What can CCP4@DL contribute to BioXHIT?

Background: (what does CCP4(i) do at the moment for data management?)

Currently CCP4i has a "project history database" which automatically keeps a record of each "job" run by the user in each project. A limited amount of information is stored for each job: date, taskname, status (whether the job failed etc), title and lists of input and output files. (Jobs can also be annotated via an electronic notebook.) This database is only accessible by CCP4i tasks so external applications cannot record jobs, or examine the data stored (although users can enter data for external jobs manually).

In addition the CCP4 suite supports data harvesting (the automatic capture of essential data from runs of key programs) with the aim of reducing the manual workload at deposition, and improving the accuracy of the data deposited.

A small number of CCP4 programs can output XML files with key data items. This pilot projects aim to show how XML can be used as a data transfer mechanism between applications; for example CCP4i tasks can extract the data automatically to set defaults for tasks.

Aims: (what is this project intending to achieve?)

To expand the scope and accessibility of the project history database to allow effective tracking of data/work flow through a structure determination. In the case of highthroughput/structural genomics utilising automated structure determination, it is important that this information can be gathered from all the software (& hardware?) components used (not just CCP4).

Workplan: (what are we actually going to do?)

1. Separate the project history database out from CCP4i, to allow it to be accessed directly by other non-CCP4 applications. The anticipated model is based on a database server approach which would ultimately allow distributed access, and would be able to deal with many multiple users accessing the same or different project histories simultaneaously.

2. Explore using the information in the project history database to infer paths ("tracking") through the structure determination, and investigate novel ways of presenting this to the user.

2i. The first step could be to investigate using the existing information e.g. a naive approach would be to match up the input and output files however ultimately additional information will need to be gathered/stored to make this possible. The scope of this information needs to be determined - for example, do we store the "previous step", or additional information such as the data & criteria used to make the decision too?

It will be necessary to examine how this information will be captured (e.g. automatically dumped from programs via XML, manual entry of data by user, direct input from an expert system, extracted from other databases used at synchrotron site? etc) which will require liaison both with software users and software developers. It will be of particular important to address the issue of data capture from non-CCP4 software, to avoid gaps in the structure determination record.

The use of more sophisticated database backends would need to be investigated, as a way of providing "data mining" tools (i.e. tools for extracting metadata from project history, e.g. to supplement data harvesting information - connection to SPINE initiative?). (Thinks: are there other tools to do this kind of thing? What other information could be extracted?)

2ii. Investigate different ways of presenting the project history data to the user - for example, as a network or timeline (multiple representations may be useful for gaining a clearer understanding of the progress of the project). Develop interfaces to allow access to the data mining functions (including a graphical interface to act as a plug-in to view/interact with the database?).