K. Henrick
European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD
henrick@ebi.ac.uk
The Macromolecular Structure Database group at the European Bioinformatics Institute was set up in 1996. The group members are Geoff Barton, Peter Keller, John Ionides and Kim Henrick with Janet Thornton as external advisor and a further member will join in October 1998. Its principal aim is to become the European partner in the deposition, archiving, and presentation of 3D services of macromolecular structural data, together with the Protein Data Bank at Brookhaven Laboratory and with the Nucleic Acid Database at Rutgers University. To achieve this aim we are working towards the design and production of a single, high quality, global structural data collection. We intended to generate a much richer database than that possible with the current PDB entries. This will require even more data to be deposited, and may make higher demands on the depositor, with respect to the quality and completeness of information required.
As part of the process of developing methods to make the process of data archiving relatively painless we have prototyped a data harvesting method with some of the CCP4 programs.
The proposed method takes the emphasis away from a web interface back to the research worker. The software used at each step in the process of the determination of a macromolecular structure by X-ray diffraction methods is to be modified such that a summary deposition file will be produced.
The research worker will be responsible for the management of these files, in the sense of making sure that they are in the correct file system location at the time of deposition. (The format of these files will be mmCIF, based on an extended mmCIF dictionary, but the researcher will not need to examine or edit them.) Deposition will then involve the transfer of the collection of files to the deposition site, where the data harvesting software will extract as much of the required information as possible. Missing information will have to be supplied by the depositor, but it is hoped that this will be kept to a minimum.
The set of files will contain the same information that will be required for publication and for writing a thesis. The requested information will be used to characterise the reflection dataset, the refinement used and provide confidence levels for aspects of the final structure model. Information that is normally only derived from coordinates will not be requested in the deposition process. The deposition centre will use wherever possible the same procedures to derive the information that is currently calculated from coordinates by research workers.
There are three classes of data items.
In outline a program will write a file containing the following:
data_phosphate_binding_protein[A197C_chromophore]
_entry.id phosphate_binding_protein
_entry.data_set_id A197C_chromophore
_audit.creation_DayTime 'Wed Jul 16 12:58:55 1997'
Where entry.id corresponds to the users ProjectID (protein) and entry.data_set_id corresponds to the users DataSetID
Every project has at least 1 dataset, but may have more than 1, as for example in the case of MIR structure solution. A ProjectID will be a research workers identifier for a data set that will be expected to be become a submitted Entry, i.e. a mutant is a new ProjectID, however, a heavy atom derivative will not be a new ProjectID but a different DataSetID.
The harvest snapshot files generated by running a particular CCP4 program will then contain the data items as mmCIF tags and value pairs that the particular program is capable of automatically generating. The identification of data items within a particular programs is a trivial task. For most CCP4 programs this closely corresponds to the data contained in the xloggraph tables. [Trying to parse log files to get the requested deposition data items has been found to be very difficult as the output of the programs has varied and in some cases this is difficult to track from version to version and to understand the input parameters and the path used through the options available for each program.]
To generate successfully a set of deposition files during the course of a structure determination, some minimal discipline will be required from the researcher. Essentially, this involves deciding on some name or code to identify the project, and sticking to it. This is necessary, to allow the data harvesting software to identify correctly and merge data from a number of different files. Where multiple datasets are involved, the user will also be required to identify each dataset in a consistent way.
The prototype CCP4 mechanism involves automatically placing
the output from a program run into the directory (unix here),
$HOME/Deposition_Files/Project_Identifier/datasetname.program_name_function
In the above, "function" will be chosen by the program input keywords when the particular program is multifunctional. For example REFMAC can be used in several different modes, and MOSFLM can be run for selected sets of images for a particular dataset. No versions of these deposition files are to be kept and each output harvest file will be overwritten for each run of a program with a particular ProjectName, DataSetName, ProgramName and ProgramFunction.
The user requirement needed to produce these deposit files, will be to add
two extra options to existing software. In the context of CCP4 programs,
this will mean adding two extra keywords, i.e.
PROJECT my_protein DATASET native
The presence of these keywords in command input will cause the program to generate the deposition files. The individual research worker is ultimately responsible for the management of their own files. The proposed system of consistently using keywords to flag every project and data set will require some data management organisation and discipline to be followed, however, if used this will make life easier for everyone concerned.
The software should then handle the details.
The PROJECT and DATASET instructions are required to differentiate between different cell dimensions, resolutions, etc. for different data collections, in particular where derivative structure factors and heavy atom sites and phasing information are deposited.
The additional keyword, USECWD, can be used to place the deposit file in the current working directory. This is required where a particular program is run on a machine where the research worker doesn't have a $HOME directory, i.e. at a synchrotron site, or an in- house generator/image plate system, or for example where part of the structure determination process is run on a different machine either within the same institution or between different institutions. The research worker will be responsible for collecting all the files into the one harvesting archive directory ready for deposition. This keyword can also be used for tests, and when a successful run is made the output harvest file can then be simply moved to the DepositFiles sub-directory.
At each stage in a protein structure determination the program steps are often re-run and the stages are repeated in an iterative cycle. Data sets are commonly discarded as experience is gained and better crystals that give better data are grown. Therefore the above two program instructions are only required for the final run of each program. However, it is not always known at the time of a particular run that it will be the final one, and in the case of a preliminary publication the results are not definitive. Therefore the instructions could be left in for every run of the programs, and will overwrite existing files that correspond to a particular program. (The accumulation of versions of the deposit files will be unmanagable.)
The CCP4 implimentation will involve a change to the MTZ library.
The MTZ header will be modified to hold labels for each ProjectName
and DataSetName and additional pointers for each column of information
that will indicate which dataset the item belongs to. Then entry
points to an MTZ file, such as MOSFLM (or ROTAPREP or F2MTZ), will
demand values for the keywords PROJECT and DATASET
which will be held in the header. Later procedures will also have
mandatory values for these items but will take the values from the MTZ
header where appropriate. The programs MTZUTILS, MTZDUMP and CAD will
also be modified to manage the merging of reflection datasets. It is
possible that the required items will also be placed in the MAP file
header.
Sample Outputs
The CCP4 programs, MOSFLM, SCALA, TRUNCATE, MLPHARE, REFMAC and
RESTRAIN have been modified to output harvest files and sample output
files are available (these files are preliminary as the new mmCIF data names
used here are under review):
The EBI have created an Oracle database to represent a PDB entry and the mechanism for managing deposition. This system will be tested extensively to accept a variety of deposition files and mechanism. Before CCP4 or any other protein structure determination software can be modified and distributed, mechanisms that will be acceptable to research workers have to be finalised and tested. Towards this end a joint CCP4-EBI workshop will be held at the EBI in September (1998) to discuss such methods and to decide on the data to be harvested.
There has been support from other packages, including CNS, O, TNT, SHARP and SOLVE. It is hoped that within 2 years most new depositions to the central archive sites will be sending information automatically collected into a series of these snapshot files from any of the main crystallographic software suites.
The EBI harvest suggestion for deposition files labelled with the project and data set names is close to the proposed new image CIF header ideas and it is hoped to integrate the two approaches [ see details on Crystallographic Binary File (CBF) and header information on The draft definitions] ]