John Badger
Structural GenomiX, 10505 Roselle St., San Diego, CA 92121, USA
The emergence of several public consortia (for example, the recent NIH initiative) and privately funded groups engaged in high throughput protein crystallography highlights a long-standing need to develop simple, automated and standardized mechanisms for annotating and storing all sets of experimental data that contribute to the solution of a macromolecular crystal structure.
A brief survey of the current situation illustrates the deficiencies of the mechanisms available at the present time.
Diffraction data in the public domain
The maintainers of the Protein Data Bank do not usually receive any diffraction data other than the data set used to refine the deposited structure and, in many cases, do not receive any diffraction data at all. In the PDB snapshot taken at October 1, 2000 there were only 4,129 structure factor files for the 10,900 structures determined by diffraction techniques. This shortfall in the number of diffraction data sets is not solely a problem with legacy structures - for the 1601 structures solved by x-ray and released between January 1, 2000 and October 1, 2000 only 820 sets of experimental data were available.
Most of the structure factor data at the PDB is stored in very simple mmCIF files1. Data annotation (i.e. information on merging R-values, redundancy etc) is provided via REMARK fields in the associated PDB coordinate files. If completed by the depositor, this information provides a reasonable summary of overall data quality. However, it is unavoidably limited, in both accuracy and scope, by the fact that it is manually entered into a PDB deposition interface or deposition text file.
One can only speculate on what becomes of the multitudes of heavy atom derivative and anomalous scattering data sets collected in individual laboratories but which are not normally deposited with the PDB. It seems more than optimistic to believe that many of these data files remain available for more than a few years after the date that the structure was solved. Furthermore, even when diffraction data sets are securely stored, there may not be sufficient information on their content and accuracy for them to be useful.
Diffraction data in the working laboratory
As working formats, the two most widely used at the present time appear to be the CCP4/MTZ format and the reflection file format used by programs in the X-PLOR/CNS/CNX lineage. Obviously, several other more ad hoc formats (usually simple ASCII reflection lists) are used by other individual programs. As working formats, both the CCP4/MTZ and X-PLOR/CNS/CNX formats are quite satisfactory and , in particular, the mapping of data types provides some safeguards against inappropriate use. The major disadvantage of these formats for data archival is that they do not provide space for data annotation. This means that information on the reliability and purpose of data stored in these types of file can easily become lost over time.
Key issues
The discussion above points out the lack of any public domain 'off the shelf' standards or data models for maintaining macromolecular structure factor data. The most significant issues are as follows:
In the case of our own organization (Structural GenomiX) many dozens of data sets, predominantly involving anomalous diffraction for MAD phasing, have been obtained in the last few months. There is an urgent need to develop not only a conceptual standard for annotating these data but also to provide a practical implementation, compatible with the output diagnostics of the current data processing programs.
This communication is intended to stimulate interest in this subject and describes our own efforts towards solving the problem of diffraction data storage within our internal data base. Diffraction data files developed along these lines may well be useful for other groups interested in creating local archives of well-annotated diffraction data. It is not our intention to encourage structure depositors to provide diffraction data files of this type to the PDB since the development of new formats and mechanisms for the annotation of diffraction data at the PDB would require a considerable public discussion, involving a number of organizations.
Our set of principles for developing a reporting standard for diffraction data are that:
Items (1), (2) and (3) above require little comment.
The rationale behind item (4) is that we anticipate that the processing of raw (frame) data will increasingly become the domain of technical specialists using semi-automated procedures. The reduced data will then be provided to the crystallographer responsible for solving the structure. This scenario is likely to become very common for commercial organizations and public consortia engaged in high-throughput crystallography.
Item (5) expresses a practical consideration that most crystallographers use a variety of software, which require data in different formats. This means that the storage format should be easily parsable, with straightforward extraction of relevent data items.
Basic approach
We have taken the point of view that each set of reflections used in a structure determination should be represented in the same way and should be stored as a separate data set. Specifically, this means that we have not employed any special reflection types to represent data obtained from MAD experiments or from derivative crystals. This viewpoint contrasts with conventional mmCIF view which has special category groups (_phasing_set_refln, phasing_mir_der_refln, etc) for reflections used in different types of experimental phasing.
We have taken this approach for reasons of conciseness and to avoid some awkwardness with these mmCIF categories for representing experimental phase information derived from multiple sources (i.e. phase information derived by both anomalous diffraction and isomorphous replacement). Furthermore, in some very common structure determination scenarios (for example, two-wavelength MAD from Se-Met crystals, where the 'best' data set plays the role of native in structure refinement) there is no 'native' data set in the conventional sense. In this case it seems undesirable to arbitrarily classify one of the data sets as 'native'. In fact, a survey of the PDB holdings to 1 October 2000 reveals that there are already 108 coordinate sets corresponding to Se-Met proteins.
Annotation
The conceptual design of the file for diffraction data storage involves selecting what information should be reported and the form in which it should be expressed. Figure 1 provides an example of the current content of the file that we have developed.
Our data files use mmCIF categories and records (mmCIF) to provide a well-defined annotation. The decision to use the mmCIF dictionary as the basis for our diffraction data files is motivated by the fact that it contains most of required annotation items and expresses them in a way that is usually fairly obvious to human readers.
Where possible we have used the existing mmCIF definitions for these diffraction data files; where no mmCIF record was available we have developed our own categories and records. In fact, there are 15 locally defined records from the 59 records normally used in our diffraction data files. The need to develop a significant number of new record types should not be considered surprising for a 'real-world' application developed some years after most of the the conceptual development of the mmCIF dictionary. For comparison, it is interesting to note that the current RCSB PDB x-ray deposition form contains as many as 30 locally defined items (i.e. items with prefixes 'ndb' or 'rcsb') out of a total of 107 fields. We anticipate that additional records will be required in our diffraction data files as new data quality metrics emerge and/or further experience with reporting diffraction data in this format reveals a need to provide additional information.
The addition of new mmCIF records has mostly been necessary to report data statistics and reflection values relating to the use of anomalous scattering information. We have introduced _refln.sgx_Fplus_meas_au, _refln.sgx_Fplus_meas_sigma_au, _refln.sgx_Fminus_meas_au and _refln.sgx_Fminus_meas_sigma_au records to capture Friedel pairs as separate data items. Since we frequently want to extract anomalous differences from our data sets, this layout is much more convenient than storing all these data within the single column of _refln.F_meas_au data. In this case _refln.F_meas_au and _refln.F_meas_sigma_au represents the merged reflection and this representation allows the merged value of the structure factor to be contained within the same data loop as the Friedel pairs.
The records in the new _reflns_info group are used to provide information on the use of the data set for phase determination; the _reflns_info.sgx_der_type tag is used to record whether the sample contains heavy atoms and _reflns_info.sgx_se reports whether the crystal contains Se atoms.
A few mandatory tags that have no real use within the context of reflection data at Structural GenomiX are included for formal compliance with the mmCIF dictionary.
Fig 1. An illustrative example of a mmCIF file containing diffraction data
data_example.1 _reflns.entry_id 'example.1' _cell.entry_id 'example.1' _symmetry.entry_id 'example.1' _reflns_info.id 'example.1' _diffrn.id 1 _diffrn_radiation_wavelength.id 1 _diffrn_radiation_wavelength.wavelength 1.20 _reflns_info.sgx_der_type 'none' _reflns_info.sgx_se 'yes' _reflns.sgx_merged_anom 'no' _reflns.ndb_netI_over_av_sigmaI 13.9 _reflns.ndb_Rmerge_I_obs 0.113 _reflns.sgx_Rmerge_anom 0.052 _reflns.sgx_number_measured 228381 _reflns.percent_possible_obs 96.6 _reflns.ndb_redundancy 6.8 _reflns.sgx_anom_completeness 95.4 loop_ _reflns_shell.d_res_low _reflns_shell.d_res_high _reflns_shell.meanI_over_sigI_obs _reflns_shell.Rmerge_I_obs _reflns_shell.sgx_Rmerge_anom _reflns_shell.number_measured_all _reflns_shell.number_unique_all _reflns_shell.percent_possible_all _reflns_shell.ndb_redundancy _reflns_shell.sgx_anom_completeness . 9.49 20.5 0.076 0.058 4973 1039 85.5 4.8 82.7 9.49 6.71 22.0 0.073 0.052 12969 2066 93.4 6.3 96.9 6.71 5.48 20.5 0.094 0.058 17289 2601 95.2 6.6 96.8 5.48 4.74 19.9 0.104 0.047 20646 3018 95.7 6.8 96.2 4.74 4.24 19.2 0.103 0.041 23571 3419 96.2 6.9 96.2 4.24 3.87 16.6 0.120 0.045 26387 3759 96.5 7.0 96.6 3.87 3.59 13.6 0.141 0.052 28558 4060 96.6 7.0 96.2 3.59 3.35 10.5 0.159 0.058 30344 4340 96.7 7.0 95.5 3.35 3.16 7.6 0.201 0.072 31541 4599 96.7 6.9 95.1 3.16 3.00 5.1 0.278 0.103 32103 4826 96.6 6.7 94.4 _reflns.B_iso_Wilson_estimate 75.366 _cell.length_a 63.131 _cell.length_b 84.626 _cell.length_c 316.239 _cell.angle_alpha 90.000 _cell.angle_beta 90.000 _cell.angle_gamma 90.000 _symmetry.Int_Tables_number 18 _symmetry.space_group_name_H-M 'P 21 21 2' _reflns.number_all 33714 _reflns.d_resolution_low 20.00 _reflns.d_resolution_high 3.00 loop_ _refln.wavelength_id _refln.crystal_id _refln.scale_group_code _refln.index_h _refln.index_k _refln.index_l _refln.F_meas_au _refln.F_meas_sigma_au _refln.sgx_Fplus_meas_au _refln.sgx_Fplus_meas_sigma_au _refln.sgx_Fminus_meas_au _refln.sgx_Fminus_meas_sigma_au 1 1 1 0 0 16 172.6586 6.8931 172.4213 6.9680 172.8959 11.8956 1 1 1 0 0 17 14.2712 6.6891 11.7388 8.2799 16.8037 10.5080 ...plus 33714 more rows of reflection data.....
The focus of this communication is to describe a mechanism for reporting the data processing statistics that are available directly following data reduction. However, we are are using the same type of file to store the additional reflection information that becomes available at the completion of the phase determination and refinement processes. We are using the new records _refln.sgx_HL_A, _refln.sgx_HL_B, _refln.sgx_HL_C, _refln.sgx_HL_D to denote Hendrickson-Lattman phase probability coefficients regardless of their experimental source since phase probabilities only seem to appear in the _phasing_MIR_der_refln records in the mmCIF dictionary. Thus, the set of additional _refln records thatwe are employing to report phase determination and refinement information is:
_refln.sgx_Fmap _refln.fom _refln.phase_meas _refln.sgx_HL_A _refln.sgx_HL_B _refln.sgx_HL_C _refln.sgx_HL_D _refln.statusThe first seven fields in the list above provide a means to maintain complete experimental phase information, allowing reproduction of the initial experimentally phased electron density map. The final field reports whether a reflection was used within the 'working' refinement data set or it was a part of the 'test' cross-validation data set.
The annotation provided in these files could be entered manually (by hand editing), although the process would be a little tedious.
Instead, we have developed a parser for automatically extracting most of the information from the reflection files and the output of data processing programs. Specifically, log files from CCP4/SCALA, CCP4/TRUNCATE and the MTZ reflection file are parsed to generate most of the content of the annotated diffraction data file. The EBI/CCP4 data harvesting system also provides a mechanism for extracting key information in standardized form from these programs. For our own database we considered it important to provide a mechanism of extracting data items absent from the data harvesting output, albeit at cost of the effort involved in writing and maintaining a parser for the CCP4/SCALA and CCP4/TRUNCATE log files.
Although the content of these diffraction data files appears neutral with respect to the data processing programs used, the CCP4/SCALA, log file provides a particularly convenient and complete report of data quality and the information provided in some fields of the data diffraction file would be difficult to capture from the output of certain other data reduction software.
This file format is presented as a practical and self-contained format in which all data sets and key information associated with a structure determination can be maintained over long periods of time. All that is required is that the set of data files are stored in secure directories or on permanent media such as CD-ROM.
We have developed a storage system within a relational database (Oracle) to provide access to our entire corpus of annotated crystallographic data and coordinates. This crystallographic database, containing consistently and fully annotated diffraction data, should facilitate a longer term scientific objective of evaluating and improving structure determination methodologies. For example, it is clear that the rate and ease of a structure determination is strongly associated with the quality of the first experimentally phased electron density map. However, to properly determine the data collection regimes that optimize the chances of getting a 'good' initial map from a given crystal in the presence of radiation damage (i.e. decisions relating to the trade-off between the numbers of wavelengths for MAD versus redundancy of data) requires complete and consistent information on the outcomes of a large number of structure determinations.
This general approach of crafting appropriately extended mmCIF files may also be used in other areas of macromolecular crystal structure determination. For example, we have also developed an automated system that validates our structures using the SFCHECK V 5.3.42 program and the PROCHECK V 3.53 software (i.e. versions of this software incorporated into the CCP4 4.0 release) . This system collects key results from the output of these programs into a single mmCIF coordinate file, which provides a convenient and consistently annotated report of structure quality.
The virtue of these approaches is that they establish a highly consistent set of self-contained structure files without requiring the crystallographer to perform tedious clerical tasks. Time and expertise is better spent studying the structure and carrying out follow-up crystallography, biochemistry and molecular modelling studies.
I wish to thank Janet Newman, Tom Peat, Eric de la Fortelle, Sarah Dry and Phil Bourne for comments on this manuscript. The diffraction data that is being used with this system is the result of work by the Crystallography group at Structural GenomiX.
2. A.A. Vaguine, J. Richelle, and S.J. Wodak. (1999). Acta Cryst. D55, 191-205.
3. R.A. Laskowski, M.W. MacArthur, D.S. Moss and J.M. Thornton. (1993). J. Appl. Cryst., 26, 283-291.