Newsletter contents... UP


A system for storing annotated diffraction data

John Badger
Structural GenomiX, 10505 Roselle St., San Diego, CA 92121, USA


1. Introduction

The emergence of several public consortia (for example, the recent NIH initiative) and privately funded groups engaged in high throughput protein crystallography highlights a long-standing need to develop simple, automated and standardized mechanisms for annotating and storing all sets of experimental data that contribute to the solution of a macromolecular crystal structure.

A brief survey of the current situation illustrates the deficiencies of the mechanisms available at the present time.

Diffraction data in the public domain

The maintainers of the Protein Data Bank do not usually receive any diffraction data other than the data set used to refine the deposited structure and, in many cases, do not receive any diffraction data at all. In the PDB snapshot taken at October 1, 2000 there were only 4,129 structure factor files for the 10,900 structures determined by diffraction techniques. This shortfall in the number of diffraction data sets is not solely a problem with legacy structures - for the 1601 structures solved by x-ray and released between January 1, 2000 and October 1, 2000 only 820 sets of experimental data were available.

Most of the structure factor data at the PDB is stored in very simple mmCIF files1. Data annotation (i.e. information on merging R-values, redundancy etc) is provided via REMARK fields in the associated PDB coordinate files. If completed by the depositor, this information provides a reasonable summary of overall data quality. However, it is unavoidably limited, in both accuracy and scope, by the fact that it is manually entered into a PDB deposition interface or deposition text file.

One can only speculate on what becomes of the multitudes of heavy atom derivative and anomalous scattering data sets collected in individual laboratories but which are not normally deposited with the PDB. It seems more than optimistic to believe that many of these data files remain available for more than a few years after the date that the structure was solved. Furthermore, even when diffraction data sets are securely stored, there may not be sufficient information on their content and accuracy for them to be useful.

Diffraction data in the working laboratory

As working formats, the two most widely used at the present time appear to be the CCP4/MTZ format and the reflection file format used by programs in the X-PLOR/CNS/CNX lineage. Obviously, several other more ad hoc formats (usually simple ASCII reflection lists) are used by other individual programs. As working formats, both the CCP4/MTZ and X-PLOR/CNS/CNX formats are quite satisfactory and , in particular, the mapping of data types provides some safeguards against inappropriate use. The major disadvantage of these formats for data archival is that they do not provide space for data annotation. This means that information on the reliability and purpose of data stored in these types of file can easily become lost over time.

Key issues

The discussion above points out the lack of any public domain 'off the shelf' standards or data models for maintaining macromolecular structure factor data. The most significant issues are as follows:

  1. Only the data set against which the structure was refined is securely stored at the PDB.
  2. The data annotation is disconnected from the file containing the reflection list.
  3. The data annotation is limited to key summary information.

In the case of our own organization (Structural GenomiX) many dozens of data sets, predominantly involving anomalous diffraction for MAD phasing, have been obtained in the last few months. There is an urgent need to develop not only a conceptual standard for annotating these data but also to provide a practical implementation, compatible with the output diagnostics of the current data processing programs.

This communication is intended to stimulate interest in this subject and describes our own efforts towards solving the problem of diffraction data storage within our internal data base. Diffraction data files developed along these lines may well be useful for other groups interested in creating local archives of well-annotated diffraction data. It is not our intention to encourage structure depositors to provide diffraction data files of this type to the PDB since the development of new formats and mechanisms for the annotation of diffraction data at the PDB would require a considerable public discussion, involving a number of organizations.

2. Design principles

Our set of principles for developing a reporting standard for diffraction data are that:

  1. Data annotation should include (but not be limited to) the data quality indices normally reported in journal publications.
  2. The set of data files associated with a single structure entry should provide all information (other than amino acid sequence) that is needed to solve the structure.
  3. The data annotation should reside within the same file as the reflection data in order to avoid any possibility of information loss.
  4. The data annotation should provide sufficient information for an experienced crystallographer to detect problems and limitations in the data set.
  5. The diffraction data should be presented in a layout that is easily translatable to other formats for use with current crystallographic sofware.

Items (1), (2) and (3) above require little comment.

The rationale behind item (4) is that we anticipate that the processing of raw (frame) data will increasingly become the domain of technical specialists using semi-automated procedures. The reduced data will then be provided to the crystallographer responsible for solving the structure. This scenario is likely to become very common for commercial organizations and public consortia engaged in high-throughput crystallography.

Item (5) expresses a practical consideration that most crystallographers use a variety of software, which require data in different formats. This means that the storage format should be easily parsable, with straightforward extraction of relevent data items.

3. Conceptual Design

Basic approach

We have taken the point of view that each set of reflections used in a structure determination should be represented in the same way and should be stored as a separate data set. Specifically, this means that we have not employed any special reflection types to represent data obtained from MAD experiments or from derivative crystals. This viewpoint contrasts with conventional mmCIF view which has special category groups (_phasing_set_refln, phasing_mir_der_refln, etc) for reflections used in different types of experimental phasing.

We have taken this approach for reasons of conciseness and to avoid some awkwardness with these mmCIF categories for representing experimental phase information derived from multiple sources (i.e. phase information derived by both anomalous diffraction and isomorphous replacement). Furthermore, in some very common structure determination scenarios (for example, two-wavelength MAD from Se-Met crystals, where the 'best' data set plays the role of native in structure refinement) there is no 'native' data set in the conventional sense. In this case it seems undesirable to arbitrarily classify one of the data sets as 'native'. In fact, a survey of the PDB holdings to 1 October 2000 reveals that there are already 108 coordinate sets corresponding to Se-Met proteins.

Annotation

The conceptual design of the file for diffraction data storage involves selecting what information should be reported and the form in which it should be expressed. Figure 1 provides an example of the current content of the file that we have developed.

Our data files use mmCIF categories and records (mmCIF) to provide a well-defined annotation. The decision to use the mmCIF dictionary as the basis for our diffraction data files is motivated by the fact that it contains most of required annotation items and expresses them in a way that is usually fairly obvious to human readers.

Where possible we have used the existing mmCIF definitions for these diffraction data files; where no mmCIF record was available we have developed our own categories and records. In fact, there are 15 locally defined records from the 59 records normally used in our diffraction data files. The need to develop a significant number of new record types should not be considered surprising for a 'real-world' application developed some years after most of the the conceptual development of the mmCIF dictionary. For comparison, it is interesting to note that the current RCSB PDB x-ray deposition form contains as many as 30 locally defined items (i.e. items with prefixes 'ndb' or 'rcsb') out of a total of 107 fields. We anticipate that additional records will be required in our diffraction data files as new data quality metrics emerge and/or further experience with reporting diffraction data in this format reveals a need to provide additional information.

The addition of new mmCIF records has mostly been necessary to report data statistics and reflection values relating to the use of anomalous scattering information. We have introduced _refln.sgx_Fplus_meas_au, _refln.sgx_Fplus_meas_sigma_au, _refln.sgx_Fminus_meas_au and _refln.sgx_Fminus_meas_sigma_au records to capture Friedel pairs as separate data items. Since we frequently want to extract anomalous differences from our data sets, this layout is much more convenient than storing all these data within the single column of _refln.F_meas_au data. In this case _refln.F_meas_au and _refln.F_meas_sigma_au represents the merged reflection and this representation allows the merged value of the structure factor to be contained within the same data loop as the Friedel pairs.

The records in the new _reflns_info group are used to provide information on the use of the data set for phase determination; the _reflns_info.sgx_der_type tag is used to record whether the sample contains heavy atoms and _reflns_info.sgx_se reports whether the crystal contains Se atoms.

A few mandatory tags that have no real use within the context of reflection data at Structural GenomiX are included for formal compliance with the mmCIF dictionary.

Fig 1. An illustrative example of a mmCIF file containing diffraction data

data_example.1
_reflns.entry_id                     'example.1'                        
_cell.entry_id                       'example.1'                        
_symmetry.entry_id                   'example.1'                        
_reflns_info.id                      'example.1'                        
_diffrn.id                           1
_diffrn_radiation_wavelength.id      1
_diffrn_radiation_wavelength.wavelength 1.20                               
_reflns_info.sgx_der_type            'none'                             
_reflns_info.sgx_se                  'yes'                              
_reflns.sgx_merged_anom              'no'                               
_reflns.ndb_netI_over_av_sigmaI      13.9                               
_reflns.ndb_Rmerge_I_obs             0.113                              
_reflns.sgx_Rmerge_anom              0.052                              
_reflns.sgx_number_measured          228381                             
_reflns.percent_possible_obs         96.6                               
_reflns.ndb_redundancy               6.8                                
_reflns.sgx_anom_completeness        95.4                               
 loop_
_reflns_shell.d_res_low             
_reflns_shell.d_res_high            
_reflns_shell.meanI_over_sigI_obs   
_reflns_shell.Rmerge_I_obs          
_reflns_shell.sgx_Rmerge_anom       
_reflns_shell.number_measured_all   
_reflns_shell.number_unique_all     
_reflns_shell.percent_possible_all  
_reflns_shell.ndb_redundancy        
_reflns_shell.sgx_anom_completeness 
  .     9.49   20.5   0.076   0.058     4973     1039     85.5      4.8     82.7    
 9.49   6.71   22.0   0.073   0.052    12969     2066     93.4      6.3     96.9    
 6.71   5.48   20.5   0.094   0.058    17289     2601     95.2      6.6     96.8    
 5.48   4.74   19.9   0.104   0.047    20646     3018     95.7      6.8     96.2    
 4.74   4.24   19.2   0.103   0.041    23571     3419     96.2      6.9     96.2    
 4.24   3.87   16.6   0.120   0.045    26387     3759     96.5      7.0     96.6    
 3.87   3.59   13.6   0.141   0.052    28558     4060     96.6      7.0     96.2    
 3.59   3.35   10.5   0.159   0.058    30344     4340     96.7      7.0     95.5    
 3.35   3.16    7.6   0.201   0.072    31541     4599     96.7      6.9     95.1    
 3.16   3.00    5.1   0.278   0.103    32103     4826     96.6      6.7     94.4    
_reflns.B_iso_Wilson_estimate        75.366
_cell.length_a   63.131
_cell.length_b   84.626
_cell.length_c  316.239
_cell.angle_alpha   90.000
_cell.angle_beta    90.000
_cell.angle_gamma   90.000
_symmetry.Int_Tables_number     18
_symmetry.space_group_name_H-M 'P 21 21 2'  
_reflns.number_all   33714
_reflns.d_resolution_low    20.00
_reflns.d_resolution_high    3.00
 loop_
_refln.wavelength_id
_refln.crystal_id
_refln.scale_group_code
_refln.index_h
_refln.index_k
_refln.index_l
_refln.F_meas_au
_refln.F_meas_sigma_au
_refln.sgx_Fplus_meas_au
_refln.sgx_Fplus_meas_sigma_au
_refln.sgx_Fminus_meas_au
_refln.sgx_Fminus_meas_sigma_au
1 1 1     0    0   16   172.6586     6.8931   172.4213     6.9680   172.8959    11.8956
1 1 1     0    0   17    14.2712     6.6891    11.7388     8.2799    16.8037    10.5080

...plus 33714 more rows of reflection data.....    

The focus of this communication is to describe a mechanism for reporting the data processing statistics that are available directly following data reduction. However, we are are using the same type of file to store the additional reflection information that becomes available at the completion of the phase determination and refinement processes. We are using the new records _refln.sgx_HL_A, _refln.sgx_HL_B, _refln.sgx_HL_C, _refln.sgx_HL_D to denote Hendrickson-Lattman phase probability coefficients regardless of their experimental source since phase probabilities only seem to appear in the _phasing_MIR_der_refln records in the mmCIF dictionary. Thus, the set of additional _refln records thatwe are employing to report phase determination and refinement information is:

_refln.sgx_Fmap 
_refln.fom 
_refln.phase_meas 
_refln.sgx_HL_A
_refln.sgx_HL_B
_refln.sgx_HL_C
_refln.sgx_HL_D
_refln.status
The first seven fields in the list above provide a means to maintain complete experimental phase information, allowing reproduction of the initial experimentally phased electron density map. The final field reports whether a reflection was used within the 'working' refinement data set or it was a part of the 'test' cross-validation data set.

4. Mechanism for creating mmCIF diffraction data files

The annotation provided in these files could be entered manually (by hand editing), although the process would be a little tedious.

Instead, we have developed a parser for automatically extracting most of the information from the reflection files and the output of data processing programs. Specifically, log files from CCP4/SCALA, CCP4/TRUNCATE and the MTZ reflection file are parsed to generate most of the content of the annotated diffraction data file. The EBI/CCP4 data harvesting system also provides a mechanism for extracting key information in standardized form from these programs. For our own database we considered it important to provide a mechanism of extracting data items absent from the data harvesting output, albeit at cost of the effort involved in writing and maintaining a parser for the CCP4/SCALA and CCP4/TRUNCATE log files.

Although the content of these diffraction data files appears neutral with respect to the data processing programs used, the CCP4/SCALA, log file provides a particularly convenient and complete report of data quality and the information provided in some fields of the data diffraction file would be difficult to capture from the output of certain other data reduction software.

5. Applications and related developments

This file format is presented as a practical and self-contained format in which all data sets and key information associated with a structure determination can be maintained over long periods of time. All that is required is that the set of data files are stored in secure directories or on permanent media such as CD-ROM.

We have developed a storage system within a relational database (Oracle) to provide access to our entire corpus of annotated crystallographic data and coordinates. This crystallographic database, containing consistently and fully annotated diffraction data, should facilitate a longer term scientific objective of evaluating and improving structure determination methodologies. For example, it is clear that the rate and ease of a structure determination is strongly associated with the quality of the first experimentally phased electron density map. However, to properly determine the data collection regimes that optimize the chances of getting a 'good' initial map from a given crystal in the presence of radiation damage (i.e. decisions relating to the trade-off between the numbers of wavelengths for MAD versus redundancy of data) requires complete and consistent information on the outcomes of a large number of structure determinations.

This general approach of crafting appropriately extended mmCIF files may also be used in other areas of macromolecular crystal structure determination. For example, we have also developed an automated system that validates our structures using the SFCHECK V 5.3.42 program and the PROCHECK V 3.53 software (i.e. versions of this software incorporated into the CCP4 4.0 release) . This system collects key results from the output of these programs into a single mmCIF coordinate file, which provides a convenient and consistently annotated report of structure quality.

The virtue of these approaches is that they establish a highly consistent set of self-contained structure files without requiring the crystallographer to perform tedious clerical tasks. Time and expertise is better spent studying the structure and carrying out follow-up crystallography, biochemistry and molecular modelling studies.

6. Acknowledgements

I wish to thank Janet Newman, Tom Peat, Eric de la Fortelle, Sarah Dry and Phil Bourne for comments on this manuscript. The diffraction data that is being used with this system is the result of work by the Crystallography group at Structural GenomiX.

7. References and notes

1. These files are non-standard in that the mandatory _refln.scale_group_code field is usually absent and the _refln.status field (used to denote free and working data sets) is usually represented by a numerical value '0' or '1' rather than the characters 'o' or 'f' defined in the mmCIF dictionary. The CCP4/MTZ2VARIOUS tool creates the _refln.status field according to the mmCIF dictionary definition. However it appears that the mmCIF parsers in both the SFCHECK2 and CNX programs read numeric rather than character codes.

2. A.A. Vaguine, J. Richelle, and S.J. Wodak. (1999). Acta Cryst. D55, 191-205.

3. R.A. Laskowski, M.W. MacArthur, D.S. Moss and J.M. Thornton. (1993). J. Appl. Cryst., 26, 283-291.


Newsletter contents... UP