CCP4 Coordinate Library Project ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ the Library with full documentation is found at http://www.ebi.ac.uk/~keb/cldoc/ 1. Introduction The Coordinate Library is designed to aid CCP4 developers in working with coordinate files and coordinate-related data. The Library features working with PDB and mmCIF coordinate files and it has its internal binary format, which is portable between different platforms. All Library interface functions are independent of the file format, so that a developer does not need to know which particular format will be used. The Library provides various high-level tools for working with coordinate- related data, which include not only reading, parsing and writing, but also orthogonal-fractional coordinate transforms, generation of symmetry mates, editing the molecular structure and many others. The Library should be viewed as a general low-level tool for unifying the coordinate-related functionality in CCP4 suit. 2. The coordinate hierarchy. The Library represents a hierarchy of C++ classes that corresponds to the macromolecular structure: Atom -> Residue -> Chain -> Model -> Manager where Model corresponds to a model in NMR-originated PDB files. The Library's Manager provides most of interface functions, coordinate-related functionality and communication between all components of the coordinate hierarchy. The coordinate hierarchy is created on fly when a coordinate file is being read. There are also several ways to create the hieararchy programmatically, from-inside an application. The Manager provides an efficient access to all elements of the hierarchy and a variety of tools for editing/transforming it. The Manager provides two functional interfaces. One replicates the RWBROOK library of CCP4 suit and may replace it in older CCP4 applications. The other interface is aimed on C/C++ based applications and provides a considerably enhanced, as compared with RWBROOK, functionality. There is no principal limit on the number of coordinate hierarchy components on any level. The hierarchy remains resident in RAM until its Manager is disposed. A loaded coordinate file takes approximately the same amount of RAM as it size. 3. Coordinate-related functionality. 3.1. Reading/Writing coordinate files. 23 interface functions allow for reading/writing PDB, mmCIF and MMDB binary files. The file name may be given as a string or pointed by a user-defined environmental variable. All functions read gzipped files, the output files may be autoimatically gzipped as well. The Library provides functions for automatical recognition of coordinate file format and format-independent read functions. 3.2. Querying the coordinate hierarchy. This set of functions allows for retrieving different sort of information from the hierarchy. Examples include (but not limited to) o getting the entry ID (or a PDB ID code) o getting secondary structure as annotated in PDB or mmCIF files o getting the total number of atoms / residues / chains / models o checking on the presence of crystallographic information, space symmetry group, orthogonalising/fractionalising matrices, non-crystallographic symmetry transformation matrices o checking on quality and completeness of crystallographic information set up in coordinate hierarchy. o getting the number of symmetry operations and the symmetry transformation matrix. o calculating averaged properties of atoms in the hierarchy such as mass center, bounding box, mass distribution momentums and others. 3.3. Surfing the coordinate hierarchy. More than 50 interface functions offer variety of tools for surfing the coordinate hierarchy. Generally, any model, chain, residue or atom may be addressed directly using either of three following methods: 1) by specifying the model number, chain ID, sequence number, insertion code, atom name, chemical element and alternative location; these characteristics may be replaced by their positional equivalents, e.g. 2nd chain (in a model), 25th residue (in a chain), 3rd atom (in a residue). 2) by specifying a special path string - the "coordinate ID" (CID) 3) by looking up the hierarchy's internal reference tables. Different methods provide different flexibility and efficiency. Access through internal tables is the most efficient one, while using the coordinate ID provides for maximum flexibility. Coordinate ID is an ASCII string of the following format: /mdl/chn/seq(res).ic/atm[elm]:aloc where mdl stands for the model number, chn - chain ID, seq - residue sequence number, res - residue name, ic - residue insertion code, atm - atom name, elm - chemical element ID, aloc - alternative location indicator. Any item in coordinate ID may be replaced by a wildcard "*", which means indefinite value for that item. The wildcard value is automatically implied for any missing item except the chain ID, insertion code and alternative location indicator. 3.4. Selection functions. Any object of coordinate hierarchy may be selected, i.e. assigned a mark and then be checked for its presence. Selected objects are represented on a uniform indexed space, which may be used for extracting the selected objects in the form of vectors of the objects, and performing different sort of operations, such as looking for contacts between atoms, on them. Each selection is associated with an ID, called the selection handle. Selection handles are generated by the Manager. Any object may participate in arbitrary number of selections simultaneously, however objects of different types cannot be mixed in one selection set. There are 5 types of selection operations: NEW a new selection for given selection handle; the previous selection is wiped out OR new selection is added to the already selected set AND new selection is made on the already selected set XOR selects only those atoms that are found in either former or newly selected sets (exclusive 'or') CLR removes new selection form the already selected set. The selection range is specified directly by object's address properties (model number, chain ID, sequence number etc.) or coordinate ID as defined above. In this case, CID allows for lists of values, e.g. "A,B,C" for chain IDs, and range of sequence numbers, e.g. "33.A-40.B". Geometrical restrictions may be applied to the selections, such as sphere, cylinder, slab and proximity to already selected objects. 17 interface functions provide for selection functionality. These include o generation of selection handles o deleting selections o selection of objects using different methods for specifying the selection range and geometrical restrictions. o retrieving vectors of selected objects. 3.5. Editing the coordinate hierarchy. Any object of coordinate hierarchy may be removed from it. Similarly, any object may be put into the hierarchy onto a specified place. The Library provides numerous editing tools on all levels of the hierarchy, which work down the hierarchy. For example, Manager has tools for editing any atom, residue, chain or model, while chain has tools for editing its residues and atoms only. The edited objects are specified using their address properties (model number, chain ID, sequence number etc.) or coordinate ID (cf. above). The library allows for direct disposal of objects; e.g. if an atom was obtained in the course of selection, it may be disposed using its pointer just as well as through the interface functions: the atom's destructor will perform all necessary actions on checking the atom out from the hierarchy. Editing tools allow for programmatical composition of coordinate hierarchies. Generated atoms are added to residues, residues - to chains, chains - to models, and then models are checked into the Manager class. 3.6. Contact-looking functions. These functions allow one to find all pairs of atoms that are in a particular range of distance from each other. Special functions are designed for finding contacts between an atom and a set of atoms and between two sets of atoms. The functions use bricking algorithm, which is much more efficient than all-to-all pairwise comparisons. 3.7. Coordinate transformations. The Library offers various functions for performing different types of coordinate transformations on the coordinate hierarchy, such as fractionalizing/orthogonalizing, symmetry mating and rotation. These functions may be applied on arbitrary set of atoms, e.g. one resulting from atom selection. 3.8. User-defined data. User-defined data (UDD) is the type of data, which is not found in PDB and mmCIF coordinate files, which does not have predefined meaning and which becomes known only at run-time. UDD may be saved into and retrieved from MMDB binary files. UDD are allowed on any level of the coordinate hierarchy. This means that arbitrary number of UDDs may be added to any atom, any residue and any model. MMDB offers 3 types of UDDs: integer values, real-type values and strings. In order to avoid clashes between different applications using UDD, each UDD must be registered with a unique name (ID) before it may be used. 3.9. Monomer library support. The Library provides functions for monomer library (ML) support. ML is composed from EBI ligand database and arranged into indexed binary files. The Library provides functions for extracting ligand data from ML files and for running queries such as looking for a ligand with maximal structural similarity. The types of data that are supported by ML include o compound name and synonyms o chemical formula o list of atoms, theie chemical identities, bonds and bond orders o 3D atom coordinates where available o atom energy types o internal coordinates - bond lengths, angles and torsions as derived from energy types. 3.10. Reading/writing mmCIF files. The Library offers standalone tools for reading/writing arbitrary mmCIF files. The files are parsed into hierarchy of C++ classes corresponding to mmCIF categories and loops. The content of mmCIF files is then accessible in a random order from this dynamical hierarchy by category and data item names and loop row. The mmCIF class hierarchy may also be created programmatically and filled with application-dependent data. Flushing such a hierarchy into a disk file ensures a valid mmCIF syntax.