CCP4 Coordinate Library Project

                   CCP4 Coordinate Library Project
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         the Library with full documentation is found at
                   http://www.ebi.ac.uk/~keb/cldoc/



1. Introduction

The Coordinate Library is designed to aid CCP4 developers in working with
coordinate files and coordinate-related data. The Library features working
with PDB and mmCIF coordinate files and it has its internal binary format,
which is portable between different platforms. All Library interface
functions are independent of the file format, so that a developer does not
need to know which particular format will be used.

The Library provides various high-level tools for working with coordinate-
related data, which include not only reading, parsing and writing, but also
orthogonal-fractional coordinate transforms, generation of symmetry mates,
editing the molecular structure and many others. The Library should be
viewed as a general low-level tool for unifying the coordinate-related
functionality in CCP4 suit.


2. The coordinate hierarchy.

The Library represents a hierarchy of C++ classes that corresponds to the
macromolecular structure:

      Atom -> Residue -> Chain -> Model -> Manager

where Model corresponds to a model in NMR-originated PDB files. The
Library's Manager provides most of interface functions, coordinate-related
functionality and communication between all components of the coordinate
hierarchy.

The coordinate hierarchy is created on fly when a coordinate file is
being read. There are also several ways to create the hieararchy
programmatically, from-inside an application. The Manager provides
an efficient access to all elements of the hierarchy and a variety
of tools for editing/transforming it.

The Manager provides two functional interfaces. One replicates the
RWBROOK library of CCP4 suit and may replace it in older CCP4
applications. The other interface is aimed on C/C++ based applications
and provides a considerably enhanced, as compared with RWBROOK,
functionality.

There is no principal limit on the number of coordinate hierarchy
components on any level. The hierarchy remains resident in RAM until its
Manager is disposed. A loaded coordinate file takes approximately
the same amount of RAM as it size.


3. Coordinate-related functionality.

3.1. Reading/Writing coordinate files.

23 interface functions allow for reading/writing PDB, mmCIF and MMDB
binary files. The file name may be given as a string or pointed by
a user-defined environmental variable. All functions read gzipped
files, the output files may be autoimatically gzipped as well.
The Library provides functions for automatical recognition of
coordinate file format and format-independent read functions.


3.2. Querying the coordinate hierarchy.

This set of functions allows for retrieving different sort of information
from the hierarchy. Examples include (but not limited to)

o getting the entry ID (or a PDB ID code)
o getting secondary structure as annotated in PDB or mmCIF files
o getting the total number of atoms / residues / chains / models
o checking on the presence of crystallographic information, space symmetry
  group, orthogonalising/fractionalising matrices, non-crystallographic
  symmetry transformation matrices
o checking on quality and completeness of crystallographic information
  set up in coordinate hierarchy.
o getting the number of symmetry operations and the symmetry
  transformation matrix.
o calculating averaged properties of atoms in the hierarchy such as
  mass center, bounding box, mass distribution momentums and others.


3.3. Surfing the coordinate hierarchy.

More than 50 interface functions offer variety of tools for surfing
the coordinate hierarchy. Generally, any model, chain, residue or atom
may be addressed directly using either of three following methods:

1) by specifying the model number, chain ID, sequence number, insertion
   code, atom name, chemical element and alternative location; these
   characteristics may be replaced by their positional equivalents,
   e.g. 2nd chain (in a model), 25th residue (in a chain), 3rd atom
   (in a residue).
2) by specifying a special path string - the "coordinate ID" (CID)
3) by looking up the hierarchy's internal reference tables.

Different methods provide different flexibility and efficiency. Access
through internal tables is the most efficient one, while using
the coordinate ID provides for maximum flexibility.

Coordinate ID is an ASCII string of the following format:

    /mdl/chn/seq(res).ic/atm[elm]:aloc
      
where mdl stands for the model number, chn - chain ID, seq - residue
sequence number, res - residue name, ic - residue insertion code,
atm - atom name, elm - chemical element ID, aloc - alternative location
indicator. Any item in coordinate ID may be replaced by a wildcard "*",
which means indefinite value for that item. The wildcard value is
automatically implied for any missing item except the chain ID,
insertion code and alternative location indicator.


3.4. Selection functions.

Any object of coordinate hierarchy may be selected, i.e. assigned a mark
and then be checked for its presence. Selected objects are represented on
a uniform indexed space, which may be used for extracting the selected
objects in the form of vectors of the objects, and performing different
sort of operations, such as looking for contacts between atoms, on them.

Each selection is associated with an ID, called the selection handle.
Selection handles are generated by the Manager. Any object may participate
in arbitrary number of selections simultaneously, however objects of
different types cannot be mixed in one selection set.

There are 5 types of selection operations:

NEW  a new selection for given selection handle; the previous selection
     is wiped out
OR   new selection is added to the already selected set
AND  new selection is made on the already selected set
XOR  selects only those atoms that are found in either former or newly
     selected sets (exclusive 'or')
CLR  removes new selection form the already selected set.

The selection range is specified directly by object's address properties
(model number, chain ID, sequence number etc.) or coordinate ID as
defined above. In this case, CID allows for lists of values, e.g.
"A,B,C" for chain IDs, and range of sequence numbers, e.g. "33.A-40.B".

Geometrical restrictions may be applied to the selections, such as sphere,
cylinder, slab and proximity to already selected objects.

17 interface functions provide for selection functionality. These include

o generation of selection handles
o deleting selections
o selection of objects using different methods for specifying the
  selection range and geometrical restrictions.
o retrieving vectors of selected objects.



3.5. Editing the coordinate hierarchy.

Any object of coordinate hierarchy may be removed from it. Similarly,
any object may be put into the hierarchy onto a specified place.
The Library provides numerous editing tools on all levels of the
hierarchy, which work down the hierarchy. For example, Manager has
tools for editing any atom, residue, chain or model, while chain has
tools for editing its residues and atoms only.

The edited objects are specified using their address properties (model
number, chain ID, sequence number etc.) or coordinate ID (cf. above).
The library allows for direct disposal of objects; e.g. if an atom was
obtained in the course of selection, it may be disposed using its pointer
just as well as through the interface functions: the atom's destructor
will perform all necessary actions on checking the atom out from the
hierarchy.

Editing tools allow for programmatical composition of coordinate
hierarchies. Generated atoms are added to residues, residues - to chains,
chains - to models, and then models are checked into the Manager class.


3.6. Contact-looking functions.

These functions allow one to find all pairs of atoms that are in a particular
range of distance from each other. Special functions are designed for
finding contacts between an atom and a set of atoms and between two sets
of atoms. The functions use bricking algorithm, which is much more efficient
than all-to-all pairwise comparisons.


3.7. Coordinate transformations.

The Library offers various functions for performing different types of
coordinate transformations on the coordinate hierarchy, such as
fractionalizing/orthogonalizing, symmetry mating and rotation. These
functions may be applied on arbitrary set of atoms, e.g. one resulting
from atom selection.


3.8. User-defined data.

User-defined data (UDD) is the type of data, which is not found in PDB
and mmCIF coordinate files, which does not have predefined meaning and
which becomes known only at run-time. UDD may be saved into and retrieved
from MMDB binary files.

UDD are allowed on any level of the coordinate hierarchy. This means
that arbitrary number of UDDs may be added to any atom, any residue and
any model.

MMDB offers 3 types of UDDs: integer values, real-type values and
strings. In order to avoid clashes between different applications using
UDD, each UDD must be registered with a unique name (ID) before it
may be used.


3.9. Monomer library support.

The Library provides functions for monomer library (ML) support. ML
is composed from EBI ligand database and arranged into indexed binary
files. The Library provides functions for extracting ligand data from
ML files and for running queries such as looking for a ligand with
maximal structural similarity. The types of data that are supported by
ML include

o compound name and synonyms
o chemical formula
o list of atoms, theie chemical identities, bonds and bond orders
o 3D atom coordinates where available
o atom energy types
o internal coordinates - bond lengths, angles and torsions as
  derived from energy types.


3.10. Reading/writing mmCIF files.

The Library offers standalone tools for reading/writing arbitrary mmCIF
files. The files are parsed into hierarchy of C++ classes corresponding
to mmCIF categories and loops. The content of mmCIF files is then
accessible in a random order from this dynamical hierarchy by category
and data item names and loop row. The mmCIF class hierarchy may also
be created programmatically and filled with application-dependent data.
Flushing such a hierarchy into a disk file ensures a valid mmCIF
syntax.