Rationalisation of CCP4 Model Analysis Programs

Peter Briggs
March 2000

Overview
Comparison of the programs
Possible actions

Overview

The CCP4 Suite contains a number programs designated as model analysis programs:

act - analyse coordinates
areaimol - analyse solvent accessible areas
baverage - averages B over main and side chain atoms
contact - computes various types of contacts in protein structures
distang - distances and angles calculation
dyndom - determine dynamic domains when two conformations are available
polypose - superimpose many multi-domain structures
procheck - check the stereochemical quality of protein structures
sc - analyse shape complementarity
sfcheck - assess agreement between atomic model and X-ray data
surface - surface accessibility/prepare input to volume
topp - topological comparison
volume -polyhedral volume around selected atoms

Some other programs also seem to overlap into this area:

geomcalc - molecular geometry calculations
lsqkab - apply various transformations to coordinate files

There is also an unsupported program:

angles - bond lengths, bond angles and dihedral angles from coordinate files

Comparison of the programs

Here I'm trying to broadly group together programs with similar functions, at the same time keeping an eye on differences. This is far from comprehensive.

The most urgent cases for treatment are the groups of programs performing surface/volume calculations (SURFACE/VOLUME/AREAIMOL) and contact/distance calculations (CONTACT/DISTANG/ACT).

Distances, angles, and contacts

Both ACT and CONTACT look for crystal contacts - CONTACT has had more recent developments although I think that the method used in ACT is better for finding all crystal contacts (working in fractional coordinates rather than using the brute-force search method of CONTACT).

DISTANG has a number of functions:

calculation of bond angles and distance from a subset of coordinates in a PDB file,
calculate possible packing restrictions for a coordinate set (packing search),
generate information in a suitable format for the PROTIN dictionary from a set of coordinates,

Function 1 is in CONTACT? Function 3 seems to duplicate MAKEDICT (hopefully soon to be obsolete with advent of new dictionary). The packing search sounds as if it might be useful. DISTANG is also used to generate input for WATERTIDY.

GEOMCALC also examines geometry within a set of coordinates (fit atoms to a plane, bond angles, atom-atom distances etc) which suggests some overlap with DISTANG (and MAKEDICT?).

Surfaces, areas and volumes

AREAIMOL and SURFACE calculate the same quantities but using different algorithms. Tests suggest that there isn't much difference in the results, however the programs themselves have a number of differences: SURFACE has obtuse input and limited output, and doesn't allow for symmetry related molecules or area differences - whereas AREAIMOL can look at these things, and also offers a range of analyses. What is missing is SURFACE's powerful method for selecting atoms to be in/excluded from the calculations, and the ability to assign VDW radii based on context, eg Calpha's may be given a different radius to Cbeta etc.

VOLUME is alone in calculating volumes. The drawback is that it requires SURFACE to prepare the input, so it cannot stand alone.

SC examines the interface between two molecular surfaces. At some stage possibly the calculation of the molecular surface could be subsumed into AREAIMOL/SURFACE?

Fitting model fragments

This strays into the area of molecular replacement.

POLYPOSE overlaps with LSQKAB? (or TOPP?)

B-value analysis

ACT does some calculation of B-value statistics - could these go into BAVERAGE? (are they there now?)

Miscellaneous

PROCHECK, SFCHECK, DYNDOM, TOPP and SC shouldn't be manipulated at a program level, since they still belong to the original authors (and may be under active development). They also do not appear to overlap their functionality. They are however prime candidates for interfacing via ccp4i.

Possible courses of action

Leaving aside the issue of simply merging together programs which share a large part of their functionality, some standardisation of the programs would useful from the point of view of maintainability, and improved efficiency. Possibly some functions could be generalised and farmed out to the libraries, eventually forming the basis of a "toolbox" of useful functions.

Standardisation of the way users interact with the programs can happen at two levels - at the level of the actual programs (requiring a lot of rewriting no doubt), and at the level of the graphical interface.

Program Standardisation

Program output

Improving the output of the programs is something which will have a direct and immediate impact on user friendliness. This will include adding html/summary tags into the output, something which will be increasingly useful for people using the graphical interface.

Keywords/atom selection

There is an argument for standardisation of keywords, particularly those are used for selecting which parts of the model are to be analysed.

CONTACT recently had FROM...TO keywords added along the lines of DISTANG(?), this is one possible model, Liz Potterton has looked at the more general problem of atom selection (see http://www.ysbl.york.ac.uk/~lizp/ccp4/atom_selection.html) though these programs shouldn't need the whole range of options discussed there.

To start with, there is now a routine RDATOMSELECT in PARSER which can decode atom selection commands of the type used in CONTACT and DISTANG (in fact CONTACT now uses this routine). There is also a KEYPARSE equivalent (ParseAtomSelect).

Atom identification

This was prompted by looking inside of ACT, which performs different checks on the properties of each atom: B-value, occupancy, and whether the atom is "main chain", "side chain", solvent, metal etc.

So another possibility would be to turn these into library routines in someway. At a simplistic level routines (e.g. is_main_chain) would only require the atom and residue names, and would return TRUE or FALSE accordingly. A more powerful implementation would create some kind of database, which would have to be initialised by the external program by loading it with the relevant details. Other routines would then be used to interrogate the database.

(This latter approach is attractive because a) it ties in with Liz's ideas about more powerful forms of atom selection, e.g. selecting on B-values, and b) it maybe offers the possibility of creating very powerful functions from which analysis programs could quickly be constructed. For example, functions could return the distance between two atoms. Such library functions could also be used for the bricking algorithms discussed below.)

Other functions

CONTACT and AREAIMOL both use similar brick-search algorithms to search for contacts between atoms. (Does ACT also?) There may be some potential here for common bricking routines, also for routines to generate symmetry-related atoms and place them in the unit cell.

ACT also employs an efficient method for finding all symmetry-related contacts, regardless of the number of associated lattice-vector translations required in combination with the symmetry operator. This is in contrast to e.g. AREAIMOL (searches +/-1 lattice vector in all directions) CONTACT (up to +/-2 in BIGSEARCH mode, +/-1 otherwise). It might be useful to have generic routines using this method, which could then be plugged into other programs.

User interface standardisation

ccp4i offers the possibilty of rationalise the way that users interact with the programs rather than changing the program input. A single task interface ("Analyse model"?) could be used to run several different programs depending on the choices made by the user in the protocol folder. Things like e.g. atom selection could be standardised at the level of the interface, with behind-the-scenes processing to convert to the syntax used by the required program.

Designing such an interface is a question of examining the available options presented by the suite, initially disregarding which program they come from, and presenting these options to the user in a sensible manner.