These are just some random thoughts. If you can understand them, then please comment.

Input/output modes

At least 4 cases:
  1. Read only, e.g. act, protin, sfall.
    Use ccif_load_cif only. "Get" what you want, advancing context until last packet, and close file.
  2. Write only, e.g. makedict (in some modes), peakmax.
    Use ccif_new_cif only. "Put" with istat = append_row.
  3. Read and write, making minor changes to existing rows, e.g. pdbset (in some modes), refmac.
    Use ccif_load_cif and ccif_new_cif with input file specified. "Get" from CIFOUT (copied from CIFIN) and "put" only those data items that are changed.
  4. Read and write where packets need to be added, deleted or re-ordered, e.g. gensym, hgen, sortwater.
    Use ccif_load_cif and ccif_new_cif with input file specified (always want to avoid losing information, even if we're completely re-writing ATOM_SITE data). "Get" information from CIFIN, and re-write for CIFOUT. Need to sort, and renumber. How delete old stuff?

Adventures with ACT

ACT is a simple program which reads data from a coordinate file in one go (s/r getxyz) and then produces statistics from that data. I have modified it so that it will read an mmCIF file. I have run the original PDB version and the new mmCIF version on HUMAN RHINOVIRUS 16 COAT PROTEIN (6980 atoms): both PDB and mmCIF files were taken from the PDB, code 1AYM. Observed differences:
  1. There are 4 chains. The PDB file numbers the residues from 1 for each chain. This becomes _atom_site.auth_seq_id in the mmCIF file, while _atom_site.label_seq_id numbers the full set of residues sequentially.
  2. The old PDB version includes the alternate location indicator as part of the atom name, and thus fails to recognise some atoms as being main chain. This is not an issue with the mmCIF version.
  3. And the bad news on timing:
    
    <  Times: User:      28.9s System:    0.4s Elapsed:    1:35
    
    >  Times: User:       7.4s System:    0.1s Elapsed:    0:17
    
    
    Yes, the first one is mmCIF.

I subsequently inserted calls to GETELAPSED to see where the time was being spent. The values for "User" time were as follows:


After loading dictionary:                   0.1s
After loading XYZIN:                        2.9s
After reading coordinate data into ACT:    23.6s
Final:                                     28.7s

Note that these values are cumulative. So the culprit seems to be reading the data, i.e. the repeated calls to GETATOMINFO in s/r GETXYZ.

Now the cciflib.f routine GETATOMINFO ultimately calls libccif routines ccif_get_real_esd, etc. I next replaced these calls with simple assignments, e.g. setting occupancy = 1.0 instead of reading it. I did this first for the occupancies, then in addition for the B values, and then also for the coordinates. The results:


                                                    -occ   -B   -xyz

After loading dictionary:                   0.1s    0.1    0.1   0.1
After loading XYZIN:                        2.9s    2.9    2.9   2.9
After reading coordinate data into ACT:    23.6s   20.6   17.3   6.1
Final:                                     28.7s   25.6   22.3  crashed!

Thus, it seems to be the calls to ccif_get_real_esd, etc. that are taking the time.

Thoughts on limited use of full mmCIF dictionary

mmCIF data files used with CCP4 programs may contain any data items. However, CCP4 library routines will only utilise a small subset of data items from the full mmCIF dictionary. All other data items will simply be copied from input to output.

The following categories cover the information typically held in CCP4-PDB files:

CELL
cell dimensions (replacing CRYST1 card).
SYMMETRY
spacegroup name or number (not normally included in CCP4-PDB files).
ATOM_SITES
cell transformations (replacing SCALEx cards).
ATOM_SITE
atom site information (replacing ATOM,HETATM,ANISOU cards).
Note:
ATOM_SITES_ALT, ATOM_SITES_ALT_ENS, ATOM_SITES_ALT_GEN
pointed to by _atom_site.label_alt_id and gives more information on alternative conformations: _atom_site.label_alt_id is sufficient for programs in their current form.
ATOM_SITE_ANISOTROP
this contains alternate_exclusive data items to those in category ATOM_SITE: in general, it is simpler to use the latter. However, when there is anisotropic U data for only a small subset of atoms, e.g. for metal ions only, then it might be more convenient to use a separate category.

In addition, we're thinking of working with the following categories:

ENTITY
define polymer/non-polymer/water entities.
ENTITY_POLY_SEQ
sequence information. Ideally, this should correspond to the sequence in the ATOM_SITE category, although there are exceptions, e.g. if the latter describes a temporary poly-ALA model.
STRUCT_ASYM
describes contents of asymmetric unit.
STRUCT_CONN
describes disulphides, salt bridges and hydrogen bonds. The first would be useful for protin.

The following properties of the data items need to be considered:

Mandatory

 The following  items are mandatory (_item.mandatory_code): That is, they are required if the category is present.  However, the categories themselves
are not mandatory (are any?).

Dependent

There are several dependent groups, e.g.: But I think these dependent groups are always confined to one category?

Keys

The key items are: which of course are mandatory.

Related

Parent-child

_cell.entry_id and _symmetry.entry_id are children of _entry.id That is, the former data items are pointers to _entry.id and should have identical values in the CIF file.


Other programs

General

The IUCr pages list several CIF libraries and validation tools. In particular, there are plenty of syntax checkers, but none seem to care about the semantics??

cif2cif

cif2cif by Herb Bernstein copies an input CIF to the ouput, doing various checks and modifications. If a dictionary is specified, it will check the CIF file against that. In particular, it identifies:
  1. data items not in the dictionary.
  2. data item values which are not the correct type, e.g. character instead of number.
  3. missing semi-colons.
However, it does NOT complain about:
  1. missing mandatory data items.
  2. a missing member of a dependent group.
  3. non-unique key items.
  4. child data items which do not agree with the parent data item.

 Back to mmCIF page...