These are just some random thoughts. If you can understand them,
then please comment.
Input/output modes
At least 4 cases:
- Read only, e.g. act, protin, sfall.
Use ccif_load_cif only. "Get" what you want, advancing context
until last packet, and close file.
- Write only, e.g. makedict (in some modes), peakmax.
Use ccif_new_cif only. "Put" with istat = append_row.
- Read and write, making minor changes to existing rows,
e.g. pdbset (in some modes), refmac.
Use ccif_load_cif and ccif_new_cif with input file specified.
"Get" from CIFOUT (copied from CIFIN) and "put" only those
data items that are changed.
- Read and write where packets need to be added, deleted or
re-ordered, e.g. gensym, hgen, sortwater.
Use ccif_load_cif and ccif_new_cif with input file specified (always
want to avoid losing information, even if we're completely re-writing
ATOM_SITE data). "Get" information from CIFIN, and re-write for
CIFOUT. Need to sort, and renumber. How delete old stuff?
Adventures with ACT
ACT is a simple program which reads data from a coordinate file in one
go (s/r getxyz) and then produces statistics from that data. I have
modified it so that it will read an mmCIF file. I have run the original
PDB version and the new mmCIF version on HUMAN RHINOVIRUS 16 COAT PROTEIN
(6980 atoms): both PDB and mmCIF files were taken from the PDB, code 1AYM.
Observed differences:
- There are 4 chains. The PDB file numbers the residues from 1 for
each chain. This becomes _atom_site.auth_seq_id in the mmCIF
file, while _atom_site.label_seq_id numbers the full set of
residues sequentially.
- The old PDB version includes the alternate location indicator
as part of the atom name, and thus fails to recognise some atoms as
being main chain. This is not an issue with the mmCIF version.
- And the bad news on timing:
< Times: User: 28.9s System: 0.4s Elapsed: 1:35
> Times: User: 7.4s System: 0.1s Elapsed: 0:17
Yes, the first one is mmCIF.
I subsequently inserted calls to GETELAPSED to see where the time
was being spent. The values for "User" time were as follows:
After loading dictionary: 0.1s
After loading XYZIN: 2.9s
After reading coordinate data into ACT: 23.6s
Final: 28.7s
Note that these values are cumulative. So the culprit seems to be
reading the data, i.e. the repeated calls to GETATOMINFO in
s/r GETXYZ.
Now the cciflib.f routine GETATOMINFO ultimately
calls libccif routines ccif_get_real_esd, etc.
I next replaced these calls with simple assignments, e.g. setting
occupancy = 1.0 instead of reading it. I did this first for the
occupancies, then in addition for the B values, and then also for the
coordinates. The results:
-occ -B -xyz
After loading dictionary: 0.1s 0.1 0.1 0.1
After loading XYZIN: 2.9s 2.9 2.9 2.9
After reading coordinate data into ACT: 23.6s 20.6 17.3 6.1
Final: 28.7s 25.6 22.3 crashed!
Thus, it seems to be the calls to ccif_get_real_esd, etc.
that are taking the time.
Thoughts on limited use of full mmCIF dictionary
mmCIF data files used with CCP4 programs may contain any data items.
However, CCP4 library routines will only utilise a small subset of
data items from the full mmCIF dictionary. All other data items will
simply be copied from input to output.
The following categories cover the information typically held
in CCP4-PDB files:
- CELL
- cell dimensions (replacing CRYST1 card).
- SYMMETRY
- spacegroup name or number (not normally included in CCP4-PDB files).
- ATOM_SITES
- cell transformations (replacing SCALEx cards).
- ATOM_SITE
- atom site information (replacing ATOM,HETATM,ANISOU cards).
Note:
- ATOM_SITES_ALT, ATOM_SITES_ALT_ENS, ATOM_SITES_ALT_GEN
- pointed to by _atom_site.label_alt_id and gives more information
on alternative conformations: _atom_site.label_alt_id is sufficient
for programs in their current form.
- ATOM_SITE_ANISOTROP
- this contains alternate_exclusive data items to those in
category ATOM_SITE: in general, it is simpler to use the latter.
However, when there is anisotropic U data for only a small subset
of atoms, e.g. for metal ions only, then it might be more convenient
to use a separate category.
In addition, we're thinking of working with the following categories:
- ENTITY
- define polymer/non-polymer/water entities.
- ENTITY_POLY_SEQ
- sequence information. Ideally, this should correspond to the
sequence in the ATOM_SITE category, although there are exceptions,
e.g. if the latter describes a temporary poly-ALA model.
- STRUCT_ASYM
- describes contents of asymmetric unit.
- STRUCT_CONN
- describes disulphides, salt bridges and hydrogen bonds. The first
would be useful for protin.
-
-
The following properties of the data items need to be considered:
Mandatory
The following items are mandatory (_item.mandatory_code):
-
_cell.entry_id
-
_symmetry.entry_id
-
_atom_site.id
-
_atom_site.type_symbol
That is, they are required if the category is present. However, the
categories themselves
are not mandatory (are any?).
Dependent
There are several dependent groups, e.g.:
-
_cell.angle_alpha, _cell.angle_beta and _cell.angle_gamma
But I think these dependent groups are always confined to one category?
Keys
The key items are:
-
_cell.entry_id
-
_symmetry.entry_id
-
_atom_site.id
which of course are mandatory.
Related
Parent-child
_cell.entry_id and _symmetry.entry_id are children of _entry.id
That is, the former data items are pointers to _entry.id and should
have identical values in the CIF file.
Other programs
General
The IUCr pages
list several CIF libraries and validation tools. In particular, there
are plenty of syntax checkers, but none seem to care about the
semantics??
cif2cif
cif2cif by Herb Bernstein copies an input CIF to the ouput,
doing various checks and modifications. If a dictionary is specified,
it will check the CIF file against that. In particular, it identifies:
- data items not in the dictionary.
- data item values which are not the correct type, e.g. character
instead of number.
- missing semi-colons.
However, it does NOT complain about:
- missing mandatory data items.
- a missing member of a dependent group.
- non-unique key items.
- child data items which do not agree with the parent data item.
Back to mmCIF page...