A new C language library for reading, writing and manipulating MTZ files

Martyn Winn

June 2001

Background

The CCP4 software suite has been using the MTZ data format for holding reflection data for many years now. One of the main advantages of the MTZ format is the ability to hold all relevant data (intensities, structure factor amplitudes, phases, free R flags, etc. etc.) together in one file. Each set of reflection data is held as a column in a table, and each column is identified and accessed via a label.

This arrangement has worked well, but ignores the fact that certain columns of data go naturally together. A calculated phase belongs with the corresponding calculated structure factor; the structure factors for different wavelengths of a MAD experiment are used together; the structure factors of a heavy-atom derivative are used to estimate phases for a native structure factor. In short, a set of columns does not describe the relationships implicit in the file.

Thus, the concepts of "project" and "dataset" were invented. A project contains all data used to create a model of the target structure. A dataset is the result of a single experiment contributing to the determination of the model (which may be represented by several columns of data). All columns in an MTZ file are assigned to a particular dataset in a particular project. This information was initially used for Data Harvesting, but has since been used in other contexts where inferrence of relationships between columns is useful (see my Newsletter article).

However, it has since been appreciated that this is not a complete data model of the reflection data. Such a model is described in the next section, and is based on ideas of Kevin Cowtan (see Kevin's page), Airlie McCoy and others.

Data model

Reflection data is still held in a table format, but columns of data are arranged in a heirarchical manner. From the top down, the heirarchy is:

File -> Crystal -> Dataset -> Datalist -> Column

A `Crystal' is essentially a single crystal form: usually there will be one crystal per derivative, unless a single derivative can crystalise in several cells (e.g. RT and frozen). A `Dataset' is a set of observations on a crystal. If data is collected at several wavelengths, each of these becomes a separate dataset. A `Datalist' is a grouping of associated columns. Thus a single list will hold both F and SigF. Another list holds all four Hendrickson Lattman coefficients.

Each data list is linked to one of the datasets and each dataset is linked to one of the crystals. There may be several data lists per dataset and several datasets per crystal. Some data lists may be linked to the base dataset (which is just a placeholder) and thus to the base crystal. This will be the case for synthetic data and data types such as the FreeRflag.

The project is now simply an attribute of the crystal. It will still be used for the purposes of Data Harvesting, but does not form part of the heirarchy.

Representation in MTZ file

An MTZ file consists of a table of real data, together with an ASCII header consisting of keyworded records. For the sake of maintaining backwards compatability, this format has been retained. In particular, the above data model is described by a set of keyworded records in the MTZ header, from which the in-memory model can be constructed.

The relevant records are:

NDIF <ndatasets>
The total number of datasets held in the file.
with the following recorded for each dataset:
PROJECT <dataset id> <project name>
The project to which <dataset id> belongs to.
CRYSTAL <dataset id> <crystal name>
The crystal to which <dataset id> belongs to.
DATASET <dataset id> <dataset name>
The name of <dataset id>
DCELL <dataset id> <cell parameters>
The cell parameters of the crystal.
DWAVEL <dataset id> <radiation wavelength>
The wavelength of the radiation used for this dataset.
Columns and batches contain pointers to their parent datasets. The Datalist grouping is inferred from the column names.

This is clearly not a perfect representation of the data model. Some information is duplicated, for example all datasets belonging to the same crystal must have the same DCELL parameters recorded. However, it represents a simple extension of the existing MTZ format, and thus can be implemented without breaking existing software or rendering existing data obsolete. All keywords except CRYSTAL are already implemented in CCP4 4.1

C function library

I have written a C function library to read/write these extended MTZ files, and to manipulate a data structure representing the above data model. Many functions are derived from Jan-Pieter Abraham's solomon code, though they have been substantially altered (and therefore any problems are of my creation).

Available files are:


mtzdata.h        defines basic MTZ data structure
mtzlib.c         C library for MTZ i/o and manipulating MTZ data structure
library.c        library.c for C programs (made robust by Charles Ballard)
symlib.c         C versions of symfr3 and symtr3
ccplib.c         C versions of ccperr, ccpfyp, ccprcs (mainly from Pete Briggs)
cparser.c        C version of parser (from Pete Briggs)

In more detail, mtzlib.c contains following functions:

MTZ i/o


MTZ *MtzGet
void CmtzRrefl
void MtzPut
void CmtzWhdrLine
void CmtzWrefl

Memory allocation


MTZ *MtzMalloc
void MtzFree
MTZCOL *MtzMallocCol
void MtzFreeCol
MTZBAT *MtzMallocBatch
void MtzFreeBatch
char *MtzCallocHist
void MtzFreeHist

Header operations

 
int MtzNbat

Crystal operations


MTZXTAL *MtzAddXtal
char *MtzXtalPath
MTZXTAL *MtzXtalLookup

Dataset operations


MTZSET *MtzAddDataset
int MtzNset
MTZXTAL *MtzSetXtal
char *MtzSetPath
MTZSET *MtzSetLookup

Column operations


MTZCOL *MtzAddColumn
MTZSET  *MtzColSet
int MtzNcol
char *MtzColPath
void MtzRJustPath
int MtzPathMatch
MTZCOL *MtzColLookup
int MtzListColumn

helper functions


float ind2reso
void hklcoeffs
void MtzArrayToBatch
void MtzBatchToArray

pseudo-mtzlib routines


void clrtitl
int clrhist
void clrinfo
void clrsort
void clrbats
void clrcell
void clrrsol
void clrsymi
void clrsymm
int MtzParseLabin
MTZCOL *clrassn
void clridx
int clrrefl
int clrreff
int ccp4_ismnf
void clhprt
void clhprt_adv
void clrbat
void clwtitl
void clwsort
int clwhist
void clwassn
void clwidx
void clwidas
void clwrefl
void clwbat
void clwbsetid

Design Issues

Sorting

Reflections can be sorted on up to 5 columns, and this sort order is held in the MTZ file header. However, this facility does not seem to be used.

Identifying columns

LABIN FP=NATIVE/FTOXD3 SIGFP=NATIVE/SIGFTOXD3 now possible.

Fortran API

I have also written a Fortran API "mtzlib_f.c", which is basically a wrapper for the functions in mtzlib.c, which mimics the existing mtzlib.f (routines are temporarily prefixed by "F_C" to avoid clashes). It should be possible to migrate existing Fortran programs to use this API with few/no changes. It is expected however, that future applications will use the core C functions which give better access to the data structure.

Other languages

It is planned to provide bindings to other languages, e.g. tcl, python, Java.
m.d.winn@dl.ac.uk
Last modified: Fri Jun 8 10:42:23 BST 2001