CCP4i Project Database

Author: Peter Briggs
Revision: 0.2 Date:23/3/2004

Background:

In addition to the existing CCP4i job database, it is proposed that project-specific information could be stored.

Rationale:

Aims:

Description:

Each project directory would have a CCP4_DATABASE/project.def file. The contents of this file could be displayed and edited via options in the "Projects and Dirs" window.

Data can be acquired in a variety of different ways:

Entered manually on creating the project
Imported from some other source e.g. MOSFLM output, or external databases such as MOLE
Extracted from running programs/tasks/other procedures
Derived directly from other pieces of data

Data that could be stored:

* Sequence data (e.g. *.seq file?)
  - theoretical molecular weight (calcuated from sequence)
  - actual molecular weight (determined from expt)

* Experimental details
  - Type of diffraction expt (e.g. MAD, SAD, MIR etc)
  - Crystal identifiers
    o Associated cell information
    o Heavy atom types and numbers
    o For each data set derived from a crystal e.g. wavelengths,
      F' and F'', etc

* Cell (from data processing)
* Pointgroup (from data processing)
* Spacegroup
* Non-xtallographic translation (aka pseudo translation)
               (if large off-origin peak in Patterson)

* Twinning analysis
  - Second moment of I from truncate output

* Wilson estimated B-factor (from truncate)

* Solvent content analysis (calculated from information above)
  - list of nmol per solvent fraction

* "Anomalous reliability CCvsResolution Cutoff"
  - from Scala or Revise (resolution limit below which cc indicates
                          data is too unreliable for use in HA search)

* Project-specific defaults for data harvesting
  - e.g. project names and default for harvesting options

Describing data items

This is still extremely speculative

One could imagine some kind of metalanguage to describe a data item. At the most basic level this would be a name, a description, a type, and value(s) e.g.:

name        WAVELENGTH
description "Wavelength of X-rays used in the diffraction experiment"
type        positive float
units       Angstroms

More complicated types could include "lists" and "tables", or super-types made up from more basic types. (A table could be something like pairs of values, such as number of molecules in the asu versus solvent content, as calculated from MATTHEWS. A list could be something like a cell, which is a compound type made up from a list of single data items like cell lengths and angles.)

Namespacing could be used to distinguish between different projects, or items in compound objects, e.g.:


PROJECT::CRYSTAL
PROJECT::CRYSTAL::CELL
PROJECT::CRYSTAL::DATASET::WAVELENGTH
etc

The relationships between items are also important, e.g.:

correspondsTo dataBase1:name1 [ dataBase2:name2 [ ... ] ]
dependsOn dataItem1 [ dataItem2 [ ... ] ]
calculatedFrom procedure using dataItem1 [ dataItem2 [ ... ] ]

correspondsTo would be a way of specifying how a data item in our database maps onto data stored in an external database. dependsOn would be a way of listing dependencies within our database, for example if the item can be calculated directly from other information (molecular weight from sequence). calculatedFrom would be a way of specifying which program or procedure has to be run in order to generate the data - this is similiar to dependsOn, but differs in that calculatedFrom may generate several data items using the same procedure and input data items.

It should be possible to translate the metalanguage into other descriptions using a custom code generator. For example, it could be translated into a CCP4i def file plus a task file to draw an interface to display and edit the data.

Tracking data history

All data comes from somewhere - it may have been entered manually by the user, imported from a file or external database, derived from other data in the database, or extracted from the output of a procedure or a program. Also it's possible that the value of a datum could change, possibly because a more accurate estimate exists now. This may in turn have knock-on effects, if other data depends on the changed datum.

There needs to be a way of recording changes to the data. (It is possible to imagine new ways of viewing the project history, for example via the history of individual data items.)

Automatically populating CCP4i task interfaces

It should be possible to automatically populate CCP4i interfaces with data taken from the project file, provided that there is some description of which task parameters correspond to which project parameters (the correspondsTo relation suggested above, for example):

dm:SOLVENT_FRAC correspondsTo project:solvent_fraction

When the task is initiated, these relationships are checked and the parameters in dm.def can be loaded with the corresponding values from the project database (if present) before the window is drawn.

Relevant Links:

A related project to facilitate automation generally is a proposed list of CCP4 variables being prepared by Martyn Winn