Requirements for Bioxhit/CCP4 Project Database

Introduction

This document reviews the usage of the initial database system in a number of software packages, with a view to updating the database schema to accommodate the needs of these packages.

Overview of the existing schema

The existing schema consists of a single mySQL database within which there are four types of "data object" (each represented by a mySQL table):

projects
jobs
facts
dingbats

Although each instance of the data objects has a unique (numerical) id, a hierarchical relationship is implied in that projects are collections of jobs, and jobs have associated facts and associated dingbats - or alternatively that each fact or dingbat must have a parent job, and each job must have a parent project. The API for the database reinforced this hierarchy.

Software systems using the existing schema

Two software systems have tried working with the schema described above (Graeme Winter's XIA and Dan Rolfe et al's Happy). Both these systems are under continual development and their requirements are not necessarily static, nonetheless it is useful to consider how they use the existing schema in order to understand its strengths and weaknesses.

XIA

XIA currently takes diffraction images and performs indexing, processing and scaling steps to generate reduced reflection data.

In the XIA usage, a project is arbitrarily defined by the user. A new project is created initially with a single (empty) job, which is referred to as the "global job" for this project.

The global job acts as a central datastore for the project. Data is associated with this global job in the form of dingbats holding pickled python objects or XML files, and as components of XIA run they use this central datastore as a way of exchanging data. Subsequent runs of XIA within the project look up information in the global job to determine the current state of knowledge about the project.

Other jobs are also arbitrarily defined, so that a run of a script may be a single job or it may create many jobs. Dingbats are also associated with individual jobs, which hold data only relevant to the job in question. XIA may need to determine what the last "useful" job was and then fetch the data in the associated dingbats in order to perform the next job.

XIA does not make any use of "fact" data objects.

Future suggestions

Associate "source" and "sink" metadata with each dingbat, which reference which processes created the data object, and which ones subsequently accessed it.
Generate an id for each job which would be uniquely dependent on the input data for that job. Two jobs which ran the same procedure using the same input data would have the same id, and this would enable XIA to identify jobs that had been run previously.
Ability to create arbitrary "sub-projects", for example "solve structure" might be project #1, "process data" might be project #1.1, and "process dataset 1" might be project #1.1.1.
Split the data management into three parts: 1) global data store (i.e. the global job) to store project-specific, application-specific data objects; 2) a tracking/provenance database; 3) general database (aka "export database" or "knowledge base") which would hold data common to all applications.

Happy

Happy is a heavy atom location and phasing pipeline.

In Happy, a project corresponds to a run of the application. Within a project there are "nodes" and "paths" which together track the progress of the application.

A node corresponds to "something happens", for example an action is taken or a decision is made. The exact "something" is characterised by the node's "type" (NORMAL, START, SELECT, FORK, COPY, END_FAIL or END_SUCCESS). Nodes also have associated input and output paths.

A path is a collection of nodes (at least two) which form a linear segment (so a FORK node can only occur at the start or end of a path?). The path grows as new nodes are added. Paths indicate the flow of execution of the code (so the order of nodes is significant?).

Note that there is no hierarchical relationship between nodes and paths (unlike jobs and facts in the existing schema, for example).

Happy uses a project object (which is stored outside of the Bioxhit database) to carry the crystallographic/application-specific data through the process.

Future directions

Nodes could also have a record of the input and output data objects associated with them (cf INPUT_FILES and OUTPUT_FILES in CCP4i).
Bioxhit database could allow "subjobs" (i.e. substeps within an individual job).

CCP4i and CRANK

CCP4i is the CCP4 graphical user interface. CRANK is an automated heavy atom substructure determination and phasing application which has been built with the framework of CCP4i, and which extends this framework to some degree.

Within CCP4i a project corresponds to a directory which holds arbitrary files, plus log files from each job. There is also a subdirectory called CCP4_DATABASE which holds a simple database file with metadata about each job, parameter files used as input for each job in the project, and some other small "databases" such as the Amore model database.

In this context a job corresponds to a run of a CCP4i task. Each job has a number of associated attributes (TITLE, TASKNAME, DATE, STATUS, INPUT_FILES and OUTPUT_FILES).