BIOXHIT/CCP4: Ms 5.2.4 Extending the tracking database

This document is a place to put ideas about extending the CCP4i tracking database (i.e. database.def) content and functions.

Additional data items

Project title: a short (e.g. one line) user-specified description of the project
Project description: a longer description (like an abstract?) that could be added by a user. Both the project title and description could be useful if the user has a lot of projects or is able to share them with others.
User-agent: (or "driver application") This would be the name of the program that acts for the user when making changes to the database - for example, for jobs run in CCP4i the user agent would be "ccp4i", for jobs run by XIA2 it would be "xia2" and so on.
Some kind of versioning information could also be included.
Subjobs or substeps: these would be smaller jobs within a single larger job. It is likely that for example a single run of an automated pipeline would be a "job" but that the automated process would explicitly divide this run into smaller jobs.
History data: currently no history data is stored in the tracking db, instead it is derived implicitly based on filenames. So we could store explicit history links between jobs - there would be two types of link:
- Data flow: that is, when job X uses data produced from job Y, and
- Logical flow: that is, when job X follows on from from job Y due to some kind of "application logic". Application logic means the logic encoded in a script or other system, which determines that one step follows another even if there is no apparent data flow - but it could equal apply to "procedural logic", when a human user follows a procedure in which the steps are linked by some logical scheme.
It is possible that we might also wish to store inferred links (which is what is generated by the ccp4i.history class at the moment) and broken links (links that have been explicitly removed by the user).
Application control file: currently the name and location of the CCP4i parameter file is not explicitly stored. The name of the file is generated as jobid_taskname.def (e.g. 123_scala.def) and is stored in the CCP4_DATABASE subdirectory of the project. To extend the tracking to other applications means explicitly storing this data, or at least its location.
Logfile: similar to application control files, in CCP4i the logfile location isn't explicitly stored at present. The name of the logfile is generated as job_id_taskname.log (e.g. 123_scala.log) and is stored in the project directory.
Notebook: similar to the application control and log files above. The notebook data is stored in a file with the name generated as jobid_notebook.txt (e.g. 123_notebook.txt.
Tags: this is just an idea. Tags would allow users (or the system?) to associate one or more arbitrary keywords with particular jobs. These could be used for selection purposes by other functions. Needs some thought.
Operation type: another idea. This would allow the application to specify the type of operation, and allow program or script runs to be distinguished from reported jobs or editing operations. Needs some thought.

Additional functionality

Manipulating projects:
- Allow jobs to be moved between projects
- Allow import/export of projects
- Allow projects to be split or merged
- Allow branched projects to be synchronised part

Gathering Feedback

Possible ways to get feeback:

Talk to Graeme Winter looks at using the system in XIA2
Talk to Paul Emsley and Liz Potterton about using the system in Coot and CCP4mg

[Index] [Deliverables]