CCP4i Database Handler: Specification

Author: Peter Briggs
Revision: 0.3 Date:31/01/2003

Aims

Provide an API to the CCP4i project history database which can be accessed by: the main CCP4i process; independent processes started by the main CCP4i process (``jobs''); external (non-CCP4i) applications.

In the first instance this must reproduce the existing functionality within CCP4i v1.3.8 for interacting with the database. It must be able to interact with the existing format of the database (currently a flat format CCP4i-parameter or "def" file). However it should also be extensible, that is, be able to accommodate a wider scope of commands and to be easily extended to use different database backends.

Requirements

Core Requirements

Reproduce existing functionality found in CCP4i 1.3.8.
Clearly separate general database functionality from CCP4i-specific functionality.
Provide an API which can be used by non-CCP4i applications to interact with the database (applications may be in languages other than Tcl).
Provide an API which can be used in a distributed computing environment.

Additional Requirements

Provide a way of expanding the scope of the project history database (allow storage and access of new data items).
Enable new/different database backends to be used for storing the project history database.
Enable updates/changes to item definitions.
Enable access to external/3rd party databases (for example LIMS).

General Issues

Security and authentication are issues which will need to be addressed if the server is accessible over a network. These issues have not yet been explored.

Component-specific Issues

Requirements for CCP4i

The main CCP4i process needs both read and write access to the database, as it needs to be able to register new jobs, set/edit associated information, and delete jobs from the database, as well as requesting information on the current state of the database. It must be able to handle ``update'' messages from the database handler when the status of the database changes.
Requirements for Running Scripts (``jobs'')

Running jobs need to be registered with the handler as part of their startup, but after this point only require limited write access (to add output files, register script termination) and no read access to the database.
Requirements for Non-CCP4i Applications

Applications may be written in any language, so the protocol for exchanging requests/information between the handler and the applications needs to be as generic as possible.
Requirements for accessing other Databases

These are not currently known.

Current Implementation

The database for a given project is stored on disk as a flat file in CCP4i parameter file format (".def"). The data in the file is read into an array in the main CCP4i process when the project is first opened by the user. Subsequent queries or actions on the database is made via the array held in memory, which is periodically written out back to the file.

Functions for interacting with the database information are tightly embedded in the code for the main CCP4i process. In some cases general database functions are mixed together with CCP4i-specific code, thus the full range of functionality is not easily accessible by non-CCP4i applications.

Possible Solutions

The current implementation does not meet all the requirements, for example it doesn't provide an API which can be used easily by other applications.

The current prototype solution uses a separate database handler process. Application programs interact with the database via socket requests made to the database handler process. This addresses the issues of networked operation, and removes the requirement for multiple language-specific APIs.

Other possible solutions (currently not under consideration):

It would be possible to separate the database handling API cleanly from the main CCP4i process in such a way that other applications could load the commands into an interpreter and use them to interact with the database directly. It would not be straightforward for non-Tcl applications to interact with the database - this could be addressed by implementing the API in a number of languages, however this would incur an undesirable maintanence overhead. Also it fails to address the issue of operating over a network.

Outline of the Database Handler (``DbCCP4i'')

The provisional name for the database handler process is DbCCP4i.

The current model of DbCCP4i consists of three basic sets of components:

Core handler. This manages the sockets for communicating with the connecting processes, performs basic book-keeping (e.g. tracking which processes need access to which databases, which databases are currently open, and so on), and passes requests and responses between processes and databases via the appropriate API layers.
API layer to deal with requests recieved from and responses sent to the connected processes using the appropriate protocols (e.g. XML/http).
API layers to deal with different types of database, e.g.:
1. CCP4i project database
2. SQL database
3. XML database

A DbCCP4i process can be started either by the main CCP4i process, or separately e.g. via a user command. Each CCP4 project history database should only be accessed by a single DbCCP4i at any one time (this could be controlled via lock files), though a single DbCCP4i could access more than one project database - for example, a user is browsing the project history data in one project, but also has running jobs which are registered in a second project. To conserve system resources each user should only have one DbCCP4i running at anytime.

Ultimately it should be possible to allow several users to access the same database simultaneously via a single DbCCP4i.

Processes which wish to access the database must first register themselves with the DbCCP4i - in a secure environment this should include some authentication procedure. Registered processes may have different interaction requirements, for example: the main CCP4i process needs to able to read and write to the database, but it also needs to know when the database content changes (so it can update its display); whereas running jobs only need to send information to the database (new output files, job finishes or fails) and do not need to be informed of updates. There will need to be an "update" mode whereby the DbCCP4i broadcasts notice of an updated database to all connected processes which have registered an interested in knowing when certain types of database content have changed.

There needs to be a mechanism for new processes to detect and connect to an existing DbCCP4i when trying to access a project database - possibly through information stored in the lock file.

DbCCP4i processes should have a number of different persistence modes. Initially the DbCCP4i will persist as long as it has still has registered processes (processes should unregister themselves on shutdown). It should also be possible to leave the DbCCP4i running indefinitely and have processes connect and disconnect as required. There also needs to be a way to cleanly shutdown a DbCCP4i e.g. via a user command.

Prototype dbCCP4i: current status

A prototype version of dbCCP4i now exists which has been built on top of CCP4i 1.3.9.

Currently Unresolved Issues for DbCCP4i

Information exchange protocols: i.e. how are requests and information to be passed between the server and the application (syntax etc)?
Security: how to make sure that only authorised applications are allowed to interact with the database?
Locking issues: should multiple processes be allowed access (read-write, read-only) to the same database (queuing/lock-grab/lock-out)?

Work Breakdown for Implementation

Define the APIs for interaction between the DbCCP4i and the client processes.
Define the APIs for interaction between the DbCCP4i and the databases.
Rewrite CCP4i code so that CCP4i interacts with the database through a set of API commands. More specifically this consists of separating the existing commands into two distinct classes: a set which interact directly with the global `database' variable and its image on disk (the core database commands); and a set which interact with the database information via calls to the first.
Write the core DbCCP4i process and implement functions to allow processes to start-up, connect to/register with, disconnect/unregister from and shutdown the DbCCP4i. Investigate/deal with security issues.
Implement the project database handling functions identified in 3) inside the DbCCP4.
Rewrite the CCP4i database API to interact with the project history database via the DbCCP4i.
Document the implementation and usage of the DbCCP4i and APIs.
Perform local testing.
Distribute new CCP4i for wider testing.