Automation standards and frameworks: from data reduction to structure

This document aims to outline the specific questions that each of the workshop sessions will try to answer. See http://www.ebi.ac.uk/msd-srv/docs/bioxhit05_1.html for information on the workshop.

1. Standards for Frameworks for Automation

This session will focus on technologies rather than on the science - we are interested in the similarities and differences between developments.

Ideally each talk will give a description of the pipeline (e.g. is a script, or a set of daemons, or something else) including the language(s) that it is implemented in, and then focus on answering the following questions:

Inter-component communications: how do components within the pipeline communicate with each other?
Abstraction of actions: how are tasks expressed in a generic fashion, that is: how are tasks described in a way which separates the specific program (e.g. SCALA) from the generic task (e.g. scaling)?
Action prerequisites: how do components get the right data to run? Where is the data stored? How are decisions made? How is the decision-making expertise stored and accessed?
Action results: how are results reported to the "end user"? Do the results get stored anywhere? If so, where, how and for how long?

The aim of this session is to make recommendations to answer these questions for future developments.

2. Standards for data exchange between computational units in the structure determination software pipeline

This session will focus on defining a practical hierarchy of functional blocks within the software pipeline. A "functional block" could be a single program or a section of pipeline. It could itself be made up of other smaller functional blocks (e.g. functions within programs or smaller sections of pipeline).

The point of doing this is to define the interfaces between functional blocks as this is across these interfaces that data will need to be transferred. For each functional block we need to consider what information is required for:

Decision making and feedback - that is, what do you need to know to go into a block and what do you learn from doing it?
Data transfer (including metadata e.g. expected number of heavy atoms - what we mean is, things which are known or required throughout the process or parts of the process, e.g. sequence, spacegroup, cell parameters etc)
Archiving and deposition - in particular what is missing from what we already store? Are there issues to do with storing data which can easily be regenerated? Can we anticipate requirements for future applications (e.g. data mining?)
Definitions of "success" and "failure" - what do these mean?
We need to consider issues of transferring ambiguous data for example spacegroup names can have many different representations

The session should also consider how much of this is covered by existing standards.

The aim of this session is to provide a set of requirements which can be used as the basis for or input to the BIOXHIT Data Exchange Standards (task 5.1.2).

3. Toolboxes for Automation

This session will look at existing toolbox developments. Each talk should describe the toolbox and should include:

What problems did you set out to solve with the toolboxes?
What did you learn from writing them?
What would you do differently and what's missing?

The discussion should then focus on what functionality the pipelines covered on day 1 need, and where they obtained this functionality from - for example, using an existing toolbox, using existing programs from a general software suite, or writing their own functions or "jiffy" utilities. This would be a change to the current programme.

What functionality is required to make their pipelines work?
Where is this functionality already provided? Which functions have you had to provide yourself?
What languages and approaches are you using?

The aim of this session is to summarise the answers to these questions to feed into the report on a software toolbox for automation (BIOXHIT task 4.7.1).

Outcomes of this meeting

Report from each section of the meeting:

From day 1: Report on the existing pipeline developments, comparison of technologies and recommendations for pipeline frameworks
From day 2: Report on the "functional blocks" and the interfaces as a set of requirements to be fed into the BIOXHIT data model for data transfer
From day 3: Report on the existing toolboxes to be fed into the BIOXHIT report on toolbox requirements

We should also agree on follow-up meetings and other actions as part of this meeting. It is suggested that we add a short wrap-up session at the end of the third day to set deadlines for the next steps.