Writing CCP4 scripts in PERL

E. Courcelle and J.P. Samama

Institut de Pharmacologie et Biologie Structurale, IPBS, 205 route de Narbonne, 31077 TOULOUSE, Cedex, France

I. Introduction

Handling and treatment of crystallographic data commonly requires several programs. They are provided by the ccp4 project and by other sources. The treatments include:

data formats translation programs
data treatment programs
normalization

Although possible, interactive work is rather fastidious, inefficient and error prone, as every program requires the reading and writing of several data, log, and command files. It thus seems more appropriate to write scripts, in which all calls, input or output file names, etc., will be coded so that a single script execute all the programs. However, large scripts end up looking alike for any informatic program: according to the way they have been written, they can be more or less difficult to understand, and thus difficult to maintain or modify. We have tried to implement a tool which would permit the users to write large but nevertheless clear and easy-to-maintain scripts.

II. Choosing a scripting language

It is most common, when writing scripts which execute scientific programs, to use traditional unix shells, such as sh, ksh, csh or one of their dialects. This is a good and well-known approach, when the objective is only to chain the execution of several programs.
However, it appears that data treatments now make it necessary to use more and more complex scripts, invoking not only CCP4 programs, but also programs from other sources (particularly xplor or cns). An example of those complex scripts could be for instance running a given program in a loop and exiting from the loop only when a given condition is fulfilled. The execution of the program should be considered as a black box, the behaviour of which may be described as:

reading data
doing something with the data, according to another input file (keywords, xplor input file, etc).
writing output files
telling the user what has been done

The output files of a given program are often the inputs of another one, defining some piping mechanism.
Another case arises when the script should analyze the log file, whether to extract some results and print a summary, or to take a decision based upon those results. It is thus important to be able to easily extract the information from ascii files.
In both cases, the file will be treated inside the script, and will probably not be used later. Thus its name does not have to be chosen by, or even known to the user; however one must define an algorithm for the script to choose unique file names, avoiding any conflict. It is clear that this "black box concept" is most easily implemented using object oriented programming. One thus needs a scripting language with two main characteristics:

efficient ascii file treatment
support for object oriented programming

Neither of these two characteristics is correctly implemented in traditional Unix shell languages: without even speaking of object orientation, the users have to use external unix programs, like grep, awk, etc. to extract information from ascii files. However, starting from version 5, the Perl language is nicely adapted to these constraints:

regular expression matching, as well as other facilities, are implemented as part of the language. Thus, one does not have to call external utilities to analyze ascii files.
The language is extensible, so that it is possible to write Perl modules. Those modules may support data encapsulation, inheritance, methods... thus it is possible to implement object oriented programming in Perl.
Efficient memory management, a very useful functionality, associated with data encapsulation, for efficient object programming.

III. The `occp4.pm` module.

A Perl module was written which allows the user to write Perl scripts in an object oriented way. This module is loaded through a classical require Perl instruction (cf. example)

Conception of an object:

The constructor:

Before the execution of a program, an object, whose class is called occp4, is created (cf example). The parameters of the constructor are:

The name of the ccp4 program
The pairs (logical name, file name) that the program will use for its input/output.

As noted previously, in many cases, the files created or read by the programs do not have to survive after the script: they can be treated as temporary files, so the user does not have to worry about the choice of a filename. In those cases, the special name 'CHANA' (acronym of 'CHoose A NAme for me') causes the object to compute a unique temporary name for this file. The algorithm involved is implemented by the occp4 module, in such a way that names are generated in an "orderly" manner. This helps when debugging the script. The temporary files are deleted by the destructor of the object, unless otherwise specified.

The member functions:

The behaviour of the object is then controlled by the member functions; the most important functions are quoted here:

iofiles, iofildel to specify the input-output pairs of (logicals, file names), after the constructor has been called, or to retrieve the file name, if chosen by the system.
keywords, keywrep, keywdel to generate and modify a list of pairs of (keywords, values) controlling the behaviour of the program, as described in the documentation.
input_src, input_string, input_file: these functions allow the user to let the object read its control script not only from the list of (keywords,values), but also from a string or from a file: they are particularly useful when dealing with programs which do not belong to ccp4.
logfile allows the user to know the name of the (temporary) logfile computed by the object. It is also possible to force the object to use another log file. Forcing all objects to share the same logfile allows the user to maintain and keep an unique logfile for the whole script.
debug controls the messages printed by the object at the time of key moments in its life (mainly when the program is executed or when the object is destroyed).
run starts the program. The execution code of the program is returned, so that error handling is very easy to achieve.

The destructor:

The destructor is called by the system, as explained below. It essentially removes all the temporary files handled by the objects, i.e. the files whose names were declared as "CHANA". It is thus very simple to implement pipe-like mechanisms, feeding a program with the output of another program, without having to worry about intermediate file names.

IV. Playing with memory management

Perl provides a sophisticated memory management system. An appropriate knowledge of its logic is important to correctly use the occp4 objects. While the constructor is explicitly called when a new object is created, the destructor is implicitly called when the object runs beyond its scope.

The `use strict` instruction

In order to make your script clearer, and to be sure of the scope of your variables (and consequently of your objects), you should declare use strict at the beginning of the script (cf example); with this declaration, you are ensured that:

Every variable must be declared with the my instruction before being used
Every variable declared with the my instruction is local to the block; this insures that the memory allocated for this variable is given back to the system at the end of the block

How to call the destructor ?

Objects in Perl are so-called "blessed" references (other languages like C refer to pointers). Working only with local variables, as noted above, the following rules may be applied in order to control the destructor call:

When an object is created (calling its constructor with an instruction like $obj = new occp4(...)):
- some memory allocation is performed for the object
- the object is initialized
- the constructor returns a reference pointing to the created object
- The reference is stored inside $obj
It is legal to write instructions like $o1=$obj. Only the reference is copied then, the object itself is not duplicated.
If the scope of $o1 is longer than the scope of $obj, the object is still alive, even when $obj is destroyed. The exact rule is: "the object is alive as long as there is at least one reference pointing towards him". The object will be destroyed, and thus the destructor will be called, when the last reference pointing to this object is suppressed, or when it will be assigned again. For instance with the instruction $obj = 0. See the third example for a short illustration of this property.

Thus, even if the destructor cannot be explicitly called, as in other object oriented languages like C++, you may control exactly when the destructor is called by the system, just by controlling the affectation or scopes of reference variables.

V. some examples

This section shows 3 short examples of scripts, written in Perl, with the use of the occp4.pm module. Those scripts can be copied and pasted, the numbers shown inside comments refer to the notes below each example script.

Use of two CCP4 objects

This example shows the basics of occp4.pm:

use instructions and other magical formulas
How to create two objects
A file generated by the first object and used as input by the second
The destruction of those objects

The debug flag is also illustrated here.


#!/usr/local/bin/perl

use strict;
require "occp4.pm";                                     # 1


{		                                        # 2
my $obj_fft = new occp4 ('fft',
                         'hklin'=>"some_file",
	                 'mapout'=>'CHANA.map');        # 3

$obj_fft->logfile(">>file.log");                        # 4

$obj_fft->keywords('TITLE'=>"some title",
                   'LABI' =>" F1=F1 SIG1=SIGF1 F2=FC PHI=PHIC",
                   'RESO' =>"$low_res $high_res,
                   'SCALE'=>"F1 3.0 0.0 F2 2.0 0.0",
                   'BINMAPOUT'=>' ');                   # 5
die (Ť something wrong in fft ť) if ($obj_fft->run());  # 6

my  $obj_ext = new occp4('extend',
                         'MAPIN' =>$obj_fft->iofiles('MAPOUT'),
                         'MAPOUT'=>outfile.map',
                         'XYZIN' =>"some_file.pdb");    # 7

$obj_ext->debug("RVDF");                                # 8

$obj_ext->keywords('BORDER'=>"5.");
die Ť something wrong in extend ť) if($obj_ext->run());

}                                                       # 9

Magical formulas
Start a new block
Alloc memory for an object:
- The associated program is fft
- hklin is some file already created
- mapout is a temporary file, whose name is chosen by the object
The logfile is appended to the main script log file
The ccp4 keywords are entered
fft is run, the script is killed in the case of an error
Allocate memory for another object:
- The associated program is extend
- the mapin file is the mapout file of the previous fft object
- the mapout file is a definitive one.
Set some debugging flags:
- R: A message is printed when the program is run
- V: Every file is verified just before running the program
- D: A message is printed when the object is destroyed
- F: The temporary files are not removed when the object is destroyed.
End of the block:
- Both objects are destroyed
- The temporary files (among them the intermediate file created by $obj_fft and read by $obj_ext) are removed.
- The permanent files (the logfile and the last mapout file) are kept.
- The temporary file (the logfile) created by $obj_fft is not removed, because the 'F' debug flag was set for this object; however, a message is printed about this file.

Use of xplor and explicit destruction of an object

This example shows:

How to use xplor
How to destroy the objects explicitly.

my $obj_xpl = new occp4 ('xplor.exe');                    # 1
$obj_xpl->input_src("F");                                 # 2
$obj_xpl->input_file("file.inp");	                  # 3
$obj_xpl->logfile(">>file.log");
die "ERROR in xplor" if ($obj_xpl->run()); 
$obj_xpl=0 ;		                                  # 4

Allocate memory for a new object:
- the associated program is xplor.exe, which must be present in the path
- There are no logical/physical file names, because xplor does not use the CCP4 library.
Select a file as the source of commands for this object : as xplor does not use the CCP4 library, the keywords/values pairs are not a relevant way of controlling the program. We prefer to write xplor commands to a file. However, the commands could have been also written to a string.
The name of the file with the xplor commands in it
The $obj_xpl reference is set to 0, and because there are no other reference points to our object, it will be destroyed.

How to write loops

The following example shows:

The allocation of an object inside a loop
The loop is exited when some condition is reached
Differed destruction of the objects

my $old_obj=0 ;		                                          # 1
while (1) {		                                          # 2
	my $xyzin ;
	if ($old_obj) {
		$xyzin = $old_obj->iofiles("XYZIN") ;
	} else {
		$xyzin = "some_initial_value" ;
	} ;			                                  # 3
my $obj = new occp4("some_program",	                          # 4
			XYZIN=>$xyzin,
			XYZOUT=> "CHANA") ;
	keywords etc.
	if (some_condition) {
		break ;		                                  # 5
	} else {
		$old_obj = $obj ;	                          # 6
	} ;
} ;				                                  # 7

my $loop_obj = $old_obj;                                          # 8

Allocate memory for a reference, no object is pointed to yet.
start a block, specifying an infinite loop
The input file is specified as follows:
- Some initial value when entering the loop, or
- A file written at the previous iteration
Allocate memory for a new object:
- The input file name was calculated at the previous line.
- The output file is temporary.
If the condition is verified, exit from the loop
If the condition is not verified, do not exit:
- We set the $old_obj reference so that it points to the last created object. This prevents the destruction of $obj, although we are going to start a new iteration.
- Unless we are executing the first iteration, the object that was pointed to by $old_obj is now destroyed, because no other reference is pointed towards it.
This iteration is terminated, we exit from the block and start a new iteration. However, the object that was pointed to by $obj is still alive, and is now pointed to by $old_obj.
The loop is terminated, $old_obj is the object produced by the penultimate iteration.

VI. Module availability

occp4.pm is available on the ipbs ftp server ftp://ftp.ipbs.fr/pub/occp4. Also available at the same url is refmac.pl, a perl rewrite of a script written in ksh by Laurent Maveyraud (refmac.ksh) to perform refmac refinement with bulk solvent correction (in xplor or refmac).

Newsletter contents...