Crank, Crunch2 and Bp3: A platform for rapid automated structure determination

Steven R. Ness, Irakli Sikharulidze, R.A.G de Graaff and Navraj S. Pannu

Biophysical Structural Chemistry, Leiden Institute of Chemistry, P.O. Box 9502, 2300 RA Leiden, The Netherlands

Introduction

In recent years, there has been much progress on many fronts in the field of macromolecular crystallography, some of which include automation of structure solution, novel crystallographic algorithms, and the use of massively parallel computational resources. Crank (Ness et al., 2004) is a new suite in macromolecular crystallography that combines these concepts into a self-consistent whole and attempts to enable crystallographers to solve structures faster and more efficiently and let them share their results with the world in a standardized way amenable for future data-mining. Crank has an interface based on the CCP4i (Potterton et al., 2003) toolkit, and is designed to integrate into a CCP4 (1994) based workflow to provide a truly integrated system for performing many diverse kinds of crystallographic experiments. Two standalone programs within the Crank suite are Crunch2 for substructure detection and Bp3 for substructure phasing.

Automated Structure Solution

Due to the large amount of data generated by projects in structural genomics and the increased power in the algorithms used, automated structure solution has been gaining in popularity. Programs that do this usually take merged diffraction data and attempt to perform all the steps in a crystallographic structure determination. One of the first programs to do this was SOLVE (Terwilliger & Berendzen, 1999). By combining expert knowledge, heuristics and various programs, SOLVE can often determine a full protein structure starting from intensity data.

In recent years, many other packages, such as AutoSHARP (Bricogne et al., 2002), CHART (Emsley, 1999), ELVES (Holton & Alber, 2004), BnP (Weeks et al, 2002), Shelx[C/D/E] (Schneider & Sheldrick, 2002), HKL2MAP (Pape & Schneider, 2004), PHENIX (Adams et al., 2004), ARP/wARP (Perrakis et al., 1999) and Auto-Rickshaw (Panjikar et al., 2005) have been developed that use various strategies and subprograms in the pursuit of automated structure solution.

Given good quality data, any of these automated structure solution packages can produce a fully traced protein model. However, in the case of more difficult structures, it may become necessary for the crystallographer to try a variety of different packages, each implementing a different strategy and employing a different user interface. The Crank package provides wrappers for many different crystallographic programs, this allows the user to construct their own strategies for solving difficult structures. In addition, Crank can even provide wrappers for entire automated structure solution packages, thus allowing a user to quickly try a variety of different approaches, finding the best programs for their particular dataset.

By integrating a structure solution database within Crank, it is possible for a user to use the different substructure solutions or phase estimates between multiple Crank runs. In addition, Crank databases can also be exported in a variety of formats including mmCIF, for efficient deposition and sharing of results.

Novel Algorithms

Fairly often, new and novel crystallographic algorithms/programs are created. For example, with the use of programs such as MLPHARE (Otwinowski, 1991), SHARP (La Fortelle & Bricogne, 1997) and BP3 (Pannu & Read, 2004), along with direct methods in programs such as SHELXD (Schneider & Sheldrick, 2002), SnB (Weeks & Miller, 1999) and Crunch2 (de Graaff et al., 2001), crystallographers are solving previously insoluble crystal structures. Due to their new and often developmental nature, these programs are sometimes difficult to use effectively. By providing a common user interface for all programs and by including large amounts of on-line help, Crank speeds up this learning process, enabling users to utilize the most recent algorithms in the process of structure solution.

To further this end, Crank is bundled with several new programs in crystallography, including Crunch2 and BP3. Crunch2 is a new direct methods based program that uses the rank-reduction of Karle-Hauptmann matrices to determine phases. Whereas most of the other direct methods programs for macromolecular crystallography use triplet and quartet relationships, the algorithm used in Crunch2 implicitly utilizes higher phase relationships which are inherent in the rank reduction algorithm. Crunch2 been shown to produce high quality solutions in the case of difficult structures. BP3 is a novel program for substructure refinement and phasing. It employs a multivariate approach and has been shown to outperform the leading substructure refinement programs in some test cases (Pannu & Read, 2004), (Ness et al., 2004).

Crystallographic standards

In macromolecular crystallography, there are a wealth of different programs to help with various aspects of the process of structure solution. These programs are sometimes standalone and accept input in either their own special format or in a more standard format. Sometimes these programs are part of a larger suite like CCP4. In these suites there is often a harmonization of program input and output. In the CCP4 suite, the common input formats are an executable shell script based command script and the MTZ reflection file format. The output formats are MTZ files and logfiles.

With common and standardized input and outputs it becomes easier for the user to take the output from one program and convert into into input for a second program. CCP4i has further helped in this process by giving the user a standard GUI interface with which to build command scripts. Similar graphic interfaces also make it easier for users to run programs. In addition, the Crank CCP4i interface provides tooltips and other hints for the user, helping them determine the best parameters for their particular project. These kind of interfaces have proven very useful and are now common in many projects, such as AutoSHARP, BnP, PHENIX, Chart, and Elves.

Grid Computation

In the last few years there have been attempts to come up with ways to enable novices to do massive parallel and distributed computational work. Cluster computing, where large numbers of commodity computers are joined together by a queuing system to provide large amounts of available computing power have been popular for the last ten years. However, these efforts have often been stymied by issues of security, reliablity, portability and ease of use. Grid computing is a new concept that combines many of the lessons learned in cluster computing into a single solution. The Grid uses XML as a common language and provides interfaces to help solve the many problems inherent in running programs over a worldwide network of computers.

Because Crank is built upon the same framework of XML and SOAP that the Grid is designed around, Crank integrates smoothly with the Grid. In fact, the entire Crank architecture can be envisioned as a subset of the Grid, with the individual subsections of Crank functioning like separate web services. This kind of architecture can allow truly global computation to take place. In fact, in the future, individual program authors could simply deploy their novel crystallographic algorithms as Grid web services, easily accessible by anyone in the world.

Crank takes all these diverse threads and combines them together in a flexible and extensible framework. Crank does not aim to replace existing programs or suites, but rather to combine them into a self-consistent whole, making it possible for users to run any combination of programs or suites in any order.

Crank Architecture

Crank is a loosely associated collection of programs that communicate via XML. In designing Crank, we held to the old UNIX maxim of "small programs, weakly interlinked", where each program does a specific task, and the programs communicate via a simple common language. Whereas most other packages communicate via language specific mechanisms, for instance Python Pickle objects, Crank instead stores all of it's data in XML. This allows programmers to write their applications in the most appropriate language for them. To this end, we have written Crank in a variety of different languages including Tcl, Python, C, C++, FORTRAN, Bourne shell, C shell and Java. By writing various parts of Crank in these different languages, we try to ensure that the external XML based datastructures we design are sufficiently portable to be used by other developers in their own projects and are suitable for future data-mining activities.

This philosophy has served us well, by weakly coupling different aspects of Crank, it is quite simple to add new programs to Crank. Programs have input and output, and sometimes they generate error messages. To add a program to Crank, one first identifies the different kinds of input to a program. There are two broad categories of input, dataset specific and program specific. Dataset specific information includes sequence data, information about the macromolecule, and reflection data. Program specific information is simply instructions to the programs, for example the number of cycles of refinement to run. Program output can also be thought of in two broad categories, modified reflection data, and logfile information.

In adding a program to Crank, we identify all the program inputs and outputs and write small conversion programs to change the program input and output into a standard format. By putting these small wrappers around all the various programs, we effectively turn all programs into standardized building blocks, out of which any arbitrary crystallographic experiment can be constructed.

Crank XML

Crank uses XML for all inter-program communication and also for long term database storage and archiving of information. In order to provide for the rapid addition of new programs, Crank Input XML data structures are simply program keywords, transliterated into XML. For example, the following Crunch2 run script:

 
#!/bin/sh 
   
crunch2 HKLIN crank.drear MODELIN coordinaten.xyz HITS crunch2.out.hits << END 
TRY 1 10 
NCYC 400 
ICOO 1 
CELL 17.8280 31.4450 44.0110 90.0000 90.0000 90.0000 
SYMM 19 
NATOM 10 
SCATT 15 
END 

Was generated from the following XML files:

Crank Input XML
 
<crank> 
  <soap> 
    <run> 
      <job id="2">Crunch2 
        <name>Crunch2</name> 
        <tag>2_CRUNCH2</tag> 
        <input> 
           <coords></coords> 
           <evalue_columns> 
              <fa>1_AFRO_FA</fa> 
              <sigfa>1_AFRO_SIGFA</sigfa> 
              <e>1_AFRO_E</e> 
              <sige>1_AFRO_SIGE</sige> 
           </evalue_columns> 
        </input> 
        <output> 
           <coords>2_CRUNCH2.coords</coords> 
        </output> 
        <crunch2> 
           <pmf>1</pmf> 
           <scattering_power>15</scattering_power> 
           <max_resolution></max_resolution> 
           <min_atoms>3</min_atoms> 
           <ntrials>10</ntrials> 
           <ncycles>400</ncycles> 
           <pmf> 
             <npatt>1</npatt> 
             <max_resolution>4.0</max_resolution> 
          </pmf> 
        </crunch2> 
      </job> 
    </run> 
  </soap> 
</crank> 
Crank MTZ XML
<crank>
  <dataset_info>
    <cell>
      <cell_a>17.8280</cell_a>
      <cell_b>31.4450</cell_b>
      <cell_c>44.0110</cell_c>
      <cell_alpha>90.0000</cell_alpha>
      <cell_beta>90.0000</cell_beta>
      <cell_gamma>90.0000</cell_gamma>
    </cell>
    <spacegroup>
      <number>19</number>
      <lattice>P</lattice>
      <operator id="0">21</operator>
      <operator id="1">21</operator>
      <operator id="2">21</operator>
    </spacegroup>
    <n_symops>4</n_symops>
    <symmetry_operator id="0">
      <aa>1</aa>  <ab>0</ab>  <ac>0</ac>    <atrans>0.0000000</atrans>
      <ba>0</ba>  <bb>1</bb>  <bc>0</bc>    <btrans>0.0000000</btrans>
      <ca>0</ca>  <cb>0</cb>  <cc>1</cc>    <ctrans>0.0000000</ctrans>
    </symmetry_operator>
    <symmetry_operator id="1">
      <aa>-1</aa> <ab>0</ab>  <ac>0</ac>    <atrans>0.5000000</atrans>
      <ba>0</ba>  <bb>-1</bb> <bc>0</bc>    <btrans>0.0000000</btrans>
      <ca>0</ca>  <cb>0</cb>  <cc>1</cc>    <ctrans>0.5000000</ctrans>
    </symmetry_operator>
    <symmetry_operator id="2">
      <aa>1</aa>  <ab>0</ab>  <ac>0</ac>    <atrans>0.5000000</atrans>
      <ba>0</ba>  <bb>-1</bb> <bc>0</bc>    <btrans>0.5000000</btrans>
      <ca>0</ca>  <cb>0</cb>  <cc>-1</cc>   <ctrans>0.0000000</ctrans>
    </symmetry_operator>
    <symmetry_operator id="3">
      <aa>-1</aa> <ab>0</ab>  <ac>0</ac>    <atrans>0.0000000</atrans>
      <ba>0</ba>  <bb>1</bb>  <bc>0</bc>    <btrans>0.5000000</btrans>
      <ca>0</ca>  <cb>0</cb>  <cc>-1</cc>   <ctrans>0.5000000</ctrans>
    </symmetry_operator>
  </dataset_info>
</crank>

Because there are so many different possible types of XML output, Crank does not try to impose it's own standards on the XML output by programs, but rather maintains a database of program output and how to convert program output to standard Crank Output XML. In the case of a program that already has XML output, XSLT (Extensible Stylesheet Language Transformations) documents are used to convert program output to Crank XML. In the case of programs without XML output, program logfiles are converted to XML by small utility programs. After transforming the program output into standard Crank XML, Crank can then access and use this program output in subsequent "decision" steps in the Crank pipeline.

Designing an experiment in Crank

Most of the other automated pipelines in crystallography use a combination of expert knowledge and heuristics in a pipeline. This expert knowledge is often of very high quality, as in the SOLVE, BnP, autoSHARP, Elves and CHART pipelines. However, as the plurality of pipelines in existence suggests, there are different decisions possible at each stage, and each pipeline will often choose a different strategy. In Crank, we instead create a toolbox that allows a scientist to design their own custom pipeline, with different programs run at each step and multiple paths that can be chosen depending on program output from a previous step. In addition, by utilizing the parallel nature of the Grid, multiple programs can be run at once, and subsequent decisions can be made based on the best available knowledge at the time. This allows Crank to be both flexible and fast.

There are many pre-programmed strategies built into Crank for solving structures, and in addition, a user can build their own custom strategies.

To build a strategy, first start the Crank CCP4i interface. You are presented with an empty canvas, with only a section for protein and dataset information. Different programs can be chosen from the drop down menu by selecting a program and then pressing "Add Program or Decision". After pressing this button, the selected program or decision is added to the Crank canvas. The user can then go in and edit the default parameters for the program being run. This process is repeated until a full experiment is built. An example pipeline for a MAD experiment could be:

Step Program Description
0 SCALEIT (Evans, P.R., Dodson, E.J. & Dodson, R., unpublished) Relative scaling of datasets.
1 AFRO (Pannu, in preparation) Calcuate E-values
2 Crunch2 (de Graaff et al., 2001) Determine heavy atom positions
3 BP3 (Pannu et al., 2003) Refine heavy atom positions and output phases
4 SOLOMON (Abrahams & Leslie, 1996) Density modification
5 DM (Cowtan, 1994) Density modification
6 RESOLVE (Terwilliger, 2003) Model building
7 REFMAC (Murshudov et al., 1997) Model refinement

The output of all these programs is stored in an experiment specific Crank Output XML database. For example, the Luzzatti parameters from the BP3 step would be stored as:

 
<job id="3"> 
   <program>bp3</program> 
   <luzzati id="0">0.3000</luzzati> 
   <luzzati id="1">0.3000</luzzati> 
   <luzzati id="2">0.3000</luzzati> 
</job> 

In addition to program steps, a user can insert a decision step into an experiment. Decisions take the form of a familiar IF..THEN statement, where the variable referred to in the IF part of the statement is a value that has been stored in the Crank experiment database as the output from one of the programs that have been run.

Because of the simplicity of this approach, and its similarity to the way a scientist thinks about a real experiment, this methodology can be quite a powerful way to solve a structure.

In addition, this approach works quite well with the Grid model of computation, in that multiple experimental pipelines can be run in parallel, with decision steps able to take the current best solution. For example, multiple SHELXD jobs can be started with different resolution limits and as these jobs complete, subsequent programs that need substructure information can take the highest scoring solution as their starting point.

Test cases - Fast phasing

As well as improving the the convergence radius of structure solution, developers have also attempted to obtain a solution quickly. Crank allows the ability to test and implement different protocols, one of which we present below. In order to obtain an answer from Bp3 quickly, we can perform a limited number of cycles of refinement of error and atomic parameters and output the corresponding best phase and phase probability distribution. This procedure in Bp3 is accomplished with the "PHASe" keyword. We report tests of this compared with the normal/default procedure of BP3. Table 1 shows the details of the data sets used.

Table 1

Molecule Experiment Substructure f'' (approx) Reference
C. acidurici ferredoxin SAD 8 Fe 1.25 (Dauter et al., 1997; 2002)
Carbohydrate binding module SAD 4 Se 5.4 (Boraston et al., 2003; Dodson, 2003)
DNA oligomer (CGCGCG)2 SAD 10 P 0.434 (Dauter & Adamiak 2001; Dauter et al., 2002)
Human acyl-protein thioesterase SAD 22 Br 5 (Devedjiev et al., 2000; Dauter et al., 2002)
E. coli thioesterase II SAD 8 Se 5.4 (Li et al., 2000; Dauter et al., 2002)
Lysozyme (high redundancy) SAD 10 S 8 Cl 0.56 (Dauter et al., 1999; Dauter et al., 2002)
Pseudomonas serine carboxyl proteinase SAD 9 Br 5 (Dauter et al., 2001; Dauter et al., 2002)
Calcium subtilisin SAD 3 Ca 1.28 (Betzel et al., 1988; Dauter et al., 2002)
MutS binding to G-T mismatch SAD 45 Se 5 (Lamers, Perrakis et al., 2000)
Lysozyme (low redundancy) SAD 10 S 2 Cl 0.56 (Weiss, 2001)

To perform the phasing in Bp3, the substructure that was input was determined with Crunch2, using DREAR (Blessing & Smith, 1999) for FA value calculation. Results from the fast phasing (PHASe protocol) and default protocol are shown in Table 2.

Table 2 : Statistics for substructure refinement and phasing in BP3 using two different protocols.

Molecule PHASe protocol Default protocol
Figure of merit cos(Phase error) Time Min:Sec Figure of merit cos(Phase error) Time Min:Sec
C. acidurici ferredoxin 0.39 0.53 1:30 0.56 0.54 9:52
Carbohydrate binding module 0.24 0.20 1:02 0.25 0.19 6:31
DNA oligomer (CGCGCG)2 0.43 0.52 0:10 0.57 0.54 1:14
Human acyl-protein thioesterase 0.34 0.32 1:18 0.36 0.33 22:50
Lysozyme (high redundancy) 0.42 0.43 1:39 0.48 0.44 14:22
E. coli thioesterase II 0.45 0.36 1:44 0.48 0.36 12:12
Pseudomonas serine carboxyl proteinase 0.30 0.12 1:07 0.26 0.12 30:34
Calcium subtilisin 0.22 0.31 0:23 0.33 0.31 3:04
MutS binding to G-T mismatch 0.43 0.36 7:05 0.48 0.39 204:52
Lysozyme (low redundancy) 0.33 0.33 0:39 0.40 0.37 11:06

From Table 2, the general trend is that the PHASe protocol can produce solutions of similar quality, significantly faster than the default protocol. However, the default protocol produces better phase estimates (as judged by a higher cosine of the phase error) and more accurate phase probability statistics (as judged by the agreement between the figure of merit and the cosine of the phase error). These small differences may be important for structure solution with smaller signals and worse phase quality. However, for large signals, the PHASe protocol may be sufficient to generate a completely built model efficiently. A thorough analysis will be performed in the future for obtaining the most complete model automatically and efficiently.

Conclusion

By combining existing concepts into a single suite, Crank allows crystallographers to use the most advanced programs with a common user interface and allows these programs to run efficiently on a world-wide cluster of computers. This enables the crystallographer to try a variety of different hypotheses in a time efficient manner, speeding up the process of structure determination. By exporting the output from various programs into a common format, it helps the user to deposit and share their data more efficiently. In the future, this self-consistent data can be data-mined by the scientific community, adding value to the already valuable information produced by the crystallographic community.

Acknowledgements and Program Availability

We would like to thank the people who contributed datasets and the users who have downloaded the pre-beta version of crank, crunch2 and bp3. Version 0.8 of crank is available from web site http://www.bfsc.leidenuniv.nl/software/crank/ and will be available via CCP4 and is suitable for SAD or SIRAS experiments. Version 1.0 of crank, with support for MAD phasing, will also be available from our web site and via CCP4 in the future. Funding for this research was provided by the Netherlands Organization for Scientific Research (http://www.nwo.nl).

References

Abrahams, J. P. & Leslie A. G. W. (1996). Acta Cryst. D52, 30-42.

Adams, P.D., Gopal, K., Grosse-Kunstleve, R.W., Hung, L.-W., Ioerger, T.R., McCoy, A.J., Moriarty, N.W., Pai, R.K., Read, R.J., Romo, T.D., Sacchettini, J.C., Sauter, N.K., Storoni, L.C. & Terwilliger, T.C. (2004). J. Synchrotron Rad. 11, 53-55

Betzel, C., Dauter, Z., Dauter, M., Ingelman, M., Papendorf, G., Wilson, K.S. & Branner, S. (1988). J. Mol. Biol. 204, 803-804.

Blessing, R.H. & Smith, G.D. (1999). J. Appl. Cryst. 32, 664-670.

Boraston, A. B., Revett, T. J, Boraston, C.M, Nurizzo D. & Davies, G.J. (2003). Structure 11, 665-675.

Bricogne, G., Vonrhein C., Paciorek W., Flensburg C., Schiltz M., Blanc E., Roversi P., Morris R. & Evans G. (2002). Acta Cryst. A58, C239.

Collaborative Computational Project, Number 4. (1994). Acta Cryst. D50, 760-763

Cowtan, K. (1994). Joint CCP4 and ESF-EACBM Newsletter on Protein Crystallography, 31, p34-38.

Dauter, Z., Wilson, K.S., Sieker, L. C., Meyer, J. & Moulis, J.M. (1997). Biochemistry, 36, 16065-16073.

Dauter, Z., Dauter, M., de La Fortelle, E., Bricogne, G. & Sheldrick, G. M. (1999). J. Mol. Biol. 289, 83-92.

Dauter, Z., and Adamiak, D. A. (2001). Acta Cryst. D57, 990-995.

Dauter, Z., Li, M. and Wlodawer, A. (2001). Acta Cryst. D57, 239-249.

Dauter, Z., Dauter, M. & Dodson, E.6 (2002). Acta. Cryst. D58, 494-50.

Devedjiev, Y., Dauter, Z., Kuznetsov, S.R., Jones, T. L. Z. & Derewenda, Z. S. (2000). Structure Fold Des., 8(11), 1137-1465.

Dodson, E. (2003). Acta. Cryst. D59, 1958-1965.

Emsley, P. (1999). CCP4 Newsletter on Protein Crystallography. 36.

de Graaff, R. A. G., Hilge, M., van der Plas, J. L. & Abrahams, J. P. (2001). Acta Cryst. D57, 1857-1862.

La Fortelle, E. de & Bricogne G. (1997). Methods Enzymol. 276, 472-494.

Lamers, M. H., Perrakis, A., Enzlin, J. H., Winterwerp H. H. K, de Wind, N. & Sixma, T. (2000). Nature 407, 711-717.

Li, J., Derewenda, U., Dauter, Z., Smith, S. & Derewenda, Z. S. (2000). Nature Struct Biol 7, 555-559.

Holton, J. M. & Alber, T. (2004). Proc. Natl. Acad. Sci. (USA) 101, 1537-1542.

Murshudov, G.N., Vagin, A.A. & Dodson, E.J. (1997). Acta Cryst. D53, 240-255.

Ness S.R., de Graaff, R.A.G., Abrahams, J.P. & Pannu, N.S. (2004) Structure. 12, 1753-1761.

Otwinowski, Z. (1991) Proceedings of the CCP4 Study Weekend. Isomorphous Scattering and Anomalous Replacement. Edited by Wolf, W., Evans, P.R., and Leslie, A.G.W., pp 80-86. Warrington: Daresbury Laboratory.

Panjikar, S., Parthasarathy, V., Lamzin, V. S., Weiss, M. S. & Tucker, P. A. (2005). Acta Cryst. D61, 449-457

Pannu, N.S. & Read, R.J. (2004) Acta Cryst. D60, 22-27.

Pannu, N. S., McCoy A. J. & Read, R. J. (2003). Acta Cryst. D59, 1801-1808.

Pape, T. & Schneider T.R. (2004). J. Appl. Cryst. 37, 843-844.

Potterton, E., Briggs, P., Turkenburg, M. & Dodson, E.J. (2003). Acta Cryst. D59, 1131-1137.

Perrakis, A., Morris, R. J. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458-463.

Schneider, T.R. & Sheldrick, G.M. (2002). Acta Cryst. D58, 1772-1779.

Terwilliger, T.C. and J. Berendzen. (1999). Acta Cryst. D55, 849-861.

Terwilliger, T.C. (2003). Acta Cryst. D59, 1174-1182.

Weeks, C.M. & Miller R. (1999). J. Appl. Cryst. 32, 120-124.

Weeks, C.M., Blessing, R.H., Miller, R., Mungee, R., Potter, S.A., Rappleye, J., Smith, G.D., Xu, H. & Furey, W. (2002). Z. Kristallogr. 217, 686-693.

Weiss, M.S. (2001). J. Appl. Cryst. 34, 130-135.