Macromolecular Phasing by ShakeandBake
Charles M. Weeks1 and Russ Miller1,2
1HauptmanWoodward Med. Res. Inst., Inc., 73 High St., Buffalo, NY 14203 USA
2SUNYBuffalo, Dept. of Comp. Sci., Buffalo, NY 14260 USA
weeks@hwi.buffalo.edu and miller@hwi.buffalo.edu
Abstract. ShakeandBake is a direct methods procedure which has provided ab initio solutions for protein structures containing as many as 600 independent nonH atoms, provided that goodquality diffraction data are available to 1.1Å resolution. Its ultimate potential is unknown. The ShakeandBake algorithm extends the range of conventional direct methods by repetitively, unconditionally, and automatically alternating reciprocalspace phase refinement with filtering in real space to impose constraints. An extensive web site for SnB, a computer program implementing ShakeandBake, can be found at URL http://www.hwi.buffalo.edu/SnB.
1 Introduction
The majority of smallmolecule organic crystal structures having fewer than 100 independent nonH atoms are solved using reciprocalspace direct methods. Conventional direct methods begin to fail in the 100200 atom range because the accuracy of the underlying probabilistic phase relationships decreases as the size of the structure increases. Fortunately, however, the size of molecular structures amenable to phasing by direct methods can be increased significantly if these methods are augmented by the imposition of physicallymeaningful constraints in real space. Realspace phaseimprovement methods, commonly known as densitymodification methods, are widely used in macromolecular crystallography (see review by Podjarny et al., 1987). The potential for realspace constraints to improve phases in the context of smallmolecule direct methods was recognized by Jerome Karle (1968) who found that even a relatively small, chemicallysensible fragment extracted by manual interpretation of an electron density map could be parlayed into a complete solution by transformation back to reciprocal space and then performing additional iterations of phase refinement. The tremendous increases in computer speed in recent years have made it feasible to consider repeatedly cycling many trial structures backandforth between real and reciprocal space, while performing optimization alternately in each space. This computeintensive process, which requires the use of two Fourier transforms during each cycle, forms the basis of the synergistic Shake (phase refinement) and Bake (density modification) procedure in which the power of reciprocalspace phase refinement is automatically augmented by filtering to impose the phase constraints implicit in real space (Weeks et al., 1994).
Getting Started. Practitioners of conventional direct methods handle the problem of beginning a structure determination when no atomic positions are known by adopting a 'multisolution' approach in which multiple sets of trial phases are evaluated either symbolically (Karle & Karle, 1966) or numerically (Germain & Woolfson, 1968), with probable correct set(s) determined by ranking according to a suitable figureofmerit. Since solving very large structures requires a large initial set of presumedknown phases, it has been found advantageous to generate trial phase sets by using a randomnumber generator to assign values to many or all of the required phases (Baggio et al., 1978: Yao, 1981). In the ShakeandBake procedure, phases are assigned initial values by generating trial structures consisting of randomly positioned atoms (thereby imposing an atomicity constraint from the outset) and then computing structure factors.
Foundations of Phase Refinement. Direct methods are based on the fact that there exist linear combinations of phases, called structure invariants, whose values, in principle, depend only on the magnitudes of the normalized structure factors,
(1)
where rj is the position vector of one of the N atoms, assumed identical, in the unit cell. The conditional probability distributions of the structure invariants permit individual invariant values to be estimated as first proposed by Hauptman and Karle (1953). The most useful phase relationships are the threephase or triplet invariants,
(2)
which have the associated parameters (or weights),
(3)
Ab initio phase determination involves the derivation of individual phase values from a set of triplets having a sufficiently large triplet:phase ratio (e.g., 10:1). The tangent formula,
(4)
(Karle & Hauptman, 1956) provides a simple, but highly effective means for extracting phase values from the triplets. If several pairs of phases, K and HK, and their associated are known, equation (4) can be used to determine the most probable value for H. Phase expansion and/or refinement in reciprocal space is accomplished through successive applications of this relationship. The tangent formula, in either its original or a weighted form, is at the heart of widelyused conventional multisolution phasing programs, including MULTAN (Main et al., 1980) and SHELXS (Sheldrick, 1984), which refine multiple sets of trial phases by making many iterations or passes through the phase list.
Minimal Function. Minimization of an objective function like the minimal function,
(5)
(Debaerdemaeker & Woolfson, 1983; Hauptman, 1991; DeTitta et al., 1994) provides an alternative approach to phase refinement. R() is a measure of the meansquare difference between the values of the triplets calculated using a particular set of phases and their expected values as given by the ratio of modified Bessel functions, and it is expected to have a constrained global minimum when the phases are equal to their correct values for some choice of origin and enantiomorph (the minimal principle). Equation (5) can also be written to include contributions from higherorder (quartet) invariants, but this option has not been shown, within the context of SnB, to be computationally efficient. Experimentation has thus far confirmed that, when the minimal function is used actively in the phasing process and solutions are produced, the final trial structure corresponding to the smallest value of R() is a solution. Therefore, R() is also an extremely useful figureofmerit.
Parameter Shift. Parameter shift (Bhuiya & Stanley, 1963) is a seemingly simple search technique that has proven to be quite powerful as an optimization method when used to reduce the value of the minimal function, provided that appropriate choices of parameter values are made. The phases are considered in decreasing order with respect to the values of the associated 's. When considering a given phase i , the value of the minimal function (Eq. (5)) is initially evaluated three times. First, with the given set of phase assignments, second with phase i modified by the addition of the predetermined phase shift, and third with i modified by the subtraction of the predetermined phase shift. If the first evaluation yields the minimum of these three values of the minimal function, then consideration of i is complete, and parameter shift proceeds to i +1. Otherwise, the direction of search is determined by the modification that yields the minimum value, and the phase is updated to reflect that modification. In this case, phase i continues to be updated by the predetermined phase shift in the direction just determined so long as the value of the minimal function is reduced, though there is a userdefined predetermined maximum number of times that the shift is attempted. Based on extensive experimentation with these and related parameters, involving a variety of structures in several space groups, it has been determined that in terms of running time and percentage of trial structures that produce a solution, an excellent choice of parameters consists of the following: (i) perform a small number of passes through the phase set, (ii) evaluate the phases in order by decreasing values, and (iii) for each phase, perform a maximum of two 90° phase shifts. When the parametershift phase refinement is applied in centrosymmetric space groups, only a single shift of 180° is required for each phase. Surprisingly, higher success rates have been obtained if restricted phases in acentric space groups are treated as general phases (Weeks et al., 1994).
RealSpace Constraints. Automatic realspace electrondensity map interpretation in the ShakeandBake procedure consists of selecting an appropriate number of the largest peaks (typically equal to or less than the expected number of atoms) to be used as an updated trial structure without regard to chemical constraints other than a minimum allowed distance between atoms. If markedly unequal atoms are present, appropriate numbers of peaks (atoms) can be weighted by the proper atomic numbers during transformation back to reciprocal space. Thus, a priori knowledge concerning the chemical composition of the crystal is utilized, but no knowledge of constitution is required or used during peak selection. It is useful to think of peak picking in this context as simply an extreme form of density modification appropriate when atomicresolution data are available. The entire dualspace refinement procedure is repeated for an appropriate number of cycles which have been determined empirically by experimentation with known datasets (Weeks et al., 1994).
2 Methods
The ShakeandBake procedure has been implemented in an efficient and easytouse program, SnB (Miller et al, 1994). Pertinent information concerning SnB including the complete User's Manual may be accessed from the home page on the World Wide Web at URL:http://www.hwi.buffalo.edu/SnB. Standalone UNIX executables for SGI, SUN, IBM, and DEC alpha workstations as well as PC/Linux versions may be downloaded without cost to academic users. SnB has also been incorporated into Molecular Structure Corporation's teXsan package of crystallographic programs, and supercomputer versions have been installed on the Cray T3D and Cray C90 at the Pittsburgh Supercomputing Center, the CM5 at NCSA, and the SP2 at the Cornell Theory Center.
Overview of the SnB Program. The main menu of SnB gives the user the options of (i) generating and processing trial structures to determine a structure, (ii) producing a histogram of R() values for completed trial structures of a previously submitted structuredetermination process, and (iii) displaying the best current trial structure (i.e., lowest R()). A typical application of SnB consists of submitting a structuredetermination process, monitoring the progress of the trial structures by occasionally viewing a histogram of final minimalfunction values and, when a potential solution is identified, examining the geometry of this structure. The running time of the structuredetermination procedure for large, difficult structures requiring many trials is substantial, and the ability to follow conveniently the course of such jobs is essential.
The flow chart presented in Figure 1 illustrates the basic operation of the ShakeandBake process as implemented in SnB. Triplet and (optionally) negativequartet structure invariants,
Figure 1. A flow chart for the ShakeandBake algorithm. Solid lines represent flow of control; double lines show movement of data. ‘Start A’ represents the beginning of a structuredetermination process, and ‘Start B’ indicates the beginning of a session in which the R() histogram and molecular geometry are checked.
as well as the initial coordinates for the trial structures, are generated. Once this information is available, every trial structure is subjected to the following ShakeandBake procedure. Initially, a structurefactor calculation is performed which yields phases corresponding to the trial structure. The associated value of the minimal function, R(), is then computed. At this point, the cyclical ShakeandBake phasing procedure is initiated, as follows. The phases are refined via the tangent formula or by parameter shift so as to reduce the value of R(). These phases are then passed to a Fourier routine which produces an electrondensity map, but no graphical output is produced. Instead, the map is examined by a peakpicking routine which typically finds the n largest peaks (where n is the number of independent nonH atoms in the asymmetric unit) subject to the constraint that no two peaks are closer than a specified distance. These peaks are then considered to be atoms, and the process of structurefactor calculation, phase refinement, and density modification via peak selection is repeated for the predetermined number of ShakeandBake cycles.
For each completed trial structure, the final value of the minimal function is stored in a file, and the histogram routine can be run to determine whether or not a solution appears to be present in the set of completed trial structures. A bimodal distribution with significant separation is a typical indication that solutions are present (as shown in Figure 2), while a unimodal, bellshaped distribution (e.g., Figure 2 with the ‘0.467 to 0.470’ row omitted) typically indicates a set of nonsolutions. Two options permit the user to view the current best structure. The first requires only a characterbased terminal and produces a text plot suitable for printing on a line printer. The user can then manually ‘connect the dots.’ This routine also produces a list of the interpeak distances and angles. The second option makes use of GeomView, a graphical routine developed by the Geometry Center (Center for the Computation and Visualization of Geometric Structures at the University of Minnesota) and suitable for an XWindows environment. These options are included to assist the user in deciding whether a solution has, in fact, been obtained. They are not intended to provide complete visualization, especially for larger structures. It is expected that the coordinates will be input into other graphical programs for more extensive display.
SnB Parameters. The SnB user must supply (i) basic crystal data including space group, cell constants, and the contents of the asymmetric unit and (ii) an input reflection file consisting of h, k, l and the normalized structurefactor magnitudes, . The program will automatically sort this data into descending order by , eliminate systematic absences, and eliminate duplicate reflections. No selection based on s(F) or F/(F) is performed. It is often critical that values be calculated extremely carefully, and Blessing's programs (Blessing et al., 1996) are recommended. Costeffective default values for the control parameters (displayed following each query) are based on experience with several known test structures and are summarized in Table 1. Several parameters depend on structure size and can be expressed as a function of n. The user is free to override these recommendations, if desired.
A few comments concerning the parameters affecting the trial structures are in order. In practice, it is not necessary to use more than 100 randomly positioned atoms as an initial trial structure. During later cycles, choosing n peaks to recycle through the procedure gives optimum success rates for smaller structures. However, for large structures containing a significant number of atoms with low occupancy or high thermal motion, trial structures composed of less than n peaks (e.g., 0.8*n) give better performance. The geometry of trials that are solutions can be improved by EFourier recycling (Sheldrick, 1985), and the user can select the number of such Fourier refinement cycles (i.e., SnB cycles with no phase refinement) and the number of peaks. Also, it is often useful to build, over the course of several cycles, from the number of peaks used during the ShakeandBake stage to the approximate total number of atoms expected in the structure. When atoms with atomic numbers greater than 10 are present, the user has the option of weighting the appropriate number of largest peaks in the structurefactor calculations. Unequal weighting has resulted in accelerated convergence to solution in cases where a small number of sulfur, iron, or chlorine atoms is present.
Table 1. Default parameter values for the
SnB structuredetermination procedure.
Parameter 
Default 


NonH atoms in ASU 
n 


Invariant generation: 

Number of phases 
10n 
Number of triples 
100n 
Number of negative 
0 
quartets 



Starting atoms/random trial 
min (n,100) 


Number of SnB cycles: 

Parameter shift (PS) 
n/2 
refinement or Tangent 
n/4 
formula refinement 



PS phase refinement: 

Size of phase shift 
90° 
Max. number of shifts 
2 
Number of iterations 
1 
Exploit restricted phases? 
No 


Number of peaks to select 
[0.8n,n] 
Exploit heavy atoms? 
Yes 
Number EFourier steps 
0 
Figure 2. A 20bucket histogram of the final minimal function values after 255 cycles for the 624atom Tox II structure. 5000 phases and 50,000 triplet invariants were used. The separation between the single solution and the 1618 nonsolutions is clearly shown.
R(f) Range 
Trials in Range 


0.4670.470 
1* 
0.4710.474 
0 
0.4750.478 
0 
0.4790.482 
0 
0.4830.486 
0 
0.4870.490 
0 
0.4910.494 
0 
0.4950.498 
0 
0.4990.502 
0 
0.5030.506 
0 
0.5070.510 
25* 
0.5110.514 
135*** 
0.5150.518 
386********** 
0.5190.522 
639****************** 
0.5230.526 
390********** 
0.5270.530 
41* 
0.5310.534 
2* 
0.5350.538 
0 
0.5390.542 
0 
The relative efficiency of tangentformula and parametershift phase refinement in ShakeandBake has been compared using known atomicresolution datasets (Weeks et al., 1997). In the case of tangent refinement the minimal function is also computed, but used only as a figureofmerit. Regardless of which refinement method is used, optimization proceeds most rapidly when there is immediate feedback of each refined phase value. In general, the tangent formula solves small structures (<100 atoms) more costeffectively, but the two phaserefinement methods are equally efficient for solving most of the tested structures with more than 100 independent atoms, including crambin. However, only parameter shift has produced recognizable solutions for gramicidin A although another figureofmerit might be more reliable for tangent refinement. In addition, tangentformula costeffectiveness is highly dependent on the number of phaserefinement iterations (i.e., the number of passes through the list of phases) per complete ShakeandBake cycle whereas parameter shift does not exhibit such strong dependency. The number of iterations per cycle must be chosen judiciously if high efficiency is, in fact, to be achieved. This is especially true for structures in space group P1 where it is never advisable to perform more than one iteration of tangent refinement per cycle. Given a fixed number of machine cycles, it is important to consider the tradeoff between the number of trial structures processed and the number of cycles processed per trial structure. Experimentation has shown that, with a phaserefinement technique consisting of a singleiteration, twostep parameter shift of 90°, the point of diminishing returns is at approximately n/2 cycles. Therefore, the program defaults the number of cycles per trial to approximately this value. Overall recommendations for phaserefinement are given in Table 2.
After the dialogue is complete, the user is asked to review the information supplied and make any necessary changes. This information is then stored for use at a later time and for
Table 2. PhaseRefinement Recommendations.
Recommendation 
Method 
Cycles 
SnB Iterations/Cycle) 




n < 100 atoms 
Tangent Formula 
n/4 
4 or 1 (P1) 
n > 100 atoms 
Parameter Shift 
n/2 
1 
Always Safe 
Parameter Shift 
n 
1 
use by the histogram routine. Once a user decides that the parameters are satisfactory, the program automatically initiates the structuredetermination procedure by spawning a batch job.
3 Applications
A list of successful SnB applications to protein structures is given in Table 3. Gramicidin A, crambin, and rubredoxin were previously known test structures resolved at the HauptmanWoodward Institute. The 64residue scorpion toxin (Tox II) had been previously solved, but the number of residues and the amino acid sequence were deliberately withheld from the Buffalo group. The only information supplied was that the protein was composed of approximately 500 atoms and contained four disulfide bonds. The remaining structures (vancomycin, Er1 pheromone, and alpha1 peptide) were previously unknown, and these applications were made in other laboratories without direct involvement by the authors of SnB. All were solved routinely and automatically using essentially default parameters. Success rate (percentage of trial structures going to solution) depends on size and complexity of the structure, resolution and quality of data, the presence of heavier atoms (e.g., S, Cl, Fe), and the space group as well as the number of refinement cycles. Success rate typically decreases as the size of the structure increases or the resolution or data quality decreases. Success rates for structures in P1 are significantly higher than for other space groups, a result which may be related to the fact that the origin position can be chosen arbitrarily in P1.
The application to Tox II was made on a network of SGI R4000 Indigo Workstations with SnB running as a background job for approximately six weeks. One morning, the histogram reproduced in Figure 2 was found during the daily progress check. After detecting that the histogram was now bimodal, the single trial in the 0.467 to 0.470 range was examined, and a conservative model consisting of five fragments and a total of 241 atoms was constructed. Following multiple cycles of Xplor refinement, the residual was 0.16 for 624 nonH atoms (Smith et al., 1996).
It has been known for some time that conventional direct methods can be a valuable tool for locating the positions of heavy atoms using isomorphous E's (Wilson, 1978) and anomalous scatterers using anomalous E's (Mukherjee et al., 1989). Thus, it is no surprise that the ShakeandBake algorithm can be fruitfully applied in this arena as well. The first application of this type was to native and SeMet data for avian sarcoma virus integrase (Bujacz et al., 1995). The four Se atoms were found using 189 ∆E values (>1.76) in the resolution range 20 to 3.7Å. The investigators report that the isomorphous difference Patterson map was impossible to deconvolute without the aid of direct methods.
4 Concluding Remarks
The ultimate potential of the ShakeandBake approach to the ab initio structure determination of macromolecules is unknown. The combination of this technique with increasingly morepowerful computers has recently permitted directmethod solutions in situations regarded as impossible only a few years ago. The combination of ShakeandBake methodology with alternative densitymodification methods and supplemental phasing information from isomorphous replacement and single or multiplewavelength anomalous dispersion may allow equally spectacular advances in the near future.
Table 3. Protein structures solved ab initio by SnB
Structure References 

Vancomycin 
255 
P43212 
0.9Å 
1/4200 
P. Loll, pers. comm. 

Gramicidin A 
317 
P212121 
0.86 
0.25% 
Hauptman, 1995 

Er1 Pheromone 
328 
C2 
1.0 
0.25% 
Anderson et al., 1996 

Crambin 
~400 
P21 
0.83 
23% 
Weeks et al., 1995 

Alpha1 Peptide 
471 
P1 
0.92 
5% 
Prive et al., 1995 

Rubredoxin 
497 
P21 
1.0 
2.7% 
Hauptman, 1995 

Tox II 
624 
P212121 
0.96 
1/1619 
Smith et al., 1996 
Acknowledgments. The ShakeandBake algorithm and the SnB program have been made possible by the financial support of grants GM46733 from NIH and IRI9412415 from NSF. The authors would like to acknowledge the guidance and inspiration provided by Prof. Herbert Hauptman throughout the development of SnB.
References
Anderson, D.H., Weiss, M.S., & Eisenberg, D. Acta Cryst, D52, (1996) 469.
Baggio, R., Woolfson, M.M., Declercq, J.P., & Germain, G. Acta Cryst, A34 (1978) 883.
Bhuiya, A.K. and Stanley, E. Acta Cryst, 16 (1963) 981.
Blessing, R.H., Guo, D.Y., & Langs, D.A. Acta Cryst, D52 (1996) 257.
Bujacz, G., Jaskolski, M., Alexandratos, J., Wlodawer, A., Merkel, G., Katz, R.A., & Skalka, A.M. Jour. Mol. Biol, 253 (1995) 333.
Debaerdemaeker, T. and Woolfson, M.M. Acta Cryst, A39 (1983) 193.
DeTitta, G.T., Weeks, C.M., Thuman, P., Miller, R., & Hauptman, H.A. Acta Cryst, A50 (1994) 203.
Germain, G. and Woolfson, M.M. Acta Cryst, B24 (1968) 91.
Hauptman, H.A. "A Minimal Principle in the Phase Problem", in Crystallographic Computing 5: From Chemistry to Biology, D. Moras, A.D. Podnarny & J.C. Thierry (Eds.), IUCr Oxford Univ. Press, 1991, pp. 324332.
Hauptman, H.A., Acta Cryst, B51 (1995) 416.
Hauptman, H.A. and Karle, J. Solution of the Phase Problem. I. The Centrosymmetric Crystal, ACA Monograph No. 3, Dayton, OH:Polycrystal Book Service (1953).
Karle, J. Acta Cryst, B24 (1968) 182.
Karle, J. and Hauptman, H. Acta Cryst, 9 (1956) 635.
Karle, J. and Karle, I.L. Acta Cryst, 21 (1966) 849.
Miller, R., Gallo, S.M., Khalak, H.G., & Weeks, C.M. J. Appl. Cryst, 27, (1994) 613.
Mukherjee, A. K., Helliwell, J. R., & Main, P. Acta. Cryst, A45, (1989) 715718.
Podjarny, A.D., Bhat, T.N., & Zwick, M. Ann. Rev. Biophys. Biophys. Chem, 16 (1987) 351.
Prive, G., Ogihara, N., Wesson, L., Cascio, D., & Eisenberg, D. Abstr. W008, Proc. Am. Crystallogr. Assoc. Meeting, Montreal (1995).
Sheldrick, G.M. "SHELX84", in Crystallographic Computing 3: Data Collection, Structure Determination, Proteins, and Databases, G M. Sheldrick, C. Kruger & R. Goddard (Eds.), Clarendon Press, Oxford, 1985, pp. 184189.
Smith, G. D., Blessing, R. H., Ealick, S. E., FontecillaCamps, J. C., Hauptman, H. A., Housset, D., Langs, D. A., & Miller, R. Abstr. E1146, IUCr Meeting, Seattle (1996).
Weeks, C.M., DeTitta, G.T., Hauptman, H.A., Thuman, P., & Miller, R., Acta Cryst, A50 (1994) 210.
Weeks, C.M., Hauptman, H.A., Chang, C.S., & Miller, R. "Structure Determination by ShakeandBake with Tangent Refinement", ACA Transactions Symposium 30 (1997).
Weeks, C.M., Hauptman, H.A., Smith, G.D., Blessing, R.H., Teeter, M.M., & Miller, R. Acta Cryst, D51 (1995) 33.
Wilson, K. S. Acta Cryst, B34, (1978) 1599.