Dr. Kevin Cowtan
Structural Biology Laboratory, University of York, UK
Terwilliger (1999) describes a new likelihood-based framework for employing structural information which was previously exploited by means of conventional density modification calculations. The statistical basis of Conventional density modification is dubious at best (Cowtan & Main, 1996), despite the introduction of `fudge-factors' which attempt to correct for the worst sources of bias (Abrahams 1997, Cowtan & Main 1999). Terwilliger's treatment provides a better foundation for the integration of density-modification-like information with those elements of structure solution which have already received a more through statistical treatment.
In this paper a similar treatment is considered as part of a general consideration of the relationships between likelihood functions calculated for structure factors and electron density maps. This provides some new insights into likelihood-based phase improvement.
Suppose the likelihood of a set of structure factors can be represented by the product of likelihoods for the individual structure factors, and that these likelihoods in turn may be represented by Gaussians centred on some expected value for that structure factor. The log-likelihood is then the log of a general n-dimensional Gaussian, which is a sum of quadratic terms in the complex structure factors. The quadratic distribution may be represented in turn by a general quadratic distribution for the log-likelihood based on the electron density across the whole map.
Furthermore, the whole procedure may be reversed. Thus, a likelihood distribution in real or reciprocal space may be represented in the other space by the following series of steps:
Consider an ensemble of sets structure factors, which may be represented by a mean structure factor . Let the distribution of values of F(h) about be represented by a general two-dimensional Gaussian
where and and , and are the Gaussian coefficients obtained by inverting the variance-covariance coefficients for the real and imaginary parts of the reflection..
Let , , , then:
Assuming the reflections are independent, the conditional probability of a unique set of structure factors given a particular model for the magnitude and phase errors may be calculated from the product of the individual likelihoods:
If we take the logarithm of this expression we obtain a log-likelihood, which is a summation over the log-likelihoods for individual reflections::
where
and the summation is now over the unique reflections and their Friedel opposites.
Substituting , :
This expression gives the change in log-likelihood as the density moves away from the likelihood maximum, which in turn is the Fourier transform of the reciprocal space maximum.
How does the log-likelihood of the map vary as the density at a single point in the map is varied? For a map in spacegroup P1, set d(y)=0 for (for other space groups all symmetry related densities must be included).
Note that the density variance, which is inversely proportional to the quadratic coefficient, varies over a cell whose extent is half that of the crystal cell. Suppose that there is a large uncertainty in the magnitude of one reflection, but that its phase is well known. Then the corresponding uncertainty in the electron density will be large wherever the electron density wave corresponding to that reflection has a peak or a trough. Thus the features in the variance map have half the spacing of the features in the electron density map.
Consider an ensemble of density maps which may be represented by a mean density and a variance at each of points distributed throughout the cell, where is the effective number of parameters required to represent the map. If the conditional probabilities for values at each sample point are independent, then the likelihood of any particular map is given by:
If we take the log of this function, we obtain the log-likelihood of the map:
This may be generalised to the continuous electron density by integrating over the cell and re-normalising to correct for the effective number of parameters:
Let and . Then
Replacing d(x) and v(x) with their Fourier transforms D(h) and V(h) we obtain:
How does as we vary a single reflection, i.e. ignoring inter-reflection correlations? Set D(k)=0 for , since the correlation between a reflection and its Friedel opposite may not be ignored. (For spacegroups other than P1, symmetry equivalent reflections should also be included). Then:
Again this expression describes the change in log-likelihood as we move away from the maximum, given by the Fourier transform of the real space maximum.
Let , and expand:
For a general spacegroup, the expression is as follows:
where the summations are over the symmetry equivalent reflections and their opposites.
The terms in and are equivalent to the curvature expressions given by Terwilliger (1999) for a phase of 0 or . The additional cross term allows the semi-major axes of the quadratic function to be oriented in an arbitrary direction in the Argand diagram. This term could easily be incorporated into Terwilliger's expression.
By using the full expression in equation (9) the correlations between reflections may also be calculated.
The obvious application for this mathematics is phase improvement through real-space constraints, as demonstrated by Terwilliger. The new electron density log-likelihood expressions allow a variance to be calculated for the electron density at any position in the map. This variance may be used in Terwilliger's expressions to determine the probability that a map coordinate is in protein or solvent. It may also be used in non-crystallographic symmetry averaging to give an appropriate weight to the information from other copies of the molecule (with an additional term to account for non-isomorphism between the NCS related molecules).
The function V(h) in equation (10), used in calculating structure factor log-likelihoods, is related to the G-function of Hendrickson and Lattman (1970). Since all the minus-log-likelihood curvatures v(x) in real space are positive, V(h) must be dominated by a large origin term, which contributes to every structure factor variance. V(h) must also be limited in extent, since it is the Fourier transform of the low-resolution mask-outline. The use of a continuously varying probability instead of the traditional binary mask by Terwilliger will further limit the extent of V(h), but this is clearly better than including the meaningless high-resolution ripples introduced by a binary mask.
One notable difference between the approach described here and that of Terwilliger is that the density probability distribution is approximated here before changing from real to reciprocal space. As a result, no Taylor series expansion is required: the resulting distribution is correct across the whole of the Argand diagram. The disadvantage of this approach is that some information about the probability distribution of density values is discarded. For example, a bimodal probability distribution for the value of the electron density at some point in the map will be approximated by a single broad distribution. The practical implications of these changes have yet to be investigated. However, for the simple application of solvent flattening, the differences between a fixed width Gaussian plus constant and variable width Gaussian model are likely to be small.
It is now possible to ask what information can be gained from a solvent flatness constraint, by considering the variation in log-likelihood across the whole of the Argand diagram for each reflection. Suppose real space probability distributions are constructed to represent flat density in the solvent region and completely unknown density in the protein. The probability distribution for the electron density at any position in the map is therefore a Gaussian centred on the solvent density, whose width increases with the probability that the density belongs to the protein.
Ignoring cross terms between reflections, the likelihood distribution for each structure factor from this information has its maximum at the origin of the Argand diagram, since the Fourier transform of the flat solvent density is zero for all non-origin reflections. Therefore, in the case of unobserved reflections for which both magnitude and phase are unknown, this information alone cannot predict non-zero values for unobserved reflections. In the case of reflections whose magnitude is known, the Gaussian probability centred on the origin can give rise to at best a symmetrical bimodal phase distribution (i.e. in terms of Hendrickson Lattman coefficients, A=B=0). Combination of this probability distribution with an experimental phase probability distribution can provide some phase improvement.
Incorporation of the inter-reflection cross terms in the probability function however has the potential to dramatically increase the power of the approach for the following reasons. Suppose one reflection is known in magnitude and phase. Then, neighbouring reflections will be more likely to take values which suppress the contribution of the first structure factor in solvent regions, and as a result add to it in protein regions. The implicit phase relationships in solvent flattening are expressed explicitly in this manner. The calculation may be achieved by constructing a 2-D Gaussian model in the Argand diagram for the experimental phase probabilities on the basis of the structure factor magnitude, its standard deviation, and the Hendrickson-Lattman coefficients. The log of this distribution gives a (block biagonal) matrix of quadratic coefficients, which may be summed with the (full matrix) quadratic coefficients from the map variance information. The combined matrix may then be inverted to determine the shift to reach the likelihood maximum, and the variances of all the parameters. The calculation required the inversion of a matrix whose dimension is the number of reflections, but such a calculation is probably within the range of high-end workstations for small to medium sized structures. Alternatively, some form of sparse matrix approach may well be practical.
Terwilliger's approach of calculating the local curvature around the centroid map calculated from the experimental data provides an alternative and computationally efficient means of avoiding this lengthy calculation for the purposes of finding the likelihood maxima, as long as the shifts to the structure factors remain small or the curvatures are recalculated appropriately. However the incorporation of cross terms may improve the error estimation for the resulting phases.
The formulation described here could be modified to achieve similar efficiency by calculating as many terms as are desired of the curvature once (since it is constant in this formulation), and then using a gradient/curvature search to find the log likelihood maximum.
A further development of the ideas described here would be to drop the assumption of independence between density values in the map. However it is hard to see where density covariance information might come from except when calculated from experimental phase information. This offers the possibility of completely representing a Gaussian model for every structure factor and phase by a set of variances and covariances in real space, however the applications of such an approach are not obvious.
This work was funded by the BBSRC, grant number 87/B09875