CCP4 Tutorial: Session 5 - Molecular Replacement

See also the accompanying document giving background information.

This tutorial solves one protein structure using the structure of a similar protein.

In the following instructions, when you need to type something, or click on something, it will be shown in red. Output from the programs or text from the interface is given in green.

The Problem

The example is to determine the structure of cardiotoxin for which we know:

It is a small protein of 60 residues.
The space group is C2.
We have data to 1.5Å resolution.
We have the structure of the same cardiotoxin in a different space group but residues 6-11 and 28-31 are missing.

This protein is now in the PDB as 1tgx.pdb (solved by A.Bilwes,B.Rees,D.Moras J.Mol.Biol. 1994, V239, p-122).

Outline of the Method

Make an estimate of the number of molecules in the asymmetric unit so we know how many molecules to look for with the molecular replacement programs.
Look at our experimental data - are there any problems?
Run molecular replacement program to find solutions.

Alternative: Molecular Replacement with AMoRe
Refine the phases using NCS (non-cystallographic symmetry) phased refinement.

The Data Files

Files in directory DATA:

model.pdb	contains coordinates of the model we will use to solve cardiotoxin
cardiotoxin.mtz	contains the experimental data

Files in directory RESULTS:

matthews.log	the log file from Cell Content Analysis
mr_analyse.log	the log file from Analyse Data
molrep.log	the log file from Molrep
model_molrep1.pdb	the output coordinates from Molrep
mr_refmac.log	the log file from Refmac5 refinement

5a) Estimate the Number of Molecules in the Asymmetric Unit

Most protein crystals contain about 50% water. We will calculate the number of protein molecules in the asymmetric unit of our crystal which will give a water content of about 50%.

Exercise

Select the Molecular Replacement module and open the Cell Content Analysis task window.
Enter a job title such as

Job title Estimate solvent content for cardiotoxin (mr tutorial step 1)
Select the MTZ file - the program will read the space group and cell dimensions from the MTZ file (so you do not need to type them in):

MTZ file DATA cardiotoxin.mtz
Enter the molecular weight of the protein. The protein has 60 residues and we say average residue weight is 100 Dalton. So

Molecular weight of protein 6000.
Click on Run Now.
Look at the output in the window - it shows a table of the Matthews coefficient and percentage solvent content dependent on the number of molecules that are in the asymmetric unit.
```
For estimated protein molecular weight 6000.0
Nmol/asym  Matthews Coeff  %solvent

  1          6.6             81.2
  2          3.3             62.5
  3          2.2             43.7
  4          1.7             24.9
  5          1.3              6.1
```
We are looking for the number of molecules to give about 50% solvent. It looks as if our crystal will have three molecules in the asymmetric unit but two molecules is also possible.
Close the Cell Content Analysis window.

5b) Look at the Experimental Data

We will do two things:

Create a Patterson map and search it for peaks. We expect a big peak at the origin (position 0,0,0) but if there is another big peak (sometimes up to 0.25 the size of the origin peak) then there may be a translation between the molecules in the asymmetric unit and it will be more difficult to solve.

The theory behind this is explained on the web site of Bernhard Rupp:

http://www-structure.llnl.gov/xray/101index.html

For more information, go to the section on Phasing Techniques on this website, and click on NCS with native Patterson maps

Create a Wilson plot which is an indication of the self consistency of the data. Also find the average B-value of the data - this can be used to help the molecular replacement program.

Exercise

Select the Analyse Data for MR task and enter a job title such as

Job title Analyse experimental data for cardiotoxin (mr tutorial step 100)
Select the input experimental data:

MTZ in DATA cardiotoxin.mtz

The output map file will be automatically set to:

Map out TEMPORARY cardiotoxin_mr_analyse1.map
Select input model:

PDB in DATA model.pdb
In the folder Define Map enter the resolution limits:

Exclude data resolution less than 10.0 A or greater than 3.0
In the folder Wilson Plot for B Analysis enter the Number of residues in the asymmetric unit - i.e.:

number of molecules in asymmetric unit * number of residues per molecule

= 3 * 60 = 180, so:

Number of residues in asymmetric unit 180
Click on Run -> Run Now. You can now Close the Analyse Data window.

Look at the log file when the job has finished. In the main CCP4i window click on the job called mr_analyse and then from menu View Files from Job select View Log File. In the log file is output from the programs FFT which created the Patterson map and PEAKMAX which searched for peaks in the map. To find what we want click on the Find button and enter the text List of peaks. You now see a table which starts like this:

 Count Site Height        Grid         Fractional coordinates   Orthogonal coordinates

   1   1        100.00     0   0   0   0.0000  0.0000  0.0000     0.00   0.00   0.00
   2   2         17.56     5   0   0   0.0641  0.0000  0.0000     5.05   0.00   0.00
   3   3         15.24    11   0   0   0.1339  0.0000  0.0000    10.54   0.00   0.00
   4   4         12.83    20   0   0   0.2525  0.0000  0.0000    19.87   0.00   0.00
   5   5         14.52    28   0   0   0.3530  0.0000  0.0000    27.78   0.00   0.00
   6   6         16.12    38   0   0   0.4688  0.0000  0.0000    36.89   0.00   0.00
   7   7         14.84    16   0   2   0.2029  0.0000  0.0331    15.12   0.00   1.65
   8   8         16.33     6   0   3   0.0734  0.0000  0.0552     4.37   0.00   2.75
   9   9         14.84    36   0   3   0.4507  0.0000  0.0470    34.27   0.00   2.34
  10  10         16.35    24   0   4   0.3036  0.0000  0.0716    22.06   0.00   3.57

The peaks are scaled to 100.0 for the biggest peak, which is at position x=0,y=0,z=0 - this is as we expect. The next biggest peak is 17.56 - this is much smaller so there is no translation (good!).

Then there is a table with the 50 highest peaks scaled with Height/rms, which starts with:

   Order No. Site Height/Rms    Grid      Fractional coordinates   Orthogonal coordinates

     1    1    1   66.97     0   0   0   0.0000  0.0000  0.0000     0.00   0.00   0.00
     2  529  488   66.97    40  20   0   0.5000  0.5000  0.0000    39.35  20.20   0.00
     3  532  491   12.01    36  20   1   0.4454  0.5000  0.0253    34.40  20.20   1.26
     4   73    2   12.01     4   0  55   0.0546  0.0000  0.9821   -20.75   0.00  48.96
     5  135  126   11.77     3   4   1   0.0339  0.1025  0.0192     2.18   4.14   0.96
     6  469  411   11.77    37  16  55   0.4661  0.3975  0.9821    11.62  16.06  48.96
     7  525  484   11.77    37  19  51   0.4671  0.4784  0.9106    13.53  19.33  45.39
     8   78   72   11.77     3   1   5   0.0329  0.0216  0.0894     0.31   0.87   4.46
     9   32   32   11.23     8   0  22   0.1010  0.0000  0.3953    -2.13   0.00  19.71
    10  572  530   11.23    32  20  34   0.3990  0.5000  0.6047    15.97  20.20  30.15

Hints

In fact the values you get may be different, for example:

 Order No. Site Height/Rms    Grid      Fractional coordinates   Orthogonal coordinates

    1    1    1   58.43     0   0   0   0.0000  0.0000  0.0000     0.00   0.00   0.00
    2    4    2    3.47     4   0  55   0.0546  0.0000  0.9821   -20.75   0.00  48.96
    3   10    2    3.47    36  20   1   0.4454  0.5000  0.0253    34.40  20.20   1.26

but the comments above still apply.

You may also see a different number of peaks, for example:

 Order No. Site Height/Rms    Grid      Fractional coordinates   Orthogonal coordinates

     1    6    1   38.66    28  16   0   0.5000  0.5000  0.0000    39.35  20.20   0.00
     2    8    1   24.28    28  16  43   0.5000  0.5000  0.9773    14.42  20.20  48.72
     3    5    2    3.34    24  16   0   0.4269  0.5000  0.0000    33.60  20.20   0.00
     4    3    3    3.10     3   0  39   0.0597  0.0000  0.8826   -17.82   0.00  44.00

In this case x=0.5,y=0.5,z=0.0 is the centring operation of the spacegroup of the data (C2). It is a crystallographic translation of the origin peak (as opposed to a non-crystallographic translation).

The difference is an effect of the width of the Patterson origin peak being related to the resolution range of data included when generating the Patterson map. At lower resolutions the origin peak may overlap neighbouring grid points in the map, and result in apparent extra peaks in these adjacent positions.

Including higher resolution data narrows the origin peak and reduces the effect; try changing the high resolution limit from 4.0 Å to 3.0 Å in the Define Map folder, and re-run.

Now go to the end of the log file where you will see:

Average B value for experimental data = 18.178
Average B value for model = 20.000
Running aMoRe: set the Tabling parameter BADD
(the amount to add to the Bvalue) to -1.822

Look at the graphs in the log file. From the View Files from Job menu select View Log Graphs. There are three tables in the log file - look at them in turn:

Wilson Plot

This is a usual Wilson plot - no problems here!

Amplitude Analysis v. Resolution

This plot is the usual shape for amplitude versus resolution plot with 'water' peak at about 4Å.

Average B v. Residue

This shows the difference from the mean - in this PDB file all the B values are set to 20. This is not interesting for this protein.

Quit (or File -> Exit) from the two windows which display the log file and the graphs. Close all other windows except the main window.

5c) Run MolRep Molecular Replacement Program

This program will solve the structure - you must input a coordinate file for a protein similar to the protein in the crystal and the program will output a coordinate file with the molecule moved to the right position(s) in the crystal.

Exercise

From the Molecular Replacement module select MolRep - auto MR and enter a job title such as

Job title MolRep for cardiotoxin (mr tutorial step 200)
The default mode for running MolRep is good:

Do molecular replacement performing rotation and translation function
Select the input experimental data file:

MTZ in DATA cardiotoxin.mtz
Select the input model:

Model in DATA model.pdb
In the folder Search Parameters (...):

Search for 3 monomers in the asymmetric unit.
Click on Run -> Run Now.

MolRep will take a long time to run - if it is too long you can see the output files: $RESULTS/molrep.log and $RESULTS/model_molrep1.pdb.

Look at the log file by selecting View Any File from the right side of the main window, then select:

Go to directory RESULTS

then

File molrep.log

and then click on Display and Exit.

The log file lists many possible solutions. After the rotation function:
```
Number of peaks :      50

               alpha   beta   gamma   theta   phi     chi       Rf          Rf/sigma
 Sol_RF   1     28.27  60.29  182.91  148.91 -167.32  153.12    0.3796E+09  5.31
 Sol_RF   2     40.54  72.23  275.07  117.37  152.74   83.17    0.3249E+09  4.54
 Sol_RF   3    162.83  58.08  180.35  104.76  -98.76   60.26    0.3060E+09  4.28
 Sol_RF   4    325.42  64.83  249.36  146.36  128.03  150.77    0.3013E+09  4.21
 Sol_RF   5     64.03  63.07  262.23  115.31  170.90   70.70    0.2954E+09  4.13
```
This shows the possible rotation of the molecule: alpha beta gamma (or theta phi chi in polar coordinates). The score for the solution is the Rfactor.

After a translation function:
```
               alpha   beta   gamma   Xfrac  Yfrac  Zfrac Dens/sig R-fac   Corr
 Sol_TF_1 1     28.27  60.29 -177.09  0.091  0.000  0.297    3.91  0.598   0.139
 Sol_TF_1 2     28.27  60.29 -177.09  0.611  0.000  0.342    3.35  0.596   0.137
 Sol_TF_1 3     28.27  60.29 -177.09  0.174  0.000  0.105    3.23  0.602   0.125
 Sol_TF_1 4     28.27  60.29 -177.09  0.870  0.000  0.349    2.82  0.606   0.107
 Sol_TF_1 5     28.27  60.29 -177.09  0.268  0.000  0.471    2.67  0.607   0.120
```
This shows the rotation (alpha beta gamma) and translation as fractional coordinates (Xfrac Yfrac Zfrac). There are different ways to score the solutions: density/sigma, Rfactor, Correlation coefficient. The correlation coefficient is the best - this is bigger for good solutions.

The program runs the translation function three times to find three different molecules. For the second run of the translation function it will take the best solution from the first run and try to find another molecule which will fit well with the first solution. For the third run of the translation function it will keep the best solution from the first run and the second run and try to find molecule number three.

It is not possible to say what is a good score - this will depend on many things but it is good if the best score (the correlation factor in the column labelled Corr) is much bigger than the second best score. When you are looking for several molecules the best score for the first (and perhaps second) molecule will not be very good but you hope that the best score for all three molecules is much better than for other possible solutions.

There are three molecules in the output PDB file - these have chain names: A, B and C.

The diagram below shows how these molecules are related: there are three molecules and the vectors between the centres of the molecules. The direction of the axis of rotation to map one molecule onto another is shown; it is at 90 degrees to the vector between the molecule centres. To map one molecule onto the other needs some translation and a rotation of approximately 120 degrees (in fact the angles are 124 degrees, 104 degrees and 131 degrees which are surprisingly different from 120 degrees). This is right for a three-fold rotation axis and this shows there is a rotation axis and not a screw axis. This helps to confirm that the solution is correct because MolRep did not use this information to find the solution.
The output file model_molrep1.pdb contains three copies of the input model moved to the right positions in the asymmetric unit. The three molecules will pack together something like this:

Alternative 5c) Run AMoRe Molecular Replacement Program

5d) Refine the Structure

It is not always certain that the molecular replacement is correct - the best way to test it is to refine the model. The best way to start refinement when there is more than one molecule in the asymmetric unit is to use the non-crystallographic symmetry to restrain the refinement, i.e. the refinement program must keep all the molecules similar.

Exercise

Select the Refinement module and open the Run Refmac5 task window.
Enter a job title such as:

Job title Refine mr coordinates (mr tutorial step 300)
In the Files folder, enter the name of the data file:

MTZ in DATA cardiotoxin.mtz

The output MTZ file will be automatically set to:

MTZ out TEST cardiotoxin_refmac1.mtz
Enter the name of the input files - this is the coordinate file output from MolRep:

PDB in RESULTS model_molrep1.pdb

The output PDB file will be automatically set to:

PDB out TEST model_molrep1_refmac1.pdb
Now you must tell the program that it must keep the non-crystallographic symmetry by keeping the three chains similar. Click on the line with the folder title

Setup Non-Crystallographic Symmetry (NCS) Restraints
Click on the button

Add NCS restraint
On the first line, input:

NCS restrain chain A, residues 1 to 60 with medium restraints

Underneath this line, choose:
to chain B (same residues as defined above)
Then click Add chain and select:
to chain C
Click on Run -> Run Now.
You can look at a log file by using View Any File and selecting

Go to directory RESULTS

File type log CCP4 log filename filter *.log

Viewer View Log Graphs

and then select file:

File mr_refmac.log

and then click on Display and Exit.

Go to the bottom of the list of Tables in File and select

Rfactor analysis, stats vs cycle

Rfactor Analysis

The graph shows the Rfactor (red) and the Free Rfactor (blue) for 6 cycles of refinement. The Free Rfactor goes down from 52% to 45%. The Rfactor is high - this is normal after molecular replacement because we do not have a good model yet but it goes down so we probably have a good solution.

On to the next tutorial - Refinement.

Side-step to the alternative for step 5c - Molecular Replacement with AMoRe.

Back to the previous tutorial - Experimental Phasing (by MAD).

Back to the index.

To find out more:
MolRep: http://www.ysbl.york.ac.uk/~alexei/molrep.html
CCP4: http://www.dl.ac.uk/CCP/CCP4/

Prepared by Liz Potterton (lizp@ysbl.york.ac.uk) & Eleanor Dodson, July 2000
Additional material: Peter Briggs & Martyn Winn, February 2001
Adapted by Maria Turkenburg, 2002-2003