Collaborative Computational Project No. 4
Software for Macromolecular X-Ray Crystallography

DLS BAG Training, July 2015

Please ignore the links to individual datasets at the top of each of the sections below and instead take the unpacked data from the following location: /dls/i02/data/2015/mx13273-1/bag_training/.

Data processing with DIALS

This tutorial introduces data processing using the newly developed integration program DIALS. The tutorial is located at the DIALS website, but there is no need to download the dataset from the link there, as it has been unpacked for you locally.

Click here to visit the DIALS tutorial site: Processing in Detail

Return to the top of this section

Data processing and more with iMosflm and CCP4

Here is a collection of tutorials starting from diffraction images. These will develop experience with data processing software and show some tips/tricks for ways to obtain good quality reduced datasets. However, you need not stop at that point. In some cases the tutorials also go on to phasing and structure solution.

These tutorials were developed, with permission, from the X-ray Tutorial at Helmholtz-Zentrum Berlin.

If you are unfamiliar with Linux commands please ask for help to unpack the datasets. It is not completely straightforward!

Experiment 1: S-SAD on bovine Insulin

Data part 1
Data part 2

Data processing using iMosflm
Experimental phasing with CRANK

Experiment 3: Molecular Replacement on monoclinic Lysozyme

Data part 1
Data part 2
Data part 3

Data processing using iMosflm
Molecular replacement with MrBUMP

Return to the top of this section

Solving a protein-protein complex with Phaser (BETA/BLIP)

This tutorial originally comes from the Phaser Wiki.

The files for this tutorial can be found at: 2_complex.tgz


Reflection data: beta_blip_P3221.mtz
Structure files: beta.pdb, blip.pdb
Sequence files: beta.seq, blip.seq


This tutorial demonstrates a difficult molecular replacement problem.

β-Lactamase (BETA, 29kDa) is an enzyme produced by various bacteria, and is of interest because it is responsible for penicillin resistance, cleaving penicillin at the β-lactam ring. There are many small molecule inhibitors of BETA in clinical use, but bacteria can become resistant to these as well. Streptomyces clavuligerus produces beta-lactamase inhibitory protein (BLIP, 17.5kDa), which has been investigated as an alternative to small molecule inhibitors, as it appears more difficult for bacteria to become resistant to this form of BETA inhibition. The structures of BETA and BLIP were originally solved separately by experimental phasing methods. The crystal structure of the complex between BETA and BLIP has been a test case for molecular replacement because of the difficulty encountered in the original structure solution. BETA, which models 62% of the unit cell, is trivial to locate, but BLIP is more difficult to find. The BLIP component was originally found by testing a large number of potential orientations with a translation function search, until one solution stood out from the noise.
  1. What do you think is the best order in which to search for BETA and BLIP? Under what circumstances could the lower molecular weight search model be the easiest to find by molecular replacement?
  2. What is the space-group recorded on the mtz file? If you had not solved this structure, would you know that this was the space-group? If not, what other space-group(s) must you consider?
    • Think about handedness (enantiomorphs)
  3. Run Phaser for solving BETA/BLIP
    • Bring up the GUI for Phaser
    • All the yellow boxes need to be filled in.
    • Search for BETA and BLIP in the one job.
  4. Has Phaser solved the structure?
    • Look at the Z-scores for the rotation and translation functions
  5. What search order was used?
    • If you wanted, you could force the other search order and see what difference this makes.
  6. Which space group was the solution in?
  7. Look though the job.sum file and identify the anisotropy correction, rotation function, translation function, packing, and refinement modes, for the two search molecules, and all the space groups. Draw a flow diagram of the search strategy.
  8. Why doesn't Phaser perform the rotation function in the two enantiomorphic space groups?
  9. Which reflections in the data are particularly important for deciding the translational symmetry of the space-groups to search? Under what data collection conditions might you not have recorded these important reflections? Are there any other space-groups that you might want to consider when solving BETA/BLIP?
  10. How big is the anisotropic correction for the data? How does this compare to TOXD?
  11. Run Phaser again with the anisotropy correction turned off. What effect does this have on the structure solution?

Return to the top of this section

Using ensemble models with Phaser (TOXD)

This tutorial originally comes from the Phaser Wiki.

The files for this tutorial can be found at: 3_ensemble.tgz


Reflection data: toxd.mtz
Structure files: 1BIK.pdb, 1D0D_B.pdb
Sequence file: toxd.seq

MR using TOXD

This tutorial demonstrates the ensembling procedure in Phaser.

α-Dendrotoxin (TOXD, 7139Da) is a small neurotoxin from green mamba venom. You have two models for the structure. One is in the file 1BIK.pdb, which contains the protein chain from PDB entry 1BIK, and the other is in the file 1D0D_B.pdb, which contains chain B from PDB entry 1D0D. 1BIK is the structure of Bikunin, a serine protease inhibitor from the human inter-α-inhibitor complex, with sequence identity 37.7% to TOXD. 1D0D is the complex between tick anticoagulant protein (chain A) and bovine pancreatic trypsin inhibitor (BPTI, chain B). BPTI has a sequence identity of 36.4% to TOXD. Note that models making up an ensemble must be superimposed on each other, which has not yet been done with these two structures.
  1. Use the SSM superpose option in coot to superimpose 1BIK on 1D0D_B, saving the resulting coordinates in 1BIK_on_1D0D.pdb.
    • Alternatively, superpose molecules using Gesamt (Molecular Replacement > Utilities > Superpose Coordinates). This may be the best option if Coot is slow in the workshop setup.
  2. Go to the Molecular Replacement module, in the yellow pull-down on the LHS of the GUI
  3. Bring up the GUI for Phaser
  4. All the yellow boxes need to be filled in. Note that in this example, the ensemble is the actual ensemble containing superposed structures. In addition:
    • it is a good idea to change the Ensemble id from the default.
    • it is also a good idea to fill in the TITLE.
  5. When you have entered all the information, run Phaser.
  6. Has Phaser solved the structure? What was the LLG of the best solution? What was the Z-score of the best translation function solution?
    • The meaning of the Z-score is given in the documentation
  7. Look though the log file and identify the anisotropy correction, rotation function, translation function, packing, and refinement modes. Draw a flow diagram of the search strategy.
  8. How many potential solutions did Phaser find or reject at each stage? What were the selection criteria for carrying potential solutions forward to the next step in the rotation and translation functions? How many other selection criteria could have been used, and what are they?
    • Use the documentation
  9. Run Phaser again without using ensembling i.e. run two jobs, one using 1BIK only and the other using 1D0D only as models. What are the LLGs of the final solutions? What are the Z-scores of the translation functions? Was ensembling a good idea?

Return to the top of this section

An easy MR example


Parent directory: 9_easy
Sequence file: 4hg7.fasta
Reflection data: 4hg7.mtz
Structure files: 1t4e.pdb

1. Introduction

This is an easy but not completely straightforward MR example. Illustrates importance of model editing.

Assume that 4hg7.mtz presents your experimental data. Try to solve and refine this structure starting from 1t4e.pdb and using any tools you like.

The asymmetric unit of the template structure 1t4e contains two molecules of a fragment of a human oncoprotein MDM. The target structure 4hg7 contains a complex of an almost (but not completely!) identical fragment of the same protein with the ligand.

Fragments are very similar, same initial sequence, model 16-111 (+ 5 residues of expression tag, invisible in the structure), target 17-108. So at a first glance there is no need to modify the model.

The ligand is as follows:

A clear density for this ligand will indicate that the structure is solved. At this point you may also exercise with Coot > Calculate > Ligand Builder to generate description of this ligand and to refine the complex.

2. Instructions

(1) Find the files listed above in the tutorial directory.

(2) How many molecules are there in the asymmetric unit of the target structure? To answer this question, use files 4hg7.seq and 4hg7.mtz and Cell Content Analysis task in CCP4I (CCP4I > Molecular Replacement > Analysis > Cell Content Analysis).

(3) Extract chain A from the file 1t4e.pdb into 1t4e_A.pdb. (CCP4I > Molecular Replacement > Edit PDB Model File. Change selections in the second line of the task menu to "pdbset" and "perform selection for output PDB file". Other selections should be clear)

(4) Run molrep (don't use sequence- assuming it's the same protein, and almost same length constructs) and examine the table at the end of the log-file. Is the molrep solution convincing? Try refinement. Did it go smoothly and produce interpretable density? Do you see the ligand density?

(5) Run phaser. Did you notice any unusual message? Refine the model. Do you think phaser found a solution?

(6) Ways to proceed: Phaser: increase phaser's tolerance to clashes, or remove three C-terminal residues from the model.

Molrep - use sequence now. Or remove three C-terminal residues.

(7) Compare solution from (4) or (5) with the solution from (6) using Csymmatch (CCP4I > Program List > csymmatch; define input files and tick the box "Apply origin correction and hand correction).

(8) Overlap the structures in Coot. See the shift between the right and wrong solution. Go to Draw>cell and symmetry>show symmetry>symmetry by molecule-display as Calpha s - and see why 3 residues made such a big difference.

This example shows how a small modification of the model can make a big difference.

*) like cloning in a way, when two-three truncated residues could make the construct soluble.

Return to the top of this section

Automated model building

Download the data from ccp4_modelbuilding.tgz


Reflection data: cMybCEBP_beta-sf.mtz
AMPLE generated Molecular Replacement model: ample.pdb
Sequence file(s): cMybCEBP_beta-seq.fasta (protein/DNA), cMybCEBP_beta-protein_seq.fasta (protein only) and cMybCEBP_beta-DNA_seq.fasta (DNA only)

Automated model building

The goal of this tutorial is to illustrate how to use various automated density modification and model building tools to improve upon an initial partial model for a target structure produced by molecular replacement (MR). Molecular replacement can often succeed with only a partial or fragment search model. It then becomes necessary to build upon this model to give a final structure. This can be done manually using programs like Coot and Refmac, however, effective use of density modification and automated model building tools can make the task of completion a lot easier.

Here we will use several tools from CCP4, SHELX and ARP/wARP to demonstrate how to build up a final model from an inital model procduced by the AMPLE molecular replacement pipeline in CCP4. AMPLE uses ab initio modeling to derive ensemble search models for target structures that may not have suitable homologues available for use in MR. From an initial ensemble of search models it will produce several truncated forms of the same ensemble, truncating away the most variable sections in steps right down to a low variance core which may only represent a fragment of the target structure.

The target structure we will try to complete is the crystal structure of a protein-DNA complex. It is a mechanism of c-Myb-C/EBP beta cooperation from separated sites on a promoter. The protein component of the target is a coiled-coil structure and AMPLE has been used to generate search models for this and use them in MR. It has produced a solution (ample.pdb) which is made up of several short helical fragments which scoring indicates are positioned correctly.

  1. Project setup

    Download and unpack the file ccp4_modelbuilding.tgz (linked to above) in a suitable location. In ccp4i create a new project with the directory ccp4_modelbuilding as the project directory. Give the project a suitable name. Select this project from the "Change Project" menu.

  2. Refinement of the AMPLE solution

    The first step is to refine the positioned model to produce an electron density map that we can examine. This will help us to further confirm the correctness of the solution and give us a starting map for model building.

    • From the Refinement menu select the "Run Refmac5" task. A more detailed tutorial on refinement can be found elsewhere so we won't elebaorate too much on the options chosen
    • Set the job title e.g. "Refine AMPLE solution"
    • We will do 30 cycles of restrained refinement (default) with jelly-body refinement
    • Provide "MTZ in", it will be the file cMybCEBP_beta-sf.mtz. The interface will automatically detect the F/SIGF column labels
    • Provide "PDB in" as the solution model that has come from AMPLE (ample.pdb)
    • The interface will automatically assign output file names
    • Under "Refinement Parameters" set 30 cycles of maximum likelihood restrained refinement and select the option "use jelly-body refinement with sigma 0.02"
    • Select "Run" and then "Run now"
    • When the job completes, double-click on job name in the task list. This will present you with the qtrview output from the job. This contains an overall results summary and the log file for the job. The initial and final R/Rfree will show a slight improvement but not enough to convince us that it is a clear solution. We need to look at the electron density map to make a better assessment.
    • In the results summary go to the "Output files" section and click the Coot button. This will launch Coot with the model and electron density maps for 2fo-fc and fo-fc (difference map) loaded. Display the symmetry mates for the model and examine the density around the model. Can you see features in the maps that give an indication that the solution is possibly correct?
  3. First attempt at building the protein component

    Now that we are more confident that the MR solution is correct we will try to use Buccaneer to produce a more complete model for the protein part of the target structure.

    • From the "Model Building" menu select the "Buccaneer -autobuild/refine" task. Set the job title
    • Choose to perform model building/refinement starting from "molecular replacement". We are going to give Buccaneer the refined AMPLE pdb file to "seed chain growing". Provide this PDB in the "MR model PDB in" box.
    • Next, provide the sequence file for the protein part of the target (cMybCEBP_beta-protein.fasta) in the box "Work SEQ in"
    • Provide the output MTZ file from the previous refinement job as the "Work MTZ in". The refined MTZ contains the calculated Phases (PHIC) and figure of merit (FOM) required for mode building.
    • The next section selects the column labels needed for input to Buccaneer. The structure factors F/SIGF will be automatically selected
    • We don't have phase information in the form of Hendrickson-Lattman coefficients so we need to tick the box "Use PHI/FOM instead of HL coefficients". Choose "PHIC" for PHI and "FOM" for FOM
    • Refmac gives us map coefficients as well so we can tick the box "Use map coefficients" and select "FWT" for F and "PHWT" for the corresponding PHI value.
    • We will also use the Free R set in the refinement steps so choose "Use Free-R flag" and give "FreeR_flag" in the "Free R flag" column
    • Everything else can be left as defaut. The default is to do 5 cycles of model building with each cycle followed by 10 cycles of restrained refinement in Refmac
    • Select Run, Run now
    • Again, double click the job name to bring up the results gui. You will see the build and refinement output for the 5 cycles. For each cycle the number of residues built should increase, however the refinement R/Rfree remain stubornly high. This is to be expected given that we have not tried to model the DNA part which makes up a large fraction of the overall structure.The best way to assess the solution is to look at the model. Click the Coot button in the "Output files" section to view it. The coiled-coil helices should be present and mostly built but notice also that Buccaneer has placed a lot of protein fragments in what appears to be a DNA-like structure. Although these fragments are mostly junk we can use the fact that they have traced out the DNA location to our advantage. We will use this model as a starting point for density modification and phase improvement.
  4. Determining the solvent content of the unit cell

    We are going to run SHELXE to do density modifcation and c-alpha tracing. Before we can proceed we need to determine the solvent content of our unit cell. SHELXE is quite sensitive to this parameter so we need to make an accurate estimate. To do this we will use the program Matthews_coef

    • Go to the Molecular Replacement menu and select "Analysis", "Cell Content Analysis". This inteface runs Matthews_coef which is used to determine the Matthews Coefficient given a space group and unit cell dimensions and some description of the structure present. In this case we will give it the target sequence file. Based on this sequence, it will give us a set of probabilities for the likely number of copies of the target structure in the unit cell.
    • Give the job a title. We have a protein-DNA complex so we need to select this option where it says "Calculate Matthews coefficient for...". Next, provide it with the MTZ file containing our experiemental structure factor amplitudes (cMybCEBP_beta-sf.mtz)
    • For the section "Use molecular weight..." select "estimated from sequence file" and provide the path to the sequence file containing both the protein and the DNA sequence
    • Hit "Run Now"
    • The output presents a table of Matthews coefficient, solvent percentage and probability values for different numbers of estimated molecules in the asymmetric unit cell. The results suggest that we have 2 copies in the asymmetric unit with a probability of 91% and a solvent content of 53.51%. It should be highlighted that this is just an estimate. Matthews_coef tends to favour solvent contents in the range 40-60%. In step 2, our run of Buccaneer produced a model that shows only 1 copy of the target in the asymmetric unit so we should be suspicious of the Matthews_coef result. It may be that the crystal has an unusually high solvent content. To demonstrate the importance of estimating solvent content correctly we will proceed with running SHELXE twice, assuming two copies in the asymmetric unit and assuming one copy
  5. Density modification and c-alpha tracing in SHELXE

    We will now run SHELXE twice, firstly assuming a 2 copies in the asymmetric unit and a solvent content of 53.51% and secondly with 1 copy in the asymmetric unit and a solvent content of 76.75%.

    • Go to the "Model Building" menu. At the bottom of the list there will be the "SHELXE from MR solution" task. Select it and give the job a title.
    • This is a very simple interface to the SHELXE program designed to utilise the "trace from MR solution" option. SHELXE is also useful for density modification and tracing from Experiemental Phasing. To access all the features of SHELXE it must be run from the command line.
    • Provide MTZ in and PDB in. These should be the original MTZ file with the structure factor amplitudes (cMybCEBP_beta-sf.mtz) and the output PDB file from the Buccaneer step. The PDB file is used by SHELXE to get an initial estimate for the phase information
    • The other thing we need to provide is the solvent content estimate. In the case where we assume 2 copies in the asymmetric unit we set this to 53.5 (or 0.535)
    • We want to make SHELXE perform several cycles of density modification and c-alpha tracing. Typically 15 cycles is a good number, however, in the interest of speed,, will do only 3 cycles so set "Perform 3 cycles of main-chain tracing...."
    • Hit "Run" and then "Run now"
    • SHELXE will take a little time to run so in the meantime set up the second run, repeating the steps above but setting the solvent content to 76.7 (or 0.767)
    • When the jobs complete, double click the job name in the task list and have a look at the log file. Several utility programs are used to convert the input data to SHELXE format and back again. The SHELXE log itself will come after the Mtz2various log. At the end of each cycle a short summary is given. It will state how many residues have been traced and list the size of each trace fragment. A good rule of thumb is to look for an average fragment length of 10 or more as being indicative of success. The other requirement for success is that the "CC for partial structure against native data is 25% or greater (Note that currently these criteria only apply for favorable resolutions of about 2.3 Angstroms or better). At the bottom of the SHELXE log it will report the best trace and which cycle produced the best trace. The output pdb and phs files (phases) will be generated from this cycle. The phases file is then converted back to CCP4 MTZ format and the resulting map and model can be viewed by clicking on the results tab and then clicking the Coot button in the "Output Files" section
    • You should be able to see a c-alpha trace of the coiled-coil structure in the output model. There will also be c-alpha fragments where we predict the DNA should be (based on the Buccaneer result). However, we are more interested in viewing the map that has been produced by the density modification. Expand the map radius in Coot (select Edit and then Map parameters). Set it to about 50.0 Angstroms. Can you see density for the DNA? Compare both SHELXE solutions. Which one has less noise and better density for the DNA? It should be clear that the job where we set solvent content to 76.7% has better density, meaning there is most likely to be only one copy in the asymmetric unit.
  6. Building the DNA structure

    The density modification and phase improvement from SHELXE has given us a much better map to attempt model building in. Currently, the DNA and protein need to be built in separate steps. With a good map, the best approach is the build the DNA first and then the protein with the DNA fixed in place. The DNA can be built with Nautilus or with ARP/wARP. We will use ARP/wARP here but you should also try using Nautilus in your own time.

    • Go to the "Model Building" menu and select "ARP/wARP DNA/RNA". Provide a job title.
    • For "MTZ in" provide the MTZ file that has come from the SHELXE task where we used 76.7% for the solvent content. This will automatically populate the column label section with F/SIGF and PHI_SHELXE/FOM_SHELXE for the observed amplitudes and calculated phases respectively.
    • ARP/wARP also requires the number of nucleotides (52) and residues (284) in the asymmetric unit.
    • Run the job. It will take a few minutes to complete. If you view the log file you will get a message to say how many nucelotides have been built.
    • When the job completes there won't be a Coot button to view the model, you'll need to start Coot manually and select the pdb file from your project directory. The name of the output pdb file will be listed in the line at the bottom of the log starting "#CCP4I TERMINATION OUTPUT_FILES..."
    • All being well you should see a nice model for the DNA structure. You may have to do a little pruning in Coot to delete residues that are not correct. These should be fairly obvious if present. If you do need to do this, save the new coordinates to a file. This model will be fixed in place in the next step where we will attempt to rebuild the protein structure
    • Notice that we didn't need to provide ARP/wARP with the sequence for our target. It will have done its best to guess the nucelotides but some mutation and additional model building will be needed later on.
    • Try repeating the above steps with the other SHELXE output. Is the DNA model any better or worse?
  7. Building the protein structure

    Our final step will be to use Buccaneer again to build the protein structure, but this time fixing the DNA in place. We will also use the SHELXE output map to build into.

    • Go to the "Model Building" menu and select "Buccaneer autobuild/refine". Give a job title and again perform model building/refinement starting from "molecular replacement" phases.
    • This time we will use the MR model to place and name chains and "nothing else". Provide the output PDB file containg the DNA (remember that if you deleted elements in Coot you'll need to choose the saved pdb file from Coot, otherwise choose the output PDB from the ARP/wARP step).
    • Select "Specify an initial model to be extended" and input the same PDB file as above for the "Work PDB in".
    • For "Work SEQ in" choose the protein-only sequence file, "cMybCEBP_beta_protein.fasta"
    • For "Work MTZ in" choose the MTZ file output by the SHELXE task
    • Again, we will "Use Phi/FOM instead of HL coefficients" and choose PHI_SHELXE/FOM_SHELXE for the phases PHI/FOM
    • We will also "Use Free-R flag" and select FreeR_flag for the "Free R flag" column label
    • By default, Buccaneer will rebuild our DNA structure as protein. We want to prevent it from doing this and get it to retain the DNA. Under the section "Model building parameters" select "Specify atoms from the intial model to keep". By default this will select all of the atoms in initial model but it is possible to adjust the range of what is kept
    • Select "Run" and "Run now"
    • 5 rounds of building and refinement will now be performed. Watch the results of each round in the Results tab of the Qtrview viewer. As each cycle completes more of the model will be traced and the refinement R/Rfree should steadily improve.
    • When the job completes, hit the Coot button under "Output files" to view the final model. You should see that we now have a substanial portion of the protein-DNA complex complete. Some further manual model building and general tidying-up is still required to complete the structure but a lot of the work has been done by the automation tools making completion of the structure a lot easier.


  1. What features in the electron density map after step 1 illustrate that the MR solution from AMPLE is likely to be correct?
  2. Why is it important to make an accurate estimate of the solvent content of the unit cell?

Return to the top of this section