Protein Sequence and Structure Bioinformatics: teaching

Showing posts with label teaching. Show all posts

Monday, October 4, 2010

Getting the most out of the Combinatorial Extension (CE) algorithm

The CE algorithm is a frequently used algorithm for protein structure alignment. We have recently ported the CE algorithm from C to Java and made the new implementation available through BioJava project. It is run as part of the Protein Comparison Tool at the RCSB PDB web site. This was a good opportunity to learn a few hidden aspects of the original implementation and re-consider some of the parameter choices that have been take for the original code. Here a quick summary of how CE works and how to get the most out of it.

CE essentially has two main steps:

Finding Aligned Fragment Pairs (AFPs) between the two protein structures that share similarity. This step in most cases is quite quick, however this only provides a rough alignment as a result which needs refinement.
The alignment optimization steps is the slow part of the algorithm. It iteratively tries to extend the initial alignment while staying below a certain RMSD threshold.

So how does these two steps work in detail? Today we will look at Step 1:

Step 1: Finding AFPs:This step starts with calculating Matrices of intra-molecular distances withing each of the two proteins. These are matrices that describe the distances within a protein and they look something like this:

Here you can see the two matrices, on the left for the alpha chain of Hemoglobin, and on the right side for the beta chain of Hemoglobin. You can view their alignment at the RCSB PDB web site.

As you can see the two chains are very similar. Their pattern of intra-molecular distances looks similar, too. The regions that are colored indicate close proximity in space. Can you spot the helices?
There are a few different ways to compare such distance matrices. In order to identify the AFPs, CE takes the approach to "slide a window" over each of the matrices and tries to find regions where the two proteins looks similar.

This is how it is calculated: First all distances for fragments of size 8 are calculated and the difference of Distances is

This can be done by calculating differences of distances for each fragment of length winSize, which is one of the parameters that can be sent to CE and which is by default set to a length of 8.

The original CE paper has a table on the impact of this parameter. Here you can see how different values for winSize (here called Fragment size m) affect the final RMSD, alignment length and calculation time.It is impressive how much CPU speed has improved since back in those days: Now the alignment shows up almost instantly on my screen. However back then they only had a 248 Mhz CPU and you can see in the table above how many seconds one had to be patient back in 1998.

I tried to reproduce this table with the current implementation, but the numbers don't show up exactly the same. There were a few follow up publication on the algorithm during which this step got improved. The trace step has now some RMSD threshold parameters which make sure that the RMSD between the traces stay within limits.

In this table the internal numbers after the Step 1: As you can see the main effect is on the nr. of traces that are being detected.

winSize	nr traces	RMSD after initial trace
4	2,662,211	4.4
6	1,858,380	3.8
8	1,158,327	4.6
10	891,807	4.3
12	847,802	4.3
16	758584	4.0
24	261,290	5.0
36	17395	4.6

To come back to the differences of distances between the fragments of length winSize: at this stage the difference of distance matrix looks like this:

Each dot on the matrix corresponds to the start of a fragment and color corresponds to closely similar fragments. As you can see there are plenty of regions that have similar fragments, and the fragments come in groups.

The next step, optimizing the alignment, will be discussed some other time...

Tuesday, August 25, 2009

Hydrogen bonds in proteins

Hydrogen bonds play an important part in determining the 3D structure of proteins. The formation of H-bonds contributes to the stability of a protein. They provide a characteristic pattern in secondary structure elements that can be used for secondary structure assignment.

A hydrogen bond is formed when two electonegative atoms compete for the same hydrogen atom. While the hydrogen is covalently bound to one of the atoms, the donor, it interacts with the other atom, the acceptor. The possibility to interact with one atom, while covalently bound to another, is special for hydrogens. The electrostatic and covalent aspects of the bond cause the three atoms to stay mostly linear.

H-bonds in proteins most frequently involve the C=O and N-H groups of amino acids in the polypeptide chain. A single oxygen can simultaneously participate in two bonds.

Here again some example code that can calculate the energy of an H-bond between the N-H and C=O atoms of two amino acids using BioJava. It assumes that the coordinates for H-bonds were calculated as described in one of my earlier postings.


/** calculate HBond energy of two groups in cal/mol ...
    * see Creighton page 147 f
    *
    * Jeffrey, George A., An introduction to hydrogen bonding, Oxford University Press, 1997.
    * categorizes hbonds with donor-acceptor distances of
    * 2.2-2.5 Å as "strong, mostly covalent",
    * 2.5-3.2 Å as "moderate, mostly electrostatic",
    * 3.2-4.0 Å as "weak, electrostatic".
    *  Energies are given as 40-14, 15-4, and <4 kcal/mol respectively.
    *
    */


   public  double calculateHBondEnergy(AminoAcid one, AminoAcid two )
   throws StructureException{


      //System.out.println("calcHBondEnergy" + one + "|" + two);

      Atom N = one.getN();
      Atom H = one.getH();

      Atom O = two.getO();
      Atom C = two.getC();

      double dno = Calc.getDistance(O,N);
      double dhc = Calc.getDistance(C,H);
      double dho = Calc.getDistance(O,H);
      double dnc = Calc.getDistance(C,N);

      

      double contact = MINDIST ;

      //   there seems to be a contact!
      if ( (dno < contact) || (dhc < contact) || (dnc < contact) || (dno < contact)) {
         //System.out.println("!!! contact " + one + " " + two);
         return HBONDLOWENERGY ;
      }

      double e1 = Q / dho  - Q / dhc ;
      double e2 = Q / dnc  - Q / dno ;

      double energy = e1 + e2;

    

      // bond too weak
      if ( energy > HBONDHIGHENERGY)
         return 0;

      // test to avoid bond too strong
      if ( energy > HBONDLOWENERGY)
         return energy;

      return HBONDLOWENERGY ;

   }

Sunday, August 16, 2009

How to calculate H atoms for Nitrogens in protein structures

Only few protein structures contain coordinates for hydrogen (H) atoms. This is because X-ray crystallography usually does not resolve the positions of H atoms. They have only one electron and they barely scatter X-rays. Only a few structures have been determined with a high enough resolution to place the hydrogen atoms (< 1.2 Å). Some PDB files contain H atoms that have been modeled into the protein structure. Neutron Diffraction is a method that does allow to observe H atoms. An protein that has been determined with a combination of X-ray and Neutron Diffraction and that contains hydrogen atoms is PDB ID:3HGN .

If hydrogen atoms are required for calculations (e.g. to calculate hydrogen bond energies in protein structures) they need to be inferred from the available information. Here two possible ways to approximate positions for H-atoms that are bound to the Nitrogen (N) atom of an amino acid.

Let's start with having a look at the peptide plane. We can use vectors from the plane in order to calculate H atoms.

A simple way to infer the coordinates is by putting the H on the opposite side of the O atom that is bound to C. C coordinates are from amino acid i. N, Cα atoms from amino acid i+1, for which the H atom is being calculated. For the examples below we are using BioJava. It treats every atom as a vector. This means, if the code fragments below talk about atoms it is meant as a synonym for a vector in a 3D space.


Group a  = groups[i];
Group b  = groups[i+1];

// atom will be for group b.
Atom H = calcSimpleH(a.getC(),b.getN(),b.getCA());

/** Calculates the H atom for group B (i+1) */
private static Atom calcSimpleH(Atom c,Atom o, Atom n) throws StructureException{

Atom h = Calc.substract(c,o);
     double dist = Calc.getDistance(c,o);
     //System.out.println(dist);
     double x = n.getX() + h.getX() / dist;
     double y = n.getY() + h.getY() / dist;
     double z = n.getZ() + h.getZ() / dist;

     h.setX(x);
     h.setY(y);
     h.setZ(z);

     return h;
}

One issue with this first approach is that H is not necessarily directly opposite of the C-O.

An alternative method is shown below:

Use the unit vectors NC and NCα to get the direction of the vector and substract it from N.



Group a  = groups[i];
Group b  = groups[i+1];
Atom H = calc_H(a.getC(),b.getN(),b.getCA());


/**
* Calculates the H atom for group B (i+1)
*/
private static Atom calc_H(Atom C, Atom N, Atom CA) throws StructureException
{

     Atom nc  = Calc.substract(N,C);
     Atom nca = Calc.substract(N,CA);

     Atom u_nc  = Calc.unitVector(nc)   ;
     Atom u_nca = Calc.unitVector(nca);

     Atom added = Calc.substract(u_nc,u_nca);

     //(According to Creighton the distance N-H is 1.03 +/- 0.02 Å.)
     Atom U = Calc.unitVector(added);

     Atom H = Calc.add(N,U);

     return H;
}

If you are doing your own implementation of this I recommend to add a check that the distance N-H is really 1 Angstrom. Additionally write out a modified PDB file and use a 3D viewer to verify that the atom is placed on the correct side of the N.

Protein Sequence and Structure Bioinformatics