Skip to content

Multiple Structure Alignment Datastructures#278

Merged
lafita merged 76 commits intobiojava:minorfrom
sbliven:multaln
Jun 16, 2015
Merged

Multiple Structure Alignment Datastructures#278
lafita merged 76 commits intobiojava:minorfrom
sbliven:multaln

Conversation

@sbliven
Copy link
Copy Markdown
Member

@sbliven sbliven commented Jun 5, 2015

Introduces new data structures for structure alignments, created along with @lafita. The data structure can represent standard pairwise alignments, but also multiple alignments, flexible alignments, and non-topological alignments (#126).

The structure consists of a hierarchy of objects:

  1. MultipleAlignmentEnsemble A collection of alignments over a set of structures
  2. MultipleAlignment A single alignment
  3. BlockSet A portion of the alignment with a single rigid superposition
  4. Block A portion of the alignment with preserved sequence order. Stores the actual aligned residues for each position (which may be gaps).

Some documentation still needs to be written and will be added to the cookbook.

A few other design decisions bear mention:

  • The ensemble stores references to the aligned atom arrays.
  • In the normal case, atom arrays can be regenerated from the structure names list, which is currently of type String but will change to StructureIdentifier following the completion of Make loading of structures more consistent #81.
  • All levels of the hierarchy can serve as a cache for various scores (e.g. RMSD, TM-Score, etc), but such scores are not standardized and should be recalculated when needed by client code
  • The superposition matrices are now stored using vecmath Matrix4d objects. To support flexible alignments, the definitive matrices are stored in each BlockSet. However, a default matrix can be stored in MultipleAlignment to save memory for rigid alignments.
  • AFPChain can be converted directly to MultipleAlignmentEnsemble

This pull request also bundles concurrent development of:

  • GUI improvements
  • Ghe creation of a monte-carlo based optimization strategy for refining structural refinements
  • A new AtomCache.getRepresentativeAtoms() method (that should replace getAtoms() everywhere)

etc. etc. !

This is a fairly major feature addition, so I'll leave this request open for a few days to allow comments.

lafita and others added 30 commits April 20, 2015 18:10
The core data structures for the Multiple Alignment object have been
created: MultipleAlignment, BlockSet, Block, Pose.
The distanceMatrix is renamed to distanceTables to match with the
AFPChain nomenclature. The description of replaceOptAln has also been
changed to be more general.
The pose contains the translation and the rotationMatrix as information
of the 3D transformation of the proteins. A Demo for the display of the
multiple alignment has been created.
In order to generalize the 3D GUI features of the Structure Alignment
and implement a Multiple Alignment GUI for the new MultipleAlignment
object.
The multiple alignments can be visualized through the
MultipleAlignmentJmol class, adapted from the StructureAlignmentJmol.
The coloring of the different blocks and the alignment menus are still
not implemented.
Gaps are described by null values in the Blocks of the
MultipleAlignment. Now the Jmol class accounts for these gaps and does
not color them.
from the Pose class, because it is a static variable that does not
depend on the specific BlockSet. It only stores the intra-residue
distances of every protein.
The wrong line was commented out, so the molecule was not colored.
Adapted the display method in StructureAlignmentDisplay to rotate and
display in Jmol the atoms of a MultipleAlignment.
Minor changes to respond to TODOs
Interfaces for the classes Block, Pose and BlockSet have been created to
generalize and document all the methods needed for a MultipleAlignment
object.
The interfaces have been implemented again and the Jmol display also
works for the new MultipleAlignment DS composition.
Add some methods to calculate internal variables (update), and moved the
cache variables (RMSD, TM-score, similarity, coverage) from the
MultipleAlignment to these two classes.
Another layer in the OO data structure has been added to allow returning
alternative alignments. An ensemble of MSTA is a collection of
MultipleAlignment objects. Another change has been the addition of two
different implementations of Pose, one to determine global
superimpositions and another to determine flexible part
superimpositions.
When an object is created with the constructor and its parent is set,
the parent also gets a link to the object automatically.
The Ensemble can calculate the distance Matrices for every structure in
the updateDistanceMatrix() method. Automatic cross-references added to
the setParent() methods, for consistency.
All pairwise structural comparisons are evaluated to build the
background distance Matrices. Atoms can be rotated from Pose as well.
A new Pose abstract implementation has been created that calculates the
TMscore and RMSD of the alignment. The name of AlignmentJmol has been
changed to AbstractAlignmentJmol to be clear that is an abstract class.
A constructor for a new MultipleAlignment can be used from an AFPChain.
It creats an equivalent alignment object, for backwards compatibility.
The clone methods now entirely change the links between the cloned and
the original objects so that no cross-links occur.
An initial implementation of the CEMC algorithm for multiple structure
alignment has been created. Now a seed MultipleAlignment can be created
with a parallel pairwise all-to-all alignment. The MC optimization is
still not implemented. A demo is available under the structure-gui
package.
In the transition to replace AFPChain with the MultipleAlignment class.
A core structure for the CEMC algorithm has also been created.
@andreasprlic
Copy link
Copy Markdown
Member

Brilliant... time for a new release...

@sbliven
Copy link
Copy Markdown
Member Author

sbliven commented Jun 8, 2015

Are we happy with the class placement? Maybe we should add a new align.multiple package?

sbliven and others added 20 commits June 8, 2015 11:10
Such sequences better belong in display code than in the model.
They have been moved to a new MultipleAlignmentTools utility class.
Packing by columns is needed instead of by rows.
The transformation calculated in AFPChain was not copied. Now the
information is converted into a Matrix4D and copied.
Bug fixes, Class renames and Code organization
The sequence alignment method has been improved to introduce a gap
between blocks. Many more output conversions need to be implemented
(Web, Aligned Pairs, etc.)
The alignment panel is fully functional, and the sequence alignment to
Jmol connection is now possible for MultipleAlignments.
It was only used for problems loading the Atom arrays and to check
consistency in some parts of the calculations, but the usage was not
clearly defined. 
The exceptions have been replaced by NullPointerException and
IllegalStateException, respectively, and since they are Java runtime
exceptions they do not need to be thrown. 
Because there was no need to catch these exceptions, so they needed to
be thrown always, the change does not affect the behavior of the code,
but simplifies it.
FatCat result, Alignment Residues (as Pairs) and FASTA format.
Two tests to check the correctness of the MultipleAlignment DS have been
implemented. Some bugs have been detected and fixed in the code while
writting the tests.
Implementations for MultipleAlignment DS
lafita added a commit that referenced this pull request Jun 16, 2015
Multiple Structure Alignment Datastructures
@lafita lafita merged commit 85378c8 into biojava:minor Jun 16, 2015
lafita added a commit that referenced this pull request Jun 17, 2015
The ReferenceSuperimposer now can calculate the transformation of each
individual BlockSet in case there are several.
A bug with the MultipleAlignment clone() method has been fixed (the
BlockSets and Blocks were not added to the parent Lists).
Improve documentation of the DataStructure.
lafita added a commit to lafita/biojava that referenced this pull request Jun 25, 2015
Parameters, StartupParameters and UserArgumentProcessor classes.
The old CEMC classed have been renamed to a more general, since the new
version supports any pairwise algorithm to generate the seed.
@sbliven sbliven deleted the multaln branch June 15, 2016 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants