Protein Sequence and Structure Bioinformatics: BioJava

Showing posts with label BioJava. Show all posts

Friday, April 25, 2014

Systematic Detection of Internal Symmetry in Proteins Using CE-Symm

In our latest paper, "Systematic Detection of Internal Symmetry in Proteins Using CE-Symm", we are taking a look at how internal symmetry in proteins is related to protein function. A large number of proteins have symmetry not only in their biological assemblies, but also within their tertiary structures. To investigate the question of how internal symmetry evolved, how symmetry and function are related, and the overall frequency of internal symmetry, we developed a new algorithm that can detect pseudo-symmetry within tertiary structures of proteins. Our results indicate that more domains are pseudo-symmetric than previously estimated. We establish a number of recurring types of symmetry–function relationships and describe several characteristic cases in detail.

Read more over at JMB.

Several protein domains with internal symmetry that CE-Symm detects. Coloring is by symmetry unit.

Tuesday, March 25, 2014

BioJava 3.0.8 released

BioJava 3.0.8 was released on March 25th 2014 and is available from

http://biojava.org/wiki/BioJava:Download as well as from the

BioJava maven repository at http://www.biojava.org/download/maven/

This release would not have been possible without contributions from

13 developers, thanks to all for their support!

BioJava 3.0.8 includes a lot of new features as well as numerous bug fixes and improvements.

New Features:

new Genbank writer
new parser for Karyotype file from UCSC
new parser for Gene locations from UCSC
new parser for Gene names file from genenames.org
new module for Cox regression code for survival analysis
new calculation of accessible surface area (ASA)
new module for parsing .OBO files (ontologies)
improved representation of SCOP and Berkeley-SCOP classifications

For a detailed comparison see here:

https://github.com/biojava/biojava/compare/biojava-3.0.7...biojava-3.0.8

For the next release we are planning some refactoring and removal of code that has been deprecated for a long time. As such the next release will be named 3.1.0.

About BioJava:

BioJava is a mature open-source project that provides a framework for

processing of biological data. BioJava contains powerful analysis and

statistical routines, tools for parsing common file formats, and

packages for manipulating sequences and 3D structures. It enables

rapid bioinformatics application development in the Java programming

language.

Happy BioJava-ing,

Andreas

Friday, November 30, 2012

BioJava 3.0.5 released

BioJava 3.0.5 has been released and is available from http://www.biojava.org/wiki/BioJava:Download as well as from the BioJava maven repository at http://www.biojava.org/download/maven/ .

New Features:

- New parser for CATH classification

- New parser for Stockholm file format

- Significantly improved representation of biological assemblies of protein structures. Now can re-create biological assembly from asymmetric unit

- Several bug fixes

Thanks to Daniel Asarnow for contributing the CATH parser and Amr Al Hossary and Marco Vaz for their contributions to the Stockholm parser.

Sunday, October 28, 2012

RCSB PDB web site update Fall 2012

New Features at the RCSB PDB web site

This week the RCSB PDB released the latest major web site update. Here a quick description of some of the new features.

Protein Feature View

One of the main new features is the new Protein Feature View. It allows to compare the full length protein sequence, as defined by UniProt with the regions that have been determined in 3D and are available together with their coordinates from the Protein Data Bank. Besides the visualization of the PDB and UniProt relationships, the new view also adds additional annotations for a more comprehensive understanding of the protein. External data such as Pfam domains or regions for which Homology Models are available from the ProteinModelPortal are indicated. There are also some annotations that are being calculated on the fly: Protein disorder regions, as predicted by Peter Troshin's BioJava implementation of RONN are available as a histogram-style track. Finally, regions with increased hydrophobicity can be spotted by looking at the Hydropathy track.

The Protein Feature View is built using SVG graphics and extensively uses the jQuery-SVG library. Using SVG graphics for a prominently feature on the site (it is on every protein-explorer page) has become possible since the majority of all modern browsers support these types of graphics nowadays. However, there is still a number of users who are stuck with old browser versions. According to our web site traffic logs, this number is rapidly declining and we estimate that currently less than 15% of our users can't use the new view. These users won't see error messages on the protein-explorer page, thought. The graphics will simply not be visible and provide a graceful fallback to the way the page used to look before the graphics were introduced.

Better Pfam integration

Another new feature of this release is a better integration with Pfam. Pfam family names are now searchable and one can quickly lookup all protein structures related to these families. Since Pfam is used in structural genomics projects to prioritize targets for crystallization, a possible use case is to look up domains of unknown function (DUFs) and whether 3D coordinates have already been determined for them. As already mentioned above, Pfam domains can be viewed as part of the new Protein Feature View. Weekly up-to-date Pfam-PDB mappings are being calculated by submitting newly released PDB entries to the HMMER3 web site. The details of this process are being described in more detail at the Pfam blog site.

Searching and Reporting

Other improvements of this RCSB PDB web site update include search and reporting improvements. RCSB searches have been improved for better supporting poly-proteins and their sub-components (see screenshot above). There is also better support for searching drug names (and more information about drugs on the Ligand Summary page (e.g. Lipitor), coming from DrugBank . Once a search has been performed, there are now four different types of reports available for investigating the results. Besides the "traditional" search results there is now a "condensed" view, which provides a compact summary of results. The "gallery" provides images for the proteins that have been found in the search. A "timeline" gives a historic overview when proteins were released in the PDB

A full description of all the new features is (as always) available on the What's New Page.

Friday, August 10, 2012

BioJava 2012 paper published

Today the latest BioJava paper was published, describing the BioJava version 3 series .

Thanks to all developers for their contributions, it would not have been possible without them!

Abstract:

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/bts494?ijkey=BzJOy9GgM2XNw07&keytype=ref

PDF:

http://bioinformatics.oxfordjournals.org/cgi/reprint/bts494?ijkey=BzJOy9GgM2XNw07&keytype=ref

Citation:

BioJava: an open-source framework for bioinformatics in 2012

Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius
Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock
Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L.
Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis

Bioinformatics 2012; doi: 10.1093/bioinformatics/bts494

Monday, May 21, 2012

BioJava 3.0.4 released

BioJava 3.0.4 just hit the servers. This is mainly a bug-fix release addressing a few issues with the protein structure and the disorder modules.

One new feature is that SCOP can now be either accessed from the original SCOP site in the UK or the Berkeley version.

Sunday, March 18, 2012

GSoC 2012 - how to get started with a proposal

To get started with a proposal I would recommend to look at the BioJava
project proposals from the last two years (and here) and
see what kind of projects got funded and how those proposals were
written. Think about what you would like to work on. Get a copy of
BioJava and see how related features are working. Come up with a plan
on how to extend this.

We are fairly flexible regarding what kind of projects we will run
this summer and this really depends on the submitted project
proposals. All proposals will be compared and ranked together with
other projects from the Bio* projects. As such a good proposal is key
to get funded.

A good proposals shows

- the motivation of the student
- that the candidate is qualified to do what he is proposing
- adds useful new functionality to BioJava
- discusses possible risks and what to do about them

It is difficult to answer questions like "how should I perform this or
that project?" - There are more than one possible path and it depends
on your skills and interest what will be the best answer for this.
Overall I recommend to pick a project on a topic that is close to your
(future?) thesis, or is of particular interest for you.

Here a couple of more thoughts which are project specific:

- The best projects are those that you come up with yourself. If you
want to distinguish yours from every other proposal, suggest something
which is not on our list.

- File parsers:

if you want to work on file parsers take a look at existing ones. What
features do they provide? How can they be extended? For example if you
want to work on the CATH parser, take a look at how the SCOP parser
works. What features are available around this (access to domains) and
how can something like this be set up for CATH. Look at how the CATH
website provides files.

- Porting of algorithms:

There are several approaches possible for doing this. I recommend that
you should have some background both in C and in Java for this. Get a
copy of the algorithm you want to port, compile it, and take a look at
the source. There are several ways how to proceed for the actual port
and having a good strategy for this is key for this proposal. Perhaps
try to use your strategy on some simple test case to see how this
might work.

- BioJava in the cloud

The goal here is parallelization of existing code. What parts of
biojava are suitable for this? How can they be parallelized and moved
to current cloud infrastructure? There is a lot of online material
available for this which will be helpful here.

Friday, March 16, 2012

BioJava at at Google Summer of Code 2012

The Open Bioinformatics foundation as an umbrella organisation for
BioJava has been accepted to participate in this year's Google Summer
of Code.

This means we will again be able to offer mentoring through BioJava
this year. Accepted students will get a stipend of 5,000$ from Google.
Participation is possible from most countries in the world, as long as
you are eligible to work in the country in which you'll reside
throughout the duration of the program.

If you are interested in working on a BioJava related project, now is
the time to start preparing and discussing your proposals. For the
last two years we had many applications for the projects proposed by
mentors. If you want to distinguish your application I recommend to
propose your own project. Don't forget to discuss any proposal with
us before you submit them. We will try to provide feedback and match
you with a suitable Mentor.

Also see http://biojava.org/wiki/Google_Summer_of_Code and Google's
FAQs: http://www.google-melange.com/document/show/gsoc_program/google/gsoc2012/faqs

The student application deadline is April 6th. Google will announce
which proposals got accepted on April 23rd.

BioJava 3.0.3 released

BioJava 3.0.3 has been released and is available from
http://www.biojava.org/wiki/BioJava:Download as well as from the
BioJava maven repository at http://www.biojava.org/download/maven/ .

New Features

BioJava 3.0.3 adds several new features

- Significant improvements for the web service module (ncbi blast and
hmmer web services)

- Fastq parser (ported from the biojava 1 series to version 3)

- Support for SIFTS-PDB to UniProt mapping

- Improved support for working with external protein domain definitions

- Protmod module renamed to modfinder

- Numerous improvements all over the place (several hundred commits
since last release)

- We are also working on an update for the legacy biojava 1.8 series.

This release would not have been possible with contributions from
numerous people, thanks to all for their support!

Happy BioJava-ing!

Tuesday, December 28, 2010

BioJava 3.0 released

Today we released BioJava 3.0. It is available from http://biojava.org/wiki/BioJava:Download.
Over the last year BioJava has undergone a major re-write. It has been modularized into small, re-usable components and a number of new features have been added. The new approach, modeled after the apache commons, minimizes dependencies and allows for easier contribution of new components.

At the present the main modules are:

biojava3-core: The core module offers the basic tools required for working with biological sequences of various types (DNA, RNA, protein). Besides file parsers for popular file formats it provides efficient data structures for sequence manipulation and serialization.

biojava3-genome: The genome module provides support for reading and writing of gtf, gff2, gff3 file formats

biojava3-alignment: This module provides implementations for pairwise and multiple sequence alignments (MSA). The implementation for MSA provides a flexible and multi-threaded framework that works in linear space and that, as an option, allows the users to define anchors that are used in the build up of the multiple alignment.

biojava3-structure: The 3D protein structure module provides parsers and a data model for working PDB and mmCif files. New features in this release are the implementation of the CE and FATCAT structural alignment algorithms and the support of chemical component definition files, for a chemically and biologically correct representation of modified residues and ligands.

biojava3-protmod: The protein modification module can detect more than 200 protein modifications and crosslinks in 3D protein structures. It comes with an XML file and Java data structures to store information about different types of protein modifications collected from PDB, RESID, and PSI-MOD.

Not every feature of the BioJava 1.X code base was migrated over to BioJava 3.0. A modularized version of the 1.X sources is available as a new "biojava-legacy" project.

Friday, October 8, 2010

BioJava's Google Summer of Code summary

Today a slighlty belated summary of what happened at the Google Summer of Code at the BioJava project:

Our two students Mark Chapman and Jianjiong Gao did an amazing job on their two projects "All Java Multiple Sequence Alignment" (MSA) and "Identification and Classification of Posttranslational Modification of Proteins" (PTM).

For Multiple Sequence Alignments we now have a flexible and multi-threaded MSA implementation that works in linear space and that, as an option, allows the users to define anchors that are used in the build up of the multiple alignment. The code is available as part of the new biojava3-alignment module.

The Posttranslational Modification module (biojava3-protmod) can detect three different types of protein modifications in protein structures. It comes with an XML file & Java data structures to store information about different types of protein modifications, and contains entries from RESID, PDBCC and PSI-MOD. There is also a visualisation component to display cross linked PTM on a sequence viewer.

Both Mark and Jianjiong have expressed their interest in maintaining and further developing their modules and I am looking forward to interacting more with them in the future. I want to thank the Mentors and Co-Mentors Peter Rose, Kyle Ellrott and Scooter Willis for their help and guidance for the projects, without them this would not have been possible. Thanks also to Robert Buels and the Open Bioinformatics Foundation for organizing our applications for GSoC and last, but not least, Google for sponsoring this Summer of Code.

Tuesday, June 1, 2010

biojava-structure now supports Chemical Component Dictionary

I have updated the BioJava structure data model to support the PDB
chemical component dictionary. This has the benefit that now

* Chemically modified amino acids can be detected (and treated as
amino acids, rather than Hetatom groups)
* It is possible to get a component type for each Group, which allows
to identify ligands.

As a consequence the nr. of amino acids in a chain can change compared
to the previous data representation. As such the loading of chem.
comps is set to "false" by default. It can be configure by the
"loadChemCompInfo" flag in the PDB/mmCIF file parsers.

PDB ID 1A4W - Thrombin with Thiazole-containing Inhibitors. Image source: RCSB PDB

An example where this representation makes a difference is PDB ID 1A4W. This structure contains several Ligands and a chemically modified residue. Without the help of the Chemical Component Dictionary it would have been difficult to correctly represent this protein.

You can get the code either from BioJava SVN, or from the (still slightly experimental) Maven repository at http://www.biojava.org/download/maven/ .

Thursday, March 18, 2010

Google Summer of Code 2010

Our (the Open Biology Foundation's) application for the Google Summer of Code has been accepted. http://socghop.appspot.com/gsoc/program/accepted_orgs/google/gsoc2010

I am offering to mentor a project to develop an All-Java Multiple Sequence Alignment algorithm as part of the BioJava project.

If you are an interested and skilled student, take a look at the project description, and if you think you are up for the challenge, send me an email with your application.

http://biojava.org/wiki/Google_Summer_of_Code

Friday, January 22, 2010

BioJava Hackathon - Last Day

Today was the last day of the BioJava Hackathon. It has been an exciting week and we made progress along several lines, which I will talk about in a moment. Special thanks go to Jonathan Warren for organizing the meeting room at the Sanger Institute. Also thanks to our hackers without who this hackathon would not have been possible. In particular thanks to Scooter Willis, Jules Jacobsen, Andy Yates, Jonathan Warren, Christoph Gille, Matias Piipari for participating during the week and to our special guests who joined us for a day, Richard Holland and Jim Procter.

All the code that has been written is available through the new modules labeled with the biojava3 name. Most work was related to the new sequence and protein structure modules:

Sequence modules

There have been a lot of discussions about the current way sequences are represented over the last years. As such the "sequence guys" among the developers were working on coming up with a new design which is providing a biological meaningful (think central dogma) representation of sequences. What is still missing are file parsers using the new modules. The first fasta parser is about to be committed by Scooter as I am writing this. There is still more work required before the code will be ready for the next release. Still this is the beginning of a new data representation which should make the code base ready for the next couple of years.

Structure modules

The protein structure modules are the BioJava3-part which is closest to be released. During this week we have added the CE algorithm for protein structure alignment, implemented core interfaces for a generic Model View Control wrapping of various 3D visualization tools, we added better support for chemically modified residues (like MSE) and natural ones like Selenocysteine. They are treated now as amino acids. We also re-factored the code base to have the structure data model clearly separated from the new graphical user interfaces. This gui module now provides a nice way for calculating and visualizing protein structure alignments.

Next BioJava release (3.0)

There is still more work required to push the new sequence module to a state where it can be released. We also did not write any documentation this week, so that will have to be added later on. We will try to bring up the modules to a state where they can be released over the next weeks. Once a module is release ready a detailed summary of the new features will be posted to the mailing list. In any case there will be a BioJava 3.0 release in time for the ISMB/BOSC conference as we have been doing during the last years.

Wednesday, January 20, 2010

BioJava Hackathon - Day 3 - Structure Modules

Today the main new feature in the structure modules is the release of a Java port of the Combinatorial Extension (CE) algorithm. This contains both a version of the algorithm that can be run from command line, as well as a GUI to view the results and trigger custom alignments. Essentially this is what is available from the RCSB website from: http://www.rcsb.org/pdb/workbench/workbench.do

About the generic design for Model View Control for 3D viewers, an unsolved problem is currently how to deal with selections. Selecting ranges, chains or atoms in proteins is done using a scripting interface at PyMol or Jmol. Shall we have a scripting interface (based on the syntax of one of these) or shall we have multiple select methods that accept various arguments? Jules Jacobsen wrapped the Jmol-Biojava interface using the new interface definitions for the MVC.

Tuesday, January 19, 2010

BioJava Hackathon - Day 2

Yesterday's contributor who added most lines of code is Michael Heuer, who is joining the hackathon from remote (i.e. somewhere in the US). He added the new FASTQ parser to BioJava. Well done Michael!

During the morning session we did a "Post Up", a silent and structured way of doing brainstorming. This was in order to come up with a new requirement how to do some state of the art pushing on the sequence modules. Scooter moderated a discussion where we focused on biologically meaningful representations of biological sequences. A Chemical Compound will be at the core of any sequence representation and we want to have different types of sequences like Chromosome sequence, Scaffold, DNA, RNA, Protein, and Sugars.

We started with test-driven development for the new sequence interfaces and then we will wrap the existing sequence code with the new interfaces. Here you can see us during the brainstorming session:

On the 3D structure side of things, we added a new 3D structure-gui module that is going to provide the Model View Control interface for the various open source viewers.

Monday, January 18, 2010

BioJava Hackathon - Day 1 part 2

Continuation of Day 1...

We had more discussion about how to deal with the sequence modules, bytecode dependencies of the core module and related topics. Seems there is a general agreement about moving the current sequence code out of the core module into its own space. Will continue tomorrow morning, when Richard Holland is back.

On a different side of things, Christoph Gille, Jules Jacobsen and I were discussing how to provide a Model View Control interface for using various open source 3D visualization libraries (Jmol, RCSB Libraries, Astex Viewer) together with Biojava.

We spent a lot of time discussing today, hope to be able to get more code done tomorrow.

BioJava Hackathon - Day 1

Hi,

I am going to blog every day about the BioJava Hackathon, so you can stay updated with what is happening here in Cambridge.

In the morning I gave this presentation around which we had several discussions about what are the most critical issues we want to solve. The issues are:

Installation problems. Getting the latest checkout of the new Maven based build system causes problems for some of us. Sorting our the installation procedure is a major topic of the afternoon. It works successfully with the latest Eclipse, the m2eclipse plugin and subclipse plugin. Some of the NetBeans based developers also reported no problems during installations.
Features. The Biojava features should become a first class citizen. This means it should be possible to instantiate them independently of sequence objects.
Simplify Sequences: Sequences should be Strings as far as possible. Only convert them to Sequence objects if required.
Some of the BioJava 3 docu is not up to date and can lead to misunderstandings. The latest BioJava 3 code is available in the trunk
Memory efficiency: Make sure that iterating over RichSequences is memory efficient. (Fix a memory leak there)
Bytecode: The Biojava - core module should not require the Bytecode module.

Andy Yates is tweeting about it at http://twitter.com/search?q=%23biojava

Saturday, January 16, 2010

BioJava Hackathon 2010

I am off to Cambridge, U.K. where we will have the BioJava Hackathon next week. I am planning to blog on a regular basis about what is going on there.

Sunday, January 10, 2010

Protein Comparison Tool

In the recent months I spent some time developing the new RCSB PDB Protein Comparison Tool (you can see an example for it on the right-hand menu of this blog).

In particular I spent a lot of time porting the CE and FATCAT algorithms from C to Java and developing a new user interface. Check out the latest version at http://betastaging.rcsb.org/pdb/workbench/workbench.do . (E.g. try to align 4HHB chain A and 4HHB chain B ).

Having the algorithms in Java opens the door for a number of nice applications. It is now possible to launch the structure comparison application with a single mouse click using the Java Web Start technology.