Skip to content

StructureIdentifier framework#376

Merged
sbliven merged 9 commits into
biojava:masterfrom
sbliven:fix81_squashed
Jan 25, 2016
Merged

StructureIdentifier framework#376
sbliven merged 9 commits into
biojava:masterfrom
sbliven:fix81_squashed

Conversation

@sbliven

@sbliven sbliven commented Jan 20, 2016

Copy link
Copy Markdown
Member

Add substantially new unified class for identify structures: StructureIdentifier

This unifies various prior methods for identifying a structure from a string, which was
previously spread over AtomCache methods and StructureTools/StructureIO utilities. It
includes PDB IDs, residue range specifications, URLs, SCOP/CATH/PDP domains, and files.

Major Classes:

  • StructureIdentifier is an interface for methods which know how to load a structure from
    some resource and reduce it to the requested substructure. The most important property
    is the identifier, which can be an arbitrary string.
  • SubstructureIdentifier is considered the canonical way to specify a structure and
    is the major implementation for chain and residue level substructures. All identifiers
    can be converted to a SubstructureIdentifier.
  • StructureName is suitable for wrapping user-supplied identifiers, and it dispatches the
    request to a more specific class based on a guess as to the type (PDB, URL, file, etc).

StructureIdentifiers can represent arbitrary strings (e.g. domain IDs). These are converted
(possibly through some relatively slow process like downloading and parsing a file) into
standard-format SubstructureIdentifier instances, which should be easier to serialize and
recreate.

Detailed changes:

  • Substantially changes StructureIdentifier, which existed but wasn't used anywhere.
    • Remove getPdbId() method, since not globally relevant. It's still present in most
      implementations, and can always be accessed via toCanonical().getPdbId()
  • AtomCache
    • Accept chain indices ("4HHB.0") with a warning if there was not
      already a chain with that ID.
    • Removes support for the strictScop parameter. The new default is effectively strictScop=false. This can be overridden by getting a ScopDomain directly from a particular ScopDatabase, rather than using generic methods like AtomCache.getStructure(String), which use the StructureName class for fuzzy matching of domain names.
  • Better guessing of PDB/CIF filetypes from filename (but not full autodetection from contents, as requested in Auto-detecting file formats #232)

…eIdentifier

This unifies various prior methods for identifying a structure from a string, which was
previously spread over AtomCache methods and StructureTools/StructureIO utilities. It
includes PDB IDs, residue range specifications, URLs, SCOP/CATH/PDP domains, and files.

Major Classes:
* StructureIdentifier is an interface for methods which know how to load a structure from
  some resource and reduce it to the requested substructure. The most important property
  is the identifier, which can be an arbitrary string.
* SubstructureIdentifier is considered the canonical way to specify a structure and
  is the major implementation for chain and residue level substructures. All identifiers
  can be converted to a SubstructureIdentifier.
* StructureName is suitable for wrapping user-supplied identifiers, and it dispatches the
  request to a more specific class based on a guess as to the type (PDB, URL, file, etc).

StructureIdentifiers can represent arbitrary strings (e.g. domain IDs). These are converted
(possibly through some relatively slow process like downloading and parsing a file) into
standard-format SubstructureIdentifier instances, which should be easier to serialize and
recreate.

Detailed changes:

* Substantially changes StructureIdentifier, which existed but wasn't used anywhere.
  * Remove getPdbId() method, since not globally relevant. It's still present in most
    implementations, and can always be accessed via toCanonical().getPdbId()
* AtomCache
  * Accept chain indices ("4HHB.0") with a warning if there was not
    already a chain with that ID.


This commit rebases and squashes pre-4.1 development from sbliven's fix81 branch
culminating in 007ea6e. For posterity, the original commit messages are listed below.
Some changes may be omitted or modified when resolving the rebase.

Commit messages (oldest to youngest):

---

Major changes to StructureIdentifier (biojava#81)

Redefines StructureIdentifier as something which transforms a structure.
This will replace all of the disjoint places where strings are parsed
into various structures and ranges and identifiers.

---

More work on using StructureIdentifiers

* Add AtomCache.getStructure(StructureIdentifier) method, and change
    other methods to use it
* Implement StructureName and other StructureIdentifiers
* Remove the 'A:+1-+5' range syntax from StructureTools. It was stupid.
* Remove string parsing from lots of places and replace with
    StructureIdentifiers
* Fix lots of tests

---

Adding support for models in SubstructureIdentifier

---

More improvements for StructureIdentifiers

* Accept chain indices ("4HHB.0") with a warning if there was not
    already a chain with that ID.
* Fixing bugs in StructureName due to differences with
    SubstructureIdentifier
* Removing StructureName.compareTo method, since they aren't well
    ordered
* Changing more AtomCache methods to use StructureIdentifiers
* Test fixes

---

Fixing bugs with loading structures from files and urls

These are now valid StructureName values. They are implemented using the
PassthroughIdentifier, which makes AtomCache responsible for fetching
the right structure.

---

Implementing PDP identifiers in StructureName

PDP parsing doesn't have good test coverage, but my basic checks work.

---

Last test fix.

All (existing) tests now pass.

---

Fix some AtomCache synchronization.

---

Modifying StructureIdentifier interface

    - Add loadStructure() method
    - Remove getPdbId() method, since not globally relevant. It's still
      present in most implementations, and can always be accessed via
      toCanonical().getPdbId()
    - In AtomCache, rename protected loadStructureByPdbId to public
      getStructureForPdbId() to bypass StructureIdentifier parsing

---

Fix bug loading PDP structures

---
The StructureFiletype enum will be useful for biojava#232
The strictSCOP parameter has been removed from AtomCache. Its
get/setters are now deprecated and do nothing.

Instead, whether guessing occurs is determined by how the identifier is
created. If created directly from a ScopDatabase, no guessing is
performed. In the more common case, where a StructureName is created
from user input (e.g. through AtomCache.getStructure(String)), guessing
is performed. This effectively changes the default behavior from
strict identifiers to more fuzzy permissive ones.
Stores a reference to the identifier used to create the structure
Particularly MultipleAlignementEnsemble
This fixes a test issue with the previous commit.
@sbliven sbliven added the enhancement Improvement of existing code or method label Jan 20, 2016
@sbliven sbliven self-assigned this Jan 20, 2016
@sbliven sbliven added this to the BioJava 4.2.0 milestone Jan 20, 2016
@sbliven

sbliven commented Jan 20, 2016

Copy link
Copy Markdown
Member Author

I'd welcome a code review of this pull. Particularly, what should the return type for getCanonical() be? Ideally, it would be an identifier capable of representing arbitrary selections from a structure. Currently, the best we have is an identifier capable of representing residue ranges (SubstructureIdentifier).

@josemduarte

Copy link
Copy Markdown
Contributor

Travis is failing, is that some networking issue?

I'll give a look at the code when I get a bit of time. Thanks for the pull request, this looks awesome!

A question, this would support multi-letter chain identifiers, wouldn't it?

@sbliven

sbliven commented Jan 20, 2016

Copy link
Copy Markdown
Member Author

Tests pass on my laptop, but there are plenty of changes here. I'll restart Travis and double check.

@lafita

lafita commented Jan 20, 2016

Copy link
Copy Markdown
Member

There is a Test in the core module that sometimes gives a network exception from Travis. Is it the case?

@lafita

lafita commented Jan 21, 2016

Copy link
Copy Markdown
Member

The error is in the test files I created for MultipleAlignment serialization (XML, FASTA, etc), because the names of the structures do not match (that is expected from your changes @sbliven). We just need to update the files with the new names.

@sbliven

sbliven commented Jan 21, 2016

Copy link
Copy Markdown
Member Author

Ok, finally fixed the minor test regressions and got travis to finish the tests.

sbliven added a commit that referenced this pull request Jan 25, 2016
StructureIdentifier framework

This has some minor API incompatible changes (i.e. StructureIdentifier existed before but wasn't used anywhere). It also changes the default behavior for some things in AtomCache (e.g. filenames are now supported), but without changing the API.
@sbliven sbliven merged commit 5edfed4 into biojava:master Jan 25, 2016
@sbliven

sbliven commented Jan 25, 2016

Copy link
Copy Markdown
Member Author

I'm going to merge this before master diverges too much. Comments about the new code are still appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement of existing code or method

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants