StructureIdentifier framework#376
Conversation
…eIdentifier
This unifies various prior methods for identifying a structure from a string, which was
previously spread over AtomCache methods and StructureTools/StructureIO utilities. It
includes PDB IDs, residue range specifications, URLs, SCOP/CATH/PDP domains, and files.
Major Classes:
* StructureIdentifier is an interface for methods which know how to load a structure from
some resource and reduce it to the requested substructure. The most important property
is the identifier, which can be an arbitrary string.
* SubstructureIdentifier is considered the canonical way to specify a structure and
is the major implementation for chain and residue level substructures. All identifiers
can be converted to a SubstructureIdentifier.
* StructureName is suitable for wrapping user-supplied identifiers, and it dispatches the
request to a more specific class based on a guess as to the type (PDB, URL, file, etc).
StructureIdentifiers can represent arbitrary strings (e.g. domain IDs). These are converted
(possibly through some relatively slow process like downloading and parsing a file) into
standard-format SubstructureIdentifier instances, which should be easier to serialize and
recreate.
Detailed changes:
* Substantially changes StructureIdentifier, which existed but wasn't used anywhere.
* Remove getPdbId() method, since not globally relevant. It's still present in most
implementations, and can always be accessed via toCanonical().getPdbId()
* AtomCache
* Accept chain indices ("4HHB.0") with a warning if there was not
already a chain with that ID.
This commit rebases and squashes pre-4.1 development from sbliven's fix81 branch
culminating in 007ea6e. For posterity, the original commit messages are listed below.
Some changes may be omitted or modified when resolving the rebase.
Commit messages (oldest to youngest):
---
Major changes to StructureIdentifier (biojava#81)
Redefines StructureIdentifier as something which transforms a structure.
This will replace all of the disjoint places where strings are parsed
into various structures and ranges and identifiers.
---
More work on using StructureIdentifiers
* Add AtomCache.getStructure(StructureIdentifier) method, and change
other methods to use it
* Implement StructureName and other StructureIdentifiers
* Remove the 'A:+1-+5' range syntax from StructureTools. It was stupid.
* Remove string parsing from lots of places and replace with
StructureIdentifiers
* Fix lots of tests
---
Adding support for models in SubstructureIdentifier
---
More improvements for StructureIdentifiers
* Accept chain indices ("4HHB.0") with a warning if there was not
already a chain with that ID.
* Fixing bugs in StructureName due to differences with
SubstructureIdentifier
* Removing StructureName.compareTo method, since they aren't well
ordered
* Changing more AtomCache methods to use StructureIdentifiers
* Test fixes
---
Fixing bugs with loading structures from files and urls
These are now valid StructureName values. They are implemented using the
PassthroughIdentifier, which makes AtomCache responsible for fetching
the right structure.
---
Implementing PDP identifiers in StructureName
PDP parsing doesn't have good test coverage, but my basic checks work.
---
Last test fix.
All (existing) tests now pass.
---
Fix some AtomCache synchronization.
---
Modifying StructureIdentifier interface
- Add loadStructure() method
- Remove getPdbId() method, since not globally relevant. It's still
present in most implementations, and can always be accessed via
toCanonical().getPdbId()
- In AtomCache, rename protected loadStructureByPdbId to public
getStructureForPdbId() to bypass StructureIdentifier parsing
---
Fix bug loading PDP structures
---
The StructureFiletype enum will be useful for biojava#232
The strictSCOP parameter has been removed from AtomCache. Its get/setters are now deprecated and do nothing. Instead, whether guessing occurs is determined by how the identifier is created. If created directly from a ScopDatabase, no guessing is performed. In the more common case, where a StructureName is created from user input (e.g. through AtomCache.getStructure(String)), guessing is performed. This effectively changes the default behavior from strict identifiers to more fuzzy permissive ones.
Stores a reference to the identifier used to create the structure
Particularly MultipleAlignementEnsemble
This fixes a test issue with the previous commit.
|
I'd welcome a code review of this pull. Particularly, what should the return type for getCanonical() be? Ideally, it would be an identifier capable of representing arbitrary selections from a structure. Currently, the best we have is an identifier capable of representing residue ranges (SubstructureIdentifier). |
|
Travis is failing, is that some networking issue? I'll give a look at the code when I get a bit of time. Thanks for the pull request, this looks awesome! A question, this would support multi-letter chain identifiers, wouldn't it? |
|
Tests pass on my laptop, but there are plenty of changes here. I'll restart Travis and double check. |
|
There is a Test in the core module that sometimes gives a network exception from Travis. Is it the case? |
|
The error is in the test files I created for MultipleAlignment serialization (XML, FASTA, etc), because the names of the structures do not match (that is expected from your changes @sbliven). We just need to update the files with the new names. |
c203184 to
1fa6f3d
Compare
|
Ok, finally fixed the minor test regressions and got travis to finish the tests. |
StructureIdentifier framework This has some minor API incompatible changes (i.e. StructureIdentifier existed before but wasn't used anywhere). It also changes the default behavior for some things in AtomCache (e.g. filenames are now supported), but without changing the API.
|
I'm going to merge this before master diverges too much. Comments about the new code are still appreciated. |
Add substantially new unified class for identify structures: StructureIdentifier
This unifies various prior methods for identifying a structure from a string, which was
previously spread over AtomCache methods and StructureTools/StructureIO utilities. It
includes PDB IDs, residue range specifications, URLs, SCOP/CATH/PDP domains, and files.
Major Classes:
some resource and reduce it to the requested substructure. The most important property
is the identifier, which can be an arbitrary string.
is the major implementation for chain and residue level substructures. All identifiers
can be converted to a SubstructureIdentifier.
request to a more specific class based on a guess as to the type (PDB, URL, file, etc).
StructureIdentifiers can represent arbitrary strings (e.g. domain IDs). These are converted
(possibly through some relatively slow process like downloading and parsing a file) into
standard-format SubstructureIdentifier instances, which should be easier to serialize and
recreate.
Detailed changes:
implementations, and can always be accessed via toCanonical().getPdbId()
already a chain with that ID.