StructureIdentifier framework by sbliven · Pull Request #376 · biojava/biojava

sbliven · 2016-01-20T17:33:41Z

Add substantially new unified class for identify structures: StructureIdentifier

This unifies various prior methods for identifying a structure from a string, which was
previously spread over AtomCache methods and StructureTools/StructureIO utilities. It
includes PDB IDs, residue range specifications, URLs, SCOP/CATH/PDP domains, and files.

Major Classes:

StructureIdentifier is an interface for methods which know how to load a structure from
some resource and reduce it to the requested substructure. The most important property
is the identifier, which can be an arbitrary string.
SubstructureIdentifier is considered the canonical way to specify a structure and
is the major implementation for chain and residue level substructures. All identifiers
can be converted to a SubstructureIdentifier.
StructureName is suitable for wrapping user-supplied identifiers, and it dispatches the
request to a more specific class based on a guess as to the type (PDB, URL, file, etc).

StructureIdentifiers can represent arbitrary strings (e.g. domain IDs). These are converted
(possibly through some relatively slow process like downloading and parsing a file) into
standard-format SubstructureIdentifier instances, which should be easier to serialize and
recreate.

Detailed changes:

Substantially changes StructureIdentifier, which existed but wasn't used anywhere.
- Remove getPdbId() method, since not globally relevant. It's still present in most
  implementations, and can always be accessed via toCanonical().getPdbId()
AtomCache
- Accept chain indices ("4HHB.0") with a warning if there was not
  already a chain with that ID.
- Removes support for the strictScop parameter. The new default is effectively strictScop=false. This can be overridden by getting a ScopDomain directly from a particular ScopDatabase, rather than using generic methods like AtomCache.getStructure(String), which use the StructureName class for fuzzy matching of domain names.
Better guessing of PDB/CIF filetypes from filename (but not full autodetection from contents, as requested in Auto-detecting file formats #232)

…eIdentifier This unifies various prior methods for identifying a structure from a string, which was previously spread over AtomCache methods and StructureTools/StructureIO utilities. It includes PDB IDs, residue range specifications, URLs, SCOP/CATH/PDP domains, and files. Major Classes: * StructureIdentifier is an interface for methods which know how to load a structure from some resource and reduce it to the requested substructure. The most important property is the identifier, which can be an arbitrary string. * SubstructureIdentifier is considered the canonical way to specify a structure and is the major implementation for chain and residue level substructures. All identifiers can be converted to a SubstructureIdentifier. * StructureName is suitable for wrapping user-supplied identifiers, and it dispatches the request to a more specific class based on a guess as to the type (PDB, URL, file, etc). StructureIdentifiers can represent arbitrary strings (e.g. domain IDs). These are converted (possibly through some relatively slow process like downloading and parsing a file) into standard-format SubstructureIdentifier instances, which should be easier to serialize and recreate. Detailed changes: * Substantially changes StructureIdentifier, which existed but wasn't used anywhere. * Remove getPdbId() method, since not globally relevant. It's still present in most implementations, and can always be accessed via toCanonical().getPdbId() * AtomCache * Accept chain indices ("4HHB.0") with a warning if there was not already a chain with that ID. This commit rebases and squashes pre-4.1 development from sbliven's fix81 branch culminating in 007ea6e. For posterity, the original commit messages are listed below. Some changes may be omitted or modified when resolving the rebase. Commit messages (oldest to youngest): --- Major changes to StructureIdentifier (biojava#81) Redefines StructureIdentifier as something which transforms a structure. This will replace all of the disjoint places where strings are parsed into various structures and ranges and identifiers. --- More work on using StructureIdentifiers * Add AtomCache.getStructure(StructureIdentifier) method, and change other methods to use it * Implement StructureName and other StructureIdentifiers * Remove the 'A:+1-+5' range syntax from StructureTools. It was stupid. * Remove string parsing from lots of places and replace with StructureIdentifiers * Fix lots of tests --- Adding support for models in SubstructureIdentifier --- More improvements for StructureIdentifiers * Accept chain indices ("4HHB.0") with a warning if there was not already a chain with that ID. * Fixing bugs in StructureName due to differences with SubstructureIdentifier * Removing StructureName.compareTo method, since they aren't well ordered * Changing more AtomCache methods to use StructureIdentifiers * Test fixes --- Fixing bugs with loading structures from files and urls These are now valid StructureName values. They are implemented using the PassthroughIdentifier, which makes AtomCache responsible for fetching the right structure. --- Implementing PDP identifiers in StructureName PDP parsing doesn't have good test coverage, but my basic checks work. --- Last test fix. All (existing) tests now pass. --- Fix some AtomCache synchronization. --- Modifying StructureIdentifier interface - Add loadStructure() method - Remove getPdbId() method, since not globally relevant. It's still present in most implementations, and can always be accessed via toCanonical().getPdbId() - In AtomCache, rename protected loadStructureByPdbId to public getStructureForPdbId() to bypass StructureIdentifier parsing --- Fix bug loading PDP structures ---

The StructureFiletype enum will be useful for biojava#232

The strictSCOP parameter has been removed from AtomCache. Its get/setters are now deprecated and do nothing. Instead, whether guessing occurs is determined by how the identifier is created. If created directly from a ScopDatabase, no guessing is performed. In the more common case, where a StructureName is created from user input (e.g. through AtomCache.getStructure(String)), guessing is performed. This effectively changes the default behavior from strict identifiers to more fuzzy permissive ones.

Stores a reference to the identifier used to create the structure

Particularly MultipleAlignementEnsemble

This fixes a test issue with the previous commit.

sbliven · 2016-01-20T19:18:30Z

I'd welcome a code review of this pull. Particularly, what should the return type for getCanonical() be? Ideally, it would be an identifier capable of representing arbitrary selections from a structure. Currently, the best we have is an identifier capable of representing residue ranges (SubstructureIdentifier).

josemduarte · 2016-01-20T19:25:50Z

Travis is failing, is that some networking issue?

I'll give a look at the code when I get a bit of time. Thanks for the pull request, this looks awesome!

A question, this would support multi-letter chain identifiers, wouldn't it?

sbliven · 2016-01-20T21:03:48Z

Tests pass on my laptop, but there are plenty of changes here. I'll restart Travis and double check.

lafita · 2016-01-20T21:24:08Z

There is a Test in the core module that sometimes gives a network exception from Travis. Is it the case?

lafita · 2016-01-21T08:09:25Z

The error is in the test files I created for MultipleAlignment serialization (XML, FASTA, etc), because the names of the structures do not match (that is expected from your changes @sbliven). We just need to update the files with the new names.

…fier

sbliven · 2016-01-21T11:00:02Z

Ok, finally fixed the minor test regressions and got travis to finish the tests.

StructureIdentifier framework This has some minor API incompatible changes (i.e. StructureIdentifier existed before but wasn't used anywhere). It also changes the default behavior for some things in AtomCache (e.g. filenames are now supported), but without changing the API.

sbliven · 2016-01-25T11:09:29Z

I'm going to merge this before master diverges too much. Comments about the new code are still appreciated.

sbliven added 8 commits January 19, 2016 11:47

Implement extension-based PDB/CIF filetype detection

10da1c2

The StructureFiletype enum will be useful for biojava#232

Adding URLIdentifier for files and URLs

de3e073

Fixing minor test failure

9555a84

Add Structure.get/setStructureIdentifier methods

6017337

Stores a reference to the identifier used to create the structure

Changing many classes to use StructureIdentifier

a62d45e

Particularly MultipleAlignementEnsemble

Switching UserConfiguration to default to mmcif

1b11a48

This fixes a test issue with the previous commit.

sbliven added the enhancement Improvement of existing code or method label Jan 20, 2016

sbliven self-assigned this Jan 20, 2016

sbliven added this to the BioJava 4.2.0 milestone Jan 20, 2016

Fix bug where an identifier's toString method differed from getIdenti…

1fa6f3d

…fier

sbliven force-pushed the fix81_squashed branch from c203184 to 1fa6f3d Compare January 21, 2016 09:16

sbliven merged commit 5edfed4 into biojava:master Jan 25, 2016

sbliven mentioned this pull request Jan 25, 2016

Make loading of structures more consistent #81

Closed

josemduarte added a commit that referenced this pull request Jan 26, 2016

Adding test for multichar chain ids: #81, #376

d7e0326

josemduarte mentioned this pull request Feb 1, 2016

StructureIO.getStructure(pdbId) fails if a file with pdb id present in working dir #398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StructureIdentifier framework#376

StructureIdentifier framework#376
sbliven merged 9 commits into
biojava:masterfrom
sbliven:fix81_squashed

sbliven commented Jan 20, 2016

Uh oh!

sbliven commented Jan 20, 2016

Uh oh!

josemduarte commented Jan 20, 2016

Uh oh!

sbliven commented Jan 20, 2016

Uh oh!

lafita commented Jan 20, 2016

Uh oh!

lafita commented Jan 21, 2016

Uh oh!

sbliven commented Jan 21, 2016

Uh oh!

sbliven commented Jan 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sbliven commented Jan 20, 2016

Uh oh!

sbliven commented Jan 20, 2016

Uh oh!

josemduarte commented Jan 20, 2016

Uh oh!

sbliven commented Jan 20, 2016

Uh oh!

lafita commented Jan 20, 2016

Uh oh!

lafita commented Jan 21, 2016

Uh oh!

sbliven commented Jan 21, 2016

Uh oh!

sbliven commented Jan 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants