-
Notifications
You must be signed in to change notification settings - Fork 397
Support extended pdbid #950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support extended pdbid #950
Conversation
Created a new class PDBId, and removed old deprecated methods. Most of the changes are in Structure, StructureImpl, and PDBFileParser classes.
sbliven
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aalhossary Awesome work! FYI I've started reviewing it but it's a pretty PR so it might take me a few days.
biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/cath/CathDomainTest.java
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/PDBId.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/Structure.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/Structure.java
Outdated
Show resolved
Hide resolved
sbliven
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, thanks @aalhossary! Extended ids are now available in mmcif so this is good timing.
I was looking for detailed information about this change but I wasn't able find much beyond the two wwpdb blogs. Do you know of any more detailed info about the change? Are there example files?
If I understand correctly, the major changes are:
- New
PDBIdclass which can represent IDs as either 1ABC or PDB_00001ABC form - Parse keywords from pdb headers
Breaking changes:
- Some deprecated and removed methods in Structure
- notably,
String getPdbId()becomesPDBId getPDBId()(note capitalization) orString getPDBCode()
- notably,
- PDB codes are returned upper-case consistently
PDBId class
I think that wrapping ids into a light-weight class like this is a good approach. It's better than storing them as strings and providing conversion methods, despite the small overhead of object creation.
One thing we should discuss is all the static flags controlling short/long behavior. It's nice to have this flexibility, but since it's global state it is difficult to test and can lead to unexpected bugs if the flags get changed in a different part of the code. Instead of global state, maybe any code that needs to override the defaultShorteningBehavior should be forced to call getId(false)?
We could also improve the memory usage by storing either the 4-byte string (if isShortCompatible) or the 8-byte extended ID. Then the PDB_ prefix and leading zeros would only be added in toExtendedId. Maybe this is premature optimization, but I think loading large lists of PDBIds is a reasonable use case.
Consistent usage
It would be good to be consistent about whether to accept String or PDBId.
I would suggest that classes that are used commonly by users should override methods to accept both Strings and PDBIds. The more "internal classes" (I'm thinking of the parser internals like StructureDataInterface) can just have PDBId versions. Or if we want to minimize breaking, we can define both but deprecate the String versions for a release.
Keywords
Is the keyword parsing related to the extended ids in any way? If not it might be good to extract that to a separate PR.
Keywords should also be set from mmcif (and mmtf if they are supported).
Parsing
This PR sets up all the infrastructure for storing extended PDBIds. However, I suspect that additional work will be needed to ensure that the parsers are using the new IDs. For instance, I don't see anywhere that is reading the _database_2.pdbx_database_accession field in the cif parser. (This is more a TODO than a suggestion.)
Testing
I didn't see any tests that actually use 5+ character structures. Can we add a cif file that uses a long id and test parsing and such. I suspect we will find additional bugs (e.g. in the alignment output formats). We don't need to fix everything in this initial PR of course.
Thanks so much for building this feature. This was way more work than I expected when I made this issue, so thanks for taking it on!
biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/StructureName.java
Outdated
Show resolved
Hide resolved
...-structure/src/main/java/org/biojava/nbio/structure/align/pairwise/AlternativeAlignment.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/align/util/AtomCache.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/io/LocalPDBDirectory.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/StructureName.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/SubstructureIdentifier.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/SubstructureIdentifier.java
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/SubstructureIdentifier.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/SubstructureIdentifier.java
Outdated
Show resolved
Hide resolved
biojava-structure-gui/src/main/java/org/biojava/nbio/structure/gui/util/PDBUploadPanel.java
Outdated
Show resolved
Hide resolved
|
Hi Spencer @sbliven, I don't have any resources other than these 2 blogs too :). There are no details nor examples :( While writing my edit, I focused on making the minimal changes possible to the current codebase. Therefore, I minimized throwing any new exceptions from already present code that does not do so. I will reply your individual questions in place. KEYWODS recordsIt is actually a part of another PR. Both have some files in common. I thought by the time this PR gets reviewed, the other one would have been approved already. PDBId classDefault shortening BehaviorThe static flag for default shortening behavior does NOT have a setter. The static flag is mainly for easier change in the future in a later release. On the other hand, the object method getId(boolean) is public in case anybody wants it. Short Id to minimize memory usageI don't think storing 4-bute short ID is a good idea for these reasons:
However, I will consider storing the Consistent usageI am planning to move away from the String gradually as possible, leaving the very frequently used classes by users have overloaded functions with deprecated String variant, but I will NOT remove them soon. Parsers using the new PDBIdWhatever the parser is doing, it will eventually one of the SetPdbId() [already removed altogether], setPDBId(), or setPdbCode() methods. TestingBefore submitting my PR, I had either modified my code a lot to pass the currently available unit tests, or to modify the tests themselves, if they test for non realistic scenarios. The only format EXCEPTIONXXXX is the only exception of an accepted ID that doesn't follow the |
to match the expected String
|
I submitted a new revision (this time I performed a full system test before submission :) ). |
|
@sbliven , @josemduarte This PR has been silent for 25 days. Shall I consider the API stable and start writing the documentation? |
josemduarte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aalhossary . I have a few concerns about this one, the most important of them already expressed by @sbliven : this new class to wrap the PDB identifiers should be used consistently throughout all the relevant interfaces. Doing that is quite complicated and begs for the question whether the String accessors should be deprecated or live side by side.
For that and other reasons, I'm going to vote against proceeding with this solution. I feel that having some helper methods that can do PDB id string handling could be a more flexible solution.
|
In fact, I am not convinced with your 2 or 3 points @josemduarte for these reasons: Helper MethodA helper method was an option that I excluded in the first place; because what we need is not only conversion from old format to extended format and vise versa. but also
Implementing all of these features needs several methods. If they all were implemented as static helper functions somewhere, the result would be
The need for complicated restructuring.I reviewed every place where
I believe my PR is quite comprehensive. @josemduarte I would appreciate it if you list out all remaining interfaces to restructure. It would then be my own task to handle them. Should the String accessors be deprecated or live side by side?By all means, they should coexist now and for a long-enough time. Maybe we will think about the answer of this question when we release BioJava 7.0.0 or even 10.0.0! FinallyI appreciate mentioning the remaining concerns you have @josemduarte. My next tasks in this PR
|
Only the ID proper is stored (in upper case).
e.g. for 1abc we store 00001ABC.
This format is the trade off between storing short/extended format and
all extended format.
storing the ID proper allows direct comparison and fast ordering of
PDBId objects with uniform names in the format \d{4}[1-9][0-9A-Z]{3}.
|
Sorry for the delay @aalhossary . I've read your new comments and arguments and I see your point. Let me check the PR more carefully now and submit a new review for it. |
|
BTW when you have time, could you solve the conflicts by merging master into your branch? That would also make the diff cleaner, without the changes related to #948 |
|
I committed what I have till now. |
Conflicts: biojava-structure/src/main/java/org/biojava/nbio/structure/PDBHeader.java biojava-structure/src/main/java/org/biojava/nbio/structure/Structure.java
|
I have 2 questions that need to clarify: Is the first character of the current (short) PDB ID format a digit in the range [0-9] or [1-9]? |
- Class name: changed into PdbId - methods name: changed into getPdbId(), setPdbId(PdbId) - field name: pdbId
|
I pushed a new set of commits addressing your comments @josemduarte. |
It introduced a new bug. Reverting it.
Conflicts: CHANGELOG.md biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/PdbPair.java
|
I received this reply about the entry/ies with XXXX
My interpretation is that this XXXX may appear in other files. Consequently, I believe it's best to:
This should not break Java rules; as the rule in Java documentation is
|
|
@josemduarte I pushed a new commit right now. I hope it will be the last in this pull request :) |
I think that the reply you got makes it quite clear: XXXX is not special but just some random string that is used to make sure there's no confusion with actual PDB ids. For instance, one day XXXX could be replaced by any other string in those assembly cif files. Consequently I think it is important to not treat XXXX specially and to remove any logic around it. When XXXX or any other string that doesn't pass a standard PDB id regex is encountered, then it should not be considered a PDB id and the PdbId field in the Structure object should be set to null. |
Shall I understand that you want ANY malformed PdbId be gracefully set to |
|
Well, I removed XXXX altogether, and set all malformed |
josemduarte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've done another round of reviewing
biojava-structure/src/main/java/org/biojava/nbio/structure/PdbId.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/PdbId.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/PdbId.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/PdbId.java
Outdated
Show resolved
Hide resolved
| pdbId = new PdbId(URLIdentifier.guessPDBID( path.substring(path.lastIndexOf('/')+1) )); | ||
| } catch (IllegalArgumentException e) { | ||
| pdbId = null; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explicit check for null (instead of try/catch) will be more readable and more robust if implementation changes later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a null parameter is not the only source of IllegalArgumentException. The method toInternalFormat() (called from the constructor) is actually more likely to throw the IllegalArgumentException.
I can't remove the catch. I can return the explicit checked exception thrown from the constructor instead :)
biojava-structure/src/main/java/org/biojava/nbio/structure/align/util/AtomCache.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/io/PDBFileParser.java
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/io/cif/CifStructureConsumerImpl.java
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/io/cif/CifStructureConsumerImpl.java
Outdated
Show resolved
Hide resolved
|
I submitted a new commit that addresses all reviewer's comments except replacing the try/catch block with an explicit null pointer check. |
josemduarte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all your efforts! I think all looks good now! Just the TODOs should be removed.
biojava-structure/src/main/java/org/biojava/nbio/structure/StructureIO.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/SubstructureIdentifier.java
Outdated
Show resolved
Hide resolved
biojava-structure/src/main/java/org/biojava/nbio/structure/align/util/AtomCache.java
Outdated
Show resolved
Hide resolved
|
Tests are passing. I'll merge this in by the end of tomorrow if there no other comments. |
I added
PDBIdclass to replace the current String PdbId with conversation methods to and from short format.I will write the documentation later, as soon as the API is approved and finalized.
Fixes #930.