Support extended pdbid #950

aalhossary · 2021-07-27T11:23:44Z

I added PDBId class to replace the current String PdbId with conversation methods to and from short format.
I will write the documentation later, as soon as the API is approved and finalized.
Fixes #930.

Created a new class PDBId, and removed old deprecated methods. Most of the changes are in Structure, StructureImpl, and PDBFileParser classes.

sbliven

@aalhossary Awesome work! FYI I've started reviewing it but it's a pretty PR so it might take me a few days.

biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/cath/CathDomainTest.java

biojava-structure/src/main/java/org/biojava/nbio/structure/PDBId.java

biojava-structure/src/main/java/org/biojava/nbio/structure/Structure.java

sbliven

Excellent, thanks @aalhossary! Extended ids are now available in mmcif so this is good timing.

I was looking for detailed information about this change but I wasn't able find much beyond the two wwpdb blogs. Do you know of any more detailed info about the change? Are there example files?

If I understand correctly, the major changes are:

New PDBId class which can represent IDs as either 1ABC or PDB_00001ABC form
Parse keywords from pdb headers

Breaking changes:

Some deprecated and removed methods in Structure
- notably, String getPdbId() becomes PDBId getPDBId() (note capitalization) or String getPDBCode()
PDB codes are returned upper-case consistently

PDBId class

I think that wrapping ids into a light-weight class like this is a good approach. It's better than storing them as strings and providing conversion methods, despite the small overhead of object creation.

One thing we should discuss is all the static flags controlling short/long behavior. It's nice to have this flexibility, but since it's global state it is difficult to test and can lead to unexpected bugs if the flags get changed in a different part of the code. Instead of global state, maybe any code that needs to override the defaultShorteningBehavior should be forced to call getId(false)?

We could also improve the memory usage by storing either the 4-byte string (if isShortCompatible) or the 8-byte extended ID. Then the PDB_ prefix and leading zeros would only be added in toExtendedId. Maybe this is premature optimization, but I think loading large lists of PDBIds is a reasonable use case.

Consistent usage

It would be good to be consistent about whether to accept String or PDBId.

I would suggest that classes that are used commonly by users should override methods to accept both Strings and PDBIds. The more "internal classes" (I'm thinking of the parser internals like StructureDataInterface) can just have PDBId versions. Or if we want to minimize breaking, we can define both but deprecate the String versions for a release.

Keywords

Is the keyword parsing related to the extended ids in any way? If not it might be good to extract that to a separate PR.

Keywords should also be set from mmcif (and mmtf if they are supported).

Parsing

This PR sets up all the infrastructure for storing extended PDBIds. However, I suspect that additional work will be needed to ensure that the parsers are using the new IDs. For instance, I don't see anywhere that is reading the _database_2.pdbx_database_accession field in the cif parser. (This is more a TODO than a suggestion.)

Testing

I didn't see any tests that actually use 5+ character structures. Can we add a cif file that uses a long id and test parsing and such. I suspect we will find additional bugs (e.g. in the alignment output formats). We don't need to fix everything in this initial PR of course.

Thanks so much for building this feature. This was way more work than I expected when I made this issue, so thanks for taking it on!

biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/StructureName.java

...-structure/src/main/java/org/biojava/nbio/structure/align/pairwise/AlternativeAlignment.java

biojava-structure/src/main/java/org/biojava/nbio/structure/align/util/AtomCache.java

biojava-structure/src/main/java/org/biojava/nbio/structure/io/LocalPDBDirectory.java

biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/StructureName.java

biojava-structure/src/main/java/org/biojava/nbio/structure/SubstructureIdentifier.java

biojava-structure-gui/src/main/java/org/biojava/nbio/structure/gui/util/PDBUploadPanel.java

aalhossary · 2021-08-03T13:08:51Z

Hi Spencer @sbliven,

I don't have any resources other than these 2 blogs too :). There are no details nor examples :(

While writing my edit, I focused on making the minimal changes possible to the current codebase. Therefore, I minimized throwing any new exceptions from already present code that does not do so.

I will reply your individual questions in place.

KEYWODS records

It is actually a part of another PR. Both have some files in common. I thought by the time this PR gets reviewed, the other one would have been approved already.

PDBId class

Default shortening Behavior

The static flag for default shortening behavior does NOT have a setter. The static flag is mainly for easier change in the future in a later release. On the other hand, the object method getId(boolean) is public in case anybody wants it.

Short Id to minimize memory usage

I don't think storing 4-bute short ID is a good idea for these reasons:

It will add a lot of overhead to compare / hashcode Ids for a small memory benefit
the New IDs would eventually supersede the short ID, especially as the new depositions are being added in an increasing rate, while the old ones are being obsoleted by new ones.

However, I will consider storing the 8-byte extended ID only in a later optimization PR, after the new IDs start to emerge and we see how the format would look like.

Consistent usage

I am planning to move away from the String gradually as possible, leaving the very frequently used classes by users have overloaded functions with deprecated String variant, but I will NOT remove them soon.

Parsers using the new PDBId

Whatever the parser is doing, it will eventually one of the SetPdbId() [already removed altogether], setPDBId(), or setPdbCode() methods.
However, I would appreciate it if you point out any missing / additional places to take care of. I am sorry I am still nw to parsing mmCIF files.

Testing

Before submitting my PR, I had either modified my code a lot to pass the currently available unit tests, or to modify the tests themselves, if they test for non realistic scenarios.

The only format EXCEPTION

XXXX is the only exception of an accepted ID that doesn't follow the \d\p{Alnum}{3} format.
I added it because I found it declared in the header of 4nwr-assembly1.cif.gz referenced from TestURLBasedFileParsing.
Can somebody confirm whither this file is correct or wrong and should be remediated? @josemduarte Can you help us find the answer of this question?

to match the expected String

aalhossary · 2021-08-04T19:05:26Z

I submitted a new revision (this time I performed a full system test before submission :) ).
I left the questions which need answers unresolved in @sbliven comments.

aalhossary · 2021-08-29T09:00:02Z

@sbliven , @josemduarte This PR has been silent for 25 days. Shall I consider the API stable and start writing the documentation?

josemduarte

Thanks @aalhossary . I have a few concerns about this one, the most important of them already expressed by @sbliven : this new class to wrap the PDB identifiers should be used consistently throughout all the relevant interfaces. Doing that is quite complicated and begs for the question whether the String accessors should be deprecated or live side by side.

For that and other reasons, I'm going to vote against proceeding with this solution. I feel that having some helper methods that can do PDB id string handling could be a more flexible solution.

aalhossary · 2021-09-22T23:23:26Z

In fact, I am not convinced with your 2 or 3 points @josemduarte for these reasons:

Helper Method

A helper method was an option that I excluded in the first place; because what we need is not only conversion from old format to extended format and vise versa. but also

checks of order (via java.lang.Comparable) among different PDBIDs.
checks of hashCode and equals among PDBIDs of different formats to use in a HashMap.
checking whether a PDBID is short or long, and hence do whatever necessary conversions.
checking whether an extended PDBID is shortable.
general regular expression matching.

Implementing all of these features needs several methods. If they all were implemented as static helper functions somewhere, the result would be

Complex code
We reply on the user's own discretion to use them properly.
We would either add several methods to the Strusture class that are actually unrelated to its main purpose, or add a new separate class . In the later case, I prefer grouping all related methods in a light-weight class, which is what I did.

The need for complicated restructuring.

I reviewed every place where setPDBCode() is called. I found only 4 places remaining, two are to be done in the next commit, and the other two are actually misuse of the field and should be changed.
I also reviewed every comment by Spenser Bliven and found that most if not all of them are met and solved already.

this new class to wrap the PDB identifiers should be used consistently throughout all the relevant interfaces.

I believe my PR is quite comprehensive. @josemduarte I would appreciate it if you list out all remaining interfaces to restructure. It would then be my own task to handle them.

Should the String accessors be deprecated or live side by side?

By all means, they should coexist now and for a long-enough time. Maybe we will think about the answer of this question when we release BioJava 7.0.0 or even 10.0.0!
If you ask me about my opinion, they should coexist (as convenience methods).

Finally

I appreciate mentioning the remaining concerns you have @josemduarte.
What is the opinion of other members?
What do you think @sbliven ?

My next tasks in this PR

Keeping shorter string to represent the PDB ID in memory.
Replacing remaining setPDBCode and getPDBCode with setPDBId and getPDBId
Adding documentation
Any other refactoring to be proposed by members.

Only the ID proper is stored (in upper case). e.g. for 1abc we store 00001ABC. This format is the trade off between storing short/extended format and all extended format. storing the ID proper allows direct comparison and fast ordering of PDBId objects with uniform names in the format \d{4}[1-9][0-9A-Z]{3}.

josemduarte · 2021-10-01T22:05:48Z

Sorry for the delay @aalhossary . I've read your new comments and arguments and I see your point. Let me check the PR more carefully now and submit a new review for it.

josemduarte · 2021-10-01T22:06:21Z

BTW when you have time, could you solve the conflicts by merging master into your branch? That would also make the diff cleaner, without the changes related to #948

aalhossary · 2021-10-02T08:11:55Z

I committed what I have till now.
I will try merging the master branch now.

Conflicts: biojava-structure/src/main/java/org/biojava/nbio/structure/PDBHeader.java biojava-structure/src/main/java/org/biojava/nbio/structure/Structure.java

aalhossary · 2021-10-02T09:42:22Z

I have 2 questions that need to clarify:

Is the first character of the current (short) PDB ID format a digit in the range [0-9] or [1-9]?
will the "PDB_" prefix in the future (extended) format be case sensitive or not?

- Class name: changed into PdbId - methods name: changed into getPdbId(), setPdbId(PdbId) - field name: pdbId

aalhossary · 2021-10-15T03:29:06Z

I pushed a new set of commits addressing your comments @josemduarte.
I did not modify the RegEx and PdbId.XXX yet. They can be modified later.

It introduced a new bug. Reverting it.

Conflicts: CHANGELOG.md biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/PdbPair.java

aalhossary · 2021-10-22T13:41:08Z

I received this reply about the entry/ies with XXXX

XXXX in that file should not be considered a valid PDB ID.
PDB IDs are used only in .ent (entry files). XXXX is used in any derivative files (like biological assembly files) to ensure that these files are not interpreted to be actual released structures.

My interpretation is that this XXXX may appear in other files.

Consequently, I believe it's best to:

remove the PdbId.XXXX constant,
allow the XXXX exception, and
add another detail: XXXX PdbId objects should NOT be equal, unless they are the same object.

This should not break Java rules; as the rule in Java documentation is

"If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result." However, the opposite is not required.
equals(Object obj) "is reflexive: for any non-null reference value x, x.equals(x) should return true."

aalhossary · 2021-10-22T16:40:11Z

@josemduarte I pushed a new commit right now. I hope it will be the last in this pull request :)

josemduarte · 2021-10-22T18:05:55Z

My interpretation is that this XXXX may appear in other files.
Consequently, I believe it's best to:
remove the PdbId.XXXX constant,
allow the XXXX exception, and
add another detail: XXXX PdbId objects should NOT be equal, unless they are the same object.

I think that the reply you got makes it quite clear: XXXX is not special but just some random string that is used to make sure there's no confusion with actual PDB ids. For instance, one day XXXX could be replaced by any other string in those assembly cif files.

Consequently I think it is important to not treat XXXX specially and to remove any logic around it. When XXXX or any other string that doesn't pass a standard PDB id regex is encountered, then it should not be considered a PDB id and the PdbId field in the Structure object should be set to null.

aalhossary · 2021-10-22T23:58:07Z

Consequently I think it is important to not treat XXXX specially and to remove any logic around it. When XXXX or any other string that doesn't pass a standard PDB id regex is encountered, then it should not be considered a PDB id and the PdbId field in the Structure object should be set to null.

Shall I understand that you want ANY malformed PdbId be gracefully set to null?

aalhossary · 2021-10-23T02:55:41Z

Well, I removed XXXX altogether, and set all malformed PdbId values to null gracefully.
Please review the effects of this update carefully.
Thank you!
Amr

josemduarte

Thanks, I've done another round of reviewing

CHANGELOG.md

biojava-structure/src/main/java/org/biojava/nbio/structure/PdbId.java

josemduarte · 2021-10-25T18:58:21Z

biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/StructureName.java

+				pdbId = new PdbId(URLIdentifier.guessPDBID( path.substring(path.lastIndexOf('/')+1) ));
+			} catch (IllegalArgumentException e) {
+				pdbId = null;
+			}


An explicit check for null (instead of try/catch) will be more readable and more robust if implementation changes later on

a null parameter is not the only source of IllegalArgumentException. The method toInternalFormat() (called from the constructor) is actually more likely to throw the IllegalArgumentException.
I can't remove the catch. I can return the explicit checked exception thrown from the constructor instead :)

biojava-structure/src/main/java/org/biojava/nbio/structure/align/util/AtomCache.java

biojava-structure/src/main/java/org/biojava/nbio/structure/io/PDBFileParser.java

biojava-structure/src/main/java/org/biojava/nbio/structure/io/cif/CifStructureConsumerImpl.java

aalhossary · 2021-10-26T05:30:30Z

I submitted a new commit that addresses all reviewer's comments except replacing the try/catch block with an explicit null pointer check.

josemduarte

Thank you for all your efforts! I think all looks good now! Just the TODOs should be removed.

biojava-structure/src/main/java/org/biojava/nbio/structure/StructureIO.java

biojava-structure/src/main/java/org/biojava/nbio/structure/SubstructureIdentifier.java

biojava-structure/src/main/java/org/biojava/nbio/structure/align/util/AtomCache.java

josemduarte · 2021-10-28T16:19:47Z

Tests are passing. I'll merge this in by the end of tomorrow if there no other comments.

Support for extended PDBID PDB_nnnnNXXX

6c22796

Created a new class PDBId, and removed old deprecated methods. Most of the changes are in Structure, StructureImpl, and PDBFileParser classes.

aalhossary requested a review from josemduarte July 27, 2021 11:23

aalhossary self-assigned this Jul 27, 2021

aalhossary requested a review from sbliven July 27, 2021 11:27

Some deprecated functions removal

dc12461

sbliven reviewed Aug 2, 2021

View reviewed changes

sbliven requested changes Aug 2, 2021

View reviewed changes

Fix typo

7b08387

aalhossary added 3 commits August 4, 2021 20:07

Targeted most of Spencer's comments

c39632c

Fixed a bug in PDBId.equals() method

7dd8528

Changing PDBID in SCOP record to lower case

e8466b9

to match the expected String

aalhossary requested a review from sbliven August 4, 2021 19:03

josemduarte reviewed Aug 30, 2021

View reviewed changes

josemduarte closed this Sep 20, 2021

aalhossary reopened this Sep 22, 2021

aalhossary added 4 commits October 1, 2021 20:31

public static PDBId field XXXX

12efad9

Behavior is now an enum

d4dfc95

Expose Pattern objects instead of Strings

22e9158

aalhossary added 2 commits October 2, 2021 15:44

Enhancements and bug fix

76112f0

Added BDBId Unit test

d95e561

Merge branch 'master' into Support_Extended_PDBID

f54dd7d

Conflicts: biojava-structure/src/main/java/org/biojava/nbio/structure/PDBHeader.java biojava-structure/src/main/java/org/biojava/nbio/structure/Structure.java

aalhossary added 7 commits October 14, 2021 19:51

Change capitalization state

a55aa95

- Class name: changed into PdbId - methods name: changed into getPdbId(), setPdbId(PdbId) - field name: pdbId

Fix probable NPE + keep consistent PdbId method naming convention

732a2c2

Addressed some of the reviewer's comments.

34cb49e

Use JUnit 5

40a55d2

PdbId class Documentation

141d667

JavaDoc style update

c91a4dc

Minor updates

9ce50fe

aalhossary requested a review from josemduarte October 15, 2021 03:29

aalhossary added 3 commits October 16, 2021 07:43

Reverting a wrong optimization

0fb6742

It introduced a new bug. Reverting it.

Merge branch 'biojavaorigin/master' into Support_Extended_PDBID

02bfee4

Conflicts: CHANGELOG.md biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/PdbPair.java

Adding BioJava development code

0d96963

aalhossary added 2 commits October 23, 2021 00:21

Javadoc

87fbc10

XXXX PdbId objects are not equal unless they are the same object

17f366e

No more XXXX. ANY malformed PdbId gracefully set to null

ef9cf3f

josemduarte requested changes Oct 25, 2021

View reviewed changes

Addressing Reviewer's comments

02974cd

josemduarte approved these changes Oct 27, 2021

View reviewed changes

Removing TODOs

108fe97

aalhossary requested a review from josemduarte October 27, 2021 22:29

josemduarte approved these changes Oct 27, 2021

View reviewed changes

josemduarte merged commit 12d348d into biojava:master Oct 29, 2021

josemduarte mentioned this pull request Jan 18, 2022

Error running on Alphafold models rcsb/symmetry#113

Closed

Support extended pdbid #950

Support extended pdbid #950

Uh oh!

Conversation

aalhossary commented Jul 27, 2021

Uh oh!

sbliven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbliven left a comment

Choose a reason for hiding this comment

PDBId class

Consistent usage

Keywords

Parsing

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aalhossary commented Aug 3, 2021

KEYWODS records

PDBId class

Default shortening Behavior

Short Id to minimize memory usage

Consistent usage

Parsers using the new PDBId

Testing

The only format EXCEPTION

Uh oh!

aalhossary commented Aug 4, 2021

Uh oh!

aalhossary commented Aug 29, 2021

Uh oh!

josemduarte left a comment

Choose a reason for hiding this comment

Uh oh!

aalhossary commented Sep 22, 2021

Helper Method

The need for complicated restructuring.

Should the String accessors be deprecated or live side by side?

Finally

My next tasks in this PR

Uh oh!

josemduarte commented Oct 1, 2021

Uh oh!

josemduarte commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aalhossary commented Oct 2, 2021

Uh oh!

aalhossary commented Oct 2, 2021

Uh oh!

aalhossary commented Oct 15, 2021

Uh oh!

aalhossary commented Oct 22, 2021

Uh oh!

aalhossary commented Oct 22, 2021

Uh oh!

josemduarte commented Oct 22, 2021

Uh oh!

aalhossary commented Oct 22, 2021

Uh oh!

aalhossary commented Oct 23, 2021

Uh oh!

josemduarte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

josemduarte commented Oct 1, 2021 •

edited

Loading

aalhossary commented Oct 26, 2021 •

edited

Loading