Skip to content

Conversation

@aalhossary
Copy link
Member

I added PDBId class to replace the current String PdbId with conversation methods to and from short format.
I will write the documentation later, as soon as the API is approved and finalized.
Fixes #930.

Created a new class PDBId, and removed old deprecated methods.
Most of the changes are in Structure, StructureImpl, and PDBFileParser
classes.
@aalhossary aalhossary requested a review from josemduarte July 27, 2021 11:23
@aalhossary aalhossary self-assigned this Jul 27, 2021
@aalhossary aalhossary requested a review from sbliven July 27, 2021 11:27
Copy link
Member

@sbliven sbliven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aalhossary Awesome work! FYI I've started reviewing it but it's a pretty PR so it might take me a few days.

Copy link
Member

@sbliven sbliven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks @aalhossary! Extended ids are now available in mmcif so this is good timing.

I was looking for detailed information about this change but I wasn't able find much beyond the two wwpdb blogs. Do you know of any more detailed info about the change? Are there example files?

If I understand correctly, the major changes are:

  • New PDBId class which can represent IDs as either 1ABC or PDB_00001ABC form
  • Parse keywords from pdb headers

Breaking changes:

  • Some deprecated and removed methods in Structure
    • notably, String getPdbId() becomes PDBId getPDBId() (note capitalization) or String getPDBCode()
  • PDB codes are returned upper-case consistently

PDBId class

I think that wrapping ids into a light-weight class like this is a good approach. It's better than storing them as strings and providing conversion methods, despite the small overhead of object creation.

One thing we should discuss is all the static flags controlling short/long behavior. It's nice to have this flexibility, but since it's global state it is difficult to test and can lead to unexpected bugs if the flags get changed in a different part of the code. Instead of global state, maybe any code that needs to override the defaultShorteningBehavior should be forced to call getId(false)?

We could also improve the memory usage by storing either the 4-byte string (if isShortCompatible) or the 8-byte extended ID. Then the PDB_ prefix and leading zeros would only be added in toExtendedId. Maybe this is premature optimization, but I think loading large lists of PDBIds is a reasonable use case.

Consistent usage

It would be good to be consistent about whether to accept String or PDBId.

I would suggest that classes that are used commonly by users should override methods to accept both Strings and PDBIds. The more "internal classes" (I'm thinking of the parser internals like StructureDataInterface) can just have PDBId versions. Or if we want to minimize breaking, we can define both but deprecate the String versions for a release.

Keywords

Is the keyword parsing related to the extended ids in any way? If not it might be good to extract that to a separate PR.

Keywords should also be set from mmcif (and mmtf if they are supported).

Parsing

This PR sets up all the infrastructure for storing extended PDBIds. However, I suspect that additional work will be needed to ensure that the parsers are using the new IDs. For instance, I don't see anywhere that is reading the _database_2.pdbx_database_accession field in the cif parser. (This is more a TODO than a suggestion.)

Testing

I didn't see any tests that actually use 5+ character structures. Can we add a cif file that uses a long id and test parsing and such. I suspect we will find additional bugs (e.g. in the alignment output formats). We don't need to fix everything in this initial PR of course.

Thanks so much for building this feature. This was way more work than I expected when I made this issue, so thanks for taking it on!

@aalhossary
Copy link
Member Author

Hi Spencer @sbliven,

I don't have any resources other than these 2 blogs too :). There are no details nor examples :(

While writing my edit, I focused on making the minimal changes possible to the current codebase. Therefore, I minimized throwing any new exceptions from already present code that does not do so.

I will reply your individual questions in place.

KEYWODS records

It is actually a part of another PR. Both have some files in common. I thought by the time this PR gets reviewed, the other one would have been approved already.

PDBId class

Default shortening Behavior

The static flag for default shortening behavior does NOT have a setter. The static flag is mainly for easier change in the future in a later release. On the other hand, the object method getId(boolean) is public in case anybody wants it.

Short Id to minimize memory usage

I don't think storing 4-bute short ID is a good idea for these reasons:

  1. It will add a lot of overhead to compare / hashcode Ids for a small memory benefit
  2. the New IDs would eventually supersede the short ID, especially as the new depositions are being added in an increasing rate, while the old ones are being obsoleted by new ones.

However, I will consider storing the 8-byte extended ID only in a later optimization PR, after the new IDs start to emerge and we see how the format would look like.

Consistent usage

I am planning to move away from the String gradually as possible, leaving the very frequently used classes by users have overloaded functions with deprecated String variant, but I will NOT remove them soon.

Parsers using the new PDBId

Whatever the parser is doing, it will eventually one of the SetPdbId() [already removed altogether], setPDBId(), or setPdbCode() methods.
However, I would appreciate it if you point out any missing / additional places to take care of. I am sorry I am still nw to parsing mmCIF files.

Testing

Before submitting my PR, I had either modified my code a lot to pass the currently available unit tests, or to modify the tests themselves, if they test for non realistic scenarios.

The only format EXCEPTION

XXXX is the only exception of an accepted ID that doesn't follow the \d\p{Alnum}{3} format.
I added it because I found it declared in the header of 4nwr-assembly1.cif.gz referenced from TestURLBasedFileParsing.
Can somebody confirm whither this file is correct or wrong and should be remediated? @josemduarte Can you help us find the answer of this question?

@aalhossary aalhossary requested a review from sbliven August 4, 2021 19:03
@aalhossary
Copy link
Member Author

I submitted a new revision (this time I performed a full system test before submission :) ).
I left the questions which need answers unresolved in @sbliven comments.

@aalhossary
Copy link
Member Author

@sbliven , @josemduarte This PR has been silent for 25 days. Shall I consider the API stable and start writing the documentation?

Copy link
Contributor

@josemduarte josemduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aalhossary . I have a few concerns about this one, the most important of them already expressed by @sbliven : this new class to wrap the PDB identifiers should be used consistently throughout all the relevant interfaces. Doing that is quite complicated and begs for the question whether the String accessors should be deprecated or live side by side.

For that and other reasons, I'm going to vote against proceeding with this solution. I feel that having some helper methods that can do PDB id string handling could be a more flexible solution.

@aalhossary
Copy link
Member Author

In fact, I am not convinced with your 2 or 3 points @josemduarte for these reasons:

Helper Method

A helper method was an option that I excluded in the first place; because what we need is not only conversion from old format to extended format and vise versa. but also

  • checks of order (via java.lang.Comparable) among different PDBIDs.
  • checks of hashCode and equals among PDBIDs of different formats to use in a HashMap.
  • checking whether a PDBID is short or long, and hence do whatever necessary conversions.
  • checking whether an extended PDBID is shortable.
  • general regular expression matching.

Implementing all of these features needs several methods. If they all were implemented as static helper functions somewhere, the result would be

  • Complex code
  • We reply on the user's own discretion to use them properly.
  • We would either add several methods to the Strusture class that are actually unrelated to its main purpose, or add a new separate class . In the later case, I prefer grouping all related methods in a light-weight class, which is what I did.

The need for complicated restructuring.

I reviewed every place where setPDBCode() is called. I found only 4 places remaining, two are to be done in the next commit, and the other two are actually misuse of the field and should be changed.
I also reviewed every comment by Spenser Bliven and found that most if not all of them are met and solved already.

this new class to wrap the PDB identifiers should be used consistently throughout all the relevant interfaces.

I believe my PR is quite comprehensive. @josemduarte I would appreciate it if you list out all remaining interfaces to restructure. It would then be my own task to handle them.

Should the String accessors be deprecated or live side by side?

By all means, they should coexist now and for a long-enough time. Maybe we will think about the answer of this question when we release BioJava 7.0.0 or even 10.0.0!
If you ask me about my opinion, they should coexist (as convenience methods).

Finally

I appreciate mentioning the remaining concerns you have @josemduarte.
What is the opinion of other members?
What do you think @sbliven ?

My next tasks in this PR

  1. Keeping shorter string to represent the PDB ID in memory.
  2. Replacing remaining setPDBCode and getPDBCode with setPDBId and getPDBId
  3. Adding documentation
  4. Any other refactoring to be proposed by members.

@aalhossary aalhossary reopened this Sep 22, 2021
Only the ID proper is stored (in upper case).
e.g. for 1abc we store 00001ABC.

This format is the trade off between storing short/extended format and
all extended format.
storing the ID proper allows direct comparison and fast ordering of
PDBId objects with uniform names in the format \d{4}[1-9][0-9A-Z]{3}.
@josemduarte
Copy link
Contributor

Sorry for the delay @aalhossary . I've read your new comments and arguments and I see your point. Let me check the PR more carefully now and submit a new review for it.

@josemduarte
Copy link
Contributor

josemduarte commented Oct 1, 2021

BTW when you have time, could you solve the conflicts by merging master into your branch? That would also make the diff cleaner, without the changes related to #948

@aalhossary
Copy link
Member Author

I committed what I have till now.
I will try merging the master branch now.

Conflicts:
	biojava-structure/src/main/java/org/biojava/nbio/structure/PDBHeader.java
	biojava-structure/src/main/java/org/biojava/nbio/structure/Structure.java
@aalhossary
Copy link
Member Author

I have 2 questions that need to clarify:

Is the first character of the current (short) PDB ID format a digit in the range [0-9] or [1-9]?
will the "PDB_" prefix in the future (extended) format be case sensitive or not?

@aalhossary
Copy link
Member Author

I pushed a new set of commits addressing your comments @josemduarte.
I did not modify the RegEx and PdbId.XXX yet. They can be modified later.

It introduced a new bug. Reverting it.
Conflicts:
	CHANGELOG.md
	biojava-structure/src/main/java/org/biojava/nbio/structure/align/client/PdbPair.java
@aalhossary
Copy link
Member Author

I received this reply about the entry/ies with XXXX

XXXX in that file should not be considered a valid PDB ID.
PDB IDs are used only in .ent (entry files). XXXX is used in any derivative files (like biological assembly files) to ensure that these files are not interpreted to be actual released structures.

My interpretation is that this XXXX may appear in other files.

Consequently, I believe it's best to:

  • remove the PdbId.XXXX constant,
  • allow the XXXX exception, and
  • add another detail: XXXX PdbId objects should NOT be equal, unless they are the same object.

This should not break Java rules; as the rule in Java documentation is

  • "If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result." However, the opposite is not required.
  • equals(Object obj) "is reflexive: for any non-null reference value x, x.equals(x) should return true."

@aalhossary
Copy link
Member Author

@josemduarte I pushed a new commit right now. I hope it will be the last in this pull request :)

@josemduarte
Copy link
Contributor

My interpretation is that this XXXX may appear in other files.
Consequently, I believe it's best to:
remove the PdbId.XXXX constant,
allow the XXXX exception, and
add another detail: XXXX PdbId objects should NOT be equal, unless they are the same object.

I think that the reply you got makes it quite clear: XXXX is not special but just some random string that is used to make sure there's no confusion with actual PDB ids. For instance, one day XXXX could be replaced by any other string in those assembly cif files.

Consequently I think it is important to not treat XXXX specially and to remove any logic around it. When XXXX or any other string that doesn't pass a standard PDB id regex is encountered, then it should not be considered a PDB id and the PdbId field in the Structure object should be set to null.

@aalhossary
Copy link
Member Author

Consequently I think it is important to not treat XXXX specially and to remove any logic around it. When XXXX or any other string that doesn't pass a standard PDB id regex is encountered, then it should not be considered a PDB id and the PdbId field in the Structure object should be set to null.

Shall I understand that you want ANY malformed PdbId be gracefully set to null?

@aalhossary
Copy link
Member Author

Well, I removed XXXX altogether, and set all malformed PdbId values to null gracefully.
Please review the effects of this update carefully.
Thank you!
Amr

Copy link
Contributor

@josemduarte josemduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've done another round of reviewing

pdbId = new PdbId(URLIdentifier.guessPDBID( path.substring(path.lastIndexOf('/')+1) ));
} catch (IllegalArgumentException e) {
pdbId = null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explicit check for null (instead of try/catch) will be more readable and more robust if implementation changes later on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a null parameter is not the only source of IllegalArgumentException. The method toInternalFormat() (called from the constructor) is actually more likely to throw the IllegalArgumentException.
I can't remove the catch. I can return the explicit checked exception thrown from the constructor instead :)

@aalhossary
Copy link
Member Author

aalhossary commented Oct 26, 2021

I submitted a new commit that addresses all reviewer's comments except replacing the try/catch block with an explicit null pointer check.

Copy link
Contributor

@josemduarte josemduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all your efforts! I think all looks good now! Just the TODOs should be removed.

@josemduarte
Copy link
Contributor

Tests are passing. I'll merge this in by the end of tomorrow if there no other comments.

@josemduarte josemduarte merged commit 12d348d into biojava:master Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support extended PDB IDs

4 participants