-
Notifications
You must be signed in to change notification settings - Fork 397
Fix #703: Recover from empty structure files in PDB_CACHE_DIR #774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
log slow downloads upon start
The main change hasn't been implemented, so we want tests to fail. However, the tests exposed some NPE and IO exceptions. These are now fixed, so the tests fail in the expected manner. - Use ATP ligand, which is not covered by the ReducedChemCompProvider - Use the ReducedChemCompProvider as a fallback consistently, preventing null chemComp - Defensive parsing in SimpleMMcifConsumer - Robust test, doesn't require internet!
Files less than 40 bytes are deleted to allow for gzip headers. This addresses biojava#703
This is necessary when changing the cache path.
| return chemComp; | ||
| // May be null if the file was corrupt. Fall back on ReducedChemCompProvider in that case | ||
| if(chemComp != null) { | ||
| return chemComp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a small change in behavior. If we can't read the downloaded chemcomp file, we now return a stub chemcomp rather than NULL. Previously the stub was returned for corrupt GZIP files (which threw an error) but not for malformed cif contents (which just returned null).
| // probably a network error happened. Try to use the ReducedChemCOmpProvider | ||
| ReducedChemCompProvider reduced = new ReducedChemCompProvider(); | ||
| if( fallback == null) { | ||
| fallback = new ReducedChemCompProvider(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating a ReducedChemCompProvider is cheap (and idempotent), but let's cache it anyways.
| * | ||
| * @param dir directory to delete | ||
| */ | ||
| public static void deleteDirectory(Path dir) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly this isn't an nio method and I couldn't find one in BioJava
| cache.setPath(tmpCache.toString()); | ||
| cache.setCachePath(tmpCache.toString()); | ||
| cache.setUseMmCif(true); | ||
| ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider(tmpCache.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying how much code is needed to change the cache dir.
|
Tests pass locally, but there's some danger of system-dependent bugs here since I'm hacking the cache path for the two new tests. BTW, this could be an example of how to use AtomCache in tests without incurring network access. |
|
This should be merged to master after merging |
josemduarte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sbliven !
I've submitted a few comments and questions
|
|
||
| protected void downloadFileFromRemote(URL remoteURL, File localFile) throws IOException{ | ||
| // System.out.println("downloading " + remoteURL + " to: " + localFile); | ||
| LOGGER.info("Downloaded file {} to local file {}", remoteURL, localFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be "Downloading"? It isn't downloaded at this point yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if( ! success) { | ||
| return null; | ||
| } | ||
| assert(!f.exists()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather log a warning here. Assert is going to be ignored in normal situations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intended. If the delete was successful then the file will not exist. This is just a sanity check during development to make sure delete() wasn't asynchronous or something.
| if( f.length() < MIN_PDB_FILE_SIZE ) { | ||
| boolean success = f.delete(); | ||
| if( ! success) { | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this rather throw an exception (perhaps IOException)? I don't see how can this null be handled if file can't be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to Files.delete which throws the exception
| } | ||
|
|
||
| /** | ||
| * Force the cache to be reset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clears the memory cache, right? Could you add that to javadoc, it's an important detail since there is both memory and file cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only memory. Documented.
| // delete files that are too short to have contents | ||
| if( f.length() < LocalPDBDirectory.MIN_PDB_FILE_SIZE ) { | ||
| f.delete(); | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file fails to be deleted, this won't work as expected. I think the fail to delete case should throw an IOException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't check the status because the delete isn't really necessary here. If this method returns false then we re-download the file and move it on top of the old one. I just added the delete defensively.
| } | ||
|
|
||
| return reduced.getChemComp(recordName); | ||
| return fallback.getChemComp(recordName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a warning logged when the fallback is used? That'd be important, so that users are aware and can investigate if there's something wrong in their code or file system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
| residueNrInt = Integer.parseInt(residueNumberS); | ||
| } else { | ||
| String label_seq_id = atom.getLabel_seq_id(); | ||
| residueNrInt = Integer.parseInt(label_seq_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label_seq_id is not the same as residue number. This can introduce many problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should discuss this further.
I added it because the atp.cif.gz file I added (from pymol) has label_seq_id but not auth_seq_id. It seems reasonable to me to fall back on the sequential numbering if the authors don't specify a custom numbering.
Note that auth_seq_id is optional but label_seq_id is required. From the spec it sounds to me like label_seq_id is a good fallback for ResidueNumber:
An alternative identifier for _atom_site.label_seq_id that
may be provided by an author in order to match the identification
used in the publication that describes the structure.
Ideally BioJava would use Group.getId() rather then getResidueNumber() everywhere. However, many things still use residue numbers, so setting a default seems prudent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another potential issue here, which is that the seq_id is stored as a long while ResidueNumber contains an Integer. I don't think there are any files with >2billion groups per entity, but if we hit it then the ResidueNumber would throw a NumberFormatException here. I think that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started #775 to continue this discussion.
| line = buf.readLine(); | ||
| while( line != null && (line.isEmpty() || line.startsWith(COMMENT_CHAR))) { | ||
| line = buf.readLine(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to handle comments in first line!
Is an empty first line also valid CIF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think blank lines are permitted anywhere in CIF, but I could be wrong. I try to write permissive parsers. It seems like it shouldn't break the spec so we might as well be robust to it either way.
| assertTrue(chem.getAtoms().size() > 0); | ||
| assertEquals("NON-POLYMER", chem.getType()); | ||
| } finally { | ||
| // FileDownloadUtils.deleteDirectory(tmpCache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why commented? Is cleaning up not needed here?
If not needed please remove the try/finally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Cleaning up /tmp is polite, but makes it harder to debug failing tests.
| assertTrue(chem.getAtoms().size() > 0); | ||
| assertEquals("NON-POLYMER", chem.getType()); | ||
| } finally { | ||
| // FileDownloadUtils.deleteDirectory(tmpCache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
See comments on biojava#774
Use GlobalsHelper.pushState()/restoreState() before and after tests to ensure that state isn't carried between tests. This is applied to the AtomCacheTest to fix test regressions while simplifying the code.
Maven runs tests with a clean environment, so we can't restore PDB_DIR
|
@josemduarte please check again and see if I addressed all your suggestions. The |
|
Tests pass with an old cache, but 4LNC got updated Tuesday and now |
4LNC was updated to remove the X-Ray experimental method. This switches the test to 6F2Q, which uses both Neutron & Xray.
| String recordName = atom.getGroup_PDB(); | ||
| String residueNumberS = atom.getAuth_seq_id(); | ||
| Integer residueNrInt = Integer.parseInt(residueNumberS); | ||
| Integer residueNrInt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you handle this in a separate pull request? as per discussion in #775
Parsing such a file again throws a NumberFormatException. Further work/discussion of this issue is on biojava#775, but it was blocking the merging of biojava#774.
|
OK, I removed the mmcif changes (#775) and merged the result. |
For a short period the ChemComp files didn't resolve, resulting in empty files getting cached by BioJava. The caching problem has been long fixed, but this PR should make it so BioJava can recover and if the user happened to use an old version of BioJava (e.g. older CE-Symm versions).
Fixes #703