Skip to content

Conversation

@sbliven
Copy link
Member

@sbliven sbliven commented Jun 8, 2018

For a short period the ChemComp files didn't resolve, resulting in empty files getting cached by BioJava. The caching problem has been long fixed, but this PR should make it so BioJava can recover and if the user happened to use an old version of BioJava (e.g. older CE-Symm versions).

Fixes #703

sbliven added 6 commits June 4, 2018 17:21
log slow downloads upon start
The main change hasn't been implemented, so we want tests to fail.
However, the tests exposed some NPE and IO exceptions. These are now
fixed, so the tests fail in the expected manner.

- Use ATP ligand, which is not covered by the ReducedChemCompProvider
- Use the ReducedChemCompProvider as a fallback consistently, preventing
  null chemComp
- Defensive parsing in SimpleMMcifConsumer
- Robust test, doesn't require internet!
Files less than 40 bytes are deleted to allow for gzip headers.

This addresses biojava#703
This is necessary when changing the cache path.
return chemComp;
// May be null if the file was corrupt. Fall back on ReducedChemCompProvider in that case
if(chemComp != null) {
return chemComp;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small change in behavior. If we can't read the downloaded chemcomp file, we now return a stub chemcomp rather than NULL. Previously the stub was returned for corrupt GZIP files (which threw an error) but not for malformed cif contents (which just returned null).

// probably a network error happened. Try to use the ReducedChemCOmpProvider
ReducedChemCompProvider reduced = new ReducedChemCompProvider();
if( fallback == null) {
fallback = new ReducedChemCompProvider();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a ReducedChemCompProvider is cheap (and idempotent), but let's cache it anyways.

*
* @param dir directory to delete
*/
public static void deleteDirectory(Path dir) throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly this isn't an nio method and I couldn't find one in BioJava

cache.setPath(tmpCache.toString());
cache.setCachePath(tmpCache.toString());
cache.setUseMmCif(true);
ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider(tmpCache.toString()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying how much code is needed to change the cache dir.

@sbliven
Copy link
Member Author

sbliven commented Jun 8, 2018

Tests pass locally, but there's some danger of system-dependent bugs here since I'm hacking the cache path for the two new tests.

BTW, this could be an example of how to use AtomCache in tests without incurring network access.

@sbliven
Copy link
Member Author

sbliven commented Jun 8, 2018

This should be merged to master after merging

Copy link
Contributor

@josemduarte josemduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sbliven !

I've submitted a few comments and questions


protected void downloadFileFromRemote(URL remoteURL, File localFile) throws IOException{
// System.out.println("downloading " + remoteURL + " to: " + localFile);
LOGGER.info("Downloaded file {} to local file {}", remoteURL, localFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "Downloading"? It isn't downloaded at this point yet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if( ! success) {
return null;
}
assert(!f.exists());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather log a warning here. Assert is going to be ignored in normal situations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's intended. If the delete was successful then the file will not exist. This is just a sanity check during development to make sure delete() wasn't asynchronous or something.

if( f.length() < MIN_PDB_FILE_SIZE ) {
boolean success = f.delete();
if( ! success) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this rather throw an exception (perhaps IOException)? I don't see how can this null be handled if file can't be deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to Files.delete which throws the exception

}

/**
* Force the cache to be reset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clears the memory cache, right? Could you add that to javadoc, it's an important detail since there is both memory and file cache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only memory. Documented.

// delete files that are too short to have contents
if( f.length() < LocalPDBDirectory.MIN_PDB_FILE_SIZE ) {
f.delete();
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file fails to be deleted, this won't work as expected. I think the fail to delete case should throw an IOException

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't check the status because the delete isn't really necessary here. If this method returns false then we re-download the file and move it on top of the old one. I just added the delete defensively.

}

return reduced.getChemComp(recordName);
return fallback.getChemComp(recordName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a warning logged when the fallback is used? That'd be important, so that users are aware and can investigate if there's something wrong in their code or file system

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

residueNrInt = Integer.parseInt(residueNumberS);
} else {
String label_seq_id = atom.getLabel_seq_id();
residueNrInt = Integer.parseInt(label_seq_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label_seq_id is not the same as residue number. This can introduce many problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss this further.

I added it because the atp.cif.gz file I added (from pymol) has label_seq_id but not auth_seq_id. It seems reasonable to me to fall back on the sequential numbering if the authors don't specify a custom numbering.

Note that auth_seq_id is optional but label_seq_id is required. From the spec it sounds to me like label_seq_id is a good fallback for ResidueNumber:

_atom_site.auth_seq_id:

An alternative identifier for _atom_site.label_seq_id that
may be provided by an author in order to match the identification
used in the publication that describes the structure.

Ideally BioJava would use Group.getId() rather then getResidueNumber() everywhere. However, many things still use residue numbers, so setting a default seems prudent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another potential issue here, which is that the seq_id is stored as a long while ResidueNumber contains an Integer. I don't think there are any files with >2billion groups per entity, but if we hit it then the ResidueNumber would throw a NumberFormatException here. I think that's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started #775 to continue this discussion.

line = buf.readLine();
while( line != null && (line.isEmpty() || line.startsWith(COMMENT_CHAR))) {
line = buf.readLine();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to handle comments in first line!

Is an empty first line also valid CIF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think blank lines are permitted anywhere in CIF, but I could be wrong. I try to write permissive parsers. It seems like it shouldn't break the spec so we might as well be robust to it either way.

assertTrue(chem.getAtoms().size() > 0);
assertEquals("NON-POLYMER", chem.getType());
} finally {
// FileDownloadUtils.deleteDirectory(tmpCache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commented? Is cleaning up not needed here?

If not needed please remove the try/finally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Cleaning up /tmp is polite, but makes it harder to debug failing tests.

assertTrue(chem.getAtoms().size() > 0);
assertEquals("NON-POLYMER", chem.getType());
} finally {
// FileDownloadUtils.deleteDirectory(tmpCache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

sbliven added 3 commits June 12, 2018 15:54
Use GlobalsHelper.pushState()/restoreState() before and after tests
to ensure that state isn't carried between tests.

This is applied to the AtomCacheTest to fix test regressions
while simplifying the code.
Maven runs tests with a clean environment, so we can't restore PDB_DIR
@sbliven
Copy link
Member Author

sbliven commented Jun 15, 2018

@josemduarte please check again and see if I addressed all your suggestions.

The GlobalsHelper class that I added is pretty clean, and it would be good to use it for more tests that hack paths and factory methods.

@sbliven
Copy link
Member Author

sbliven commented Jun 15, 2018

Tests pass with an old cache, but 4LNC got updated Tuesday and now TestExperimentalTechniques.test4LNC fails on Travis.

4LNC was updated to remove the X-Ray experimental method.
This switches the test to 6F2Q, which uses both Neutron & Xray.
String recordName = atom.getGroup_PDB();
String residueNumberS = atom.getAuth_seq_id();
Integer residueNrInt = Integer.parseInt(residueNumberS);
Integer residueNrInt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you handle this in a separate pull request? as per discussion in #775

Parsing such a file again throws a NumberFormatException.
Further work/discussion of this issue is on biojava#775, but it was blocking
the merging of biojava#774.
@sbliven sbliven merged commit 5a4a68c into biojava:bugfixes-4.2 Jun 18, 2018
@sbliven
Copy link
Member Author

sbliven commented Jun 18, 2018

OK, I removed the mmcif changes (#775) and merged the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants