-
Notifications
You must be signed in to change notification settings - Fork 397
Description
I have a large .gbff file I am trying to read (or rather several large such files). The biggest is about 1.5GB.
The code looks like this:
InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new
GenbankReader<DNASequence, NucleotideCompound>(is, new
GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
);
LinkedHashMap<String, DNASequence> gbFile = dnaReader.process();
...Since the file sizes got too large I try to read the files in smaller chunks
(I have also tried setting VM arguments to -Xms512M -Xmx4096M, which did not to work due to lack of memory)
I try
InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new
GenbankReader<DNASequence, NucleotideCompound>(is, new
GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
);
LinkedHashMap<String, DNASequence> gbFile = dnaReader.process(1);
...which seems to work. However, I want all the entries. But, when I want to do this more than one time using:
InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new
GenbankReader<DNASequence, NucleotideCompound>(is, new
GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
);
LinkedHashMap<String, DNASequence> gbFile1 = dnaReader.process(1);
LinkedHashMap<String, DNASequence> gbFile2 = dnaReader.process(1);
...I receive the error:
Exception in thread "main" org.biojava.nbio.core.exceptions.ParserException: Stream closed
at org.biojava.nbio.core.sequence.io.GenbankSequenceParser.getSequence(GenbankSequenceParser.java:391)
at org.biojava.nbio.core.sequence.io.GenbankReader.process(GenbankReader.java:145)
at Prorgam.main(Prorgam.java:39)
which seems against the purpose of setting an int (not closing the stream?).
However,
InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new
GenbankReader<DNASequence, NucleotideCompound>(is, new
GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
);
LinkedHashMap<String, DNASequence> gbFile1 = dnaReader.process(1);
dnaReader.close();do work.
(also, I I manage to get this to work, is there some GenbankReader.hasNext() or similar way which allows me to go through all the smaller chunks of the file, without reading them all to memory?)
Have I misunderstood something here? From the docs
public LinkedHashMap<String,S> process(int max)
throws IOException,
CompoundNotFoundException
This method tries to parse maximum max records from the open File or InputStream, and leaves the underlying resource open.
Subsequent calls to the same method continue parsing the rest of the file.
This is particularly useful when dealing with very big data files, (e.g. NCBI nr database), which can't fit into memory and will take long time before the first result is available.
It seems like I should be able to use it this way?
Thanks