Skip to content

GenbankReader process(int max) seems to close the InputStream? #800

@TorkelE

Description

@TorkelE

I have a large .gbff file I am trying to read (or rather several large such files). The biggest is about 1.5GB.

The code looks like this:

InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new 
    GenbankReader<DNASequence, NucleotideCompound>(is, new 
        GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
		new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
		); 
LinkedHashMap<String, DNASequence> gbFile = dnaReader.process();
...

Since the file sizes got too large I try to read the files in smaller chunks
(I have also tried setting VM arguments to -Xms512M -Xmx4096M, which did not to work due to lack of memory)

I try

InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new 
    GenbankReader<DNASequence, NucleotideCompound>(is, new 
        GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
		new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
		); 
LinkedHashMap<String, DNASequence> gbFile = dnaReader.process(1);
...

which seems to work. However, I want all the entries. But, when I want to do this more than one time using:

InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new 
    GenbankReader<DNASequence, NucleotideCompound>(is, new 
        GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
		new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
		); 
LinkedHashMap<String, DNASequence> gbFile1 = dnaReader.process(1);
LinkedHashMap<String, DNASequence> gbFile2 = dnaReader.process(1);
...

I receive the error:

Exception in thread "main" org.biojava.nbio.core.exceptions.ParserException: Stream closed
	at org.biojava.nbio.core.sequence.io.GenbankSequenceParser.getSequence(GenbankSequenceParser.java:391)
	at org.biojava.nbio.core.sequence.io.GenbankReader.process(GenbankReader.java:145)
	at Prorgam.main(Prorgam.java:39)

which seems against the purpose of setting an int (not closing the stream?).

However,

InputStream is = new FileInputStream("filename.gbff.gz");
is = new GZIPInputStream(is);
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new 
    GenbankReader<DNASequence, NucleotideCompound>(is, new 
        GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
		new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
		); 
LinkedHashMap<String, DNASequence> gbFile1 = dnaReader.process(1);
dnaReader.close();

do work.

(also, I I manage to get this to work, is there some GenbankReader.hasNext() or similar way which allows me to go through all the smaller chunks of the file, without reading them all to memory?)

Have I misunderstood something here? From the docs

public LinkedHashMap<String,S> process(int max)
                                                            throws IOException,
                                                                   CompoundNotFoundException
This method tries to parse maximum max records from the open File or InputStream, and leaves the underlying resource open.
Subsequent calls to the same method continue parsing the rest of the file.
This is particularly useful when dealing with very big data files, (e.g. NCBI nr database), which can't fit into memory and will take long time before the first result is available.

It seems like I should be able to use it this way?
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions