-
Notifications
You must be signed in to change notification settings - Fork 397
Added Java Streaming I/O for FASTA Files #1080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
f45e018
Changed the URLs to the mirrors from http to https. Changed the name…
hilbertglm 96c8126
Before finding the InputStreamProvider
hilbertglm fd3be7e
Removed the MagicNumber class after finding InputStreamProvider
hilbertglm ef665e2
Added comments
hilbertglm 26b21d7
Added access to the iterable
hilbertglm d081b12
Changed the version to 7.1.0-SNAPSHOT. Made changes per review.
hilbertglm 2d1f32e
Missed a RuntimeException. Changed to UncheckedIOException
hilbertglm 84998b9
Added an entry to the changelog for FastaStreamer
hilbertglm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
179 changes: 179 additions & 0 deletions
179
biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/FastaStreamer.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,179 @@ | ||
| package org.biojava.nbio.core.sequence.io; | ||
|
|
||
| import org.biojava.nbio.core.sequence.ProteinSequence; | ||
| import org.biojava.nbio.core.sequence.compound.AminoAcidCompound; | ||
| import org.biojava.nbio.core.sequence.compound.AminoAcidCompoundSet; | ||
| import org.biojava.nbio.core.sequence.io.template.SequenceCreatorInterface; | ||
| import org.biojava.nbio.core.sequence.io.template.SequenceHeaderParserInterface; | ||
| import org.biojava.nbio.core.util.InputStreamProvider; | ||
|
|
||
| import java.io.File; | ||
| import java.io.IOException; | ||
| import java.io.InputStream; | ||
| import java.io.UncheckedIOException; | ||
| import java.nio.file.Path; | ||
| import java.util.Collections; | ||
| import java.util.Iterator; | ||
| import java.util.LinkedHashMap; | ||
| import java.util.Map; | ||
| import java.util.Optional; | ||
| import java.util.Spliterator; | ||
| import java.util.Spliterators; | ||
| import java.util.function.Consumer; | ||
| import java.util.stream.Stream; | ||
| import java.util.stream.StreamSupport; | ||
|
|
||
| /** | ||
| * Read from a FASTA file (or gzipped FASTA file) and create a Java stream of {@link ProteinSequence} objects | ||
| * for use in a functional programming paradigm. | ||
| * | ||
| * @author Gary Murphy | ||
| * @since 7.1.0 | ||
| */ | ||
| public class FastaStreamer { | ||
|
|
||
| private final Path path; | ||
| private int batchSize = 1_000; | ||
| private SequenceHeaderParserInterface<ProteinSequence, AminoAcidCompound> headerParser; | ||
| private SequenceCreatorInterface<AminoAcidCompound> sequenceCreator; | ||
| private LinkedHashMap<String, ProteinSequence> chunk = new LinkedHashMap<>(); | ||
| private Iterator<Map.Entry<String, ProteinSequence>> iterator = Collections.emptyIterator(); | ||
| private boolean closed = false; | ||
|
|
||
| /** | ||
| * The constructor is private. Created via the <tt>from(...)</tt> static factory method | ||
| * | ||
| * @param path the path to the file containing the FASTA content (possibly GZipped) | ||
| */ | ||
| private FastaStreamer(final Path path) { | ||
| this.path = path; | ||
| } | ||
|
|
||
| public static FastaStreamer from(final Path path) { | ||
| return new FastaStreamer(path); | ||
| } | ||
|
|
||
| public static FastaStreamer from(File file) { | ||
| return from(file.toPath()); | ||
| } | ||
|
|
||
| public FastaStreamer withHeaderParser(SequenceHeaderParserInterface<ProteinSequence, AminoAcidCompound> headerParser) { | ||
| this.headerParser = headerParser; | ||
| return this; | ||
| } | ||
|
|
||
| public FastaStreamer withSequenceCreator(SequenceCreatorInterface<AminoAcidCompound> sequenceCreator) { | ||
| this.sequenceCreator = sequenceCreator; | ||
| return this; | ||
| } | ||
|
|
||
| public FastaStreamer batchSize(int size) { | ||
| this.batchSize = size; | ||
| return this; | ||
| } | ||
|
|
||
| /** | ||
| * Enable iteration through the proteins in the file using syntax such as: | ||
| * <pre> | ||
| * for(ProteinSequence sequence : FastaStreamer.from(path).each()) { | ||
| * . | ||
| * . | ||
| * . | ||
| * } | ||
| * </pre> | ||
| * | ||
| * @return an iterable suitable for an iteration loop | ||
| */ | ||
| public Iterable<ProteinSequence> each() { | ||
| return () -> stream().iterator(); | ||
| } | ||
|
|
||
| /** | ||
| * Create a stream of protein sequences from the contents of the path | ||
| * @return the stream | ||
| */ | ||
| public Stream<ProteinSequence> stream() { | ||
| InputStreamProvider provider = new InputStreamProvider(); | ||
| InputStream input; | ||
| try { | ||
| input = provider.getInputStream(getPath().toFile()); | ||
| } catch (IOException exception) { | ||
| throw new UncheckedIOException(exception); | ||
| } | ||
| FastaReader<ProteinSequence, AminoAcidCompound> reader = new FastaReader<>(input, getHeaderParser(), getSequenceCreator()); | ||
| Spliterator<ProteinSequence> source = new Spliterators.AbstractSpliterator<>(Integer.MAX_VALUE, Spliterator.IMMUTABLE | Spliterator.NONNULL) { | ||
| @Override | ||
| public boolean tryAdvance(Consumer<? super ProteinSequence> action) { | ||
| if (closed) { | ||
| return false; | ||
| } | ||
| ProteinSequence protein = next(reader); | ||
| if (null == protein) { | ||
| return false; | ||
| } | ||
| action.accept(protein); | ||
| return true; | ||
| } | ||
|
|
||
| /** | ||
| * Fetch the next header/protein tuple from the cache. If the cache is empty, fetch another | ||
| * batch from the source file | ||
| * | ||
| * @param reader | ||
| * the input stream from which the FASTA content is read | ||
| * @return the protein sequence | ||
| */ | ||
| private ProteinSequence next(FastaReader<ProteinSequence, AminoAcidCompound> reader) { | ||
| try { | ||
| if (!iterator.hasNext()) { | ||
| chunk = reader.process(getBatchSize()); | ||
| if (null == chunk) { | ||
| closed = true; | ||
| reader.close(); | ||
| return null; | ||
| } | ||
| iterator = chunk.entrySet().iterator(); | ||
| } | ||
| if (iterator.hasNext()) { | ||
| Map.Entry<String, ProteinSequence> entry = iterator.next(); | ||
| return createSequence(entry.getValue()); | ||
| } | ||
| closed = true; | ||
| reader.close(); | ||
| } catch (IOException exception) { | ||
| throw new UncheckedIOException(String.format("I/O error reading the FASTA file from '%s'", getPath()), exception); | ||
| } | ||
| return null; | ||
| } | ||
| }; // Spliterator | ||
| return StreamSupport.stream(source, false); | ||
| } | ||
|
|
||
| /** | ||
| * Create the sequence with the information from the header. This implementation return the sequence as-is, but | ||
| * this is an opportunity for the implementer to build specific information into the user collection space | ||
| * of the sequence | ||
| * | ||
| * @param sequence the protein sequence | ||
| * @return the sequence | ||
| */ | ||
| protected ProteinSequence createSequence(ProteinSequence sequence) { | ||
| return sequence; | ||
| } | ||
|
|
||
| protected Path getPath() { | ||
| return path; | ||
| } | ||
|
|
||
| protected int getBatchSize() { | ||
| return batchSize; | ||
| } | ||
|
|
||
| protected SequenceHeaderParserInterface<ProteinSequence, AminoAcidCompound> getHeaderParser() { | ||
| return Optional.ofNullable(headerParser).orElse(new GenericFastaHeaderParser<>()); | ||
| } | ||
|
|
||
| public SequenceCreatorInterface<AminoAcidCompound> getSequenceCreator() { | ||
| return Optional.ofNullable(sequenceCreator).orElse(new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())); | ||
| } | ||
| } | ||
50 changes: 50 additions & 0 deletions
50
biojava-core/src/test/java/org/biojava/nbio/core/sequence/io/FastaStreamerTest.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| package org.biojava.nbio.core.sequence.io; | ||
|
|
||
| import org.biojava.nbio.core.sequence.ProteinSequence; | ||
| import org.junit.Assert; | ||
| import org.junit.Test; | ||
|
|
||
| import java.io.IOException; | ||
| import java.nio.file.Path; | ||
| import java.nio.file.Paths; | ||
| import java.util.List; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| /** | ||
| * Test the functionality of the {@link FastaStreamer} code | ||
| */ | ||
| public class FastaStreamerTest { | ||
|
|
||
| @Test | ||
| public void stream() throws IOException { | ||
| String file = this.getClass().getResource("PF00104_small.fasta.gz").getFile(); | ||
| Path path = Paths.get(file); | ||
| List<ProteinSequence> sequences; | ||
|
|
||
| sequences = FastaStreamer.from(path).stream().collect(Collectors.toList()); | ||
| Assert.assertEquals("Count", 283, sequences.size()); | ||
|
|
||
| ProteinSequence sequence; | ||
| sequence = sequences.get(0); | ||
| Assert.assertEquals("A2D504_ATEGE/1-46", sequence.getOriginalHeader()); | ||
| sequence = sequences.get(sequences.size()-1); | ||
| Assert.assertEquals("Q98SJ1_CHICK/15-61", sequence.getOriginalHeader()); | ||
|
|
||
| sequences = FastaStreamer.from(path) | ||
| .batchSize(2) // Ensure there isn't an edge condition loading the next buffer | ||
| .stream() | ||
| .collect(Collectors.toList()); | ||
| Assert.assertEquals("Count", 283, sequences.size()); | ||
| } | ||
|
|
||
| @Test | ||
| public void iterate() { | ||
| String file = this.getClass().getResource("PF00104_small.fasta.gz").getFile(); | ||
| Path path = Paths.get(file); | ||
| int count = 0; | ||
| for (ProteinSequence sequence : FastaStreamer.from(path).each()) { | ||
| count++; | ||
| } | ||
| Assert.assertEquals("Count", 283, count); | ||
| } | ||
| } |
Binary file added
BIN
+30.3 KB
biojava-core/src/test/resources/org/biojava/nbio/core/sequence/io/PF00104_small.fasta.gz
Binary file not shown.
Binary file added
BIN
+54 Bytes
biojava-core/src/test/resources/org/biojava/nbio/core/util/example.gz
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to throw the exception so that it can be handled at a higher level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if
stream()doesn't permit a throws in the signature, then I'd advice to use anUncheckedIOExceptioninstead ofRuntimeExceptionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exceptions would break stream chaining, as the contrived example below shows. I changed to
UncheckedIOException. I originally had the exception being thrown, so it is allowed by Java, but it caused me problems when I used it in my code, so I get where you are coming from.