Skip to content

Commit 94ae550

Browse files
foisysandreasprlic
authored andcommitted
Getting there ;-)
1 parent 8b8da60 commit 94ae550

File tree

2 files changed

+81
-66
lines changed

2 files changed

+81
-66
lines changed

_wikis/BioJava:BioJavaXDocs.md

Lines changed: 33 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -616,13 +616,13 @@ you have to pass a RichSequenceIterator over your collection of
616616
sequences in order for the XML format to group them together into a
617617
single file with the correct headers:
618618

619-
<java> BufferedReader br = new BufferedReader(new
620-
FileReader("myGenbank.gbk")); // an input GenBank file Namespace ns =
621-
RichObjectFactory.getDefaultNamespace(); // a namespace to override that
622-
in the file RichSequenceIterator seqs =
623-
RichSequence.IOTools.readGenbankDNA(br,ns); // we are reading DNA
624-
sequences RichSequence.IOTools.writeEMBLxml(System.out, seqs, ns); //
625-
write the whole lot in EMBLxml format to standard out </java>
619+
<java> // an input GenBank file BufferedReader br = new
620+
BufferedReader(new FileReader("myGenbank.gbk")); // a namespace to
621+
override that in the file Namespace ns =
622+
RichObjectFactory.getDefaultNamespace(); // we are reading DNA sequences
623+
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);
624+
// write the whole lot in EMBLxml format to standard out
625+
RichSequence.IOTools.writeEMBLxml(System.out, seqs, ns); </java>
626626

627627
If you don't know what format your input file is in, but know it could
628628
be one of a fixed set of acceptable formats, then you can use BioJavaX's
@@ -664,12 +664,12 @@ constants in the RichSequenceBuilderFactory interface. These constants
664664
are defined as:
665665

666666
Table 8.1. RichSequenceBuilderFactory predefined constants.
667-
RichSequenceBuilderFactor.FACTORY Does not attempt any compression on
668-
sequence data. RichSequenceBuilderFactor.PACKED Will compress all
669-
sequence data using PackedSymbolLists.
670-
RichSequenceBuilderFactor.THRESHOLD Will compress sequence data using a
671-
PackedSymbolList only when the sequence exceeds 5000 bases in length.
672-
Otherwise, data is not compressed.
667+
668+
| Name of constant | What it will do |
669+
|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
670+
| RichSequenceBuilderFactor.FACTORY | Does not attempt any compression on sequence data. |
671+
| RichSequenceBuilderFactor.PACKED | Will compress all sequence data using PackedSymbolLists. |
672+
| RichSequenceBuilderFactor.THRESHOLD | Will compress sequence data using a PackedSymbolList only when the sequence exceeds 5000 bases in length. Otherwise, data is not compressed. |
673673

674674
If you set the namespace to null, then the namespace used will depend on
675675
the format you are reading. For formats which specify namespaces, the
@@ -685,22 +685,16 @@ classes will have similar methods.
685685

686686
For an alphabet which does not have a tools class, you can do this:
687687

688-
Alphabet a = ...; // get an alphabet instance from somewhere
689-
SymbolTokenization st = a.getTokenization("token");
688+
<java> Alphabet a = ...; // get an alphabet instance from somewhere
689+
SymbolTokenization st = a.getTokenization("token"); </java>
690690

691691
#### Writing using RichStreamWriter
692692

693693
File output is done using RichStreamWriter. This requires:
694694

695-
`  1.`
696-
697-
`     an OutputStream to write sequences to.`
698-
`  2.`
699-
700-
`     a Namespace to use for the sequences.`
701-
`  3.`
702-
703-
`     a RichSequenceIterator that provides the sequences to write.`
695+
1. An OutputStream to write sequences to.
696+
2. A Namespace to use for the sequences.
697+
3. A RichSequenceIterator that provides the sequences to write.
704698

705699
The namespace should only be specified when the file format includes
706700
namespace information and you wish to override the information
@@ -716,38 +710,36 @@ convert data from one file format to another with no intermediate steps.
716710
If you only have one sequence to write, you can wrap it in a temporary
717711
RichSequenceIterator by using a call like this:
718712

719-
RichSequence rs = ...; // get sequence from somewhere
713+
<java> RichSequence rs = ...; // get sequence from somewhere
720714
RichSequenceIterator it = new SingleRichSeqIterator(rs); // wrap it in
721-
an iterator
715+
an iterator </java>
722716

723717
#### Example
724718

725719
The following is an example that will read some DNA sequences from a
726720
GenBank file and write them out to standard output (screen) as FASTA
727721
using the methods outlined above:
728722

729-
SymbolTokenization dna = DNATools.getDNA().getTokenization("token"); //
730-
sequences will be DNA sequences
731-
732-
RichSequenceFormat genbank = new GenbankFormat(); // read Genbank
733-
RichSequenceFormat fasta = new FastaFormat(); // write FASTA
734-
RichSequenceBuilderFactory factory =
735-
RichSequenceBuilderFactory.THRESHOLD; // compress only longer sequences
736-
Namespace bloggsNS = RichObjectFactory.getObject(
723+
<java> // sequences will be DNA sequences SymbolTokenization dna =
724+
DNATools.getDNA().getTokenization("token"); // read Genbank
725+
RichSequenceFormat genbank = new GenbankFormat(); // write FASTA
726+
RichSequenceFormat fasta = new FastaFormat(); // compress only longer
727+
sequences RichSequenceBuilderFactory factory =
728+
RichSequenceBuilderFactory.THRESHOLD; // read/write everything using the
729+
'bloggs' namespace Namespace bloggsNS = RichObjectFactory.getObject(
737730

738731
`                       SimpleNamespace.class, `
739732
`                       new Object[]{"bloggs"} `
740-
`                    );                                                     // read/write everything using the 'bloggs' namespace`
733+
`                    );                                                     `
741734

742-
BufferedReader input = new BufferedReader(new
743-
FileReader("mygenbank.file"));// read seqs from "mygenbank.file"
744-
OutputStream output = System.out; // write seqs to STDOUT
735+
// read seqs from "mygenbank.file" BufferedReader input = new
736+
BufferedReader(new FileReader("mygenbank.file")); // write seqs to
737+
STDOUT OutputStream output = System.out;
745738

746739
RichStreamReader seqsIn = new
747740
RichStreamReader(input,genbank,dna,factory,bloggsNS); RichStreamWriter
748-
seqsOut = new RichStreamWriter(output,fasta);
749-
seqsOut.writeStream(seqsIn,bloggsNS); // one-step Genbank to Fasta
750-
conversion!
741+
seqsOut = new RichStreamWriter(output,fasta); // one-step Genbank to
742+
Fasta conversion! seqsOut.writeStream(seqsIn,bloggsNS); </java>
751743

752744
#### Line widths and eliding information
753745

_wikis/BioJava:BioJavaXDocs.mediawiki

Lines changed: 48 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -475,10 +475,14 @@ while (seqs.hasNext()) {
475475
If you wish to output a number of sequences in one of the XML formats, you have to pass a RichSequenceIterator over your collection of sequences in order for the XML format to group them together into a single file with the correct headers:
476476

477477
<java>
478-
BufferedReader br = new BufferedReader(new FileReader("myGenbank.gbk")); // an input GenBank file
479-
Namespace ns = RichObjectFactory.getDefaultNamespace(); // a namespace to override that in the file
480-
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns); // we are reading DNA sequences
481-
RichSequence.IOTools.writeEMBLxml(System.out, seqs, ns); // write the whole lot in EMBLxml format to standard out
478+
// an input GenBank file
479+
BufferedReader br = new BufferedReader(new FileReader("myGenbank.gbk"));
480+
// a namespace to override that in the file
481+
Namespace ns = RichObjectFactory.getDefaultNamespace();
482+
// we are reading DNA sequences
483+
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);
484+
// write the whole lot in EMBLxml format to standard out
485+
RichSequence.IOTools.writeEMBLxml(System.out, seqs, ns);
482486
</java>
483487

484488
If you don't know what format your input file is in, but know it could be one of a fixed set of acceptable formats, then you can use BioJavaX's format-guessing routine to attempt to read it:
@@ -512,62 +516,81 @@ To construct a RichStreamReader, you will need five things.
512516
The RichSequenceBuilderFactory is best set to one of the predefined constants in the RichSequenceBuilderFactory interface. These constants are defined as:
513517

514518
Table 8.1. RichSequenceBuilderFactory predefined constants.
515-
RichSequenceBuilderFactor.FACTORY Does not attempt any compression on sequence data.
516-
RichSequenceBuilderFactor.PACKED Will compress all sequence data using PackedSymbolLists.
517-
RichSequenceBuilderFactor.THRESHOLD Will compress sequence data using a PackedSymbolList only when the sequence exceeds 5000 bases in length. Otherwise, data is not compressed.
519+
{|border="1" cellpadding="2"
520+
!width="200"|Name of constant
521+
!width="400"|What it will do
522+
|-
523+
|RichSequenceBuilderFactor.FACTORY
524+
|Does not attempt any compression on sequence data.
525+
|-
526+
|RichSequenceBuilderFactor.PACKED
527+
|Will compress all sequence data using PackedSymbolLists.
528+
|-
529+
|RichSequenceBuilderFactor.THRESHOLD
530+
|Will compress sequence data using a PackedSymbolList only when the sequence exceeds 5000 bases in length. Otherwise, data is not compressed.
531+
|}
518532

519533
If you set the namespace to null, then the namespace used will depend on the format you are reading. For formats which specify namespaces, the namespace from the file will be used. For formats which do not specify namespaces, the default namespace provided by RichObjectFactory.getDefaultNamespace() will be used.
520534

521535
The SymbolTokenization should be obtained from the Alphabet that represents the sequence data you are expecting from the file. If you are reading DNA sequences, you should use DNATools.getDNA().getTokenization("token"). Other alphabets with tools classes will have similar methods.
522536

523537
For an alphabet which does not have a tools class, you can do this:
524538

539+
<java>
525540
Alphabet a = ...; // get an alphabet instance from somewhere
526541
SymbolTokenization st = a.getTokenization("token");
542+
</java>
527543

528544
==== Writing using RichStreamWriter ====
529545

530546
File output is done using RichStreamWriter. This requires:
531547

532-
1.
533-
534-
an OutputStream to write sequences to.
535-
2.
536-
537-
a Namespace to use for the sequences.
538-
3.
539-
540-
a RichSequenceIterator that provides the sequences to write.
548+
<ol>
549+
<li>An OutputStream to write sequences to.</li>
550+
<li>A Namespace to use for the sequences.</li>
551+
<li>A RichSequenceIterator that provides the sequences to write.</li>
552+
</ol>
541553

542554
The namespace should only be specified when the file format includes namespace information and you wish to override the information associated with the actual sequences. If you do not wish to do this, just set it to null, and the namespace from each individual sequence will be used instead.
543555

544556
The RichSequenceIterator is an iterator over a set of sequences, exactly the same as the one returned by the RichStreamReader. It is therefore possible to plug a RichStreamReader directly into a RichStreamWriter and convert data from one file format to another with no intermediate steps.
545557

546558
If you only have one sequence to write, you can wrap it in a temporary RichSequenceIterator by using a call like this:
547559

560+
<java>
548561
RichSequence rs = ...; // get sequence from somewhere
549562
RichSequenceIterator it = new SingleRichSeqIterator(rs); // wrap it in an iterator
563+
</java>
550564

551565
==== Example ====
552566

553567
The following is an example that will read some DNA sequences from a GenBank file and write them out to standard output (screen) as FASTA using the methods outlined above:
554568

555-
SymbolTokenization dna = DNATools.getDNA().getTokenization("token"); // sequences will be DNA sequences
556-
557-
RichSequenceFormat genbank = new GenbankFormat(); // read Genbank
558-
RichSequenceFormat fasta = new FastaFormat(); // write FASTA
559-
RichSequenceBuilderFactory factory = RichSequenceBuilderFactory.THRESHOLD; // compress only longer sequences
569+
<java>
570+
// sequences will be DNA sequences
571+
SymbolTokenization dna = DNATools.getDNA().getTokenization("token");
572+
// read Genbank
573+
RichSequenceFormat genbank = new GenbankFormat();
574+
// write FASTA
575+
RichSequenceFormat fasta = new FastaFormat();
576+
// compress only longer sequences
577+
RichSequenceBuilderFactory factory = RichSequenceBuilderFactory.THRESHOLD;
578+
// read/write everything using the 'bloggs' namespace
560579
Namespace bloggsNS = RichObjectFactory.getObject(
561580
SimpleNamespace.class,
562581
new Object[]{"bloggs"}
563-
); // read/write everything using the 'bloggs' namespace
582+
);
564583

565-
BufferedReader input = new BufferedReader(new FileReader("mygenbank.file"));// read seqs from "mygenbank.file"
566-
OutputStream output = System.out; // write seqs to STDOUT
584+
// read seqs from "mygenbank.file"
585+
BufferedReader input = new BufferedReader(new FileReader("mygenbank.file"));
586+
// write seqs to STDOUT
587+
OutputStream output = System.out;
567588

568589
RichStreamReader seqsIn = new RichStreamReader(input,genbank,dna,factory,bloggsNS);
569590
RichStreamWriter seqsOut = new RichStreamWriter(output,fasta);
570-
seqsOut.writeStream(seqsIn,bloggsNS); // one-step Genbank to Fasta conversion!
591+
// one-step Genbank to Fasta conversion!
592+
seqsOut.writeStream(seqsIn,bloggsNS);
593+
</java>
571594

572595
==== Line widths and eliding information ====
573596

0 commit comments

Comments
 (0)