Skip to content

Commit c8067a2

Browse files
Gwaldonandreasprlic
authored andcommitted
more on tokenization
1 parent 55e37c7 commit c8067a2

File tree

2 files changed

+34
-1
lines changed

2 files changed

+34
-1
lines changed

_wikis/BioJava:Cookbook:Sequence.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,3 +99,22 @@ public class SymbolListToString {
9999
` }`
100100

101101
} </java>
102+
103+
The above example uses the process of 'tokenization' to create the
104+
String, in this case hidden in the SeqString method. Different types of
105+
tokenization can be used to control the output String.
106+
107+
<java>
108+
109+
Alphabet alph; // An alphabet SymbolList sym; //A SymbolList
110+
111+
SymbolTokenization tok= alph.getTokenization("token"); String output =
112+
tok.tokenizeSymbolList(sym)
113+
114+
</java>
115+
116+
Use "token" or "default" to represent nucleotides and amino acids in
117+
lower case single characters; use "alternate" to represent DNA in single
118+
capital letters and amino acids from the PROTEIN\_TERM alphabet in
119+
character triplets (e.g. Arg) (see
120+
[AlternateTokenization](http://www.biojava.org/docs/api1.8/org/biojava/bio/seq/io/AlternateTokenization.html)).

_wikis/BioJava:Cookbook:Sequence.mediawiki

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,4 +76,18 @@ public class SymbolListToString {
7676
String s = sl.seqString();
7777
}
7878
}
79-
</java>
79+
</java>
80+
81+
The above example uses the process of 'tokenization' to create the String, in this case hidden in the SeqString method. Different types of tokenization can be used to control the output String.
82+
83+
<java>
84+
85+
Alphabet alph; // An alphabet
86+
SymbolList sym; //A SymbolList
87+
88+
SymbolTokenization tok= alph.getTokenization("token");
89+
String output = tok.tokenizeSymbolList(sym)
90+
91+
</java>
92+
93+
Use "token" or "default" to represent nucleotides and amino acids in lower case single characters; use "alternate" to represent DNA in single capital letters and amino acids from the PROTEIN_TERM alphabet in character triplets (e.g. Arg) (see [http://www.biojava.org/docs/api1.8/org/biojava/bio/seq/io/AlternateTokenization.html AlternateTokenization]).

0 commit comments

Comments
 (0)