Skip to content

Commit 18e9a19

Browse files
Update README.md
added a pytorch example
1 parent 3a8d44a commit 18e9a19

File tree

1 file changed

+43
-5
lines changed

1 file changed

+43
-5
lines changed

README.md

Lines changed: 43 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,8 +141,46 @@ Model Handle: 2854016629088
141141
0 0 0 0 0 0 0 0]
142142
Model Freed
143143
```
144+
### 4. Python example to call a XLM-R tokenizer and prepare a pytorch batch
144145

145-
### 4. Python example, doing tokenization and hyphenation of a text
146+
```python
147+
import os
148+
import torch
149+
from torch.nn.utils.rnn import pad_sequence
150+
from blingfire import load_model, text_to_ids, free_model
151+
152+
# Load the XLM-RoBERTa tokenizer model provided by BlingFire
153+
model_path = os.path.join("./data", 'xlm_roberta_base.bin')
154+
tokenizer_model = load_model(model_path)
155+
156+
if __name__ == "__main__":
157+
# Sample input text
158+
input_texts = [
159+
"+1 (678) 274-9543 US https://lookup.robokiller.com/p/678-274-9543 (678) 274-9543 - roboKiller lookup",
160+
"+1 (678) 274-9543 US https://lookup.robokiller.com/p/678-274-9543 (678) 274-9543 - super robo killer lookup"
161+
]
162+
163+
# Tokenize input texts and create tensors
164+
token_ids = [
165+
torch.cat([
166+
torch.tensor([0], dtype=torch.long), # 0 - <s> token
167+
torch.tensor(text_to_ids(tokenizer_model, t, 256, 3, True).astype('int64')), # 3 - <unk>
168+
torch.tensor([2], dtype=torch.long) # 2 - </s> token
169+
])
170+
for t in input_texts
171+
]
172+
173+
# Use torch.nn.utils.rnn.pad_sequence to pad the sequences
174+
token_ids = pad_sequence(token_ids, batch_first=True, padding_value=1) # 1 - <pad>
175+
176+
print("\nAfter Padding:")
177+
print(token_ids)
178+
179+
# Free the model after use
180+
free_model(tokenizer_model)
181+
```
182+
183+
### 5. Python example, doing tokenization and hyphenation of a text
146184

147185
Since hyphenation API's take one word at a time with the limit of 300 Unicode characters, we need to break the text into words first and then run hyphenation for each token.
148186

@@ -172,7 +210,7 @@ Li-ke Cu-rios-i-ty , the Per-se-ve-rance ro-ver was built by en-gi-neers and sci
172210
Note you can specify any other Unicode character as a hyphen that API inserts into the output string.
173211

174212

175-
### 5. C# example, calling XLM Roberta tokenizer and getting ids and offsets
213+
### 6. C# example, calling XLM Roberta tokenizer and getting ids and offsets
176214

177215
Note, everything that is supported in Python is supported by C# API as well. C# also has ability to use parallel computations since all models and functions are stateless you can share the same model across the threads without locks. Let's load XLM Roberta model and tokenize a string, for each token let's get ID and offsets in the original text.
178216

@@ -231,7 +269,7 @@ tokens from offsets: ['Auto'/4396 'pho'/22014 'bia'/9166 ','/4 ' also'/2843 ' ca
231269
```
232270
See this project for more C# examples: https://github.com/microsoft/BlingFire/tree/master/nuget/test .
233271

234-
### 6. JavaScript example, fetching and loading model file, using the model to compute ids
272+
### 7. JavaScript example, fetching and loading model file, using the model to compute ids
235273

236274
The goal of integration with JavaScript is ability to run the code in a browser with ML frameworks like TensorFlow.js and FastText web assembly.
237275

@@ -277,11 +315,11 @@ $(document).ready(function() {
277315
Full example code can be found [here](https://github.com/microsoft/BlingFire/blob/master/wasm/example.html). Details of the API are described in the [wasm](https://github.com/microsoft/BlingFire/tree/master/wasm) folder.
278316

279317

280-
### 7. Example of making a difference with using Bling Fire default tokenizer in a classification task
318+
### 8. Example of making a difference with using Bling Fire default tokenizer in a classification task
281319

282320
[This notebook](/doc/Bling%20Fire%20Tokenizer%20Demo.ipynb) demonstrates how Bling Fire tokenizer helps in Stack Overflow posts classification problem.
283321

284-
### 8. Example of reaching 99% accuracy for language detection
322+
### 9. Example of reaching 99% accuracy for language detection
285323

286324
[This document](https://github.com/microsoft/BlingFire/wiki/How-to-train-better-language-detection-with-Bling-Fire-and-FastText) describes how to improve [FastText](https://fasttext.cc/) language detection model with Bling Fire and achive 99% accuracy in language detection task for 365 languages.
287325

0 commit comments

Comments
 (0)