A highly efficient, customizable library for training Byte Pair Encoding (BPE) and TikToken (GPT-4 style) tokenizers from scratch. Designed for ease of use, it leverages Hugging Face's streaming datasets to train on massive corpuses (like FineWeb-Edu) with minimal memory overhead. Train a production-ready tokenizer in only 4 lines of code.
- Dual Modes: Support for both standard BPE (BERT/RoBERTa style) and TikToken (GPT-4/cl100k style) training.
- Memory Efficient: Uses streaming datasets and batch iterators to train on huge datasets without loading them entirely into RAM.
- Fully Customizable: Easily configure vocabulary size, special tokens, and batch sizes.
- Hugging Face Compatible: Exports tokenizers directly to the
tokenizer.jsonformat, compatible withtransformers.
You can install this package directly from GitHub using pip:
pip install git+https://github.com/OE-Void/Tokenizer-from_scratch.gitTrain a production-ready tokenizer with best-practice defaults (FineWeb-Edu dataset, 64k Vocab, 10k Batch Size) in just 4 lines of code:
from BPE.tok import Trainer
from BPE.data import batch_iterator
# Automatically streams FineWeb-Edu and trains a 64k vocab BPE tokenizer
Trainer(batch_iterator()) The standard BPE implementation includes automatic [CLS] and [SEP] token injection (Template Processing).
from BPE.tok import Trainer
from BPE.data import batch_iterator
# Initialize the data stream (defaults to FineWeb-Edu 10BT sample)
# Batches are yielded to keep memory usage low
batch_iter = batch_iterator(BATCH_SIZE=10_000)
# Start Training
Trainer(
batch_iter,
vocab_size=64_000,
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
save_to="my_bpe_tokenizer"
)This module replicates the GPT-4 (cl100k_base) pre-tokenization logic using specific Regex patterns and ByteLevel encoding.
from TikToken.tok import Trainer
from TikToken.data import batch_iterator
# Initialize data stream
batch_iter = batch_iterator(BATCH_SIZE=10_000)
# Train using TikToken logic
Trainer(
batch_iter,
vocab_size=64_000,
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
save_to="my_tiktoken_tokenizer"
)Handles the streaming of data from Hugging Face.
Dataset(default:"HuggingFaceFW/fineweb-edu"): The HF dataset ID.BATCH_SIZE(default:10,000): Number of text samples per batch. 10k is optimized for ~24GB RAM environments.split: Dataset split to use (e.g.,"train").
The main training entry point.
vocab_size: Target vocabulary size (default64,000).special_tokens: List of special tokens. The tokenizer automatically detects mapping (e.g.,<cls>or[CLS]).save_to: Directory/filename prefix for the saved tokenizer.
Once trained, your tokenizer is saved as a JSON file (and a folder for Hugging Face compatibility). You can load it easily with the transformers library:
from transformers import PreTrainedTokenizerFast
# Load the tokenizer (replace 'tokenizer_tiktoken' with your save_to name)
tokenizer = PreTrainedTokenizerFast.from_pretrained("tokenizer_tiktoken")
# Encode text
text = "Hello, world! This is a test."
encoded = tokenizer(text)
print(f"Tokens: {encoded.tokens()}")
print(f"IDs: {encoded['input_ids']}")
# Decode back to string
decoded = tokenizer.decode(encoded['input_ids'])
print(f"Decoded: {decoded}")- BPE/: Contains the logic for standard Byte-Pair Encoding training.
- TikToken/: Implements the specific Regex-split BPE used by models like GPT-4.
- main.py: A simple entry point script for testing.