A Python package for cross-lingual tokenizer alignment using bilingual dictionaries and recurrent alignment algorithms.
Dict Trans Tokenizer enables you to:
- Train BPE tokenizers from bilingual dictionary data
- Create aligned corpora from bilingual dictionaries
- Perform recurrent token alignment between source and target tokenizers
- Remap pre-trained models to use aligned tokenizers
This is particularly useful for adapting pre-trained language models to new languages or domains using bilingual dictionaries.
uv add dict-trans-tokenizer git+https://github.com/sj-h4/dict_trans_tokenizer.git- fast_align: Required for token alignment
Note
Please use https://github.com/FremyCompany/fast_align.git, which is used in TransTokenizer.
See also the example/basic_usage.py for a complete example.
from dict_trans_tokenizer import (
run_recurrent_alignment,
train_bpe_tokenizer,
load_bilingual_dict,
)
# Train tokenizers and run alignment
run_recurrent_alignment(
source_model="roberta-base", # or path to source model
target_tokenizer="path/to/target/tokenizer",
dictionary="path/to/bilingual_dict.json",
corpus_path="output/corpus.moses",
fast_align_path="fast_align",
alignment_mode="token", # or "word"
mapping_mode="replace",
min_count=10,
output_dir="output/",
logging_level="INFO",
)import tempfile
from pathlib import Path
from dict_trans_tokenizer import (
train_bpe_tokenizer,
create_aligned_corpus,
run_recurrent_alignment,
load_bilingual_dict,
AlignmentMode,
)
# 1. Load bilingual dictionary
dictionary = load_bilingual_dict("path/to/dict.json")
# 2. Train target language tokenizer
target_words = [entry.entry.lower().strip() for entry in dictionary]
train_bpe_tokenizer(
train_words=target_words,
output_dir="target_tokenizer/",
vocab_size=1000,
show_progress=True
)
# 3. Create aligned corpus
corpus_path = create_aligned_corpus(
source_tokenizer=source_tokenizer,
target_tokenizer=target_tokenizer,
dictionary=dictionary,
output_path="aligned_corpus.moses",
alignment_mode=AlignmentMode.TOKEN,
)
# 4. Run recurrent alignment
run_recurrent_alignment(
source_model="roberta-base",
target_tokenizer="target_tokenizer/tokenizer-fast",
dictionary="path/to/dict.json",
corpus_path="corpus.moses",
fast_align_path="fast_align",
alignment_mode="token",
mapping_mode="replace",
min_count=5,
output_dir="aligned_model/",
)The bilingual dictionary should be a JSON file with the following structure:
[
{
"entry": "hello",
"definitions": ["greeting", "hi", "welcome"]
},
{
"entry": "world",
"definitions": ["earth", "planet", "globe"]
}
]Each entry contains:
entry: The target language worddefinitions: List of source language translations/definitions
Main function that performs the complete recurrent alignment process.
Parameters:
source_model(str): Path to source model or HuggingFace model nametarget_tokenizer(str): Path to target tokenizerdictionary(str): Path to bilingual dictionary JSON filecorpus_path(str): Path where aligned corpus will be savedfast_align_path(str): Path to fast_align binaryalignment_mode(str): "token" or "word" alignment modemapping_mode(str): "replace" (currently only supported mode)min_count(int): Minimum count threshold for token considerationoutput_dir(str): Directory to save resultslogging_level(str): Logging level (default: "INFO")seed(int): Random seed (default: 42)
Train a BPE tokenizer from word list.
Parameters:
train_words(list[str]): List of words to train onoutput_dir(str): Directory to save tokenizervocab_size(int): Vocabulary sizeshow_progress(bool): Whether to show training progressspecial_tokens(list[str]): Special tokens (optional)
Create aligned corpus from bilingual dictionary.
Parameters:
source_tokenizer: Source language tokenizertarget_tokenizer: Target language tokenizerdictionary: List of bilingual dictionary entriesoutput_path(str): Path to save aligned corpusalignment_mode: AlignmentMode.TOKEN or AlignmentMode.WORD
uv run lefthook installuv run pytestuv run ruff check --fix@inproceedings{sakajo-etal-2025-dictionaries,
title = "Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries",
author = "Sakajo, Haruki and
Ide, Yusuke and
Vasselli, Justin and
Sakai, Yusuke and
Tian, Yingtao and
Kamigaito, Hidetaka and
Watanabe, Taro",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.1333/",
doi = "10.18653/v1/2025.findings-acl.1333",
pages = "25963--25976",
ISBN = "979-8-89176-256-5",
abstract = "Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer."
}