Dict Trans Tokenizer

A Python package for cross-lingual tokenizer alignment using bilingual dictionaries and recurrent alignment algorithms.

Overview

Dict Trans Tokenizer enables you to:

Train BPE tokenizers from bilingual dictionary data
Create aligned corpora from bilingual dictionaries
Perform recurrent token alignment between source and target tokenizers
Remap pre-trained models to use aligned tokenizers

This is particularly useful for adapting pre-trained language models to new languages or domains using bilingual dictionaries.

Installation

uv add dict-trans-tokenizer git+https://github.com/sj-h4/dict_trans_tokenizer.git

Dependencies

fast_align: Required for token alignment

Note

Please use https://github.com/FremyCompany/fast_align.git, which is used in TransTokenizer.

Quick Start

See also the example/basic_usage.py for a complete example.

Basic Usage

from dict_trans_tokenizer import (
    run_recurrent_alignment,
    train_bpe_tokenizer,
    load_bilingual_dict,
)

# Train tokenizers and run alignment
run_recurrent_alignment(
    source_model="roberta-base",  # or path to source model
    target_tokenizer="path/to/target/tokenizer",
    dictionary="path/to/bilingual_dict.json",
    corpus_path="output/corpus.moses",
    fast_align_path="fast_align",
    alignment_mode="token",  # or "word"
    mapping_mode="replace",
    min_count=10,
    output_dir="output/",
    logging_level="INFO",
)

Step-by-Step Workflow

import tempfile
from pathlib import Path
from dict_trans_tokenizer import (
    train_bpe_tokenizer,
    create_aligned_corpus,
    run_recurrent_alignment,
    load_bilingual_dict,
    AlignmentMode,
)

# 1. Load bilingual dictionary
dictionary = load_bilingual_dict("path/to/dict.json")

# 2. Train target language tokenizer
target_words = [entry.entry.lower().strip() for entry in dictionary]
train_bpe_tokenizer(
    train_words=target_words,
    output_dir="target_tokenizer/",
    vocab_size=1000,
    show_progress=True
)

# 3. Create aligned corpus
corpus_path = create_aligned_corpus(
    source_tokenizer=source_tokenizer,
    target_tokenizer=target_tokenizer,
    dictionary=dictionary,
    output_path="aligned_corpus.moses",
    alignment_mode=AlignmentMode.TOKEN,
)

# 4. Run recurrent alignment
run_recurrent_alignment(
    source_model="roberta-base",
    target_tokenizer="target_tokenizer/tokenizer-fast",
    dictionary="path/to/dict.json",
    corpus_path="corpus.moses",
    fast_align_path="fast_align",
    alignment_mode="token",
    mapping_mode="replace",
    min_count=5,
    output_dir="aligned_model/",
)

Dictionary Format

The bilingual dictionary should be a JSON file with the following structure:

[
  {
    "entry": "hello",
    "definitions": ["greeting", "hi", "welcome"]
  },
  {
    "entry": "world",
    "definitions": ["earth", "planet", "globe"]
  }
]

Each entry contains:

entry: The target language word
definitions: List of source language translations/definitions

API Reference

Core Functions

`run_recurrent_alignment()`

Main function that performs the complete recurrent alignment process.

Parameters:

source_model (str): Path to source model or HuggingFace model name
target_tokenizer (str): Path to target tokenizer
dictionary (str): Path to bilingual dictionary JSON file
corpus_path (str): Path where aligned corpus will be saved
fast_align_path (str): Path to fast_align binary
alignment_mode (str): "token" or "word" alignment mode
mapping_mode (str): "replace" (currently only supported mode)
min_count (int): Minimum count threshold for token consideration
output_dir (str): Directory to save results
logging_level (str): Logging level (default: "INFO")
seed (int): Random seed (default: 42)

`train_bpe_tokenizer()`

Train a BPE tokenizer from word list.

Parameters:

train_words (list[str]): List of words to train on
output_dir (str): Directory to save tokenizer
vocab_size (int): Vocabulary size
show_progress (bool): Whether to show training progress
special_tokens (list[str]): Special tokens (optional)

`create_aligned_corpus()`

Create aligned corpus from bilingual dictionary.

Parameters:

source_tokenizer: Source language tokenizer
target_tokenizer: Target language tokenizer
dictionary: List of bilingual dictionary entries
output_path (str): Path to save aligned corpus
alignment_mode: AlignmentMode.TOKEN or AlignmentMode.WORD

Development

Setup

uv run lefthook install

Running Tests

uv run pytest

Linting

uv run ruff check --fix

Citation

@inproceedings{sakajo-etal-2025-dictionaries,
    title = "Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries",
    author = "Sakajo, Haruki  and
      Ide, Yusuke  and
      Vasselli, Justin  and
      Sakai, Yusuke  and
      Tian, Yingtao  and
      Kamigaito, Hidetaka  and
      Watanabe, Taro",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.1333/",
    doi = "10.18653/v1/2025.findings-acl.1333",
    pages = "25963--25976",
    ISBN = "979-8-89176-256-5",
    abstract = "Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer."
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
example		example
src/dict_trans_tokenizer		src/dict_trans_tokenizer
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
LICENSE-APATCH		LICENSE-APATCH
LICENSE-MIT		LICENSE-MIT
README.md		README.md
lefthook.yml		lefthook.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Dict Trans Tokenizer

Overview

Installation

Dependencies

Quick Start

Basic Usage

Step-by-Step Workflow

Dictionary Format

API Reference

Core Functions

`run_recurrent_alignment()`

`train_bpe_tokenizer()`

`create_aligned_corpus()`

Development

Setup

Running Tests

Linting

Citation

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

sj-h4/dict_trans_tokenizer

Folders and files

Latest commit

History

Repository files navigation

Dict Trans Tokenizer

Overview

Installation

Dependencies

Quick Start

Basic Usage

Step-by-Step Workflow

Dictionary Format

API Reference

Core Functions

run_recurrent_alignment()

train_bpe_tokenizer()

create_aligned_corpus()

Development

Setup

Running Tests

Linting

Citation

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`run_recurrent_alignment()`

`train_bpe_tokenizer()`

`create_aligned_corpus()`

Packages