Skip to content

Code for "Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries" (Sakajo et al., 2025 ACL Findings)

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APATCH
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

sj-h4/dict_trans_tokenizer

Repository files navigation

Dict Trans Tokenizer

A Python package for cross-lingual tokenizer alignment using bilingual dictionaries and recurrent alignment algorithms.

Overview

Dict Trans Tokenizer enables you to:

  • Train BPE tokenizers from bilingual dictionary data
  • Create aligned corpora from bilingual dictionaries
  • Perform recurrent token alignment between source and target tokenizers
  • Remap pre-trained models to use aligned tokenizers

This is particularly useful for adapting pre-trained language models to new languages or domains using bilingual dictionaries.

Installation

uv add dict-trans-tokenizer git+https://github.com/sj-h4/dict_trans_tokenizer.git

Dependencies

  • fast_align: Required for token alignment

Quick Start

See also the example/basic_usage.py for a complete example.

Basic Usage

from dict_trans_tokenizer import (
    run_recurrent_alignment,
    train_bpe_tokenizer,
    load_bilingual_dict,
)

# Train tokenizers and run alignment
run_recurrent_alignment(
    source_model="roberta-base",  # or path to source model
    target_tokenizer="path/to/target/tokenizer",
    dictionary="path/to/bilingual_dict.json",
    corpus_path="output/corpus.moses",
    fast_align_path="fast_align",
    alignment_mode="token",  # or "word"
    mapping_mode="replace",
    min_count=10,
    output_dir="output/",
    logging_level="INFO",
)

Step-by-Step Workflow

import tempfile
from pathlib import Path
from dict_trans_tokenizer import (
    train_bpe_tokenizer,
    create_aligned_corpus,
    run_recurrent_alignment,
    load_bilingual_dict,
    AlignmentMode,
)

# 1. Load bilingual dictionary
dictionary = load_bilingual_dict("path/to/dict.json")

# 2. Train target language tokenizer
target_words = [entry.entry.lower().strip() for entry in dictionary]
train_bpe_tokenizer(
    train_words=target_words,
    output_dir="target_tokenizer/",
    vocab_size=1000,
    show_progress=True
)

# 3. Create aligned corpus
corpus_path = create_aligned_corpus(
    source_tokenizer=source_tokenizer,
    target_tokenizer=target_tokenizer,
    dictionary=dictionary,
    output_path="aligned_corpus.moses",
    alignment_mode=AlignmentMode.TOKEN,
)

# 4. Run recurrent alignment
run_recurrent_alignment(
    source_model="roberta-base",
    target_tokenizer="target_tokenizer/tokenizer-fast",
    dictionary="path/to/dict.json",
    corpus_path="corpus.moses",
    fast_align_path="fast_align",
    alignment_mode="token",
    mapping_mode="replace",
    min_count=5,
    output_dir="aligned_model/",
)

Dictionary Format

The bilingual dictionary should be a JSON file with the following structure:

[
  {
    "entry": "hello",
    "definitions": ["greeting", "hi", "welcome"]
  },
  {
    "entry": "world",
    "definitions": ["earth", "planet", "globe"]
  }
]

Each entry contains:

  • entry: The target language word
  • definitions: List of source language translations/definitions

API Reference

Core Functions

run_recurrent_alignment()

Main function that performs the complete recurrent alignment process.

Parameters:

  • source_model (str): Path to source model or HuggingFace model name
  • target_tokenizer (str): Path to target tokenizer
  • dictionary (str): Path to bilingual dictionary JSON file
  • corpus_path (str): Path where aligned corpus will be saved
  • fast_align_path (str): Path to fast_align binary
  • alignment_mode (str): "token" or "word" alignment mode
  • mapping_mode (str): "replace" (currently only supported mode)
  • min_count (int): Minimum count threshold for token consideration
  • output_dir (str): Directory to save results
  • logging_level (str): Logging level (default: "INFO")
  • seed (int): Random seed (default: 42)

train_bpe_tokenizer()

Train a BPE tokenizer from word list.

Parameters:

  • train_words (list[str]): List of words to train on
  • output_dir (str): Directory to save tokenizer
  • vocab_size (int): Vocabulary size
  • show_progress (bool): Whether to show training progress
  • special_tokens (list[str]): Special tokens (optional)

create_aligned_corpus()

Create aligned corpus from bilingual dictionary.

Parameters:

  • source_tokenizer: Source language tokenizer
  • target_tokenizer: Target language tokenizer
  • dictionary: List of bilingual dictionary entries
  • output_path (str): Path to save aligned corpus
  • alignment_mode: AlignmentMode.TOKEN or AlignmentMode.WORD

Development

Setup

uv run lefthook install

Running Tests

uv run pytest

Linting

uv run ruff check --fix

Citation

@inproceedings{sakajo-etal-2025-dictionaries,
    title = "Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries",
    author = "Sakajo, Haruki  and
      Ide, Yusuke  and
      Vasselli, Justin  and
      Sakai, Yusuke  and
      Tian, Yingtao  and
      Kamigaito, Hidetaka  and
      Watanabe, Taro",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.1333/",
    doi = "10.18653/v1/2025.findings-acl.1333",
    pages = "25963--25976",
    ISBN = "979-8-89176-256-5",
    abstract = "Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer."
}

About

Code for "Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries" (Sakajo et al., 2025 ACL Findings)

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APATCH
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages