git clone https://github.com/ChenghaoMou/text-dedup
cd text-dedup
uv syncThis repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:
- MinHash + MinHashLSH for near-duplicate detection
- 64 or 128 bit SimHash
- SuffixArray Substring exact deduplication
- Bloom Filter exact deduplication
All algorithms use a config-based approach with TOML files for easy customization.
All deduplication scripts read from a config.toml file in the project root.
Edit config.toml with your input data and algorithm settings:
MinHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "minhash"
text_column = "text"
seed = 42
batch_size = 10000
num_perm = 240
threshold = 0.7
false_positive_weight = 0.5
false_negative_weight = 0.5
hash_bits = 64
ngram_size = 5
check_false_positive = true
[output]
output_dir = "output"
clean_cache = false
save_clusters = true
[debug]
enable_profiling = falseSimHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "simhash"
text_column = "text"
hash_bits = 64
ngram_size = 3
bit_diff = 3
[output]
output_dir = "output"
clean_cache = false
[debug]
enable_profiling = falseBloom Filter Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "bloom_filter"
text_column = "text"
error_rate = 1e-5
expected_elements = 100000
[output]
output_dir = "output"
clean_cache = false
[debug]
enable_profiling = falseSuffix Array Substring Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "suffix_array"
text_column = "text"
google_repo_path = "third_party/deduplicate-text-datasets"
merge_strategy = "longest"
length_threshold = 100
cache_dir = ".cache"
[output]
output_dir = "output"
clean_cache = false
[debug]
enable_profiling = false# MinHash
python -m text_dedup.minhash
# SimHash
python -m text_dedup.simhash
# Bloom Filter
python -m text_dedup.bloom_filter
# Suffix Array
python -m text_dedup.suffix_arraypinecone/core-2020-05-10-deduplication
| Algorithm | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score | Accuracy | Time |
|---|---|---|---|---|---|---|---|
| MinHash | 0.9587 | 0.9416 | 0.9450 | 0.9611 | 0.9518 | 0.9277 | 11.09s |
| SimHash | 0.9038 | 0.7323 | 0.7993 | 0.9318 | 0.8515 | 0.8375 | 626.11s |
| Exact Title Matching 1 | 0.830 | 0.50 | 0.709 | 0.992 | 0.757 | 0.746 | - |
| Simhash Matching 1 | 0.697 | 0.247 | 0.598 | 0.985 | 0.631 | 0.616 | - |
| Document Vector Similarity 1 | 0.912 | 0.779 | 0.861 | 0.986 | 0.885 | 0.883 | - |
| Hybrid Method 1 | 0.908 | 0.828 | 0.899 | 0.979 | 0.904 | 0.903 | - |
| LaBSE2 | 0.937 | 0.923 | 0.930 | 0.943 | 0.933 | 0.919 | - |
| Multilingual USE2 | 0.917 | 0.907 | 0.918 | 0.927 | 0.917 | 0.909 | - |
| Multilingual E5-Base2 | 0.931 | 0.908 | 0.919 | 0.939 | 0.924 | 0.920 | - |
| MinHash + LSH2 | 0.929 | 0.902 | 0.915 | 0.938 | 0.921 | 0.918 | - |
| RETSim Partial-Dup2 | 0.945 | 0.941 | 0.945 | 0.949 | 0.945 | 0.928 | - |
| RETSim Near-Dup2 | 0.928 | 0.937 | 0.942 | 0.934 | 0.935 | 0.926 | - |
NEWS-COPY
Adjusted Rand Index (ARI) on NEWS-COPY dataset:
| Model/Algorithm | ARI | Time |
|---|---|---|
| MinHash | 0.7293 | 3.01s |
| SimHash | 0.6463 | 140.03s |
| n-gram 3 | 0.440 | - |
| SimHash2 | 0.695 | - |
| MinHash3 | 0.737 | - |
| MinHash2 | 0.783 | - |
| Multilingual USE2 | 0.730 | - |
| Multilingual E5-Base2 | 0.742 | - |
| S-BERT3 | 0.700 | - |
| RETSim Partial-Dup2 | 0.831 | - |
| RETSim Near-Dup2 | 0.704 | - |
| Re-ranking 3 | 0.937 | - |
| Bi-encoder 3 | 0.915 | - |
You can reproduce the benchmark results using the provided benchmark suite.
# Run all benchmarks (both datasets, all algorithms)
just benchmark-all
# Run only CORE dataset benchmarks
just benchmark-core
# Run only NEWS-COPY dataset benchmarks
just benchmark-news
# Run specific algorithm on specific dataset
just benchmark-core-minhash
just benchmark-core-simhash
just benchmark-news-minhash
just benchmark-news-simhashBenchmark configuration files are located in configs/:
benchmark_core_minhash.toml- MinHash on CORE datasetbenchmark_core_simhash.toml- SimHash on CORE datasetbenchmark_news_minhash.toml- MinHash on NEWS-COPY datasetbenchmark_news_simhash.toml- SimHash on NEWS-COPY dataset
To customize benchmark parameters, edit the config files and adjust hyperparameters like num_perm, threshold, ngram_size, or bit_diff.
Generally, you can cite this repository as:
@software{chenghao_mou_2023_8364980,
author = {Chenghao Mou and
Chris Ha and
Kenneth Enevoldsen and
Peiyuan Liu},
title = {ChenghaoMou/text-dedup: Reference Snapshot},
month = sep,
year = 2023,
publisher = {Zenodo},
version = {2023.09.20},
doi = {10.5281/zenodo.8364980},
url = {https://doi.org/10.5281/zenodo.8364980}
}This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!
- Datasketch (MIT)
- simhash-py and simhash-cpp (MIT)
- Deduplicating Training Data Makes Language Models Better (Apache 2.0)
- Gaoya (MIT)
