GitHub - ChenghaoMou/text-dedup: All-in-one text de-duplication

Installation

git clone https://github.com/ChenghaoMou/text-dedup
cd text-dedup
uv sync

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

MinHash + MinHashLSH for near-duplicate detection
64 or 128 bit SimHash
SuffixArray Substring exact deduplication
Bloom Filter exact deduplication

All algorithms use a config-based approach with TOML files for easy customization.

Quick Start

All deduplication scripts read from a config.toml file in the project root.

1. Configure your settings

Edit config.toml with your input data and algorithm settings:

MinHash Near Deduplication

[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "minhash"
text_column = "text"
seed = 42
batch_size = 10000
num_perm = 240
threshold = 0.7
false_positive_weight = 0.5
false_negative_weight = 0.5
hash_bits = 64
ngram_size = 5
check_false_positive = true

[output]
output_dir = "output"
clean_cache = false
save_clusters = true

[debug]
enable_profiling = false

SimHash Near Deduplication

[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "simhash"
text_column = "text"
hash_bits = 64
ngram_size = 3
bit_diff = 3

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false

Bloom Filter Exact Deduplication

[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "bloom_filter"
text_column = "text"
error_rate = 1e-5
expected_elements = 100000

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false

Suffix Array Substring Exact Deduplication

[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "suffix_array"
text_column = "text"
google_repo_path = "third_party/deduplicate-text-datasets"
merge_strategy = "longest"
length_threshold = 100
cache_dir = ".cache"

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false

2. Run the deduplication

# MinHash
python -m text_dedup.minhash

# SimHash
python -m text_dedup.simhash

# Bloom Filter
python -m text_dedup.bloom_filter

# Suffix Array
python -m text_dedup.suffix_array

Benchmarks

pinecone/core-2020-05-10-deduplication

Algorithm	Precision (Duplicates)	Recall (Duplicates)	Precision (Non Duplicates)	Recall (Non Duplicates)	Macro F1 score	Accuracy	Time
MinHash	0.9587	0.9416	0.9450	0.9611	0.9518	0.9277	11.09s
SimHash	0.9038	0.7323	0.7993	0.9318	0.8515	0.8375	626.11s
Exact Title Matching ¹	0.830	0.50	0.709	0.992	0.757	0.746	-
Simhash Matching ¹	0.697	0.247	0.598	0.985	0.631	0.616	-
Document Vector Similarity ¹	0.912	0.779	0.861	0.986	0.885	0.883	-
Hybrid Method ¹	0.908	0.828	0.899	0.979	0.904	0.903	-
LaBSE²	0.937	0.923	0.930	0.943	0.933	0.919	-
Multilingual USE²	0.917	0.907	0.918	0.927	0.917	0.909	-
Multilingual E5-Base²	0.931	0.908	0.919	0.939	0.924	0.920	-
MinHash + LSH²	0.929	0.902	0.915	0.938	0.921	0.918	-
RETSim Partial-Dup²	0.945	0.941	0.945	0.949	0.945	0.928	-
RETSim Near-Dup²	0.928	0.937	0.942	0.934	0.935	0.926	-

NEWS-COPY

Adjusted Rand Index (ARI) on NEWS-COPY dataset:

Model/Algorithm	ARI	Time
MinHash	0.7293	3.01s
SimHash	0.6463	140.03s
n-gram ³	0.440	-
SimHash²	0.695	-
MinHash³	0.737	-
MinHash²	0.783	-
Multilingual USE²	0.730	-
Multilingual E5-Base²	0.742	-
S-BERT³	0.700	-
RETSim Partial-Dup²	0.831	-
RETSim Near-Dup²	0.704	-
Re-ranking ³	0.937	-
Bi-encoder ³	0.915	-

Running Benchmarks

You can reproduce the benchmark results using the provided benchmark suite.

Quick Start with Just

# Run all benchmarks (both datasets, all algorithms)
just benchmark-all

# Run only CORE dataset benchmarks
just benchmark-core

# Run only NEWS-COPY dataset benchmarks
just benchmark-news

# Run specific algorithm on specific dataset
just benchmark-core-minhash
just benchmark-core-simhash
just benchmark-news-minhash
just benchmark-news-simhash

Configuration Files

Benchmark configuration files are located in configs/:

benchmark_core_minhash.toml - MinHash on CORE dataset
benchmark_core_simhash.toml - SimHash on CORE dataset
benchmark_news_minhash.toml - MinHash on NEWS-COPY dataset
benchmark_news_simhash.toml - SimHash on NEWS-COPY dataset

To customize benchmark parameters, edit the config files and adjust hyperparameters like num_perm, threshold, ngram_size, or bit_diff.

License

Apache 2.0

Citations

Generally, you can cite this repository as:

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Datasketch (MIT)
simhash-py and simhash-cpp (MIT)
Deduplicating Training Data Makes Language Models Better (Apache 2.0)
Gaoya (MIT)

Name		Name	Last commit message	Last commit date
Latest commit History 392 Commits
.github		.github
benchmarks		benchmarks
configs		configs
report		report
src/text_dedup		src/text_dedup
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
banner.png		banner.png
config.toml		config.toml
justfile		justfile
pyproject.toml		pyproject.toml
tox.ini		tox.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Documentation

Features

Quick Start

1. Configure your settings

2. Run the deduplication

Benchmarks

Running Benchmarks

Quick Start with Just

Configuration Files

License

Citations

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

ChenghaoMou/text-dedup

Folders and files

Latest commit

History

Repository files navigation

Installation

Documentation

Features

Quick Start

1. Configure your settings

2. Run the deduplication

Benchmarks

Running Benchmarks

Quick Start with Just

Configuration Files

License

Citations

Acknowledgements

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages