Skip to content

DataFog/datafog-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataFog

Fast PII detection and anonymization, built in Rust with Python and WASM bindings.

DataFog detects structured PII (emails, phone numbers, SSNs, credit cards, IPs, dates of birth, ZIP codes) using compiled regex patterns, and optionally detects soft PII (names, organizations, addresses) using a GLiNER ONNX model via NER. The regex engine runs in microseconds per kilobyte with zero external dependencies.

Install (Python)

DataFog is a Rust library with Python bindings built by maturin. You need a Rust toolchain to install from source.

Prerequisites: Rust (stable) and Python 3.9+.

# From PyPI (once published)
pip install datafog

# From source
git clone https://github.com/DataFog/datafog.git
cd datafog
pip install .

That's it. pip install . compiles the Rust code and installs DataFog as a Python package. No separate Rust build step needed — maturin handles it behind the scenes.

from datafog import DataFog, detect, anonymize_text

# Detect PII
entities = detect("Contact john@example.com or call 555-123-4567")
# [{"type": "EMAIL", "value": "john@example.com", "start": 8, "end": 24, "score": 1.0},
#  {"type": "PHONE", "value": "555-123-4567", "start": 33, "end": 45, "score": 1.0}]

# Anonymize
clean = anonymize_text("SSN is 123-45-6789", method="redact")
# "SSN is [REDACTED]"

# Class API with batch support
fog = DataFog()
results = fog.detect_batch(["john@test.com", "555-123-4567", "no pii here"])

Install (Rust)

Add to your Cargo.toml:

[dependencies]
datafog-core = "0.1"
use datafog_core::DataFog;
use datafog_core::anonymizer::AnonymizeMethod;

let fog = DataFog::new();
let result = fog.detect("Contact john@example.com");
println!("{:?}", result.spans);

let anon = fog.anonymize("SSN is 123-45-6789", AnonymizeMethod::Redact);
assert_eq!(anon.text, "SSN is [REDACTED]");

NER engine (optional)

The NER engine uses a GLiNER ONNX model to detect soft PII that regex can't catch: person names, organizations, locations, addresses, and more. It runs both regex and NER, then merges the results.

Building with NER pulls in gline-rs + ONNX Runtime (~50 MB binary size increase).

# Install from source with NER + model auto-download
pip install . --config-settings="build-args=--features full"
from datafog import DataFog, has_ner_support

if has_ner_support():
    fog = DataFog(engine="ner")  # downloads ~50 MB model on first use
    entities = fog.detect("John Smith works at Acme Corp in Paris")
    # detects PERSON, ORGANIZATION, LOCATION in addition to regex entities

You can also point to a local model directory:

fog = DataFog(engine="ner", model="/path/to/model/dir")

The model directory must contain tokenizer.json and model.onnx.

Entity types

Engine Entity types
Regex EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS, DOB, ZIP
NER PERSON, ORGANIZATION, LOCATION, ADDRESS, MEDICAL_RECORD_NUMBER, ACCOUNT_NUMBER, LICENSE_NUMBER, PASSPORT_NUMBER, URL

When both engines are active, all entity types are available.

Anonymization methods

from datafog import anonymize_text

anonymize_text("Email: john@test.com", method="redact")    # "Email: [REDACTED]"
anonymize_text("Email: john@test.com", method="replace")   # "Email: [EMAIL_1]"
anonymize_text("Email: john@test.com", method="hash")      # "Email: a1b2c3d4..."

Available methods: redact, replace, hash (SHA-256), hash_md5, hash_sha3.

WASM

The WASM target provides regex-only detection for browser and edge environments.

# Requires wasm-pack: https://rustwasm.github.io/wasm-pack/installer/
wasm-pack build crates/datafog-wasm --target web

A demo page is included at crates/datafog-wasm/demo/index.html.

Project structure

datafog/
├── crates/
│   ├── datafog-core/        # Pure Rust library (regex, NER, anonymizer, cascade)
│   ├── datafog-python/      # PyO3 bindings
│   └── datafog-wasm/        # wasm-bindgen bindings (regex-only)
├── python/datafog/          # Python package source + type stubs
├── tests/                   # Python integration tests
├── pyproject.toml           # maturin build config
└── Cargo.toml               # Workspace root

Feature flags (Rust)

Flag What it adds Dependencies
default Regex-only detection None beyond regex, serde
ner GLiNER NER engine gline-rs, ort
model-download Auto-download models from HuggingFace reqwest, dirs
parallel Rayon-based batch parallelism rayon
ner-cuda CUDA GPU acceleration for NER (implies ner)
ner-coreml Apple CoreML acceleration for NER (implies ner)

Development

git clone https://github.com/DataFog/datafog.git
cd datafog

# Rust
cargo test --workspace
cargo clippy --workspace -- -D warnings
cargo fmt --all -- --check
cargo bench --package datafog-core

# Python
python3 -m venv .venv
source .venv/bin/activate
pip install ".[dev]"            # installs datafog + maturin + pytest
pytest -v

License

Apache 2.0

About

DataFog core library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •