DataFog 4.5 keeps the core package focused on lightweight text PII screening. The default path is:
pip install datafogimport datafog
result = datafog.redact("Email jane@example.com", engine="regex")
print(result.redacted_text)OCR and Spark are supported optional surfaces. They are useful for image and distributed workflows, but they should not be treated as required for the core install, package import, text scanning, text redaction, or guardrail helpers.
Use OCR when you need to extract text from images before running PII detection.
Install local OCR support:
pip install "datafog[ocr]"Use URL-based image downloads:
pip install "datafog[web,ocr]"Use Donut OCR:
pip install "datafog[nlp-advanced,ocr]"Notes:
- Tesseract usage requires the system
tesseractbinary in addition to the Python extra. - Python 3.13 is validated for the OCR install profile, Pillow, pytesseract, and system Tesseract smoke checks.
- Donut OCR requires a model that is already available locally. DataFog should not download models implicitly during normal runtime usage.
- OCR is not deprecated. A broader OCR API and packaging overhaul is deferred beyond the 4.5 focus release.
Example local OCR flow:
import asyncio
from datafog.services.image_service import ImageService
async def main():
service = ImageService(use_tesseract=True, use_donut=False)
extracted = await service.ocr_extract(["./invoice.png"])
print(extracted)
asyncio.run(main())Use Spark when you need distributed processing around DataFog PII detection.
Install Spark support:
pip install "datafog[distributed]"Use Spark PII UDF helpers:
pip install "datafog[distributed,nlp]"Notes:
SparkServicerequires PySpark and a Java runtime.- Spark PII UDF helpers also require spaCy and an installed spaCy model.
- Spark is not deprecated. A broader Spark overhaul is deferred beyond the 4.5 focus release.
Example local Spark flow:
from datafog.services.spark_service import SparkService
service = SparkService(master="local[1]")
rows = service.read_json("./records.json")
print(rows)The repository includes tests that block optional dependency imports while
importing datafog and running the default text helpers. These checks verify
that OCR, Spark, NLP, model-loading, and web dependencies are not required for
the core path.
DataFog 4.5 validates Python 3.13 beyond the core/CLI path for the optional profiles that currently have compatible wheels in the tested dependency set.
| Profile | Python 3.13 status | Notes |
|---|---|---|
nlp |
Supported | spaCy imports and the profile smoke test pass on Python 3.13. |
nlp-advanced |
Supported | GLiNER, torch, transformers, and onnxruntime import successfully on Python 3.13. |
ocr |
Supported | Pillow, pytesseract, and the system Tesseract bridge validate on Python
3.13 when the tesseract binary is installed. |
nlp-advanced,ocr |
Supported with local model requirement | Donut dependencies import on Python 3.13; runtime OCR still requires the configured Donut model to be present locally. |
distributed and all |
Not newly certified in 4.5 | Keep using Python 3.10-3.12 for distributed/all-profile validation until Spark and the full optional surface are audited separately. |