-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Environment
Docling Version: 2.60.0
OS: Windows
Python: 3.12
Bug Description
I am testing Docling to process complex documents and am enabling do_formula_enrichment and generate_picture_images. On test documents, both features fail, resulting in corrupted data output.
Formula Failure: Instead of extracting LaTeX, the text is replaced with garbled code (e.g., a ¼ Δ f rep f rep ¼ 2 : 12 /C3 10 /C0 622 and SNRtime corr : /C0 apod : = 3112). This is a significant regression from standard text extraction. This appears related to another reported issue (on formula spacing), but my output is completely garbled, not just spaced out.
Image Failure: Instead of exporting images and linking them, Docling only inserts `` placeholders. My script creates the target directory, but Docling fails to save any image files into it (the directory remains empty). This appears to be the same bug as reported in Issue #2560.
Steps to Reproduce
Install docling==2.60.0.
Download a public PDF known to contain formulas and images (e.g., "Attention Is All You Need": https://arxiv.org/pdf/1706.03762v7).
Run the Python script below.
Minimal Python Code
Python
import docling
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from pathlib import Path
--- 1. Configure Docling ---
print("ℹ️ Initializing Docling Converter...")
pipeline_options = PdfPipelineOptions()
Bug 1: Enable Formula Enrichment
pipeline_options.do_formula_enrichment = True
Bug 2: Enable Image Generation
pipeline_options.generate_picture_images = True
converter = DocumentConverter(format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})
print("✅ Docling Converter initialized.")
--- 2. Define Paths ---
Using a public test document (e.g., "Attention Is All You Need")
Download this file locally and place it in the same directory.
pdf_path = "1706.03762v7.pdf"
image_output_dir = Path("./test_images/doc_01")
image_output_dir.mkdir(parents=True, exist_ok=True) # Create output dir
print(f"🔄 Processing {pdf_path}...")
--- 3. Convert ---
try:
result = converter.convert(pdf_path)
doc = result.document
# 4. Export
# Per Issue #2560, export_to_markdown() does not accept image_dir or include_annotations
markdown_text = doc.export_to_markdown()
print("\n--- MARKDOWN RESULT (snippet) ---")
# Look for the famous formula
attention_formula_index = markdown_text.find("Attention(Q, K, V)")
if attention_formula_index != -1:
print(markdown_text[attention_formula_index : attention_formula_index + 200] + "...")
else:
print("Could not find 'Attention(Q, K, V)' snippet.")
print("\n--- IMAGE PLACEHOLDERS ---")
print(f"Found '' placeholder: {'' in markdown_text}")
print("\n--- IMAGE DIRECTORY CONTENT ---")
print(f"Checking directory: {image_output_dir.resolve()}")
# This check will show an empty list, proving no images were saved.
print(list(image_output_dir.glob('*')))
except Exception as e:
print(f"❌ Error during Docling extraction: {e}")
Expected Behavior
Formulas: Formulas should be extracted as clean LaTeX (e.g.,
Images: generate_picture_images=True should save image files (e.g., img-0.png) to the output directory.
Markdown: The `` placeholders should be replaced with valid Markdown image links (e.g.,
).
Actual Behavior
Formulas: Formulas are destroyed and replaced with garbled text (see snippets below).
Images: No image files are saved to disk. The directory specified (or any other directory) remains empty.
Markdown: The file only contains `` placeholders.
Thank you for your work on this promising project!