Skip to content

Bug: do_formula_enrichment=True produces garbled text (e.g., /C0 apod) and generate_picture_images=True creates empty folders & `` placeholders #2568

@SiHo256

Description

@SiHo256

Environment
Docling Version: 2.60.0

OS: Windows

Python: 3.12

Bug Description
I am testing Docling to process complex documents and am enabling do_formula_enrichment and generate_picture_images. On test documents, both features fail, resulting in corrupted data output.

Formula Failure: Instead of extracting LaTeX, the text is replaced with garbled code (e.g., a ¼ Δ f rep f rep ¼ 2 : 12 /C3 10 /C0 622 and SNRtime corr : /C0 apod : = 3112). This is a significant regression from standard text extraction. This appears related to another reported issue (on formula spacing), but my output is completely garbled, not just spaced out.

Image Failure: Instead of exporting images and linking them, Docling only inserts `` placeholders. My script creates the target directory, but Docling fails to save any image files into it (the directory remains empty). This appears to be the same bug as reported in Issue #2560.

Steps to Reproduce
Install docling==2.60.0.

Download a public PDF known to contain formulas and images (e.g., "Attention Is All You Need": https://arxiv.org/pdf/1706.03762v7).

Run the Python script below.

Minimal Python Code
Python

import docling
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from pathlib import Path

--- 1. Configure Docling ---

print("ℹ️ Initializing Docling Converter...")
pipeline_options = PdfPipelineOptions()

Bug 1: Enable Formula Enrichment

pipeline_options.do_formula_enrichment = True

Bug 2: Enable Image Generation

pipeline_options.generate_picture_images = True

converter = DocumentConverter(format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})
print("✅ Docling Converter initialized.")

--- 2. Define Paths ---

Using a public test document (e.g., "Attention Is All You Need")

Download this file locally and place it in the same directory.

pdf_path = "1706.03762v7.pdf"
image_output_dir = Path("./test_images/doc_01")
image_output_dir.mkdir(parents=True, exist_ok=True) # Create output dir

print(f"🔄 Processing {pdf_path}...")

--- 3. Convert ---

try:
result = converter.convert(pdf_path)
doc = result.document

# 4. Export
# Per Issue #2560, export_to_markdown() does not accept image_dir or include_annotations
markdown_text = doc.export_to_markdown()

print("\n--- MARKDOWN RESULT (snippet) ---")
# Look for the famous formula
attention_formula_index = markdown_text.find("Attention(Q, K, V)")
if attention_formula_index != -1:
    print(markdown_text[attention_formula_index : attention_formula_index + 200] + "...")
else:
    print("Could not find 'Attention(Q, K, V)' snippet.")

print("\n--- IMAGE PLACEHOLDERS ---")
print(f"Found '' placeholder: {'' in markdown_text}")

print("\n--- IMAGE DIRECTORY CONTENT ---")
print(f"Checking directory: {image_output_dir.resolve()}")
# This check will show an empty list, proving no images were saved.
print(list(image_output_dir.glob('*'))) 

except Exception as e:
print(f"❌ Error during Docling extraction: {e}")

Expected Behavior
Formulas: Formulas should be extracted as clean LaTeX (e.g., $$Attention(Q, K, V) = \text{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V$$).

Images: generate_picture_images=True should save image files (e.g., img-0.png) to the output directory.

Markdown: The `` placeholders should be replaced with valid Markdown image links (e.g., ...).

Actual Behavior
Formulas: Formulas are destroyed and replaced with garbled text (see snippets below).

Images: No image files are saved to disk. The directory specified (or any other directory) remains empty.

Markdown: The file only contains `` placeholders.

Thank you for your work on this promising project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions