-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug
When I test Docling with certain PDF files, it was throwing the runtime error to the specific PDF but not other PDFs I have tested. Unfortunately I cannot share the PDF since it involves medical and PnC information,
...
Steps to reproduce
Pipeline used:
# 1b. Configure RapidOCR Options
# --- Define models directory ---
# Assumes models folder is at the root level (parent of app)
models_dir = Path(__file__).parent.parent.parent / "models" / "rapidocr"
models_dir.mkdir(parents=True, exist_ok=True) # Ensure directory exists
logger.info(f"Ensuring RapidOCR models are in: {models_dir}")
# --- Download RapidOCR models to the specified directory ---
# This will only download if models aren't already in models_dir
download_path = snapshot_download(
repo_id="SWHL/RapidOCR",
local_dir=models_dir, # Specify the target directory
local_dir_use_symlinks=False # Recommended False on Windows for simplicity
)
logger.info(f"RapidOCR models location: {download_path}") # download_path will be models_dir
# Setup RapidOcrOptions using paths relative to the download_path (which is models_dir)
# Ensure these sub-paths (PP-OCRv4, PP-OCRv3) match the actual downloaded folder structure
det_model_path = os.path.join(
download_path, "PP-OCRv4", "en_PP-OCRv3_det_infer.onnx"
)
rec_model_path = os.path.join(
download_path, "PP-OCRv3", "en_PP-OCRv3_rec_infer.onnx"
)
cls_model_path = os.path.join(
download_path, "PP-OCRv3", "ch_ppocr_mobile_v2.0_cls_train.onnx"
)
# Optimized configuration for OCR
rapidocr_options = RapidOcrOptions(
det_model_path=det_model_path,
rec_model_path=rec_model_path,
cls_model_path=cls_model_path,
lang=['english'], # Focus on English
text_score=0.6, # Confidence threshold for text detection
force_full_page_ocr=True, # Process the whole page
use_det=True, # Enable text detection
use_rec=True, # Enable text recognition
use_cls=True, # Enable text orientation classification
bitmap_area_threshold=0.03, # Threshold for detecting small text areas
print_verbose=True # Enable debugging output
)
# 2. Configure Table Structure Options
table_options = TableStructureOptions(
mode=TableFormerMode.ACCURATE,
do_cell_matching=True
)
# 3. Configure Accelerator Options
accelerator_options = AcceleratorOptions(
device=AcceleratorDevice.CUDA if torch.cuda.is_available() else AcceleratorDevice.CPU,
num_threads=8,
cuda_use_flash_attention2=torch.cuda.is_available(),
)
# 4. Configure Main Pipeline Options
pipeline_options = PdfPipelineOptions(
do_ocr=True,
ocr_options=rapidocr_options,
do_table_structure=True,
table_structure_options=table_options,
accelerator_options=accelerator_options,
generate_page_images=True, # Keep True for visualizations
generate_picture_images=True,
create_legacy_output=True,
document_timeout=300,
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
data = converter.convert(str(file_path))- Use the setup as above, and run a PDF into this pipeline.
- It should throw an error
...
Docling version
2025-11-04 02:49:17,500 - INFO - Loading plugin 'docling_defaults'
2025-11-04 02:49:17,501 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
Docling version: 2.31.0
Docling Core version: 2.50.0
Docling IBM Models version: 3.10.2
Docling Parse version: 4.7.0
Python: cpython-310 (3.10.17)
Platform: Windows-10-10.0.19045-SP0
...
Python version
Python 3.10.17
...
Logs:
[2025-11-04 00:45:41] [ERROR] docling.pipeline.standard_pdf_pipeline - standard_pdf_pipeline.py:349 - Stage preprocess failed for run 1: could not find the page-dimensions: {
"/Contents": [
"5 0 R [stream]"
],
"/Parent": "[skipping /Parent]",
"/Resources": {
"/Font": {
"/c": {
"/BaseFont": "/AAAAAA+Arial,Bold",
"/FirstChar": 32,
"/FontDescriptor": {
"/Ascent": 905,
"/AvgWidth": 479,
"/CapHeight": 500,
"/Descent": -212,
"/Flags": 4,
"/FontBBox": [
-628,
-376,
2000,
1056
],
"/FontFile2": "103 0 R [stream]",
"/FontName": "/AAAAAA+Arial,Bold",
"/ItalicAngle": 0,
"/Leading": 0,
"/MaxWidth": 2628,
"/MissingWidth": 479,
"/StemH": 0,
"/StemV": 0,
"/Type": "/FontDescriptor",
"/XHeight": 0
},
"/LastChar": 116,
"/Subtype": "/TrueType",
"/ToUnicode": "100 0 R [stream]",
"/Type": "/Font",
"/Widths": [
278,
0,
0,
0,
0,
0,
722,
0,
333,
333,
0,
0,
278,
333,
278,
278,
556,
556,
556,
556,
556,
556,
556,
556,
556,
556,
333,
0,
0,
0,
0,
0,
975,
722,
722,
722,
722,
667,
611,
778,
722,
278,
0,
722,
611,
833,
722,
778,
667,
778,
722,
667,
611,
722,
667,
944,
0,
667,
611,
0,
0,
0,
0,
0,
0,
556,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
278,
0,
0,
611,
0,
0,
0,
0,
333
]
},
"/d": {
"/BaseFont": "/AAAAAB+Arial",
"/FirstChar": 32,
"/FontDescriptor": {
"/Ascent": 905,
"/AvgWidth": 441,
"/CapHeight": 500,
"/Descent": -212,
"/Flags": 4,
"/FontBBox": [
-665,
-325,
2000,
1040
],
"/FontFile2": "113 0 R [stream]",
"/FontName": "/AAAAAB+Arial",
"/ItalicAngle": 0,
"/Leading": 0,
"/MaxWidth": 2665,
"/MissingWidth": 441,
"/StemH": 0,
"/StemV": 0,
"/Type": "/FontDescriptor",
"/XHeight": 0
},
"/LastChar": 121,
"/Subtype": "/TrueType",
"/ToUnicode": "110 0 R [stream]",
"/Type": "/Font",
"/Widths": [
278,
278,
355,
0,
0,
889,
667,
0,
333,
333,
389,
584,
278,
333,
278,
278,
556,
556,
556,
556,
556,
556,
556,
556,
556,
556,
278,
0,
0,
0,
584,
0,
0,
667,
667,
722,
722,
667,
611,
778,
722,
278,
500,
667,
556,
833,
722,
778,
667,
778,
722,
667,
611,
722,
667,
944,
667,
667,
611,
278,
0,
278,
0,
0,
0,
556,
556,
500,
556,
556,
278,
556,
556,
222,
222,
0,
222,
833,
556,
556,
0,
0,
333,
500,
278,
556,
500,
722,
500,
500
]
}
},
"/ProcSet": [
"/PDF",
"/Text",
"/ImageB",
"/ImageC",
"/ImageI"
],
"/XObject": {
"/img0": "9 0 R [stream]"
}
},
"/Type": "/Page"
}
...
jsamoocha
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working