Skip to content

RuntimeError: Could not find the page-dimensions #2574

@lihomm

Description

@lihomm

Bug

When I test Docling with certain PDF files, it was throwing the runtime error to the specific PDF but not other PDFs I have tested. Unfortunately I cannot share the PDF since it involves medical and PnC information,
...

Steps to reproduce

Pipeline used:

        # 1b. Configure RapidOCR Options
        # --- Define models directory ---
        # Assumes models folder is at the root level (parent of app)
        models_dir = Path(__file__).parent.parent.parent / "models" / "rapidocr"
        models_dir.mkdir(parents=True, exist_ok=True) # Ensure directory exists
        logger.info(f"Ensuring RapidOCR models are in: {models_dir}")

        # --- Download RapidOCR models to the specified directory ---
        # This will only download if models aren't already in models_dir
        download_path = snapshot_download(
            repo_id="SWHL/RapidOCR",
            local_dir=models_dir, # Specify the target directory
            local_dir_use_symlinks=False # Recommended False on Windows for simplicity
        )
        logger.info(f"RapidOCR models location: {download_path}") # download_path will be models_dir

        # Setup RapidOcrOptions using paths relative to the download_path (which is models_dir)
        # Ensure these sub-paths (PP-OCRv4, PP-OCRv3) match the actual downloaded folder structure
        det_model_path = os.path.join(
            download_path, "PP-OCRv4", "en_PP-OCRv3_det_infer.onnx"
        )
        rec_model_path = os.path.join(
            download_path, "PP-OCRv3", "en_PP-OCRv3_rec_infer.onnx"
        )
        cls_model_path = os.path.join(
            download_path, "PP-OCRv3", "ch_ppocr_mobile_v2.0_cls_train.onnx"
        )

        # Optimized configuration for OCR
        rapidocr_options = RapidOcrOptions(
            det_model_path=det_model_path,
            rec_model_path=rec_model_path,
            cls_model_path=cls_model_path,
            lang=['english'],  # Focus on English
            text_score=0.6,    # Confidence threshold for text detection
            force_full_page_ocr=True,  # Process the whole page
            use_det=True,      # Enable text detection
            use_rec=True,      # Enable text recognition
            use_cls=True,      # Enable text orientation classification
            bitmap_area_threshold=0.03,  # Threshold for detecting small text areas
            print_verbose=True  # Enable debugging output
        )

        # 2. Configure Table Structure Options
        table_options = TableStructureOptions(
            mode=TableFormerMode.ACCURATE,
            do_cell_matching=True
        )

        # 3. Configure Accelerator Options
        accelerator_options = AcceleratorOptions(
            device=AcceleratorDevice.CUDA if torch.cuda.is_available() else AcceleratorDevice.CPU,
            num_threads=8,
            cuda_use_flash_attention2=torch.cuda.is_available(),
        )

        # 4. Configure Main Pipeline Options
        pipeline_options = PdfPipelineOptions(
            do_ocr=True,
            ocr_options=rapidocr_options,
            do_table_structure=True,
            table_structure_options=table_options,
            accelerator_options=accelerator_options,
            generate_page_images=True, # Keep True for visualizations
            generate_picture_images=True,
            create_legacy_output=True,
            document_timeout=300,
        )

       converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
            }
        )

        data = converter.convert(str(file_path))
  1. Use the setup as above, and run a PDF into this pipeline.
  2. It should throw an error

...

Docling version

2025-11-04 02:49:17,500 - INFO - Loading plugin 'docling_defaults'
2025-11-04 02:49:17,501 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
Docling version: 2.31.0
Docling Core version: 2.50.0
Docling IBM Models version: 3.10.2
Docling Parse version: 4.7.0
Python: cpython-310 (3.10.17)
Platform: Windows-10-10.0.19045-SP0
...

Python version

Python 3.10.17
...

Logs:

[2025-11-04 00:45:41] [ERROR] docling.pipeline.standard_pdf_pipeline - standard_pdf_pipeline.py:349 - Stage preprocess failed for run 1: could not find the page-dimensions: {
    "/Contents": [
        "5 0 R [stream]"
    ],
    "/Parent": "[skipping /Parent]",
    "/Resources": {
        "/Font": {
            "/c": {
                "/BaseFont": "/AAAAAA+Arial,Bold",
                "/FirstChar": 32,
                "/FontDescriptor": {
                    "/Ascent": 905,
                    "/AvgWidth": 479,
                    "/CapHeight": 500,
                    "/Descent": -212,
                    "/Flags": 4,
                    "/FontBBox": [
                        -628,
                        -376,
                        2000,
                        1056
                    ],
                    "/FontFile2": "103 0 R [stream]",
                    "/FontName": "/AAAAAA+Arial,Bold",
                    "/ItalicAngle": 0,
                    "/Leading": 0,
                    "/MaxWidth": 2628,
                    "/MissingWidth": 479,
                    "/StemH": 0,
                    "/StemV": 0,
                    "/Type": "/FontDescriptor",
                    "/XHeight": 0
                },
                "/LastChar": 116,
                "/Subtype": "/TrueType",
                "/ToUnicode": "100 0 R [stream]",
                "/Type": "/Font",
                "/Widths": [
                    278,
                    0,
                    0,
                    0,
                    0,
                    0,
                    722,
                    0,
                    333,
                    333,
                    0,
                    0,
                    278,
                    333,
                    278,
                    278,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    333,
                    0,
                    0,
                    0,
                    0,
                    0,
                    975,
                    722,
                    722,
                    722,
                    722,
                    667,
                    611,
                    778,
                    722,
                    278,
                    0,
                    722,
                    611,
                    833,
                    722,
                    778,
                    667,
                    778,
                    722,
                    667,
                    611,
                    722,
                    667,
                    944,
                    0,
                    667,
                    611,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    556,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    278,
                    0,
                    0,
                    611,
                    0,
                    0,
                    0,
                    0,
                    333
                ]
            },
            "/d": {
                "/BaseFont": "/AAAAAB+Arial",
                "/FirstChar": 32,
                "/FontDescriptor": {
                    "/Ascent": 905,
                    "/AvgWidth": 441,
                    "/CapHeight": 500,
                    "/Descent": -212,
                    "/Flags": 4,
                    "/FontBBox": [
                        -665,
                        -325,
                        2000,
                        1040
                    ],
                    "/FontFile2": "113 0 R [stream]",
                    "/FontName": "/AAAAAB+Arial",
                    "/ItalicAngle": 0,
                    "/Leading": 0,
                    "/MaxWidth": 2665,
                    "/MissingWidth": 441,
                    "/StemH": 0,
                    "/StemV": 0,
                    "/Type": "/FontDescriptor",
                    "/XHeight": 0
                },
                "/LastChar": 121,
                "/Subtype": "/TrueType",
                "/ToUnicode": "110 0 R [stream]",
                "/Type": "/Font",
                "/Widths": [
                    278,
                    278,
                    355,
                    0,
                    0,
                    889,
                    667,
                    0,
                    333,
                    333,
                    389,
                    584,
                    278,
                    333,
                    278,
                    278,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    278,
                    0,
                    0,
                    0,
                    584,
                    0,
                    0,
                    667,
                    667,
                    722,
                    722,
                    667,
                    611,
                    778,
                    722,
                    278,
                    500,
                    667,
                    556,
                    833,
                    722,
                    778,
                    667,
                    778,
                    722,
                    667,
                    611,
                    722,
                    667,
                    944,
                    667,
                    667,
                    611,
                    278,
                    0,
                    278,
                    0,
                    0,
                    0,
                    556,
                    556,
                    500,
                    556,
                    556,
                    278,
                    556,
                    556,
                    222,
                    222,
                    0,
                    222,
                    833,
                    556,
                    556,
                    0,
                    0,
                    333,
                    500,
                    278,
                    556,
                    500,
                    722,
                    500,
                    500
                ]
            }
        },
        "/ProcSet": [
            "/PDF",
            "/Text",
            "/ImageB",
            "/ImageC",
            "/ImageI"
        ],
        "/XObject": {
            "/img0": "9 0 R [stream]"
        }
    },
    "/Type": "/Page"
}
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions