Skip to content

TypeError when using custom PSM value in TesseractOcrOptions #2576

@Mulgyeol

Description

@Mulgyeol

Bug

When using TesseractOcrOptions with a custom psm (Page Segmentation Mode) parameter, the DocumentConverter initialization fails with a TypeError: __init__() takes exactly 0 positional arguments (1 given).

The bug is in docling/models/tesseract_ocr_model.py:100:

main_psm = (
    tesserocr.PSM(self.options.psm)  # Bug: Attempts to call PSM as constructor
    if self.options.psm is not None
    else tesserocr.PSM.AUTO
)

Root cause: The code assumes tesserocr.PSM is an enum that can be initialized with an integer, but it's actually a class with integer constants. tesserocr.PSM.AUTO is already the integer 3, so calling PSM(3) raises TypeError.

Fix: Use the integer value directly:

main_psm = (
    self.options.psm  # Already an integer
    if self.options.psm is not None
    else tesserocr.PSM.AUTO
)

Steps to reproduce

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractOcrOptions
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

# Configure OCR with custom PSM value
ocr_options = TesseractOcrOptions(
    lang=["eng"],
    psm=3  # Any custom PSM value will trigger the bug
)

pdf_options = PdfPipelineOptions(
    ocr_options=ocr_options,
    artifacts_path="/path/to/models"
)

# Initialize converter
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pdf_options,
            backend=DoclingParseV4DocumentBackend
        )
    }
)

# Convert document - TypeError occurs during initialization
result = converter.convert("test.pdf")

Error traceback:

File "docling/models/tesseract_ocr_model.py", line 100, in __init__
    tesserocr.PSM(self.options.psm)
TypeError: __init__() takes exactly 0 positional arguments (1 given)

Docling version

docling 2.60.0
(Bug exists since v2.56.0 when PSM support was added in PR #2411)

Python version

Python 3.11.13


Additional context:

  • tesserocr version: 2.7.1 - 2.8.0
  • Platform: Linux (Docker), macOS
  • Impact: All users attempting to use custom PSM values
  • Workaround: None (users must avoid the psm parameter)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions