-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Bug
When using TesseractOcrOptions with a custom psm (Page Segmentation Mode) parameter, the DocumentConverter initialization fails with a TypeError: __init__() takes exactly 0 positional arguments (1 given).
The bug is in docling/models/tesseract_ocr_model.py:100:
main_psm = (
tesserocr.PSM(self.options.psm) # Bug: Attempts to call PSM as constructor
if self.options.psm is not None
else tesserocr.PSM.AUTO
)
Root cause: The code assumes tesserocr.PSM is an enum that can be initialized with an integer, but it's actually a class with integer constants. tesserocr.PSM.AUTO is already the integer 3, so calling PSM(3) raises TypeError.
Fix: Use the integer value directly:
main_psm = (
self.options.psm # Already an integer
if self.options.psm is not None
else tesserocr.PSM.AUTO
)
Steps to reproduce
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractOcrOptions
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
# Configure OCR with custom PSM value
ocr_options = TesseractOcrOptions(
lang=["eng"],
psm=3 # Any custom PSM value will trigger the bug
)
pdf_options = PdfPipelineOptions(
ocr_options=ocr_options,
artifacts_path="/path/to/models"
)
# Initialize converter
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pdf_options,
backend=DoclingParseV4DocumentBackend
)
}
)
# Convert document - TypeError occurs during initialization
result = converter.convert("test.pdf")
Error traceback:
File "docling/models/tesseract_ocr_model.py", line 100, in __init__
tesserocr.PSM(self.options.psm)
TypeError: __init__() takes exactly 0 positional arguments (1 given)
Docling version
docling 2.60.0
(Bug exists since v2.56.0 when PSM support was added in PR #2411)
Python version
Python 3.11.13
Additional context:
- tesserocr version: 2.7.1 - 2.8.0
- Platform: Linux (Docker), macOS
- Impact: All users attempting to use custom PSM values
- Workaround: None (users must avoid the psm parameter)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working