@dosu
@dosu
I have a document parser service that is performing text extraction from the various file types. The documents mostly are Japanese language, they do not have equal typical structure, but they may contain such items inside: Tables 40%, Lists 40%, , Workflow images 10%, Others 10%. Most of the files are PDFs, which are OCR'd from paperwork's.
Also, one critical assumption is we are using cloud deployed solution, preferably with CPU-only approach, as we have low-resource at current stage. So, Linux + CPU-only + low resource, by that no huge models should be used.
For now it is utilizing the local Docling tool that is supported by its models capabilities (Docling’s internal models perform poorly on Japanese text). It does not provide the same quality in results as the ground truth examples. Especially extraction of tabular table structure is still low-performing if there is Japanese text.
The tested PDF’s pages 6–8 are screenshot-heavy operation guidance pages, so image/callout text should be represented, but separately from canonical procedure text. Also, for this document sections come out as paragraphs and bullets, not just as tables. That means the layout model never marked those regions as tables in the first place.
Current core file code:
from __future__ import annotations
from pathlib import Path
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.layout_model_specs import (
DOCLING_LAYOUT_EGRET_LARGE,
DOCLING_LAYOUT_EGRET_MEDIUM,
DOCLING_LAYOUT_EGRET_XLARGE,
DOCLING_LAYOUT_HERON,
DOCLING_LAYOUT_HERON_101,
)
from docling.datamodel.pipeline_options import (
LayoutOptions,
PdfPipelineOptions,
PictureClassificationLabel,
PictureDescriptionApiOptions,
RapidOcrOptions,
TableFormerMode,
TableStructureOptions,
)
from docling.document_converter import (
DocumentConverter,
PdfFormatOption,
PowerpointFormatOption,
WordFormatOption,
)
_FIGURE_PROMPT_DEFAULT = (
"You are extracting figure content for a vector search index over Japanese "
"and English business / regulatory documents.\n\n"
"Respond in 2-4 sentences (max 120 words). State:\n"
"1. Figure type (screenshot, flowchart, diagram, chart, table-as-image).\n"
"2. Its purpose in the surrounding procedure.\n"
"3. Any specific labels, button names, field names, values, dates, codes — "
"quote verbatim.\n\n"
"Do NOT describe colors, fonts, window chrome, OS or browser names, or "
"spatial layout. If the figure is a flowchart, list steps in order using "
"arrows (→). Reply in the dominant language of the visible text."
)
_LAYOUT_SPEC = {
"heron": DOCLING_LAYOUT_HERON,
"heron_101": DOCLING_LAYOUT_HERON_101,
"egret_medium": DOCLING_LAYOUT_EGRET_MEDIUM,
"egret_large": DOCLING_LAYOUT_EGRET_LARGE,
"egret_xlarge": DOCLING_LAYOUT_EGRET_XLARGE,
}
def _build_rapid_ocr_options(settings) -> RapidOcrOptions:
"""RapidOCR (PaddleOCR-ONNX) is Docling's recommended Japanese-capable
CPU OCR backend. When det/rec model paths are configured, we pin to
pre-downloaded PP-OCRv5 weights for stable offline operation; otherwise
we fall back to RapidOCR's bundled defaults.
Note: RapidOcrOptions.bitmap_area_threshold defaults to 0.05 in Docling
2.89; force_full_page_ocr is the right switch when the embedded text
layer is known to be unreliable (e.g. OCR'd-from-paperwork PDFs).
"""
kwargs: dict = {
"lang": settings.ocr_languages,
"force_full_page_ocr": settings.ocr_mode == "force",
"bitmap_area_threshold": settings.ocr_bitmap_area_threshold,
}
if settings.rapidocr_det_model_path:
kwargs["det_model_path"] = settings.rapidocr_det_model_path
if settings.rapidocr_rec_model_path:
kwargs["rec_model_path"] = settings.rapidocr_rec_model_path
if settings.rapidocr_rec_keys_path:
kwargs["rec_keys_path"] = settings.rapidocr_rec_keys_path
if settings.rapidocr_cls_model_path:
kwargs["cls_model_path"] = settings.rapidocr_cls_model_path
kwargs["use_cls"] = True
else:
# RapidOCR's text-direction classifier corrects lines rotated 180°.
# When no cls model path is set, RapidOCR otherwise tries to load
# its bundled default (a PP-OCRv4 cls weight at
# /opt/docling/models/RapidOcr/onnx/PP-OCRv4/cls/...) and fails hard
# if that file is not present on disk. Disabling cls avoids a hard
# build-time dependency on weights that contribute little to
# right-side-up printed Japanese documents. Set
# RAPIDOCR_CLS_MODEL_PATH to re-enable orientation classification.
kwargs["use_cls"] = False
return RapidOcrOptions(**kwargs)
_DEFAULT_FIGURE_CLASSES_ALLOW: tuple[PictureClassificationLabel, ...] = (
PictureClassificationLabel.SCREENSHOT_FROM_COMPUTER,
PictureClassificationLabel.SCREENSHOT_FROM_MANUAL,
PictureClassificationLabel.FLOW_CHART,
PictureClassificationLabel.BAR_CHART,
PictureClassificationLabel.LINE_CHART,
PictureClassificationLabel.PIE_CHART,
PictureClassificationLabel.BOX_PLOT,
PictureClassificationLabel.SCATTER_PLOT,
PictureClassificationLabel.TABLE,
PictureClassificationLabel.ENGINEERING_DRAWING,
PictureClassificationLabel.GEOGRAPHICAL_MAP,
PictureClassificationLabel.TOPOGRAPHICAL_MAP,
PictureClassificationLabel.PHOTOGRAPH,
)
def _build_picture_description_options(settings) -> PictureDescriptionApiOptions | None:
if not settings.do_picture_description:
return None
if not settings.openai_api_key:
raise RuntimeError(
"DO_PICTURE_DESCRIPTION=true but OPENAI_API_KEY is empty. "
"Either set the key or disable picture description."
)
return PictureDescriptionApiOptions(
url="https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {settings.openai_api_key}"},
params={"model": settings.openai_model},
prompt=settings.figure_prompt.strip() or _FIGURE_PROMPT_DEFAULT,
timeout=60,
# Only send the VLM figures we actually want captions for. Icons,
# logos, barcodes, page thumbnails, signatures and stamps are
# filtered out — they produce noisy captions and burn budget.
classification_allow=list(_DEFAULT_FIGURE_CLASSES_ALLOW),
classification_min_confidence=settings.picture_classification_min_confidence,
picture_area_threshold=settings.picture_area_threshold,
concurrency=settings.figure_concurrency,
)
def _build_pipeline_options(settings) -> PdfPipelineOptions:
opts = PdfPipelineOptions()
# ---- OCR (Japanese-tuned PP-OCRv5 via RapidOCR/ONNX, CPU-friendly) ----
opts.do_ocr = settings.ocr_mode != "off"
if opts.do_ocr:
opts.ocr_options = _build_rapid_ocr_options(settings)
# ---- Layout (HERON by default; HERON-101 for highest accuracy) ----
opts.layout_options = LayoutOptions(model_spec=_LAYOUT_SPEC[settings.layout_model])
# ---- Table structure (TableFormer V1 + ACCURATE + cell_matching tuned) ----
# V1 (TableStructureOptions) supports FAST/ACCURATE modes; V2 does not.
# V1 ACCURATE is also Docling's own default and benchmarks competitively
# with V2 on banded/borderless tables when do_cell_matching=False.
opts.do_table_structure = True
opts.table_structure_options = TableStructureOptions(
mode=(TableFormerMode.ACCURATE if settings.table_mode == "accurate" else TableFormerMode.FAST),
do_cell_matching=settings.table_do_cell_matching,
)
# ---- Figure enrichment (GPT-4.1-mini captions, area-thresholded) ----
pd_opts = _build_picture_description_options(settings)
opts.generate_picture_images = True
# Required for region-level VLM table rescue (table_rescue.py crops from
# the rendered page bitmap). Cheap on CPU since images are only
# rasterized when this flag is true.
opts.generate_page_images = settings.enable_vlm_table_rescue
opts.images_scale = 2.0
# `compile_model=True` (the Docling default) asks PyTorch Inductor to
# JIT-compile the classifier, which requires a host C compiler — gcc on
# Linux, MSVC `cl` on Windows. Disabling it avoids a hard dependency on
# the build toolchain at runtime and costs ~5% throughput on CPU.
opts.picture_classification_options.engine_options.compile_model = False
opts.do_picture_classification = True # required for classification_allow
opts.do_picture_description = pd_opts is not None
if pd_opts is not None:
opts.picture_description_options = pd_opts
opts.enable_remote_services = True
# ---- Long-running document guard ----
# Bound conversion wall-time so one slow doc cannot pin a worker.
if settings.document_timeout_s:
opts.document_timeout = settings.document_timeout_s
# ---- CPU batch tuning ----
opts.ocr_batch_size = settings.ocr_batch_size
opts.layout_batch_size = settings.layout_batch_size
opts.table_batch_size = settings.table_batch_size
# ---- Accelerator (match AWS vCPUs) ----
if settings.num_threads:
opts.accelerator_options = AcceleratorOptions(
num_threads=settings.num_threads,
device=AcceleratorDevice.CPU,
)
return opts
def build_converter(settings) -> tuple[DocumentConverter, str]:
"""Returns (converter, backend_label). The backend_label is surfaced in
every ParseResponse so operators can see at a glance which configuration
produced the document.
"""
opts = _build_pipeline_options(settings)
label = (
f"docling[layout={settings.layout_model} "
f"ocr={settings.ocr_mode} "
f"table={settings.table_mode}/cell_match={settings.table_do_cell_matching} "
f"figures={'gpt-4.1-mini' if settings.do_picture_description else 'off'}"
f"{' rescue=on' if settings.enable_vlm_table_rescue else ''}]"
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=opts),
InputFormat.DOCX: WordFormatOption(pipeline_options=opts),
InputFormat.PPTX: PowerpointFormatOption(pipeline_options=opts),
}
)
return converter, label
It is better to use Figure/screenshot enrichment for VLM captions only, not full-page VLM markdown, as it is now.
I assume, Docling should remain the orchestration spine, not the only extraction engine.
Correct me If I am wrong.
It is too unstable for identifiers, headings, Japanese procedural wording, and table artifacts. While it works fine with the English words in the tables, as I tried to translate part of the file and process it through our service. The not translated Japanese language was still broken. It creates malformed markdown, separate headings from content, change identifiers, introduce empty table artifacts, trimmed part of the table text with no revert. For Japanese procedural documents, VLM should be treated as an enrichment/fallback layer, not the source of truth.
I need to understand a wider, more reliable approach to using Docling. Assuming that is the maximum we can expect from this tool, could we use other local options, libraries or tools instead to perform this type of extraction, especially table extraction and structure preservation? Can we also integrate this into Docling's current pipeline as part of orchestration spine not as core extraction engine, which can provide excellent pipeline construction and image detection?
I still think it is better to send images to a VLM like GPT-4.1-mini.
Conduct in-depth research into the current approach to using Docling's capabilities, as well as any technical or greenfield capabilities that are not used or intentionally skipped, which could be valuable and beneficial for our parsing service. Check whether we can realistically expect such results as in the ground example or whether this is the maximum that can be extracted (ignore this file now).
Critique the current codebase and suggest more efficient, maintainable solutions that consume fewer resources, are accurate, and work for both English-style documents and Japanese 項目/説明 layouts with banded shading and no strong borders.
Redesign the Docling-based document parser to fix table extraction on CPU-only, and enhance it.
Ensure you are following these rules:
Core Engineering & Problem-Solving Methodology
Your overarching directive is to ensure that every response is well-researched, succinct, precise, and reasonable, with absolutely all loose ends tied. Your primary value is measured not just by fulfilling user requests, but by proactively identifying and proposing improvements that leave the project in a more robust state, and denying the user intents that can mislead or negatively impact existing functionality or core goal.
1. Proactive Project Improvement (Improve, Don't Just Execute)
You are a guardian of the codebase. Fulfilling the immediate request is the bare minimum; your goal is to elevate it.
-
Continuous Tech-Debt Reduction: Whenever modifying/analyzing a file, proactively identify and emphasize technical debt, performance problems, greenfield opportunities, or security vulnerabilities within the scope of your changes.
-
Better Than Found: Always ask: "Does this change leave the codebase better than I found it?" Do not settle for "working"; enforce "excellent.
2. Strict Research Sequence
The problem cannot be solved without deep context. You should execute research in such priority order:
-
Codebase Context: Search the codebase comprehensively for integration points and existing implementations, patterns, used libraries.
-
Provided URLs: Recursively fetch and thoroughly read any URLs provided in the prompt.
-
Internet Research Protocol: After internal research is fully exhausted and finalized, if third-party libraries/frameworks are involved, search the web for up-to-date official documentation and best practices to integrate them or to get better understanding of used ones.
@dosu
@dosu
I have a document parser service that is performing text extraction from the various file types. The documents mostly are Japanese language, they do not have equal typical structure, but they may contain such items inside: Tables 40%, Lists 40%, , Workflow images 10%, Others 10%. Most of the files are PDFs, which are OCR'd from paperwork's.
Also, one critical assumption is we are using cloud deployed solution, preferably with CPU-only approach, as we have low-resource at current stage. So, Linux + CPU-only + low resource, by that no huge models should be used.
For now it is utilizing the local
Doclingtool that is supported by its models capabilities (Docling’s internal models perform poorly on Japanese text). It does not provide the same quality in results as the ground truth examples. Especially extraction of tabular table structure is still low-performing if there is Japanese text.The tested PDF’s pages 6–8 are screenshot-heavy operation guidance pages, so image/callout text should be represented, but separately from canonical procedure text. Also, for this document sections come out as paragraphs and bullets, not just as tables. That means the layout model never marked those regions as tables in the first place.
Current core file code:
It is better to use Figure/screenshot enrichment for VLM captions only, not full-page VLM markdown, as it is now.
I assume, Docling should remain the orchestration spine, not the only extraction engine.
Correct me If I am wrong.
It is too unstable for identifiers, headings, Japanese procedural wording, and table artifacts. While it works fine with the English words in the tables, as I tried to translate part of the file and process it through our service. The not translated Japanese language was still broken. It creates malformed markdown, separate headings from content, change identifiers, introduce empty table artifacts, trimmed part of the table text with no revert. For Japanese procedural documents, VLM should be treated as an enrichment/fallback layer, not the source of truth.
I need to understand a wider, more reliable approach to using Docling. Assuming that is the maximum we can expect from this tool, could we use other local options, libraries or tools instead to perform this type of extraction, especially table extraction and structure preservation? Can we also integrate this into Docling's current pipeline as part of orchestration spine not as core extraction engine, which can provide excellent pipeline construction and image detection?
I still think it is better to send images to a VLM like GPT-4.1-mini.
Conduct in-depth research into the current approach to using Docling's capabilities, as well as any technical or greenfield capabilities that are not used or intentionally skipped, which could be valuable and beneficial for our parsing service. Check whether we can realistically expect such results as in the ground example or whether this is the maximum that can be extracted (ignore this file now).
Critique the current codebase and suggest more efficient, maintainable solutions that consume fewer resources, are accurate, and work for both English-style documents and Japanese 項目/説明 layouts with banded shading and no strong borders.
Redesign the Docling-based document parser to fix table extraction on CPU-only, and enhance it.
Ensure you are following these rules:
Core Engineering & Problem-Solving Methodology
Your overarching directive is to ensure that every response is well-researched, succinct, precise, and reasonable, with absolutely all loose ends tied. Your primary value is measured not just by fulfilling user requests, but by proactively identifying and proposing improvements that leave the project in a more robust state, and denying the user intents that can mislead or negatively impact existing functionality or core goal.
1. Proactive Project Improvement (Improve, Don't Just Execute)
You are a guardian of the codebase. Fulfilling the immediate request is the bare minimum; your goal is to elevate it.
Continuous Tech-Debt Reduction: Whenever modifying/analyzing a file, proactively identify and emphasize technical debt, performance problems, greenfield opportunities, or security vulnerabilities within the scope of your changes.
Better Than Found: Always ask: "Does this change leave the codebase better than I found it?" Do not settle for "working"; enforce "excellent.
2. Strict Research Sequence
The problem cannot be solved without deep context. You should execute research in such priority order:
Codebase Context: Search the codebase comprehensively for integration points and existing implementations, patterns, used libraries.
Provided URLs: Recursively fetch and thoroughly read any URLs provided in the prompt.
Internet Research Protocol: After internal research is fully exhausted and finalized, if third-party libraries/frameworks are involved, search the web for up-to-date official documentation and best practices to integrate them or to get better understanding of used ones.