Skip to content

markitdown-ocr process non-text layer PDFs generate an MD document containing only page numbers #1863

@liuhuanBamboo

Description

@liuhuanBamboo

markitdown-ocr process non-text layer PDFs, such as those converted from images or generated through scanning,It will generate an MD document containing only page numbers
the reason is , In the file "pdf_converter_with_ocr.py",first, by "markdown_content.append(f"\n"## Page{page_num}\n)" set a page Message,
but In the subsequent code logic, whether to perform full-page OCR based on whether the content is empty。

Due to the page number information in markdown_content, markdown_content is not empty, so it will skip OCR,finally,generate an MD document containing only page numbers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions