feat: Infer hierarchical heading levels (H1-H4) for PDFs#4222
feat: Infer hierarchical heading levels (H1-H4) for PDFs#4222Angel98518 wants to merge 3 commits intoUnstructured-IO:mainfrom
Conversation
|
Hi, @badGarnet , Can you review my PR please? |
|
@badGarnet Please review my PR |
|
Awesome work dude! And I'm curious, is there any reason it's limited to H1-H4, rather than H1-H6? |
The H1–H4 limit follows the issue title (#4204), which requested "H1, H2, H3, H4". The code can be extended to H6 if you want. |
|
Oh okay, makes sense. I just named those specifically so that it was easier for people to search for. I think supporting up to H6 will probably help cover as many use cases as possible. |
Ok, I will update code |
|
hi, @codebymikey , I updated code for H1~H6, Please check. Thank you for your review |
| # Normalize level to 1-4 range | ||
| normalized_level = min(max(level + 1, 1), 4) | ||
| outline_map[title.lower()] = normalized_level | ||
| outline_map = {} |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
Hi, @codebymikey , Please comment if you have another feedback |
|
Nope, all done. Probably just need to be rebased with upstream, and wait for a maintainer to review. Thanks again for implementing! |
Thank your for your review |
Implements issue Unstructured-IO#4204: Add support for inferring hierarchical heading/title levels (H1, H2, H3, H4) for PDF documents. Features: - Extract PDF outline/bookmarks to determine heading hierarchy - Analyze font sizes as fallback method for hierarchy detection - Assign heading_level metadata (1-4) to Title elements - Support fuzzy text matching for outline entries - Works with all PDF partition strategies (HI_RES, FAST, OCR_ONLY) Changes: - Add heading_level field to ElementMetadata - Create pdf_hierarchy.py module with hierarchy detection functions - Integrate hierarchy detection into PDF partitioner - Add comprehensive test suite Total: 557 lines added across 4 files
Update implementation to support heading levels 1-6 (H1-H6) instead of limiting to H1-H4. This provides better coverage for documents with deeper heading hierarchies. Changes: - Update level normalization to cap at 6 instead of 4 - Update percentile-based assignment to distribute across 6 levels - Update all documentation and comments to reflect H1-H6 - Update tests to validate H1-H6 range
43db051 to
654ce92
Compare
|
Hi, @codebymikey , when can maintainer review this PR? |
|
Not sure, as I'm not a maintainer. But based off the current activity in the project, it probably shouldn't take more than a couple days to get some. |
Thank you |
Description
Implements issue #4204: Add support for inferring hierarchical heading/title levels (H1, H2, H3, H4) for PDF documents.
Features
heading_levelmetadata (1-4) to Title elementsImplementation Details
New Files
unstructured/partition/pdf_hierarchy.py(356 lines): Core hierarchy detection moduleextract_pdf_outline(): Extracts PDF bookmarks/outline structureextract_font_info_from_layout_element(): Extracts font information from PDFMiner layoutinfer_heading_levels_from_outline(): Assigns levels based on PDF outlineinfer_heading_levels_from_font_sizes(): Assigns levels based on font size analysisinfer_heading_levels(): Main integration functiontest_unstructured/partition/test_pdf_hierarchy.py(144 lines): Comprehensive test suiteModified Files
unstructured/documents/elements.py: Addedheading_levelfield to ElementMetadataunstructured/partition/pdf.py: Integrated hierarchy detection into PDF partitionerUsage
Title elements in PDFs will now have a
heading_levelmetadata field (1-4) indicating their hierarchical level:Testing
Changes Summary
Fixes #4204