Skip to content

feat: Infer hierarchical heading levels (H1-H4) for PDFs#4222

Open
Angel98518 wants to merge 3 commits intoUnstructured-IO:mainfrom
Angel98518:feat/pdf-hierarchical-headings-4204
Open

feat: Infer hierarchical heading levels (H1-H4) for PDFs#4222
Angel98518 wants to merge 3 commits intoUnstructured-IO:mainfrom
Angel98518:feat/pdf-hierarchical-headings-4204

Conversation

@Angel98518
Copy link

Description

Implements issue #4204: Add support for inferring hierarchical heading/title levels (H1, H2, H3, H4) for PDF documents.

Features

  • PDF Outline Extraction: Extracts PDF bookmarks/outline structure to determine heading hierarchy
  • Font Size Analysis: Analyzes font sizes as fallback method for hierarchy detection
  • Heading Level Assignment: Assigns heading_level metadata (1-4) to Title elements
  • Fuzzy Text Matching: Supports fuzzy matching for outline entries when exact matches are not found
  • Multi-Strategy Support: Works with all PDF partition strategies (HI_RES, FAST, OCR_ONLY)

Implementation Details

New Files

  • unstructured/partition/pdf_hierarchy.py (356 lines): Core hierarchy detection module

    • extract_pdf_outline(): Extracts PDF bookmarks/outline structure
    • extract_font_info_from_layout_element(): Extracts font information from PDFMiner layout
    • infer_heading_levels_from_outline(): Assigns levels based on PDF outline
    • infer_heading_levels_from_font_sizes(): Assigns levels based on font size analysis
    • infer_heading_levels(): Main integration function
  • test_unstructured/partition/test_pdf_hierarchy.py (144 lines): Comprehensive test suite

Modified Files

  • unstructured/documents/elements.py: Added heading_level field to ElementMetadata
  • unstructured/partition/pdf.py: Integrated hierarchy detection into PDF partitioner

Usage

Title elements in PDFs will now have a heading_level metadata field (1-4) indicating their hierarchical level:

from unstructured.partition.auto import partition

elements = partition("document.pdf")
for element in elements:
    if isinstance(element, Title) and element.metadata.heading_level:
        print(f"{element.text}: H{element.metadata.heading_level}")

Testing

  • Added comprehensive test suite covering:
    • PDF outline extraction
    • Font size analysis
    • Integration with partitioner
    • Edge cases and error handling

Changes Summary

  • Total lines: 557 lines added
  • Files changed: 4 files (2 new, 2 modified)

Fixes #4204

@Angel98518
Copy link
Author

Hi, @badGarnet , Can you review my PR please?

@Angel98518
Copy link
Author

@badGarnet Please review my PR

@codebymikey
Copy link

Awesome work dude!

And I'm curious, is there any reason it's limited to H1-H4, rather than H1-H6?

@Angel98518
Copy link
Author

Awesome work dude!

And I'm curious, is there any reason it's limited to H1-H4, rather than H1-H6?

The H1–H4 limit follows the issue title (#4204), which requested "H1, H2, H3, H4". The code can be extended to H6 if you want.

@codebymikey
Copy link

Oh okay, makes sense. I just named those specifically so that it was easier for people to search for.

I think supporting up to H6 will probably help cover as many use cases as possible.

@Angel98518
Copy link
Author

Oh okay, makes sense. I just named those specifically so that it was easier for people to search for.

I think supporting up to H6 will probably help cover as many use cases as possible.

Ok, I will update code

@Angel98518
Copy link
Author

hi, @codebymikey , I updated code for H1~H6, Please check. Thank you for your review

# Normalize level to 1-4 range
normalized_level = min(max(level + 1, 1), 4)
outline_map[title.lower()] = normalized_level
outline_map = {}

This comment was marked as resolved.

@Angel98518
Copy link
Author

Hi, @codebymikey , Please comment if you have another feedback

@codebymikey
Copy link

Nope, all done. Probably just need to be rebased with upstream, and wait for a maintainer to review.

Thanks again for implementing!

@Angel98518
Copy link
Author

Nope, all done. Probably just need to be rebased with upstream, and wait for a maintainer to review.

Thanks again for implementing!

Thank your for your review

Contributor added 3 commits February 5, 2026 17:41
Implements issue Unstructured-IO#4204: Add support for inferring hierarchical heading/title
levels (H1, H2, H3, H4) for PDF documents.

Features:
- Extract PDF outline/bookmarks to determine heading hierarchy
- Analyze font sizes as fallback method for hierarchy detection
- Assign heading_level metadata (1-4) to Title elements
- Support fuzzy text matching for outline entries
- Works with all PDF partition strategies (HI_RES, FAST, OCR_ONLY)

Changes:
- Add heading_level field to ElementMetadata
- Create pdf_hierarchy.py module with hierarchy detection functions
- Integrate hierarchy detection into PDF partitioner
- Add comprehensive test suite

Total: 557 lines added across 4 files
Update implementation to support heading levels 1-6 (H1-H6) instead of
limiting to H1-H4. This provides better coverage for documents with
deeper heading hierarchies.

Changes:
- Update level normalization to cap at 6 instead of 4
- Update percentile-based assignment to distribute across 6 levels
- Update all documentation and comments to reflect H1-H6
- Update tests to validate H1-H6 range
@Angel98518 Angel98518 force-pushed the feat/pdf-hierarchical-headings-4204 branch from 43db051 to 654ce92 Compare February 5, 2026 16:41
@Angel98518
Copy link
Author

Hi, @codebymikey , when can maintainer review this PR?

@codebymikey
Copy link

Not sure, as I'm not a maintainer.

But based off the current activity in the project, it probably shouldn't take more than a couple days to get some.

@Angel98518
Copy link
Author

Not sure, as I'm not a maintainer.

But based off the current activity in the project, it probably shouldn't take more than a couple days to get some.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/Infer the hierarchical heading/title levels such as H1, H2, H3, H4 for PDFs

2 participants