feat: Infer hierarchical heading levels (H1-H4) for PDFs by Angel98518 · Pull Request #4222 · Unstructured-IO/unstructured

Angel98518 · 2026-02-02T22:28:20Z

Description

Implements issue #4204: Add support for inferring hierarchical heading/title levels (H1, H2, H3, H4) for PDF documents.

Features

PDF Outline Extraction: Extracts PDF bookmarks/outline structure to determine heading hierarchy
Font Size Analysis: Analyzes font sizes as fallback method for hierarchy detection
Heading Level Assignment: Assigns heading_level metadata (1-4) to Title elements
Fuzzy Text Matching: Supports fuzzy matching for outline entries when exact matches are not found
Multi-Strategy Support: Works with all PDF partition strategies (HI_RES, FAST, OCR_ONLY)

Implementation Details

New Files

unstructured/partition/pdf_hierarchy.py (356 lines): Core hierarchy detection module
- extract_pdf_outline(): Extracts PDF bookmarks/outline structure
- extract_font_info_from_layout_element(): Extracts font information from PDFMiner layout
- infer_heading_levels_from_outline(): Assigns levels based on PDF outline
- infer_heading_levels_from_font_sizes(): Assigns levels based on font size analysis
- infer_heading_levels(): Main integration function
test_unstructured/partition/test_pdf_hierarchy.py (144 lines): Comprehensive test suite

Modified Files

unstructured/documents/elements.py: Added heading_level field to ElementMetadata
unstructured/partition/pdf.py: Integrated hierarchy detection into PDF partitioner

Usage

Title elements in PDFs will now have a heading_level metadata field (1-4) indicating their hierarchical level:

from unstructured.partition.auto import partition

elements = partition("document.pdf")
for element in elements:
    if isinstance(element, Title) and element.metadata.heading_level:
        print(f"{element.text}: H{element.metadata.heading_level}")

Testing

Added comprehensive test suite covering:
- PDF outline extraction
- Font size analysis
- Integration with partitioner
- Edge cases and error handling

Changes Summary

Total lines: 557 lines added
Files changed: 4 files (2 new, 2 modified)

Fixes #4204

Angel98518 · 2026-02-03T20:04:25Z

Hi, @badGarnet , Can you review my PR please?

Angel98518 · 2026-02-04T10:50:00Z

@badGarnet Please review my PR

codebymikey · 2026-02-05T13:26:53Z

Awesome work dude!

And I'm curious, is there any reason it's limited to H1-H4, rather than H1-H6?

Angel98518 · 2026-02-05T13:34:26Z

Awesome work dude!

And I'm curious, is there any reason it's limited to H1-H4, rather than H1-H6?

The H1–H4 limit follows the issue title (#4204), which requested "H1, H2, H3, H4". The code can be extended to H6 if you want.

codebymikey · 2026-02-05T15:07:23Z

Oh okay, makes sense. I just named those specifically so that it was easier for people to search for.

I think supporting up to H6 will probably help cover as many use cases as possible.

Angel98518 · 2026-02-05T15:28:30Z

Oh okay, makes sense. I just named those specifically so that it was easier for people to search for.

I think supporting up to H6 will probably help cover as many use cases as possible.

Ok, I will update code

Angel98518 · 2026-02-05T15:36:01Z

hi, @codebymikey , I updated code for H1~H6, Please check. Thank you for your review

unstructured/partition/pdf_hierarchy.py

-        # Normalize level to 1-4 range
-        normalized_level = min(max(level + 1, 1), 4)
-        outline_map[title.lower()] = normalized_level
+        outline_map = {}


Angel98518 · 2026-02-05T16:00:16Z

Hi, @codebymikey , Please comment if you have another feedback

codebymikey · 2026-02-05T16:05:13Z

Nope, all done. Probably just need to be rebased with upstream, and wait for a maintainer to review.

Thanks again for implementing!

Angel98518 · 2026-02-05T16:09:59Z

Nope, all done. Probably just need to be rebased with upstream, and wait for a maintainer to review.

Thanks again for implementing!

Thank your for your review

Implements issue Unstructured-IO#4204: Add support for inferring hierarchical heading/title levels (H1, H2, H3, H4) for PDF documents. Features: - Extract PDF outline/bookmarks to determine heading hierarchy - Analyze font sizes as fallback method for hierarchy detection - Assign heading_level metadata (1-4) to Title elements - Support fuzzy text matching for outline entries - Works with all PDF partition strategies (HI_RES, FAST, OCR_ONLY) Changes: - Add heading_level field to ElementMetadata - Create pdf_hierarchy.py module with hierarchy detection functions - Integrate hierarchy detection into PDF partitioner - Add comprehensive test suite Total: 557 lines added across 4 files

Update implementation to support heading levels 1-6 (H1-H6) instead of limiting to H1-H4. This provides better coverage for documents with deeper heading hierarchies. Changes: - Update level normalization to cap at 6 instead of 4 - Update percentile-based assignment to distribute across 6 levels - Update all documentation and comments to reflect H1-H6 - Update tests to validate H1-H6 range

Angel98518 · 2026-02-05T18:23:24Z

Hi, @codebymikey , when can maintainer review this PR?

codebymikey · 2026-02-05T18:59:05Z

Not sure, as I'm not a maintainer.

But based off the current activity in the project, it probably shouldn't take more than a couple days to get some.

Angel98518 · 2026-02-05T19:04:39Z

Not sure, as I'm not a maintainer.

But based off the current activity in the project, it probably shouldn't take more than a couple days to get some.

Thank you

codebymikey reviewed Feb 5, 2026

View reviewed changes

unstructured/partition/pdf_hierarchy.py Outdated

# Normalize level to 1-4 range

normalized_level = min(max(level + 1, 1), 4)

outline_map[title.lower()] = normalized_level

outline_map = {}

This comment was marked as resolved.

Sign in to view

Contributor added 3 commits February 5, 2026 17:41

fix: Correct indentation in outline_map initialization

654ce92

Angel98518 force-pushed the feat/pdf-hierarchical-headings-4204 branch from 43db051 to 654ce92 Compare February 5, 2026 16:41

Conversation

Angel98518 commented Feb 2, 2026

Description

Features

Implementation Details

New Files

Modified Files

Usage

Testing

Changes Summary

Uh oh!

Angel98518 commented Feb 3, 2026

Uh oh!

Angel98518 commented Feb 4, 2026

Uh oh!

codebymikey commented Feb 5, 2026

Uh oh!

Angel98518 commented Feb 5, 2026

Uh oh!

codebymikey commented Feb 5, 2026

Uh oh!

Angel98518 commented Feb 5, 2026

Uh oh!

Angel98518 commented Feb 5, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Angel98518 commented Feb 5, 2026

Uh oh!

codebymikey commented Feb 5, 2026

Uh oh!

Angel98518 commented Feb 5, 2026

Uh oh!

Angel98518 commented Feb 5, 2026

Uh oh!

codebymikey commented Feb 5, 2026

Uh oh!

Angel98518 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants