Is it possible to include the page headers from a docx file in the text_content? #1391

davidrendall · 2025-08-05T16:57:57Z

davidrendall
Aug 5, 2025

Some of my docx files have page headers. When I process them with markitdown, I get the body content of the docx file but the page header is missing. Is there any way to ask markitdown to include the page header in the returned text_content? Thank you.

Answered by alishanawer

Sep 20, 2025

As far as I know, Microsoft doesn’t provide a ready-made library that converts DOCX to Markdown while also including headers and footers. The closest is the Open XML SDK, which gives full access to all parts of a DOCX (including headers/footers), but you’d need to build the Markdown conversion layer yourself. Outside of Microsoft tools, the most reliable option is Pandoc, which does handle headers/footers and generally produces better Markdown output. If you’re already in Python, python-docx is another option for extracting everything and then formatting to Markdown manually.

View full answer

alishanawer · 2025-09-19T11:39:23Z

alishanawer
Sep 19, 2025

markitdown currently extracts the main document body text but does not parse headers or footers from .docx files. In the OpenXML structure, headers and footers are stored separately from the main document.xml part, which is why they don’t show up in the text_content.

Right now there isn’t a built-in option to include them. If you need header/footer text, you’d have to extend the parser or preprocess the file with a library like python-docx to extract those parts and then feed them into your pipeline.

So the short answer is: no, not with the current markitdown functionality, you’d need to handle headers explicitly outside of it.

3 replies

davidrendall Sep 19, 2025
Author

Thank you for your reply, i appreciate it. I ended up using a combination of markitdown and document intelligence, but the latter has various other problems with the output which MS support told me will not be fixed. Are you aware of any other Micosoft library which does a more comprehensive job of converting docx to markdown? It seems surprising that they would not have a good tool for this.

alishanawer Sep 20, 2025

As far as I know, Microsoft doesn’t provide a ready-made library that converts DOCX to Markdown while also including headers and footers. The closest is the Open XML SDK, which gives full access to all parts of a DOCX (including headers/footers), but you’d need to build the Markdown conversion layer yourself. Outside of Microsoft tools, the most reliable option is Pandoc, which does handle headers/footers and generally produces better Markdown output. If you’re already in Python, python-docx is another option for extracting everything and then formatting to Markdown manually.

Answer selected by davidrendall

davidrendall Sep 22, 2025
Author

Thank you very much for the advice, I really appreciate you taking the time. I've previously used python-docx but was not completely happy with the results. I'll have a look at the Open XML SDK next.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to include the page headers from a docx file in the text_content? #1391

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to include the page headers from a docx file in the text_content? #1391

Uh oh!

davidrendall Aug 5, 2025

Replies: 1 comment · 3 replies

Uh oh!

alishanawer Sep 19, 2025

Uh oh!

davidrendall Sep 19, 2025 Author

Uh oh!

alishanawer Sep 20, 2025

Uh oh!

davidrendall Sep 22, 2025 Author

davidrendall
Aug 5, 2025

Replies: 1 comment 3 replies

alishanawer
Sep 19, 2025

davidrendall Sep 19, 2025
Author

davidrendall Sep 22, 2025
Author