Is it possible to include the page headers from a docx file in the text_content? #1391
-
|
Some of my docx files have page headers. When I process them with markitdown, I get the body content of the docx file but the page header is missing. Is there any way to ask markitdown to include the page header in the returned text_content? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
|
markitdown currently extracts the main document body text but does not parse headers or footers from Right now there isn’t a built-in option to include them. If you need header/footer text, you’d have to extend the parser or preprocess the file with a library like So the short answer is: no, not with the current markitdown functionality, you’d need to handle headers explicitly outside of it. |
Beta Was this translation helpful? Give feedback.
As far as I know, Microsoft doesn’t provide a ready-made library that converts DOCX to Markdown while also including headers and footers. The closest is the Open XML SDK, which gives full access to all parts of a DOCX (including headers/footers), but you’d need to build the Markdown conversion layer yourself. Outside of Microsoft tools, the most reliable option is Pandoc, which does handle headers/footers and generally produces better Markdown output. If you’re already in Python, python-docx is another option for extracting everything and then formatting to Markdown manually.