Skip to content

feat: Add option to emit compact tables (no column padding) for token-efficient output (RAG) #359

@shkarupa-alex

Description

@shkarupa-alex

What is the proposed feature?

Problem

html-to-markdown renders GFM tables with column padding so cells are visually aligned in the source Markdown:

| name      | a   | b    |
| --------- | --- | ---- |
| short     | 1   | 2    |

This is great for human readability but wasteful when the Markdown is consumed by an LLM (RAG pipelines, embedding-based retrieval, prompt construction). The padding is pure whitespace that:

  • inflates token count significantly — on my example (pricing plan-comparison pages) we measured padding at ~57% of the rendered article body;
  • pollutes embeddings (whitespace runs become part of the chunk);
  • raises inference cost with no semantic benefit, since the rendered HTML is identical for both forms.

Proposed feature

A new option, something like compact_tables=True (or a more general table_style="compact" | "padded"), that emits the same table without column padding:

| name | a | b |
| --- | --- | --- |
| short | 1 | 2 |

Requirements

  • valid GFM (renders identically in any GFM-compatible viewer);
  • alignment markers preserved: :-: (center), --: (right), --- (default/left);
  • cell content untouched (only inter-cell whitespace collapsed);
  • separator line uses the minimum 3 dashes per column.

Why would this be a good addition?

Current workaround

We post-process the output by re-parsing it through mistletoe with a custom MarkdownRenderer that overrides calculate_table_column_widths, table_separator_line_to_text, and table_row_to_line to
skip padding. It works but adds a second Markdown round-trip and an extra dependency just to strip whitespace.

Use case

Document-ingestion pipelines for RAG / LLM applications where token cost and chunking quality matter more than source-side readability.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions