Skip to content

Optimize MHTML parser: 1.5x faster on large files#1928

Draft
sweenzor wants to merge 1 commit intogildas-lormeau:masterfrom
sweenzor:mhtml-parser-perf
Draft

Optimize MHTML parser: 1.5x faster on large files#1928
sweenzor wants to merge 1 commit intogildas-lormeau:masterfrom
sweenzor:mhtml-parser-perf

Conversation

@sweenzor
Copy link
Copy Markdown

@sweenzor sweenzor commented Mar 21, 2026

Speeds up MHTML parsing ~1.5x on large files via three changes:

  • parse.js — Replace splice(len, 0, ...next) with a push() loop; use length truncation for soft line breaks.
  • util.js decodeBinary — Build string in 8KB chunks via String.fromCharCode.apply() instead of char-by-char.
  • util.js decodeBase64 — Direct Uint8Array loop instead of atob().split("").map(). Added try/catch for malformed input.
Size Before After Speedup
10KB 0.51ms 0.37ms 1.4x
1MB 63.2ms 43.1ms 1.5x
10MB 572ms 423ms 1.4x
100MB 6063ms 3982ms 1.5x

To reproduce, save a page as MHTML from Chrome, then:

// node bench.mjs path/to/file.mhtml
import { readFileSync } from "fs";
import { parse } from "./src/lib/mhtml-to-html/parse.js";
const mhtml = new Uint8Array(readFileSync(process.argv[2]));
const runs = 10, times = [];
for (let i = 0; i < runs; i++) {
  const start = performance.now();
  parse(mhtml);
  times.push(performance.now() - start);
}
times.sort((a, b) => a - b);
console.log(`Median: ${times[Math.floor(runs / 2)].toFixed(1)}ms`);

Three targeted performance fixes in the JS MHTML parser hot path:

1. parse.js: Replace splice(...spread) with push loop for byte
   accumulation. The old `resource.data.splice(len, 0, ...next)`
   spread a Uint8Array into individual arguments on every line of
   MHTML content — O(n) per call in a tight loop over thousands of
   lines. A simple `push(next[i])` loop avoids the spread overhead
   entirely. Truncation via `data.length -= N` replaces splice for
   quoted-printable soft line break removal.

2. util.js decodeBinary: Replace character-by-character string
   concatenation (`data += String.fromCharCode(byte)`) with chunked
   `String.fromCharCode.apply(null, chunk)` joined at the end. The
   old approach was O(n²) due to string immutability; each `+=`
   allocated a new string. Chunks of 8192 bytes stay within the
   call stack limit for `apply`.

3. util.js decodeBase64: Replace `atob(v).split("").map(c =>
   c.charCodeAt(0))` with a pre-allocated Uint8Array filled via a
   direct for-loop. The old approach created two intermediate arrays
   (one from split, one from map) that were immediately discarded.

Benchmarked on synthetic MHTML fixtures (before → after):
  - 10KB:   0.51ms → 0.37ms (1.4x)
  - 1MB:   63.2ms → 43.1ms (1.5x)
  - 10MB:   572ms →  423ms (1.4x)
  - 100MB: 6063ms → 3982ms (1.5x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant