fix(cli): reconfigure stdout to UTF-8 so redirected output is lossless (#1802)#1864
Open
haosenwang1018 wants to merge 1 commit into
Open
Conversation
Closes microsoft#1802 Refs microsoft#1788 The CLI's stdout output path used the OS console codepage (``cp1252``/``gbk``/etc. on Windows) and re-encoded the converted markdown with ``errors="replace"`` to avoid raising. That suppressed ``UnicodeEncodeError: 'charmap' codec can't encode characters`` but silently replaced every non-encodable character (bullets, em-dashes, CJK, ...) with ``?`` — so users running ``markitdown foo.docx > foo.md`` on Windows got corrupted output by default. Reconfigure ``sys.stdout`` to UTF-8 before printing so the redirected file is faithful to the source. ``sys.stdout.reconfigure`` is a 3.7+ TextIOWrapper method; if it's unavailable (some embedded interpreters, non-standard stdout shims) we fall through to ``print`` and only re-encode with ``errors="replace"`` if a ``UnicodeEncodeError`` is actually raised. That preserves the previous "never crash on unicode" guarantee without paying the lossy cost on the common path. Tests: - ``test_handle_output_to_file_preserves_unicode`` pins the file output contract (already utf-8, but worth a regression guard alongside the stdout changes). - ``test_handle_output_to_stdout_reconfigures_to_utf8`` asserts that ``sys.stdout.reconfigure(encoding="utf-8")`` is invoked AND that the payload survives unchanged through ``capsys``. - ``test_handle_output_falls_back_when_reconfigure_unavailable`` covers the defensive replace path on a fake stdout that has no ``reconfigure`` and a strict ASCII codec — should not raise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@haosenwang1018 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Closes #1802
Refs #1788
Root cause
The CLI's stdout output path used the OS console codepage (
cp1252/gbk/etc. on Windows) and re-encoded the converted markdown witherrors="replace"to avoid raising:That suppresses
UnicodeEncodeError: 'charmap' codec can't encode characters(the symptom #1802 reports) but silently replaces every non-encodable character with?. A user runningmarkitdown foo.docx > foo.mdon Windows ends up with bullet points, em-dashes, CJK, etc. all turned into?in the output file. From the reporter's perspective the bug is "the converted markdown is corrupt" — even though no error is raised.#1788's traceback shows the older pre-
errors="replace"version did raiseUnicodeEncodeErroroutright, which is what the existing replace path was patching. The fix here addresses both issues at once.Fix
Reconfigure
sys.stdoutto UTF-8 before printing so the redirected file is faithful to the source.sys.stdout.reconfigureis a 3.7+TextIOWrappermethod; if it's unavailable (some embedded interpreters, non-standard stdout shims) we fall through toprintand only re-encode witherrors="replace"if aUnicodeEncodeErroris actually raised. That preserves the previous "never crash" guarantee without paying the lossy cost on the common path.Tests
test_handle_output_to_file_preserves_unicodepins the existing file-output utf-8 contract.test_handle_output_to_stdout_reconfigures_to_utf8assertssys.stdout.reconfigure(encoding="utf-8")is invoked and the payload survives unchanged throughcapsys.test_handle_output_falls_back_when_reconfigure_unavailablecovers the defensive replace path on a fake stdout that has noreconfigureand a strict ASCII codec — should not raise.