Why does python's open() function mangle my utf-8 files?

Question

This is a strange one, and it might be due to a python update, because it worked fine yesterday with no changes. Here we go:

I have a program that opens utf-8 files (that use accented characters, etc, not just ansi characters). When I open the files with open(file, encoding="utf-8-sig").read(), the non-ansi characters get mangled, as shown here in my terminal:

mangled characters when encoding of open() is set to "utf-8-sig"

However, when I set the encoding to "ansi", the characters are perfectly normal!

normal characters with encoding="ansi"

This is a complete mystery to me. As said before, this worked fine yesterday. I've checked that the files were indeed utf-8, multiple times. I don't know if the problem is with the open() function, or the print() function when the characters are displayed. in any case, it's strange. The "ansi" version would be a solution, but the problem is that it causes problems with Lark, which uses the contents of the opened files.

In the screenshots I gave here, the code is basic:

with open(str(GRAMMAR), "r", encoding="utf-8-sig") as grammar:
    print(grammar.read())

What could this problem be caused by?

Is the output in the screenshots from your terminal? It could be entirely a printing issue and not an issue with the actual strings, especially on Windows. Do you have another way to check? — Kemp
– Kemp, Commented Oct 20, 2021 at 12:17
You could try putting sys.stdout.reconfigure(encoding='utf-8') early in your script and see if that helps. — Kemp
– Kemp, Commented Oct 20, 2021 at 12:20
Can you reduce the file to a single line that fails, read the file with binary, print the result: print(open(file,'rb').read()), and edit your question to include it? This will remove ambiguity of the file content and encoding. ansi and utf-8 are completely different encodings, and if utf-8 works, the file is encoded in UTF-8. An "ANSI" encoded file with those characters would raise an exception if read as UTF-8. — Mark Tolonen
– Mark Tolonen, Commented Oct 20, 2021 at 17:39

algorev · Accepted Answer · 2021-10-20 12:29:59Z

0

I just noticed something: ansi is not an encoding. The correct name for the encoding would be ascii. This means that when I typed encoding="ansi", python ignored the encoding I asked it to set and read the file as its default encoding, which is normally utf-8. This does not explain why it doesn't work with utf-8-sig or why Lark is screaming at me, but this is specific to my case. So for future readers of this questions, check 2 things:

If you want to use ascii, type ascii, not ansi.
Stick with the defaults.

answered Oct 20, 2021 at 12:29

algorev

291 bronze badge

Sign up to request clarification or add additional context in comments.

Comments

snakecharmerb · Accepted Answer · 2021-10-20 15:31:54Z

0

On Windows machines, Python recognises the name "ansi" as an alias for the "mbcs" codec, defined as

Windows only: Encode the operand according to the ANSI codepage (CP_ACP).

So ansi is a valid encoding, but it isn't the same as ASCII, or UTF-8, hence the apparent mangling.

answered Oct 20, 2021 at 15:31

snakecharmerb

57.2k13 gold badges137 silver badges200 bronze badges

Collectives™ on Stack Overflow

Why does python's open() function mangle my utf-8 files?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related