This is a strange one, and it might be due to a python update, because it worked fine yesterday with no changes. Here we go:
I have a program that opens utf-8 files (that use accented characters, etc, not just ansi characters). When I open the files with open(file, encoding="utf-8-sig").read(), the non-ansi characters get mangled, as shown here in my terminal:
mangled characters when encoding of open() is set to "utf-8-sig"
However, when I set the encoding to "ansi", the characters are perfectly normal!
normal characters with encoding="ansi"
This is a complete mystery to me. As said before, this worked fine yesterday. I've checked that the files were indeed utf-8, multiple times. I don't know if the problem is with the open() function, or the print() function when the characters are displayed. in any case, it's strange. The "ansi" version would be a solution, but the problem is that it causes problems with Lark, which uses the contents of the opened files.
In the screenshots I gave here, the code is basic:
with open(str(GRAMMAR), "r", encoding="utf-8-sig") as grammar:
print(grammar.read())
What could this problem be caused by?
sys.stdout.reconfigure(encoding='utf-8')early in your script and see if that helps.print(open(file,'rb').read()), and edit your question to include it? This will remove ambiguity of the file content and encoding.ansiandutf-8are completely different encodings, and ifutf-8works, the file is encoded in UTF-8. An "ANSI" encoded file with those characters would raise an exception if read as UTF-8.