2

This is a strange one, and it might be due to a python update, because it worked fine yesterday with no changes. Here we go:

I have a program that opens utf-8 files (that use accented characters, etc, not just ansi characters). When I open the files with open(file, encoding="utf-8-sig").read(), the non-ansi characters get mangled, as shown here in my terminal:

mangled characters when encoding of open() is set to "utf-8-sig"

However, when I set the encoding to "ansi", the characters are perfectly normal!

normal characters with encoding="ansi"

This is a complete mystery to me. As said before, this worked fine yesterday. I've checked that the files were indeed utf-8, multiple times. I don't know if the problem is with the open() function, or the print() function when the characters are displayed. in any case, it's strange. The "ansi" version would be a solution, but the problem is that it causes problems with Lark, which uses the contents of the opened files.

In the screenshots I gave here, the code is basic:

with open(str(GRAMMAR), "r", encoding="utf-8-sig") as grammar:
    print(grammar.read())

What could this problem be caused by?

3
  • Is the output in the screenshots from your terminal? It could be entirely a printing issue and not an issue with the actual strings, especially on Windows. Do you have another way to check? Commented Oct 20, 2021 at 12:17
  • You could try putting sys.stdout.reconfigure(encoding='utf-8') early in your script and see if that helps. Commented Oct 20, 2021 at 12:20
  • 1
    Can you reduce the file to a single line that fails, read the file with binary, print the result: print(open(file,'rb').read()), and edit your question to include it? This will remove ambiguity of the file content and encoding. ansi and utf-8 are completely different encodings, and if utf-8 works, the file is encoded in UTF-8. An "ANSI" encoded file with those characters would raise an exception if read as UTF-8. Commented Oct 20, 2021 at 17:39

2 Answers 2

0

I just noticed something: ansi is not an encoding. The correct name for the encoding would be ascii. This means that when I typed encoding="ansi", python ignored the encoding I asked it to set and read the file as its default encoding, which is normally utf-8. This does not explain why it doesn't work with utf-8-sig or why Lark is screaming at me, but this is specific to my case. So for future readers of this questions, check 2 things:

  1. If you want to use ascii, type ascii, not ansi.
  2. Stick with the defaults.
Sign up to request clarification or add additional context in comments.

Comments

0

On Windows machines, Python recognises the name "ansi" as an alias for the "mbcs" codec, defined as

Windows only: Encode the operand according to the ANSI codepage (CP_ACP).

So ansi is a valid encoding, but it isn't the same as ASCII, or UTF-8, hence the apparent mangling.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.