0

Do not know how to resolve the UnicodeDecodeError:

I am not able to write text to file --> UnicodeDecodeError about character â = '0xe2'.

1) â = '0xe2' character for sure does not exists in that string

2) re.search is not able to find â character in the string, which I am trying to write file.writelines(string)

3) there is defined errors='replace' in file opening, thus file.writelines() should not complain about character errors.

File=codecs.open(fname, 'w','utf-8', errors='replace')

lines=smart_str( lines, 'utf-8', strings_only=False, errors='replace' )
# lines is 'some webpage text after BeautifulSoup.prettify which does not contain letter â ='0xe2', which is converted with Django smart_str to string'

FileA.writelines(lines) #gives UnicodeDecodeError : 'ascii' codec can't decode byte 0xe2 in position 9637: ordinal not in range(128).

myre = re.compile(r'0xe2', re.UNICODE) # letter   â = '0xe2'
print re.search(myre, lines) #gives None
linessub=myre.sub('', lines)
print re.search(myre, linessub)  #gives None

FileA.writelines(lines) #gives UnicodeDecodeError : 'ascii' codec can't decode byte 0xe2 in position 9637: ordinal not in range(128).
1
  • If you want to test whether a string contains a substring, just do '\xe2' in s instead of this re stuff, which wouldn't work anyway because '0xe2' != '\xe2'. Commented Nov 20, 2011 at 19:38

1 Answer 1

3

You're using codecs.open so your file object expects unicode strings, not byte strings.

The point in using this function is that you don't have to encode the strings yourself before writing them to the file. You write unicode strings and the file object will encode them internally.

It looks like the smart_str returns UTF-8 encoded strings (seeing that you pass the encoding name to it). If you pass that to the codec-aware file object, which expects unicode, it will first try to decode the byte strings back to unicode. Because it will not know the encoding of the passed in string, it will use ascii. And that's where the error comes from because the string isn't ascii, it's UTF-8:

UnicodeDecodeError : 'ascii' codec can't decode...

So, you want to skip the encoding stage done by smart_str and simply write unicode strings to the file, or, switch from codecs.open() to the normal open() which works with bytes and as such expects already encoded byte strings.

By the way, your test for existence of the 0xE2 character will not work. Firstly, you use r'0xe2' as the pattern which is simply a 4 character string, not a single 0xE2 character. Secondly, you don't need re for something simple as that. Just try this:

print '\xe2' in your_str
Sign up to request clarification or add additional context in comments.

3 Comments

The question is how to resolve UnicodeDecodeError about character â = '0xe2', appearing when i try to write text "lines" to file.
I mean this character for sure does not exists in the text. How such error arose and what i should do about it?
I need smart_str since i perform search on the string before writing to file, which is not shown here. AFter search i am getting these errors and i am not able to write string to any file, neither opened with simple open nor opened with codecs.open(...).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.