1

Since unsigned char represents 0 - 255 and the extended ascii code for 'à' is 133, I expected the following C code to print 133

unsigned char uc;

uc='à';

printf("%hhu \n",uc);

Instead, both clang and gcc produce the following error

error: character too large for enclosing character literal type
uc='à';
    ^ 

What went wrong?

By the way I copied à from a French language website and pasted the result into the assignment statement. What I suspect is the way I created à may not be valid.

19
  • 10
    Your editor is using UTF-8 encoding, not extended ASCII. Commented Aug 17, 2021 at 19:07
  • 2
    Your compiler most likely treats your source code as UTF-8 and not as ASCII. And in UTF-8 the letter "à" is represented as the two byte sequence 0xC3 0xA0 and therefore does not fit into a char, be it signed or unsigned. Commented Aug 17, 2021 at 19:08
  • The character à cannot be used in a C program. Try with \x85. Commented Aug 17, 2021 at 19:11
  • 2
    @YvesDaoust do you have a link to normative reference? Commented Aug 17, 2021 at 19:12
  • 3
    There is no such thing as "extended ascii". Some people used this term a decade or two ago to denote several incompatible things. There is no reason to use it today. Commented Aug 17, 2021 at 19:38

1 Answer 1

3

Since unsigned char represents 0 - 255

This is true in most implementations, but the C standard does not require that a char is limited to 8 bit, it can be larger and support a larger range.

and the extended ascii code for 'à' is 133,

There can be a C implementation where 'à' has the value 133 (0x85) but since most implementations use Unicode, 'à' probably uses the code point 224 (0xE0) which is most likely stored as UTF-8. Your Editor is also set to UTF-8 and therefore needs more than a single byte to represent characters outside of ASCII. In UTF-8, all ASCII characters are stored like they are in ASCII and need 1 byte, all other characters are a combination of 2-4 byte and bit 7 is set in every one of them. I suggest you learn how UTF-8 works, UTF-8 is the best way to store text most of the time, so you should only use something else when you have a good reason to do so.

I expected the following C code to print 133

In UTF-8 the code point for à is stored as 0xC3 0xA0 which is combined to the value 0xE0. You can't store 0xC3 0xA0 in a 8 bit char. So clang reports an error. You could try to store it in a int, unsigned, wchar_t or some other integer type that is large enough. GCC would store the value 0xC3A0 and not 0xE0, because that is the value inside the ''. However, C supports wide characters. The type wchar_t which may support more characters is most likely wchar_t is 32 or 16 on your system. To write a wide character literal, you can use the prefix L. With a wide character literal, the compiler would store the correct value of 0xE0.

Change the code to:

#include <wchar.h>

....

wchar_t wc;
wc=L'à';
printf("%u \n",(unsigned)wc);
Sign up to request clarification or add additional context in comments.

2 Comments

This answer is not written in a way that would explain to somebody unfamiliar with character sets and representing them via bytes.
“bit 7 is set in every one of them” contradicts “the code point for à is stored as 0x3C 0xA0”, since bit 7 is not set in 0x3C.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.