unsigned char in C not working as expected

Question

Since unsigned char represents 0 - 255 and the extended ascii code for 'à' is 133, I expected the following C code to print 133

unsigned char uc;

uc='à';

printf("%hhu \n",uc);

Instead, both clang and gcc produce the following error

error: character too large for enclosing character literal type
uc='à';
    ^

What went wrong?

By the way I copied à from a French language website and pasted the result into the assignment statement. What I suspect is the way I created à may not be valid.

Your compiler most likely treats your source code as UTF-8 and not as ASCII. And in UTF-8 the letter "à" is represented as the two byte sequence 0xC3 0xA0 and therefore does not fit into a char, be it signed or unsigned. — Codo
– Codo, Commented Aug 17, 2021 at 19:08
The character à cannot be used in a C program. Try with \x85. — user1196549
– user1196549, Commented Aug 17, 2021 at 19:11
There is no such thing as "extended ascii". Some people used this term a decade or two ago to denote several incompatible things. There is no reason to use it today. — n. m. could be an AI
– n. m. could be an AI, Commented Aug 17, 2021 at 19:38

12431234123412341234123 · Accepted Answer · 2021-08-17 19:51:18Z

3

Since unsigned char represents 0 - 255

This is true in most implementations, but the C standard does not require that a char is limited to 8 bit, it can be larger and support a larger range.

and the extended ascii code for 'à' is 133,

There can be a C implementation where 'à' has the value 133 (0x85) but since most implementations use Unicode, 'à' probably uses the code point 224 (0xE0) which is most likely stored as UTF-8. Your Editor is also set to UTF-8 and therefore needs more than a single byte to represent characters outside of ASCII. In UTF-8, all ASCII characters are stored like they are in ASCII and need 1 byte, all other characters are a combination of 2-4 byte and bit 7 is set in every one of them. I suggest you learn how UTF-8 works, UTF-8 is the best way to store text most of the time, so you should only use something else when you have a good reason to do so.

I expected the following C code to print 133

In UTF-8 the code point for à is stored as 0xC3 0xA0 which is combined to the value 0xE0. You can't store 0xC3 0xA0 in a 8 bit char. So clang reports an error. You could try to store it in a int, unsigned, wchar_t or some other integer type that is large enough. GCC would store the value 0xC3A0 and not 0xE0, because that is the value inside the ''. However, C supports wide characters. The type wchar_t which may support more characters is most likely wchar_t is 32 or 16 on your system. To write a wide character literal, you can use the prefix L. With a wide character literal, the compiler would store the correct value of 0xE0.

Change the code to:

#include <wchar.h>

....

wchar_t wc;
wc=L'à';
printf("%u \n",(unsigned)wc);

edited Aug 17, 2021 at 19:51

answered Aug 17, 2021 at 19:41

12431234123412341234123

2,88218 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Eric Postpischil Over a year ago

This answer is not written in a way that would explain to somebody unfamiliar with character sets and representing them via bytes.

Eric Postpischil Over a year ago

“bit 7 is set in every one of them” contradicts “the code point for à is stored as 0x3C 0xA0”, since bit 7 is not set in 0x3C.

Collectives™ on Stack Overflow

unsigned char in C not working as expected

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related