Reading unicode chars on the byte level

Question

Suppose I wanted to detect unicode characters and encode them using \u notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?

I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g.

I want to improve this functionality to write \u0000 notation.

@SLaks, See my edits. Please let me know if I am unclear on anything. — 700 Software
– 700 Software, Commented Dec 17, 2012 at 21:53

ikegami · Accepted Answer · 2012-12-18 00:07:14Z

2

You need to have Unicode characters, so start by decoding your byte array.

use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);

Only then can you escape Unicode characters.

( my $escaped_text = $decoded_text ) =~
   s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;

For example,

$ perl -CSDA -MEncode=decode -E'
   my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
   my $decoded_text = decode("UTF-8", $encoded_text);
   say $decoded_text;
   ( my $escaped_text = $decoded_text ) =~
      s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
   say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl

edited Dec 18, 2012 at 0:07

answered Dec 17, 2012 at 22:02

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

melpomene Over a year ago

Note that this approach adds ambiguity: Codepoints over 0xFFFF will be indistinguishable from shorter codepoints followed by ASCII digits (also, it will escape most ASCII characters in the range 0 .. 31 including tab/carriage return (or newline on MacOS classic :-) - I'd just use s/([^[:ascii:]])/.../eg there).

ikegami Over a year ago

There's nothing I can do about the scheme is not defined for codepoints over U+FFFF. Escaping control characters is a good thing.

Esailija Over a year ago

I have no idea about perl but \uxxxx is defined for code points above U+FFFF using surrogate pairs: \uHHHH\uLLLL. For example, U+20000: \ud840\udc00

ikegami Over a year ago

@Esailija, You're talking about JSON. The OP didn't mention anything about JSON. He asked to encode codepoints as \uXXXX. If he wants JSON, he should use JSON::XS.

Esailija Over a year ago

@ikegami what do you mean JSON? I tested it in C#, java, javascript and python 2.7. These all support surrogate pairs in the \u notation for astral planes.

|

Collectives™ on Stack Overflow

Reading unicode chars on the byte level

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related