1

Suppose I wanted to detect unicode characters and encode them using \u notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?

I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g.

I want to improve this functionality to write \u0000 notation.

2
  • 2
    \u has nothing whatsoever to do with UTF8. Commented Dec 17, 2012 at 21:50
  • @SLaks, See my edits. Please let me know if I am unclear on anything. Commented Dec 17, 2012 at 21:53

1 Answer 1

2

You need to have Unicode characters, so start by decoding your byte array.

use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);

Only then can you escape Unicode characters.

( my $escaped_text = $decoded_text ) =~
   s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;

For example,

$ perl -CSDA -MEncode=decode -E'
   my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
   my $decoded_text = decode("UTF-8", $encoded_text);
   say $decoded_text;
   ( my $escaped_text = $decoded_text ) =~
      s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
   say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl
Sign up to request clarification or add additional context in comments.

6 Comments

Note that this approach adds ambiguity: Codepoints over 0xFFFF will be indistinguishable from shorter codepoints followed by ASCII digits (also, it will escape most ASCII characters in the range 0 .. 31 including tab/carriage return (or newline on MacOS classic :-) - I'd just use s/([^[:ascii:]])/.../eg there).
There's nothing I can do about the scheme is not defined for codepoints over U+FFFF. Escaping control characters is a good thing.
I have no idea about perl but \uxxxx is defined for code points above U+FFFF using surrogate pairs: \uHHHH\uLLLL. For example, U+20000: \ud840\udc00
@Esailija, You're talking about JSON. The OP didn't mention anything about JSON. He asked to encode codepoints as \uXXXX. If he wants JSON, he should use JSON::XS.
@ikegami what do you mean JSON? I tested it in C#, java, javascript and python 2.7. These all support surrogate pairs in the \u notation for astral planes.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.