Converting to unicode characters in Perl?

Question

I want to convert the text ( Hindi ) to Unicode in Perl. I have searched in CPAN. But, I could not find the exact module/way which I am looking for. Basically, I am looking for something like this.

My Input is:

इस परीक्षण के लिए है

My expected output is:

\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948

How to achieve this in Perl?

Give me some suggestions.

I have tried the list of Perl modules ( Encode , Text::Unidecode ). — vara
– vara, Commented Sep 22, 2014 at 10:54

ikegami · Accepted Answer · 2014-09-22 12:42:04Z

8

Try this

use utf8;

my $str = 'इस परीक्षण के लिए है';

for my $c (split //, $str) {
    printf("\\u%04x", ord($c));
}
print "\n";

edited Sep 22, 2014 at 12:42

ikegami

391k17 gold badges291 silver badges555 bronze badges

answered Sep 22, 2014 at 11:25

Oleg G

9454 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

amon Over a year ago

Careful: The \uxxxx notation used by many tools or programming languages (such as Java, JavaScript, and Python) only supports 4 hex digits, which makes these escapes unsuitable for code points larger than U+FFFF. The common workaround is to encode the text as UTF-16 and then represent each 16-bit integer as an \uxxxx escape (i.e. encoding in surrogate pairs). Python also offers another workaround: a 32-bit \Uxxxxxxxx notation. Your code might produce five-or-more-digit escapes which are not generally recognized, e.g. when applied to $str = "😃" (U+1F603 Smiling Face With Open Mouth).

ikegami · Accepted Answer · 2014-09-22 15:09:25Z

6

You don't really need any module to do that. ord for extracting char code and printf for formatting it as 4-numbers zero padded hex is more than enough:

use utf8;
my $str = 'इस परीक्षण के लिए है';
(my $u_encoded = $str) =~ s/(.)/sprintf "\\u%04x", ord($1)/sge;
# \u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948

edited Sep 22, 2014 at 15:09

ikegami

391k17 gold badges291 silver badges555 bronze badges

answered Sep 22, 2014 at 11:30

Oleg V. Volkov

22.6k4 gold badges49 silver badges68 bronze badges

1 Comment

amon Over a year ago

This solution has potential issues with Unicode characters outside of the Basic Multilingual Plane (i.e. > U+FFFF), as your code will produce escapes having more than four hex digits. See my comment to the other Oleg's answer for details.

amon · Accepted Answer · 2014-09-23 14:14:49Z

Because I left a few comments on how the other answers might fall short of the expectations of various tools, I'd like to share a solution that encodes characters outside of the Basic Multilingual Plane as pairs of two escapes: "😃" would become \ud83d\ude03.

This is done by:

Encoding the string as UTF-16, without a byte order mark. We explicitly choose an endianess. Here, we arbitrarily use the big-endian form. This produces a string of octets (“bytes”), where two octets form one UTF-16 code unit, and two or four octets represent an Unicode code point.

This is done for convenience and performance; we could just as well determine the numeric values of the UTF-16 code units ourselves.
unpacking the resulting binary string into 16-bit integers which represent each UTF-16 code unit. We have to respect the correct endianess, so we use the n* pattern for unpack (i.e. 16-bit big endian unsigned integer).
Formatting each code unit as an \uxxxx escape.

As a Perl subroutine, this would look like

use strict;
use warnings;
use Encode ();

sub unicode_escape {
    my ($str) = @_;
    my $UTF_16BE_octets = Encode::encode("UTF-16BE", $str);
    my @code_units = unpack "n*", $UTF_16BE_octets;
    return join '', map { sprintf "\\u%04x", $_ } @code_units;
}

Test cases:

use Test::More tests => 3;
use utf8;

is unicode_escpape(''), '',
    'empty string is empty string';

is unicode_escape("\N{SMILING FACE WITH OPEN MOUTH}"), '\ud83d\ude03',
    'non-BMP code points are escaped as surrogate halves';

my $input = 'इस परीक्षण के लिए है';
my $output = '\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948';
is unicode_escape($input), $output,
    'ordinary BMP code points each have a single escape';

clt60 · Accepted Answer · 2014-09-23 16:20:16Z

4

If you want only an simple converter, you can use the following filter

perl -CSDA -nle 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -nlE 'printf "\\u%04x",$_ for unpack "U*"'

like:

echo "इस परीक्षण के लिए है" | perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_' <<<  "इस परीक्षण के लिए है"

prints:

\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948\u000a

Unicode with surrogate pairs.

use strict;
use warnings;
use utf8;
use open qw(:std :utf8);

my $str = "if( \N{U+1F42A}+\N{U+1F410} == \N{U+1F41B} ){ \N{U+1F602} = \N{U+1F52B} } # ορισμός ";

print "$str\n";
for my $ch (unpack "U*", $str) {
        if( $ch > 0xffff ) {
                my $h = ($ch - 0x10000) / 0x400 + 0xD800;
                my $l = ($ch - 0x10000) % 0x400 + 0xDC00;
                printf "\\u%04x\\u%04x", $h, $l;
        }
        else {
                printf "\\u%04x", $ch;
        }
}
print "\n";

prints

if( 🐪+🐐 == 🐛 ){ 😂 = 🔫 } # ορισμός 
\u0069\u0066\u0028\u0020\ud83d\udc2a\u002b\ud83d\udc10\u0020\u003d\u003d\u0020\ud83d\udc1b\u0020\u0029\u007b\u0020\ud83d\ude02\u0020\u003d\u0020\ud83d\udd2b\u0020\u007d\u0020\u0023\u0020\u03bf\u03c1\u03b9\u03c3\u03bc\u03cc\u03c2\u0020

edited Sep 23, 2014 at 16:20

answered Sep 22, 2014 at 12:44

clt60

64.3k17 gold badges114 silver badges206 bronze badges

2 Comments

amon Over a year ago

This solution has potential issues with Unicode characters outside of the Basic Multilingual Plane (i.e. > U+FFFF), as your code will produce escapes having more than four hex digits. See my comment to Oleg G's answer for details.

clt60 Over a year ago

@amon the OP asking for Hindi e.g. characters are inside of BMP but youre right and for the reference added the an example with surrogate pairs too...

Collectives™ on Stack Overflow

Converting to unicode characters in Perl?

4 Answers 4

1 Comment

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related