Perl: Converting strings to Unicode

Question

I have a MySql database that stores strings with the Unicode characters encoded using an XML type format (i.e., &#nnnnn; ). An example of one of these strings would be: 概述 which represents the Unicode characters: 概述

Perl lets me make this conversion in my application if I hard-code the strings in the format:

\x{6982}\x{8ff0}

or even:

\N{U+6982}\N{U+8ff0}

To me it seems like a simple matter of changing the format from &#nnnnn; to \x{nnnn}
The Perl application seems to require hex numbers whereas the MySql is outputting integers.

I wanted to do this simple conversion in Regex. So I matched the integer using:

m/\&\#(\d{3,5});/;

Then I converted the match to hex using: sprintf('{%04x}',$1)
Then I added in the necessary: \x{ }

I was easily able to create strings that contained: "\x{6982}\x{8ff0}" But none of them were printed by the application as Unicode. They were simply printed as they were created: symbols and text.

I found out that if you hard-coded these strings into the program, Perl would "interpolate" them into Unicode characters. But if they were created as a string, the "interpolation" did not take place.

I tried to force the interpolation by using various functions such as:
Encode::decode('UTF-8', "some string" );
Encode::encode('UTF-8', "some string" );
But that wasn't what those functions were intended for.
I also tried to use Perl's manual string interpolation

$v="${ \($v) }";

But that did not convert the string "\x{6982}\x{8ff0}" into Unicode. It simply remained the same string as before.

I came across an example using "eval()".


while($unicodeString =~ m/\&\#(\d{3,5});/) {
    $_=$unicodeString;     ## in the XML form of (spaces added so you could see it here): & #27010; & #36848;
    m/\&\#(\d{3,5});/;     ## Matches the integer number in the Unicode
    my $y=q(\x).sprintf('{%04x}',$1); ## Converts the integer to hex and adds the \x{}
    my $v = eval qq{"$y"}; ## Performs the interpolation of the string to get the Unicode
    $unicodeString =~ s/\&\#(\d{3,5});/$v/;  ## Replaces the old code with the new Unicode character
}

This conversion works now. But I am not happy with the repeated use of eval() to convert each character: one-at-a-time. I could build my string in the While loop and then simply eval() the new string. But I would prefer to only eval() those small strings that were specifically matched in Regex.

Is there a better way of converting an XML string (with Unicode characters shown as integers) into a string that contains the actual Unicode characters?

How can I easily go from a string that contains:

我认识到自己的长处和短处，并追求自我发展。

to one with:
我认识到自己的长处和短处，并追求自我发展。

The documents I need to convert contain thousands of these characters.

Have you already tested the cpan module HTML::Entities? Maybe it can meet your needs. — Ronaldo Ferreira de Lima
– Ronaldo Ferreira de Lima, Commented Aug 30, 2021 at 20:48

Håkon Hægland · Accepted Answer · 2021-08-30 20:51:51Z

3

Here is a simple example of how you can replace the unicode escapes using the chr function:

use feature qw(say);
use strict;
use warnings;
use open qw( :encoding(utf-8) :std );

my $str = "&#27010;&#36848;";
$str =~ s/&#(\d+);/chr $1/eg;
printf "%vX\n", $str;
say $str;

Output:

6982.8FF0
概述

answered Aug 30, 2021 at 20:51

Håkon Hægland

40.9k22 gold badges96 silver badges210 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JKandra Over a year ago

Thanks for this answer. I have tried using this suggested regex substitution and it works. The chr $1 was the key I was looking for. This really helped me to eliminate the multiple calls to eval(). Thanks.

ikegami · Accepted Answer · 2021-09-07 02:09:08Z

1

I didn't find a module that decode XML entities because they are normally only found in XML, and the XML parser handles them. But, it's pretty easy to recreate.

use feature qw( say state );

sub decode_xml_entities_inplace {
   state $ents = {
      amp  => "&",
      lt   => "<",
      gt   => ">",
      quot => '"',
      apos => "'",
   };
   
   $_[0] =~ s{
      &
      (?: \# (?: x([0-9a-fA-F]+)
             |   ([0-9]+)
             )
      |   (\w+)
      )
      ;
   }{
      if    (defined($1)) { chr(hex($1))      }
      elsif (defined($2)) { chr($2)           }
      else                { $ents->{$3} // $& }
   }xeg;
}

my $s = "&#27010;&#36848;";
decode_xml_entities_inplace($s);
say $s;

Of course, if you simply need to handle the decimal numeric entities, the above simplifies to

use feature qw( state );

my $s = "&#27010;&#36848;";
$s =~ s{ &\# ([0-9]+) ; }{ chr($1) }xeg;
say $s;

edited Sep 7, 2021 at 2:09

answered Aug 30, 2021 at 21:15

ikegami

391k17 gold badges291 silver badges555 bronze badges

3 Comments

JKandra Over a year ago

Thanks for this answer. I have tried implementing the above code. The use of "state" and "s{}{}" was a new lesson for me. I also learned a little about non-capturing groups using the "(?: )" syntax. Your solution didn't exactly do what I needed. Your solution create new strings like the following: &#是&#一&#种&#诊&#断&#工&#具&#，&#可&#为&#您. As you can see, the unicode characters are inserted by the chr() function that you refer to in the substitution. However, the "&#" are also being inserted with the unicode because of the non-capturing groups in the matching part of your regex.

JKandra Over a year ago

You also suggested an alternate (shorter) version of the replacement using: s{ &# ([0-9]+) ; }{ chr($1) }xeg; I tried implementing this solution as well. It didn't work out, but not because of any syntax error. The problem, I think, was the extra spaces before and after the characters to be captured. I reduced the spacing in the regex to: s{&#([0-9]+);}{chr $1}eg; and this worked just fine. Thanks for that suggestion.

ikegami Over a year ago

Forgot to escape # and there was a missing ). Fixed.

Collectives™ on Stack Overflow

Perl: Converting strings to Unicode

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related