2

I have a MySql database that stores strings with the Unicode characters encoded using an XML type format (i.e., &#nnnnn; ). An example of one of these strings would be: 概述 which represents the Unicode characters: 概述

Perl lets me make this conversion in my application if I hard-code the strings in the format:

\x{6982}\x{8ff0}
or even:
\N{U+6982}\N{U+8ff0}

To me it seems like a simple matter of changing the format from &#nnnnn; to \x{nnnn}
The Perl application seems to require hex numbers whereas the MySql is outputting integers.

I wanted to do this simple conversion in Regex. So I matched the integer using:

m/\&\#(\d{3,5});/;

Then I converted the match to hex using: sprintf('{%04x}',$1)
Then I added in the necessary: \x{ }

I was easily able to create strings that contained: "\x{6982}\x{8ff0}" But none of them were printed by the application as Unicode. They were simply printed as they were created: symbols and text.

I found out that if you hard-coded these strings into the program, Perl would "interpolate" them into Unicode characters. But if they were created as a string, the "interpolation" did not take place.

I tried to force the interpolation by using various functions such as:
Encode::decode('UTF-8', "some string" );
Encode::encode('UTF-8', "some string" );
But that wasn't what those functions were intended for.
I also tried to use Perl's manual string interpolation

$v="${ \($v) }";

But that did not convert the string "\x{6982}\x{8ff0}" into Unicode. It simply remained the same string as before.

I came across an example using "eval()".


while($unicodeString =~ m/\&\#(\d{3,5});/) {
    $_=$unicodeString;     ## in the XML form of (spaces added so you could see it here): & #27010; & #36848;
    m/\&\#(\d{3,5});/;     ## Matches the integer number in the Unicode
    my $y=q(\x).sprintf('{%04x}',$1); ## Converts the integer to hex and adds the \x{}
    my $v = eval qq{"$y"}; ## Performs the interpolation of the string to get the Unicode
    $unicodeString =~ s/\&\#(\d{3,5});/$v/;  ## Replaces the old code with the new Unicode character
}

This conversion works now. But I am not happy with the repeated use of eval() to convert each character: one-at-a-time. I could build my string in the While loop and then simply eval() the new string. But I would prefer to only eval() those small strings that were specifically matched in Regex.

Is there a better way of converting an XML string (with Unicode characters shown as integers) into a string that contains the actual Unicode characters?

How can I easily go from a string that contains:

我认识到自己的长处和短处,并追求自我发展。

to one with:
我认识到自己的长处和短处,并追求自我发展。

The documents I need to convert contain thousands of these characters.

1
  • 1
    Have you already tested the cpan module HTML::Entities? Maybe it can meet your needs. Commented Aug 30, 2021 at 20:48

2 Answers 2

3

Here is a simple example of how you can replace the unicode escapes using the chr function:

use feature qw(say);
use strict;
use warnings;
use open qw( :encoding(utf-8) :std );

my $str = "概述";
$str =~ s/&#(\d+);/chr $1/eg;
printf "%vX\n", $str;
say $str;

Output:

6982.8FF0
概述
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this answer. I have tried using this suggested regex substitution and it works. The chr $1 was the key I was looking for. This really helped me to eliminate the multiple calls to eval(). Thanks.
1

I didn't find a module that decode XML entities because they are normally only found in XML, and the XML parser handles them. But, it's pretty easy to recreate.

use feature qw( say state );

sub decode_xml_entities_inplace {
   state $ents = {
      amp  => "&",
      lt   => "<",
      gt   => ">",
      quot => '"',
      apos => "'",
   };
   
   $_[0] =~ s{
      &
      (?: \# (?: x([0-9a-fA-F]+)
             |   ([0-9]+)
             )
      |   (\w+)
      )
      ;
   }{
      if    (defined($1)) { chr(hex($1))      }
      elsif (defined($2)) { chr($2)           }
      else                { $ents->{$3} // $& }
   }xeg;
}

my $s = "&#27010;&#36848;";
decode_xml_entities_inplace($s);
say $s;

Of course, if you simply need to handle the decimal numeric entities, the above simplifies to

use feature qw( state );

my $s = "&#27010;&#36848;";
$s =~ s{ &\# ([0-9]+) ; }{ chr($1) }xeg;
say $s;

3 Comments

Thanks for this answer. I have tried implementing the above code. The use of "state" and "s{}{}" was a new lesson for me. I also learned a little about non-capturing groups using the "(?: )" syntax. Your solution didn't exactly do what I needed. Your solution create new strings like the following: &#是&#一&#种&#诊&#断&#工&#具&#,&#可&#为&#您. As you can see, the unicode characters are inserted by the chr() function that you refer to in the substitution. However, the "&#" are also being inserted with the unicode because of the non-capturing groups in the matching part of your regex.
You also suggested an alternate (shorter) version of the replacement using: s{ &# ([0-9]+) ; }{ chr($1) }xeg; I tried implementing this solution as well. It didn't work out, but not because of any syntax error. The problem, I think, was the extra spaces before and after the characters to be captured. I reduced the spacing in the regex to: s{&#([0-9]+);}{chr $1}eg; and this worked just fine. Thanks for that suggestion.
Forgot to escape # and there was a missing ). Fixed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.