I have a MySql database that stores strings with the Unicode characters encoded using an XML type format (i.e., &#nnnnn; ). An example of one of these strings would be: 概述 which represents the Unicode characters: 概述
Perl lets me make this conversion in my application if I hard-code the strings in the format:
\x{6982}\x{8ff0}or even:\N{U+6982}\N{U+8ff0}
To me it seems like a simple matter of changing the format from &#nnnnn; to \x{nnnn}
The Perl application seems to require hex numbers whereas the MySql is outputting integers.
I wanted to do this simple conversion in Regex. So I matched the integer using:
m/\&\#(\d{3,5});/;
Then I converted the match to hex using:
sprintf('{%04x}',$1)
Then I added in the necessary: \x{ }
I was easily able to create strings that contained: "\x{6982}\x{8ff0}"
But none of them were printed by the application as Unicode. They were simply printed as they were created: symbols and text.
I found out that if you hard-coded these strings into the program, Perl would "interpolate" them into Unicode characters. But if they were created as a string, the "interpolation" did not take place.
I tried to force the interpolation by using various functions such as:
Encode::decode('UTF-8', "some string" );
Encode::encode('UTF-8', "some string" );
But that wasn't what those functions were intended for.
I also tried to use Perl's manual string interpolation
$v="${ \($v) }";
But that did not convert the string "\x{6982}\x{8ff0}" into Unicode. It simply remained the same string as before.
I came across an example using "eval()".
while($unicodeString =~ m/\&\#(\d{3,5});/) {
$_=$unicodeString; ## in the XML form of (spaces added so you could see it here): & #27010; & #36848;
m/\&\#(\d{3,5});/; ## Matches the integer number in the Unicode
my $y=q(\x).sprintf('{%04x}',$1); ## Converts the integer to hex and adds the \x{}
my $v = eval qq{"$y"}; ## Performs the interpolation of the string to get the Unicode
$unicodeString =~ s/\&\#(\d{3,5});/$v/; ## Replaces the old code with the new Unicode character
}
This conversion works now. But I am not happy with the repeated use of eval() to convert each character: one-at-a-time. I could build my string in the While loop and then simply eval() the new string. But I would prefer to only eval() those small strings that were specifically matched in Regex.
Is there a better way of converting an XML string (with Unicode characters shown as integers) into a string that contains the actual Unicode characters?
How can I easily go from a string that contains:
我认识到自己的长处和短处,并追求自我发展。
to one with:
我认识到自己的长处和短处,并追求自我发展。
The documents I need to convert contain thousands of these characters.