2

String substitution of UTF-8 encoded strings works fine when the regexp contains only ascii characters, but produces garbled output when regexp contains non-ascii.

my $str = "¿más?";

$str =~ s/[?]//g; 
print "$str\n";

==> ¿más

$str =~ s/[¿]//g; 
print "$str\n";

==> m�s

UPDATE: The answers to above made it clear that my original question was framed poorly. The answers focused on STDOUT, but in my actual problem, I am not printing to STDOUT. (I only did that to simplify the problem statement). In the actual problem, I retrieve data from sqlite store and use data as filenames to search file system. When I apply cleanup routines to the retrieved data, certain filenames get garbled.

One way to see this might be to simplify the example further:

my $str = "más";

$str =~ s/[?]//g; 
print "$str\n";

==> más

$str =~ s/[¿]//g; 
print "$str\n";

==> m�s

Now you can see that @ikegami's explanation does not apply. Something about the second s/// creates the problem. To be fair, both answers solved the problem as stated -- but any additional insights would be greatly appreciated!

UPDATE 2: As requested, have added sprintf's vector flag output. Note: Have also changed the target substitution character from ¿ to ¡ -- I now think that my code above (as @ikegami suggested) must have been copied incorrectly.

my $str = "más";
printf "%v02X\n", $str;

==> 6D.C3.A1.73

$str =~ s/[!]//g; 
printf "%v02X\n", $str;

==> 6D.C3.A1.73

print "$str\n";

==> más

$str =~ s/[¡]//g; 
printf "%v02X\n", $str;

==> 6D.C3.73

print "$str\n";

==> m�s

2
  • Despite what you claim in your update, I didn't mention STDOUT. Commented Oct 9, 2014 at 4:12
  • @ikegami - have added info you requested; also see note about ¿ vs ¡ Commented Oct 9, 2014 at 17:28

2 Answers 2

3

You are viewing your source code as if it was UTF-8, but unless you tell Perl it's UTF-8, it views it as US-ASCII.

You say you have the following:

my $str = "más";
printf "%v02X %s\n", $str, $str;
$str =~ s/[!]//g; 
printf "%v02X %s\n", $str, $str;
$str =~ s/[¡]//g; 
printf "%v02X %s\n", $str, $str;

But you really gave the equivalent of the following to Perl:

my $str = "m\xC3\xA1s";
printf "%v02X %s\n", $str, $str;   # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[!]//g; 
printf "%v02X %s\n", $str, $str;   # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[\xC2\xA1]//g;           # Replaces either of these bytes
printf "%v02X %s\n", $str, $str;   # 6D.C3.73 (garbage)

You want the following:

use utf8;                             # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';  # Terminal provides and expects UTF-8.

my $str = "más";
printf "U+%v04X %s\n", $str, $str;   # U+006D.00E1.0073 (the Unicode of más)
$str =~ s/[¡]//g;                    # Aka s/[\x{0041}]//g
printf "U+%v04X %s\n", $str, $str;   # U+006D.00E1.0073 (the Unicode of más)

You mention you didn't get your string from the source code and that you're not outputting the STDOUT, but the fix is the same: Decode inputs and encode outputs.

Sign up to request clarification or add additional context in comments.

4 Comments

perlopentut models ':encoding(UTF-8)' where as use open models ':encoding(utf8)'. Is there a notable difference between these two and which do you recommend if either?
@Miller, UTF-8 is a standard encoding. UTF8 is a Perl-specific lax variant.
Updated in response to change of question.
Thanks, @ikegami! Great to know about sprintf's vector flag.
3

Specify the encoding of your source code using utf8 and output using binmode:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;

binmode STDOUT, ':encoding(UTF-8)';

my $str = "¿más?";

$str =~ s/[?]//g; 
print "$str\n";

$str = "¿más?";
$str =~ s/[¿]//g; 
print "$str\n";

Outputs:

¿más
más?

4 Comments

It works (thank you!), but what does STDOUT have to do with s///?
You're printing the results (which defaults to STDOUT), therefore you must specify that encoding too.
But results printed fine after first substitution.
Observe @ikegami's answer for a potential explanation. Either way the solution is the same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.