Perl string substitution garbles Unicode string

Question

String substitution of UTF-8 encoded strings works fine when the regexp contains only ascii characters, but produces garbled output when regexp contains non-ascii.

my $str = "¿más?";

$str =~ s/[?]//g; 
print "$str\n";

==> ¿más

$str =~ s/[¿]//g; 
print "$str\n";

==> m�s

UPDATE: The answers to above made it clear that my original question was framed poorly. The answers focused on STDOUT, but in my actual problem, I am not printing to STDOUT. (I only did that to simplify the problem statement). In the actual problem, I retrieve data from sqlite store and use data as filenames to search file system. When I apply cleanup routines to the retrieved data, certain filenames get garbled.

One way to see this might be to simplify the example further:

my $str = "más";

$str =~ s/[?]//g; 
print "$str\n";

==> más

$str =~ s/[¿]//g; 
print "$str\n";

==> m�s

Now you can see that @ikegami's explanation does not apply. Something about the second s/// creates the problem. To be fair, both answers solved the problem as stated -- but any additional insights would be greatly appreciated!

UPDATE 2: As requested, have added sprintf's vector flag output. Note: Have also changed the target substitution character from ¿ to ¡ -- I now think that my code above (as @ikegami suggested) must have been copied incorrectly.

my $str = "más";
printf "%v02X\n", $str;

==> 6D.C3.A1.73

$str =~ s/[!]//g; 
printf "%v02X\n", $str;

==> 6D.C3.A1.73

print "$str\n";

==> más

$str =~ s/[¡]//g; 
printf "%v02X\n", $str;

==> 6D.C3.73

print "$str\n";

==> m�s

Despite what you claim in your update, I didn't mention STDOUT. — ikegami
– ikegami, Commented Oct 9, 2014 at 4:12
@ikegami - have added info you requested; also see note about ¿ vs ¡ — Phil Mitchell
– Phil Mitchell, Commented Oct 9, 2014 at 17:28

ikegami · Accepted Answer · 2014-10-09 19:28:36Z

3

You are viewing your source code as if it was UTF-8, but unless you tell Perl it's UTF-8, it views it as US-ASCII.

You say you have the following:

my $str = "más";
printf "%v02X %s\n", $str, $str;
$str =~ s/[!]//g; 
printf "%v02X %s\n", $str, $str;
$str =~ s/[¡]//g; 
printf "%v02X %s\n", $str, $str;

But you really gave the equivalent of the following to Perl:

my $str = "m\xC3\xA1s";
printf "%v02X %s\n", $str, $str;   # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[!]//g; 
printf "%v02X %s\n", $str, $str;   # 6D.C3.A1.73 (the UTF-8 of más)
$str =~ s/[\xC2\xA1]//g;           # Replaces either of these bytes
printf "%v02X %s\n", $str, $str;   # 6D.C3.73 (garbage)

You want the following:

use utf8;                             # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';  # Terminal provides and expects UTF-8.

my $str = "más";
printf "U+%v04X %s\n", $str, $str;   # U+006D.00E1.0073 (the Unicode of más)
$str =~ s/[¡]//g;                    # Aka s/[\x{0041}]//g
printf "U+%v04X %s\n", $str, $str;   # U+006D.00E1.0073 (the Unicode of más)

You mention you didn't get your string from the source code and that you're not outputting the STDOUT, but the fix is the same: Decode inputs and encode outputs.

edited Oct 9, 2014 at 19:28

answered Oct 9, 2014 at 1:46

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Miller Over a year ago

perlopentut models ':encoding(UTF-8)' where as use open models ':encoding(utf8)'. Is there a notable difference between these two and which do you recommend if either?

ikegami Over a year ago

@Miller, UTF-8 is a standard encoding. UTF8 is a Perl-specific lax variant.

ikegami Over a year ago

Updated in response to change of question.

Phil Mitchell Over a year ago

Thanks, @ikegami! Great to know about sprintf's vector flag.

Miller · Accepted Answer · 2014-10-09 01:42:33Z

3

Specify the encoding of your source code using utf8 and output using binmode:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;

binmode STDOUT, ':encoding(UTF-8)';

my $str = "¿más?";

$str =~ s/[?]//g; 
print "$str\n";

$str = "¿más?";
$str =~ s/[¿]//g; 
print "$str\n";

Outputs:

¿más
más?

answered Oct 9, 2014 at 1:42

Miller

35.3k4 gold badges42 silver badges61 bronze badges

4 Comments

Phil Mitchell Over a year ago

It works (thank you!), but what does STDOUT have to do with s///?

Miller Over a year ago

You're printing the results (which defaults to STDOUT), therefore you must specify that encoding too.

Phil Mitchell Over a year ago

But results printed fine after first substitution.

Miller Over a year ago

Observe @ikegami's answer for a potential explanation. Either way the solution is the same.

Collectives™ on Stack Overflow

Perl string substitution garbles Unicode string

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related