IO::Handle to get and unget unicode characters

Question

I think I've run into a problem with Unicode and IO::Handle. It's very likely I'm doing something wrong. I want to get and unget individual unicode characters (not bytes) from an IO::Handle. But I'm getting a surprising error.

#!/usr/local/bin/perl

use 5.016;
use utf8;
use strict;
use warnings;

binmode(STDIN,  ':encoding(utf-8)');
binmode(STDOUT, ':encoding(utf-8)');
binmode(STDERR, ':encoding(utf-8)');

my $string = qq[a Å];
my $fh = IO::File->new();

$fh->open(\$string, '<:encoding(UTF-8)');

say $fh->getc(); # a
say $fh->getc(); # SPACE
say $fh->getc(); # Å LATIN CAPITAL LETTER A WITH RING ABOVE (U+00C5)
$fh->ungetc(ord("Å"));
say $fh->getc(); # should be A RING again.

The error message from the ungetc() line is "Malformed UTF-8 character (unexpected end of string) in say at unicode.pl line 21. "\x{00c5}" does not map to utf8 at unicode.pl line 21." But that's the correct hex for the character, and it should map to the character.

I used a hex editor to make sure that the bytes for A-RING are correct for UTF-8.

This seems to be a problem for any two-byte character.

The final say outputs '\xC5' (literally four characters: backslash, x, C, 5).

And I've tested this by reading from files instead of scalar variables. The result is the same.

This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-2level

And the script is saved in UTF-8. That was the first thing I checked.

Tip: binmode(STDIN, ':encoding(utf-8)'); binmode(STDOUT, ':encoding(utf-8)'); binmode(STDERR, ':encoding(utf-8)'); can be written use open ':std', ':encoding(utf-8)';. (Bonus, it happens at compile time.) — ikegami
– ikegami, Commented Jan 6, 2013 at 6:46
Tip: my $fh = IO::File->new(); $fh->open(\$string, '<:encoding(UTF-8)'); can be written open(my $fh, '<:encoding(UTF-8)', \$string). Yes, you can still use the methods. — ikegami
– ikegami, Commented Jan 6, 2013 at 6:51

tchrist · Accepted Answer · 2013-01-06 23:02:34Z

2

I am pretty certain this proves there is a serious Unicode-processing bug going on, given that this output:

perl5.16.0 ungettest
ungettest 98896 @ Sun Jan  6 16:01:08 2013: sending normal line to kid
ungettest 98896 @ Sun Jan  6 16:01:08 2013: await()ing kid
ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting litte z
ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting big sigma
ungettest 98897 @ Sun Jan  6 16:01:08 2013: kid looping on parental input
98897: Unexpected fatalized warning: utf8 "\xA3" does not map to Unicode at ungettest line 40, <STDIN> line 1.
 at ungettest line 10, <STDIN> line 1.
    main::__ANON__('utf8 "\xA3" does not map to Unicode at ungettest line 40, <ST...') called at ungettest line 40
98896: parent pclose failed: 65280,  at ungettest line 28.
Exit 255

is produced by this program:

#!/usr/bin/env perl

use v5.16;
use strict;
use warnings;
use open qw( :utf8    :std );

use Carp;

$SIG{__WARN__} = sub {  confess "$$: Unexpected fatalized warning: @_" };

sub ungetchar($) {
    my $char = shift();
    confess "$$: expected single character pushback, not <$char>" if length($char) != 1;
    STDIN->ungetc(ord $char);
}

sub debug {
    my $now = localtime(time());
    print STDERR "$0 $$ \@ $now: @_\n";
}

if (open(STDOUT, "|-")                          // confess "$$: cannot fork: $!") {
    $| = 1;
    debug("sending normal line to kid");
    say "From \N{greek:alpha} to \N{greek:omega}.";
    debug("await()ing kid");
    close(STDOUT)                               || confess "$$: parent pclose failed: $?, $!";
    debug("child finished, parent exiting normally");
    exit(0);
}

debug("ungetting litte z");
ungetchar("z")                                  || confess "$$: ASCII ungetchar failed: $!";

debug("ungetting big sigma");
ungetchar("\N{greek:Sigma}")                    || confess "$$: Unicode ungetchar failed: $!";

debug("kid looping on parental input");
while (<STDIN>) {
    chomp;
    debug("kid got $_");
}
close(STDIN)                                    || confess "$$: child pclose failed: $?, $!";
debug("parent closed pipe, child exiting normally");
exit 0;

answered Jan 6, 2013 at 23:02

tchrist

80.7k31 gold badges135 silver badges186 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Michael Over a year ago

So it looks like ord() is returning the correct unicode code point, but ungetc() isn't interpreting it as a code point, but a raw sequence of bytes. Does that sound right? If it is, then it's a significant bug.

tchrist Over a year ago

@Michael Yes, in perlio.c, in PerlIO_ungetc(PerlIO *f, int ch), what is happening is that there is essentially a casting of an int into a char via STDCHAR buf = ch, where a STDCHAR is just a C char type. This is a Big Problem. Please file a bug report. Thank you.

Michael Over a year ago

Bug report away (via perlbug). How long will it take for a confirmation email that they received it?

tchrist Over a year ago

@Michael You should have received the auto-reply ticket virtually immediately. However, it doesn’t seem to have made its way through the whole system yet. At least, I haven’t gotten the p5p copy yet.

Michael Over a year ago

Bug report made it this time: rt.perl.org/rt3/Public/Bug/Display.html?id=116322

ikegami · Accepted Answer · 2013-01-06 22:08:45Z

1

ungetc prepends a byte to the underlying input stream. To return U+00C5, the stream would have to contain C3 A5 (the UTF-8 encoding of that character), not C5 (ord("Å")). Use IO::Unread's unread instead.

edited Jan 6, 2013 at 22:08

answered Jan 6, 2013 at 6:17

ikegami

391k17 gold badges291 silver badges555 bronze badges

5 Comments

Michael Over a year ago

The documentation for ungetc() in IO::Handle says that it pushes a character back onto the stream. And since the stream is UTF-8 encoded, it should work, shouldn't it?

ikegami Over a year ago

Character is a vague term. In this case, it's using the C definition. <repeat answer here>

Michael Over a year ago

Except that it isn't using the C definition when reading characters.

ikegami Over a year ago

It would make sense for your file to contain 20 20 20 (three spaces in UTF-8). Would it make sense for your file to contain C5 20 20 20? No, prepending C5 makes no sense.

Michael Over a year ago

But nowhere am I asking it to prepend C5.

Collectives™ on Stack Overflow

IO::Handle to get and unget unicode characters

2 Answers 2

5 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related