2

I think I've run into a problem with Unicode and IO::Handle. It's very likely I'm doing something wrong. I want to get and unget individual unicode characters (not bytes) from an IO::Handle. But I'm getting a surprising error.

#!/usr/local/bin/perl

use 5.016;
use utf8;
use strict;
use warnings;

binmode(STDIN,  ':encoding(utf-8)');
binmode(STDOUT, ':encoding(utf-8)');
binmode(STDERR, ':encoding(utf-8)');

my $string = qq[a Å];
my $fh = IO::File->new();

$fh->open(\$string, '<:encoding(UTF-8)');

say $fh->getc(); # a
say $fh->getc(); # SPACE
say $fh->getc(); # Å LATIN CAPITAL LETTER A WITH RING ABOVE (U+00C5)
$fh->ungetc(ord("Å"));
say $fh->getc(); # should be A RING again.

The error message from the ungetc() line is "Malformed UTF-8 character (unexpected end of string) in say at unicode.pl line 21. "\x{00c5}" does not map to utf8 at unicode.pl line 21." But that's the correct hex for the character, and it should map to the character.

I used a hex editor to make sure that the bytes for A-RING are correct for UTF-8.

This seems to be a problem for any two-byte character.

The final say outputs '\xC5' (literally four characters: backslash, x, C, 5).

And I've tested this by reading from files instead of scalar variables. The result is the same.

This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-2level

And the script is saved in UTF-8. That was the first thing I checked.

2
  • Tip: binmode(STDIN, ':encoding(utf-8)'); binmode(STDOUT, ':encoding(utf-8)'); binmode(STDERR, ':encoding(utf-8)'); can be written use open ':std', ':encoding(utf-8)';. (Bonus, it happens at compile time.) Commented Jan 6, 2013 at 6:46
  • Tip: my $fh = IO::File->new(); $fh->open(\$string, '<:encoding(UTF-8)'); can be written open(my $fh, '<:encoding(UTF-8)', \$string). Yes, you can still use the methods. Commented Jan 6, 2013 at 6:51

2 Answers 2

2

I am pretty certain this proves there is a serious Unicode-processing bug going on, given that this output:

perl5.16.0 ungettest
ungettest 98896 @ Sun Jan  6 16:01:08 2013: sending normal line to kid
ungettest 98896 @ Sun Jan  6 16:01:08 2013: await()ing kid
ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting litte z
ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting big sigma
ungettest 98897 @ Sun Jan  6 16:01:08 2013: kid looping on parental input
98897: Unexpected fatalized warning: utf8 "\xA3" does not map to Unicode at ungettest line 40, <STDIN> line 1.
 at ungettest line 10, <STDIN> line 1.
    main::__ANON__('utf8 "\xA3" does not map to Unicode at ungettest line 40, <ST...') called at ungettest line 40
98896: parent pclose failed: 65280,  at ungettest line 28.
Exit 255

is produced by this program:

#!/usr/bin/env perl

use v5.16;
use strict;
use warnings;
use open qw( :utf8    :std );

use Carp;

$SIG{__WARN__} = sub {  confess "$$: Unexpected fatalized warning: @_" };

sub ungetchar($) {
    my $char = shift();
    confess "$$: expected single character pushback, not <$char>" if length($char) != 1;
    STDIN->ungetc(ord $char);
}

sub debug {
    my $now = localtime(time());
    print STDERR "$0 $$ \@ $now: @_\n";
}

if (open(STDOUT, "|-")                          // confess "$$: cannot fork: $!") {
    $| = 1;
    debug("sending normal line to kid");
    say "From \N{greek:alpha} to \N{greek:omega}.";
    debug("await()ing kid");
    close(STDOUT)                               || confess "$$: parent pclose failed: $?, $!";
    debug("child finished, parent exiting normally");
    exit(0);
}

debug("ungetting litte z");
ungetchar("z")                                  || confess "$$: ASCII ungetchar failed: $!";

debug("ungetting big sigma");
ungetchar("\N{greek:Sigma}")                    || confess "$$: Unicode ungetchar failed: $!";

debug("kid looping on parental input");
while (<STDIN>) {
    chomp;
    debug("kid got $_");
}
close(STDIN)                                    || confess "$$: child pclose failed: $?, $!";
debug("parent closed pipe, child exiting normally");
exit 0;
Sign up to request clarification or add additional context in comments.

5 Comments

So it looks like ord() is returning the correct unicode code point, but ungetc() isn't interpreting it as a code point, but a raw sequence of bytes. Does that sound right? If it is, then it's a significant bug.
@Michael Yes, in perlio.c, in PerlIO_ungetc(PerlIO *f, int ch), what is happening is that there is essentially a casting of an int into a char via STDCHAR buf = ch, where a STDCHAR is just a C char type. This is a Big Problem. Please file a bug report. Thank you.
Bug report away (via perlbug). How long will it take for a confirmation email that they received it?
@Michael You should have received the auto-reply ticket virtually immediately. However, it doesn’t seem to have made its way through the whole system yet. At least, I haven’t gotten the p5p copy yet.
1

ungetc prepends a byte to the underlying input stream. To return U+00C5, the stream would have to contain C3 A5 (the UTF-8 encoding of that character), not C5 (ord("Å")). Use IO::Unread's unread instead.

5 Comments

The documentation for ungetc() in IO::Handle says that it pushes a character back onto the stream. And since the stream is UTF-8 encoded, it should work, shouldn't it?
Character is a vague term. In this case, it's using the C definition. <repeat answer here>
Except that it isn't using the C definition when reading characters.
It would make sense for your file to contain 20 20 20 (three spaces in UTF-8). Would it make sense for your file to contain C5 20 20 20? No, prepending C5 makes no sense.
But nowhere am I asking it to prepend C5.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.