Perl unicode file with non-unicode content

Question

A software is producing UTF-8 files, but writing content to the file that isn't unicode. I can't change that software and have to take the output as it is now. Don' t know if this will show up here correctly, but an german umlaut "ä" is shown in the file as "Ã¤".

If I open the file in Notepad++, it tells me the file is UTF-8 (without BOM) encoded. Now, if I say "convert to ANSI" in Notepad and then switch the file encoding back to UTF-8 (without converting), the German umlauts in the file are correct. How can I achieve the exact same behaviour in Perl? Whatever I tried up to now, the umlaut mess just got worse.

To reproduce, create yourself an UTF-8 encoded file and write content to it:

Ok, I'll try. Create yourself a UTF-8 file and write this to it: MÃ¤nner SchÃ¼le VÃ¶ogel SÃ¼Ã

Then, on an UTF-8 mysql database, create a table with varchar field an UTF8_unicode encoding. Now, use this script:

use utf8;
use DBI;
use Encode;
if (open FILE, "test.csv") {
  my $db = DBI->connect(
    'DBI:mysql:your_db;host=127.0.0.1;mysql_compression=1', 'root', 'Yourpass',
    { PrintError => 1 }
  );
  my $sql="";
  my $sql = qq{SET NAMES 'utf8';};
  $db->do($sql);
  while (my $line = <FILE>) {
    my $sth = $db->prepare("INSERT IGNORE INTO testtable (testline) VALUES (?);");
    $sth->execute($line);
  }
}

The exact contents of file will get written to the database. But, the output I expect in database is with German umlauts:

Männer Schüler Vögel Süß

So, how can I convert that correctly?

There's no such thing as "non-unicode file". Please provide examples of your input, script that does work, current output and expected output, so people can have idea where your problem is. — Oleg V. Volkov
– Oleg V. Volkov, Commented Jul 6, 2012 at 12:22
Off-topic: you should place the prepare outside of the while-loop. The way you to it, the prepare is expensive and will be done for each line of your file. See the DBI doc outline section for more info. — simbabque
– simbabque, Commented Jul 6, 2012 at 12:48
Try adding $line = utf8::decode($line); line as the first statement of your while loop. — raina77ow
– raina77ow, Commented Jul 6, 2012 at 12:50

raina77ow · Accepted Answer · 2012-07-06 12:42:42Z

It's ironic: as I see it, the software you talk about is not writing 'non-unicode content' (that's non-sense) - it encodes it UTF-8 twice. Let's take this ä character, for example: it's represented by two bytes in UTF-8, %C3 %A4. But then something in that program decides to treat these bytes as Latin-1 encodings instead: thus they become two separate characters (which will be eventually encoded into UTF-8, and that's what'll be saved into a file).

I suppose the simplest way of reversing this is making Perl think that it uses a series of bytes (and not a sequence of characters) when dealing with the string read from the file. It can be done as simple (and as ugly) as...

open my $fh, '<:utf8', $file_name or die $!;
my $string = <$fh>;              # a sequence of characters            
$string = utf8::decode($string); # ... will be considered a sequence of octets

tripleee · Accepted Answer · 2012-07-06 16:34:29Z

1

Sounds like something is converting it a second time, assuming it to be something like ISO 8859-15 and then converting that to UTF-8. You can reverse this by converting UTF-8 to ISO 8859-15 (or whichever encoding seems to make sense for your data).

As seen on http://www.fileformat.info/info/unicode/char/E4/index.htm the bytes 0xC3 0xA4 are the valid UTF-8 encoding of ä. When viewed as ISO 8859-15 (or 8859-1, or Windows-1252, or a number of other 8-bit encodings) they display the string Ã¤.

edited Jul 6, 2012 at 16:34

answered Jul 6, 2012 at 12:35

tripleee

192k37 gold badges318 silver badges369 bronze badges

1 Comment

Mathias Over a year ago

It's ironic. I tried so many things, but always started to think "it is utf-8, so I have to convert from utf-8 to some other encoding". But it was so simple, maybe to simple for me....here is what I'm using now, just converting from utf8 to latin1, and that works great: $line=encode("latin1", decode("utf8", $line));

Collectives™ on Stack Overflow

Perl unicode file with non-unicode content

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related