0

A software is producing UTF-8 files, but writing content to the file that isn't unicode. I can't change that software and have to take the output as it is now. Don' t know if this will show up here correctly, but an german umlaut "ä" is shown in the file as "ä".

If I open the file in Notepad++, it tells me the file is UTF-8 (without BOM) encoded. Now, if I say "convert to ANSI" in Notepad and then switch the file encoding back to UTF-8 (without converting), the German umlauts in the file are correct. How can I achieve the exact same behaviour in Perl? Whatever I tried up to now, the umlaut mess just got worse.

To reproduce, create yourself an UTF-8 encoded file and write content to it:

Ok, I'll try. Create yourself a UTF-8 file and write this to it: Männer Schüle Vöogel SüÃ

Then, on an UTF-8 mysql database, create a table with varchar field an UTF8_unicode encoding. Now, use this script:

use utf8;
use DBI;
use Encode;
if (open FILE, "test.csv") {
  my $db = DBI->connect(
    'DBI:mysql:your_db;host=127.0.0.1;mysql_compression=1', 'root', 'Yourpass',
    { PrintError => 1 }
  );
  my $sql="";
  my $sql = qq{SET NAMES 'utf8';};
  $db->do($sql);
  while (my $line = <FILE>) {
    my $sth = $db->prepare("INSERT IGNORE INTO testtable (testline) VALUES (?);");
    $sth->execute($line);
  }
}

The exact contents of file will get written to the database. But, the output I expect in database is with German umlauts:

Männer Schüler Vögel Süß

So, how can I convert that correctly?

6
  • There's no such thing as "non-unicode file". Please provide examples of your input, script that does work, current output and expected output, so people can have idea where your problem is. Commented Jul 6, 2012 at 12:22
  • I extended my text with some code and sample data. Commented Jul 6, 2012 at 12:40
  • 2
    Off-topic: you should place the prepare outside of the while-loop. The way you to it, the prepare is expensive and will be done for each line of your file. See the DBI doc outline section for more info. Commented Jul 6, 2012 at 12:48
  • Try adding $line = utf8::decode($line); line as the first statement of your while loop. Commented Jul 6, 2012 at 12:50
  • Great tip about the prepare, thank you :) Commented Jul 6, 2012 at 13:40

2 Answers 2

3

It's ironic: as I see it, the software you talk about is not writing 'non-unicode content' (that's non-sense) - it encodes it UTF-8 twice. Let's take this ä character, for example: it's represented by two bytes in UTF-8, %C3 %A4. But then something in that program decides to treat these bytes as Latin-1 encodings instead: thus they become two separate characters (which will be eventually encoded into UTF-8, and that's what'll be saved into a file).

I suppose the simplest way of reversing this is making Perl think that it uses a series of bytes (and not a sequence of characters) when dealing with the string read from the file. It can be done as simple (and as ugly) as...

open my $fh, '<:utf8', $file_name or die $!;
my $string = <$fh>;              # a sequence of characters            
$string = utf8::decode($string); # ... will be considered a sequence of octets
Sign up to request clarification or add additional context in comments.

Comments

1

Sounds like something is converting it a second time, assuming it to be something like ISO 8859-15 and then converting that to UTF-8. You can reverse this by converting UTF-8 to ISO 8859-15 (or whichever encoding seems to make sense for your data).

As seen on http://www.fileformat.info/info/unicode/char/E4/index.htm the bytes 0xC3 0xA4 are the valid UTF-8 encoding of ä. When viewed as ISO 8859-15 (or 8859-1, or Windows-1252, or a number of other 8-bit encodings) they display the string ä.

1 Comment

It's ironic. I tried so many things, but always started to think "it is utf-8, so I have to convert from utf-8 to some other encoding". But it was so simple, maybe to simple for me....here is what I'm using now, just converting from utf8 to latin1, and that works great: $line=encode("latin1", decode("utf8", $line));

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.