19

My delicious-to-wp perl script works but gives for all "weird" characters even weirder output. So I tried

$description = decode_utf8( $description ); 

but that doesnt make a difference. I would like e.g. “go live” to become “go live” and not “go live†How can I handle unicode in Perl so that this works?

UPDATE: I found the problem was to set utf of DBI I had to set in Perl:

my $sql = qq{SET NAMES 'utf8';};
$dbh->do($sql);

That was the part that I had to set, tricky.

3
  • Distill your code to the shortest possible script that still exhibits the problem. Take the database out of the equation to figure out if the problem has anything to do with Perl. Come up with something others can test and debug. Commented Jun 11, 2009 at 21:34
  • Also note there's some work to do on the database side, which, if not done, How to make MySQL handle UTF-8 properly Commented May 19 at 16:27
  • I tried all of the solutions here and on the question above. None of them worked. The only thing that worked for me was use utf8; binmode(STDOUT, ':utf8'); this makes Perl output UTF8 characters correctly. And the UTF8 string that you insert into the database, make sure you run use Encode qw(encode decode); $newstring=decode("UTF-8",$oldstring);. This is currently working for me. I talked about this more in my answer at the bottom, but neither the accepted solution nor any of the other solutions on either page worked. The database itself didnt require any configuration. Commented May 19 at 21:39

7 Answers 7

24

It's worth noting that if you're running a version of DBD::mysql new enough (3.0008 on), you can do the following: $dbh->{'mysql_enable_utf8'} = 1; and then everything's decode()ed/encode()ed for you on the way out from/in to DBI.

Sign up to request clarification or add additional context in comments.

3 Comments

For the lazy like myself, this is the dbic variant: Schema->connect("dbi:mysql:".$db{db}, $db{user}, $db{pass}, {mysql_enable_utf8 => 1} );
Solution in the answer doesn't work for me but solution in the comment above by @al. works like a charm.
According to the documentation it can be done during the connect, which requires to further actions. If you do it post connect as suggested in the answer you also have to do a an extra command: SET NAMES utf8.
14

Enable UTF8, when you connect to database like this:

my $dbh = DBI->connect(
    "dbi:mysql:dbname=db_name", 
    "db_user", "db_pass",
     {RaiseError => 0, PrintError => 0, mysql_enable_utf8 => 1}
 ) or die "Connect to database failed.";

This should get you character mode strings with the UTF8 flag set as needed.

From DBI General Interface Rules & Caveats:

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.

And the specifics from DBD::mysql for mysql_enable_utf8

Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect.

1 Comment

Thank you for this. I tried updating the Content-Type and charset.... I pondered server side encoding and decoding.... all that seemed like perhaps I was missing something. My issue was the single apostrophe in an input field. Worked fine on my laptop, but an entry from my iphone (Both devices using Chrome by the way) I'd get a funky character that even existed in the database. But Hooray for your answer. That simple fix on my get_database_handle() routine solved my issue. Kudos to you.
6

The term

$dbh->do(qq{SET NAMES 'utf8';});

definitely saves the day for accessing an utf-8 declared database, but take notice, if you are going to do any perl processing of any data obatined from the db it would be wise to store it in a perl var as an utf8 string with, as this operation is not implicit.

$utfstring = decode('utf8',$string_from_db);

of course, for proper i/o handling of utf8 strings (reading, printing, writing to output) remember to set

use open ':utf8';

and

binmode STDOUT, ":utf8";

the latter being essential for printing out utf8 strings. Hope this helps.

3 Comments

How to get decode function? Perl shows error, saying cannot find the decode() routine. Thanks!
This fixed my problem..Have a perl cgi script that connects to a Postgres utf-8 database that returns a JSON string to the client using XHR. Characters were showing up on the client side all garbled until I added my $final_utf8 = Encode::encode_utf8($treeJSON); print CGI->header('application/json;charset=UTF-8'); print "$final_utf8"; as suggested in your post. Remember you need use Encode; at the top of your perl script to use this.
None of the answers up to this one couldn't solve my UTF8 DB problem but this one did. Thanks!
2

It may have nothing to do with Perl. Check to make sure you're using UTF encodings in the pertinent MySQL table columns.

Comments

0

By default, the driver Perl/MySQL handles binary data (at least I concluded this from some experiments with MySQL 5.1 and 5.5).

Without setting mysql_enable_utf8, I encoded/decoded the strings to/from UTF-8 before writing/reading to/from the database.

It should not be relied upon the perl-internal string representation as an array of byte; be aware that the internal 'utf8' is not guaranteed to be standard UTF-8; in converse, the single byte encoding is not guaranteed to be ISO-8859-1; really do encode/decode to/from UTF-8 (and not 'utf8').

There are also some settings of MySQL (like SET NAMES above, as far as I remember there is a client encoding, a connection encoding, and a server encoding, whose interactions are not quite clear to me if they do not all have the same value) regarding to the encodings; setting all of them to UTF-8, and the recipe above, worked for me.

Comments

0

Leave this one out:

binmode STDOUT, ":utf8";

when using:

$dbh->do(qq{SET NAMES 'utf8';});

Otherwise your output will have double utf8 encoding, resulting in unreadable double byte characters!

2 Comments

If what you say is true then Perl does not know the data retrieved from the database was UTF-8 encoded and simply thinks it is bytes. Surely DBD::Mysql (or whatever DBD you are using) is decoding the data from the database for you in which case Perl should know you have Unicode characters and when they are written to stdout with the utf8 layer all will be well. If what you describe is what happens I'd suggest your DBD is broken. However, looking at latest DBD::Mysql it does do "sv_utf8_decode(sv);" at least on some data.
bohica, DBD::mysql normally provides bytes, implicit decoding is not turned on by default.
-1

I have a string with some special characters I am trying to put in a database. Namely chessgames from François-André Danican Philidor. When I try to put this name in my MySQL database using DBI, I get this error...

HAND CHECK: Fran�ois Andr� Philidor||NN||1||0||Fran�ois Andr� Philidor||NN||
DBD::mysql::st execute failed: Incorrect string value: '\xE7ois A...' for column 'white_player' at row 1 at chessgames.pl line 110, <GEN0> line 2360.
SQL Error: Incorrect string value: '\xE7ois A...' for column 'white_player' at row 1
Fran�ois Andr� Philidor||NN||1||0||Fran�ois Andr� Philidor||NN||

MySQL is having trouble understanding the special characters in the name, specifically the ç and the é. The first thing that helped was to add the following to my script...

use utf8; #some names have utf8 characters
binmode(STDOUT, ':utf8');

These commands changed the output so that when Perl printed the name with special characters, it printed the special characters properly. But MySQL still did not understand the special characters. Output changed slightly to this...

HAND CHECK: François André Philidor||NN||1||0||François André Philidor||NN||
DBD::mysql::st execute failed: Incorrect string value: '\xE7ois A...' for column 'white_player' at row 1 at chessgames.pl line 110, <GEN0> line 2360.
SQL Error: Incorrect string value: '\xE7ois A...' for column 'white_player' at row 1
Fran�ois Andr� Philidor||NN||1||0||Fran�ois Andr� Philidor||NN||

My manual check was working, but the MySQL query was still returning an error. I tried some of the other solutions listed above...

$dbh->do('SET NAMES utf8');

This solution did not work for me and produced the same errors.

$dbh->{'mysql_enable_utf8'} = 1;

This solution also did not work. A different question suggested when creating the table in SQL, to use this command. How to store unicode in MySQL?

white_player VARCHAR(128) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL, 

This did not work as well. I decided to look at the Unicode section in perldoc. I read through

$ perldoc perlunicode

Which pointed to

$ perldoc perlunitut

Which in turn pointed to

$ perldoc perlpacktut

Which is where I found the solution that worked for me. Here is the relevant text from perldoc perlpacktut under the Unicode section...

Please note: in the general case, you're better off using
"Encode::decode('UTF-8', $utf)" to decode a UTF-8 encoded byte string to
a Perl Unicode string, and "Encode::encode('UTF-8', $str)" to encode a
Perl Unicode string to UTF-8 bytes. These functions provide means of
handling invalid byte sequences and generally have a friendlier
interface.

Encoding (as a verb) is the conversion from *text* to *binary*.
Decoding is the conversion from *binary* to *text*.

I added the following command to my code

use Encode qw(encode decode); #suggestion from perldoc perlpacktut
$whiteplayer = decode("UTF-8",$pgn->white); 

However this also didn't work, database still shows Fran�ois Andr� Danican Philidor instead of François André Danican Philidor. I eventually found this answer

execute failed: Incorrect string value: '\xE4rvine...' with mariadb and perl DBD

And saw that this encoding was not UTF8, but a similar encoding iso-8859-1, also known as latin-1. I changed my decode statement to

$whiteplayer = decode("iso-8859-1",$pgn->white);

and finally it was working! The code is long and non-obvious, but I will sum up the solution as the following.

  1. include use utf8;
  2. include binmode(STDOUT, ':utf8');
  3. do a little reading namely perldoc perlunicode, perldoc perlunitut, and perldoc perlpacktut
  4. find the correct encoding of the text you want to enter into the database. Usually UTF-8 but check to make sure. Note: Careful not to mix up terms in the decode function. It is UTF-8 and not UTF8.
  5. include use Encode qw(encode decode); and decode the string with special characters using something like $newstring=decode("UTF-8",$oldstring);

Here is some code to check if your string contains non ASCII characters...

while ( $string =~ /([^\x00-\x7f])/g ){
  print "string: $string contains non-ASCII character: $1\n"
}
print "\n";

As long as CPAN is configured correctly, you can install the necessary packages by running

$ cpan
$ cpan[1]> install utf8 Encode

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.