Slowly growing guide to Perl code that handles UTF8-encoded text data cleanly.
Why is it not trivial: Even if you are sure you never deal with legacy encodings, you have to consider code dealing with binary data as well. Read/written to files, transferred over sockets… And it may not be code in *your* script but in third-party modules you downloaded from CPAN that were last touched in 2001.
Assuming no work (regex matching, sorting, case changes) is done on the data, it should just be regarded and stored as UTF8 text with no spurious recoding - this is still important on its own, as your data will likely get corrupted e.g. when taken from file or CGI parameters and stored in SQL database (and vice versa), they might get double-encoded, and so on. It is not necessary to know so much for this compared to issues of manipulating the Unicode text correctly, so we first describe just the “keeping your data encoded correctly” part.
If writing UTF8-clean script, you should be aware of data encoding every time data enters or leaves your program - through a filehandle, argument, environment variable, or a CPAN module you have no control about! Due to the amount of legacy code, you just have to think about this, there's no workaround. Below, we outline a recipe for sane Unicode defaults at most program edges, but especially when calling other code, you will still need to make decisions (and workarounds).
We assume Perl v5.14, though v5.10+ should be mostly sane already.
use utf8;
use encoding::warnings;
use Encode; $data = Encode::decode('UTF-8', $data, Encode::FB_CROAK);
(compared to Encode::decode_utf8
, here data will be also checked for correct encoding) (Encode::FB_CROAK
will fail hard in case of encoding error)
#!/usr/bin/perl -CSA
(this will apply to all code, not just your script)
__END__
): binmode(DATA, ":encoding(UTF-8)");
There are two ways - defaulting that all filehandles are UTF-8 text…
-CSA
earlier to -CDSA
, but this will affect all code including CPAN modules, something you do not want most of the time.…or manually marking filehandles that are UTF-8 text:
open(my $fh, '<:encoding(UTF-8)', 'filename');
(Sometimes, you see :utf8
instead; that just declares the data is utf8 while :encoding()
also checks if it is properly encoded.)
DBI->connect(..., {RaiseError => 1, mysql_enable_utf8 => 1}); do(SET NAMES 'utf8');
(Of course RaiseError
is somewhat unrelated - but you want it.)
DBI->connect("dbi:Pg:dbname=...", '', '', {AutoCommit => 1, RaiseError => 1, pg_enable_utf8 => 1});
header(-charset => 'utf8');
or equivalent
use CGI qw(-utf8);
will make all query parameters utf8, but this will interfere with binary file uploads; alternatively, you need to manually Encode::decode
all text param()
calls. You may want to write a simple param_utf8() wrapper.
This is waaay more complex. Terribly so. TODO, see below for some resources.
Useful starting point: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default tchrist's reply
http://stackoverflow.com/questions/3735721/checklist-for-going-the-unicode-way-with-perl?rq=1
CPAN utf8::all
Always start your code with
use strict; use warnings; use autodie;