Perl + UTF-8

Slowly growing guide to Perl code that handles UTF8-encoded text data cleanly.

Why is it not trivial: Even if you are sure you never deal with legacy encodings, you have to consider code dealing with binary data as well. Read/written to files, transferred over sockets… And it may not be code in *your* script but in third-party modules you downloaded from CPAN that were last touched in 2001.

Passing Unicode data well

Assuming no work (regex matching, sorting, case changes) is done on the data, it should just be regarded and stored as UTF8 text with no spurious recoding - this is still important on its own, as your data will likely get corrupted e.g. when taken from file or CGI parameters and stored in SQL database (and vice versa), they might get double-encoded, and so on. It is not necessary to know so much for this compared to issues of manipulating the Unicode text correctly, so we first describe just the “keeping your data encoded correctly” part.

If writing UTF8-clean script, you should be aware of data encoding every time data enters or leaves your program - through a filehandle, argument, environment variable, or a CPAN module you have no control about! Due to the amount of legacy code, you just have to think about this, there's no workaround. Below, we outline a recipe for sane Unicode defaults at most program edges, but especially when calling other code, you will still need to make decisions (and workarounds).

We assume Perl v5.14, though v5.10+ should be mostly sane already.

Common Basics

  • Source code (literals etc.):
    use utf8;
  • Emit a warning when encoding of your data implicitly changes in a dubious way:
    use encoding::warnings;
  • When data has been read earlier in binary and you want to promote it to UTF8 data, always decode it manually (you might need to do this a lot with data coming from CPAN modules):
    use Encode; $data = Encode::decode('UTF-8', $data, Encode::FB_CROAK);

    (compared to Encode::decode_utf8, here data will be also checked for correct encoding) (Encode::FB_CROAK will fail hard in case of encoding error)

  • Main edge - @ARGV and STDIN, STDOUT, STDERR (if you are sure you are not looking at binary data):
    #!/usr/bin/perl -CSA

    (this will apply to all code, not just your script)

  • DATA filehandle (lines in script file after __END__):
    binmode(DATA, ":encoding(UTF-8)");

Other Filehandles

There are two ways - defaulting that all filehandles are UTF-8 text…

  • You could extend the -CSA earlier to -CDSA, but this will affect all code including CPAN modules, something you do not want most of the time.
  • Much better it to set this as default only in local lexical scope:
    use open qw(:encoding(UTF-8));

    (you can also replace -CSA with -CA and add :std to the open pragma to lexically restrict UTF8 mode on stdio filehandles)

  • To open a file in binary mode:
    open(my $binfh, 'file.bin'); binmode $binfh;

    (this is a good practice to always do anyway; it will not make a difference just for UTF-8 but also on Windows etc.)

…or manually marking filehandles that are UTF-8 text:

  • open(my $fh, '<:encoding(UTF-8)', 'filename');

    (Sometimes, you see :utf8 instead; that just declares the data is utf8 while :encoding() also checks if it is properly encoded.)

Non-core Perl

  • MySQL:
    DBI->connect(..., {RaiseError => 1, mysql_enable_utf8 => 1}); do(SET NAMES 'utf8');

    (Of course RaiseError is somewhat unrelated - but you want it.)

  • PostgreSQL:
    DBI->connect("dbi:Pg:dbname=...", '', '', {AutoCommit => 1, RaiseError => 1, pg_enable_utf8 => 1});
  • CGI.pm output:
    header(-charset => 'utf8');

    or equivalent

  • CGI.pm input:
    use CGI qw(-utf8);

    will make all query parameters utf8, but this will interfere with binary file uploads; alternatively, you need to manually Encode::decode all text param() calls. You may want to write a simple param_utf8() wrapper.

Processing Unicode text well

This is waaay more complex. Terribly so. TODO, see below for some resources.

Resources

General Modern Perl notes

Always start your code with

use strict;
use warnings;
use autodie;
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki