Table of Contents

Perl + UTF-8

Slowly growing guide to Perl code that handles UTF8-encoded text data cleanly.

Why is it not trivial: Even if you are sure you never deal with legacy encodings, you have to consider code dealing with binary data as well. Read/written to files, transferred over sockets… And it may not be code in *your* script but in third-party modules you downloaded from CPAN that were last touched in 2001.

Passing Unicode data well

Assuming no work (regex matching, sorting, case changes) is done on the data, it should just be regarded and stored as UTF8 text with no spurious recoding - this is still important on its own, as your data will likely get corrupted e.g. when taken from file or CGI parameters and stored in SQL database (and vice versa), they might get double-encoded, and so on. It is not necessary to know so much for this compared to issues of manipulating the Unicode text correctly, so we first describe just the “keeping your data encoded correctly” part.

If writing UTF8-clean script, you should be aware of data encoding every time data enters or leaves your program - through a filehandle, argument, environment variable, or a CPAN module you have no control about! Due to the amount of legacy code, you just have to think about this, there's no workaround. Below, we outline a recipe for sane Unicode defaults at most program edges, but especially when calling other code, you will still need to make decisions (and workarounds).

We assume Perl v5.14, though v5.10+ should be mostly sane already.

Common Basics

Other Filehandles

There are two ways - defaulting that all filehandles are UTF-8 text…

…or manually marking filehandles that are UTF-8 text:

Non-core Perl

Processing Unicode text well

This is waaay more complex. Terribly so. TODO, see below for some resources.

Resources

Useful starting point: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default tchrist's reply

http://stackoverflow.com/questions/3735721/checklist-for-going-the-unicode-way-with-perl?rq=1

CPAN utf8::all

General Modern Perl notes

Always start your code with

use strict;
use warnings;
use autodie;