User Tools

Site Tools


kb:perl-utf8

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

kb:perl-utf8 [2012/06/24 18:32] – created paskykb:perl-utf8 [2012/08/29 01:35] (current) – [Non-core Perl] pasky
Line 1: Line 1:
 +====== Perl + UTF-8 ======
  
 +Slowly growing guide to Perl code that handles UTF8-encoded text data cleanly.
 +
 +**Why is it not trivial:** Even if you are sure you never deal with legacy encodings, you have to consider code dealing with binary data as well. Read/written to files, transferred over sockets... And it may not be code in *your* script but in third-party modules you downloaded from CPAN that were last touched in 2001.
 +
 +===== Passing Unicode data well =====
 +
 +Assuming no work (regex matching, sorting, case changes) is done on the data, it should just be regarded and stored as UTF8 text with no spurious recoding - this is still important on its own, as your data will likely get corrupted e.g. when taken from file or CGI parameters and stored in SQL database (and vice versa), they might get double-encoded, and so on. It is not necessary to know so much for this compared to issues of manipulating the Unicode text correctly, so we first describe just the "keeping your data encoded correctly" part.
 +
 +If writing UTF8-clean script, you should be aware of data encoding every time data enters or leaves your program - through a filehandle, argument, environment variable, or a CPAN module you have no control about! Due to the amount of legacy code, you just have to //think// about this, there's no workaround. Below, we outline a recipe for sane Unicode defaults at most program edges, but especially when calling other code, you will still need to make decisions (and workarounds).
 +
 +We assume Perl v5.14, though v5.10+ should be mostly sane already.
 +
 +==== Common Basics ====
 +
 +  * Source code (literals etc.): <code perl>use utf8;</code>
 +  * Emit a warning when encoding of your data implicitly changes in a dubious way: <code perl>use encoding::warnings;</code>
 +  * When data has been read earlier in binary and you want to promote it to UTF8 data, always decode it manually (you might need to do this a lot with data coming from CPAN modules): <code perl>use Encode; $data = Encode::decode('UTF-8', $data, Encode::FB_CROAK);</code> (compared to ''Encode::decode_utf8'', here data will be also checked for correct encoding) (''Encode::FB_CROAK'' will fail hard in case of encoding error)
 +  * Main edge - @ARGV and STDIN, STDOUT, STDERR (if you are sure you are not looking at binary data): <code>#!/usr/bin/perl -CSA</code> (this will apply to all code, not just your script)
 +  * DATA filehandle (lines in script file after ''%%__END__%%''): <code perl>binmode(DATA, ":encoding(UTF-8)");</code>
 +
 +==== Other Filehandles ====
 +
 +There are two ways - defaulting that all filehandles are UTF-8 text...
 +
 +  * You could extend the ''-CSA'' earlier to ''-CDSA'', but this will affect all code including CPAN modules, something you **do not want** most of the time.
 +  * Much better it to set this as default only in local lexical scope: <code perl>use open qw(:encoding(UTF-8));</code> (you can also replace ''-CSA'' with ''-CA'' and add '':std'' to the open pragma to lexically restrict UTF8 mode on stdio filehandles)
 +  * To open a file in binary mode: <code perl>open(my $binfh, 'file.bin'); binmode $binfh;</code> (this is a good practice to always do anyway; it will not make a difference just for UTF-8 but also on Windows etc.)
 +
 +...or manually marking filehandles that are UTF-8 text:
 +
 +  * <code perl>open(my $fh, '<:encoding(UTF-8)', 'filename');</code> (Sometimes, you see '':utf8'' instead; that just declares the data is utf8 while '':encoding()'' also checks if it is properly encoded.)
 +
 +==== Non-core Perl ====
 +
 +  * MySQL: <code perl>DBI->connect(..., {RaiseError => 1, mysql_enable_utf8 => 1}); do(SET NAMES 'utf8');</code> (Of course ''RaiseError'' is somewhat unrelated - but you want it.)
 +  * PostgreSQL: <code perl>DBI->connect("dbi:Pg:dbname=...", '', '', {AutoCommit => 1, RaiseError => 1, pg_enable_utf8 => 1});</code>
 +  * CGI.pm output: <code perl>header(-charset => 'utf8');</code> or equivalent
 +  * CGI.pm input: <code perl>use CGI qw(-utf8);</code> will make all query parameters utf8, but this will interfere with binary file uploads; alternatively, you need to manually ''Encode::decode'' all text ''param()'' calls. You may want to write a simple param_utf8() wrapper.
 +
 +===== Processing Unicode text well =====
 +
 +This is waaay more complex. Terribly so. TODO, see below for some resources.
 +
 +===== Resources =====
 +
 +Useful starting point: [[http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default]] tchrist's reply
 +
 +[[http://stackoverflow.com/questions/3735721/checklist-for-going-the-unicode-way-with-perl?rq=1]]
 +
 +CPAN utf8::all
 +
 +===== General Modern Perl notes =====
 +
 +Always start your code with
 +<code perl>
 +use strict;
 +use warnings;
 +use autodie;
 +</code>
kb/perl-utf8.txt · Last modified: 2012/08/29 01:35 by pasky