kb:perl-utf8
Differences
This shows you the differences between two versions of the page.
kb:perl-utf8 [2012/06/24 18:32] – created pasky | kb:perl-utf8 [2012/08/29 01:35] (current) – [Non-core Perl] pasky | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Perl + UTF-8 ====== | ||
+ | Slowly growing guide to Perl code that handles UTF8-encoded text data cleanly. | ||
+ | |||
+ | **Why is it not trivial:** Even if you are sure you never deal with legacy encodings, you have to consider code dealing with binary data as well. Read/ | ||
+ | |||
+ | ===== Passing Unicode data well ===== | ||
+ | |||
+ | Assuming no work (regex matching, sorting, case changes) is done on the data, it should just be regarded and stored as UTF8 text with no spurious recoding - this is still important on its own, as your data will likely get corrupted e.g. when taken from file or CGI parameters and stored in SQL database (and vice versa), they might get double-encoded, | ||
+ | |||
+ | If writing UTF8-clean script, you should be aware of data encoding every time data enters or leaves your program - through a filehandle, argument, environment variable, or a CPAN module you have no control about! Due to the amount of legacy code, you just have to //think// about this, there' | ||
+ | |||
+ | We assume Perl v5.14, though v5.10+ should be mostly sane already. | ||
+ | |||
+ | ==== Common Basics ==== | ||
+ | |||
+ | * Source code (literals etc.): <code perl>use utf8;</ | ||
+ | * Emit a warning when encoding of your data implicitly changes in a dubious way: <code perl>use encoding:: | ||
+ | * When data has been read earlier in binary and you want to promote it to UTF8 data, always decode it manually (you might need to do this a lot with data coming from CPAN modules): <code perl>use Encode; $data = Encode:: | ||
+ | * Main edge - @ARGV and STDIN, STDOUT, STDERR (if you are sure you are not looking at binary data): < | ||
+ | * DATA filehandle (lines in script file after '' | ||
+ | |||
+ | ==== Other Filehandles ==== | ||
+ | |||
+ | There are two ways - defaulting that all filehandles are UTF-8 text... | ||
+ | |||
+ | * You could extend the '' | ||
+ | * Much better it to set this as default only in local lexical scope: <code perl>use open qw(: | ||
+ | * To open a file in binary mode: <code perl> | ||
+ | |||
+ | ...or manually marking filehandles that are UTF-8 text: | ||
+ | |||
+ | * <code perl> | ||
+ | |||
+ | ==== Non-core Perl ==== | ||
+ | |||
+ | * MySQL: <code perl> | ||
+ | * PostgreSQL: <code perl> | ||
+ | * CGI.pm output: <code perl> | ||
+ | * CGI.pm input: <code perl>use CGI qw(-utf8);</ | ||
+ | |||
+ | ===== Processing Unicode text well ===== | ||
+ | |||
+ | This is waaay more complex. Terribly so. TODO, see below for some resources. | ||
+ | |||
+ | ===== Resources ===== | ||
+ | |||
+ | Useful starting point: [[http:// | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | CPAN utf8::all | ||
+ | |||
+ | ===== General Modern Perl notes ===== | ||
+ | |||
+ | Always start your code with | ||
+ | <code perl> | ||
+ | use strict; | ||
+ | use warnings; | ||
+ | use autodie; | ||
+ | </ |