How to achieve fool proof unicode handling in Perl CGI? - perl

So I have a mysql database which is serving an old wordpress db I backed up. I am writing some simple perl scripts to serve the wordpress articles (I don't want a wordpress install).
Wordpress for whatever reason stored all quotes as unicode chars, all ... as unicode chars, all double dashes, all apostrophes, there is unicode nbsp all over the place -- it is a mess (this is why I do not a wordpress install).
In my test environment, which is Linux Mint 17.1 Perl 5.18.2 Mysql 5.5, things mostly work fine when I have Content-type line being served up with "charset=utf-8" (except apostrophes just simply never decode properly no matter what combination I of things I try). Omitting the charset causes all unicode characters to break (except apostrophes now work). This is OK, with the exception of the apostrophes, I under stand what is going on and I have a handle on the data.
Now on my production environment, which is a VM, is Linux CentOS 6.5 Perl 5.10.1 Mysql 5.6.22, and here things do not work at all. Whether or not I include the "charset=utf-8" in the Content-type there is no difference, no unicode charatcers work correctly (including apostrophes). Maybe it has to do with the lower version of Perl? Does anyone have any insight?
Apart from this very specific case, does anyone know of a fool-proof Perl idiom for handling unicode which comes from the DB? (I'm not sure where in the pipeline things are going wrong, but I have a suspicion it is at the DB-driver level)
One of the problems is that my data is very inconsistent and dirty. I could parse the entire DB and scrub all unicode and re-import it -- the point is I want to avoid that. I want a one-size fits all collection of Perl scripts for reading wordpress databases.

Dealing with Perl and UTF-8 has been a pain to me. After a good amount of time i learned that there is no "fool proof unicode handling" in Perl ... but there is an unicode handling that can be of help:
The Encode module.
As the perlunifaq says (http://perldoc.perl.org/perlunifaq.html):
When should I decode or encode?
Whenever you're communicating text with anything that is external to
your perl process, like a database, a text file, a socket, or another
program. Even if the thing you're communicating with is also written
in Perl.
So we do this to every UTF-8 text string sent to our Perl process:
my $perl_str = decode('utf8',$myExt_str);
And this to every text string sent from Perl to anything external to our Perl process:
my $ext_str = encode('utf8',$perl_str);
...
Now that's a lot of encoding/decoding when we retrieve or send data from/to a mysql or postgresql database. But fear not, because there is a way to tell Perl that EVERY TEXT STRING from/to a database are utf8. Additionally we tell the database that every text string should be treated as UTF-8. The only downside is that you need to be sure that every text string is UTF-8 encoded... but that's another story:
# For MySQL:
# This requires DBD::mysql version 4 or greater
use DBI;
my $dbh = DBI->connect ('dbi:mysql:test_db',
$username,
$password,
{mysql_enable_utf8 => 1}
);
Ok, now we have the text strings from our database in utf8 and the database knows all our text strings should be treated as UTF-8... But what about anything else? We need to tell Perl (AND CGI) that EVERY TEXT STRING we write in our process is utf8 AND tell other other processes to treat our text strings as UTF-8 as well:
use utf8;
use CGI '-utf8';
my $cgi = new CGI;
$cgi->charset('UTF-8');
UPDATED!
What is a "wide character"?
This is a term used both for characters with an ordinal value greater
than 127, characters with an ordinal value greater than 255, or any
character occupying more than one byte, depending on the context. The
Perl warning "Wide character in ..." is caused by a character with an
ordinal value greater than 255.
With no specified encoding layer, Perl
tries to fit things in ISO-8859-1 for backward compatibility reasons.
When it can't, it emits this warning (if warnings are enabled), and
outputs UTF-8 encoded data instead. To avoid this warning and to avoid
having different output encodings in a single stream, always specify
an encoding explicitly, for example with a PerlIO layer:
# The next line is required to avoid the "Wide character in print" warning
# AND to avoid having different output encodings in a single stream.
binmode STDOUT, ":encoding(UTF-8)";
...
Even with all of this sometimes you need to encode('utf8',$perl_str) . That's why i learn there is no fool proof unicode handling in Perl. Please read the perlunifaq (http://perldoc.perl.org/perlunifaq.html)
I hope this helps.

Related

Enabling non-ASCII characters in Perl [duplicate]

I wonder why most modern solutions built using Perl don't enable UTF-8 by default.
I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.
Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?
Commenting #tchrist got too long, so I'm adding it here.
It seems that I did not make myself clear. Let me try to add some things.
tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.
tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.
I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.
Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.
Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!
I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.
๐™Ž๐™ž๐™ข๐™ฅ๐™ก๐™š๐™จ๐™ฉ โ„ž: ๐Ÿ• ๐˜ฟ๐™ž๐™จ๐™˜๐™ง๐™š๐™ฉ๐™š ๐™๐™š๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™™๐™–๐™ฉ๐™ž๐™ค๐™ฃ๐™จ
Set your PERL_UNICODE envariable to AS. This makes all Perl scripts decode #ARGV as UTFโ€‘8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTFโ€‘8. Both these are global effects, not lexical ones.
At the top of your source file (program, module, library, dohickey), prominently assert that you are running perl version 5.12 or better via:
use v5.12; # minimal for unicode string feature
use v5.14; # optimal for unicode string feature
Enable warnings, since the previous declaration only enables strictures and features, not warnings. I also suggest promoting Unicode warnings into exceptions, so use both these lines, not just one of them. Note however that under v5.14, the utf8 warning class comprises three other subwarnings which can all be separately enabled: nonchar, surrogate, and non_unicode. These you may wish to exert greater control over.
use warnings;
use warnings qw( FATAL utf8 );
Declare that this source unit is encoded as UTFโ€‘8. Although once upon a time this pragma did other things, it now serves this one singular purpose alone and no other:
use utf8;
Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTFโ€‘8 unless you tell it otherwise. That way you do not affect other moduleโ€™s or other programโ€™s code.
use open qw( :encoding(UTF-8) :std );
Enable named characters via \N{CHARNAME}.
use charnames qw( :full :short );
If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTFโ€‘8, then say:
binmode(DATA, ":encoding(UTF-8)");
There is of course no end of other matters with which you may eventually find yourself concerned, but these will suffice to approximate the state goal to โ€œmake everything just work with UTFโ€‘8โ€, albeit for a somewhat weakened sense of those terms.
One other pragma, although it is not Unicode related, is:
use autodie;
It is strongly recommended.
๐ŸŒด ๐Ÿช๐Ÿซ๐Ÿช ๐ŸŒž ๐•ฒ๐–” ๐•ฟ๐–๐–”๐–š ๐–†๐–“๐–‰ ๐•ฏ๐–” ๐•ท๐–Ž๐–๐–Š๐–œ๐–Ž๐–˜๐–Š ๐ŸŒž ๐Ÿช๐Ÿซ๐Ÿช ๐Ÿ
๐ŸŽ ๐Ÿช ๐•ญ๐–”๐–Ž๐–‘๐–Š๐–—โธ—๐–•๐–‘๐–†๐–™๐–Š ๐–‹๐–”๐–— ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Šโธ—๐•ฌ๐–œ๐–†๐–—๐–Š ๐•ฎ๐–”๐–‰๐–Š ๐Ÿช ๐ŸŽ
My own boilerplate these days tends to look like this:
use 5.014;
use utf8;
use strict;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use File::Basename qw< basename >;
use Carp qw< carp croak confess cluck >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
END { close STDOUT }
if (grep /\P{ASCII}/ => #ARGV) {
#ARGV = map { decode("UTF-8", $_) } #ARGV;
}
$0 = basename($0); # shorter messages
$| = 1;
binmode(DATA, ":utf8");
# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
confess "Uncaught exception: #_" unless $^S;
};
# now promote run-time warnings into stack-dumped
# exceptions *unless* we're in an try block, in
# which case just cluck the stack dump instead
local $SIG{__WARN__} = sub {
if ($^S) { cluck "Trapped warning: #_" }
else { confess "Deadly warning: #_" }
};
while (<>) {
chomp;
$_ = NFD($_);
...
} continue {
say NFC($_);
}
__END__
๐ŸŽ… ๐•น ๐–” ๐•ธ ๐–† ๐–Œ ๐–Ž ๐–ˆ ๐•ญ ๐–š ๐–‘ ๐–‘ ๐–Š ๐–™ ๐ŸŽ…
Saying that โ€œPerl should [somehow!] enable Unicode by defaultโ€ doesnโ€™t even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case. Unicode is much much more than just a larger character repertoire; itโ€™s also how those characters all interact in many, many ways.
Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to โ€œupgradeโ€ to your spiffy new Brave New World modernity.
It is way way way more complicated than people pretend. Iโ€™ve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I donโ€™t think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, youโ€™ll break either your own code or somebody elseโ€™s. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.
๐Ÿช goes out of its way to make Unicode easy, far more than anything else Iโ€™ve ever used. If you think this is bad, try something else for a while. Then come back to ๐Ÿช: either you will have returned to a better world, or else you will bring knowledge of the same with you so that we can make use of your new knowledge to make ๐Ÿช better at these things.
๐Ÿ’ก ๐•ด๐–‰๐–Š๐–†๐–˜ ๐–‹๐–”๐–— ๐–† ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Š โธ— ๐•ฌ๐–œ๐–†๐–—๐–Š ๐Ÿช ๐•ท๐–†๐–š๐–“๐–‰๐–—๐–ž ๐•ท๐–Ž๐–˜๐–™ ๐Ÿ’ก
At a minimum, here are some things that would appear to be required for ๐Ÿช to โ€œenable Unicode by defaultโ€, as you put it:
All ๐Ÿช source code should be in UTF-8 by default. You can get that with use utf8 or export PERL5OPTS=-Mutf8.
The ๐Ÿช DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").
Program arguments to ๐Ÿช scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS=-CA.
The standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S for all of them, or I, O, and/or E for just some of them. This is like perl -CS.
Any other handles opened by ๐Ÿช should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and o for particular ones of these; export PERL5OPTS=-CD would work. That makes -CSAD for all of them.
Cover both bases plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std. See uniquote.
You donโ€™t want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8. And make sure your input streams are always binmoded to :encoding(UTF-8), not just to :utf8.
Code points between 128โ€“255 should be understood by ๐Ÿช to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS=-Mfeature=unicode_strings. That will make uc("\xDF") eq "SS" and "\xE9" =~ /\w/. A simple export PERL5OPTS=-Mv5.12 or better will also get that.
Named Unicode characters are not by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek or some such. See uninames and tcgrep.
You almost always need access to the functions from the standard Unicode::Normalize module various types of decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run incoming stuff through NFD and outbound stuff from NFC. Thereโ€™s no I/O layer for these yet that Iโ€™m aware of, but see nfc, nfd, nfkd, and nfkc.
String comparisons in ๐Ÿช using eq, ne, lc, cmp, sort, &c&cc are always wrong. So instead of #a = sort #b, you need #a = Unicode::Collate->new->sort(#b). Might as well add that to your export PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons.
๐Ÿช built-ins like printf and write do the wrong thing with Unicode data. You need to use the Unicode::GCString module for the former, and both that and also the Unicode::LineBreak module as well for the latter. See uwc and unifmt.
If you want them to count as integers, then you are going to have to run your \d+ captures through the Unicode::UCD::num function because ๐Ÿชโ€™s built-in atoi(3) isnโ€™t currently clever enough.
You are going to have filesystem issues on ๐Ÿ‘ฝ filesystems. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.
All your ๐Ÿช code involving a-z or A-Z and such MUST BE CHANGED, including m//, s///, and tr///. Itโ€™s should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.
Code that uses \p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use \p{Upper} instead, and know the reason why. Yes, \p{Lowercase} and \p{Lower} are different from \p{Ll} and \p{Lowercase_Letter}.
Code that uses [a-zA-Z] is even worse. And it canโ€™t use \pL or \p{Letter}; it needs to use \p{Alphabetic}. Not all alphabetics are letters, you know!
If you are looking for ๐Ÿช variables with /[\$\#\%]\w+/, then you have a problem. You need to look for /[\$\#\%]\p{IDS}\p{IDC}*/, and even that isnโ€™t thinking about the punctuation variables or package variables.
If you are checking for whitespace, then you should choose between \h and \v, depending. And you should never use \s, since it DOES NOT MEAN [\h\v], contrary to popular belief.
If you are using \n for a line boundary, or even \r\n, then you are doing it wrong. You have to use \R, which is not the same!
If you donโ€™t know when and whether to call Unicode::Stringprep, then you had better learn.
Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such. The easiest way to do that is with the standard Unicode::Collate module. Unicode::Collate->new(level => 1)->cmp($a, $b). There are also eq methods and such, and you should probably learn about the match and substr methods, too. These are have distinct advantages over the ๐Ÿช built-ins.
Sometimes thatโ€™s still not enough, and you need the Unicode::Collate::Locale module instead, as in Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b) instead. Consider that Unicode::Collate::->new(level => 1)->eq("d", "รฐ") is true, but Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " รฐ") is false. Similarly, "ae" and "รฆ" are eq if you donโ€™t use locales, or if you use the English one, but they are different in the Icelandic locale. Now what? Itโ€™s tough, I tell you. You can play with ucsort to test some of these things out.
Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string โ€œniรฑoโ€. Its NFD form โ€” which you had darned well better have remembered to put it in โ€” becomes โ€œnin\x{303}oโ€. Now what are you going to do? Even pretending that a vowel is [aeiou] (which is wrong, by the way), you wonโ€™t be able to do something like (?=[aeiou])\X) either, because even in NFD a code point like โ€˜รธโ€™ does not decompose! However, it will test equal to an โ€˜oโ€™ using the UCA comparison I just showed you. You canโ€™t rely on NFD, you have to rely on UCA.
๐Ÿ’ฉ ๐”ธ ๐•ค ๐•ค ๐•ฆ ๐•ž ๐•– ๐”น ๐•ฃ ๐•  ๐•œ ๐•– ๐•Ÿ ๐•Ÿ ๐•– ๐•ค ๐•ค ๐Ÿ’ฉ
And thatโ€™s not all. There are a million broken assumptions that people make about Unicode. Until they understand these things, their ๐Ÿช code will be broken.
Code that assumes it can open a text file without specifying the encoding is broken.
Code that assumes the default encoding is some sort of native platform encoding is broken.
Code that assumes that web pages in Japanese or Chinese take up less space in UTFโ€‘16 than in UTFโ€‘8 is wrong.
Code that assumes Perl uses UTFโ€‘8 internally is wrong.
Code that assumes that encoding errors will always raise an exception is wrong.
Code that assumes Perl code points are limited to 0x10_FFFF is wrong.
Code that assumes you can set $/ to something that will work with any valid line separator is wrong.
Code that assumes roundtrip equality on casefolding, like lc(uc($s)) eq $s or uc(lc($s)) eq $s, is completely broken and wrong. Consider that the uc("ฯƒ") and uc("ฯ‚") are both "ฮฃ", but lc("ฮฃ") cannot possibly return both of those.
Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ยช" is a lowercase letter with no uppercase; whereas both "แตƒ" and "แดฌ" are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not \p{Lowercase_Letter}, despite being both \p{Letter} and \p{Lowercase}.
Code that assumes changing the case doesnโ€™t change the length of the string is broken.
Code that assumes there are only two cases is broken. Thereโ€™s also titlecase.
Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a \p{Mark} turning into a \p{Letter}. It can also make it switch from one script to another.
Code that assumes that case is never locale-dependent is broken.
Code that assumes Unicode gives a fig about POSIX locales is broken.
Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.
Code that assumes that diacritics \p{Diacritic} and marks \p{Mark} are the same thing is broken.
Code that assumes \p{GC=Dash_Punctuation} covers as much as \p{Dash} is broken.
Code that assumes dash, hyphens, and minuses are the same thing as each other, or that there is only one of each, is broken and wrong.
Code that assumes every code point takes up no more than one print column is broken.
Code that assumes that all \p{Mark} characters take up zero print columns is broken.
Code that assumes that characters which look alike are alike is broken.
Code that assumes that characters which do not look alike are not alike is broken.
Code that assumes there is a limit to the number of code points in a row that just one \X can match is wrong.
Code that assumes \X can never start with a \p{Mark} character is wrong.
Code that assumes that \X can never hold two non-\p{Mark} characters is wrong.
Code that assumes that it cannot use "\x{FFFF}" is wrong.
Code that assumes a non-BMP code point that requires two UTF-16 (surrogate) code units will encode to two separate UTF-8 characters, one per code unit, is wrong. It doesnโ€™t: it encodes to single code point.
Code that transcodes from UTFโ€16 or UTFโ€32 with leading BOMs into UTFโ€8 is broken if it puts a BOM at the start of the resulting UTF-8. This is so stupid the engineer should have their eyelids removed.
Code that assumes the CESU-8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as "\xC0\x80" is UTF-8 is broken and wrong. These guys also deserve the eyelid treatment.
Code that assumes characters like > always points to the right and < always points to the left are wrong โ€” because they in fact do not.
Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they donโ€™t.
Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong. Off with their heads! If that seems too extreme, we can compromise: henceforth they may type only with their big toe from one foot. (The rest will be duct taped.)
Code that assumes that all \p{Math} code points are visible characters is wrong.
Code that assumes \w contains only letters, digits, and underscores is wrong.
Code that assumes that ^ and ~ are punctuation marks is wrong.
Code that assumes that รผ has an umlaut is wrong.
Code that believes things like โ‚จ contain any letters in them is wrong.
Code that believes \p{InLatin} is the same as \p{Latin} is heinously broken.
Code that believe that \p{InLatin} is almost ever useful is almost certainly wrong.
Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.
Code that believes someoneโ€™s name can only contain certain characters is stupid, offensive, and wrong.
Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again. Period. Iโ€™m not even positive they should even be allowed to see again, since it obviously hasnโ€™t done them much good so far.
Code that believes thereโ€™s some way to pretend textfile encodings donโ€™t exist is broken and dangerous. Might as well poke the other eye out, too.
Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.
Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal mรฉlange of hubris and naรฏvetรฉ that only a lightning bolt from Zeus will fix.
Code that believes you can use ๐Ÿช printf widths to pad and justify Unicode data is broken and wrong.
Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, youโ€™ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.
Code that treats code points from one plane one whit differently than those from any other plane is ipso facto broken and wrong. Go back to school.
Code that believes that stuff like /s/i can only match "S" or "s" is broken and wrong. Youโ€™d be surprised.
Code that uses \PM\pM* to find grapheme clusters instead of using \X is broken and wrong.
People who want to go back to the ASCII world should be whole-heartedly encouraged to do so, and in honor of their glorious upgrade they should be provided gratis with a pre-electric manual typewriter for all their data-entry needs. Messages sent to them should be sent via an แด€สŸสŸแด„แด€แด˜s telegraph at 40 characters per line and hand-delivered by a courier. STOP.
๐Ÿ˜ฑ ๐•พ ๐–€ ๐•ธ ๐•ธ ๐•ฌ ๐•ฝ ๐–„ ๐Ÿ˜ฑ
I donโ€™t know how much more โ€œdefault Unicode in ๐Ÿชโ€ you can get than what Iโ€™ve written. Well, yes I do: you should be using Unicode::Collate and Unicode::LineBreak, too. And probably more.
As you see, there are far too many Unicode things that you really do have to worry about for there to ever exist any such thing as โ€œdefault to Unicodeโ€.
What youโ€™re going to discover, just as we did back in ๐Ÿช 5.8, that it is simply impossible to impose all these things on code that hasnโ€™t been designed right from the beginning to account for them. Your well-meaning selfishness just broke the entire world.
And even once you do, there are still critical issues that require a great deal of thought to get right. There is no switch you can flip. Nothing but brain, and I mean real brain, will suffice here. Thereโ€™s a heck of a lot of stuff you have to learn. Modulo the retreat to the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21หขแต— century, and you cannot wish Unicode away by willful ignorance.
You have to learn it. Period. It will never be so easy that โ€œeverything just works,โ€ because that will guarantee that a lot of things donโ€™t work โ€” which invalidates the assumption that there can ever be a way to โ€œmake it all work.โ€
You may be able to get a few reasonable defaults for a very few and very limited operations, but not without thinking about things a whole lot more than I think you have.
As just one example, canonical ordering is going to cause some real headaches. ๐Ÿ˜ญ"\x{F5}" โ€˜รตโ€™, "o\x{303}" โ€˜รตโ€™, "o\x{303}\x{304}" โ€˜ศญโ€™, and "o\x{304}\x{303}" โ€˜ลฬƒโ€™ should all match โ€˜รตโ€™, but how in the world are you going to do that? This is harder than it looks, but itโ€™s something you need to account for. ๐Ÿ’ฃ
If thereโ€™s one thing I know about Perl, it is what its Unicode bits do and do not do, and this thing I promise you: โ€œ ฬฒแด›ฬฒสœฬฒแด‡ฬฒส€ฬฒแด‡ฬฒ ฬฒษชฬฒsฬฒ ฬฒษดฬฒแดฬฒ ฬฒUฬฒษดฬฒษชฬฒแด„ฬฒแดฬฒแด…ฬฒแด‡ฬฒ ฬฒแดฬฒแด€ฬฒษขฬฒษชฬฒแด„ฬฒ ฬฒส™ฬฒแดœฬฒสŸฬฒสŸฬฒแด‡ฬฒแด›ฬฒ ฬฒ โ€ ๐Ÿ˜ž
You cannot just change some defaults and get smooth sailing. Itโ€™s true that I run ๐Ÿช with PERL_UNICODE set to "SA", but thatโ€™s all, and even that is mostly for command-line stuff. For real work, I go through all the many steps outlined above, and I do it very, ** very** carefully.
๐Ÿ˜ˆ ยกฦจdlษ™ษฅ ฦจแด‰ษฅส‡ ษ™doษฅ puษ สปฮปษp ษ™ษ”แด‰u ษ ษ™สŒษษฅ สปสžษ”nl pooโ… ๐Ÿ˜ˆ
There are two stages to processing Unicode text. The first is "how can I input it and output it without losing information". The second is "how do I treat text according to local language conventions".
tchrist's post covers both, but the second part is where 99% of the text in his post comes from. Most programs don't even handle I/O correctly, so it's important to understand that before you even begin to worry about normalization and collation.
This post aims to solve that first problem
When you read data into Perl, it doesn't care what encoding it is. It allocates some memory and stashes the bytes away there. If you say print $str, it just blits those bytes out to your terminal, which is probably set to assume everything that is written to it is UTF-8, and your text shows up.
Marvelous.
Except, it's not. If you try to treat the data as text, you'll see that Something Bad is happening. You need go no further than length to see that what Perl thinks about your string and what you think about your string disagree. Write a one-liner like: perl -E 'while(<>){ chomp; say length }' and type in ๆ–‡ๅญ—ๅŒ–ใ‘ and you get 12... not the correct answer, 4.
That's because Perl assumes your string is not text. You have to tell it that it's text before it will give you the right answer.
That's easy enough; the Encode module has the functions to do that. The generic entry point is Encode::decode (or use Encode qw(decode), of course). That function takes some string from the outside world (what we'll call "octets", a fancy of way of saying "8-bit bytes"), and turns it into some text that Perl will understand. The first argument is a character encoding name, like "UTF-8" or "ASCII" or "EUC-JP". The second argument is the string. The return value is the Perl scalar containing the text.
(There is also Encode::decode_utf8, which assumes UTF-8 for the encoding.)
If we rewrite our one-liner:
perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'
We type in ๆ–‡ๅญ—ๅŒ–ใ‘ and get "4" as the result. Success.
That, right there, is the solution to 99% of Unicode problems in Perl.
The key is, whenever any text comes into your program, you must decode it. The Internet cannot transmit characters. Files cannot store characters. There are no characters in your database. There are only octets, and you can't treat octets as characters in Perl. You must decode the encoded octets into Perl characters with the Encode module.
The other half of the problem is getting data out of your program. That's easy to; you just say use Encode qw(encode), decide what the encoding your data will be in (UTF-8 to terminals that understand UTF-8, UTF-16 for files on Windows, etc.), and then output the result of encode($encoding, $data) instead of just outputting $data.
This operation converts Perl's characters, which is what your program operates on, to octets that can be used by the outside world. It would be a lot easier if we could just send characters over the Internet or to our terminals, but we can't: octets only. So we have to convert characters to octets, otherwise the results are undefined.
To summarize: encode all outputs and decode all inputs.
Now we'll talk about three issues that make this a little challenging. The first is libraries. Do they handle text correctly? The answer is... they try. If you download a web page, LWP will give you your result back as text. If you call the right method on the result, that is (and that happens to be decoded_content, not content, which is just the octet stream that it got from the server.) Database drivers can be flaky; if you use DBD::SQLite with just Perl, it will work out, but if some other tool has put text stored as some encoding other than UTF-8 in your database... well... it's not going to be handled correctly until you write code to handle it correctly.
Outputting data is usually easier, but if you see "wide character in print", then you know you're messing up the encoding somewhere. That warning means "hey, you're trying to leak Perl characters to the outside world and that doesn't make any sense". Your program appears to work (because the other end usually handles the raw Perl characters correctly), but it is very broken and could stop working at any moment. Fix it with an explicit Encode::encode!
The second problem is UTF-8 encoded source code. Unless you say use utf8 at the top of each file, Perl will not assume that your source code is UTF-8. This means that each time you say something like my $var = 'ใปใ’', you're injecting garbage into your program that will totally break everything horribly. You don't have to "use utf8", but if you don't, you must not use any non-ASCII characters in your program.
The third problem is how Perl handles The Past. A long time ago, there was no such thing as Unicode, and Perl assumed that everything was Latin-1 text or binary. So when data comes into your program and you start treating it as text, Perl treats each octet as a Latin-1 character. That's why, when we asked for the length of "ๆ–‡ๅญ—ๅŒ–ใ‘", we got 12. Perl assumed that we were operating on the Latin-1 string "รฆรฅยญรฅรฃ" (which is 12 characters, some of which are non-printing).
This is called an "implicit upgrade", and it's a perfectly reasonable thing to do, but it's not what you want if your text is not Latin-1. That's why it's critical to explicitly decode input: if you don't do it, Perl will, and it might do it wrong.
People run into trouble where half their data is a proper character string, and some is still binary. Perl will interpret the part that's still binary as though it's Latin-1 text and then combine it with the correct character data. This will make it look like handling your characters correctly broke your program, but in reality, you just haven't fixed it enough.
Here's an example: you have a program that reads a UTF-8-encoded text file, you tack on a Unicode PILE OF POO to each line, and you print it out. You write it like:
while(<>){
chomp;
say "$_ ๐Ÿ’ฉ";
}
And then run on some UTF-8 encoded data, like:
perl poo.pl input-data.txt
It prints the UTF-8 data with a poo at the end of each line. Perfect, my program works!
But nope, you're just doing binary concatenation. You're reading octets from the file, removing a \n with chomp, and then tacking on the bytes in the UTF-8 representation of the PILE OF POO character. When you revise your program to decode the data from the file and encode the output, you'll notice that you get garbage ("รฐยฉ") instead of the poo. This will lead you to believe that decoding the input file is the wrong thing to do. It's not.
The problem is that the poo is being implicitly upgraded as latin-1. If you use utf8 to make the literal text instead of binary, then it will work again!
(That's the number one problem I see when helping people with Unicode. They did part right and that broke their program. That's what's sad about undefined results: you can have a working program for a long time, but when you start to repair it, it breaks. Don't worry; if you are adding encode/decode statements to your program and it breaks, it just means you have more work to do. Next time, when you design with Unicode in mind from the beginning, it will be much easier!)
That's really all you need to know about Perl and Unicode. If you tell Perl what your data is, it has the best Unicode support among all popular programming languages. If you assume it will magically know what sort of text you are feeding it, though, then you're going to trash your data irrevocably. Just because your program works today on your UTF-8 terminal doesn't mean it will work tomorrow on a UTF-16 encoded file. So make it safe now, and save yourself the headache of trashing your users' data!
The easy part of handling Unicode is encoding output and decoding input. The hard part is finding all your input and output, and determining which encoding it is. But that's why you get the big bucks :)
We're all in agreement that it is a difficult problem for many reasons,
but that's precisely the reason to try to make it easier on everybody.
There is a recent module on CPAN, utf8::all, that attempts to "turn on Unicode. All of it".
As has been pointed out, you can't magically make the entire system (outside programs, external web requests, etc.) use Unicode as well, but we can work together to make sensible tools that make doing common problems easier. That's the reason that we're programmers.
If utf8::all doesn't do something you think it should, let's improve it to make it better. Or let's make additional tools that together can suit people's varying needs as well as possible.
`
I think you misunderstand Unicode and its relationship to Perl. No matter which way you store data, Unicode, ISO-8859-1, or many other things, your program has to know how to interpret the bytes it gets as input (decoding) and how to represent the information it wants to output (encoding). Get that interpretation wrong and you garble the data. There isn't some magic default setup inside your program that's going to tell the stuff outside your program how to act.
You think it's hard, most likely, because you are used to everything being ASCII. Everything you should have been thinking about was simply ignored by the programming language and all of the things it had to interact with. If everything used nothing but UTF-8 and you had no choice, then UTF-8 would be just as easy. But not everything does use UTF-8. For instance, you don't want your input handle to think that it's getting UTF-8 octets unless it actually is, and you don't want your output handles to be UTF-8 if the thing reading from them can't handle UTF-8. Perl has no way to know those things. That's why you are the programmer.
I don't think Unicode in Perl 5 is too complicated. I think it's scary and people avoid it. There's a difference. To that end, I've put Unicode in Learning Perl, 6th Edition, and there's a lot of Unicode stuff in Effective Perl Programming. You have to spend the time to learn and understand Unicode and how it works. You're not going to be able to use it effectively otherwise.
While reading this thread, I often get the impression that people are using "UTF-8" as a synonym to "Unicode". Please make a distinction between Unicode's "Code-Points" which are an enlarged relative of the ASCII code and Unicode's various "encodings". And there are a few of them, of which UTF-8, UTF-16 and UTF-32 are the current ones and a few more are obsolete.
Please, UTF-8 (as well as all other encodings) exists and have meaning in input or in output only. Internally, since Perl 5.8.1, all strings are kept as Unicode "Code-points". True, you have to enable some features as admiringly covered previously.
There's a truly horrifying amount of ancient code out there in the wild, much of it in the form of common CPAN modules. I've found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it, and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly (in particular, iTiVo fails badly on anything that's not 7-bit ASCII due to transcoding issues).
You should enable the unicode strings feature, and this is the default if you use v5.14;
You should not really use unicode identifiers esp. for foreign code via utf8 as they are insecure in perl5, only cperl got that right. See e.g. http://perl11.org/blog/unicode-identifiers.html
Regarding utf8 for your filehandles/streams: You need decide by yourself the encoding of your external data. A library cannot know that, and since not even libc supports utf8, proper utf8 data is rare. There's more wtf8, the windows aberration of utf8 around.
BTW: Moose is not really "Modern Perl", they just hijacked the name. Moose is perfect Larry Wall-style postmodern perl mixed with Bjarne Stroustrup-style everything goes, with an eclectic aberration of proper perl6 syntax, e.g. using strings for variable names, horrible fields syntax, and a very immature naive implementation which is 10x slower than a proper implementation.
cperl and perl6 are the true modern perls, where form follows function, and the implementation is reduced and optimized.

Why does modern Perl avoid UTF-8 by default?

I wonder why most modern solutions built using Perl don't enable UTF-8 by default.
I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.
Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?
Commenting #tchrist got too long, so I'm adding it here.
It seems that I did not make myself clear. Let me try to add some things.
tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.
tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.
I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.
Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.
Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!
I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.
๐™Ž๐™ž๐™ข๐™ฅ๐™ก๐™š๐™จ๐™ฉ โ„ž: ๐Ÿ• ๐˜ฟ๐™ž๐™จ๐™˜๐™ง๐™š๐™ฉ๐™š ๐™๐™š๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™™๐™–๐™ฉ๐™ž๐™ค๐™ฃ๐™จ
Set your PERL_UNICODE envariable to AS. This makes all Perl scripts decode #ARGV as UTFโ€‘8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTFโ€‘8. Both these are global effects, not lexical ones.
At the top of your source file (program, module, library, dohickey), prominently assert that you are running perl version 5.12 or better via:
use v5.12; # minimal for unicode string feature
use v5.14; # optimal for unicode string feature
Enable warnings, since the previous declaration only enables strictures and features, not warnings. I also suggest promoting Unicode warnings into exceptions, so use both these lines, not just one of them. Note however that under v5.14, the utf8 warning class comprises three other subwarnings which can all be separately enabled: nonchar, surrogate, and non_unicode. These you may wish to exert greater control over.
use warnings;
use warnings qw( FATAL utf8 );
Declare that this source unit is encoded as UTFโ€‘8. Although once upon a time this pragma did other things, it now serves this one singular purpose alone and no other:
use utf8;
Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTFโ€‘8 unless you tell it otherwise. That way you do not affect other moduleโ€™s or other programโ€™s code.
use open qw( :encoding(UTF-8) :std );
Enable named characters via \N{CHARNAME}.
use charnames qw( :full :short );
If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTFโ€‘8, then say:
binmode(DATA, ":encoding(UTF-8)");
There is of course no end of other matters with which you may eventually find yourself concerned, but these will suffice to approximate the state goal to โ€œmake everything just work with UTFโ€‘8โ€, albeit for a somewhat weakened sense of those terms.
One other pragma, although it is not Unicode related, is:
use autodie;
It is strongly recommended.
๐ŸŒด ๐Ÿช๐Ÿซ๐Ÿช ๐ŸŒž ๐•ฒ๐–” ๐•ฟ๐–๐–”๐–š ๐–†๐–“๐–‰ ๐•ฏ๐–” ๐•ท๐–Ž๐–๐–Š๐–œ๐–Ž๐–˜๐–Š ๐ŸŒž ๐Ÿช๐Ÿซ๐Ÿช ๐Ÿ
๐ŸŽ ๐Ÿช ๐•ญ๐–”๐–Ž๐–‘๐–Š๐–—โธ—๐–•๐–‘๐–†๐–™๐–Š ๐–‹๐–”๐–— ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Šโธ—๐•ฌ๐–œ๐–†๐–—๐–Š ๐•ฎ๐–”๐–‰๐–Š ๐Ÿช ๐ŸŽ
My own boilerplate these days tends to look like this:
use 5.014;
use utf8;
use strict;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use File::Basename qw< basename >;
use Carp qw< carp croak confess cluck >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
END { close STDOUT }
if (grep /\P{ASCII}/ => #ARGV) {
#ARGV = map { decode("UTF-8", $_) } #ARGV;
}
$0 = basename($0); # shorter messages
$| = 1;
binmode(DATA, ":utf8");
# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
confess "Uncaught exception: #_" unless $^S;
};
# now promote run-time warnings into stack-dumped
# exceptions *unless* we're in an try block, in
# which case just cluck the stack dump instead
local $SIG{__WARN__} = sub {
if ($^S) { cluck "Trapped warning: #_" }
else { confess "Deadly warning: #_" }
};
while (<>) {
chomp;
$_ = NFD($_);
...
} continue {
say NFC($_);
}
__END__
๐ŸŽ… ๐•น ๐–” ๐•ธ ๐–† ๐–Œ ๐–Ž ๐–ˆ ๐•ญ ๐–š ๐–‘ ๐–‘ ๐–Š ๐–™ ๐ŸŽ…
Saying that โ€œPerl should [somehow!] enable Unicode by defaultโ€ doesnโ€™t even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case. Unicode is much much more than just a larger character repertoire; itโ€™s also how those characters all interact in many, many ways.
Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to โ€œupgradeโ€ to your spiffy new Brave New World modernity.
It is way way way more complicated than people pretend. Iโ€™ve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I donโ€™t think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, youโ€™ll break either your own code or somebody elseโ€™s. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.
๐Ÿช goes out of its way to make Unicode easy, far more than anything else Iโ€™ve ever used. If you think this is bad, try something else for a while. Then come back to ๐Ÿช: either you will have returned to a better world, or else you will bring knowledge of the same with you so that we can make use of your new knowledge to make ๐Ÿช better at these things.
๐Ÿ’ก ๐•ด๐–‰๐–Š๐–†๐–˜ ๐–‹๐–”๐–— ๐–† ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Š โธ— ๐•ฌ๐–œ๐–†๐–—๐–Š ๐Ÿช ๐•ท๐–†๐–š๐–“๐–‰๐–—๐–ž ๐•ท๐–Ž๐–˜๐–™ ๐Ÿ’ก
At a minimum, here are some things that would appear to be required for ๐Ÿช to โ€œenable Unicode by defaultโ€, as you put it:
All ๐Ÿช source code should be in UTF-8 by default. You can get that with use utf8 or export PERL5OPTS=-Mutf8.
The ๐Ÿช DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").
Program arguments to ๐Ÿช scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS=-CA.
The standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S for all of them, or I, O, and/or E for just some of them. This is like perl -CS.
Any other handles opened by ๐Ÿช should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and o for particular ones of these; export PERL5OPTS=-CD would work. That makes -CSAD for all of them.
Cover both bases plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std. See uniquote.
You donโ€™t want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8. And make sure your input streams are always binmoded to :encoding(UTF-8), not just to :utf8.
Code points between 128โ€“255 should be understood by ๐Ÿช to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS=-Mfeature=unicode_strings. That will make uc("\xDF") eq "SS" and "\xE9" =~ /\w/. A simple export PERL5OPTS=-Mv5.12 or better will also get that.
Named Unicode characters are not by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek or some such. See uninames and tcgrep.
You almost always need access to the functions from the standard Unicode::Normalize module various types of decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run incoming stuff through NFD and outbound stuff from NFC. Thereโ€™s no I/O layer for these yet that Iโ€™m aware of, but see nfc, nfd, nfkd, and nfkc.
String comparisons in ๐Ÿช using eq, ne, lc, cmp, sort, &c&cc are always wrong. So instead of #a = sort #b, you need #a = Unicode::Collate->new->sort(#b). Might as well add that to your export PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons.
๐Ÿช built-ins like printf and write do the wrong thing with Unicode data. You need to use the Unicode::GCString module for the former, and both that and also the Unicode::LineBreak module as well for the latter. See uwc and unifmt.
If you want them to count as integers, then you are going to have to run your \d+ captures through the Unicode::UCD::num function because ๐Ÿชโ€™s built-in atoi(3) isnโ€™t currently clever enough.
You are going to have filesystem issues on ๐Ÿ‘ฝ filesystems. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.
All your ๐Ÿช code involving a-z or A-Z and such MUST BE CHANGED, including m//, s///, and tr///. Itโ€™s should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.
Code that uses \p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use \p{Upper} instead, and know the reason why. Yes, \p{Lowercase} and \p{Lower} are different from \p{Ll} and \p{Lowercase_Letter}.
Code that uses [a-zA-Z] is even worse. And it canโ€™t use \pL or \p{Letter}; it needs to use \p{Alphabetic}. Not all alphabetics are letters, you know!
If you are looking for ๐Ÿช variables with /[\$\#\%]\w+/, then you have a problem. You need to look for /[\$\#\%]\p{IDS}\p{IDC}*/, and even that isnโ€™t thinking about the punctuation variables or package variables.
If you are checking for whitespace, then you should choose between \h and \v, depending. And you should never use \s, since it DOES NOT MEAN [\h\v], contrary to popular belief.
If you are using \n for a line boundary, or even \r\n, then you are doing it wrong. You have to use \R, which is not the same!
If you donโ€™t know when and whether to call Unicode::Stringprep, then you had better learn.
Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such. The easiest way to do that is with the standard Unicode::Collate module. Unicode::Collate->new(level => 1)->cmp($a, $b). There are also eq methods and such, and you should probably learn about the match and substr methods, too. These are have distinct advantages over the ๐Ÿช built-ins.
Sometimes thatโ€™s still not enough, and you need the Unicode::Collate::Locale module instead, as in Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b) instead. Consider that Unicode::Collate::->new(level => 1)->eq("d", "รฐ") is true, but Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " รฐ") is false. Similarly, "ae" and "รฆ" are eq if you donโ€™t use locales, or if you use the English one, but they are different in the Icelandic locale. Now what? Itโ€™s tough, I tell you. You can play with ucsort to test some of these things out.
Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string โ€œniรฑoโ€. Its NFD form โ€” which you had darned well better have remembered to put it in โ€” becomes โ€œnin\x{303}oโ€. Now what are you going to do? Even pretending that a vowel is [aeiou] (which is wrong, by the way), you wonโ€™t be able to do something like (?=[aeiou])\X) either, because even in NFD a code point like โ€˜รธโ€™ does not decompose! However, it will test equal to an โ€˜oโ€™ using the UCA comparison I just showed you. You canโ€™t rely on NFD, you have to rely on UCA.
๐Ÿ’ฉ ๐”ธ ๐•ค ๐•ค ๐•ฆ ๐•ž ๐•– ๐”น ๐•ฃ ๐•  ๐•œ ๐•– ๐•Ÿ ๐•Ÿ ๐•– ๐•ค ๐•ค ๐Ÿ’ฉ
And thatโ€™s not all. There are a million broken assumptions that people make about Unicode. Until they understand these things, their ๐Ÿช code will be broken.
Code that assumes it can open a text file without specifying the encoding is broken.
Code that assumes the default encoding is some sort of native platform encoding is broken.
Code that assumes that web pages in Japanese or Chinese take up less space in UTFโ€‘16 than in UTFโ€‘8 is wrong.
Code that assumes Perl uses UTFโ€‘8 internally is wrong.
Code that assumes that encoding errors will always raise an exception is wrong.
Code that assumes Perl code points are limited to 0x10_FFFF is wrong.
Code that assumes you can set $/ to something that will work with any valid line separator is wrong.
Code that assumes roundtrip equality on casefolding, like lc(uc($s)) eq $s or uc(lc($s)) eq $s, is completely broken and wrong. Consider that the uc("ฯƒ") and uc("ฯ‚") are both "ฮฃ", but lc("ฮฃ") cannot possibly return both of those.
Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ยช" is a lowercase letter with no uppercase; whereas both "แตƒ" and "แดฌ" are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not \p{Lowercase_Letter}, despite being both \p{Letter} and \p{Lowercase}.
Code that assumes changing the case doesnโ€™t change the length of the string is broken.
Code that assumes there are only two cases is broken. Thereโ€™s also titlecase.
Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a \p{Mark} turning into a \p{Letter}. It can also make it switch from one script to another.
Code that assumes that case is never locale-dependent is broken.
Code that assumes Unicode gives a fig about POSIX locales is broken.
Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.
Code that assumes that diacritics \p{Diacritic} and marks \p{Mark} are the same thing is broken.
Code that assumes \p{GC=Dash_Punctuation} covers as much as \p{Dash} is broken.
Code that assumes dash, hyphens, and minuses are the same thing as each other, or that there is only one of each, is broken and wrong.
Code that assumes every code point takes up no more than one print column is broken.
Code that assumes that all \p{Mark} characters take up zero print columns is broken.
Code that assumes that characters which look alike are alike is broken.
Code that assumes that characters which do not look alike are not alike is broken.
Code that assumes there is a limit to the number of code points in a row that just one \X can match is wrong.
Code that assumes \X can never start with a \p{Mark} character is wrong.
Code that assumes that \X can never hold two non-\p{Mark} characters is wrong.
Code that assumes that it cannot use "\x{FFFF}" is wrong.
Code that assumes a non-BMP code point that requires two UTF-16 (surrogate) code units will encode to two separate UTF-8 characters, one per code unit, is wrong. It doesnโ€™t: it encodes to single code point.
Code that transcodes from UTFโ€16 or UTFโ€32 with leading BOMs into UTFโ€8 is broken if it puts a BOM at the start of the resulting UTF-8. This is so stupid the engineer should have their eyelids removed.
Code that assumes the CESU-8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as "\xC0\x80" is UTF-8 is broken and wrong. These guys also deserve the eyelid treatment.
Code that assumes characters like > always points to the right and < always points to the left are wrong โ€” because they in fact do not.
Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they donโ€™t.
Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong. Off with their heads! If that seems too extreme, we can compromise: henceforth they may type only with their big toe from one foot. (The rest will be duct taped.)
Code that assumes that all \p{Math} code points are visible characters is wrong.
Code that assumes \w contains only letters, digits, and underscores is wrong.
Code that assumes that ^ and ~ are punctuation marks is wrong.
Code that assumes that รผ has an umlaut is wrong.
Code that believes things like โ‚จ contain any letters in them is wrong.
Code that believes \p{InLatin} is the same as \p{Latin} is heinously broken.
Code that believe that \p{InLatin} is almost ever useful is almost certainly wrong.
Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.
Code that believes someoneโ€™s name can only contain certain characters is stupid, offensive, and wrong.
Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again. Period. Iโ€™m not even positive they should even be allowed to see again, since it obviously hasnโ€™t done them much good so far.
Code that believes thereโ€™s some way to pretend textfile encodings donโ€™t exist is broken and dangerous. Might as well poke the other eye out, too.
Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.
Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal mรฉlange of hubris and naรฏvetรฉ that only a lightning bolt from Zeus will fix.
Code that believes you can use ๐Ÿช printf widths to pad and justify Unicode data is broken and wrong.
Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, youโ€™ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.
Code that treats code points from one plane one whit differently than those from any other plane is ipso facto broken and wrong. Go back to school.
Code that believes that stuff like /s/i can only match "S" or "s" is broken and wrong. Youโ€™d be surprised.
Code that uses \PM\pM* to find grapheme clusters instead of using \X is broken and wrong.
People who want to go back to the ASCII world should be whole-heartedly encouraged to do so, and in honor of their glorious upgrade they should be provided gratis with a pre-electric manual typewriter for all their data-entry needs. Messages sent to them should be sent via an แด€สŸสŸแด„แด€แด˜s telegraph at 40 characters per line and hand-delivered by a courier. STOP.
๐Ÿ˜ฑ ๐•พ ๐–€ ๐•ธ ๐•ธ ๐•ฌ ๐•ฝ ๐–„ ๐Ÿ˜ฑ
I donโ€™t know how much more โ€œdefault Unicode in ๐Ÿชโ€ you can get than what Iโ€™ve written. Well, yes I do: you should be using Unicode::Collate and Unicode::LineBreak, too. And probably more.
As you see, there are far too many Unicode things that you really do have to worry about for there to ever exist any such thing as โ€œdefault to Unicodeโ€.
What youโ€™re going to discover, just as we did back in ๐Ÿช 5.8, that it is simply impossible to impose all these things on code that hasnโ€™t been designed right from the beginning to account for them. Your well-meaning selfishness just broke the entire world.
And even once you do, there are still critical issues that require a great deal of thought to get right. There is no switch you can flip. Nothing but brain, and I mean real brain, will suffice here. Thereโ€™s a heck of a lot of stuff you have to learn. Modulo the retreat to the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21หขแต— century, and you cannot wish Unicode away by willful ignorance.
You have to learn it. Period. It will never be so easy that โ€œeverything just works,โ€ because that will guarantee that a lot of things donโ€™t work โ€” which invalidates the assumption that there can ever be a way to โ€œmake it all work.โ€
You may be able to get a few reasonable defaults for a very few and very limited operations, but not without thinking about things a whole lot more than I think you have.
As just one example, canonical ordering is going to cause some real headaches. ๐Ÿ˜ญ"\x{F5}" โ€˜รตโ€™, "o\x{303}" โ€˜รตโ€™, "o\x{303}\x{304}" โ€˜ศญโ€™, and "o\x{304}\x{303}" โ€˜ลฬƒโ€™ should all match โ€˜รตโ€™, but how in the world are you going to do that? This is harder than it looks, but itโ€™s something you need to account for. ๐Ÿ’ฃ
If thereโ€™s one thing I know about Perl, it is what its Unicode bits do and do not do, and this thing I promise you: โ€œ ฬฒแด›ฬฒสœฬฒแด‡ฬฒส€ฬฒแด‡ฬฒ ฬฒษชฬฒsฬฒ ฬฒษดฬฒแดฬฒ ฬฒUฬฒษดฬฒษชฬฒแด„ฬฒแดฬฒแด…ฬฒแด‡ฬฒ ฬฒแดฬฒแด€ฬฒษขฬฒษชฬฒแด„ฬฒ ฬฒส™ฬฒแดœฬฒสŸฬฒสŸฬฒแด‡ฬฒแด›ฬฒ ฬฒ โ€ ๐Ÿ˜ž
You cannot just change some defaults and get smooth sailing. Itโ€™s true that I run ๐Ÿช with PERL_UNICODE set to "SA", but thatโ€™s all, and even that is mostly for command-line stuff. For real work, I go through all the many steps outlined above, and I do it very, ** very** carefully.
๐Ÿ˜ˆ ยกฦจdlษ™ษฅ ฦจแด‰ษฅส‡ ษ™doษฅ puษ สปฮปษp ษ™ษ”แด‰u ษ ษ™สŒษษฅ สปสžษ”nl pooโ… ๐Ÿ˜ˆ
There are two stages to processing Unicode text. The first is "how can I input it and output it without losing information". The second is "how do I treat text according to local language conventions".
tchrist's post covers both, but the second part is where 99% of the text in his post comes from. Most programs don't even handle I/O correctly, so it's important to understand that before you even begin to worry about normalization and collation.
This post aims to solve that first problem
When you read data into Perl, it doesn't care what encoding it is. It allocates some memory and stashes the bytes away there. If you say print $str, it just blits those bytes out to your terminal, which is probably set to assume everything that is written to it is UTF-8, and your text shows up.
Marvelous.
Except, it's not. If you try to treat the data as text, you'll see that Something Bad is happening. You need go no further than length to see that what Perl thinks about your string and what you think about your string disagree. Write a one-liner like: perl -E 'while(<>){ chomp; say length }' and type in ๆ–‡ๅญ—ๅŒ–ใ‘ and you get 12... not the correct answer, 4.
That's because Perl assumes your string is not text. You have to tell it that it's text before it will give you the right answer.
That's easy enough; the Encode module has the functions to do that. The generic entry point is Encode::decode (or use Encode qw(decode), of course). That function takes some string from the outside world (what we'll call "octets", a fancy of way of saying "8-bit bytes"), and turns it into some text that Perl will understand. The first argument is a character encoding name, like "UTF-8" or "ASCII" or "EUC-JP". The second argument is the string. The return value is the Perl scalar containing the text.
(There is also Encode::decode_utf8, which assumes UTF-8 for the encoding.)
If we rewrite our one-liner:
perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'
We type in ๆ–‡ๅญ—ๅŒ–ใ‘ and get "4" as the result. Success.
That, right there, is the solution to 99% of Unicode problems in Perl.
The key is, whenever any text comes into your program, you must decode it. The Internet cannot transmit characters. Files cannot store characters. There are no characters in your database. There are only octets, and you can't treat octets as characters in Perl. You must decode the encoded octets into Perl characters with the Encode module.
The other half of the problem is getting data out of your program. That's easy to; you just say use Encode qw(encode), decide what the encoding your data will be in (UTF-8 to terminals that understand UTF-8, UTF-16 for files on Windows, etc.), and then output the result of encode($encoding, $data) instead of just outputting $data.
This operation converts Perl's characters, which is what your program operates on, to octets that can be used by the outside world. It would be a lot easier if we could just send characters over the Internet or to our terminals, but we can't: octets only. So we have to convert characters to octets, otherwise the results are undefined.
To summarize: encode all outputs and decode all inputs.
Now we'll talk about three issues that make this a little challenging. The first is libraries. Do they handle text correctly? The answer is... they try. If you download a web page, LWP will give you your result back as text. If you call the right method on the result, that is (and that happens to be decoded_content, not content, which is just the octet stream that it got from the server.) Database drivers can be flaky; if you use DBD::SQLite with just Perl, it will work out, but if some other tool has put text stored as some encoding other than UTF-8 in your database... well... it's not going to be handled correctly until you write code to handle it correctly.
Outputting data is usually easier, but if you see "wide character in print", then you know you're messing up the encoding somewhere. That warning means "hey, you're trying to leak Perl characters to the outside world and that doesn't make any sense". Your program appears to work (because the other end usually handles the raw Perl characters correctly), but it is very broken and could stop working at any moment. Fix it with an explicit Encode::encode!
The second problem is UTF-8 encoded source code. Unless you say use utf8 at the top of each file, Perl will not assume that your source code is UTF-8. This means that each time you say something like my $var = 'ใปใ’', you're injecting garbage into your program that will totally break everything horribly. You don't have to "use utf8", but if you don't, you must not use any non-ASCII characters in your program.
The third problem is how Perl handles The Past. A long time ago, there was no such thing as Unicode, and Perl assumed that everything was Latin-1 text or binary. So when data comes into your program and you start treating it as text, Perl treats each octet as a Latin-1 character. That's why, when we asked for the length of "ๆ–‡ๅญ—ๅŒ–ใ‘", we got 12. Perl assumed that we were operating on the Latin-1 string "รฆรฅยญรฅรฃ" (which is 12 characters, some of which are non-printing).
This is called an "implicit upgrade", and it's a perfectly reasonable thing to do, but it's not what you want if your text is not Latin-1. That's why it's critical to explicitly decode input: if you don't do it, Perl will, and it might do it wrong.
People run into trouble where half their data is a proper character string, and some is still binary. Perl will interpret the part that's still binary as though it's Latin-1 text and then combine it with the correct character data. This will make it look like handling your characters correctly broke your program, but in reality, you just haven't fixed it enough.
Here's an example: you have a program that reads a UTF-8-encoded text file, you tack on a Unicode PILE OF POO to each line, and you print it out. You write it like:
while(<>){
chomp;
say "$_ ๐Ÿ’ฉ";
}
And then run on some UTF-8 encoded data, like:
perl poo.pl input-data.txt
It prints the UTF-8 data with a poo at the end of each line. Perfect, my program works!
But nope, you're just doing binary concatenation. You're reading octets from the file, removing a \n with chomp, and then tacking on the bytes in the UTF-8 representation of the PILE OF POO character. When you revise your program to decode the data from the file and encode the output, you'll notice that you get garbage ("รฐยฉ") instead of the poo. This will lead you to believe that decoding the input file is the wrong thing to do. It's not.
The problem is that the poo is being implicitly upgraded as latin-1. If you use utf8 to make the literal text instead of binary, then it will work again!
(That's the number one problem I see when helping people with Unicode. They did part right and that broke their program. That's what's sad about undefined results: you can have a working program for a long time, but when you start to repair it, it breaks. Don't worry; if you are adding encode/decode statements to your program and it breaks, it just means you have more work to do. Next time, when you design with Unicode in mind from the beginning, it will be much easier!)
That's really all you need to know about Perl and Unicode. If you tell Perl what your data is, it has the best Unicode support among all popular programming languages. If you assume it will magically know what sort of text you are feeding it, though, then you're going to trash your data irrevocably. Just because your program works today on your UTF-8 terminal doesn't mean it will work tomorrow on a UTF-16 encoded file. So make it safe now, and save yourself the headache of trashing your users' data!
The easy part of handling Unicode is encoding output and decoding input. The hard part is finding all your input and output, and determining which encoding it is. But that's why you get the big bucks :)
We're all in agreement that it is a difficult problem for many reasons,
but that's precisely the reason to try to make it easier on everybody.
There is a recent module on CPAN, utf8::all, that attempts to "turn on Unicode. All of it".
As has been pointed out, you can't magically make the entire system (outside programs, external web requests, etc.) use Unicode as well, but we can work together to make sensible tools that make doing common problems easier. That's the reason that we're programmers.
If utf8::all doesn't do something you think it should, let's improve it to make it better. Or let's make additional tools that together can suit people's varying needs as well as possible.
`
I think you misunderstand Unicode and its relationship to Perl. No matter which way you store data, Unicode, ISO-8859-1, or many other things, your program has to know how to interpret the bytes it gets as input (decoding) and how to represent the information it wants to output (encoding). Get that interpretation wrong and you garble the data. There isn't some magic default setup inside your program that's going to tell the stuff outside your program how to act.
You think it's hard, most likely, because you are used to everything being ASCII. Everything you should have been thinking about was simply ignored by the programming language and all of the things it had to interact with. If everything used nothing but UTF-8 and you had no choice, then UTF-8 would be just as easy. But not everything does use UTF-8. For instance, you don't want your input handle to think that it's getting UTF-8 octets unless it actually is, and you don't want your output handles to be UTF-8 if the thing reading from them can't handle UTF-8. Perl has no way to know those things. That's why you are the programmer.
I don't think Unicode in Perl 5 is too complicated. I think it's scary and people avoid it. There's a difference. To that end, I've put Unicode in Learning Perl, 6th Edition, and there's a lot of Unicode stuff in Effective Perl Programming. You have to spend the time to learn and understand Unicode and how it works. You're not going to be able to use it effectively otherwise.
While reading this thread, I often get the impression that people are using "UTF-8" as a synonym to "Unicode". Please make a distinction between Unicode's "Code-Points" which are an enlarged relative of the ASCII code and Unicode's various "encodings". And there are a few of them, of which UTF-8, UTF-16 and UTF-32 are the current ones and a few more are obsolete.
Please, UTF-8 (as well as all other encodings) exists and have meaning in input or in output only. Internally, since Perl 5.8.1, all strings are kept as Unicode "Code-points". True, you have to enable some features as admiringly covered previously.
There's a truly horrifying amount of ancient code out there in the wild, much of it in the form of common CPAN modules. I've found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it, and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly (in particular, iTiVo fails badly on anything that's not 7-bit ASCII due to transcoding issues).
You should enable the unicode strings feature, and this is the default if you use v5.14;
You should not really use unicode identifiers esp. for foreign code via utf8 as they are insecure in perl5, only cperl got that right. See e.g. http://perl11.org/blog/unicode-identifiers.html
Regarding utf8 for your filehandles/streams: You need decide by yourself the encoding of your external data. A library cannot know that, and since not even libc supports utf8, proper utf8 data is rare. There's more wtf8, the windows aberration of utf8 around.
BTW: Moose is not really "Modern Perl", they just hijacked the name. Moose is perfect Larry Wall-style postmodern perl mixed with Bjarne Stroustrup-style everything goes, with an eclectic aberration of proper perl6 syntax, e.g. using strings for variable names, horrible fields syntax, and a very immature naive implementation which is 10x slower than a proper implementation.
cperl and perl6 are the true modern perls, where form follows function, and the implementation is reduced and optimized.

Checklist for going the Unicode way with Perl

I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.
Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.
This is what I have done so far (forgive me for only including "summary" code examples):
Made sure files are always read and written in UTF-8:
use open ':utf8';
Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):
s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg; # Kept from existing code
s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg; # Added
utf8::decode $_;
Made sure text is printed as UTF-8:
binmode STDOUT, ':utf8';
Made sure browsers interpret my content as UTF-8:
Content-Type: text/html; charset=UTF-8
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):
accept-charset="UTF-8"
Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:
use utf8;
Does this looks reasonable or am I missing something?
EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.
The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).
For the same reason, always Encode::decode('UTF-8', โ€ฆ), not Encode::decode_utf8(โ€ฆ).
Make decoding fail hard with an exception, compare:
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:
use Encode qw(decode);
use URI::Escape::XS qw(decodeURIComponent);
$_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.
Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.
Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', โ€ฆ)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).
For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.
You're always missing something. The problem is usually the unknown unknowns, though. :)
Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.
Some other things you'll need to do:
Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.
Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: ๏ฟฝ
Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).
Ensure that anything looking at #ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).
If you find anything else, please let us know by answering your own question with whatever we left out. ;)

Perl strings internals

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.
In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.
You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -
Perl uses logically-wide characters to represent strings internally.
it will confirm your statements.
Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

What problems should I expect when moving legacy Perl code to UTF-8?

Until now, the project I work in used ASCII only in the source code. Due to several upcoming changes in I18N area and also because we need some Unicode strings in our tests, we are thinking about biting the bullet and move the source code to UTF-8, while using the utf8 pragma (use utf8;)
Since the code is in ASCII now, I don't expect to have any troubles with the code itself. However, I'm not quite aware of any side effects we might be getting, while I think it's quite probable that I will get some, considering our environment (perl5.8.8, Apache2, mod_perl, MSSQL Server with FreeTDS driver).
If you have done such migrations in the past: what problems can I expect? How can I manage them?
The utf8 pragma merely tells Perl that your source code is UTF-8 encoded. If you have only used ASCII in your source, you won't have any problems with Perl understanding the source code. You might want to make a branch in your source control just to be safe. :)
If you need to deal with UTF-8 data from files, or write UTF-8 to files, you'll need to set the encodings on your filehandles and encode your data as external bits expect it. See, for instance, With a utf8-encoded Perl script, can it open a filename encoded as GB2312?.
Check out the Perl documentation that tells you about Unicode:
perlunicode
perlunifaq
perlunitut
Also see Juerd's Perl Unicode Advice.
A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.