How differs the open pragma with different utf8? - perl

Do these three versions all behave differently?
use open qw( :encoding(UTF-8) :std );
use open qw( :encoding(UTF8) :std );
use open qw( :utf8 :std );

Firstly, :utf8 only markes the text as UTF-8 it does not check that it is valid. See this post on PerlMonks for information.
:encoding is an Extension Layer to PerlIO, perl perldoc perliol
":encoding" use Encoding;
makes this layer available, although PerlIO.pm "knows" where to find it. It is an example of a layer which takes an argument as it is called thus: open( $fh, "<:encoding(iso-8859-7)", $pathname );
The other two questions are answered in the FAQ perldoc perlunifaq
What is the difference between ":encoding" and ":utf8"? Because UTF-8 is one of Perl's internal formats, you can often just skip the encoding or decoding step, and manipulate the UTF8 flag directly. Instead of ":encoding(UTF-8)", you can simply use ":utf8", which skips the encoding step if the data was already represented as UTF8 internally. This is widely accepted as good behavior when you're writing, but it can be dangerous when reading, because it causes internal inconsistency when you have invalid byte sequences. Using ":utf8" for input can sometimes result in security breaches, so please use ":encoding(UTF-8)" instead. Instead of "decode" and "encode", you could use "_utf8_on" and "_utf8_off", but this is considered bad style. Especially "_utf8_on" can be dangerous, for the same reason that ":utf8" can. There are some shortcuts for oneliners; see "-C" in perlrun.
What's the difference between "UTF-8" and "utf8"? "UTF-8" is the official standard. "utf8" is Perl's way of being liberal in what it accepts. If you have to communicate with things that aren't so liberal, you may want to consider using "UTF-8". If you have to communicate with things that are too liberal, you may have to use "utf8". The full explanation is in Encode. "UTF-8" is internally known as "utf-8-strict". The tutorial uses UTF-8 consistently, even where utf8 is actually used internally, because the distinction can be hard to make, and is mostly irrelevant. For example, utf8 can be used for code points that don't exist in Unicode, like 9999999, but if you encode that to UTF-8, you get a substitution character (by default; see "Handling Malformed Data" in Encode for more ways of dealing with this.) Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not some other encoding.)
The open pragma (ie., use open) only sets the default PerlIO layers for input and output; :std does the following,
The ":std" subpragma on its own has no effect, but if combined with the ":utf8" or ":encoding" subpragmas, it converts the standard filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected for input/output handles. For example, if both input and out are chosen to be ":encoding(utf8)", a ":std" will mean that STDIN, STDOUT, and STDERR are also in ":encoding(utf8)". On the other hand, if only output is chosen to be in ":encoding(koi8r)", a ":std" will cause only the STDOUT and STDERR to be in "koi8r". The ":locale" subpragma implicitly turns on ":std".
So :std is a subpragma (open.pm specific) that sets the Standard Streams to receive Unicode Input perl :utf8 as above.

Evan seems to have your answer. For future ease of use see uft8::all, "turn on Unicode - all of it".

Related

Why does PerlIO::encoding insert an additional utf8 layer?

The documentation for PerlIO says:
:encoding Use :encoding(ENCODING) either in open() or binmode() to
install a layer that transparently does character set and encoding
transformations, for example from Shift-JIS to Unicode. Note that
under stdio an :encoding also enables :utf8 . See PerlIO::encoding for
more information.
Here is a test script:
use feature qw(say);
use strict;
use warnings;
my $fn = 'test.txt';
for my $mode ('>', '>:encoding(utf8)' ) {
open( my $fh, $mode, $fn);
say join ' ', (PerlIO::get_layers($fh));
close $fh;
}
Output is:
unix perlio
unix perlio encoding(utf8) utf8
Why do I get the additional utf8 layer here?
For reasons that require knowledge of Perl internals.
When you store the number 4 in a scalar, it could be stored as a signed integer, an unsigned integer or a floating point number. You don't know which is used, and you don't have any reason to care which one is used. Perl will automatically convert as needed.
It's the same situation for strings. There are two storage formats for them. Your name is the perfect example. "Håkon Hægland" can be stored as
48.E5.6B.6F.6E.20.48.E6.67.6C.61.6E.64
or as
48.C3.A5.6B.6F.6E.20.48.C3.A6.67.6C.61.6E.64
A flag called UTF8 indicates the choice of storage format. This is transparent to the user (or at least should be).
$ perl -Mutf8 -E'
$_ = "Håkon Hægland";
utf8::downgrade( $d = $_ ); # Converts to the first format mentioned above.
utf8::upgrade( $u = $_ ); # Converts to the second format mentioned above.
say $d eq $u ? "eq" : "ne";
'
eq
While it's transparent to you, it's far from transparent to Perl itself. Whenever you manipulate a string, Perl has to check in which storage format it's stored. For example, if you concatenate two strings, Perl has to make sure they use the same storage format before performing the concatenation, converting one if necessary.
It's also not transparent to PerlIO. PerlIO, like the rest of Perl, has to deal with the bytes in the string buffer rather than what you see at the Perl level. Sometimes, those bytes are destined to be the string buffer of scalars with the UTF8 flag cleared, and sometimes, those bytes are destined to be the string buffer of scalars with the UTF8 flag set. PerlIO needs to track that. Rather than carrying a flag along from layer to layer, PerlIO adds a :utf8 layer when the scalars obtained by reading from the handle need to have the UTF8 flag set.
So, :encoding converts the bytes that form
Håkon Hægland
from the specified encoding to
48.C3.A5.6B.6F.6E.20.48.C3.A6.67.6C.61.6E.64
And :utf8 causes the scalar to have the UTF8 flag set, causing the resulting scalar to contain
U+0048.00E5.006B.006F.006E.0020.0048.00E6.0067.006C.0061.006E.0064

It is correct to switch the default perl's IO to utf-8 while using Plack and Middlewares?

Two starting points:
In his answer to Why does modern Perl avoid UTF-8 by default? tchrist pointed out 52 things needed to ensure correct Unicode handling in Perl. The answer shows the boilerplate code with some use statements. A similiar question about the use of Unicode is How to make "use My::defaults" with modern perl & utf8 defaults?
The PSGI spec is by design byte oriented. It is my responsibility to encode/decode everything, so for the Plack apps the correct way is to encode output and decode input, e.g.:
use Encode;
my $app = sub {
my $output = encode_utf8( myapp() );
return [ 200, [ 'Content-Type' =>'text/plain' ], [ $str ] ];
};
Is it correct to use
use uni::perl; # or any similar
in the PSGI application and/or in my modules?
uni::perl changes Perl's default IO to UTF-8, thus:
use open qw(:std :utf8);
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
Will doing so break something in Plack or its middlewares? Or is the only correct way to write apps for Plack explicitely encoding/decoding at open, so without the open pragma?
You really don't want to set STDIN/STDOUT to be UTF-8 mode by default on Plack, because you don't know for instance whether they will be binary data transports. E.g. if those filehandles are the FastCGI protocol connector they will be carrying encoded binary structures and not UTF-8 text. They therefore must not have an encoding layer defined, or those binary structures will be mangled or rejected as invalid.
On modern GNU/Linux systems you should completely switch to UTF-8 globally. This means setting
LANG="xx_YY.UTF-8"
PERL_UNICODE=SDAL
PERL5OPT=-Mutf8
in your /etc/environment or /etc/sysconfig/i18n or /etc/default/locale or whatever your system configuration file is. Because of RHEL/Centos bug I symlinked /etc/environment to sysconfig/i18n.
Scripts that rely on binary input should set binmode on STDIN/OUT/ERR(?) or use open pragma or should be called with -C0 option.
The problem is that some DBD drivers are buggy, e.g. DBD::JDBC, and you must set the utf8 flag by hand.
use Encode qw/_utf8_on/;
map { _utf8_on $_; } #strings;

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.

How do I find "wide characters" printed by perl?

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (#urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.
I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..
The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.
To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being
valid UTF-8. More details can be found in PerlIO::encoding.

Why do my Perl tests fail with use encoding 'utf8'?

I'm puzzled with this test script:
#!perl
use strict;
use warnings;
use encoding 'utf8';
use Test::More 'no_plan';
ok('áá' =~ m/á/, 'ok direct match');
my $re = qr{á};
ok('áá' =~ m/$re/, 'ok qr-based match');
like('áá', $re, 'like qr-based match');
The three tests fail, but I was expecting that the use encoding 'utf8' would upgrade both the literal áá and the qr-based regexps to utf8 strings, and thus passing the tests.
If I remove the use encoding line the tests pass as expected, but I can't figure it out why would they fail in utf8 mode.
I'm using perl 5.8.8 on Mac OS X (system version).
Do not use the encoding pragma. It’s broken. (Juerd Waalboer gave a great talk where he mentioned this at YAPC::EU 2k8.)
It does at least two things at once that do not belong together:
It specifies an encoding for your source file.
It specifies an encoding for your file input/output.
And to add injury to insult it also does #1 in a broken fashion: it reinterprets \xNN sequences as being undecoded octets as opposed to treating them like codepoints, and decodes them, preventing you from being able to express characters outside the encoding you specified and making your source code mean different things depending on the encoding. That’s just astonishingly wrong.
Write your source code in ASCII or UTF-8 only. In the latter case, the utf8 pragma is the correct thing to use. If you don’t want to use UTF-8, but you do want to include non-ASCII charcters, escape or decode them explicitly.
And use I/O layers explicitly or set them using the open pragma to have I/O automatically transcoded properly.
It works fine on my computer (on perl 5.10). Maybe you should try replacing that use encoding 'utf8' with use utf8.
What version of perl are you using? I think older versions had bugs with UTF-8 in regexps.
The Test::More documentation contains a fix for this issue, which I just found today (and this entry shows higher in the googles).
utf8 / "Wide character in print"
If you use utf8 or other non-ASCII characters with Test::More you might get a "Wide character in print" warning. Using binmode STDOUT, ":utf8" will not fix it. Test::Builder (which powers Test::More) duplicates STDOUT and STDERR. So any changes to them, including changing their output disciplines, will not be seem by Test::More. The work around is to change the filehandles used by Test::Builder directly.
my $builder = Test::More->builder;
binmode $builder->output, ":utf8";
binmode $builder->failure_output, ":utf8";
binmode $builder->todo_output, ":utf8";
I added this bit of boilerplate to my testing code and it works a charm.