Reliable Perl encoding with File::Slurp - encoding

I need to replace every occurrence of http:// with // in a file. The file may be (at least) in UTF-8, CP1251, or CP1255.
Does the following work?
use File::Slurp;
my $Text = read_file($File, binmode=>':raw');
$Text =~ s{http://}{//}gi;
write_file($File, {atomic=>1, binmode=>':raw'}, $Text);
It seems correct, but I need to be sure that the file will not be damaged whatever encoding it has. Please help me to be sure.

This answer won't make you sure, though I hope it can help.
I don't see any problem with your script (tested with utf8 ans iso-8859-1 without problems) though there seems to be a discussion regarding the capacity of File::slurp to correctly handle encoding : http://blogs.perl.org/users/leon_timmermans/2015/08/fileslurp-is-broken-and-wrong.html
In this answer on a similar subject, the author recommends File::Slurper as an alternative, due to better encoding handling: https://stackoverflow.com/a/206682/6193608

It's no longer recommended to use File::Slurp (see here).
I would recommend using Path::Tiny. It's easy to use, works with both files and directories, only uses core modules, and has slurp/spew methods specifically for uft8 and raw so you shouldn't have a problem with the encoding.
Usage:
use Path::Tiny;
my $Text = path($File)->slurp_raw;
$Text =~ s{http://}{//}gi;
path($File)->spew_raw($Text);
Update: From documentation on spew:
Writes data to a file atomically. The file is written to a temporary file in the same directory, then renamed over the original. An optional hash reference may be used to pass options. The only option is binmode, which is passed to binmode() on the handle used for writing.
spew_raw is like spew with a binmode of :unix for a fast, unbuffered, raw write.
spew_utf8 is like spew with a binmode of :unix:encoding(UTF-8) (or PerlIO::utf8_strict). If Unicode::UTF8 0.58+ is installed, a raw spew will be done instead on the data encoded with Unicode::UTF8.

Related

What is the Perl's IO::File equivalent to open($fh, ">:utf8",$path)?

It's possible to white a file utf-8 encoded as follows:
open my $fh,">:utf8","/some/path" or die $!;
How do I get the same result with IO::File, preferably in 1 line?
I got this one, but does it do the same and can it be done in just 1 line?
my $fh_out = IO::File->new($target_file, 'w');
$fh_out->binmode(':utf8');
For reference, the script starts as follows:
use 5.020;
use strict;
use warnings;
use utf8;
# code here
Yes, you can do it in one line.
open accepts one, two or three parameters. With one parameter, it is just a front end for the built-in open function. With two or three parameters, the first parameter is a filename that may include whitespace or other special characters, and the second parameter is the open mode, optionally followed by a file permission value.
[...]
If IO::File::open is given a mode that includes the : character, it passes all the three arguments to the three-argument open operator.
So you just do this.
my $fh_out = IO::File->new('/some/path', '>:utf8');
It is the same as your first open line because it gets passed through.
I would suggest to try out Path::Tiny. For example, to open and write out your file
use Path::Tiny;
path('/some/path')->spew_utf8(#data);
From the docs, on spew, spew_raw, spew_utf8
Writes data to a file atomically. [ ... ]
spew_raw is like spew with a binmode of :unix for a fast, unbuffered, raw write.
spew_utf8 is like spew with a binmode of :unix:encoding(UTF-8) (or PerlIO::utf8_strict ). If Unicode::UTF8 0.58+ is installed, a raw spew will be done instead on the data encoded with Unicode::UTF8.
The module integrates many tools for handling files and directories, paths and content. It is often simple calls like above, but also method chaining, recursive directory iterator, hooks for callbacks, etc. There is error handling throughout, consistent and thoughtful dealing with edge cases, flock on input/ouput handles, its own tiny and useful class for exceptions ... see docs.
Edit:
You could also use File::Slurp if it was not discouraged to use
e.g
use File::Slurp qw(write_file);
write_file( 'filename', {binmode => ':utf8'}, $buffer ) ;
The first argument to write_file is the filename. The next argument is
an optional hash reference and it contains key/values that can modify
the behavior of write_file. The rest of the argument list is the data
to be written to the file.
Some good reasons to not use?
Not reliable
Has some bugs
And as #ThisSuitIsBlackNot said File::Slurp is broken and wrong

UTF-8 in a Perl module name

How can I write a Perl module with UTF-8 in its name and filename? My current try yields "Can't locate Täst.pm in #INC", but the file does exist. I'm on Windows, and haven't tried this on Linux yet.
test.pl:
use strict;
use warnings;
use utf8;
use Täst;
Täst.pm:
package Täst;
use utf8;
Update: My current work-around it so use Tast (ASCII) and put package Täst (Unicode) in Tast.pm (ASCII). It's confusing, though.
Unfortunately, Perl, Windows, and Unicode filenames really don't go together at the moment. My advice is to save yourself a lot of hassle and stick with plain ASCII for your module names. This blog post mentions a few of the problems.
The use utf8 needs to appear before the package Täst, so that the latter can be correctly interpreted. On my Mac:
test.pl:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Tëst;
# 'use utf8' only indicates the code's encoding, but we also want stdout to be utf8
use encoding "utf8";
Tëst::hëllö();
Tëst.pm:
use utf8;
package Tëst;
sub Tëst::hëllö() {
print "Hëllö, wörld!\n";
}
1;
Output:
Macintosh:Desktop sherm$ ./test.pl
Hëllö, wörld!
As I said though - I ran this on my Mac. As cjm said above, your mileage may vary on Windows.
Unicode support often fails at the boundaries. Package and subroutine names need to map cleanly onto filenames, which is problematic on some operating systems. Not only does the OS have to create the filename that you expect, but you also have to be able to find it later as the same name.
We talked a little about the filename issue in Effective Perl Programming, but I also summarized much more in How do I create then use long Windows paths from Perl?. Jeff Atwood mentions this as part of his post on his Filesystem Paths: How Long is Too Long?.
I wouldn't recommend this approach if this is software you plan to release, to be honest. Even if you get it working fine for you, it's likely to be somewhat fragile on machines where UTF-8 isn't configured quite right, and/or filenames may not contain UTF-8 characters, etc.

How can I convert CGI input to UTF-8 without Perl's Encode module?

Through this forum, I have learned that it is not a good idea to use the following for converting CGI input (from either an escape()d Ajax call or a normal HTML form post) to UTF-8:
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
utf8::decode $_;
A safer way (which for example does not allow bogus characters through) is to do the following:
use Encode qw (decode);
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
decode ('UTF-8', $_, Encode::FB_CROAK);
I would, however, very much like to avoid using any modules (including XSLoader, Exporter, and whatever else they bring with them). The function is for a high-volume mod_perl driven website and I think both performance and maintainability will be better without modules (especially since the current code does not use any).
I guess one approach would be to examine the Encode module and strip out the functions and constants used for the “decode ('UTF-8', $_, Encode::FB_CROAK)” call. I am not sufficiently familiar with Unicode and Perl modules to do this. Maybe somebody else is capable of doing this or know a similar, safe “native” way of doing the UTF-8 conversion?
UPDATE:
I prefer keeping things non-modular, because then the only black-box is Perl's own compiler (unless of course you dig down into the module libs).
Sometimes you see large modules being replaced with a few specific lines of code. For example, instead of the CGI.pm module (which people are also in love with), one can use the following for parsing AJAX posts:
my %Input;
if ($ENV{CONTENT_LENGTH}) {
read (STDIN, $_, $ENV{CONTENT_LENGTH});
foreach (split (/&/)) {
tr/+/ /; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg;
if (m{^(\w+)=\s*(.*?)\s*$}s) { $Input{$1} = $2; }
else { die ("bad input ($_)"); }
}
}
In a similar way, it would be great if one could extract or replicate Encode's UTF-8 decode function.
Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.
Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.
Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.
You should typically use encodeURIComponent() instead.
If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).
To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

How do I find "wide characters" printed by perl?

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (#urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.
I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..
The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.
To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being
valid UTF-8. More details can be found in PerlIO::encoding.

Why do my Perl tests fail with use encoding 'utf8'?

I'm puzzled with this test script:
#!perl
use strict;
use warnings;
use encoding 'utf8';
use Test::More 'no_plan';
ok('áá' =~ m/á/, 'ok direct match');
my $re = qr{á};
ok('áá' =~ m/$re/, 'ok qr-based match');
like('áá', $re, 'like qr-based match');
The three tests fail, but I was expecting that the use encoding 'utf8' would upgrade both the literal áá and the qr-based regexps to utf8 strings, and thus passing the tests.
If I remove the use encoding line the tests pass as expected, but I can't figure it out why would they fail in utf8 mode.
I'm using perl 5.8.8 on Mac OS X (system version).
Do not use the encoding pragma. It’s broken. (Juerd Waalboer gave a great talk where he mentioned this at YAPC::EU 2k8.)
It does at least two things at once that do not belong together:
It specifies an encoding for your source file.
It specifies an encoding for your file input/output.
And to add injury to insult it also does #1 in a broken fashion: it reinterprets \xNN sequences as being undecoded octets as opposed to treating them like codepoints, and decodes them, preventing you from being able to express characters outside the encoding you specified and making your source code mean different things depending on the encoding. That’s just astonishingly wrong.
Write your source code in ASCII or UTF-8 only. In the latter case, the utf8 pragma is the correct thing to use. If you don’t want to use UTF-8, but you do want to include non-ASCII charcters, escape or decode them explicitly.
And use I/O layers explicitly or set them using the open pragma to have I/O automatically transcoded properly.
It works fine on my computer (on perl 5.10). Maybe you should try replacing that use encoding 'utf8' with use utf8.
What version of perl are you using? I think older versions had bugs with UTF-8 in regexps.
The Test::More documentation contains a fix for this issue, which I just found today (and this entry shows higher in the googles).
utf8 / "Wide character in print"
If you use utf8 or other non-ASCII characters with Test::More you might get a "Wide character in print" warning. Using binmode STDOUT, ":utf8" will not fix it. Test::Builder (which powers Test::More) duplicates STDOUT and STDERR. So any changes to them, including changing their output disciplines, will not be seem by Test::More. The work around is to change the filehandles used by Test::Builder directly.
my $builder = Test::More->builder;
binmode $builder->output, ":utf8";
binmode $builder->failure_output, ":utf8";
binmode $builder->todo_output, ":utf8";
I added this bit of boilerplate to my testing code and it works a charm.