making sure that my handling of utf8 is correct - perl

I am using Perl for a module that involves processing a lot of Unicode documents. I started getting nervous because I'm not opening and closing files with the utf8 layers like open (OUT, '>:utf8', $textfile). However, I have been thoroughly testing and the output was still as expected. So I want to better understand why.
In a nutshell, my Perl module passes a document to an external service and gets a response. The response will be in Utf8. It uses LWP::UserAgent for this. When it gets the response it just writes it to a file:
my $fh;
open($fh, '>', $outputpath) or die "Could not open file '$outputpath' $!";
print $fh $response->content;
close $fh;
I have diffed these files against Unicode files representing the "expected" output and it is fine. And yet, you can see in my open command that I was not using the utf8 layer. So why is that?
What if I just returned $response->content to some other process, instead of printing it? Would it still be proper Unicode then?
I also have a separate process that I would like to ask about, very similar question. In this case I am trying to build a new service which replaces an old one. The old one read from a file like open(my $fh, '<:utf8', $inputfile) and wrote to a new file like open(my $fh, '>:utf8', $outputfile). The new service will still read the same way, but will not write to the output file anymore. It will send the string to another server using HTTP, and on that server it will be printed to a file using open(my $fh, '>', $outputfile) so no utf8 layer. I can't change that code immediately.
I want the file contents to be the exact same as they would otherwise have been (none of the other processing rules are changing). Should I be nervous about losing the layer?
I think maybe it would help if I understood better what these layers are doing.

There is no "handling of utf8" in the main question and that in itself isn't right.
The whole thing works, as the server is sending utf8 as you say, in the following way.
The content method used on $response is from HTTP::Message
The content() method sets the raw content if an argument is given. If no argument is given the content is not touched. In either case the original raw content is returned.
Since you don't specify layers† in open the default is used, likely :unix:perlio for Unix, with no encoding (see PerlIO). So you are dumping the original bytes to the disk, unchanged.
Looking further down the page, at decoded_content( %options ), we see the default
default_charset
This override the default charset guessed by content_charset() or if that fails "ISO-8859-1".
and can establish what you are getting by printing it
say 'Content type: ', $response->content_charset;
where you should get Content type: UTF-8. But when you receive a different encoding from the server then that will wind up in the file and any code that expects utf8 will break.
One should always decode all input and encode all output. Then we know exactly what is going on. As input is decoded the program carries on with character strings (not bytes in whatever encoding was sent). In the end encode suitably for output. This Effective Perler article should be useful. Here you'd use decoded_content and write files opened with :encoding(UTF-8).
With use open ":std", ":encoding(UTF-8)"; all I/O via standard streams in the lexical scope of this pragma will be handled as utf8. (This can be overriden for other specific uses, say by specifying layers in the three argument open.)
See open pragma.
As for the other question, you need to properly encode what you intend to "send to another server." How to do that depends on how you are "sending" it.
†   With PerlIO the I/O "layers" can be set so that encoding of input and output is done as needed behind the scenes, as data is read or written. The work is done by Encode. For a nice explanation of the process see Encode::PerlIO.
Also see perlunitut, perlunifaq, and perluniitro.

Related

Remove or completely supress null character \0

I have a script, MM.pl, which is the “workhorse”, and a simple “patchfile” that it reads from. In this case, the patch file is targeting an .ini file for search and replace. Simple enough. It took me 5 days to realize the ini is encoded with null (\0) characters between each letter. Since then, I have tried every option I could find both in code snippets, use:: functions, and regular expressions.
The only reason I found it was I used use Data::Printer; to dump several values. In Notepad++, the ini appears to be encoded as USC-2 LE. It is important that MM.pl handles the task instead of asking the user to “fix” the issue.
Update: This may provide a clue \xFF\xFE are the first 2 characters in the ini file. They appear after processing. The swap is not actually changing anything else like it is supposed to, but "reveals" 2 hidden characters.
As you noticed, those nulls aren't just junk to be stripped; they're part of the file's character encoding. So decode it:
open my $fh, '<:encoding(UCS-2)', 'file.ini';
Write it back out the same way once you're done.
When you read the file set the encoding
my $fh = IO::File->open( "< something.ini" );
binmode( $fh, ":encoding(UTF-16LE)" );
And when you output, you can write back whichever enoding you like. e.g.
my $out = IO::File->open( "> something-new.ini" );
binmode( $out, ":encoding(UTF-8)" );
Or even if you're dumping to the terminal
binmode( STDOUT, ":encoding(UTF-8)" );
To be honest this really is not a solution but a copout. After 4 weeks of trying and retrying methods, and reading and reading and reading, I have put it in park and switched to python to build the app. Several references in the perldocs mention UTF16 is "problematic" and also in mention situations it is treated differently.

Change file encoding for PostgreSQL w/Perl

I'm entering large amounts of data into a PostgreSQL database using Perl and the Perl DBI. I have been getting errors as my file is improperly encoded. I have the PostgreSQL encoding set to 'utf8' and used the debian 'file' command to determine that my file has "Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators", and when I run my program the DBI fails due to an "invalid byte sequence". I already added a line in my Perl program to sub the '\r' carriage returns for nothing, but how can I convert my files to 'utf8' or get PostgreSQL to accept my file encoding. Thanks.
When you connect to PostgreSQL using DBI->connect(..., { pg_enable_utf8 => 1}) then the data used in all modifying DBI calls (SQL INSERT, UPDATE, DELETE, everywhere you use placeholders in queries etc) has to be encoded in Perl's internal encoding so that DBI itself can convert to the wire protocol correctly.
There are tons of ways how you can achieve that, and they all depend on how you read the file in the first place. The most basic one is if you use open (or one of the methods based directly on it like IO::File->open). You can then use Perl's I/O layers (see the open link above) and let Perl do that for you. Assuming your file is encoded in UTF-8 already you'll get away with:
open(my $fh, "<:encoding(UTF-8)", "filename");
while (my $line = <$fh>) {
# process query
}
This is basically equivalent to opening the file without an encoding layer and converting manually using Encode::decode, e.g. like this:
open(my $fh, "<", "filename");
while (my $line = <$fh>) {
$line = Encode::decode('UTF-8', $line);
# process query
}
A lot of other modules that receive data from external sources and return it (think of HTTP downloads with LWP, for example) return values that have already been converted into Perl's internal encoding.
So what you have to do is:
Figure out which encoding your file actually uses (try using iconv on the shell for that)
Tell DBI to enable UTF-8
Open the file with the correct encoding
Read line(s), process query, repeat

How do I find "wide characters" printed by perl?

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (#urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.
I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..
The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.
To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being
valid UTF-8. More details can be found in PerlIO::encoding.

Are there reasons to ever use the two-argument form of open(...) in Perl?

Are there any reasons to ever use the two-argument form of open(...) in Perl rather than the three-or-more-argument versions?
The only reason I can come up with is the obvious observation that the two-argument form is shorter. But assuming that verbosity is not an issue, are there any other reasons that would make you choose the two-argument form of open(...)?
One- and two-arg open applies any default layers specified with the -C switch or open pragma. Three-arg open does not. In my opinion, this functional difference is the strongest reason to choose one or the other (and the choice will vary depending what you are opening). Which is easiest or most descriptive or "safest" (you can safely use two-arg open with arbitrary filenames, it's just not as convenient) take a back seat in module code; in script code you have more discretion to choose whether you will support default layers or not.
Also, one-arg open is needed for Damian Conway's file slurp operator
$_ = "filename";
$contents = readline!open(!((*{!$_},$/)=\$_));
Imagine you are writing a utility that accepts an input file name. People with reasonable Unix experience are used to substituting - for STDIN. Perl handles that automatically only when the magical form is used where the mode characters and file name are one string, else you have to handle this and similar special cases yourself. This is a somewhat common gotcha, I am surprised no one has posted that yet. Proof:
use IO::File qw();
my $user_supplied_file_name = '-';
IO::File->new($user_supplied_file_name, 'r') or warn "IO::File/non-magical mode - $!\n";
IO::File->new("<$user_supplied_file_name") or warn "IO::File/magical mode - $!\n";
open my $fh1, '<', $user_supplied_file_name or warn "non-magical open - $!\n";
open my $fh2, "<$user_supplied_file_name" or warn "magical open - $!\n";
__DATA__
IO::File/non-magical mode - No such file or directory
non-magical open - No such file or directory
Another small difference : the two argument form trim spaces
$foo = " fic";
open(MH, ">$foo");
print MH "toto\n";
Writes in a file named fic
On the other hand
$foo = " fic";
open(MH, ">", $foo);
print MH "toto\n";
Will write in a file whose name begin with a space.
For short admin scripts with user input (or configuration file input), not having to bother with such details as trimming filenames is nice.
The two argument form of open was the only form supported by some old versions of perl.
If you're opening from a pipe, the three argument form isn't really helpful. Getting the equivalent of the three argument form involves doing a safe pipe open (open(FILE, '|-')) and then executing the program.
So for simple pipe opens (e.g. open(FILE, 'ps ax |')), the two argument syntax is much more compact.
I think William's post pretty much hits it. Otherwise, the three-argument form is going to be more clear, as well as safer.
See also:
What's the best way to open and read a file in Perl?
Why is three-argument open calls with autovivified filehandles a Perl best practice?
One reason to use the two-argument version of open is if you want to open something which might be a pipe, or a file. If you have one function
sub strange
{
my ($file) = #_;
open my $input, $file or die $!;
}
then you want to call this either with a filename like "file":
strange ("file");
or a pipe like "zcat file.gz |"
strange ("zcat file.gz |");
depending on the situation of the file you find, then the two-argument version may be used. You will actually see the above construction in "legacy" Perl. However, the most sensible thing might be to open the filehandle appropriately and send the filehandle to the function rather than using the file name like this.
When you are combining a string or using a variable, it can be rather unclear whether '<' or '>' etc is in already. In such cases, I personally prefer readability, which means, I use the longer form:
open($FILE, '>', $varfn);
When you simply use a constant, I prefer the ease-of-typing (and, actually, consider the short version better readable anyway, or at least even to the long version).
open($FILE, '>somefile.xxx');
I'm guessing you mean open(FH, '<filename.txt') as opposed to open(FH, '<', 'filename.txt') ?
I think it's just a matter of preference. I always use the former out of habit.

How do I serve a large file for download with Perl?

I need to serve a large file (500+ MB) for download from a location that is not accessible to the web server. I found the question Serving large files with PHP, which is identical to my situation, but I'm using Perl instead of PHP.
I tried simply printing the file line by line, but this does not cause the browser to prompt for download before grabbing the entire file:
use Tie::File;
open my $fh, '<', '/path/to/file.txt';
tie my #file, 'Tie::File', $fh
or die 'Could not open file: $!';
my $size_in_bytes = -s $fh;
print "Content-type: text/plain\n";
print "Content-Length: $size_in_bytes\n";
print "Content-Disposition: attachment; filename=file.txt\n\n";
for my $line (#file) {
print $line;
}
untie #file;
close $fh;
exit;
Does Perl have an equivalent to PHP's readfile() function (as suggested with PHP) or is there a way to accomplish what I'm trying to do here?
If you just want to slurp input to output, this should do the trick.
use Carp ();
{ #Lexical For FileHandle and $/
open my $fh, '<' , '/path/to/file.txt' or Carp::croak("File Open Failed");
local $/ = undef;
print scalar <$fh>;
close $fh or Carp::carp("File Close Failed");
}
I guess in response to the "Does Perl have a PHP ReadFile Equivelant" , and I guess my answer would be "But it doesn't really need one".
I've used PHP's manual File IO controls and they're a pain, Perls are just so easy to use by comparison that shelling out for a one-size-fits-all function seems over-kill.
Also, you might want to look at X-SendFile support, and basically send a header to your webserver to tell it what file to send: http://john.guen.in/past/2007/4/17/send_files_faster_with_xsendfile/ ( assuming of course it has permissions enough to access the file, but the file is just NOT normally accessible via a standard URI )
Edit Noted, it is better to do it in a loop, I tested the above code with a hard-drive and it does implicitly try store the whole thing in an invisible temporary variable and eat all your ram.
Alternative using blocks
The following improved code reads the given file in blocks of 8192 chars, which is much more memory efficient, and gets a throughput respectably comparable with my disk raw read rate. ( I also pointed it at /dev/full for fits and giggles and got a healthy 500mb/s throughput, and it didn't eat all my rams, so that must be good )
{
open my $fh , '<', '/dev/sda' ;
local $/ = \8192; # this tells IO to use 8192 char chunks.
print $_ while defined ( $_ = scalar <$fh> );
close $fh;
}
Applying jrockways suggestions
{
open my $fh , '<', '/dev/sda5' ;
print $_ while ( sysread $fh, $_ , 8192 );
close $fh;
}
This literally doubles performance, ... and in some cases, gets me better throughput than DD does O_o.
The readline function is called readline (and can also be written as
<>).
I'm not sure what problem you're having. Perhaps that for loops
aren't lazily evaluated (which they're not). Or, perhaps Tie::File is
screwing something up? Anyway, the idiomatic Perl for reading a file
a line at a time is:
open my $fh, '<', $filename or die ...;
while(my $line = <$fh>){
# process $line
}
No need to use Tie::File.
Finally, you should not be handling this sort of thing yourself. This
is a job for a web framework. If you were using
Catalyst (or
HTTP::Engine), you would
just say:
open my $fh, '<', $filename ...
$c->res->body( $fh );
and the framework would automatically serve the data in the file
efficiently. (Using stdio via readline is not a good idea here, it's
better to read the file in blocks from the disk. But who cares, it's
abstracted!)
You could use my Sys::Sendfile module. It's should be highly efficient (as it uses sendfile underneath the hood), but not entirely portable (only Linux, FreeBSD and Solaris are currently supported).
When you say "this does not cause the browser to prompt for download" -- what's "the browser"?
Different browsers behave differently, and IE is particularly wilful, it will ignore headers and decide for itself what to do based on reading the first few kb of the file.
In other words, I think your problem may be at the client end, not the server end.
Try lying to "the browser" and telling it the file is of type application/octet-stream. Or why not just zip the file, especially as it's so huge.
Don't use for/foreach (<$input>) because it reads the whole file at once and then iterates over it. Use while (<$input>) instead. The sysread solution is good, but the sendfile is the best performance-wise.
Answering the (original) question ("Does Perl have an equivalent to PHP's readline() function ... ?"), the answer is "the angle bracket syntax":
open my $fh, '<', '/path/to/file.txt';
while (my $line = <file>) {
print $line;
}
Getting the content-length with this method isn't necessarily easy, though, so I'd recommend staying with Tie::File.
NOTE
Using:
for my $line (<$filehandle>) { ... }
(as I originally wrote) copies the contents of the file to a list and iterates over that. Using
while (my $line = <$filehandle>) { ... }
does not. When dealing with small files the difference isn't significant, but when dealing with large files it definitely can be.
Answering the (updated) question ("Does Perl have an equivalent to PHP's readfile() function ... ?"), the answer is slurping. There are a couple of syntaxes, but Perl6::Slurp seems to be the current module of choice.
The implied question ("why doesn't the browser prompt for download before grabbing the entire file?") has absolutely nothing to do with how you're reading in the file, and everything to do with what the browser thinks is good form. I would guess that the browser sees the mime-type and decides it knows how to display plain text.
Looking more closely at the Content-Disposition problem, I remember having similar trouble with IE ignoring Content-Disposition. Unfortunately I can't remember the workaround. IE has a long history of problems here (old page, refers to IE 5.0, 5.5 and 6.0). For clarification, however, I would like to know:
What kind of link are you using to point to this big file (i.e., are you using a normal a href="perl_script.cgi?filename.txt link or are you using Javascript of some kind)?
What system are you using to actually serve the file? For instance, does the webserver make its own connection to the other computer without a webserver, and then copy the file to the webserver and then send the file to the end user, or does the user make the connection directly to the computer without a webserver?
In the original question you wrote "this does not cause the browser to prompt for download before grabbing the entire file" and in a comment you wrote "I still don't get a download prompt for the file until the whole thing is downloaded." Does this mean that the file gets displayed in the browser (since it's just text), that after the browser has downloaded the entire file you get a "where do you want to save this file" prompt, or something else?
I have a feeling that there is a chance the HTTP headers are getting stripped out at some point or that a Cache-control header is getting added (which apparently can cause trouble).
I've successfully done it by telling the browser it was of type application/octet-stream instead of type text/plain. Apparently most browsers prefer to display text/plain inline instead of giving the user a download dialog option.
It's technically lying to the browser, but it does the job.
The most efficient way to serve a large file for download depends on a web-server you use.
In addition to #Kent Fredric X-Sendfile suggestion:
File Downloads Done Right have some links that describe how to do it for Apache, lighttpd (mod_secdownload: security via url generation), nginx. There are examples in PHP, Ruby (Rails), Python which can be adopted for Perl.
Basically it boils down to:
Configure paths, and permissions for your web-server.
Generate valid headers for the redirect in your Perl app (Content-Type, Content-Disposition, Content-length?, X-Sendfile or X-Accel-Redirect, etc).
There are probably CPAN modules, web-frameworks plugins that do exactly that e.g., #Leon Timmermans mentioned Sys::Sendfile in his answer.