How do I serve a large file for download with Perl? - perl

I need to serve a large file (500+ MB) for download from a location that is not accessible to the web server. I found the question Serving large files with PHP, which is identical to my situation, but I'm using Perl instead of PHP.
I tried simply printing the file line by line, but this does not cause the browser to prompt for download before grabbing the entire file:
use Tie::File;
open my $fh, '<', '/path/to/file.txt';
tie my #file, 'Tie::File', $fh
or die 'Could not open file: $!';
my $size_in_bytes = -s $fh;
print "Content-type: text/plain\n";
print "Content-Length: $size_in_bytes\n";
print "Content-Disposition: attachment; filename=file.txt\n\n";
for my $line (#file) {
print $line;
}
untie #file;
close $fh;
exit;
Does Perl have an equivalent to PHP's readfile() function (as suggested with PHP) or is there a way to accomplish what I'm trying to do here?

If you just want to slurp input to output, this should do the trick.
use Carp ();
{ #Lexical For FileHandle and $/
open my $fh, '<' , '/path/to/file.txt' or Carp::croak("File Open Failed");
local $/ = undef;
print scalar <$fh>;
close $fh or Carp::carp("File Close Failed");
}
I guess in response to the "Does Perl have a PHP ReadFile Equivelant" , and I guess my answer would be "But it doesn't really need one".
I've used PHP's manual File IO controls and they're a pain, Perls are just so easy to use by comparison that shelling out for a one-size-fits-all function seems over-kill.
Also, you might want to look at X-SendFile support, and basically send a header to your webserver to tell it what file to send: http://john.guen.in/past/2007/4/17/send_files_faster_with_xsendfile/ ( assuming of course it has permissions enough to access the file, but the file is just NOT normally accessible via a standard URI )
Edit Noted, it is better to do it in a loop, I tested the above code with a hard-drive and it does implicitly try store the whole thing in an invisible temporary variable and eat all your ram.
Alternative using blocks
The following improved code reads the given file in blocks of 8192 chars, which is much more memory efficient, and gets a throughput respectably comparable with my disk raw read rate. ( I also pointed it at /dev/full for fits and giggles and got a healthy 500mb/s throughput, and it didn't eat all my rams, so that must be good )
{
open my $fh , '<', '/dev/sda' ;
local $/ = \8192; # this tells IO to use 8192 char chunks.
print $_ while defined ( $_ = scalar <$fh> );
close $fh;
}
Applying jrockways suggestions
{
open my $fh , '<', '/dev/sda5' ;
print $_ while ( sysread $fh, $_ , 8192 );
close $fh;
}
This literally doubles performance, ... and in some cases, gets me better throughput than DD does O_o.

The readline function is called readline (and can also be written as
<>).
I'm not sure what problem you're having. Perhaps that for loops
aren't lazily evaluated (which they're not). Or, perhaps Tie::File is
screwing something up? Anyway, the idiomatic Perl for reading a file
a line at a time is:
open my $fh, '<', $filename or die ...;
while(my $line = <$fh>){
# process $line
}
No need to use Tie::File.
Finally, you should not be handling this sort of thing yourself. This
is a job for a web framework. If you were using
Catalyst (or
HTTP::Engine), you would
just say:
open my $fh, '<', $filename ...
$c->res->body( $fh );
and the framework would automatically serve the data in the file
efficiently. (Using stdio via readline is not a good idea here, it's
better to read the file in blocks from the disk. But who cares, it's
abstracted!)

You could use my Sys::Sendfile module. It's should be highly efficient (as it uses sendfile underneath the hood), but not entirely portable (only Linux, FreeBSD and Solaris are currently supported).

When you say "this does not cause the browser to prompt for download" -- what's "the browser"?
Different browsers behave differently, and IE is particularly wilful, it will ignore headers and decide for itself what to do based on reading the first few kb of the file.
In other words, I think your problem may be at the client end, not the server end.
Try lying to "the browser" and telling it the file is of type application/octet-stream. Or why not just zip the file, especially as it's so huge.

Don't use for/foreach (<$input>) because it reads the whole file at once and then iterates over it. Use while (<$input>) instead. The sysread solution is good, but the sendfile is the best performance-wise.

Answering the (original) question ("Does Perl have an equivalent to PHP's readline() function ... ?"), the answer is "the angle bracket syntax":
open my $fh, '<', '/path/to/file.txt';
while (my $line = <file>) {
print $line;
}
Getting the content-length with this method isn't necessarily easy, though, so I'd recommend staying with Tie::File.
NOTE
Using:
for my $line (<$filehandle>) { ... }
(as I originally wrote) copies the contents of the file to a list and iterates over that. Using
while (my $line = <$filehandle>) { ... }
does not. When dealing with small files the difference isn't significant, but when dealing with large files it definitely can be.
Answering the (updated) question ("Does Perl have an equivalent to PHP's readfile() function ... ?"), the answer is slurping. There are a couple of syntaxes, but Perl6::Slurp seems to be the current module of choice.
The implied question ("why doesn't the browser prompt for download before grabbing the entire file?") has absolutely nothing to do with how you're reading in the file, and everything to do with what the browser thinks is good form. I would guess that the browser sees the mime-type and decides it knows how to display plain text.
Looking more closely at the Content-Disposition problem, I remember having similar trouble with IE ignoring Content-Disposition. Unfortunately I can't remember the workaround. IE has a long history of problems here (old page, refers to IE 5.0, 5.5 and 6.0). For clarification, however, I would like to know:
What kind of link are you using to point to this big file (i.e., are you using a normal a href="perl_script.cgi?filename.txt link or are you using Javascript of some kind)?
What system are you using to actually serve the file? For instance, does the webserver make its own connection to the other computer without a webserver, and then copy the file to the webserver and then send the file to the end user, or does the user make the connection directly to the computer without a webserver?
In the original question you wrote "this does not cause the browser to prompt for download before grabbing the entire file" and in a comment you wrote "I still don't get a download prompt for the file until the whole thing is downloaded." Does this mean that the file gets displayed in the browser (since it's just text), that after the browser has downloaded the entire file you get a "where do you want to save this file" prompt, or something else?
I have a feeling that there is a chance the HTTP headers are getting stripped out at some point or that a Cache-control header is getting added (which apparently can cause trouble).

I've successfully done it by telling the browser it was of type application/octet-stream instead of type text/plain. Apparently most browsers prefer to display text/plain inline instead of giving the user a download dialog option.
It's technically lying to the browser, but it does the job.

The most efficient way to serve a large file for download depends on a web-server you use.
In addition to #Kent Fredric X-Sendfile suggestion:
File Downloads Done Right have some links that describe how to do it for Apache, lighttpd (mod_secdownload: security via url generation), nginx. There are examples in PHP, Ruby (Rails), Python which can be adopted for Perl.
Basically it boils down to:
Configure paths, and permissions for your web-server.
Generate valid headers for the redirect in your Perl app (Content-Type, Content-Disposition, Content-length?, X-Sendfile or X-Accel-Redirect, etc).
There are probably CPAN modules, web-frameworks plugins that do exactly that e.g., #Leon Timmermans mentioned Sys::Sendfile in his answer.

Related

How to pipe to and read from the same tempfile handle without race conditions?

Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH

Faster way to read a file line by line in perl

I have to read a big unix file line by line using perl. the script is taking more than 2 mins to run in case of big file but takes lesser time for small file.
I am using following code:
open(FILE , "filename");
while ( < FILE > ){
}
Please let me know a way to parse file faster
What do you mean by a "big file" and a "small file"? What size are those files? How many lines do they have?
Unless your big file is absolutely huge, it seems likely that what is slowing your program down is not the reading from a file, but whatever you're doing in the while loop. To prove me wrong, you'd just need to run your program with nothing in the while loop to see how long that takes.
Assuming that I'm right, then you need to work out what section of your processing is causing the problems. Without seeing that code we, obviously, can't be any help there. But that's where a tool like Devel::NYTProf would be useful.
I'm not sure where you learned your Perl from, but the idiom you're using to open your file is rather outdated. These days we would a) use lexical variables as filehandles, b) use the 3-argument version of open() and c) always check the return value from open() and take appropriate action.
open(my $fh, '<', 'filename')
or die "Cannot open 'filename': $!\n";
while ( < $fh > ) {
...
}
If you have the memory, #array = <fh> then look thru array

making sure that my handling of utf8 is correct

I am using Perl for a module that involves processing a lot of Unicode documents. I started getting nervous because I'm not opening and closing files with the utf8 layers like open (OUT, '>:utf8', $textfile). However, I have been thoroughly testing and the output was still as expected. So I want to better understand why.
In a nutshell, my Perl module passes a document to an external service and gets a response. The response will be in Utf8. It uses LWP::UserAgent for this. When it gets the response it just writes it to a file:
my $fh;
open($fh, '>', $outputpath) or die "Could not open file '$outputpath' $!";
print $fh $response->content;
close $fh;
I have diffed these files against Unicode files representing the "expected" output and it is fine. And yet, you can see in my open command that I was not using the utf8 layer. So why is that?
What if I just returned $response->content to some other process, instead of printing it? Would it still be proper Unicode then?
I also have a separate process that I would like to ask about, very similar question. In this case I am trying to build a new service which replaces an old one. The old one read from a file like open(my $fh, '<:utf8', $inputfile) and wrote to a new file like open(my $fh, '>:utf8', $outputfile). The new service will still read the same way, but will not write to the output file anymore. It will send the string to another server using HTTP, and on that server it will be printed to a file using open(my $fh, '>', $outputfile) so no utf8 layer. I can't change that code immediately.
I want the file contents to be the exact same as they would otherwise have been (none of the other processing rules are changing). Should I be nervous about losing the layer?
I think maybe it would help if I understood better what these layers are doing.
There is no "handling of utf8" in the main question and that in itself isn't right.
The whole thing works, as the server is sending utf8 as you say, in the following way.
The content method used on $response is from HTTP::Message
The content() method sets the raw content if an argument is given. If no argument is given the content is not touched. In either case the original raw content is returned.
Since you don't specify layers† in open the default is used, likely :unix:perlio for Unix, with no encoding (see PerlIO). So you are dumping the original bytes to the disk, unchanged.
Looking further down the page, at decoded_content( %options ), we see the default
default_charset
This override the default charset guessed by content_charset() or if that fails "ISO-8859-1".
and can establish what you are getting by printing it
say 'Content type: ', $response->content_charset;
where you should get Content type: UTF-8. But when you receive a different encoding from the server then that will wind up in the file and any code that expects utf8 will break.
One should always decode all input and encode all output. Then we know exactly what is going on. As input is decoded the program carries on with character strings (not bytes in whatever encoding was sent). In the end encode suitably for output. This Effective Perler article should be useful. Here you'd use decoded_content and write files opened with :encoding(UTF-8).
With use open ":std", ":encoding(UTF-8)"; all I/O via standard streams in the lexical scope of this pragma will be handled as utf8. (This can be overriden for other specific uses, say by specifying layers in the three argument open.)
See open pragma.
As for the other question, you need to properly encode what you intend to "send to another server." How to do that depends on how you are "sending" it.
†   With PerlIO the I/O "layers" can be set so that encoding of input and output is done as needed behind the scenes, as data is read or written. The work is done by Encode. For a nice explanation of the process see Encode::PerlIO.
Also see perlunitut, perlunifaq, and perluniitro.

Are there reasons to ever use the two-argument form of open(...) in Perl?

Are there any reasons to ever use the two-argument form of open(...) in Perl rather than the three-or-more-argument versions?
The only reason I can come up with is the obvious observation that the two-argument form is shorter. But assuming that verbosity is not an issue, are there any other reasons that would make you choose the two-argument form of open(...)?
One- and two-arg open applies any default layers specified with the -C switch or open pragma. Three-arg open does not. In my opinion, this functional difference is the strongest reason to choose one or the other (and the choice will vary depending what you are opening). Which is easiest or most descriptive or "safest" (you can safely use two-arg open with arbitrary filenames, it's just not as convenient) take a back seat in module code; in script code you have more discretion to choose whether you will support default layers or not.
Also, one-arg open is needed for Damian Conway's file slurp operator
$_ = "filename";
$contents = readline!open(!((*{!$_},$/)=\$_));
Imagine you are writing a utility that accepts an input file name. People with reasonable Unix experience are used to substituting - for STDIN. Perl handles that automatically only when the magical form is used where the mode characters and file name are one string, else you have to handle this and similar special cases yourself. This is a somewhat common gotcha, I am surprised no one has posted that yet. Proof:
use IO::File qw();
my $user_supplied_file_name = '-';
IO::File->new($user_supplied_file_name, 'r') or warn "IO::File/non-magical mode - $!\n";
IO::File->new("<$user_supplied_file_name") or warn "IO::File/magical mode - $!\n";
open my $fh1, '<', $user_supplied_file_name or warn "non-magical open - $!\n";
open my $fh2, "<$user_supplied_file_name" or warn "magical open - $!\n";
__DATA__
IO::File/non-magical mode - No such file or directory
non-magical open - No such file or directory
Another small difference : the two argument form trim spaces
$foo = " fic";
open(MH, ">$foo");
print MH "toto\n";
Writes in a file named fic
On the other hand
$foo = " fic";
open(MH, ">", $foo);
print MH "toto\n";
Will write in a file whose name begin with a space.
For short admin scripts with user input (or configuration file input), not having to bother with such details as trimming filenames is nice.
The two argument form of open was the only form supported by some old versions of perl.
If you're opening from a pipe, the three argument form isn't really helpful. Getting the equivalent of the three argument form involves doing a safe pipe open (open(FILE, '|-')) and then executing the program.
So for simple pipe opens (e.g. open(FILE, 'ps ax |')), the two argument syntax is much more compact.
I think William's post pretty much hits it. Otherwise, the three-argument form is going to be more clear, as well as safer.
See also:
What's the best way to open and read a file in Perl?
Why is three-argument open calls with autovivified filehandles a Perl best practice?
One reason to use the two-argument version of open is if you want to open something which might be a pipe, or a file. If you have one function
sub strange
{
my ($file) = #_;
open my $input, $file or die $!;
}
then you want to call this either with a filename like "file":
strange ("file");
or a pipe like "zcat file.gz |"
strange ("zcat file.gz |");
depending on the situation of the file you find, then the two-argument version may be used. You will actually see the above construction in "legacy" Perl. However, the most sensible thing might be to open the filehandle appropriately and send the filehandle to the function rather than using the file name like this.
When you are combining a string or using a variable, it can be rather unclear whether '<' or '>' etc is in already. In such cases, I personally prefer readability, which means, I use the longer form:
open($FILE, '>', $varfn);
When you simply use a constant, I prefer the ease-of-typing (and, actually, consider the short version better readable anyway, or at least even to the long version).
open($FILE, '>somefile.xxx');
I'm guessing you mean open(FH, '<filename.txt') as opposed to open(FH, '<', 'filename.txt') ?
I think it's just a matter of preference. I always use the former out of habit.

How do I read a file which is constantly updating?

I am getting a stream of data (text format) from an external server and like to pass it on to a script line-by-line. The file is getting appended in a continuous manner. Which is the ideal method to perform this operation. Is IO::Socket method using Perl will do? Eventually this data has to pass through a PHP program (reusable) and eventually land onto a MySQL database.
The question is how to open the file, which is continuously getting updated?
In Perl, you can make use of seek and tell to read from a continuously growing file. It might look something like this (borrowed liberally from perldoc -f seek)
open(FH,'<',$the_file) || handle_error(); # typical open call
for (;;) {
while (<FH>) {
# ... process $_ and do something with it ...
}
# eof reached on FH, but wait a second and maybe there will be more output
sleep 1;
seek FH, 0, 1; # this clears the eof flag on FH
}
In perl there are a couple of modules that make tailing a file easier. IO::Tail and
File::Tail one uses a callback the other uses a blocking read so it just depends on which suits your needs better. There are likely other tailing modules as well but these are the two that came to mind.
IO::Tail - follow the tail of files/stream
use IO::Tail;
my $tail = IO::Tail->new();
$tail->add('test.log', \&callback);
$tail->check();
$tail->loop();
File::Tail - Perl extension for reading from continously updated files
use File::Tail;
my $file = File::Tail->new("/some/log/file");
while (defined(my $line= $file->read)) {
print $line;
}
Perhaps a named pipe would help you?
You talk about opening a file, and ask about IO::Socket. These aren't quite the same things, even if deep down you're going to be reading data of a file descriptor.
If you can access the remote stream from a named pipe or FIFO, then you can just open it as an ordinary file. It will block when nothing is available, and return whenever there is data that needs to be drained. You may, or may not, need to bring File::Tail to bear on the problem of not losing data if the sender runs too far ahead of you.
On the other hand, if you're opening a socket directly to the other server (which seems more likely), IO::Socket is not going to work out of the box as there is no getline method available. You would have to read and buffer block-by-block and then dole it out line by line through an intermediate holding pen.
You could pull out the socket descriptor into an IO::Handle, and use getline() on that. Something like:
my $sock = IO::Socket::INET->new(
PeerAddr => '172.0.0.1',
PeerPort => 1337,
Proto => 'tcp'
) or die $!;
my $io = new IO::Handle;
$io->fdopen(fileno($sock),"r") or die $!;
while (defined( my $data = $io->getline() )) {
chomp $data;
# do something
}
You may have to perform a handshake in order to start receiving packets, but that's another matter.
In python it is pretty straight-forward:
f = open('teste.txt', 'r')
for line in f: # read all lines already in the file
print line.strip()
# keep waiting forever for more lines.
while True:
line = f.readline() # just read more
if line: # if you got something...
print 'got data:', line.strip()
time.sleep(1) # wait a second to not fry the CPU needlessy
The solutions to read the whole fine to seek to the end are perfomance-unwise. If that happens under Linux, I would suggest just to rename the log file. Then, you can scan all the entites in the renamed file, while those in original file will be filled again. After scanning all the renamed file - delete it. Or move whereever you like. This way you get something like logrotate but for scanning newly arriving data.