Faster way to read a file line by line in perl - perl

I have to read a big unix file line by line using perl. the script is taking more than 2 mins to run in case of big file but takes lesser time for small file.
I am using following code:
open(FILE , "filename");
while ( < FILE > ){
}
Please let me know a way to parse file faster

What do you mean by a "big file" and a "small file"? What size are those files? How many lines do they have?
Unless your big file is absolutely huge, it seems likely that what is slowing your program down is not the reading from a file, but whatever you're doing in the while loop. To prove me wrong, you'd just need to run your program with nothing in the while loop to see how long that takes.
Assuming that I'm right, then you need to work out what section of your processing is causing the problems. Without seeing that code we, obviously, can't be any help there. But that's where a tool like Devel::NYTProf would be useful.
I'm not sure where you learned your Perl from, but the idiom you're using to open your file is rather outdated. These days we would a) use lexical variables as filehandles, b) use the 3-argument version of open() and c) always check the return value from open() and take appropriate action.
open(my $fh, '<', 'filename')
or die "Cannot open 'filename': $!\n";
while ( < $fh > ) {
...
}

If you have the memory, #array = <fh> then look thru array

Related

How to pipe to and read from the same tempfile handle without race conditions?

Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH

How does the while works in case of Filehandle when reading a gigantic file in Perl

I have a very large file to read, so when I use while for reading it line by line, the script starts taking more time to read the line as I dig deep in the file; and to mention the rise is exponential.
while (<$fh>)
{do something}
Does while has to parse through all the lines it has already read to go to the next unread line or something like that?
How can I overcome such a situation?
EDIT 1:
My code:
$line=0;
%values;
open my $fh1, '<', "file.xml" or die $!;
while (<$fh1>)
{
$line++;
if ($_=~ s/foo//gi)
{
chomp $_;
$values{'id'} = $_;
}
elsif ($_=~ s/foo//gi)
{
chomp $_;
$values{'type'} = $_;
}
elsif ($_=~ s/foo//gi)
{
chomp $_;
$values{'pattern'} = $_;
}
if (keys(%values) == 3)
{
open FILE, ">>temp.txt" or die $!;
print FILE "$values{'id'}\t$values{'type'}\t$values{'pattern'}\n";
close FILE;
%values = ();
}
if($line == ($line1+1000000))
{
$line1=$line;
$read_time = time();
$processing_time = $read_time - $start_time - $processing_time;
print "xml file parsed till line $line, time taken $processing_time sec\n";
}
}
EDIT 2
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NLM//DTD NCBI-Entrezgene, 21st January 2005//EN" "http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>816394</Gene-track_geneid>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>7</Date-std_month>
<Date-std_day>30</Date-std_day>
<Date-std_hour>19</Date-std_hour>
<Date-std_minute>53</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2015</Date-std_year>
<Date-std_month>1</Date-std_month>
<Date-std_day>8</Date-std_day>
<Date-std_hour>15</Date-std_hour>
<Date-std_minute>41</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="chromosome">21</BioSource_genome>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Arabidopsis thaliana</Org-ref_taxname>
<Org-ref_common>thale cress</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
This is just a jest of the original xml file, if you like you can check the whole xml file from Here. Select any one entry and send it to file as xml file.
EDIT 3
As suggested by many pioneers that I should avoid using substitute but I feel it is essential to have it in my code as from a line in the xml file:
<Gene-track_geneid>816394</Gene-track_geneid>
I want to take only the Id which here is 816394 can be any number (any number of digits) for other entries; so how can I avoid using substitute.
Thanks in advance
ANSWER:
First, I would like to apologize to take so long to reply; as I started again from root to top for Perl and this time came clear with use strict, which helped me in maintaining the linear time. And also the use of XML Parsers is a good thing to do while handling large Xml files..
Thanks all for help and suggestions
Further to my comment above, you should get into the habit of using the strict and warnings pragma's at the start of every script. warnings just picks up mistakes that might not be found until runtime. strict enforce a number of good rules including declaring all variables with my. The variable then exists only in the scope (typically the code block) it was declared in.
Try something like this and see if you get any improvement.
use strict;
use warnings;
my %values;
my $line = 0;
open my $XML, '<', "file.xml" or die $!;
open my $TEMP, '>>', "temp.txt" or die $!;
while (<$XML>) {
chomp;
$line++;
if (s/foo//gi) { $values{id} = $_; }
elsif (s/foo//gi) { $values{type} = $_; }
elsif (s/foo//gi) { $values{pattern} = $_; }
if (keys(%values) == 3) {
print $TEMP "$values{id}\t$values{type}\t$values{pattern}\n";
undef %values;
}
# if ($line = ...
}
close $TEMP;
Ignore my one-line-if formatting, I did that for brevity. Format however you like
The main thing I've done which I hope helps is declare the %values hash inside the while block, so it doesn't have a "global" scope, and then it's undefine at the end of each block, which if I recall correctly should clear the memory it was using. Also opening and closing your output only once should cut down on a lot of unecessary operations.
Also just cleaned up a few things. Since you are acting on the topical $_ variable, you can leave it out of operations like chomp (which now occurs only once at the beginning of the loop) and you regex substution.
EDIT
It just occured to me that you might be waiting multiple loops until %values reaches 3 in which case it will not work so I moved the undef back inside the if.
MORE EDIT
As has been commented below, you should look into installing and using an XML parser from cpan. If you for whatever reason are unable to use a module, a capturing regex might work better than a replacements... eg: $var = /^<\/(\w+)>/ should capture <this>
There's no reason I see why that code would take exponentially more time. I don't see any memory leaks. %values will not grow. Looping over each line in a file does not depend on the file size only the line size. I even made an XML file with 4 million lines in it from your linked XML data to test it.
My thoughts are...
There's something you're not showing us (those regexes aren't real, $start_time is not initialized).
You're on a wacky filesystem, perhaps a network filesystem. (OP is on NTFS)
You're using a very old version of Perl with a bug. (OP is using Perl 5.20.1)
A poorly implemented network filesystem could slow down while reading an enormous file. It could also misbehave because of how you're opening and closing temp.txt rapidly. You could be chewing through file handles. temp.txt should be opened once before the loop. #Joshua's improvement suggestions are good (though the concern about %values is a red herring).
As also noted, you should not be parsing XML by hand. For a file this large, use a SAX parser which works on the XML a piece at a time keeping the memory costs down, as opposed to a DOM parser which reads the whole file. There are many to choose from.
while (<$fh>) {...} doesn't reread the file from the start on each iteration, no
The most likely cause of your problem is that you're keeping data in memory on each iteration, causing memory usage to grow as you work your way through the file. The slowdown comes in when physical memory is exhausted and the computer has to start paging out to virtual memory, ultimately producing a situation where you could be spending more time just moving memory pages back and forth between RAM and disk than on actual work.
If you can produce a brief, runnable test case which demonstrates your problem, I'm sure we can give more specific advice to fix it. If that's not possible, just a description of your {do something} process could give us enough to go on.
Edit after Edit 1 to question:
Looking at the code posted, I suspect that your slowdown may be caused by how you're handling your output. Closing and reopening the output file each time you add a line to it would definitely slow things down relative to if you just kept it open and, depending on your OS/filesystem combination, it may need to seek through the entire file to find the end to append.
Nothing else stands out to me as potentially causing performance issues, but a couple other minor points:
After your regex substitutions, $_ will never contain line ends (unless you explicitly include them in the foo patterns), so you can probably skip the chomp $_; lines.
You should open the output file the same way as you open the input file (lexical filehandle, three-argument open) instead of doing it the old way.

backtick vs native way of doing things in PERL

Consider these 2 snippets :
#!/bin/bash/perl
open(DATA,"<input.txt");
while(<DATA>)
{
print($_) ;
}
and
$abcd = `cat input.txt`;
print $abcd;
Both will print the content of file input.txt as output
Question : Is there any standard, as to which one (backticks or native-method) should be preferred over the other, in any particular case or both are equal always??
Reason i am asking this is because i find cat method to be easier than opening a file in native perl method, so, this puts me in doubt that if i can achieve something through backtick way, shall i go with it or prefer other native ways of doing it!!
I checked this thread too : What's the difference between Perl's backticks, system, and exec? but it went a different route than my doubt!!
Use builtin functions wherever possible:
They are more portable: open works on Windows, while `cat input.txt` will not.
They have less overhead: Using backticks will fork, exec a shell which parses the command, which execs the cat program. This unnecessarily loads two programs. This is in contrast to open which is a builtin Perl function.
They make error handling easier. The open function will return a false value on error, which allows you to take different actions, e.g. like terminating the program with an error message:
open my $fh, "<", "input.txt" or die "Couldn't open input.txt: $!";
They are more flexible. For example, you can add encoding layers if your data isn't Latin-1 text:
open my $fh, "<:utf8", "input.txt" or die "Couldn't open input.txt: $!";
open my $fh, "<:raw", "input.bin" or die "Couldn't open input.bin: $!";
If you want a “just read this file into a scalar” function, look at the File::Slurp module:
use File::Slurp;
my $data = read_file "input.txt";
Using the back tick operators to call cat is highly inefficient, because:
It spawns a separate process (or maybe more than one if a shell is used) which does nothing more than read the file, which perl could do itself.
You are reading the whole file into memory instead of processing it one line at a time. OK for a small file, not so good for a large one.
The back tick method is ok for a quick and dirty script but I would not use it for anything serious.

How to include a "diff" in a Perl test? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How can I use Perl to determine whether the contents of two files are identical?
If I am writing a Perl module test, and for example I want to test that an output file is exactly what is expected, if I use an external command like diff, the test might fail on some operating systems which don't provide the diff command. What would be a simple way to do something like diff on files, which doesn't rely on external commands? I understand that there are modules on CPAN which can do file diffs, but I would rather not complicate the build process unless necessary.
File::Compare, in core since 5.004.
When testing and looking for differences in files or strings I always use Test::Differences that uses Text::Diff. I know that you probably know that and you would like a non module solution, but looking for differences has many corner cases so is not trivial. Also I write this answer more for googlers (just in case you already know these modules).
I like the table output of this module. It is very convenient when the differences are a small number.
Why not just read and compare the two files in perl? Something like...
sub readfile
{
local ($/) = undef;
open READFILE, "<", $_[0]
or die "Can't read '$_[0]': $!";
my $contents = <READFILE>;
close READFILE or die "Can't close '$_[0]': $!";
return $contents;
}
$expected = readfile("expected_results");
$actual = readfile("actual_results");
if ($expected != $actual) {
die "Got wrong results!";
}
(If you're concerned about multiple OS portability, you may also need to do something about line endings, either in your test program or here, because some OSs use CRLF instead of LF to separate lines in text files. If you want to handle it here, a regular expression replace will do the trick.)

How do I serve a large file for download with Perl?

I need to serve a large file (500+ MB) for download from a location that is not accessible to the web server. I found the question Serving large files with PHP, which is identical to my situation, but I'm using Perl instead of PHP.
I tried simply printing the file line by line, but this does not cause the browser to prompt for download before grabbing the entire file:
use Tie::File;
open my $fh, '<', '/path/to/file.txt';
tie my #file, 'Tie::File', $fh
or die 'Could not open file: $!';
my $size_in_bytes = -s $fh;
print "Content-type: text/plain\n";
print "Content-Length: $size_in_bytes\n";
print "Content-Disposition: attachment; filename=file.txt\n\n";
for my $line (#file) {
print $line;
}
untie #file;
close $fh;
exit;
Does Perl have an equivalent to PHP's readfile() function (as suggested with PHP) or is there a way to accomplish what I'm trying to do here?
If you just want to slurp input to output, this should do the trick.
use Carp ();
{ #Lexical For FileHandle and $/
open my $fh, '<' , '/path/to/file.txt' or Carp::croak("File Open Failed");
local $/ = undef;
print scalar <$fh>;
close $fh or Carp::carp("File Close Failed");
}
I guess in response to the "Does Perl have a PHP ReadFile Equivelant" , and I guess my answer would be "But it doesn't really need one".
I've used PHP's manual File IO controls and they're a pain, Perls are just so easy to use by comparison that shelling out for a one-size-fits-all function seems over-kill.
Also, you might want to look at X-SendFile support, and basically send a header to your webserver to tell it what file to send: http://john.guen.in/past/2007/4/17/send_files_faster_with_xsendfile/ ( assuming of course it has permissions enough to access the file, but the file is just NOT normally accessible via a standard URI )
Edit Noted, it is better to do it in a loop, I tested the above code with a hard-drive and it does implicitly try store the whole thing in an invisible temporary variable and eat all your ram.
Alternative using blocks
The following improved code reads the given file in blocks of 8192 chars, which is much more memory efficient, and gets a throughput respectably comparable with my disk raw read rate. ( I also pointed it at /dev/full for fits and giggles and got a healthy 500mb/s throughput, and it didn't eat all my rams, so that must be good )
{
open my $fh , '<', '/dev/sda' ;
local $/ = \8192; # this tells IO to use 8192 char chunks.
print $_ while defined ( $_ = scalar <$fh> );
close $fh;
}
Applying jrockways suggestions
{
open my $fh , '<', '/dev/sda5' ;
print $_ while ( sysread $fh, $_ , 8192 );
close $fh;
}
This literally doubles performance, ... and in some cases, gets me better throughput than DD does O_o.
The readline function is called readline (and can also be written as
<>).
I'm not sure what problem you're having. Perhaps that for loops
aren't lazily evaluated (which they're not). Or, perhaps Tie::File is
screwing something up? Anyway, the idiomatic Perl for reading a file
a line at a time is:
open my $fh, '<', $filename or die ...;
while(my $line = <$fh>){
# process $line
}
No need to use Tie::File.
Finally, you should not be handling this sort of thing yourself. This
is a job for a web framework. If you were using
Catalyst (or
HTTP::Engine), you would
just say:
open my $fh, '<', $filename ...
$c->res->body( $fh );
and the framework would automatically serve the data in the file
efficiently. (Using stdio via readline is not a good idea here, it's
better to read the file in blocks from the disk. But who cares, it's
abstracted!)
You could use my Sys::Sendfile module. It's should be highly efficient (as it uses sendfile underneath the hood), but not entirely portable (only Linux, FreeBSD and Solaris are currently supported).
When you say "this does not cause the browser to prompt for download" -- what's "the browser"?
Different browsers behave differently, and IE is particularly wilful, it will ignore headers and decide for itself what to do based on reading the first few kb of the file.
In other words, I think your problem may be at the client end, not the server end.
Try lying to "the browser" and telling it the file is of type application/octet-stream. Or why not just zip the file, especially as it's so huge.
Don't use for/foreach (<$input>) because it reads the whole file at once and then iterates over it. Use while (<$input>) instead. The sysread solution is good, but the sendfile is the best performance-wise.
Answering the (original) question ("Does Perl have an equivalent to PHP's readline() function ... ?"), the answer is "the angle bracket syntax":
open my $fh, '<', '/path/to/file.txt';
while (my $line = <file>) {
print $line;
}
Getting the content-length with this method isn't necessarily easy, though, so I'd recommend staying with Tie::File.
NOTE
Using:
for my $line (<$filehandle>) { ... }
(as I originally wrote) copies the contents of the file to a list and iterates over that. Using
while (my $line = <$filehandle>) { ... }
does not. When dealing with small files the difference isn't significant, but when dealing with large files it definitely can be.
Answering the (updated) question ("Does Perl have an equivalent to PHP's readfile() function ... ?"), the answer is slurping. There are a couple of syntaxes, but Perl6::Slurp seems to be the current module of choice.
The implied question ("why doesn't the browser prompt for download before grabbing the entire file?") has absolutely nothing to do with how you're reading in the file, and everything to do with what the browser thinks is good form. I would guess that the browser sees the mime-type and decides it knows how to display plain text.
Looking more closely at the Content-Disposition problem, I remember having similar trouble with IE ignoring Content-Disposition. Unfortunately I can't remember the workaround. IE has a long history of problems here (old page, refers to IE 5.0, 5.5 and 6.0). For clarification, however, I would like to know:
What kind of link are you using to point to this big file (i.e., are you using a normal a href="perl_script.cgi?filename.txt link or are you using Javascript of some kind)?
What system are you using to actually serve the file? For instance, does the webserver make its own connection to the other computer without a webserver, and then copy the file to the webserver and then send the file to the end user, or does the user make the connection directly to the computer without a webserver?
In the original question you wrote "this does not cause the browser to prompt for download before grabbing the entire file" and in a comment you wrote "I still don't get a download prompt for the file until the whole thing is downloaded." Does this mean that the file gets displayed in the browser (since it's just text), that after the browser has downloaded the entire file you get a "where do you want to save this file" prompt, or something else?
I have a feeling that there is a chance the HTTP headers are getting stripped out at some point or that a Cache-control header is getting added (which apparently can cause trouble).
I've successfully done it by telling the browser it was of type application/octet-stream instead of type text/plain. Apparently most browsers prefer to display text/plain inline instead of giving the user a download dialog option.
It's technically lying to the browser, but it does the job.
The most efficient way to serve a large file for download depends on a web-server you use.
In addition to #Kent Fredric X-Sendfile suggestion:
File Downloads Done Right have some links that describe how to do it for Apache, lighttpd (mod_secdownload: security via url generation), nginx. There are examples in PHP, Ruby (Rails), Python which can be adopted for Perl.
Basically it boils down to:
Configure paths, and permissions for your web-server.
Generate valid headers for the redirect in your Perl app (Content-Type, Content-Disposition, Content-length?, X-Sendfile or X-Accel-Redirect, etc).
There are probably CPAN modules, web-frameworks plugins that do exactly that e.g., #Leon Timmermans mentioned Sys::Sendfile in his answer.