Perl seek or read and discard results - perl

I am reading an opened filehandle as follows:
open(my $fh, "-|", "zcat test.csv.gz") or die "Cannot open test.csv.gz: $!";
read $fh, ???, 256;
print ">>", readline($fh), "<<\n";
close $fh;
Now, I want to seek to a known position, 256 bytes in this case.
I can't seek, because this is a filehandle based on another program's STDOUT. I tried that and the seek() does nothing. I can read and discard the data, but when seeking into a large result, this will waste memory loading and discarding it.
What can I put in for ???? that acts like /dev/null? Or is there another approach?

For 256 bytes, this is surely a micro-optimization. For larger seeks like 256MB, find an amount of memory you are willing to (temporarily) sacrifice and perform multiple reads.
use List::Util qw( min );
# Making $BUFFER_SIZE a multiple of 16 KiB covers makes sure the read is efficient.
my $BUFFER_SIZE = 65_536;
my $n = 256_000_000;
my $buffer = "";
while ($n) {
my $bytes_read = sysread($fh, $buffer, min($BUFFER_SIZE, $n));
die($!) if !defined($bytes_read);
die("Premature EOF") if !$bytes_read;
$n -= $bytes_read;
}
# Doesn't necessarily return memory back to the OS,
# but it does makes it available for Perl.
undef $buffer;

Related

skipping first x and last y lines of file

I'm doing some simple parsing on text files (which could get up into the 1GB range). How would I go about skipping the first N rows, and more importantly, the last (different) N rows? I'm sure I could open the file and count the rows, and do something with $_ < total_row_count -N, but that seems incredibly inefficient.
I'm pretty much a perl newb, btw.
A file is a sequence of bytes, without the notion of "lines." Some of those bytes are considered as "line" separators (linefeeds), which is how software gives us our "logical" lines to work with. So there is no way to know how many lines there are in a file -- without having read it and counted them, that is.
A simple and naive way is to read line-by-line and count
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
++$cnt while <$fh>;
with a little faster version using $. variable
1 while <$fh>;
my $cnt = $.;
These take between 2.5 and 3 seconds for a 1.1 Gb text file on a reasonable desktop.
We can speed this up a lot by reading in larger chunks and counting newline characters
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
NUM_LINES: {
my $len = 64_000;
my $buf;
$cnt += $buf =~ tr/\n//
while read $fh, $buf, $len;
seek $fh, 0, 0;
};
This goes in barely over half a second, on same hardware and Perl versions.
I've put it in a block to scope unneeded variables but it should be in a sub, where you can then check where the filehandle is when you get it and return it there after counting (so we can count the "rest" of lines from some point in the file and the processing can then continue), etc. It should also include checks on read operation, at each invocation.
I'd think that a half a second overhead on a Gb large file isn't that bad at all.
Still, you can go for faster yet, at the expense of it being far messier. Get the file size (metadata, so no reading involved) and seek to a position estimated to be the wanted number of lines before the end (no reading involved). That most likely won't hit the right spot so read to the end to count lines and adjust, seeking back (further or closer). Repeat until you reach the needed place.
open my $fh, "<", $file;
my $size = -s $file;
my $estimated_line_len = 80;
my $num_last_lines = 100;
my $pos = $size - $num_last_lines*$estimated_line_len;
seek $fh, $pos, 0;
my $cnt;
++$cnt while <$fh>;
say "There are $cnt lines from position $pos to the end";
# likely need to seek back further/closer ...
I'd guess that this should get you there in under 100 ms. Note that $pos is likely inside a line.
Then once you know the number of lines (or the position for desired number of lines before the end) do seek $fh, 0, 0 and process away. Or really have this in a sub which puts the filehandle back where it was before returning, as mentioned.
I think you need a circular buffer to avoid reading entire file on your memory.
skip-first-last.pl
#!/usr/bin/perl
use strict;
use warnings;
my ($first, $last) = #ARGV;
my #buf;
while (<STDIN>) {
my $mod = $. % $last;
print $buf[$mod] if defined $buf[$mod];
$buf[$mod] = $_ if $. > $first;
}
1;
Skip first 5 lines and last 2 lines:
$ cat -n skip-first-last.pl | ./skip-first-last.pl 5 2
6
7 my #buf;
8 while (<STDIN>) {
9 my $mod = $. % $last;
10 print $buf[$mod] if defined $buf[$mod];
11 $buf[$mod] = $_ if $. > $first;
12 }

Perl performance when reading and writing the same file

Is there any noticeable performance difference between these two ways of reading/writing a user file with Perl, on Linux?
Option 1:
open (READFILE, '<:utf8', "users/$_[0]") or die ("no read users/$_[0]");
# Do the reading
close (READFILE) or die;
# Do more stuff
open (WRITEFILE, '>:utf8', "users/$_[0]") or die ("no write users/$_[0]"); flock (WRITEFILE, 2) or die ("no lock users/$_[0]");
# Do the writing
close (WRITEFILE) or die;
Option 2:
open (USERFILE, '+<:utf8', "users/$_[0]") or die ("no open users/$_[0]"); flock (USERFILE, 2) or die ("no lock users/$_[0]");
# Do the reading
# Do more stuff
seek (USERFILE, 0, 0); truncate (USERFILE, 0);
# Do the writing
close (USERFILE) or die ("no write users/$_[0]");
The user files are not big, typically 20-40 lines or 2-4 KB each.
And would there be other reasons for choosing option 1 or 2 (or a 3rd option)?
Here is a benchmark which you can use to test it, I suspect that getting a new file descriptor is the part that takes longer if you close and then open again.
#!/usr/bin/env perl
use warnings;
use strict;
use open qw(:encoding(utf8) :std);
use Benchmark qw<cmpthese>;
my $text = <<TEXT;
I had some longer text here, but for better readability, just
these two lines.
TEXT
cmpthese(10_000,{
close => sub{
open my $file, '<',"bla" or die "$!";
my #array = <$file>;
close $file or die;
open $file, '>',"bla" or die "$!";
$file->print($text)
},
truncate => sub {
open my $file, '+<',"bla" or die "$!";
my #array = <$file>;
seek $file,0,0;
truncate $file, 0;
$file->print($text)
},
truncate_flock => sub {
open my $file, '+<',"bla" or die "$!";
flock $file, 2;
my #array = <$file>;
seek $file,0,0;
truncate $file, 0;
$file->print($text)
},
});
Output on my machine:
Rate close truncate_flock truncate
close 2703/s -- -15% -17%
truncate_flock 3175/s 17% -- -3%
truncate 3257/s 21% 3% --
A higher rate is better. Using close is 1.17 times slower.
But it heavily depends on how long your more stuff takes, since you're flocking the file in your truncate example and if another program is trying to access this file it may be slowed down because of that.

Effective way to find checksum in perl without memory leakage

In my program I need to look for checksum for many files. The checksum calculation is within the find command.
find(sub {
my $file = $File::Find::name;
return if ! length($file);
open (FILE, "$file");
my $chksum = md5_base64(<FILE>);
close FILE;
}, "/home/nijin");
The above code works perfectly. But if there is a file with a large size for example 6GB in the path /home/nijin, it will load 6 GB into RAM memory and the process takes 6 GB RAM continuously until the process is completed. Please note that this is a backup process and it will take more than 12 hours for the process to complete. So I will lose 6GB until the process is completed. The worst case is the process gets hangs due to large memory usage. As an option I have tried to use File::Map . the code is pasted below.
find(sub {
my $file = $File::Find::name;
return if ! length($file);
map_file my $map, "$filename", '<';
my $chksum = md5_base64($map);
}, "/home/nijin");
The above code also works but I am getting segmentation fault error while using the above code. I have also tried with Sys::Mmap but having the same issue as the first one. Is there any other option to try?
I'd run the expensive calculation in a child process. This keeps the parent process at decent memory consumption. The child can eat lots of memory for large files, but once the MD5 is returned, the memory is returned to the OS:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use File::Find;
use Digest::MD5 qw{ md5_base64 };
my %md5;
find(sub {
my $name = $File::Find::name;
return unless -f;
my $child_pid = open(my $CMD, '-|') // die "Can't fork: $!";
if ($child_pid) { # Parent
$md5{$name} = <$CMD>;
wait;
} else { # Child
open my $IN, '<', $_ or die "$name: $!";
print md5_base64(<$IN>);
exit;
}
}, shift);
print Dumper \%md5;
There's no reason to read the whole file into memory at once.
You can explicitly process it in 64k chunks by the following:
my $chksum = do {
open my $fh, '<:raw', $file;
my $md5 = Digest::MD5->new;
local $/ = \65536; # Read 64k at once
while (<$fh>) {
$md5->add($_);
}
$md5->hexdigest;
};
# Do whatever you were going to do with it here
You can also just pass the filehandle directly, although that does not guarantee how it will process it:
my $chksum = do {
open my $fh, '<:raw', $file;
Digest::MD5->new->addfile($fh)->hexdigest
};

difference between -s and my implementation

Having this snippet:
my $file = "input.txt"; # let's assume that this is an ascii file
my $size1 = -s $file;
print "$size1\n";
$size2 = 0;
open F, $file;
$size2 += length($_) while (<F>);
close F;
print "$size2\n";
when can one assert that it is true that $size1 equals $size2?
If you don't specify an encoding that supports multibyte characters, it should hold. Otherwise, the result can be different:
$ cat 1.txt
žluťoučký kůň
$ perl -E 'say -s "1.txt";
open my $FH, "<:utf8", "1.txt";
my $f = do { local $/; <$FH> };
say length $f;'
20
14
You cannot, because the input layer may do some convert on the input line, for example change crlf to cr, that may change the length of that line.
In addition, length $line count how many characters in $line, in the multi-byte encoding, as the example given by #choroba, one character may occupy more than one bytes.
See perlio for further details.
No, as Lee Duhem says, the two numbers may be different because of Perl's end-of-line processing, or because length reports the size of the string in characters, which will throw the numbers out if there are any wide characters in the text.
However the tell function will report the exact position in bytes that you have read up to, so an equivalent to your program for which the numbers are guaranteed to match is this
use strict;
use warnings;
my $file = 'input.txt';
my $size1 = -s $file;
print "$size1\n";
open my $fh, '<', $file or die $!;
my $size2 = 0;
while (<$fh>) {
$size2 = tell $fh;
}
close $fh;
print "$size2\n";
Please note the use of use strict and use warnings, the lexical file handle, the three-parameter form of open, and the check that it succeeded. All of these are best-practice for Perl programs and should be used in everything you write
You're simply missing binmode(F); or the :raw IO layer. These cause Perl to return the file exactly as it appears on disk. No line ending translation. No decoding of character encodings.
open(my $fh, '<:raw', $file)
or die "open $file: $!\n");
Then your code works fine.
my $size = 0;
$size += length while <$fh>;
That's not particularly good because it could read the entire file at once for binary files. So let's read fixed-sized blocks instead.
local $/ = \(64*1024);
my $size = 0;
$size += length while <$fh>;
That's basically the same as using read, which reads 4K or 8K (in newer Perls) at a time. There are performance benefits to reading more than that at a time, and we can use sysread to do that.
my $size = 0;
while (my $bytes_read = sysread($fh, my $buf, 64*1024)) {
$size += $bytes_read;
}
Reading the whole file is silly, though. You could just seek to the end of the file.
use Fcntl qw( SEEK_END );
my $size = sysseek($fh, 0, SEEK_END);
But then again, you might as well just use -s.
my $size = -s $fh;

How can I access a specific range of bytes in a file using Perl?

What is the most convenient way to extract a specified byte range of a file on disk into a variable?
seek to the start of the range, read the desired number of bytes (or sysseek/sysread -- see nohat's comment).
open $fh, '<', $filename;
seek $fh, $startByte, 0;
$numRead = read $fh, $buffer, $endByte - $startByte; # + 1
&do_something_with($buffer);
Sometimes I like to use File::Map, which lazily loads a file into a scalar. That turns it into string operations instead of filehandle operations:
use File::Map 'map_file';
map_file my $map, $filename;
my $range = substr( $map, $start, $length );