In my program I need to look for checksum for many files. The checksum calculation is within the find command.
find(sub {
my $file = $File::Find::name;
return if ! length($file);
open (FILE, "$file");
my $chksum = md5_base64(<FILE>);
close FILE;
}, "/home/nijin");
The above code works perfectly. But if there is a file with a large size for example 6GB in the path /home/nijin, it will load 6 GB into RAM memory and the process takes 6 GB RAM continuously until the process is completed. Please note that this is a backup process and it will take more than 12 hours for the process to complete. So I will lose 6GB until the process is completed. The worst case is the process gets hangs due to large memory usage. As an option I have tried to use File::Map . the code is pasted below.
find(sub {
my $file = $File::Find::name;
return if ! length($file);
map_file my $map, "$filename", '<';
my $chksum = md5_base64($map);
}, "/home/nijin");
The above code also works but I am getting segmentation fault error while using the above code. I have also tried with Sys::Mmap but having the same issue as the first one. Is there any other option to try?
I'd run the expensive calculation in a child process. This keeps the parent process at decent memory consumption. The child can eat lots of memory for large files, but once the MD5 is returned, the memory is returned to the OS:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use File::Find;
use Digest::MD5 qw{ md5_base64 };
my %md5;
find(sub {
my $name = $File::Find::name;
return unless -f;
my $child_pid = open(my $CMD, '-|') // die "Can't fork: $!";
if ($child_pid) { # Parent
$md5{$name} = <$CMD>;
wait;
} else { # Child
open my $IN, '<', $_ or die "$name: $!";
print md5_base64(<$IN>);
exit;
}
}, shift);
print Dumper \%md5;
There's no reason to read the whole file into memory at once.
You can explicitly process it in 64k chunks by the following:
my $chksum = do {
open my $fh, '<:raw', $file;
my $md5 = Digest::MD5->new;
local $/ = \65536; # Read 64k at once
while (<$fh>) {
$md5->add($_);
}
$md5->hexdigest;
};
# Do whatever you were going to do with it here
You can also just pass the filehandle directly, although that does not guarantee how it will process it:
my $chksum = do {
open my $fh, '<:raw', $file;
Digest::MD5->new->addfile($fh)->hexdigest
};
Related
I am reading a .gz file which is around 3 GB. I am grepping a pattern using Perl program. I am able to grep the pattern but it is taking too long to process. Can anyone help me how to process very fast?
use strict ;
use warnings ;
use Compress::Zlib;
my $file = "test.gz";
my $gz = gzopen ($file, "rb") or die "Error Reading $file: $gzerrno";
while ($gz->gzreadline($_) > 0 ) {
if (/pattern/) {
print "$_----->PASS\n";
}
}
die "Error reading $file: $gzerrno" if $gzerrno != Z_STREAM_END;
$gz ->gzclose();
What does Z_STREAM_END variable do?
I have written a script that times how long various methods take to read a gz file. I too have also found that Compress::Zlib is very slow.
use strict;
use warnings;
use autodie ':all';
use Compress::Zlib;
use Time::HiRes 'time';
my $file = '/home/con/Documents/snp150.txt.gz';
# time zcat execution
my $start_zcat = Time::HiRes::time();
open my $zcat, "zcat $file |";
while (<$zcat>) {
# print $_;
}
close $zcat;
my $end_zcat = Time::HiRes::time();
# time Compress::Zlib reading
my $start_zlib = Time::HiRes::time();
my $gz = gzopen($file, 'r') or die "Error reading $file: $gzerrno";
while ($gz->gzreadline($_) > 0) {#http://blog-en.openalfa.com/how-to-read-and-write-compressed-files-in-perl
# print "$_";# Process the line read in $_
}
$gz->gzclose();
my $end_zlib = Time::HiRes::time();
printf("zlib took %lf seconds.\n", $end_zlib - $start_zlib);
printf("zcat took %lf seconds.\n", $end_zcat - $start_zcat);
Using this script, I found that reading through zcat runs about 7x faster (!) than Compress::Zlib This will vary from computer to computer, and file to file, of course.
I am reading an opened filehandle as follows:
open(my $fh, "-|", "zcat test.csv.gz") or die "Cannot open test.csv.gz: $!";
read $fh, ???, 256;
print ">>", readline($fh), "<<\n";
close $fh;
Now, I want to seek to a known position, 256 bytes in this case.
I can't seek, because this is a filehandle based on another program's STDOUT. I tried that and the seek() does nothing. I can read and discard the data, but when seeking into a large result, this will waste memory loading and discarding it.
What can I put in for ???? that acts like /dev/null? Or is there another approach?
For 256 bytes, this is surely a micro-optimization. For larger seeks like 256MB, find an amount of memory you are willing to (temporarily) sacrifice and perform multiple reads.
use List::Util qw( min );
# Making $BUFFER_SIZE a multiple of 16 KiB covers makes sure the read is efficient.
my $BUFFER_SIZE = 65_536;
my $n = 256_000_000;
my $buffer = "";
while ($n) {
my $bytes_read = sysread($fh, $buffer, min($BUFFER_SIZE, $n));
die($!) if !defined($bytes_read);
die("Premature EOF") if !$bytes_read;
$n -= $bytes_read;
}
# Doesn't necessarily return memory back to the OS,
# but it does makes it available for Perl.
undef $buffer;
Is there any noticeable performance difference between these two ways of reading/writing a user file with Perl, on Linux?
Option 1:
open (READFILE, '<:utf8', "users/$_[0]") or die ("no read users/$_[0]");
# Do the reading
close (READFILE) or die;
# Do more stuff
open (WRITEFILE, '>:utf8', "users/$_[0]") or die ("no write users/$_[0]"); flock (WRITEFILE, 2) or die ("no lock users/$_[0]");
# Do the writing
close (WRITEFILE) or die;
Option 2:
open (USERFILE, '+<:utf8', "users/$_[0]") or die ("no open users/$_[0]"); flock (USERFILE, 2) or die ("no lock users/$_[0]");
# Do the reading
# Do more stuff
seek (USERFILE, 0, 0); truncate (USERFILE, 0);
# Do the writing
close (USERFILE) or die ("no write users/$_[0]");
The user files are not big, typically 20-40 lines or 2-4 KB each.
And would there be other reasons for choosing option 1 or 2 (or a 3rd option)?
Here is a benchmark which you can use to test it, I suspect that getting a new file descriptor is the part that takes longer if you close and then open again.
#!/usr/bin/env perl
use warnings;
use strict;
use open qw(:encoding(utf8) :std);
use Benchmark qw<cmpthese>;
my $text = <<TEXT;
I had some longer text here, but for better readability, just
these two lines.
TEXT
cmpthese(10_000,{
close => sub{
open my $file, '<',"bla" or die "$!";
my #array = <$file>;
close $file or die;
open $file, '>',"bla" or die "$!";
$file->print($text)
},
truncate => sub {
open my $file, '+<',"bla" or die "$!";
my #array = <$file>;
seek $file,0,0;
truncate $file, 0;
$file->print($text)
},
truncate_flock => sub {
open my $file, '+<',"bla" or die "$!";
flock $file, 2;
my #array = <$file>;
seek $file,0,0;
truncate $file, 0;
$file->print($text)
},
});
Output on my machine:
Rate close truncate_flock truncate
close 2703/s -- -15% -17%
truncate_flock 3175/s 17% -- -3%
truncate 3257/s 21% 3% --
A higher rate is better. Using close is 1.17 times slower.
But it heavily depends on how long your more stuff takes, since you're flocking the file in your truncate example and if another program is trying to access this file it may be slowed down because of that.
this
is just
an example.
Lets assume the above is out.txt. I want to read out.txt and write onto the same file.
<Hi >
<this>
<is just>
<an example.>
Modified out.txt.
I want to add tags in the beginning and end of some lines.
As I will be reading the file several times I cannot keep writing it onto a different file each time.
EDIT 1
I tried using "+<" but its giving an output like this :
Hi
this
is just
an example.
<Hi >
<this>
<is just>
<an example.>
**out.txt**
EDIT 2
Code for reference :
open(my $fh, "+<", "out.txt");# or die "cannot open < C:\Users\daanishs\workspace\CCoverage\out.txt: $!";
while(<$fh>)
{
$s1 = "<";
$s2 = $_;
$s3 = ">";
$str = $s1 . $s2 . $s3;
print $fh "$str";
}
The very idea of what you are trying to do is flawed. The file starts as
H i / t h i s / ...
If you were to change it in place, it would look as follows after processing the first line:
< H i > / i s / ...
Notice how you clobbered "th"? You need to make a copy of the file, modify the copy, the replace the original with the copy.
The simplest way is to make this copy in memory.
my $file;
{ # Read the file
open(my $fh, '<', $qfn)
or die "Can't open \"$qfn\": $!\n";
local $/;
$file = <$fh>;
}
# Change the file
$file =~ s/^(.*)\n/<$1>\n/mg;
{ # Save the changes
open(my $fh, '>', $qfn)
or die "Can't create \"$qfn\": $!\n";
print($fh $file);
}
If you wanted to use the disk instead:
rename($qfn, "$qfn.old")
or die "Can't rename \"$qfn\": $!\n";
open(my $fh_in, '<', "$qfn.old")
or die "Can't open \"$qfn\": $!\n";
open(my $fh_out, '>', $qfn)
or die "Can't create \"$qfn\": $!\n";
while (<$fh_in>) {
chomp;
$_ = "<$_>";
print($fh_out "$_\n");
}
unlink("$qfn.old");
Using a trick, the above can be simplified to
local #ARGV = $qfn;
local $^I = '';
while (<>) {
chomp;
$_ = "<$_>";
print(ARGV "$_\n");
}
Or as a one-liner:
perl -i -pe'$_ = "<$_>"' file
Read contents in memory and then prepare required string as you write to your file. (SEEK_SET to zero't byte is required.
#!/usr/bin/perl
open(INFILE, "+<in.txt");
#a=<INFILE>;
seek INFILE, 0, SEEK_SET ;
foreach $i(#a)
{
chomp $i;
print INFILE "<".$i.">"."\n";
}
If you are worried about amount of data being read in memory, you will have to create a temporary result file and finally copy the result file to original file.
You could use Tie::File for easy random access to the lines in your file:
use Tie::File;
use strict;
use warnings;
my $filename = "out.txt";
my #array;
tie #array, 'Tie::File', $filename or die "can't tie file \"$filename\": $!";
for my $line (#array) {
$line = "<$line>";
# or $line =~ s/^(.*)$/<$1>/g; # -- whatever modifications you need to do
}
untie #array;
Disclaimer: Of course, this option is only viable if the file is not shared with other processes. Otherwise you could use flock to prevent shared access while you modify the file.
Disclaimer-2 (thanks to ikegami): Don't use this solution if you have to edit big files and are concerned about performance. Most of the performance loss is mitigated for small files (less than 2MB, though this is configurable using the memory arg).
One option is to open the file twice: Open it once read-only, read the data, close it, process it, open it again read-write (no append), write the data, and close it. This is good practice because it minimizes the time you have the file open, in case someone else needs it.
If you only want to open it once, then you can use the +< file type - just use the seek call between reading and writing to return to the beginning of the file. Otherwise, you finish reading, are at the end of the file, and start writing there, which is why you get the behavior you're seeing.
Need to specify
use Fcntl qw(SEEK_SET);
in order to use
seek INFILE, 0, SEEK_SET;
Thanks user1703205 for the example.
I have written the two program. One program is write the content to the text file simultaneously. Another program is read that content simultaneously.
But both the program should run at the same time. For me the program is write the file is correctly. But another program is not read the file.
I know that once the write process is completed than only the data will be stored in the hard disk. Then another process can read the data.
But I want both read and write same time with different process in the single file. How can I do that?
Please help me.
The following code write the content in the file
sub generate_random_string
{
my $length_of_randomstring=shift;# the length of
# the random string to generate
my #chars=('a'..'z','A'..'Z','0'..'9','_');
my $random_string;
foreach (1..$length_of_randomstring)
{
# rand #chars will generate a random
# number between 0 and scalar #chars
$random_string.=$chars[rand #chars];
}
return $random_string;
}
#Generate the random string
open (FH,">>file.txt")or die "Can't Open";
while(1)
{
my $random_string=&generate_random_string(20);
sleep(1);
#print $random_string."\n";
print FH $random_string."\n";
}
The following code is read the content. This is another process
open (FH,"<file.txt") or die "Can't Open";
print "Open the file Successfully\n\n";
while(<FH>)
{
print "$_\n";
}
You might use an elaborate cooperation protocol such as in the following. Both ends, reader and writer, use common code in the TakeTurns module that handles fussy details such as locking and where the lock file lives. The clients need only specify what they want to do when they have exclusive access to the file.
reader
#! /usr/bin/perl
use warnings;
use strict;
use TakeTurns;
my $runs = 0;
reader "file.txt" =>
sub {
my($fh) = #_;
my #lines = <$fh>;
print map "got: $_", #lines;
++$runs <= 10;
};
writer
#! /usr/bin/perl
use warnings;
use strict;
use TakeTurns;
my $n = 10;
my #chars = ('a'..'z','A'..'Z','0'..'9','_');
writer "file.txt" =>
sub { my($fh) = #_;
print $fh join("" => map $chars[rand #chars], 1..$n), "\n"
or warn "$0: print: $!";
};
The TakeTurns module is execute-around at work:
package TakeTurns;
use warnings;
use strict;
use Exporter 'import';
use Fcntl qw/ :DEFAULT :flock /;
our #EXPORT = qw/ reader writer /;
my $LOCKFILE = "/tmp/taketurns.lock";
sub _loop ($&) {
my($path,$action) = #_;
while (1) {
sysopen my $lock, $LOCKFILE, O_RDWR|O_CREAT
or die "sysopen: $!";
flock $lock, LOCK_EX or die "flock: $!";
my $continue = $action->();
close $lock or die "close: $!";
return unless $continue;
sleep 0;
}
}
sub writer {
my($path,$w) = #_;
_loop $path =>
sub {
open my $fh, ">", $path or die "open $path: $!";
my $continue = $w->($fh);
close $fh or die "close $path: $!";
$continue;
};
}
sub reader {
my($path,$r) = #_;
_loop $path =>
sub {
open my $fh, "<", $path or die "open $path: $!";
my $continue = $r->($fh);
close $fh or die "close $path: $!";
$continue;
};
}
1;
Sample output:
got: 1Upem0iSfY
got: qAALqegWS5
got: 88RayL3XZw
got: NRB7POLdu6
got: IfqC8XeWN6
got: mgeA6sNEpY
got: 2TeiF5sDqy
got: S2ksYEkXsJ
got: zToPYkGPJ5
got: 6VXu6ut1Tq
got: ex0wYvp9Y8
Even though you went to so much trouble, there are still issues. The protocol is unreliable, so reader has no guarantee of seeing all messages that writer sends. With no writer active, reader is content to read the same message over and over.
You could add all this, but a more sensible approach would be using abstractions the operating system provides already.
For example, Unix named pipes seem to be a pretty close match to what you want, and note how simple the code is:
pread
#! /usr/bin/perl
use warnings;
use strict;
my $pipe = "/tmp/mypipe";
system "mknod $pipe p 2>/dev/null";
open my $fh, "<", $pipe or die "$0: open $pipe: $!";
while (<$fh>) {
print "got: $_";
sleep 0;
}
pwrite
#! /usr/bin/perl
use warnings;
use strict;
my $pipe = "/tmp/mypipe";
system "mknod $pipe p 2>/dev/null";
open my $fh, ">", $pipe or die "$0: open $pipe: $!";
my $n = 10;
my #chars = ('a'..'z','A'..'Z','0'..'9','_');
while (1) {
print $fh join("" => map $chars[rand #chars], 1..$n), "\n"
or warn "$0: print: $!";
}
Both ends attempt to create the pipe using mknod because they have no other method of synchronization. At least one will fail, but we don't care as long as the pipe exists.
As you can see, all the waiting machinery is handled by the system, so you do what you care about: reading and writing messages.
This works.
The writer:
use IO::File ();
sub generate_random_string {...}; # same as above
my $file_name = 'file.txt';
my $handle = IO::File->new($file_name, 'a');
die "Could not append to $file_name: $!" unless $handle;
$handle->autoflush(1);
while (1) {
$handle->say(generate_random_string(20));
}
The reader:
use IO::File qw();
my $file_name = 'file.txt';
my $handle = IO::File->new($file_name, 'r');
die "Could not read $file_name: $!" unless $handle;
STDOUT->autoflush(1);
while (defined (my $line = $handle->getline)) {
STDOUT->print($line);
}
are you on windows or *nix? you might be able to string something like this together on *nix by using tail to get the output as it is written to the file. On windows you can call CreateFile() with FILE_SHARE_READ and/or FILE_SHARE_WRITE in order to allow others to access the file while you have it opened for read/write. you may have to periodically check to see if the file size has changed in order to know when to read (i'm not 100% certain here.)
another option is a memory mapped file.