I am using Digest::MD5 to compute MD5 of a data stream; namely a GZIPped file (or to be precise, 3000) that are much too large to fit in RAM. So I'm doing this:
use Digest::MD5 qw(md5_base64);
my ($filename) = #_; # this is in a sub
my $ctx = Digest::MD5 -> new;
$openme = $filename; # Usually, it's a plain file
$openme = "gunzip -c '$filename' |" if ($filename =~ /\.gz$/); # is gz
open (FILE, $openme); # gunzip to STDOUT
binmode(FILE);
$ctx -> addfile(*FILE); # passing filehandle
close(FILE);
This is a success. addfile neatly slurps in the output of gunzip and gives a correct MD5.
However, I would really, really like to know the size of the slurped data (gunzipped "file" in this case).
I could add an additional
$size = 0 + `gunzip -c very/big-file.gz | wc -c`;
but that would involve reading the file twice.
Is there any way to extract the number of bytes slurped from Digest::MD5? I tried capturing the result: $result = $ctx -> addfile(*FILE); and doing Data::Dumper on both $result and $ctx, but nothing interesting emerged.
Edit: The files are often not gzipped. Added code to show what I really do.
I'd do it all in perl, without relying on an external program for the decompression:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use IO::Uncompress::Gunzip qw/$GunzipError/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Allow for reading both gzip format files and uncompressed files.
# This is the default behavior, but might as well be explicit about it.
my $z = IO::Uncompress::Gunzip->new($filename, Transparent => 1)
or die "Unable to open $filename: $GunzipError\n";
my $len = 0;
while ((my $blen = $z->read(my $block)) > 0) {
$len += $blen;
$md5->add($block);
}
die "There was an error reading the file: $GunzipError\n" unless $z->eof;
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;
If you want to use gunzip instead of the core IO::Uncompress::Gunzip module, you can do something similar, though, using read to get a chunk of data at a time:
#!/usr/bin/perl
use warnings;
use strict;
use autodie; # So we don't have to explicitly check for i/o related errors
use feature qw/say/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Note use of lexical file handle and safer version of opening a pipe
# from a process that eliminates shell shenanigans. Also uses the :raw
# perlio layer instead of calling binmode on the handle (which has the
# same effect)
open my $z, "-|:raw", "gunzip", "-c", $filename;
# Non-compressed version
# open my $z, "<:raw", $filename;
my $len = 0;
while ((my $blen = read($z, my $block, 4096)) > 0) {
$len += $blen;
$md5->add($block);
}
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;
You could read the contents yourself, and feed it in to $ctx->add($data), and keep a running count of how much data you've passed through. Whether you add all the data in a single call, or across multiple calls, doesn't make any difference to the underlying algorithm. The docs include:
All these lines will have the same effect on the state of the $md5 object:
$md5->add("a"); $md5->add("b"); $md5->add("c");
$md5->add("a")->add("b")->add("c");
$md5->add("a", "b", "c");
$md5->add("abc");
which indicates that you can just do this a piece at a time.
Related
I am reading a .gz file which is around 3 GB. I am grepping a pattern using Perl program. I am able to grep the pattern but it is taking too long to process. Can anyone help me how to process very fast?
use strict ;
use warnings ;
use Compress::Zlib;
my $file = "test.gz";
my $gz = gzopen ($file, "rb") or die "Error Reading $file: $gzerrno";
while ($gz->gzreadline($_) > 0 ) {
if (/pattern/) {
print "$_----->PASS\n";
}
}
die "Error reading $file: $gzerrno" if $gzerrno != Z_STREAM_END;
$gz ->gzclose();
What does Z_STREAM_END variable do?
I have written a script that times how long various methods take to read a gz file. I too have also found that Compress::Zlib is very slow.
use strict;
use warnings;
use autodie ':all';
use Compress::Zlib;
use Time::HiRes 'time';
my $file = '/home/con/Documents/snp150.txt.gz';
# time zcat execution
my $start_zcat = Time::HiRes::time();
open my $zcat, "zcat $file |";
while (<$zcat>) {
# print $_;
}
close $zcat;
my $end_zcat = Time::HiRes::time();
# time Compress::Zlib reading
my $start_zlib = Time::HiRes::time();
my $gz = gzopen($file, 'r') or die "Error reading $file: $gzerrno";
while ($gz->gzreadline($_) > 0) {#http://blog-en.openalfa.com/how-to-read-and-write-compressed-files-in-perl
# print "$_";# Process the line read in $_
}
$gz->gzclose();
my $end_zlib = Time::HiRes::time();
printf("zlib took %lf seconds.\n", $end_zlib - $start_zlib);
printf("zcat took %lf seconds.\n", $end_zcat - $start_zcat);
Using this script, I found that reading through zcat runs about 7x faster (!) than Compress::Zlib This will vary from computer to computer, and file to file, of course.
Trying to compute incremental md5 digest for all files in deep directory trees, but I'm unable to "reuse" the already calculated digest.
Here is my test-code:
#!/usr/bin/env perl
use 5.014;
use warnings;
use Digest::MD5;
use Path::Tiny;
# create some test-files in the tempdir
my #filenames = qw(a b);
my $testdir = Path::Tiny->tempdir;
$testdir->child($_)->spew($_) for #filenames; #create 2 files
dirmd5($testdir, #filenames);
exit;
sub dirmd5 {
my($dir, #files) = #_;
my $dirctx = Digest::MD5->new; #the md5 for the whole directory
for my $fname (#files) {
# calculate the md5 for one file
my $filectx = Digest::MD5->new;
my $fd = $dir->child($fname)->openr_raw;
$filectx->addfile($fd);
close $fd;
say "md5 for $fname : ", $filectx->clone->hexdigest;
# want somewhat "add" the above file-md5 to the directory md5
# this not work - even if the $filectx isn't reseted (note the "clone" above)
#$dirctx->add($filectx);
# works adding the file as bellow,
# but this calculating the md5 again
# e.g. for each file the calculation is done two times...
# once for the file-alone (above)
# and second time for the directory
# too bad if case of many and large files. ;(
# especially, if i want calculate the md5sum for the whole directory trees
$fd = $dir->child($fname)->openr_raw;
$dirctx->addfile($fd);
close $fd;
}
say "md5 for dir: ", $dirctx->hexdigest;
}
The above prints:
md5 for a : 0cc175b9c0f1b6a831c399e269772661
md5 for b : 92eb5ffee6ae2fec3ad71c777531578f
md5 for dir: 187ef4436122d1cc2f40dc2b92f0eba0
which is correct, but unfortunately inefficient way. (see the comments).
Reading the docs, I didn't find any way reuse the already calculated md5. e.g. as the above $dirctx->add($filectx);. Probably it is not possible.
Exists any way for check-summing which allows somewhat reuse the already calculated checksums, so, I would be able calculate the checksums/digests for the whole directory trees without the need calculate the digest multiple times for each file?
Ref: trying somewhat solve this question
No. There is nothing that relates MD5(initial data) and MD5(new data) to MD5(initial data + new data) because the position of the data in the stream matters as well as its value. Otherwise it wouldn't be a very useful error check as aba, aab and baa would all have the same checksum
If the files are small enough you could do read each one into memory and use that copy to add the data to both digests. That would avoid reading twice from mass storage
#!/usr/bin/env perl
use 5.014;
use warnings 'all';
use Digest::MD5;
use Path::Tiny;
# create some test-files in the tempdir
my #filenames = qw(a b);
my $testdir = Path::Tiny->tempdir;
$testdir->child($_)->spew($_) for #filenames; # create 2 files
dirmd5($testdir, #filenames);
sub dirmd5 {
my ($dir, #files) = #_;
my $dir_ctx = Digest::MD5->new; #the md5 for the whole directory
for my $fname ( #files ) {
my $data = $dir->child($fname)->slurp_raw;
# calculate the md5 for one file
my $file_md5 = Digest::MD5->new->add($data)->hexdigest;
say "md5 for $fname : $file_md5";
$dir_ctx->add($data);
}
my $dir_md5 = $dir_ctx->hexdigest;
say "md5 for dir: $dir_md5";
}
If the files are huge then the only optimisation left is to avoid reopening the same file and instead rewind it back to the start before reading it a second time
#!/usr/bin/env perl
use 5.014;
use warnings 'all';
use Digest::MD5;
use Path::Tiny;
use Fcntl ':seek';
# create some test-files in the tempdir
my #filenames = qw(a b);
my $testdir = Path::Tiny->tempdir;
$testdir->child($_)->spew($_) for #filenames; # create 2 files
dirmd5($testdir, #filenames);
sub dirmd5 {
my ($dir, #files) = #_;
my $dir_ctx = Digest::MD5->new; # The digest for the whole directory
for my $fname ( #files ) {
my $fh = $dir->child($fname)->openr_raw;
# The digest for just the current file
my $file_md5 = Digest::MD5->new->addfile($fh)->hexdigest;
say "md5 for $fname : $file_md5";
seek $fh, 0, SEEK_SET;
$dir_ctx->addfile($fh);
}
my $dir_md5 = $dir_ctx->hexdigest;
say "md5 for dir: $dir_md5";
}
I'm trying to capture the output of a tail command to a temp file.
here is a sample of my apache access log
Here is what I have tried so far.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp ();
use File::Temp qw/ :seekable /;
chomp($tail = `tail access.log`);
my $tmp = File::Temp->new( UNLINK => 0, SUFFIX => '.dat' );
print $tmp "Some data\n";
print "Filename is $tmp\n";
I'm not sure how I can go about passing the output of $tail to this temporoy file.
Thanks
I would use a different approach for tailing the file. Have a look to File::Tail, I think it will simplify things.
It sounds like all you need is
print $tmp $tail;
But you also need to declare $tail and you probably shouldn't chomp it, so
my $tail = `tail access.log`;
Is classic Perl approach to use the proper filename for the handle?
if(open LOGFILE, 'tail /some/log/file |' and open TAIL, '>/tmp/logtail')
{
print LOGFILE "$_\n" while <TAIL>;
close TAIL and close LOGFILE
}
There is many ways to do this but since you are happy to use modules, you might as well use File::Tail;
use v5.12;
use warnings 'all';
use File::Tail;
my $lines_required = 10;
my $out_file = "output.txt";
open(my $out, '>', $out_file) or die "$out_file: $!\n";
my $tail = File::Tail->new("/some/log/file");
for (1 .. $lines_required) {
print $out $tail->read;
}
close $out;
This sits and monitors the log file until it gets the 10 new lines. If you just want a copy of the last 10 lines as is, the easiest way is to use I/O redirection from the shell: tail /some/log/file > my_copy.txt
I need to write a perl script to read gzipped files from a text file list of their paths and then concatenate them together and output to a new gzipped file. ( I need to do this in perl as it will be implemented in a pipeline)
I am not sure how to accomplish the zcat and concatenation part, as the file sizes would be in Gbs, I need to take care of the storage and run time as well.
So far I can think of it as -
use strict;
use warnings;
use IO::Compress::Gzip qw(gzip $GzipError) ;
#-------check the input file specified-------------#
$num_args = $#ARGV + 1;
if ($num_args != 1) {
print "\nUsage: name.pl Filelist.txt \n";
exit;
$file_list = $ARGV[0];
#-------------Read the file into arrray-------------#
my #fastqc_files; #Array that contains gzipped files
use File::Slurp;
my #fastqc_files = $file_list;
#-------use the zcat over the array contents
my $outputfile = "combined.txt"
open(my $combined_file, '>', $outputfile) or die "Could not open file '$outputfile' $!";
for my $fastqc_file (#fastqc_files) {
open(IN, sprintf("zcat %s |", $fastqc_file))
or die("Can't open pipe from command 'zcat $fastqc_file' : $!\n");
while (<IN>) {
while ( my $line = IN ) {
print $outputfile $line ;
}
}
close(IN);
my $Final_combied_zip = new IO::Compress::Gzip($combined_file);
or die "gzip failed: $GzipError\n";
Somehow I am not able to get it to run. Also if anyone can guide on the correct way to output this zipped file.
Thanks!
You don't need perl for this. You don't even need zcat/gzip as gzipped files are catable:
cat $(cat pathfile) >resultfile
But if you really really need to try to get the extra compression by combining:
zcat $(cat pathfile)|gzip >resultfile
Adding: Also note the very first "related" link on the right, which seems to already answer this very question: How to concat two or more gzip files/streams
Thanks for the replies - the script runs well now -
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use IO::Compress::Gzip qw(gzip $GzipError);
my #data = read_file('./File_list.txt');
my $out = "./test.txt";
foreach my $data_file (#data)
{
chomp($data_file);
system("zcat $data_file >> $out");
}
my $outzip = "./test.gz";
gzip $out => $outzip;
Having this snippet:
my $file = "input.txt"; # let's assume that this is an ascii file
my $size1 = -s $file;
print "$size1\n";
$size2 = 0;
open F, $file;
$size2 += length($_) while (<F>);
close F;
print "$size2\n";
when can one assert that it is true that $size1 equals $size2?
If you don't specify an encoding that supports multibyte characters, it should hold. Otherwise, the result can be different:
$ cat 1.txt
žluťoučký kůň
$ perl -E 'say -s "1.txt";
open my $FH, "<:utf8", "1.txt";
my $f = do { local $/; <$FH> };
say length $f;'
20
14
You cannot, because the input layer may do some convert on the input line, for example change crlf to cr, that may change the length of that line.
In addition, length $line count how many characters in $line, in the multi-byte encoding, as the example given by #choroba, one character may occupy more than one bytes.
See perlio for further details.
No, as Lee Duhem says, the two numbers may be different because of Perl's end-of-line processing, or because length reports the size of the string in characters, which will throw the numbers out if there are any wide characters in the text.
However the tell function will report the exact position in bytes that you have read up to, so an equivalent to your program for which the numbers are guaranteed to match is this
use strict;
use warnings;
my $file = 'input.txt';
my $size1 = -s $file;
print "$size1\n";
open my $fh, '<', $file or die $!;
my $size2 = 0;
while (<$fh>) {
$size2 = tell $fh;
}
close $fh;
print "$size2\n";
Please note the use of use strict and use warnings, the lexical file handle, the three-parameter form of open, and the check that it succeeded. All of these are best-practice for Perl programs and should be used in everything you write
You're simply missing binmode(F); or the :raw IO layer. These cause Perl to return the file exactly as it appears on disk. No line ending translation. No decoding of character encodings.
open(my $fh, '<:raw', $file)
or die "open $file: $!\n");
Then your code works fine.
my $size = 0;
$size += length while <$fh>;
That's not particularly good because it could read the entire file at once for binary files. So let's read fixed-sized blocks instead.
local $/ = \(64*1024);
my $size = 0;
$size += length while <$fh>;
That's basically the same as using read, which reads 4K or 8K (in newer Perls) at a time. There are performance benefits to reading more than that at a time, and we can use sysread to do that.
my $size = 0;
while (my $bytes_read = sysread($fh, my $buf, 64*1024)) {
$size += $bytes_read;
}
Reading the whole file is silly, though. You could just seek to the end of the file.
use Fcntl qw( SEEK_END );
my $size = sysseek($fh, 0, SEEK_END);
But then again, you might as well just use -s.
my $size = -s $fh;