How to read data in .gz file very fast in perl programming

How to read data in .gz file very fast in perl programming - perl

I am reading a .gz file which is around 3 GB. I am grepping a pattern using Perl program. I am able to grep the pattern but it is taking too long to process. Can anyone help me how to process very fast?
use strict ;
use warnings ;
use Compress::Zlib;
my $file = "test.gz";
my $gz = gzopen ($file, "rb") or die "Error Reading $file: $gzerrno";
while ($gz->gzreadline($_) > 0 ) {
if (/pattern/) {
print "$_----->PASS\n";
}
}
die "Error reading $file: $gzerrno" if $gzerrno != Z_STREAM_END;
$gz ->gzclose();
What does Z_STREAM_END variable do?

I have written a script that times how long various methods take to read a gz file. I too have also found that Compress::Zlib is very slow.
use strict;
use warnings;
use autodie ':all';
use Compress::Zlib;
use Time::HiRes 'time';
my $file = '/home/con/Documents/snp150.txt.gz';
# time zcat execution
my $start_zcat = Time::HiRes::time();
open my $zcat, "zcat $file |";
while (<$zcat>) {
# print $_;
}
close $zcat;
my $end_zcat = Time::HiRes::time();
# time Compress::Zlib reading
my $start_zlib = Time::HiRes::time();
my $gz = gzopen($file, 'r') or die "Error reading $file: $gzerrno";
while ($gz->gzreadline($_) > 0) {#http://blog-en.openalfa.com/how-to-read-and-write-compressed-files-in-perl
# print "$_";# Process the line read in $_
}
$gz->gzclose();
my $end_zlib = Time::HiRes::time();
printf("zlib took %lf seconds.\n", $end_zlib - $start_zlib);
printf("zcat took %lf seconds.\n", $end_zcat - $start_zcat);
Using this script, I found that reading through zcat runs about 7x faster (!) than Compress::Zlib This will vary from computer to computer, and file to file, of course.

Related

Perl: How do I get "bytes read" from md5::digest addfile()?

I am using Digest::MD5 to compute MD5 of a data stream; namely a GZIPped file (or to be precise, 3000) that are much too large to fit in RAM. So I'm doing this:
use Digest::MD5 qw(md5_base64);
my ($filename) = #_; # this is in a sub
my $ctx = Digest::MD5 -> new;
$openme = $filename; # Usually, it's a plain file
$openme = "gunzip -c '$filename' |" if ($filename =~ /\.gz$/); # is gz
open (FILE, $openme); # gunzip to STDOUT
binmode(FILE);
$ctx -> addfile(*FILE); # passing filehandle
close(FILE);
This is a success. addfile neatly slurps in the output of gunzip and gives a correct MD5.
However, I would really, really like to know the size of the slurped data (gunzipped "file" in this case).
I could add an additional
$size = 0 + `gunzip -c very/big-file.gz | wc -c`;
but that would involve reading the file twice.
Is there any way to extract the number of bytes slurped from Digest::MD5? I tried capturing the result: $result = $ctx -> addfile(*FILE); and doing Data::Dumper on both $result and $ctx, but nothing interesting emerged.
Edit: The files are often not gzipped. Added code to show what I really do.

I'd do it all in perl, without relying on an external program for the decompression:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use IO::Uncompress::Gunzip qw/$GunzipError/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Allow for reading both gzip format files and uncompressed files.
# This is the default behavior, but might as well be explicit about it.
my $z = IO::Uncompress::Gunzip->new($filename, Transparent => 1)
or die "Unable to open $filename: $GunzipError\n";
my $len = 0;
while ((my $blen = $z->read(my $block)) > 0) {
$len += $blen;
$md5->add($block);
}
die "There was an error reading the file: $GunzipError\n" unless $z->eof;
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;
If you want to use gunzip instead of the core IO::Uncompress::Gunzip module, you can do something similar, though, using read to get a chunk of data at a time:
#!/usr/bin/perl
use warnings;
use strict;
use autodie; # So we don't have to explicitly check for i/o related errors
use feature qw/say/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Note use of lexical file handle and safer version of opening a pipe
# from a process that eliminates shell shenanigans. Also uses the :raw
# perlio layer instead of calling binmode on the handle (which has the
# same effect)
open my $z, "-|:raw", "gunzip", "-c", $filename;
# Non-compressed version
# open my $z, "<:raw", $filename;
my $len = 0;
while ((my $blen = read($z, my $block, 4096)) > 0) {
$len += $blen;
$md5->add($block);
}
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;

You could read the contents yourself, and feed it in to $ctx->add($data), and keep a running count of how much data you've passed through. Whether you add all the data in a single call, or across multiple calls, doesn't make any difference to the underlying algorithm. The docs include:
All these lines will have the same effect on the state of the $md5 object:
$md5->add("a"); $md5->add("b"); $md5->add("c");
$md5->add("a")->add("b")->add("c");
$md5->add("a", "b", "c");
$md5->add("abc");
which indicates that you can just do this a piece at a time.

File::Temp pass system command output to temp file

I'm trying to capture the output of a tail command to a temp file.
here is a sample of my apache access log
Here is what I have tried so far.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp ();
use File::Temp qw/ :seekable /;
chomp($tail = `tail access.log`);
my $tmp = File::Temp->new( UNLINK => 0, SUFFIX => '.dat' );
print $tmp "Some data\n";
print "Filename is $tmp\n";
I'm not sure how I can go about passing the output of $tail to this temporoy file.
Thanks

I would use a different approach for tailing the file. Have a look to File::Tail, I think it will simplify things.

It sounds like all you need is
print $tmp $tail;
But you also need to declare $tail and you probably shouldn't chomp it, so
my $tail = `tail access.log`;

Is classic Perl approach to use the proper filename for the handle?
if(open LOGFILE, 'tail /some/log/file |' and open TAIL, '>/tmp/logtail')
{
print LOGFILE "$_\n" while <TAIL>;
close TAIL and close LOGFILE
}

There is many ways to do this but since you are happy to use modules, you might as well use File::Tail;
use v5.12;
use warnings 'all';
use File::Tail;
my $lines_required = 10;
my $out_file = "output.txt";
open(my $out, '>', $out_file) or die "$out_file: $!\n";
my $tail = File::Tail->new("/some/log/file");
for (1 .. $lines_required) {
print $out $tail->read;
}
close $out;
This sits and monitors the log file until it gets the 10 new lines. If you just want a copy of the last 10 lines as is, the easiest way is to use I/O redirection from the shell: tail /some/log/file > my_copy.txt

zcat to read gzip files and then concatenate them in Perl

I need to write a perl script to read gzipped files from a text file list of their paths and then concatenate them together and output to a new gzipped file. ( I need to do this in perl as it will be implemented in a pipeline)
I am not sure how to accomplish the zcat and concatenation part, as the file sizes would be in Gbs, I need to take care of the storage and run time as well.
So far I can think of it as -
use strict;
use warnings;
use IO::Compress::Gzip qw(gzip $GzipError) ;
#-------check the input file specified-------------#
$num_args = $#ARGV + 1;
if ($num_args != 1) {
print "\nUsage: name.pl Filelist.txt \n";
exit;
$file_list = $ARGV[0];
#-------------Read the file into arrray-------------#
my #fastqc_files; #Array that contains gzipped files
use File::Slurp;
my #fastqc_files = $file_list;
#-------use the zcat over the array contents
my $outputfile = "combined.txt"
open(my $combined_file, '>', $outputfile) or die "Could not open file '$outputfile' $!";
for my $fastqc_file (#fastqc_files) {
open(IN, sprintf("zcat %s |", $fastqc_file))
or die("Can't open pipe from command 'zcat $fastqc_file' : $!\n");
while (<IN>) {
while ( my $line = IN ) {
print $outputfile $line ;
}
}
close(IN);
my $Final_combied_zip = new IO::Compress::Gzip($combined_file);
or die "gzip failed: $GzipError\n";
Somehow I am not able to get it to run. Also if anyone can guide on the correct way to output this zipped file.
Thanks!

You don't need perl for this. You don't even need zcat/gzip as gzipped files are catable:
cat $(cat pathfile) >resultfile
But if you really really need to try to get the extra compression by combining:
zcat $(cat pathfile)|gzip >resultfile
Adding: Also note the very first "related" link on the right, which seems to already answer this very question: How to concat two or more gzip files/streams

Thanks for the replies - the script runs well now -
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use IO::Compress::Gzip qw(gzip $GzipError);
my #data = read_file('./File_list.txt');
my $out = "./test.txt";
foreach my $data_file (#data)
{
chomp($data_file);
system("zcat $data_file >> $out");
}
my $outzip = "./test.gz";
gzip $out => $outzip;

Effective way to find checksum in perl without memory leakage

In my program I need to look for checksum for many files. The checksum calculation is within the find command.
find(sub {
my $file = $File::Find::name;
return if ! length($file);
open (FILE, "$file");
my $chksum = md5_base64(<FILE>);
close FILE;
}, "/home/nijin");
The above code works perfectly. But if there is a file with a large size for example 6GB in the path /home/nijin, it will load 6 GB into RAM memory and the process takes 6 GB RAM continuously until the process is completed. Please note that this is a backup process and it will take more than 12 hours for the process to complete. So I will lose 6GB until the process is completed. The worst case is the process gets hangs due to large memory usage. As an option I have tried to use File::Map . the code is pasted below.
find(sub {
my $file = $File::Find::name;
return if ! length($file);
map_file my $map, "$filename", '<';
my $chksum = md5_base64($map);
}, "/home/nijin");
The above code also works but I am getting segmentation fault error while using the above code. I have also tried with Sys::Mmap but having the same issue as the first one. Is there any other option to try?

I'd run the expensive calculation in a child process. This keeps the parent process at decent memory consumption. The child can eat lots of memory for large files, but once the MD5 is returned, the memory is returned to the OS:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use File::Find;
use Digest::MD5 qw{ md5_base64 };
my %md5;
find(sub {
my $name = $File::Find::name;
return unless -f;
my $child_pid = open(my $CMD, '-|') // die "Can't fork: $!";
if ($child_pid) { # Parent
$md5{$name} = <$CMD>;
wait;
} else { # Child
open my $IN, '<', $_ or die "$name: $!";
print md5_base64(<$IN>);
exit;
}
}, shift);
print Dumper \%md5;

There's no reason to read the whole file into memory at once.
You can explicitly process it in 64k chunks by the following:
my $chksum = do {
open my $fh, '<:raw', $file;
my $md5 = Digest::MD5->new;
local $/ = \65536; # Read 64k at once
while (<$fh>) {
$md5->add($_);
}
$md5->hexdigest;
};
# Do whatever you were going to do with it here
You can also just pass the filehandle directly, although that does not guarantee how it will process it:
my $chksum = do {
open my $fh, '<:raw', $file;
Digest::MD5->new->addfile($fh)->hexdigest
};

perl - best way to create gzipped files

I need to alter my routine and have the final outfile be gzipped. I'm trying to figure out what is the best way to gzip a processed file called within a perl subroutine.
For example, I have a sub routine that creates the file (extract_data).
Here's the main loop and sub routine:
foreach my $tblist (#tblist)
{
chomp $tblist;
extract_data($dbh, $tblist);
};
$dbh->disconnect;
sub extract_data
{
my($dbh, $tblist) = #_;
my $final_file = "/home/proc/$node-$tblist.dat";
open (my $out_fh, '>', $final_file) or die "cannot create $final_file: $!";
my $sth = $dbh->prepare("...");
$sth->execute();
while (my($uid, $hostnm,$col1,$col2,$col3,$upd,$col5) = $sth->fetchrow_array() ) {
print $out_fh "__my_key__^A$uid^Ehost^A$hostnm^Ecol1^A$col1^Ecol2^A$col2^Ecol3^A$col3^Ecol4^A$upd^Ecol5^A$col5^D";
}
$sth->finish;
close $out_fh or die "Failed to close file: $!";
};
Do I do the gzip within the main or with the sub? What is the best way to do so?
Then my new file would be $final_file =/home/proc/$node-$tblist.dat.gz
thanks.

I know there are modules to do this without using external programs, but since I understand how to use gzip a lot better than I understand how to use those modules, I just open a process to gzip and call it a day.
open (my $gzip_fh, "| /bin/gzip -c > $final_file.gz") or die "error starting gzip $!";
...
while (... = $sth->fetchrow_array()) {
print $gzip_fh "__my_key__^A$uid^Ehost^A$hostname..."; # uncompressed data
}
...
close $gzip_fh;

You can use IO::Compress::Gzip, which is in the set of core Perl modules:
use IO::Compress::Gzip qw(gzip $GzipError) ;
my $z = new IO::Compress::Gzip($fileName);
or die "gzip failed: $GzipError\n";
# object interface
$z->print($string);
$z->printf($format, $string);
$z->write($string);
$z->close();
# IO::File mode
print($z $string);
printf($z $format, $string);
close($z);
More details at perldoc
FWIW, there's also IO::Uncompress::Gunzip for reading from gzipped files in a similar fashion.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to read data in .gz file very fast in perl programming - perl

Related

Perl: How do I get "bytes read" from md5::digest addfile()?

File::Temp pass system command output to temp file

zcat to read gzip files and then concatenate them in Perl

Effective way to find checksum in perl without memory leakage

perl - best way to create gzipped files

Categories

Resources