zcat to read gzip files and then concatenate them in Perl - perl

I need to write a perl script to read gzipped files from a text file list of their paths and then concatenate them together and output to a new gzipped file. ( I need to do this in perl as it will be implemented in a pipeline)
I am not sure how to accomplish the zcat and concatenation part, as the file sizes would be in Gbs, I need to take care of the storage and run time as well.
So far I can think of it as -
use strict;
use warnings;
use IO::Compress::Gzip qw(gzip $GzipError) ;
#-------check the input file specified-------------#
$num_args = $#ARGV + 1;
if ($num_args != 1) {
print "\nUsage: name.pl Filelist.txt \n";
exit;
$file_list = $ARGV[0];
#-------------Read the file into arrray-------------#
my #fastqc_files; #Array that contains gzipped files
use File::Slurp;
my #fastqc_files = $file_list;
#-------use the zcat over the array contents
my $outputfile = "combined.txt"
open(my $combined_file, '>', $outputfile) or die "Could not open file '$outputfile' $!";
for my $fastqc_file (#fastqc_files) {
open(IN, sprintf("zcat %s |", $fastqc_file))
or die("Can't open pipe from command 'zcat $fastqc_file' : $!\n");
while (<IN>) {
while ( my $line = IN ) {
print $outputfile $line ;
}
}
close(IN);
my $Final_combied_zip = new IO::Compress::Gzip($combined_file);
or die "gzip failed: $GzipError\n";
Somehow I am not able to get it to run. Also if anyone can guide on the correct way to output this zipped file.
Thanks!

You don't need perl for this. You don't even need zcat/gzip as gzipped files are catable:
cat $(cat pathfile) >resultfile
But if you really really need to try to get the extra compression by combining:
zcat $(cat pathfile)|gzip >resultfile
Adding: Also note the very first "related" link on the right, which seems to already answer this very question: How to concat two or more gzip files/streams

Thanks for the replies - the script runs well now -
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use IO::Compress::Gzip qw(gzip $GzipError);
my #data = read_file('./File_list.txt');
my $out = "./test.txt";
foreach my $data_file (#data)
{
chomp($data_file);
system("zcat $data_file >> $out");
}
my $outzip = "./test.gz";
gzip $out => $outzip;

Related

Split large csv file into multiple files based on column(s)

I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns. I would be doing this on a unix machine.
#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
I now want multiple files like this:
# contents of 1.csv
1,2
a,b
q,w
a,s
z,x
# contents of 2.csv
1,3
a,c
q,e
a,d
z,c
# contents of 3.csv
1,4
a,d
q,r
a,f
z,v
and so on...
I can do this currently with awk on small files (say 30 columns) like this:
awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i++) print $1, $(i+1) > i ".csv"}' large_file.csv
The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.
Thanks in advance.
The main hold up here is in writing so many files.
Here is one way
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
my #lines = do { local #ARGV = $file; <> };
chomp #lines;
my #fhs = map {
open my $fh, '>', "f${_}.csv" or die $!;
$fh
}
1 .. scalar( split /,/, $lines[0] );
for (#lines) {
my ($first, #cols) = split /,/;
say {$fhs[$_]} join(',', $first, $cols[$_])
for 0..$#cols;
}
I didn't time this against any other approaches. Assembling data for each file first and then dumping it in one operation into each file may help, but first let us know how large the original CSV file is.
Opening so many output files at once (for #fhs filehandles) may pose problems. If that is the case then the simplest way is to first assemble all data and then open and write a file at a time
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my #data;
while (<$fh>) {
chomp;
my ($first, #cols) = split /,/;
push #{$data[$_]}, join(',', $first, $cols[$_])
for 0..$#cols;
}
for my $i (0..$#data) {
open my $fh, '>', $i+1 . '.csv' or die $!;
say $fh $_ for #{$data[$i]};
}
This depends on whether the entire original CSV file, plus a bit more, can be held in memory.
With your show samples, attempts; please try following awk code. Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END block of this awk code print them one by one and I am closing them ASAP all contents are getting printed to output file.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<NF;i++){
value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i+1))
}
}
END{
for(i=1;i<=NF;i++){
outFile=i".csv"
print value[i] > (outFile)
close(outFile)
}
}
' large_file.csv
I needed the same functionality and wrote it in bash.
Not sure if it will be faster than ravindersingh13's answer, but I hope it will help someone.
Actual version: https://github.com/pgrabarczyk/csv-file-splitter
#!/usr/bin/env bash
set -eu
SOURCE_CSV_PATH="${1}"
LINES_PER_FILE="${2}"
DEST_PREFIX_NAME="${3}"
DEBUG="${4:-0}"
split_files() {
local source_csv_path="${1}"
local lines_per_file="${2}"
local dest_prefix_name="${3}"
local debug="${4}"
_print_log "source_csv_path: ${source_csv_path}"
local dest_prefix_path="$(pwd)/output/${dest_prefix_name}"
_print_log "dest_prefix_path: ${dest_prefix_path}"
local headline=$(awk "NR==1" "${source_csv_path}")
local file_no=0
mkdir -p "$(dirname ${dest_prefix_path})"
local lines_in_files=$(wc -l "${source_csv_path}" | awk '{print $1}')
local files_to_create=$(((lines_in_files-1)/lines_per_file))
_print_log "There is ${lines_in_files} lines in file. I will create ${files_to_create} files per ${lines_per_file} (Last file may have less)"
_print_log "Start processing."
for (( start_line=1; start_line<=lines_in_files; )); do
last_line=$((start_line+lines_per_file))
file_no=$((file_no+1))
local file_path="${dest_prefix_path}$(printf "%06d" ${file_no}).csv"
if [ $debug -eq 1 ]; then
_print_log "Creating file ${file_path} with lines [${start_line};${last_line}]"
fi
echo "${headline}" > "${file_path}"
awk "NR>${start_line} && NR<=${last_line}" "${source_csv_path}" >> "${file_path}"
start_line=$last_line
done
_print_log "Done."
}
_print_log() {
local log_message="${1}"
local date_time=$(date "+%Y-%m-%d %H:%M:%S.%3N")
printf "%s - %s\n" "${date_time}" "${log_message}" >&2
}
split_files "${SOURCE_CSV_PATH}" "${LINES_PER_FILE}" "${DEST_PREFIX_NAME}" "${DEBUG}"
Execution:
bash csv-file-splitter.sh "sample.csv" 3 "result_" 1
Tried a solution using the module Text::CSV.
#! /usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw<:std :encoding(utf-8)>;
use autodie;
use feature qw<say>;
use Text::CSV;
my %hsh = ();
my $csv = Text::CSV->new({ sep_char => ',' });
print "Enter filename: ";
chomp(my $filename = <STDIN>);
open (my $ifile, '<', $filename);
while (<$ifile>) {
chomp;
if ($csv->parse($_)) {
my #fields = $csv->fields();
my $first = shift #fields;
while (my ($i, $v) = each #fields) {
push #{$hsh{($i + 1).".csv"}}, "$first,$v";
}
} else {
die "Line could not be parsed: $_\n";
}
}
close($ifile);
while (my ($k, $v) = each %hsh) {
open(my $ifile, '>', $k);
say {$ifile} $_ for #$v;
close($ifile);
}
exit(0);

Perl: How do I get "bytes read" from md5::digest addfile()?

I am using Digest::MD5 to compute MD5 of a data stream; namely a GZIPped file (or to be precise, 3000) that are much too large to fit in RAM. So I'm doing this:
use Digest::MD5 qw(md5_base64);
my ($filename) = #_; # this is in a sub
my $ctx = Digest::MD5 -> new;
$openme = $filename; # Usually, it's a plain file
$openme = "gunzip -c '$filename' |" if ($filename =~ /\.gz$/); # is gz
open (FILE, $openme); # gunzip to STDOUT
binmode(FILE);
$ctx -> addfile(*FILE); # passing filehandle
close(FILE);
This is a success. addfile neatly slurps in the output of gunzip and gives a correct MD5.
However, I would really, really like to know the size of the slurped data (gunzipped "file" in this case).
I could add an additional
$size = 0 + `gunzip -c very/big-file.gz | wc -c`;
but that would involve reading the file twice.
Is there any way to extract the number of bytes slurped from Digest::MD5? I tried capturing the result: $result = $ctx -> addfile(*FILE); and doing Data::Dumper on both $result and $ctx, but nothing interesting emerged.
Edit: The files are often not gzipped. Added code to show what I really do.
I'd do it all in perl, without relying on an external program for the decompression:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use IO::Uncompress::Gunzip qw/$GunzipError/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Allow for reading both gzip format files and uncompressed files.
# This is the default behavior, but might as well be explicit about it.
my $z = IO::Uncompress::Gunzip->new($filename, Transparent => 1)
or die "Unable to open $filename: $GunzipError\n";
my $len = 0;
while ((my $blen = $z->read(my $block)) > 0) {
$len += $blen;
$md5->add($block);
}
die "There was an error reading the file: $GunzipError\n" unless $z->eof;
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;
If you want to use gunzip instead of the core IO::Uncompress::Gunzip module, you can do something similar, though, using read to get a chunk of data at a time:
#!/usr/bin/perl
use warnings;
use strict;
use autodie; # So we don't have to explicitly check for i/o related errors
use feature qw/say/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Note use of lexical file handle and safer version of opening a pipe
# from a process that eliminates shell shenanigans. Also uses the :raw
# perlio layer instead of calling binmode on the handle (which has the
# same effect)
open my $z, "-|:raw", "gunzip", "-c", $filename;
# Non-compressed version
# open my $z, "<:raw", $filename;
my $len = 0;
while ((my $blen = read($z, my $block, 4096)) > 0) {
$len += $blen;
$md5->add($block);
}
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;
You could read the contents yourself, and feed it in to $ctx->add($data), and keep a running count of how much data you've passed through. Whether you add all the data in a single call, or across multiple calls, doesn't make any difference to the underlying algorithm. The docs include:
All these lines will have the same effect on the state of the $md5 object:
$md5->add("a"); $md5->add("b"); $md5->add("c");
$md5->add("a")->add("b")->add("c");
$md5->add("a", "b", "c");
$md5->add("abc");
which indicates that you can just do this a piece at a time.

File::Temp pass system command output to temp file

I'm trying to capture the output of a tail command to a temp file.
here is a sample of my apache access log
Here is what I have tried so far.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp ();
use File::Temp qw/ :seekable /;
chomp($tail = `tail access.log`);
my $tmp = File::Temp->new( UNLINK => 0, SUFFIX => '.dat' );
print $tmp "Some data\n";
print "Filename is $tmp\n";
I'm not sure how I can go about passing the output of $tail to this temporoy file.
Thanks
I would use a different approach for tailing the file. Have a look to File::Tail, I think it will simplify things.
It sounds like all you need is
print $tmp $tail;
But you also need to declare $tail and you probably shouldn't chomp it, so
my $tail = `tail access.log`;
Is classic Perl approach to use the proper filename for the handle?
if(open LOGFILE, 'tail /some/log/file |' and open TAIL, '>/tmp/logtail')
{
print LOGFILE "$_\n" while <TAIL>;
close TAIL and close LOGFILE
}
There is many ways to do this but since you are happy to use modules, you might as well use File::Tail;
use v5.12;
use warnings 'all';
use File::Tail;
my $lines_required = 10;
my $out_file = "output.txt";
open(my $out, '>', $out_file) or die "$out_file: $!\n";
my $tail = File::Tail->new("/some/log/file");
for (1 .. $lines_required) {
print $out $tail->read;
}
close $out;
This sits and monitors the log file until it gets the 10 new lines. If you just want a copy of the last 10 lines as is, the easiest way is to use I/O redirection from the shell: tail /some/log/file > my_copy.txt

how to open directory and read the files inside that directory using perl

I am trying to unzip the files and counting the matching characters in files , and after that i need to concatenate the files based on file names. i successfully achieved 1st 2 steps but i am facing the problem to achieve 3rd objective. this is the script i am using.
#! use/bin/perl
use strict;
use warnings;
print"Enter file name for Unzip\n";
print"File name: ";
chomp(my $Filename=<>);
system("gunzip -r ./$Filename\*\n");
print"Enter match characters";
chomp(my $match=<>);
system("grep -c '$match' ./$Filename/* > $Filename/output");
open $fh,"/home/final_stage/test_(copy)";
if(my $file="sra_*_*_*_R1")
{
print $file;
}
system("mkdir $Filename/R1\n");
system("mkdir $Filename/R2\n");
Based on "sra____R1" file name matching i have to concatenate and put the out in R1 folder and "sra____R2" file name R2 folder.
Help me to complete this work, all suggestions are welcome !!!!!
#!/usr/bin/perl
use strict;
use warnings;
use Path::Class;
use autodie; # die if problem reading or writing a file
my $dir = dir("/tmp"); # /tmp
my $file = $dir->file("file.txt");# Read in the entire contents of a file
my $content = $file->slurp();#
openr() returns an IO::File object to read from
my $file_handle = $file->openr(); # Read in line at a timewhile
( my $line = $file_handle->getline() )
{
print $line;
}
Enjoy your day !!!!

Adding same text to all files of the directory

I have one folder. There are 32 files and 3 directories in that folder. I want to add some lines of text on each file at top. How can I do that?
Use File::Find to find the files. Use Tie::File and unshift to add lines to the top of the file.
TLP already told you some hints how to solve the Problem. But there is always more then one way to do it. Instead of File::Find and Tie::File i would use some more "modern" modules. In these full example i use Path::Class::Rule with an iterative interface instead of an recursive interface that i like more.
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use open ':encoding(UTF-8)';
use open ':std';
use Path::Class;
use Path::Class::Rule;
my $rule = Path::Class::Rule->new->file;
my $iter = $rule->iter(dir('test'));
while ( my $file = $iter->() ) {
print $file->stringify, "\n";
add_line_to_file($file, "Sid was here.\n");
}
# 1: Path::Class::File Object
# 2: The Line
sub add_line_to_file {
my ( $file, $line ) = #_;
# Open File - return IO::File object
my $fh = $file->open('>>') or die "Cannot open $file: $!\n";
# Seek to end
$fh->seek(0, 2);
# Add line
$fh->print($line);
$fh->close;
return;
}
This could work:
perl -pi -e 's/^/my text\n/' * */*
Please try this on copy of your directory to make sure it does what you want.