Zipping files by group size in a directory - perl

I have 2 directories that I have to open and zip all files in there. My issue is that my zip files can only be 5MB each, and these files are big. I have to some how group these files by size before zipping them. I would like to do this using Perl's Archive::Zip module. I have some code but I was wondering if someone here would know a way to do this, well here is the code:
#!/perl/bin/perl -w
use strict;
use warnings;
use Archive::Zip qw/AZ_OK/;
use File::Temp qw/tempfile/;
use constant MB => 1024 * 1024;
#my #dir = '/dir1 dir2/';
my $dir = qw( dir1/);
my #files = do {
opendir my $fd, "$dir" or die $! or die $!;
grep -f, map "$dir$_", readdir $fd;
};
my $zip = Archive::Zip->new;
my $total;
my $limit = 5*MB;
foreach my $file (#files) {
my $temp = Archive::Zip->new;
my $member = $temp->addFile($file);
next unless $member->compressedSize;
my $fh = tempfile();
$temp->writeToFileHandle($fh) == AZ_OK or die $!;
$zip->addMember($member);
$total += $member->compressedSize;
die "$total bytes exceeds archive size limit" if $total > $limit;
}
print "Total archive size: $total bytes\n\n";
$zip->writeToFileNamed('zipped.zip') == AZ_OK or die $!;
Thanks!

IMHO you are solving the wrong problem. If you need to zip the files into 5M chunks, why not zip them all together and then split into 5M chunks?
You can do the splitting internally to perl (a good example is here); or for less portable solution use a system command split (available on Unix/Linux; there's a DOS port as well)

Related

Recovering a specific line in multiple .txt in a directory using Perl

I have the results of a program which gives me the results from some search giving me 2000+ file txt archives. I just need a specific line in each file, this is what I have been trying with Perl:
opendir(DIR, $dirname) or die "Could not open $dirname\n";
while ($filename = readdir(DIR)) {
print "$filename\n";
open ($filename, '<', $filename)or die("Could not open file.");
my $line;
while( <$filename> ) {
if( $. == $27 ) {
print "$line\n";
last;
}
}
}
closedir(DIR);
But there is a problem with the $filename in line 5 and I don't know an alternative to it so I don't have to manually name each file.
Several issues with that code:
Using an old-school bareword identifier for the directory handle instead of a autovivified variable like you are for the file handle.
Using the same variable for the filename and file handle is pretty strange.
You don't check to see if the file is a directory or something else other than a plain file before trying to open it.
$27?
You never assign anything to that $line variable before printing it.
Unless $directory is your program's current working directory, you're running into an issue mentioned in the readdir documentation
If you're planning to filetest the return values out of a readdir, you'd better prepend the directory in question. Otherwise, because we didn't chdir there, it would have been testing the wrong file.
(Substitute open for filetest)
Always use strict; and use warnings;.
Personally, if you just want to print the 27th line of a large number of files, I'd turn to awk and find (Using its -exec test to avoid potential errors about the command line maximum length being hit):
find directory/ -maxdepth 1 -type -f -exec awk 'FNR == 27 { print FILENAME; print }' \{\} \+
If you're on a Windows system without standard unix tools like those installed, or it's part of a bigger program, a fixed up perl way:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw/say/;
use File::Spec;
my $directory = shift;
opendir(my $dh, $directory);
while (my $filename = readdir $dh) {
my $fullname = File::Spec->catfile($directory, $filename); # Construct a full path to the file
next unless -f $fullname; # Only look at regular files
open my $fh, "<", $fullname;
while (my $line = <$fh>) {
if ($. == 27) {
say $fullname;
print $line;
last;
}
}
close $fh;
}
closedir $dh;
You might also consider using glob to get the filenames instead of opendir/readdir/closedir.
And if you have Path::Tiny available, a simpler version is:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw/say/;
use Path::Tiny;
my $directory = shift;
my $dir = path $directory;
for my $file ($dir->children) {
next unless -f $file;
my #lines = $file->lines({count => 27});
if (#lines == 27) {
say $file;
print $lines[-1];
}
}

How to put a certain number of files in one 'master' file

I want to take all my log files from /var/log and cat them into a master log file, then zip that master file. How exactly would I do that?
I have cat in my code because that's what I know how to do in bash. How would I do it in Perl?
#!/usr/bin/perl
use strict;
use warnings;
use IO::Compress::Zip qw(zip $ZipError);
# cat /var/log/*log > /home/glork/masterlog.log
my #files = </var/log/*.log>;
zip \#files => 'glork.zip'
or die "zip failed: $ZipError\n";
#files = </var/log/*.log>;
if (#files) {
unlink #files or warn "Problem unlinking #files: $!";
print "The job is done\n";
} else {
warn "No files to unlink!\n";
}
As noted in the comments, there are several less involved ways to do this. If you really need to roll your own, Archive::Zip will do whatever you tell it to.
#!/usr/bin/env perl
use strict;
use Archive::Zip ':ERROR_CODES';
use File::Temp;
use Carp;
# don't remove "temp" files when filehandle is closed
$File::Temp::KEEP_ALL = 1;
# make a temp directory if not already present
my $dir = './tmp';
if (not -d $dir) {
croak "failed to create directory [$dir]: $!" if not mkdir($dir);
}
my $zip = Archive::Zip->new();
# generate some fake log files to zip up
for my $idx (1 .. 10) {
my $tmp = File::Temp->new(DIR => $dir, SUFFIX => '.log');
my $fn = $tmp->filename();
print $tmp $fn, "\n";
}
# combine the logs into one big one
my $combined = "$dir/combined.log";
open my $out, '>', $combined or die "couldn't write [$combined]: $!";
for my $fn (<$dir/*.log>) {
open my $in, '<', $fn or die "couldn't read [$fn]: $!";
# copy the file line by line so we don't use tons of memory for big files
print($out $_) for <$in>;
}
close $out;
$zip->addFile({ filename => $combined, compressionLevel => 9});
# write out the zip file we made
my $rc = $zip->writeToFileNamed('tmp.zip');
if ($rc != AZ_OK) {
croak "failed to write zip file: $rc";
}

Optimize Perl script to deal with large amount of data

Here is my script:
#!/usr/bin/perl -w
use warnings;
use strict;
no warnings 'uninitialized';
`rm /slot/ems12093/oracle/working/marchfound.txt`;
`touch /slot/ems12093/oracle/working/marchfound.txt`;
`rm /slot/ems12093/oracle/working/newcontact.txt`;
`touch /slot/ems12093/oracle/working/newcontact.txt`;
my ( $filename, $handle, #contact_list, $file_list, $k, #file_list2, $i, $e, $m, $fh, $f, $g,
$file1, $data, $file_location, $arrSize, $namefile );
$file_location = '/slot/ems12093/oracle/working/marchfound.txt';
$filename = '/slot/ems12093/oracle/working/contact.txt';
open( $handle, '<', $filename ) or die $!;
#contact_list = <$handle>;
close $handle;
chomp #contact_list;
chdir( '/scratch/mount_point/dnbfiles/oracle_cr/' );
$file_list = qx(ls|grep -i \"2016_03_Mar_EA\");
chomp( $file_list );
$k = "/scratch/mount_point/dnbfiles/oracle_cr/2016_03_Mar_EA";
chdir( $k );
#file_list2 = qx(ls|grep -i contact|grep -i full|grep -Ev "Glb");
chomp #file_list2;
foreach $file1 ( #file_list2 ) {
foreach $i ( #contact_list ) {
$e = "zgrep $i $file1";
$f = qx($e);
if ( $f ) {
print "working\n";
$g = "$f, $file1";
open $data, '>>', $file_location or die $!;
print $data "$g\n";
close $data;
#contact_list = grep { !/$i/ } #contact_list;
$arrSize = #contact_list;
print "$arrSize\n";
}
}
}
$m = "/slot/ems12093/oracle/working/";
chdir( $m );
chomp #contact_list;
$namefile = '/slot/ems12093/oracle/working/newcontact.txt';
open( $fh, '<', $namefile ) or die $!;
#contact_list = <$fh>;
close $fh;
print "done\n";
Here I am taking an input file contact.txt which has 370k records, for example mail address, and checking if those records are present in March month's zipped database 2016_03_Mar_EA.
The database again contains approx 1.6 million records e.g. name, designation, mail, etc. So it's going to take a LOT of time to check and print all 355k * 1.6m records.
Please suggest if there is any way that I can improve my script to get a faster result.
Not purely speed specific but you should do below modifications.
1) contact.txt has 370k records therefore you should not slurp whole data at once. So instead of doing
#contact_list = <$handle>;
You should read data line by line using
while(<$handle>){
#process one contact at a time
}
2) You are changing directories and executing shell commands to get desired files. It'd be better to use File::Find::Rule. It's easier to use, see below:
my #files = File::Find::Rule->file()->name( '*.pm' )->in( #INC );
The way you are doing this, I'd bet most of the time is spent in umcompressing the database dump (which will happen 370k times). Uncompress it once - before doing the matches. (That assumes you do have enough disk).
If you are not checking for actual regexps, fgrep will save some (marginal) time (though I suspect that this optimizatin is done internally by grep)
The advice on not slurping files is a good for memory saving, and should not affect speed much, for a single scan through the data. However, you are actually unnecessarily scanning the arry multiple times, in order to get rid of duplicate contacts
#contact_list = grep { !/$i/ } #contact_list;
and that not always slows the whole shebang down, it also wastes memory as #contact_list is being copied in memory.
You can read by line, keep track in a hash, and skip the loop body on duplicates:
next if exists $seen{$i};
$seen{$i}++

In Perl, how can filter all log files in a directory, and extract interesting lines?

I'm trying to select only the .log files in my directory and then search in those files for the word "unbound" and print the entire line into a new output file with the same name as the log file (number###.log) but with a .txt extension. This is what I have so far:
#!/usr/bin/perl
use strict;
use warnings;
my $path = $ARGV[0];
my $outpath = $ARGV[1];
my #files;
my $files;
opendir(DIR,$path) or die "$!";
#files = grep { /\.log$/} readdir(DIR);
my #out;
my $out;
opendir(OUT,$outpath) or die "$!";
my $line;
foreach $files (#files) {
open (FILE, "$files");
my #line = <FILE>;
my $regex = Unbound;
open (OUT, ">>$out");
print grep {$line =~ /$regex/ } <>;
}
close OUT;
close FILE;
closedir(DIR);
closedir (OUT);
I'm a beginner, and I don't really know how to create a new text file with the acquired output.
Few things I'd suggest to improve this code:
declare your loop iterators within the loop. foreach my $file ( #files ) {
use 3 arg open: open ( my $input_fh, "<", $filename );
use glob rather than opendir then grep. foreach my $file ( <$path/*.txt> ) {
grep is good for extracting things into arrays. Your grep reads the whole file to print it, which isn't necessary. Doesn't matter much if the file is short though.
perltidy is great for reformatting code.
you're opening 'OUT' to a directory path (I think?) which isn't going to work.
$outpath isn't, it's a file. You need to do something different to output to different files. opendir isn't really valid to an output.
because you're using opendir that's actually giving you filenames - not full paths. So you might be in the wrong place to actually open the files. Prepending the path name, doing a chdir are possible solutions. But that's one of the reasons I like glob because it returns a path as well.
So with that in mind - how about:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
#Extract paths
my $input_path = $ARGV[0];
my $output_path = $ARGV[1];
#Error if paths are invalid.
unless (defined $input_path
and -d $input_path
and defined $output_path
and -d $output_path )
{
die "Usage: $0 <input_path> <output_path>\n";
}
foreach my $filename (<$input_path/*.log>) {
# extract the 'name' bit of the filename.
# be slightly careful with this - it's based
# on an assumption which isn't always true.
# File::Spec is a more powerful way of accomplishing this.
# but should grab 'number####' from /path/to/file/number####.log
my $output_file = basename ( $filename, '.log' );
#open input and output filehandles.
open( my $input_fh, "<", $filename ) or die $!;
open( my $output_fh, ">", "$output_path/$output_file.txt" ) or die $!;
print "Processing $filename -> $output_path/$output_file.txt\n";
#iterate input, extracting into $line
while ( my $line = <$input_fh> ) {
#check if $line matches your RE.
if ( $line =~ m/Unbound/ ) {
#write it to output.
print {$output_fh} $line;
}
}
#tidy up our filehandles. Although technically, they'll
#close automatically because they leave scope
close($output_fh);
close($input_fh);
}
Here is a script that takes advantage of Path::Tiny. Now, at this stage of your learning process, you are probably better off understanding #Sobrique's solution, but using modules such as Path::Tiny or Path::Class will make it easier to write these one off scripts more quickly, and correctly.
Also, I didn't really test this script, so watch out for bugs.
#!/usr/bin/env perl
use strict;
use warnings;
use Path::Tiny;
run(\#ARGV);
sub run {
my $argv = shift;
unless (#$argv == 2) {
die "Need source and destination paths\n";
}
my $it = path($argv->[0])->realpath->iterator({
recurse => 0,
follow_symlinks => 0,
});
my $outdir = path($argv->[1])->realpath;
while (my $path = $it->()) {
next unless -f $path;
next unless $path =~ /[.]log\z/;
my $logfh = $path->openr;
my $outfile = $outdir->child($path->basename('.log') . '.txt');
my $outfh;
while (my $line = <$logfh>) {
next unless $line =~ /Unbound/;
unless ($outfh) {
$outfh = $outfile->openw;
}
print $outfh $line;
}
close $outfh
or die "Cannot close output '$outfile': $!";
}
}
Notes
realpath will croak if the path provided does not exist.
Similarly for openr and openw.
I am reading input files line-by-line to keep the memory footprint of the program independent of the sizes of input files.
I do not open the output file until I know I have a match to print to.
When matching a file extension using a regular expression pattern, keep in mind that \n is a valid character in Unix file names, and the $ anchor will match it.

How can I add a prefix to all filenames under a directory?

I am trying to prefix a string (reference_) to the names of all the *.bmp files in all the directories as well sub-directories. The first time we run the silk script, it will create directories as well subdirectories, and under each subdirectory it will store each mobile application's sceenshot with .bmp extension.
When I run the automated silkscript for second time it will again create the *.bmp files in all the subdirectories. Before running the script for second time I want to prefix all the *.bmp with a string reference_.
For example first_screen.bmp to reference_first_screen.bmp,
I have the directory structure as below:
C:\Image_Repository\BG_Images\second
...
C:\Image_Repository\BG_Images\sixth
having first_screen.bmp and first_screen.bmp files etc...
Could any one help me out?
How can I prefix all the image file names with reference_ string?
When I run the script for second time, the Perl script in silk will take both the images from the sub-directory and compare them both pixel by pixel. I am trying with code below.
Could you please guide me how can I proceed to complete this task.
#!/usr/bin/perl -w
&one;
&two;
sub one {
use Cwd;
my $dir ="C:\\Image_Repository";
#print "$dir\n";
opendir(DIR,"+<$dir") or "die $!\n";
my #dir = readdir DIR;
#$lines=#dir;
delete $dir[-1];
print "$lines\n";
foreach my $item (#dir)
{
print "$item\n";
}
closedir DIR;
}
sub two {
use Cwd;
my $dir1 ="C:\\Image_Repository\\BG_Images";
#print "$dir1\n";
opendir(D,"+<$dir1") or "die $!\n";
my #dire = readdir D;
#$lines=#dire;
delete $dire[-1];
#print "$lines\n";
foreach my $item (#dire)
{
#print "$item\n";
$dir2="C:\\Image_Repository\\BG_Images\\$item";
print $dir2;
opendir(D1,"+<$dir2") or die " $!\n";
my #files=readdir D1;
#print "#files\n";
foreach $one (#files)
{
$one="reference_".$one;
print "$one\n";
#rename $one,Reference_.$one;
}
}
closedir DIR;
}
I tried open call with '+<' mode but I am getting compilation error for the read and write mode.
When I am running this code, it shows the files in BG_images folder with prefixed string but actually it's not updating the files in the sub-directories.
You don't open a directory for writing. Just use opendir without the mode parts of the string:
opendir my($dir), $dirname or die "Could not open $dirname: $!";
However, you don't need that. You can use File::Find to make the list of files you need.
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use File::Find;
use File::Find::Closures qw(find_regular_files);
use File::Spec::Functions qw(catfile);
my( $wanted, $reporter ) = find_regular_files;
find( $wanted, $ARGV[0] );
my $prefix = 'recursive_';
foreach my $file ( $reporter->() )
{
my $basename = basename( $file );
if( index( $basename, $prefix ) == 0 )
{
print STDERR "$file already has '$prefix'! Skipping.\n";
next;
}
my $new_path = catfile(
dirname( $file ),
"recursive_$basename"
);
unless( rename $file, $new_path )
{
print STDERR "Could not rename $file: $!\n";
next;
}
print $file, "\n";
}
You should probably check out the File::Find module for this - it will make recursing up and down the directory tree simpler.
You should probably be scanning the file names and modifying those that don't start with reference_ so that they do. That may require splitting the file name up into a directory name and a file name and then prefixing the file name part with reference_. That's done with the File::Basename module.
At some point, you need to decide what happens when you run the script the third time. Do the files that already start with reference_ get overwritten, or do the unprefixed files get overwritten, or what?
The reason the files are not being renamed is that the rename operation is commented out. Remember to add use strict; at the top of your script (as well as the -w option which you did use).
If you get a list of files in an array #files (and the names are base names, so you don't have to fiddle with File::Basename), then the loop might look like:
foreach my $one (#files)
{
my $new = "reference_$one";
print "$one --> $new\n";
rename $one, $new or die "failed to rename $one to $new ($!)";
}
With the aid of find utility from coreutils for Windows:
$ find -iname "*.bmp" | perl -wlne"chomp; ($prefix, $basename) = split(m~\/([^/]+)$~, $_); rename($_, join(q(/), ($prefix, q(reference_).$basename))) or warn qq(failed to rename '$_': $!)"