Reuse calculated md5 (or any other checksum) - perl

Trying to compute incremental md5 digest for all files in deep directory trees, but I'm unable to "reuse" the already calculated digest.
Here is my test-code:
#!/usr/bin/env perl
use 5.014;
use warnings;
use Digest::MD5;
use Path::Tiny;
# create some test-files in the tempdir
my #filenames = qw(a b);
my $testdir = Path::Tiny->tempdir;
$testdir->child($_)->spew($_) for #filenames; #create 2 files
dirmd5($testdir, #filenames);
exit;
sub dirmd5 {
my($dir, #files) = #_;
my $dirctx = Digest::MD5->new; #the md5 for the whole directory
for my $fname (#files) {
# calculate the md5 for one file
my $filectx = Digest::MD5->new;
my $fd = $dir->child($fname)->openr_raw;
$filectx->addfile($fd);
close $fd;
say "md5 for $fname : ", $filectx->clone->hexdigest;
# want somewhat "add" the above file-md5 to the directory md5
# this not work - even if the $filectx isn't reseted (note the "clone" above)
#$dirctx->add($filectx);
# works adding the file as bellow,
# but this calculating the md5 again
# e.g. for each file the calculation is done two times...
# once for the file-alone (above)
# and second time for the directory
# too bad if case of many and large files. ;(
# especially, if i want calculate the md5sum for the whole directory trees
$fd = $dir->child($fname)->openr_raw;
$dirctx->addfile($fd);
close $fd;
}
say "md5 for dir: ", $dirctx->hexdigest;
}
The above prints:
md5 for a : 0cc175b9c0f1b6a831c399e269772661
md5 for b : 92eb5ffee6ae2fec3ad71c777531578f
md5 for dir: 187ef4436122d1cc2f40dc2b92f0eba0
which is correct, but unfortunately inefficient way. (see the comments).
Reading the docs, I didn't find any way reuse the already calculated md5. e.g. as the above $dirctx->add($filectx);. Probably it is not possible.
Exists any way for check-summing which allows somewhat reuse the already calculated checksums, so, I would be able calculate the checksums/digests for the whole directory trees without the need calculate the digest multiple times for each file?
Ref: trying somewhat solve this question

No. There is nothing that relates MD5(initial data) and MD5(new data) to MD5(initial data + new data) because the position of the data in the stream matters as well as its value. Otherwise it wouldn't be a very useful error check as aba, aab and baa would all have the same checksum
If the files are small enough you could do read each one into memory and use that copy to add the data to both digests. That would avoid reading twice from mass storage
#!/usr/bin/env perl
use 5.014;
use warnings 'all';
use Digest::MD5;
use Path::Tiny;
# create some test-files in the tempdir
my #filenames = qw(a b);
my $testdir = Path::Tiny->tempdir;
$testdir->child($_)->spew($_) for #filenames; # create 2 files
dirmd5($testdir, #filenames);
sub dirmd5 {
my ($dir, #files) = #_;
my $dir_ctx = Digest::MD5->new; #the md5 for the whole directory
for my $fname ( #files ) {
my $data = $dir->child($fname)->slurp_raw;
# calculate the md5 for one file
my $file_md5 = Digest::MD5->new->add($data)->hexdigest;
say "md5 for $fname : $file_md5";
$dir_ctx->add($data);
}
my $dir_md5 = $dir_ctx->hexdigest;
say "md5 for dir: $dir_md5";
}
If the files are huge then the only optimisation left is to avoid reopening the same file and instead rewind it back to the start before reading it a second time
#!/usr/bin/env perl
use 5.014;
use warnings 'all';
use Digest::MD5;
use Path::Tiny;
use Fcntl ':seek';
# create some test-files in the tempdir
my #filenames = qw(a b);
my $testdir = Path::Tiny->tempdir;
$testdir->child($_)->spew($_) for #filenames; # create 2 files
dirmd5($testdir, #filenames);
sub dirmd5 {
my ($dir, #files) = #_;
my $dir_ctx = Digest::MD5->new; # The digest for the whole directory
for my $fname ( #files ) {
my $fh = $dir->child($fname)->openr_raw;
# The digest for just the current file
my $file_md5 = Digest::MD5->new->addfile($fh)->hexdigest;
say "md5 for $fname : $file_md5";
seek $fh, 0, SEEK_SET;
$dir_ctx->addfile($fh);
}
my $dir_md5 = $dir_ctx->hexdigest;
say "md5 for dir: $dir_md5";
}

Related

Compare multiple file contents in perl

I have list of files (more than 2) where I need to verify whether all of those files are identical.
I try to use File::Compare module , But it seems to accept only two files. But in my case I have multiple files where I want to verify its contents are same? Do we have any other way for my requirement.
The solution to this problem is to take a digest of every file. Many options exist, most will be 'good enough' (e.g. technically MD5 has some issues, but they're not likely to matter outside a cryptographic/malicious code scenario).
So simply:
#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5 qw ( md5_hex );
use Data::Dumper;
my %digest_of;
my %unique_files;
foreach my $file (#ARGV) {
open( my $input, '<', $file ) or warn $!;
my $digest = md5_hex ( do { local $/; <$input> } );
close ( $input );
$digest_of{$file} = $digest;
push #{$unique_files{$digest}}, $file;
}
print Dumper \%digest_of;
print Dumper \%unique_files;
%unique_files will give you each unique fingerprint, and all the files with that fingerprint - if you've got 2 (or more) then you've got files that aren't identical.

Perl: How do I get "bytes read" from md5::digest addfile()?

I am using Digest::MD5 to compute MD5 of a data stream; namely a GZIPped file (or to be precise, 3000) that are much too large to fit in RAM. So I'm doing this:
use Digest::MD5 qw(md5_base64);
my ($filename) = #_; # this is in a sub
my $ctx = Digest::MD5 -> new;
$openme = $filename; # Usually, it's a plain file
$openme = "gunzip -c '$filename' |" if ($filename =~ /\.gz$/); # is gz
open (FILE, $openme); # gunzip to STDOUT
binmode(FILE);
$ctx -> addfile(*FILE); # passing filehandle
close(FILE);
This is a success. addfile neatly slurps in the output of gunzip and gives a correct MD5.
However, I would really, really like to know the size of the slurped data (gunzipped "file" in this case).
I could add an additional
$size = 0 + `gunzip -c very/big-file.gz | wc -c`;
but that would involve reading the file twice.
Is there any way to extract the number of bytes slurped from Digest::MD5? I tried capturing the result: $result = $ctx -> addfile(*FILE); and doing Data::Dumper on both $result and $ctx, but nothing interesting emerged.
Edit: The files are often not gzipped. Added code to show what I really do.
I'd do it all in perl, without relying on an external program for the decompression:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use IO::Uncompress::Gunzip qw/$GunzipError/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Allow for reading both gzip format files and uncompressed files.
# This is the default behavior, but might as well be explicit about it.
my $z = IO::Uncompress::Gunzip->new($filename, Transparent => 1)
or die "Unable to open $filename: $GunzipError\n";
my $len = 0;
while ((my $blen = $z->read(my $block)) > 0) {
$len += $blen;
$md5->add($block);
}
die "There was an error reading the file: $GunzipError\n" unless $z->eof;
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;
If you want to use gunzip instead of the core IO::Uncompress::Gunzip module, you can do something similar, though, using read to get a chunk of data at a time:
#!/usr/bin/perl
use warnings;
use strict;
use autodie; # So we don't have to explicitly check for i/o related errors
use feature qw/say/;
use Digest::MD5;
my $filename = shift or die "Missing gzip filename!\n";
my $md5 = Digest::MD5->new;
# Note use of lexical file handle and safer version of opening a pipe
# from a process that eliminates shell shenanigans. Also uses the :raw
# perlio layer instead of calling binmode on the handle (which has the
# same effect)
open my $z, "-|:raw", "gunzip", "-c", $filename;
# Non-compressed version
# open my $z, "<:raw", $filename;
my $len = 0;
while ((my $blen = read($z, my $block, 4096)) > 0) {
$len += $blen;
$md5->add($block);
}
say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;
You could read the contents yourself, and feed it in to $ctx->add($data), and keep a running count of how much data you've passed through. Whether you add all the data in a single call, or across multiple calls, doesn't make any difference to the underlying algorithm. The docs include:
All these lines will have the same effect on the state of the $md5 object:
$md5->add("a"); $md5->add("b"); $md5->add("c");
$md5->add("a")->add("b")->add("c");
$md5->add("a", "b", "c");
$md5->add("abc");
which indicates that you can just do this a piece at a time.

Need to loop through directory and all of it's subdirectories to find files at a certain in Perl

I am attempting to loop through a directory and all of its sub-directories to see if the files within those directories are a certain size. But I am not sure if the files in the #files array still contains the file size so I can compare the size( i.e. - size <= value_size ). Can someone offer any guidance?
use strict;
use warnings;
use File::Find;
use DateTime;
my #files;
my $dt = DateTime->now;
my $date = $dt->ymd;
my $start_dir = "/apps/trinidad/archive/in/$date";
my $empty_file = 417;
find( \&wanted, $start_dir);
for my $file( #files )
{
if(`ls -ltr | awk '{print $5}'`<= $empty_file)
{
print "The file $file appears to be empty please check within the folder if this is empty"
}
else
return;
}
exit;
sub wanted {
push #files, $File::Find::name unless -d;
return;
}
I think you could use this code instead of shelling out to awk.
(Don't understand why my empty_file = 417; is an empty file size).
if (-s $file <= $empty_file)
Also notice that you are missing an open and close brace for your else branch.
(Unsure why you want to 'return' if the first file found that is not 'empty' branches to the return which doesn't do anything because return is only used to return from a function).
The exit is unnecessary and the return in the wanted function is unnessary.
Update: A File::Find::Rule solution could be used. Here is a small program that captures all files less than 14 bytes in my current directory and all of it's subdirectories.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use File::Find::Rule;
my $dir = '.';
my #files = find( file => size => "<14", in => $dir);
say -s $_, " $_" for #files;

Perl - concatenate files with similar names pattern and write concatenated file names to a list

I have a directory with multiple sub-directories in it and each subdir has a fixed set of files - one for each category like -
1)Main_dir
1.1) Subdir1 with files
- Test.1.age.txt
- Test.1.name.txt
- Test.1.place.csv
..........
1.2) Subdir2 with files
- Test.2.age.txt
- Test.2.name.txt
- Test.2.place.csv
.........
there are around 20 folders with 10 files in them. I need to first concatenate files under each category like Test.1.age.txt and Test.2.age.txt into a combined.age.txt file and once I do all concatenation I want to printout these filenames in a new Final_list.txt file like
./Main_dir/Combined.age.txt
./Main_dir/Combined.name.txt
I am able to read all the files from all subdirs in an array, but i am not sure how to do pattern search for the similar files names. Also, will be able to figure out this printout part of the code. Can anyone please share on how to do this pattern search for concatenation? My code so far :
use warnings;
use strict;
use File::Spec;
use Data::Dumper;
use File::Basename;
foreach my $file (#files) {
print "$file\n";
}
my $testdir = './Main_dir';
my #Comp_list = glob("$testdir/test_dir*/*.txt");
I am trying to do the pattern search on the array contents in the #Comp_list, which I surely need to learn -
foreach my $f1 (#Comp_list) {
if($f1 !~ /^(\./\.txt$/) {
print $f1; # check if reading the file right
#push it to a file using concatfile(
}}
Thanks a lot!
This should work for you. I've only tested it superficially as it would take me a while to create some test data, so as you have some at hand I'm hoping you'll report back with any problems
The program segregates all the files found by the equivalent of your glob call, and puts them in buckets according to their type. I've assumed that the names are exactly as you've shown, so the type is penultimate field when the file name is split on dots; i.e. the type of Test.1.age.txt is age
Having collected all of the file lists, I've used a technique that is originally designed to read through all of the files specified on the command line. If #ARGV is set to a list of files then an <ARGV> operation will read through all the files as if they were one, and so can easily be copied to a new output file
If you need the files concatenated in a specific order then I will have to amend my solution. At present they will be processed in the order that glob returns them -- probably in lexical order of their file names, but you shouldn't rely on that
use strict;
use warnings 'all';
use v5.14.0; # For autoflush method
use File::Spec::Functions 'catfile';
use constant ROOT_DIR => './Main_dir';
my %files;
my $pattern = catfile(ROOT_DIR, 'test_dir*', '*.txt');
for my $file ( glob $pattern ) {
my #fields = split /\./, $file;
my $type = lc $fields[-2];
push #{ $files{$type} }, $file;
}
STDOUT->autoflush; # Get prompt reports of progress
for my $type ( keys %files ) {
my $outfile = catfile(ROOT_DIR, "Combined.$type.txt");
open my $out_fh, '>', $outfile or die qq{Unable to open "$outfile" for output: $!};
my $files = $files{$type};
printf qq{Writing aggregate file "%s" from %d input file%s ... },
$outfile,
scalar #$files,
#$files == 1 ? '' : 's';
local #ARGV = #$files;
print $out_fh $_ while <ARGV>;
print "complete\n";
}
I think it's easier if you categorize the files first then you can work with them.
use warnings;
use strict;
use File::Spec;
use Data::Dumper;
use File::Basename;
my %hash = ();
my $testdir = './main_dir';
my #comp_list = glob("$testdir/**/*.txt");
foreach my $file (#comp_list){
$file =~ /(\w+\.\d\..+\.txt)/;
next if not defined $1;
my #tmp = split(/\./, $1);
if (not defined $hash{$tmp[-2]}) {
$hash{$tmp[-2]} = [$file];
}else{
push($hash{$tmp[-2]}, $file);
}
}
print Dumper(\%hash);
Files:
main_dir
├── sub1
│   ├── File.1.age.txt
│   └── File.1.name.txt
└── sub2
├── File.2.age.txt
└── File.2.name.txt
Result:
$VAR1 = {
'age' => [
'./main_dir/sub1/File.1.age.txt',
'./main_dir/sub2/File.2.age.txt'
],
'name' => [
'./main_dir/sub1/File.1.name.txt',
'./main_dir/sub2/File.2.name.txt'
]
};
You can create a loop to concatenate and combine files

how to open directory and read the files inside that directory using perl

I am trying to unzip the files and counting the matching characters in files , and after that i need to concatenate the files based on file names. i successfully achieved 1st 2 steps but i am facing the problem to achieve 3rd objective. this is the script i am using.
#! use/bin/perl
use strict;
use warnings;
print"Enter file name for Unzip\n";
print"File name: ";
chomp(my $Filename=<>);
system("gunzip -r ./$Filename\*\n");
print"Enter match characters";
chomp(my $match=<>);
system("grep -c '$match' ./$Filename/* > $Filename/output");
open $fh,"/home/final_stage/test_(copy)";
if(my $file="sra_*_*_*_R1")
{
print $file;
}
system("mkdir $Filename/R1\n");
system("mkdir $Filename/R2\n");
Based on "sra____R1" file name matching i have to concatenate and put the out in R1 folder and "sra____R2" file name R2 folder.
Help me to complete this work, all suggestions are welcome !!!!!
#!/usr/bin/perl
use strict;
use warnings;
use Path::Class;
use autodie; # die if problem reading or writing a file
my $dir = dir("/tmp"); # /tmp
my $file = $dir->file("file.txt");# Read in the entire contents of a file
my $content = $file->slurp();#
openr() returns an IO::File object to read from
my $file_handle = $file->openr(); # Read in line at a timewhile
( my $line = $file_handle->getline() )
{
print $line;
}
Enjoy your day !!!!