Perl -two list matching elements - perl

I am trying to grab the list of the files jenkins has updated from last build and latest build and stored in a perl array.
Now i have list of files and folders in source code which are considered as sensitive in terms of changes like XXXX\yyy/., XXX/TTTT/FFF.txt,...in FILE.TXT
i want that script should tell me if any these sensitive files was part of my changed files and if yes list its name so that we can double check with development team about is change before we trigger build .
How should i achieve this , and how to ,compare multiple files under one path form the changed path files .
have written below script ---which is called inside the jenkins with %workspace# as argument
This is not giving any matching result.
use warnings;
use Array::Utils qw(:all);
$url = `curl -s "http://localhost:8080/job/Rev-number/lastStableBuild/" | findstr "started by"`;
$url =~ /([0-9]+)/;
system("cd $ARGV[1]");
#difffiles = `svn diff -r $1:HEAD --summarize`;
chomp #difffiles;
foreach $f (#difffiles) {
$f = substr( $f, 8 );
$f = "$f\n";
}
open FILE, '/path/to/file'
or die "Can't open file: $!\n";
#array = <FILE>;
#isect = intersect( #difffiles, #array );
print #isect;

I have manged to solve this issue using below perl script -
sub Verifysensitivefileschanges()
{
$count1=0;
#isect = intersect(#difffiles,#sensitive);
#print "ISECT=#isect\n";
foreach $t (#isect)
{
if (#isect) {print "Matching element found -- $t\n"; $count1=1;}
}
return $count1;
}
sub Verifysensitivedirschanges()
{
$count2=0;
foreach $g (#difffiles)
{
$dirs = dirname($g);
$filename = basename($g);
#print "$dirs\n";
foreach $j (#array)
{
if( "$j" =~ /\Q$dirs/)
{print "Sensitive path files changed under path $j and file name is $filename\n";$count2=1;}
}
}
return $count2;
}

Related

Perl - Read multiple files and read line by line of the text file

I am trying to read multiple .txt files in a folder. Each file should be read line by line, however, I failed to read multiple .txt files by using glob. Any advice on my code?
my %data;
#FILES = glob("*.txt");
$EmailMsg .= "EG. Folder(week) = Folder(CW01) --CW01 = Week 1 -- Number is week\n ";
$EmailMsg .= "=======================================================================================================\n";
# Try to Loop multiple files here
foreach my $file (#FILES) {
local $/ = undef;
open my $fh, '<', $file;
$data{$file} = <$fh>;
# Read the file one line at a time.
while (my $line = <$fh>) {
chomp $line;
$line =~ s/^\s+//;
$line =~ s/\s+$//;
my ($name, $date, $week) = split /\:/, $line;
if ($name eq "NoneFolder") {
$EmailMsg .= "Folder ($week) - No Folder created on the FTP! Failed to open folder!\n";
}
if ($name eq "EmptyFiles") {
$EmailMsg .= "Folder ($week) - No Files insides the folder! Failed download files!\n";
}
}
}
$EmailMsg .= "=======================================================================================================\n";
$EmailMsg .= "Please note that if you receive this email means that the script is running fine just that no folder is created or no files inside the folder for the week on the FTP.\n";
# close the file.
#close <$fh>;
Currently output:
EG. Folder(week) = Folder(CW01) --CW01 = Week 1 -- Number is week
=======================================================================================================
=======================================================================================================
Please note that if you receive this email means that the script is running fine just that no folder is created or no files inside the folder for the week on the FTP.
It failed to get any .txt files.
You are trying to read each file twice: firstly into the hash %data and then again line by line.
Once you have reached end of file, you have to either reopen the file or use seek to move the read pointer back to the beginning.
You also need to set $/ back to its original value, otherwise your loop will read the entire file instead of one line at a time.
It's not clear whether you really need the second copy of the file data in the hash, but you can avoid having to reset $/ by putting the change within a block, like this
open my $fh, '<', $file;
$data{$file} = do {
local $/ = undef;
<$fh>;
};
and then reset the file pointer to the start again before the while loop.
seek $fh, 0, 0;
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
my #files=('Read a file.pl','Read a single text file.pl','Read only one
file.pl','Read the file using while.pl','Reading the file.pl');
foreach my $i(#files) {
open(FH, "<$i");
{
while (my $row = <FH>) {
chomp $row;
print "$row\n";
}
}
}
The file globbing works for me. You might want to specify scope for your #FILES variable and check that there actually are files matching the path you have specified,
#!/bin/env perl
use strict;
use warnings;
## glob on all files in home directory
## see: http://perldoc.perl.org/File/Glob.html
use File::Glob ':globally';
my #configs = <~myname/project/etc/*.cfg>;
foreach my $fn (#configs) {
print "file $fn\n";
}
your code,
my %data;
#here are some .c files,
my #FILES = glob("../*.c");
foreach my $fn (#FILES) {
print "file $fn\n";
}
exit;
This way catches more garbage for about the same amount of code.
my $PATH = shift #ARGV ;
chomp $PATH ;
opendir(TXTFILE,$PATH) || die ("failed to opendir: $PATH") ;
my #file = readdir TXTFILE ;
closedir(TXTFILE) ;
foreach(#file) { #
next unless ($_ =~ /\.txt$/i) ; # Only get .txt files
$PATH =~ s/\/$//g ; $PATH =~ s/$/\// ; # Uniform trailing slash
my $thisfile = $PATH . $_ ; # now a fully qualified filename
unless (open(THISFILE,$thisfile)) { # Notify on busted files.
warn ("$thisfile failed to open") ;
next ;
}
while(<THISFILE>) {
# etc. etc.
}
close(THISFILE) ;
}

Using File::Temp module with System command in Perl

following is my Perl code:
use strict;
use File::Find;
use MIME::Base64;
use File::Temp qw(tempfile);
sub loadFiles(); #udf
sub mySub(); #udf
my #files = ();
my $dir = shift || die "Argument missing: directory name\n";
my $finalLoc;
my $filePath;
my $fileContents;
my $base64EncFile;
my $domain = "WTX";
my $devFilePath;
my $deviceDir;
my $position;
my $user = "admin";
my $encPwd = "YzNKcGNtRnRZVEF4";
my $decPwd;
my $response;
my $temp;
my $tempFilename;
loadFiles(); #call
foreach (#files) {
#take the file path into a variable
$filePath = $_;
#replace the '/' with '\' in the file path
$filePath =~ s/\//\\/g;
#take the file path into a variable
$devFilePath = $_;
#replace the '\' with '/' in the file path
$devFilePath =~ s/\\/\//g;
#perform string operation to derive a target file path
$position = index( $devFilePath, "RPDM" );
$deviceDir = "local:///" . substr( $devFilePath, $position );
#open handle on file to read the contents
open( FILE, "< $filePath" );
#read the entire file into a variable, 'fileContents'
$fileContents = do { local $/; <FILE> };
#base64 encode the file contents
$base64EncFile = encode_base64($fileContents);
#replace the <CR><LF> characters in the file and flatten the base64 string
$base64EncFile =~ s/[\x0A\x0D]//g;
#printing file path
print "FilePath=$filePath\n";
#creating a temp file with 9 random characters at the end, example 'tempUKv1vqBTp'
$temp = File::Temp->new(
TEMPLATE => "tempXXXXXXXXX",
UNLINK => 0
) or die "Could not make tempfile: $!";
$tempFilename = $temp->filename;
#Printing temp file name
print "TempFileName=$tempFilename\n";
#open the temp file for writing
open(TEMP, ">$tempFilename");
select(TEMP);
while($base64EncFile){
#??? HOW TO PRINT THE VARIABLE $base64EncFile CONTENTS INTO THE TEMP FILE ???
}
#creating a final request for sending to the web service
my $dpString = "<env:Envelope xmlns:env='http://schemas.xmlsoap.org/soap/envelope/' xmlns:dp='http://www.datapower.com/schemas/management'><env:Body><dp:request domain='$domain'><dp:set-file name='$deviceDir'>". $base64EncFile."</dp:set-file></dp:request></env:Body></env:Envelope>";
#decode the encoded password
$decPwd = decode_base64($encPwd);
system('C:\\apps\\curl-7.15.0\\curl.exe', '-#', '-k', '-u', "admin:$decPwd", '--data-binary', "$dpString", 'https://host/service/fileSet');
print "-----------------------------------------------------------\n";
close(TEMP);
close(FILE);
}
sub loadFiles() {
find( \&mySub, "$dir" ); #custom subroutine find, parse $dir
}
# following gets called recursively for each file in $dir, check $_ to see if you want the file!
sub mySub() {
push #files, $File::Find::name
if (/(\.xml|\.xsl|\.xslt|\.ffd|\.dpa|\.wsdl|\.xsd)$/i)
; # modify the regex as per your needs or pass it as another arg
}
Task I am trying to accomplish is, given a folder argument to the above perl program will make recursive calls to a given web service end point. Problem is - using the System command in Perl is unable to send files over 32 Kb. While trying to use File::Temp module in perl, I am not sure how to set the contents of a variable into a temp file (my first week using Perl).
Any help to achieve this will be helpful. Thanks!
Are you asking how to write a string to an open file?
print $fh $string;
should do the trick.
In your example, that would translate to replacing L62-65 with something like:
print TEMP $base64EncFile;

How can I check the extension of a file using Perl?

To my perl script, a file is passed as an arguement. The file can be a .txt file or a .zip file containing the .txt file.
I want to write code that looks something like this
if ($file is a zip) {
unzip $file
$file =~ s/zip$/txt/;
}
One way to check the extension is to do a split on . and then match the last result in the array (returned by split).
Is there some better way?
You can use File::Basename for this.
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use File::Basename;
my #exts = qw(.txt .zip);
while (my $file = <DATA>) {
chomp $file;
my ($name, $dir, $ext) = fileparse($file, #exts);
given ($ext) {
when ('.txt') {
say "$file is a text file";
}
when ('.zip') {
say "$file is a zip file";
}
default {
say "$file is an unknown file type";
}
}
}
__DATA__
file.txt
file.zip
file.pl
Running this gives:
$ ./files
file.txt is a text file
file.zip is a zip file
file.pl is an unknown file type
Another solution is to make use of File::Type which determines the type of binary file.
use strict;
use warnings;
use File::Type;
my $file = '/path/to/file.ext';
my $ft = File::Type->new();
my $file_type = $ft->mime_type($file);
if ( $file_type eq 'application/octet-stream' ) {
# possibly a text file
}
elsif ( $file_type eq 'application/zip' ) {
# file is a zip archive
}
This way, you do not have to deal with missing/wrong extensions.
How about checking the end of the filename?
if ($file =~ /\.zip$/i) {
and then:
use strict;
use Archive::Extract;
if ($file =~ /\.zip$/i) {
my $ae = Archive::Extract->new(archive => $file);
my $ok = $ae->extract();
my $files = $ae->files();
}
more information here.
You can check the file extension using a regex match as:
if($file =~ /\.zip$/i) {
# $file is a zip file
}
I know this question is several years old, but for anyone that comes here in the future, an easy way to break apart a file path into its constituent path, filename, basename and extension is as follows.
use File::Basename;
my $filepath = '/foo/bar.txt';
my ($basename, $parentdir, $extension) = fileparse($filepath, qr/\.[^.]*$/);
my $filename = $basename . $extension;
You can test it's results with the following.
my #test_paths = (
'/foo/bar/fish.wibble',
'/foo/bar/fish.',
'/foo/bar/fish.asdf.d',
'/foo/bar/fish.wibble.',
'/fish.wibble',
'fish.wibble',
);
foreach my $this_path (#test_paths) {
print "Current path: $this_path\n";
my ($this_basename, $parentdir, $extension) = fileparse($this_path, qr/\.[^.]*$/);
my $this_filename = $this_basename . $extension;
foreach my $var (qw/$parentdir $this_filename $this_basename $extension/) {
print "$var = '" . eval($var) . "'\n";
}
print "\n\n";
}
Hope this helps.
Why rely on file extension? Just try to unzip and use appropriate exception handling:
eval {
# try to unzip the file
};
if ($#) {
# not a zip file
}
Maybe a little bit late but it could be used as an alternative reference:
sub unzip_all {
my $director = shift;
opendir my $DIRH, "$director" or die;
my #files = readdir $DIRH;
foreach my $file (#files){
my $type = `file $director/$file`;
if ($type =~ m/gzip compressed data/){
system "gunzip $director/$file";
}
}
close $DIRH;
return;
}
Here is possible to use linux file executing it from perl by the use of backticks(``). You area able to pass the path of your folder and evaluate if exists a file that is classified by file as gzip compressed.
If you do not mind using a perl module, you can use Module::Generic::File, such as:
use Module::Generic::File qw( file );
my $f = file( '/some/where/file.zip' );
if( $f->extension eq 'zip' )
{
# do something
}
Module::Generic::File has a lot of features to handle and manipulate a file.

How can I compare file list from a tar archive and directory?

I am still learning Perl. Can anyone please suggest me the Perl code to compare files from .tar.gz and a directory path.
Let's say I have tar.gz backup of following directory path which I have taken few days back.
a/file1
a/file2
a/file3
a/b/file4
a/b/file5
a/c/file5
a/b/d/file and so on..
Now I want to compare files and directories under this path with the tar.gz backup file.
Please suggest Perl code to do that.
See Archive::Tar.
The Archive::Tar and File::Find modules will be helpful. A basic example is shown below. It just prints information about the files in a tar and the files in a directory tree.
It was not clear from your question how you want to compare the files. If you need to compare the actual content, the get_content() method in Archive::Tar::File will likely be needed. If a simpler comparison is adequate (for example, name, size, and mtime), you won't need much more than methods used in the example below.
#!/usr/bin/perl
use strict;
use warnings;
# A utility function to display our results.
sub Print_file_info {
print map("$_\n", #_), "\n";
}
# Print some basic information about files in a tar.
use Archive::Tar qw();
my $tar_file = 'some_tar_file.tar.gz';
my $tar = Archive::Tar->new($tar_file);
for my $ft ( $tar->get_files ){
# The variable $ft is an Archive::Tar::File object.
Print_file_info(
$ft->name,
$ft->is_file ? 'file' : 'other',
$ft->size,
$ft->mtime,
);
}
# Print some basic information about files in a directory tree.
use File::Find;
my $dir_name = 'some_directory';
my #files;
find(sub {push #files, $File::Find::name}, $dir_name);
Print_file_info(
$_,
-f $_ ? 'file' : 'other',
-s,
(stat)[9],
) for #files;
Perl is kind of overkill for this, really. A shell script would do fine. The steps you need to take though:
Extract the tar to a temporary folder somewhere.
diff -uR the two folders and redirect the output somewhere (or perhaps pipe to less as appropriate)
Clean up the temporary folder.
And you're done. Shouldn't be more than 5-6 lines. Something quick and untested:
#!/bin/sh
mkdir $TEMP/$$
tar -xz -f ../backups/backup.tgz $TEMP/$$
diff -uR $TEMP/$$ ./ | less
rm -rf $TEMP/$$
Heres an example that checks to see if every file that is in an archive, also exists in a folder.
# $1 is the file to test
# $2 is the base folder
for file in $( tar --list -f $1 | perl -pe'chomp;$_=qq["'$2'$_" ]' )
do
# work around bash deficiency
if [[ -e "$( perl -eprint$file )" ]]
then
echo " $file"
else
echo "no $file"
fi
done
This is how I tested this:
I removed / renamed config, then ran the following:
bash test Downloads/update-dnsomatic-0.1.2.tar.gz Downloads/
Which gave the output of:
"Downloads/update-dnsomatic-0.1.2/"
no "Downloads/update-dnsomatic-0.1.2/config"
"Downloads/update-dnsomatic-0.1.2/update-dnsomatic"
"Downloads/update-dnsomatic-0.1.2/README"
"Downloads/update-dnsomatic-0.1.2/install.sh"
I am new to bash / shell programming, so there is probably a better way to do this.
This might be a good starting point for a good Perl program. It does what the question asked for though.
It was just hacked together, and ignores most of the best practices for Perl.
perl test.pl full \
Downloads/update-dnsomatic-0.1.2.tar.gz \
Downloads/ \
update-dnsomatic-0.1.2
#! /usr/bin/env perl
use strict;
use 5.010;
use warnings;
use autodie;
use Archive::Tar;
use File::Spec::Functions qw'catfile catdir';
my($action,$file,$directory,$special_dir) = #ARGV;
if( #ARGV == 1 ){
$file = *STDOUT{IO};
}
if( #ARGV == 3 ){
$special_dir = '';
}
sub has_file(_);
sub same_size($$);
sub find_missing(\%$);
given( lc $action ){
# only compare names
when( #{[qw'simple name names']} ){
my #list = Archive::Tar->list_archive($file);
say qq'missing file: "$_"' for grep{ ! has_file } #list;
}
# compare names, sizes, contents
when( #{[qw'full aggressive']} ){
my $next = Archive::Tar->iter($file);
my( %visited );
while( my $file = $next->() ){
next unless $file->is_file;
my $name = $file->name;
$visited{$name} = 1;
unless( has_file($name) ){
say qq'missing file: "$name"' ;
next;
}
unless( same_size( $name, $file->size ) ){
say qq'different size: "$name"';
next;
}
next unless $file->size;
unless( same_checksum( $name, $file->get_content ) ){
say qq'different checksums: "$name"';
next;
}
}
say qq'file not in archive: "$_"' for find_missing %visited, $special_dir;
}
}
sub has_file(_){
my($file) = #_;
if( -e catfile $directory, $file ){
return 1;
}
return;
}
sub same_size($$){
my($file,$size) = #_;
if( -s catfile($directory,$file) == $size ){
return $size || '0 but true';
}
return; # empty list/undefined
}
sub same_checksum{
my($file,$contents) = #_;
require Digest::SHA1;
my($outside,$inside);
my $sha1 = Digest::SHA1->new;
{
open my $io, '<', catfile $directory, $file;
$sha1->addfile($io);
close $io;
$outside = $sha1->digest;
}
$sha1->add($contents);
$inside = $sha1->digest;
return 1 if $inside eq $outside;
return;
}
sub find_missing(\%$){
my($found,$current_dir) = #_;
my(#dirs,#files);
{
my $open_dir = catdir($directory,$current_dir);
opendir my($h), $open_dir;
while( my $elem = readdir $h ){
next if $elem =~ /^[.]{1,2}[\\\/]?$/;
my $path = catfile $current_dir, $elem;
my $open_path = catfile $open_dir, $elem;
given($open_path){
when( -d ){
push #dirs, $path;
}
when( -f ){
push #files, $path, unless $found->{$path};
}
default{
die qq'not a file or a directory: "$path"';
}
}
}
}
for my $path ( #dirs ){
push #files, find_missing %$found, $path;
}
return #files;
}
After renaming config to config.rm, adding an extra char to README, changing a char in install.sh, and adding a file .test. This is what it outputted:
missing file: "update-dnsomatic-0.1.2/config"
different size: "update-dnsomatic-0.1.2/README"
different checksums: "update-dnsomatic-0.1.2/install.sh"
file not in archive: "update-dnsomatic-0.1.2/config.rm"
file not in archive: "update-dnsomatic-0.1.2/.test"

Removing files with duplicate content from single directory [Perl, or algorithm]

I have a folder with large number of files, some of with have exactly the same contents. I want to remove files with duplicate contents, meaning if two or more files with duplicate content found, I'd like to leave one of these files, and delete the others.
Following is what I came up with, but I don't know if it works :) , didn't try it yet.
How would you do it? Perl or general algorithm.
use strict;
use warnings;
my #files = <"./files/*.txt">;
my $current = 0;
while( $current <= $#files ) {
# read contents of $files[$current] into $contents1 scalar
my $compareTo = $current + 1;
while( $compareTo <= $#files ) {
# read contents of $files[compareTo] into $contents2 scalar
if( $contents1 eq $contents2 ) {
splice(#files, $compareTo, 1);
# delete $files[compareTo] here
}
else {
$compareTo++;
}
}
$current++;
}
Here's a general algorithm (edited for efficiency now that I've shaken off the sleepies -- and I also fixed a bug that no one reported)... :)
It's going to take forever (not to mention a lot of memory) if I compare every single file's contents against every other. Instead, why don't we apply the same search to their sizes first, and then compare checksums for those files of identical size.
So then when we md5sum every file (see Digest::MD5) calculate their sizes, we can use a hash table to do our matching for us, storing the matches together in arrayrefs:
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my %files_by_size;
foreach my $file (#ARGV)
{
push #{$files_by_size{-s $file}}, $file; # store filename in the bucket for this file size (in bytes)
}
Now we just have to pull out the potential duplicates and check if they are the same (by creating a checksum for each, using Digest::MD5), using the same hashing technique:
while (my ($size, $files) = each %files_by_size)
{
next if #$files == 1;
my %files_by_md5;
foreach my $file (#$files_by_md5)
{
open my $filehandle, '<', $file or die "Can't open $file: $!";
# enable slurp mode
local $/;
my $data = <$filehandle>;
close $filehandle;
my $md5 = md5_hex($data);
push #{$files_by_md5{$md5}}, $file; # store filename in the bucket for this MD5
}
while (my ($md5, $files) = each %files_by_md5)
{
next if #$files == 1;
print "These files are equal: " . join(", ", #$files) . "\n";
}
}
-fini
Perl, with Digest::MD5 module.
use Digest::MD5 ;
%seen = ();
while( <*> ){
-d and next;
$filename="$_";
print "doing .. $filename\n";
$md5 = getmd5($filename) ."\n";
if ( ! defined( $seen{$md5} ) ){
$seen{$md5}="$filename";
}else{
print "Duplicate: $filename and $seen{$md5}\n";
}
}
sub getmd5 {
my $file = "$_";
open(FH,"<",$file) or die "Cannot open file: $!\n";
binmode(FH);
my $md5 = Digest::MD5->new;
$md5->addfile(FH);
close(FH);
return $md5->hexdigest;
}
If Perl is not a must and you are working on *nix, you can use shell tools
find /path -type f -print0 | xargs -0 md5sum | \
awk '($1 in seen){ print "duplicate: "$2" and "seen[$1] } \
( ! ($1 in seen ) ) { seen[$1]=$2 }'
md5sum *.txt | perl -ne '
chomp;
($sum, $file) = split(" ");
push #{$files{$sum}}, $file;
END {
foreach (keys %files) {
shift #{$files{$_}};
unlink #{$files{$_}} if #{$files{$_}};
}
}
'
Perl is kinda overkill for this:
md5sum * | sort | uniq -w 32 -D | cut -b 35- | tr '\n' '\0' | xargs -0 rm
(If you are missing some of these utilities or they don't have these flags/functions,
install GNU findutils and coreutils.)
Variations on a theme:
md5sum *.txt | perl -lne '
my ($sum, $file) = split " ", $_, 2;
unlink $file if $seen{$sum} ++;
'
No need to go and keep a list, just to remove one from the list and delete the rest; simply keep track of what you've seen before, and remove any file matching a sum that's already been seen. The 2-limit split is to do the right thing with filenames containing spaces.
Also, if you don't trust this, just change the word unlink to print and it will output a list of files to be removed. You can even tee that output to a file, and then rm $(cat to-delete.txt) in the end if it looks good.
a bash script is more expressive than perl in this case:
md5sum * |sort -k1|uniq -w32 -d|cut -f2 -d' '|xargs rm
I'd recommend that you do it in Perl, and use File::Find while you're at it.
Who knows what you're doing to generate your list of files, but you might want to combine it with your duplicate checking.
perl -MFile::Find -MDigest::MD5 -e '
my %m;
find(sub{
if(-f&&-r){
open(F,"<",$File::Find::name);
binmode F;
$d=Digest::MD5->new->addfile(F);
if(exists($m{$d->hexdigest}){
$m{$d->hexdigest}[5]++;
push $m{$d->hexdigest}[0], $File::Find::name;
}else{
$m{$d->hexdigest} = [[$File::Find::name],0,0,0,0,1];
}
close F
}},".");
foreach $d (keys %m) {
if ($m{$d}[5] > 1) {
print "Probable duplicates: ".join(" , ",$m{$d}[0])."\n\n";
}
}'
Here is a way of filtering by size first and by md5 checksum second:
#!/usr/bin/perl
use strict; use warnings;
use Digest::MD5 qw( md5_hex );
use File::Slurp;
use File::Spec::Functions qw( catfile rel2abs );
use Getopt::Std;
my %opts;
getopt('de', \%opts);
$opts{d} = '.' unless defined $opts{d};
$opts{d} = rel2abs $opts{d};
warn sprintf "Checking %s\n", $opts{d};
my $files = get_same_size_files( \%opts );
$files = get_same_md5_files( $files );
for my $size ( keys %$files ) {
for my $digest ( keys %{ $files->{$size}} ) {
print "$digest ($size)\n";
print "$_\n" for #{ $files->{$size}->{$digest} };
print "\n";
}
}
sub get_same_md5_files {
my ($files) = #_;
my %out;
for my $size ( keys %$files ) {
my %md5;
for my $file ( #{ $files->{$size}} ) {
my $contents = read_file $file, {binmode => ':raw'};
push #{ $md5{ md5_hex($contents) } }, $file;
}
for my $k ( keys %md5 ) {
delete $md5{$k} unless #{ $md5{$k} } > 1;
}
$out{$size} = \%md5 if keys %md5;
}
return \%out;
}
sub get_same_size_files {
my ($opts) = #_;
my $checker = defined($opts->{e})
? sub { scalar ($_[0] =~ /\.$opts->{e}\z/) }
: sub { 1 };
my %sizes;
my #files = grep { $checker->($_) } read_dir $opts->{d};
for my $file ( #files ) {
my $path = catfile $opts->{d}, $file;
next unless -f $path;
my $size = (stat $path)[7];
push #{ $sizes{$size} }, $path;
}
for my $k (keys %sizes) {
delete $sizes{$k} unless #{ $sizes{$k} } > 1;
}
return \%sizes;
}
You might want to have a look at how I did to find duplicate files and remove them. Though you have to modify it to your needs.
http://priyank.co.in/remove-duplicate-files