Need Help parsing stream data from php socket server with perl - perl

I'm working on a project for my company that uses a socket server (php) to gather data from a remote device. How can I make this perl program run directly on the stream instead of first having the server write to a tmp file then running this script on that file then writing out a csv file for insertion into a database?
I thought about using IO::Socket but I'm not sure how to go about it.
Also, If anyone has any tips/pointers on how to clean up this code it would be appreciated. (I'm still very much a Perl n00b ;-))
Here's what I have so far for the perl prog.( I wind up with two tmp files after this is run):
#vim set sw=2 ts=2
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
&convert;
#open the source file, strip out any unneeded chars, reformat the data
sub convert{
my $source = "$ARGV[0]";
my $dest = "$ARGV[0]"."_tmp.txt";
chomp $source;
open SOURCE, '', $dest or die "Could not open '$dest' $!";
# move the data from the source file into an array
my %fields;
my #field_names = qw/FIELD1 FIELD2 FIELD3 FIELD4 FIELD5 FIELD6/;
my $pack_definition = 'a4 a2 a1 a4 a4 a8A*';
while(){
# strip out the packet header
s#T18##g;
s#T00##g;
s/^(FF14).*$//g;
s/^(FF18).*$//g;
s/(^|\n)[\n\s]*/$1/g;
# arrange the data into the necessary order
#fields{#field_names} = unpack($pack_definition, $_);
s#(FF15)(.{2})(.{1})(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})#$1\t$2\t$3\t$5$4\t$7$6\t$11$10$9$8#g;
s#(FF16)(.{2})(.{1})(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})#$1\t$2\t$3\t$4\t$5\t$6\t$7\t$8\t$9\t$10\t$11#g;
s#(FF17)(.{2})(.{1})(.{2})(.{2})(.{2})(.{2})(.{4})(.{4})#$1\t$2\t$3\t$5$4\t$7$6\t$8\t$9#g;
my #spds = /(FF15)\t(.{2})\t(.{1})\t(.{4})\t(.{4})\t(.{8})/;
# convert the data from hex to ascii
foreach my $data (#spds){
my $replacement = hex($data);
s#$data#$replacement#g;
}
my #psis = /(FF16)\t(.{2})\t(.{1})\t(.{2})\t(.{2})\t(.{2})\t(.{2})\t(.{2})\t(.{2})\t(.{2})\t(.{2})/;
foreach my $data1(#psis){
my $replacement1 = hex($data1);
s#$data1#$replacement1#g;
}
my #rates= /(FF17)\t(.{2})\t(.{1})\t(.{4})\t(.{1,4})\t(.{4})\t(.{4})/;
foreach my $data2 (#rates){
my $replacement2 = hex($data2);
s#$data2#$replacement2#g;
}
# print the converted data to the destination file
print DEST;
}
# close the files
close SOURCE;
close DEST;
}
&create_vals;
# perform conversion from raw values to human readable output
sub create_vals{
my $source = "$ARGV[0]"."_tmp.txt";
my $dest = "$ARGV[0]"."_converted.txt";
chomp $source;
open SOURCE, '', $dest or die "Cannot open '$dest' for writing $!";
while(){
s#(65301)\t(.{2})\t(8)\t(.{1,5})\t(.{1,5})\t(.{1,4})#"'".$1."','". $2."','". $3."','". ($4/8)."','". ($5/8)."','". ($6/20)."'"#eg;
s#(65302)\t(.{2})\t(8)\t(.{1,3})\t(.{1,3})\t(.{1,3})\t(.{1,3})\t(.{1,3})\t(.{1,3})\t(.{1,3})\t(.{1,3})#"'".$1."','". $2."','". $3."','". $4."','". (($5*1.8)-40)."','". (($6*1.8)-40)."','". ($7*.58)."','".($8*.58)."','".($9*.29008)."','".(($10*1.8)-40)."','".$11."'"#eg;
s#(65303)\t(.{2})\t(8)\t(.{1,5})\t(.{1,5})\t(.{1,5})\t(.{1,5})#"'".$1."','". $2."','". $3."','". ($4*0.014)."','". ($5*.05)."'"#eg;
print DEST;
}
}
close SOURCE;
close DEST;
__END__

If "run directly on the stream" means "accept connections from a remote device instead of the PHP server" then IO::Socket is a way to go. Google has plenty of examples and you can also check out perlipc documentation.

Related

Save contents of those files which contain a specific known string in an single .txt or .tmp file using perl

I'm trying to write a perl script where I'm trying to save whole contents of those files which contain a specific string 'PYAG_GENERATED', in a single .txt/.tmp file one after another. These file names are in a specific pattern and this pattern is 'output_nnnn.txt' where nnnn is 0001,0002 and so on. But I don't know how many number of files are present with this 'output_nnnn.txt' name.
I'm new in perl and I don't know how I can resolve this issue to get the output correctly. Can anyone help me. Thanks in advance.
I've tried to write perl script in different ways but nothing is coming in output file. I'm giving here one of those I've tried. 'new_1.txt' is the new file where I want to save the expected output and "PYAG_GENERATED" is that specific string I'm finding for in the files.
open(NEW,">>new_1.txt") or die "could not open:$!";
$find2="PYAG_GENERATED";
$n='0001';
while('output_$n.txt'){
if(/find2/){
print NEW;
}
$n++;
}
close NEW;
I expect that the output file 'new_1.txt' will save the whole contents of the the files(with filename pattern 'output_nnnn.txt') which have 'PYAG_GENERATED' string at least once inside.
Well, you tried I guess.
Welcome to the wonderful world of Perl where there are always a dozen ways of doing X :-) One possible way to achieve what you want. I put in a lot of comments I hope are helpful. It's also a bit verbose for the sake of clarity. I'm sure it could be golfed down to 5 lines of code.
use warnings; # Always start your Perl code with these two lines,
use strict; # and Perl will tell you about possible mistakes
use experimental 'signatures';
use File::Slurp;
# this is a subroutine/function, a block of code that can be called from
# somewhere else. it takes to arguments, that the caller must provide
sub find_in_file( $filename, $what_to_look_for )
{
# the open function opens $filename for reading
# (that's what the "<" means, ">" stands for writing)
# if successfull open will return we will have a "file handle" in the variable $in
# if not open will return false ...
open( my $in, "<", $filename )
or die $!; # ... and the program will exit here. The variable $! will contain the error message
# now we read the file using a loop
# readline will give us the next line in the file
# or something false when there is nothing left to read
while ( my $line = readline($in) )
{
# now we test wether the current line contains what
# we are looking for.
# the index function gives us the index of a string within another string.
# for example index("abc", "c") will give us 3
if ( index( $line, $what_to_look_for ) > 0 )
{
# we found what we were looking for
# so we don't need to keep looking in this file anymore
# so we must first close the file
close( $in );
# and then we indicate to the caller the search was a successfull
# this will immedeatly end the subroutine
return 1;
}
}
# If we arrive here the search was unsuccessful
# so we tell that to the caller
return 0;
}
# Here starts the main program
# First we get a list of files
# we want to look at
my #possible_files = glob( "where/your/files/are/output_*.txt" );
# Here we will store the files that we are interested in, aka that contain PYAG_GENERATED
my #wanted_files;
# and now we can loop over the files and see if they contain what we are looking for
foreach my $filename ( #possible_files )
{
# here we use the function we defined earlier
if ( find_in_file( $filename, "PYAG_GENERATED" ) )
{
# with push we can add things to the end of an array
push #wanted_files, $filename;
}
}
# We are finished searching, now we can start adding the files together
# if we found any
if ( scalar #wanted_files > 0 )
{
# Now we could code that us ourselves, open the files, loop trough them and write out
# line by line. But we make life easy for us and just
# use two functions from the module File::Slurp, which comes with Perl I believe
# If not you have to install it
foreach my $filename ( #wanted_files )
{
append_file( "new_1.txt", read_file( $filename ) );
}
print "Output created from " . (scalar #wanted_files) . " files\n";
}
else
{
print "No input files\n";
}
use strict;
use warnings;
my #a;
my $i=1;
my $find1="PYAG_GENERATED";
my $n=1;
my $total_files=47276; #got this no. of files by writing 'ls' command in the terminal
while($n<=$total_files){
open(NEW,"<output_$n.txt") or die "could not open:$!";
my $join=join('',<NEW>);
$a[$i]=$join;
#print "$a[10]";
$n++;
$i++;
}
close NEW;
for($i=1;$i<=$total_files;$i++){
if($a[$i]=~m/$find1/){
open(NEW1,">>new_1.tmp") or die "could not open:$!";
print NEW1 $a[$i];
}
}
close NEW1;

Removing HETATMs from PDB files

I want to remove the heteroatoms (HETATM)s from PDB text files that I have locally. I found a perl script that apparently needs a quick tweak to make it do what I want but I'm unsure of what that tweak is.
!#/usr/bin/env perl
open(FILE,"file.pdb");
#file=<FILE>;
foreach (#file){
if (/^HETATM/){
print $_,"\n";
}}
Also, if anyone has an existing perl or python script that they are OK with sharing, I would greatly appreciate it.
In R you can use the Bio3D package:
library(bio3d)
# read pdb
pdb <- read.pdb("1hel")
# make a subset based on TYPE
new <- trim.pdb(pdb, type="ATOM")
# write new pdb to disk
write.pdb(new, file="1hel_ATOM.pdb")
This can also be combined with various other selection criteria, e.g. chain id, residue number, residue name, etc etc:
# select ATOM records for chain A
n1 <- trim.pdb(pdb, type="ATOM", chain="A")
# select residue numbers 10 through 20
n2 <- trim.pdb(pdb, resno=10:20)
In PERL Try this
use warnings;
use strict;
my $filename = "4BI7.pdb";
die "Error opening file" unless (open my $handler , '<' , "$filename");
open my $newfile, '>', "filename.pdb" or die "New file not create";
while($_ = <$handler>){
print $newfile "$_" unless /^HETATM.*/;
}

Cannot find argument passed to program called using Perl "system" command

I'm writing a Perl script to run an external program on every file in a directory. This program converts files from one format to another. Here's the deal...
When I run the program from the command line, everything works as it should:
computer.name % /path/program /inpath/input.in /outpath/output.out
converting: /inpath/input.in to /outpath/output.out
computer.name %
Here's the code I wrote to convert all files in a directory (listed in "file_list.txt"):
#!/usr/bin/perl -w
use warnings;
use diagnostics;
use FileHandle;
use File::Copy;
# Set simulation parameters and directories
#test_dates = ("20110414");
$listfile = "file_list.txt";
$execname = "/path/program";
foreach $date (#test_dates)
{
# Set/make directories
$obs_file_dir = "inpath";
$pred_file_dir = "outpath";
mkdir "$pred_file_dir", 0755 unless -d "$pred_file_dir";
# Read input file names to array
$obs_file_list = $obs_file_dir . $listfile;
open(DIR, $obs_file_list) or die "Could not open file!";
#obs_files = <DIR>;
close(DIR);
# Convert and save files
foreach $file (#obs_files)
{
$file =~ s/(\*)//g;
$infile = $obs_file_dir . $file;
$outfile = $pred_file_dir . $file;
$outfile =~ s/in/out/g;
print $infile . "\n";
#arg_list = ($execname, $infile, $outfile);
system(#arg_list);
}
}
The output shows me the following error for every file in the list:
computer.name % perl_script_name.pl
/inpath/input.in
converting: /inpath/input.in to /outpath/output.out
unable to find /inpath/input.in
stat status=-1
error while processing the product
I verified every file is in the proper place and have no idea why I am getting this error. Why can't the files be found? When I manually pass the arguments using the command line, no problem. When I pass the arguments through a variable via a system call, they can't be found even though the path and file names are correct.
Your advice is greatly appreciated!
Your list of files (#obs_files) comes from reading in a file via #obs_files = <DIR>;
When you do that, each element of array will be a line from a file (e.g. directory listing), with the line being terminated by a newline character.
Before using it, you need to remove the newline character via chomp($file).
Please note that s/(\*)//g; does NOT remove that trailing newline!

How can I remove non-unique lines from a large file with Perl?

Duplicate data removal using Perl called within via a batch file within Windows
A DOS window in Windows called via a batch file.
A batch file calls the Perl script which carries out the actions. I have the batch file.
The code script I have works duplicate data is removal so long as the data file is not too big.
The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal.
The memory error occurs in the subroutine at:-
#contents_of_the_file = <INFILE>;
(A completely different method is acceptable so long as it solves this issue, please suggest).
The subroutine is:-
sub remove_duplicate_data_and_file
{
open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
if ($test ne "YES")
{
flock(INFILE,1);
}
#contents_of_the_file = <INFILE>;
if ($test ne "YES")
{
flock(INFILE,8);
}
close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
#unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, #contents_of_the_file);
open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
if ($test ne "YES")
{
flock(OUTFILE,1);
}
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
{
print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
}
if ($test ne "YES")
{
flock(OUTFILE,8);
}
}
You are unnecessarily storing a full copy of the original file in #contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and #unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.
Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.
use strict;
use warnings;
use Digest::MD5 qw(md5);
my (%seen, %keep_line_nums);
my $in_file = 'data.dat';
my $out_file = 'data_no_dups.dat';
open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;
while ( defined(my $line = <$in_handle>) ){
my $hashed_line = md5($line);
$keep_line_nums{$.} = 1 unless $seen{$hashed_line};
$seen{$hashed_line} = 1;
}
seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
print $out_handle $line if $keep_line_nums{$.};
}
close $in_handle;
close $out_handle;
You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...
Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.
At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.
Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.
How much RAM do you have?
If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.
I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.
I'd rather not risk it. I hope the Perl cavalry will arrive soon.
In the meantime, here's a link.
Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.
The solution uses a temporary file that the program treats as a database with SQLite.
#!/usr/bin/perl
use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;
my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
$line=~ s/\s+$//s;
$sha= sha1_base64($line);
unless($cx->selectrow_array($find, undef, $sha)) {
$insert->execute($sha, $line_number)}
$line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
$line=~ s/\s+$//s;
if($next_line_number == $line_number) {
say $line;
$next_line_number= $list->fetchrow_array;
last unless $next_line_number;
}
$line_number++;
}
close FILE;
Well you could use the inline replace mode of command line perl.
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):
cat infile | sort | uniq > outfile
This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).
EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:
Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file
This will be safer, for very large files, than slurping.
Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.
Here's a script BigSort that may help (though I haven't tested it):
# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ sort -t";" -k 1,1 $1 > $2
BigSort()
{
if [ -s $1 ]; then
rm $1.split.* > /dev/null 2>&1
split -l 2500 -a 5 $1 $1.split.
rm $1.sort > /dev/null 2>&1
touch $1.sort1
for FILE in `ls $1.split.*`
do
echo "sort $FILE"
sort -t";" -k 1,1 $FILE > $FILE.sort
sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
mv $1.sort2 $1.sort1
done
mv $1.sort1 $2
rm $1.split.* > /dev/null 2>&1
else
# work for empty file !
cp $1 $2
fi
}

Using Perl to cleanup a filesystem with one or more duplicates

I have two disks, one an ad-hoc backup disk, which is a mess with duplicates everywhere and another disk in my laptop which is an equal mess. I need to backup unique files and delete duplicates. So, I need to do the following:
Find all non-zero size files
Calculate the MD5 digest of all files
Find files with duplicate file names
Separate unique files, from master and other copies.
With the output of this script I will:
Backup the unique and master files
Delete the other copies
Unique file = no other copies
Master copy = first instance, where other copies exist, possibly matching preferential path
Other copies = not master copies
I've created the appended script, which seems to make sense to me, but:
total files != unique files + master copies + other copies
I have two questions:
Where's the error in my logic?
Is there a more efficient way of doing this?
I chose disk hashes, so that I don't run out of memory when processing enormous file lists.
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;
my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';
my $find = "$base/find.txt";
my $files = "$base/files.txt";
my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";
open (FIND, "< $find");
open (FILES, "> $files");
print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
chomp($path);
next if ($path =~ /^\s*$/);
if (-f $path && -s $path) {
print FILES "$path\n";
$total_files++;
printf "\r$total_files";
}
}
close(FIND);
close(FILES);
open (FILES, "< $files");
sub compare {
my ($key1, $key2) = #_;
$key1 cmp $key2;
}
$DB_BTREE->{'compare'} = \&compare;
my %duplicate_count = ();
tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_duplicate_count_file: $!\n";
my %unique = ();
tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_unique_file: $!\n";
my %master_copy = ();
tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_master_copy_file: $!\n";
my %other_copy = ();
tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_other_copy_file: $!\n";
print "\nFinding duplicate filenames and calculating their MD5 digests\n";
my $file_counter = 0;
my $percent_complete = 0;
while (my $path = <FILES>) {
$file_counter++;
# remove trailing whitespace
chomp($path);
# extract filename from path
my ($vol,$dir,$filename) = File::Spec->splitpath($path);
# calculate the file's MD5 digest
open(FILE, $path) or die "Can't open $path: $!";
binmode(FILE);
my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
# filename not stored as duplicate
if (!exists($duplicate_count{$filename})) {
# assume unique
$unique{$md5digest} = $path;
# which implies 0 duplicates
$duplicate_count{$filename} = 0;
}
# filename already found
else {
# delete unique record
delete($unique{$md5digest});
# second duplicate
if ($duplicate_count{$filename}) {
$duplicate_count{$filename}++;
}
# first duplicate
else {
$duplicate_count{$filename} = 1;
}
# the master copy is already assigned
if (exists($master_copy{$md5digest})) {
# the current path matches $path_pref, so becomes our new master copy
if ($path =~ qq|^$path_pref|) {
$master_copy{$md5digest} = $path;
}
else {
# this one is a secondary copy
$other_copy{$path} = $md5digest;
# store with path as key, as there are duplicate digests
}
}
# assume this is the master copy
else {
$master_copy{$md5digest} = $path;
}
}
$percent_complete = int(($file_counter/$total_files)*100);
printf("\rProgress: $percent_complete %%");
}
close(FILES);
# Write out data to text files for debugging
open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");
print "\n\nUnique files: ",scalar keys %unique,"\n";
foreach my $key (keys %unique) {
print UNIQUE "$key\t", $unique{$key}, "\n";
print UNIQUE_MD5 "$key\n";
}
close UNIQUE;
close UNIQUE_MD5;
open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");
print "Master copies: ",scalar keys %master_copy,"\n";
foreach my $key (keys %master_copy) {
print MASTER "$key\t", $master_copy{$key}, "\n";
print MASTER_MD5 "$key\n";
}
close MASTER;
close MASTER_MD5;
open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");
print "Other copies: ",scalar keys %other_copy,"\n";
foreach my $key (keys %other_copy) {
print OTHER $other_copy{$key}, "\t$key\n";
print OTHER_MD5 "$other_copy{$key}\n";
}
close OTHER;
close OTHER_MD5;
print "\n";
untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;
print "\n";
Looking at the algorithm, I think I see why you are leaking files. The first time you encounter a file copy, you label it "unique":
if (!exists($duplicate_count{$filename})) {
# assume unique
$unique{$md5digest} = $path;
# which implies 0 duplicates
$duplicate_count{$filename} = 0;
}
The next time, you delete that unique record, without storing the path:
# delete unique record
delete($unique{$md5digest});
So whatever filepath was at $unique{$md5digest}, you've lost it, and won't be included in unique+other+master.
You'll need something like:
if(my $original_path = delete $unique{$md5digest}) {
// Where should this one go?
}
Also, as I mentioned in a comment above, IO::File would really clean up this code.
This isn't really a response to the larger logic of the program, but you should be checking for errors in open every time (and while we're at it, why not use the more modern form of open with lexical filehandles and three arguments):
open my $unique, '>', "$base/unique.txt"
or die "Can't open $base/unique.txt for writing: $!";
If you don't want to explicitly ask each time, you could also check out the autodie module.
One apparent optimization is to use file size as an initial comparison basis, and only computer MD5 for files below a certain size or if you have a collision of two files with the same size. The larger a given file is on disc, the more costly the MD5 computation, but also the less likely its exact size will conflict with another file on the system. You can probably save yourself a lot of runtime that way.
You also might want to consider changing your approach for certain kinds of files that contain embedded meta-data that might change without changing the underlying data, so you can find additional dupes even if the MD5's don't match. I'm speaking of course of MP3 or other music files that have metadata tags that might be updated by classifiers or player programs, but which otherwise contain the same audio bits.
See here for related data on solutions in the abstract nature.
https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer
IMPORTANT Note, as much as we'd like to believe 2 files with the same MD5 are the same file, that is not necessarily true. If your data means anything to you, once you've broken it down to a list of candidates that MD5 tells you are the same file, you need to run through every bit of those files linearly to check they are in fact the same.
Put this way, given a hash function ( which MD5 is ) of size 1 bits, there are only 2 possible combination's.
0 1
if your hash function told you 2 files both returned a "1" you would not assume they are the same file.
Given a hash of 2 bits, there are only 4 possible combination's,
00 01 10 11
2 Files returning the same value you would not assume to be the same file.
Given a hash of 3 bits, there are only 8 possible combinations
000 001 010 011
100 101 110 111
2 files returning the same value you would not assume to be the same file.
This pattern goes on in ever increasing amounts, to a point that people for some bizarre reason start putting "chance" into the equation. Even at 128 bits ( MD5 ), 2 files sharing the same hash does not mean they are in fact the same file. the only way to know is by comparing every bit.
There is a minor optimization that occurs if you read them start to end, because you can stop reading as soon as you find a differing bit, but to confirm identical, you need to read every bit.