Why is my Tie::IxHash program taking a long time? - perl

Basically, I have a script to create a hash for COGs with corresponding gene IDs:
# Open directory and get all the files in it
opendir(DIR, "/my/path/to/COG/");
my #infiles = grep(/OG-.*\.fasta/, readdir(DIR));
closedir(DIR);
# Create hash for COGs and their corresponding gene IDs
tie my %ids_for, 'Tie::IxHash';
if (! -e '/my/path/to/COG/COG_hash.ref') {
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
## %ids_for
store \%ids_for, '/my/path/to/COG/COG_hash.ref';
}
my $id_ref = retrieve('/my/path/to/COG/COG_hash.ref');
%ids_for = %$id_ref;
## %ids_for
The problem isn't that it doesn't work (at least I think), but that it is extremely slow for some reason. When I tried to test run it, it would take weeks for me to have an actual result. Somehow the hash creation is really really slow and I'm sure there is some way to optimize it better for it to work way faster.
Ideally, the paths should be the input of the script that way there would be no need to constantly change the script in case the path changes.
It would also be great if there could be a way to see the progress of the hash creation, like maybe have it show that it is 25% done, 50% done, 75% done and ultimately 100% done. Regarding this last point I have seen things like use Term::ProgressBar but I am not sure if it would be appropriate in this case.

Do you really need Tie::IxHash?
That aside, I suspect your culprit is this set of lines:
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
To add a key to the hash, you are creating a list of the current key-value pairs, adding the new pair, then assigning it all back to the hash.
What happens if you take the results from read_COG_fasta and add the keys one at a time?
for my $infile (#infiles) {
my %new_hash = read_COG_fasta($infile);
foreach my $key ( keys %new_hash ) {
$ids_for{$key} = $new_hash{$key};
}
}
As for progress, I usually have something like this when I'm trying to figure out something:
use v5.26;
my $file_count = #files;
foreach my $n ( 0 .. $#files ) {
say "[$n/$file_count] Processing $file[$n]";
my %result = ...;
printf "\tGot %d results", scalar %hash; # v5.26 feature!
}
You could do the same sort of thing with the keys that you get back so you can track the size.

Related

How to change a name in a file for another in perl with hash?

I have a file like this.
>;1;
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>;2;
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>;3;
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
I would like to change each number to a corresponding string.
I wrote the following Perl program but I don't know what is wrong with it.
%lista2 = (
1 => "CAT00.3",
2 => "CAT43.1",
3 => "CAT40.3"
);
open(OA, ">file2.txt");
foreach $key ( keys %lista2 ) {
open(SAL, "file.txt");
while ( <SAL> ) {
chomp;
if( />/ ) {
#w = split("\t");
$r = 0;
s/\;//g;
if ( /%lista2[i]/ ) {
print OA "$_ $lista2{$key}\n" ;
$r = 1;
}
}
}
}
close(SAL);
close(OA);
I want to get this
>CAT00.3
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>CAT43.1
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
>CAT40.3
AACTCTGGGACAATGGCACACGGGAAACAGATAATGAACGATCAGCACAGGGAACTAGCG
But I don't know how do that.
Well you were in the right direction, I guess. But somewhere along the path you we're lost and it seems like randomly tried to run in any direction. There are a lot of things wrong in your code.
For instance it's funny how you can have those two lines
if ( /%lista2[i]/ ) {
print OA "$_ $lista2{$key}\n" ;
having one correct attempt accessing a has value ($lista2{$key}) and a totally wrong one (%lista2[i]) so close together.
Then, since you're only printing to OA if ("/$lista2{$key}/"), you'd completely eradicate all other lines in the output. Your example indicates, you don't want that.
Furthermore change the loop nesting. Instead of opening the file over and over again, open it once, iterate over the lines and in each such iteration iterate over the hash keys. Your way wasn't strictly wrong but opening and closing files doesn't come cheap, you know. And speaking of closing files: You didn't close SAL in the body of your outer loop, but that's where you reopen it.
And use at least some very basic error handling. Check if open has failed. A wrong file name and the program fails without any indication why. Make your life easier.
Why use chomp() if you later append an \n to the output anyway and make a line of it again? Skip that.
I don't know how to interpret these lines:
#w = split("\t");
$r = 0;
s/\;//g;
Is that some leftovers? They don't do anything useful.
Last but not least it's recommended to use strict; and possibly use warnings; to get pointers on problematic spots.
Here's one that passes your example.
#!/usr/bin/perl
use strict;
use warnings;
my %lista2 =
(
1 => "CAT00.3",
2 => "CAT43.1",
3 => "CAT40.3"
);
if (!open(OA, ">file2.txt")) {
die($!);
}
if (!open(SAL, "file.txt")) {
die($!);
}
foreach my $line (<SAL>) {
foreach my $key (keys(%lista2)) {
if ($line =~ s/^>;$key;$/>$lista2{$key}/) {
last;
}
}
print(OA $line);
}
close(SAL);
close(OA);
In fact, in the core it can be simplified to a pattern replacement. No splitting or anything is needed. But patterns might be confusing if you're a beginner.
I also raised the level of verbosity to make things clearer.

To increase the performance of a script in perl

I have 2 files here which is newFile and LookupFile (which are huge files).
The contents in newFile will be searched in LookupFile and further processing happens. This script is working fine, however, it is taking more time to execute. Could you please let me know what can be done here to increase the performance? Could you please let me know if we can convert files into hash to increase performance?
My file looks like below
NewFile and LookupFile:
acl sourceipaddress subnet destinationipaddress subnet portnumber
.
.
Script:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp::Tiny 'read_file';
use File::Copy;
use Data::Dumper;
use File::Copy qw(copy);
my %options = (
LookupFile => {
type => "=s",
help => "File name",
variable => 'gitFile',
required => 1,
}, newFile => {
type => "=s",
help => "file containing the acl lines to checked for",
variable => ‘newFile’,
required => 1,
} );
$opts->addOptions(%options);
$opts->parse();
$opts->validate();
my $newFile = $opts->getOption('newFile');
my $LookupFile = $opts->getOption('LookupFile');
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>)
{
chomp($line);
my #columns = split(' ',$line);
$var = #columns;
my $fld1;
my $cnt;
my $fld2;
my $fld3;
my $fld4;
my $fld5;
my $dIP;
my $sIP;
my $sHOST;
my $dHOST;
if(....)
if (....) further checks and processing
)
First thing to do before any optimization is to profile your code. Rather than guessing, this will tell you what lines are taking up the most time, and how often they're called. Devel::NYTProf is a good tool for the job.
This is a problem.
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
read_file reads the whole file in as one big string (it should be my $contents = read_file(...), using an array is awkward). Then it splits the whole thing into newlines, copying everything in the file. This is very slow and hard on memory and unnecessary.
Instead, use read_lines. This will split the file into lines as it reads avoiding a costly copy.
my #lookups = read_lines($LookupFile);
my #new = read_lines($newFile);
Next problem is $newFile is opened again and iterated through line by line.
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>) {
This is a waste as you've already read that file into memory. Use one or the other. However, in general, it's better to work with files line-by-line than to slurp them all into memory.
The above will speed things up, but they don't get at the crux of the problem. This is likely the real problem...
The contents in newFile will be searched in LookupFile and further processing happens.
You didn't show what you're doing, but I'm going to imagine it looks something like this...
for my $line (#lines) {
for my $thing (#lookups) {
...
}
}
That is, for each line in one file, you're looking at every line in the other. This is what is known as an O(n^2) algorithm meaning that as you double the size of the files you quadruple the time.
If each file has 10 lines, it will take 100 (10^2) turns through the inner loop. If they have 100 lines, it will take 10,000 (100^2). With 1,000 lines it will take 1,000,000 times.
With O(n^2) as the sizes get bigger things get very slow very quickly.
Could you please let me know if we can convert files into hash to increase performance?
You've got the right idea. You could convert the lookup file to a hash to speed things up. Let's say they're both lists of words.
# input
foo
bar
biff
up
down
# lookup
foo
bar
baz
And you want to check if any lines in input match any lines in lookup.
First you'd read lookup in and turn it into a hash. Then you'd read input and check if each line is in the hash.
use strict;
use warnings;
use autodie;
use v5.10;
...
# Populate `%lookup`
my %lookup;
{
open my $fh, $lookupFile;
while(my $line = <$fh>) {
chomp $line;
$lookup{$line} = 1;
}
}
# Check if any lines are in %lookup
open my $fh, $inputFile;
while(my $line = <$fh>) {
chomp $line;
print $line if $lookup{$line};
}
This way you only iterate through each file once. This is an O(n) algorithm meaning is scales linearly, because hash lookups are basically instantaneous. If each file has 10 lines, it will only take 10 iterations of each loop. If they have 100 lines it will only take 100 iterations of each loop. 1000 lines, 1000 iterations.
Finally, what you really want to do is skip all this and create a database for your data and search that. SQLite is a SQL database that requires no server, just a file. Put your data in there and perform SQL queries on it using DBD::SQLite.
While this means you have to learn SQL, and there is a cost to building and maintaining the database, this is fast and most importantly very flexible. SQLite can do all sorts of searches quickly without you having to write a bunch of extra code. SQL databases are a very common, so it's a very good investment to learn SQL.
Since you're splitting the file up with my #columns = split(' ',$line); it's probably a file with many fields in it. That will likely map to a SQL table very well.
SQLite can even import files like that for you. See this answer for details on how to do that.

How to find common parts in a paths with perl?

Having several paths, like:
1: /abc/def/some/common/part/xyz/file1.ext
2: /other/path/to/7433/qwe/some/common/part/anotherfile.ext
3: /misc/path/7433/qwe/some/common/part/filexx.ext
4: /2443/totally/different/path/file9988.ext
5: /abc/another/same/path/to/ppp/thisfile.ext
6: /deep1/deep2/another/same/path/to/diffone/filename.ext
I need find the common parts - each possible ones, eg. in the above if possible to found common parts:
/some/common/part/ - in the paths 1,2,3
/another/same/path/to/ - in the 5,6
/path/to/ - in the 2,5,6
/path/ - 2,3,4,5,6
etc..
I simply absoulutely haven't any idea how to solve this - what approach is good one
string based - somewhat find common parts of a string
list based - splitting all path into lists and somewhat compare arrays for common elements
tree-graph - somewhat find a common parts of a graph
other?
When i get some direction how to solve this problem, I'm (probably) able code it myself - so don't want free programmming service - but need some guiding how to start.
I'm sure than here is already some CPAN module what could help me, but I'm really have'nt idea how to find the right useful module from the list of 30k modules for the above problem. :(
EDIT - For what i need this:
Having approx. 200k files, in 10k directories and many of them "belong together", like:
/u/some/path/project1/subprojct/file1
/u/backup/of/work/date/project1/subproject/file2
/u/backup_of_backup/of/work/date/project1/subproject/file2
/u/new/addtions/to/projec1/subproject/file3
The files are dirrerent kind (pdf, images, doc, txt and so on), several are identical (like above file2 - easy to filter with Digest::MD5), but the only way "group them together" is based on "common parts" of a path - e.g. "project1/subproject" and so on..
Another files HAS the same MD5, so can filter out duplicates, but they are in different trees, like
/u/path/some/file
/u/path/lastest_project/menu/file
/u/path/jquery/menu/file
/u/path/example/solution/jquery/menu/file
so, the files are the same, (identical md5) but need somewhat move one copy to the right place (and delete others) and need somewhat determine the "most used" common paths, and collect tags... (old path elements are tags)
The idea behind is:
if the same md5 files are mostly stored under some common path - I can make a decision where to move one copy...
And it is more complicated, but for explanation is enough the above ;)
Simply need lowering the entropy on my HDD ;)
There is some discussion about finding the longest common consecutive substrings in this thread: http://www.nntp.perl.org/group/perl.fwp/2002/02/msg1662.html
The "winner" appears to be the following code, but there are a few other things in there you could try:
#!/usr/bin/perl
use strict;
use warnings;
sub lcs {
my $this = shift;
my $that = shift;
my $str = join "\0", $this, $that;
my $len = 1;
my $lcs;
while ($str =~ m{ ([^\0]{$len,}) (?= [^\0]* \0 [^\0]*? \1 ) }xg) {
$lcs = $1;
$len = 1 + length($1);
}
if ($len == 1) { print("No common substring\n"); }
else {
print("Longest common substring of length $len: \"");
print("$lcs");
print("\"\n");
}
}
Keep in mind you would have to adjust it a little bit to account for the fact that you only want entire subdirectories that match... ie, change if ($len == 1) to something like if ($len == 1 or $lcs !~ /^\// or $lcs !~ /\/$/)
You would also have to add some bookkeeping to keep track of which ones match. When I ran this code on your examples above it also found the /abc/ match in lines 1 & 5.
One thing that may or may not be a problem is that the following two lines:
/abc/another/same/path/to/ppp/thisfile.ext
/abc/another/different/path/to/ppp/otherfile.ext
Would match on:
/abc/another/
But not on:
/path/to/ppp/
But -- here's the bad news -- you will have to do O(n^2) comparisons with n=200,000 files. That could take an obscene amount of time.
Another solution would be to go through each path in your list, add all of its possible directory paths as keys to a hash and push the file itself to the hash (so that the value is an array of files that have this path in it). Something like this:
use strict;
use warnings;
my %links;
open my $fh, "<", 'filename' or die "Can't open $!";
while (my $line = <$fh>) {
chomp($line);
my #dirs = split /\//, $line;
for my $i (0..$#dirs) {
if ($i == $#dirs) {
push(#{ $links{$dirs[$i]} }, $line);
}
for my $j ($i+1..$#dirs) {
push(#{ $links{join("/",#dirs[$i..$j])} }, $line);
#PROCESS THIS if length of array is > 1
}
}
}
Of course, this would take an obscene amount of memory. With 200,000 files to process, you might have a hard time no matter what you try, but maybe you can break it up into more manageable chunks. Hopefully, this will give you a starting point.
To solve this problem, you need the correct data structure. A hash that counts the partial paths works well:
use File::Spec;
my %Count_of = ();
while( <DATA> ){
my #names = File::Spec->splitdir( $_ );
# remove file
pop #names;
# if absolute path, remove empty names at start
shift #names while length( $names[0] ) == 0;
# don't count blank lines
next unless #names;
# move two cursor thru the names,
# and count the partial parts
# created from one to the other
for my $i ( 0 .. $#names ){
for my $j ( $i .. $#names ){
my $partial_path = File::Spec->catdir( #names[ $i .. $j ] );
$Count_of{ $partial_path } ++;
}
}
}
# now display the results
for my $path ( sort { $Count_of{$b} <=> $Count_of{$a} || $a cmp $b } keys %Count_of ){
# skip if singleton.
next if $Count_of{ $path } <= 1;
printf "%3d : %s\n", $Count_of{ $path }, $path;
}
__DATA__
/abc/def/some/common/part/xyz/file1.ext
/other/path/to/7433/qwe/some/common/part/anotherfile.ext
/misc/path/7433/qwe/some/common/part/filexx.ext
/2443/totally/different/path/file9988.ext
/abc/another/same/path/to/ppp/thisfile.ext
/deep1/deep2/another/same/path/to/diffone/filename.ext

Why does my Perl for loop exit early?

I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF
First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.
Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}
There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;
Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!
Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}

Using Perl to cleanup a filesystem with one or more duplicates

I have two disks, one an ad-hoc backup disk, which is a mess with duplicates everywhere and another disk in my laptop which is an equal mess. I need to backup unique files and delete duplicates. So, I need to do the following:
Find all non-zero size files
Calculate the MD5 digest of all files
Find files with duplicate file names
Separate unique files, from master and other copies.
With the output of this script I will:
Backup the unique and master files
Delete the other copies
Unique file = no other copies
Master copy = first instance, where other copies exist, possibly matching preferential path
Other copies = not master copies
I've created the appended script, which seems to make sense to me, but:
total files != unique files + master copies + other copies
I have two questions:
Where's the error in my logic?
Is there a more efficient way of doing this?
I chose disk hashes, so that I don't run out of memory when processing enormous file lists.
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;
my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';
my $find = "$base/find.txt";
my $files = "$base/files.txt";
my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";
open (FIND, "< $find");
open (FILES, "> $files");
print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
chomp($path);
next if ($path =~ /^\s*$/);
if (-f $path && -s $path) {
print FILES "$path\n";
$total_files++;
printf "\r$total_files";
}
}
close(FIND);
close(FILES);
open (FILES, "< $files");
sub compare {
my ($key1, $key2) = #_;
$key1 cmp $key2;
}
$DB_BTREE->{'compare'} = \&compare;
my %duplicate_count = ();
tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_duplicate_count_file: $!\n";
my %unique = ();
tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_unique_file: $!\n";
my %master_copy = ();
tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_master_copy_file: $!\n";
my %other_copy = ();
tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot open $db_other_copy_file: $!\n";
print "\nFinding duplicate filenames and calculating their MD5 digests\n";
my $file_counter = 0;
my $percent_complete = 0;
while (my $path = <FILES>) {
$file_counter++;
# remove trailing whitespace
chomp($path);
# extract filename from path
my ($vol,$dir,$filename) = File::Spec->splitpath($path);
# calculate the file's MD5 digest
open(FILE, $path) or die "Can't open $path: $!";
binmode(FILE);
my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
# filename not stored as duplicate
if (!exists($duplicate_count{$filename})) {
# assume unique
$unique{$md5digest} = $path;
# which implies 0 duplicates
$duplicate_count{$filename} = 0;
}
# filename already found
else {
# delete unique record
delete($unique{$md5digest});
# second duplicate
if ($duplicate_count{$filename}) {
$duplicate_count{$filename}++;
}
# first duplicate
else {
$duplicate_count{$filename} = 1;
}
# the master copy is already assigned
if (exists($master_copy{$md5digest})) {
# the current path matches $path_pref, so becomes our new master copy
if ($path =~ qq|^$path_pref|) {
$master_copy{$md5digest} = $path;
}
else {
# this one is a secondary copy
$other_copy{$path} = $md5digest;
# store with path as key, as there are duplicate digests
}
}
# assume this is the master copy
else {
$master_copy{$md5digest} = $path;
}
}
$percent_complete = int(($file_counter/$total_files)*100);
printf("\rProgress: $percent_complete %%");
}
close(FILES);
# Write out data to text files for debugging
open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");
print "\n\nUnique files: ",scalar keys %unique,"\n";
foreach my $key (keys %unique) {
print UNIQUE "$key\t", $unique{$key}, "\n";
print UNIQUE_MD5 "$key\n";
}
close UNIQUE;
close UNIQUE_MD5;
open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");
print "Master copies: ",scalar keys %master_copy,"\n";
foreach my $key (keys %master_copy) {
print MASTER "$key\t", $master_copy{$key}, "\n";
print MASTER_MD5 "$key\n";
}
close MASTER;
close MASTER_MD5;
open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");
print "Other copies: ",scalar keys %other_copy,"\n";
foreach my $key (keys %other_copy) {
print OTHER $other_copy{$key}, "\t$key\n";
print OTHER_MD5 "$other_copy{$key}\n";
}
close OTHER;
close OTHER_MD5;
print "\n";
untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;
print "\n";
Looking at the algorithm, I think I see why you are leaking files. The first time you encounter a file copy, you label it "unique":
if (!exists($duplicate_count{$filename})) {
# assume unique
$unique{$md5digest} = $path;
# which implies 0 duplicates
$duplicate_count{$filename} = 0;
}
The next time, you delete that unique record, without storing the path:
# delete unique record
delete($unique{$md5digest});
So whatever filepath was at $unique{$md5digest}, you've lost it, and won't be included in unique+other+master.
You'll need something like:
if(my $original_path = delete $unique{$md5digest}) {
// Where should this one go?
}
Also, as I mentioned in a comment above, IO::File would really clean up this code.
This isn't really a response to the larger logic of the program, but you should be checking for errors in open every time (and while we're at it, why not use the more modern form of open with lexical filehandles and three arguments):
open my $unique, '>', "$base/unique.txt"
or die "Can't open $base/unique.txt for writing: $!";
If you don't want to explicitly ask each time, you could also check out the autodie module.
One apparent optimization is to use file size as an initial comparison basis, and only computer MD5 for files below a certain size or if you have a collision of two files with the same size. The larger a given file is on disc, the more costly the MD5 computation, but also the less likely its exact size will conflict with another file on the system. You can probably save yourself a lot of runtime that way.
You also might want to consider changing your approach for certain kinds of files that contain embedded meta-data that might change without changing the underlying data, so you can find additional dupes even if the MD5's don't match. I'm speaking of course of MP3 or other music files that have metadata tags that might be updated by classifiers or player programs, but which otherwise contain the same audio bits.
See here for related data on solutions in the abstract nature.
https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer
IMPORTANT Note, as much as we'd like to believe 2 files with the same MD5 are the same file, that is not necessarily true. If your data means anything to you, once you've broken it down to a list of candidates that MD5 tells you are the same file, you need to run through every bit of those files linearly to check they are in fact the same.
Put this way, given a hash function ( which MD5 is ) of size 1 bits, there are only 2 possible combination's.
0 1
if your hash function told you 2 files both returned a "1" you would not assume they are the same file.
Given a hash of 2 bits, there are only 4 possible combination's,
00 01 10 11
2 Files returning the same value you would not assume to be the same file.
Given a hash of 3 bits, there are only 8 possible combinations
000 001 010 011
100 101 110 111
2 files returning the same value you would not assume to be the same file.
This pattern goes on in ever increasing amounts, to a point that people for some bizarre reason start putting "chance" into the equation. Even at 128 bits ( MD5 ), 2 files sharing the same hash does not mean they are in fact the same file. the only way to know is by comparing every bit.
There is a minor optimization that occurs if you read them start to end, because you can stop reading as soon as you find a differing bit, but to confirm identical, you need to read every bit.