Perl - looping through two arrays to find "gaps" - perl

So my code setup is as per below
#!/usr/bin/perl -w
# vim: set expandtab ts=2 bg=dark smartindent shiftwidth=2 softtabstop=2 :
#
use strict;
use Data::Dumper;
use Getopt::Long qw(GetOptions);
####use warnings;
use 5.010;
####my #arr1 = ( "0", "1", "2", "3", "4"); # OK
my #arr1 = ( "0", "1", "3", "4"); # Gap-1... seq2
my $arr1_len = scalar #arr1;
##
my #arr2 = ( "0", "1", "2", "3", "4"); # OK
####my #arr2 = ( "0", "1", "3", "4"); # Gap-2... seq2
my $arr2_len = scalar #arr2;
I am trying to loop through each list and compare it with eachother and detect any gaps in the sequence (the values are seq numbers of each stream - at the moment they are sequential, but they dont need to be/ wont be in the real world example)
EFFORT 1= if I use an inner for loop for arr2, it always starts at indx 0 - i.e. the pointer/ count doesn't tick on after we have a match
EFFORT 2= if I do a fake inner/ arr2 loop then I can't go to the next j without increasing the i counter
... I suspect there must be a simple way to do this - but I can't figure it out
EFFORT 1
ARR1: for (my $i=0; $i <= ($arr1_len-1); $i++) {
print "i[$i]=$arr1[$i]\n";
ARR2: for (my $j=0; $j <= ($arr2_len-1); $j++) {
print "... j[$j]=$arr2[$j] -- ((i[$i]=$arr1[$i]))\n";
# seq num match
if ( $arr1[$i] eq $arr2[$j]){
print "MATCH of seq_num [$arr1[$i]|$arr2[$j]]\n";
shift #arr2;
next ARR1;
} # end of seq num match
# gap in arr2
if ( $arr1[$i] < $arr2[$j]){
print "GAP in new [$arr1[$i]|$arr2[$j]]... New Missing $arr1[$i]\n";
next ARR1;
} # end of gap in arr2
# gap in arr1
if ( $arr2[$j] < $arr1[$i]){
print "GAP in old [$arr1[$i]|$arr2[$j]]... Old Missing $arr2[$j]\n";
next ARR2;
} # end of gap in arr1
} # end of j loop
} # i loop
EFFORT 2
my $j = 0;
ARR1: for (my $i=0; $i <= ($arr1_len-1); $i++) {
print "i[$i]=$arr1[$i]\n";
if ( $j <= ($arr2_len-1) ) {
print "... j[$j]=$arr2[$j] -- ((i[$i]=$arr1[$i]))\n";
# seq num match
if ( $arr1[$i] eq $arr2[$j]){
print "MATCH of seq_num [$arr1[$i]|$arr2[$j]]\n";
$j++;
next ARR1;
} # end of seq_num match
# probable gap in arr2
if ( $arr1[$i] < $arr2[$j]){
print "GAP in new [$arr1[$i]|$arr2[$j]]... New Missing $arr1[$i]\n";
next ARR1;
} # end of gap in arr2
# probable gap in arr1
if ( $arr2[$j] < $arr1[$i]){
print "GAP in old [$arr1[$i]|$arr2[$j]]... arr1 Missing $arr2[$j]\n";
# CANT NEXT J WITHOUT INCREASING I ?!?
} # end of gap in arr1
} # end of fake j loop!
} # end of i loop

The definition of a hole or a gap is only if it's in arr1 and not arr2 - or vice-versa... and I need to figure this out in an iterative manner (as when there is a match I then do some more examination of those data objects)
Hash tables are your friend. Something like this snippet:
use List::Util qw/max/;
my %hash1 = map { $_ => 1 } #arr1;
my %hash2 = map { $_ => 1 } #arr2;
my $len = max($arr1[$#arr1], $arr2[$#arr2]);
for my $n (0 .. $len) {
if (exists $hash1{$n} and exists $hash2{$n}) {
# n is in both lists
} elsif (exists $hash1{$n}) {
# n is only in the first one.
} elsif (exists $hash2{$n}) {
# n is only in the second one.
} else {
# Not in either one
}
}
might do the trick for you.

I don't know if this solves your problem, but Set::IntSpan can detect 'holes' in s sequence (even if unordered)
#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
my #arr1 = ( 0, 1, 3, 4);
my $set = Set::IntSpan->new(#arr1);
print $set->holes;
Prints: 2

Related

Scoping in Perl

As a biology student, I'm trying to extend my programming knowledge and I ran into a problem with Perl.
I'm trying to create a program that generates random DNA strings and performs analysis work on the generated data.
In the first part of the program, I am able to print out the strings stored in the array, but the second part I cannot retrieve all but one of the elements of the array.
Could this be part of the scoping rules of Perl?
#!usr/bin/perl
# generate a random DNA strings and print it to file specified by the user.
$largearray[0] = 0;
print "How many nucleotides for the string?\n";
$n = <>;
$mylong = $n;
print "how many strings?\n";
$numstrings = <>;
# #largearray =();
$j = 0;
while ( $j < $numstrings ) {
$numstring = ''; # start with the empty string;
$dnastring = '';
$i = 0;
while ( $i < $n ) {
$numstring = int( rand( 4 ) ) . $numstring; # generate a new random integer
# between 0 and 3, and concatenate
# it with the existing $numstring,
# assigning the result to $numstring.
$i++; # increase the value of $i by one.
}
$dnastring = $numstring;
$dnastring =~ tr/0123/actg/; # translate the numbers to DNA characters.
#print $dnastring;
#print "\n";
$largearray[j] = $dnastring; #append generated string to end of array
#print $largearray[j];
#print $j;
#IN HERE THERE ARE GOOD ARRAY VALUES
#print "\n";
$j++;
}
# ii will be used to continuously take the next couple of strings from largearray
# for LCS matching.
$mytotal = 0;
$ii = 0;
while ( $ii < $numstrings ) {
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
print "\n";
$ii++;
#string1 = split( //, $line );
$line = $largearray[ii];
#print $largearray[ii];
#print "\n";
$ii++;
chomp $line;
#string2 = split( //, $line );
$n = #string1; #assigning a list to a scalar just assigns the
#number of elements in the list to the scalar.
$m = #string2;
$v = 1;
$Cm = 0;
$Im = 0;
$V[0][0] = 0; # Assign the 0,0 entry of the V matrix
for ( $i = 1; $i <= $n; $i++ ) { # Assign the column 0 values and print
# String 1 See section 5.2 of Johnson
# for loops
$V[$i][0] = -$Im * $i;
}
for ( $j = 1; $j <= $m; $j++ ) { # Assign the row 0 values and print String 2
$V[0][$j] = -$Im * $j;
}
for ( $i = 1; $i <= $n; $i++ ) { # follow the recurrences to fill in the V matrix.
for ( $j = 1; $j <= $m; $j++ ) {
# print OUT "$string1[$i-1], $string2[$j-1]\n"; # This is here for debugging purposes.
if ( $string1[ $i - 1 ] eq $string2[ $j - 1 ] ) {
$t = 1 * $v;
}
else {
$t = -1 * $Cm;
}
$max = $V[ $i - 1 ][ $j - 1 ] + $t;
# print OUT "For $i, $j, t is $t \n"; # Another debugging line.
if ( $max < $V[$i][ $j - 1 ] - 1 * $Im ) {
$max = $V[$i][ $j - 1 ] - 1 * $Im;
}
if ( $V[ $i - 1 ][$j] - 1 * $Im > $max ) {
$max = $V[ $i - 1 ][$j] - 1 * $Im;
}
$V[$i][$j] = $max;
}
} #outer for loop
print $V[$n][$m];
$mytotal += $V[$n][$m]; # append current result to the grand total
print "\n";
} # end while loop
print "the average LCS value for length ", $mylong, " strings is: ";
print $mytotal/ $numstrings;
This isn't a scoping issue. You have declared none of your variables, which has the effect of implicitly making them all global and accessible everywhere in your code
I reformatted your Perl program so that I could read it, and then added this to the top of your program
use strict;
use warnings 'all';
which are essential in every Perl program you write
Then I added
no strict 'vars';
which is a very bad idea, and lets you get away without declaring any variables
The result is this
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 60.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 61.
Argument "ii" isn't numeric in array element at E:\Perl\source\dna.pl line 67.
Argument "j" isn't numeric in array element at E:\Perl\source\dna.pl line 42.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 60.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 61.
Bareword "ii" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 67.
Bareword "j" not allowed while "strict subs" in use at E:\Perl\source\dna.pl line 42.
Execution of E:\Perl\source\dna.pl aborted due to compilation errors.
Line 42 (of my reformatted version) is
$largearray[j] = $dnastring
and lines 60, 61 and 67 are
$line = $largearray[ii];
print $largearray[ii]; #CANNOT RETRIEVE ARRAY VALUES
and
$line = $largearray[ii];
You are using j and ii as array indexes. Those are Perl subroutine calls, not variables. Adding use strict would have stopped this from compiling unless you had also declared sub ii and sub j
You might get away with it if you just change j and ii to $j and $ii, but you are certain to get into further problems
Please make the same changes to your own code, and declare every variable that you need using my as close as possible to the first place they are used
You should also improve your variable naming. Things like #largearray are pointless: the # says that it's an array, and whether it's large or not is relative, and of little use in understanding your code. If you have no better description of its purpose then #table or #data are probably a little better
Likewise, please avoid capital letters and most single-letter names. #V, $Cm and $Im are meaningless, and you would need fewer comments if those names were better
You certainly wouldn't need comments like # end while loop and # outer for loop if you had indented your blocks properly and kept them short enough so that both the beginning and the end can be seen on the screen at the same time, and the fewer comments you can get away with the better, because they badly clutter the code structure
Finally, it's worth noting that the C-style for loop is rarely the best choice in Perl. Your
for ( $i = 1; $i <= $n; $i++ ) { ... }
is much clearer as
for my $i ( 1 .. $n ) { ... }
and declaring the control variable at that point makes it unnecessary to invent new names like $ii for each new loop
I think you have a typo in your code:
ii => must be $ii
don't forget to put this at the beginning of your code:
use strict;
use warnings;
in order to avoid this (and others) kind of errors

Nested loops - Why do the loops include the previous evaluation?

My intention is to count the number of words in List 2 that contain each of the letters in List 1.
When I run the code, the first count is fine; however, the subsequent counts are added to the previous ones, such that the final count is the sum of all the counts, not the count of how many "words" contain an "F", as I want it to be.
Where am I doing wrong?
Here is my code.
use warnings; use strict;
my $count=0;
my #list1 = ("A", "B", "C", "D", "E", "F");
my #list2 = ("AXE", "DOG", "CAT", "FOOD", "TRANCE");
for (my $i=0; $i<scalar(#list1); $i++){
for (my $j=0; $j<scalar(#list2); $j++){
my $word = $list2[$j];
my $letter = $list1[$i];
if ($word =~ /$letter/){
$count++;
}
}
print "$count \n";
}
All the help appreciated.
If I understand your spec correctly, you just want to move the count declaration/initialization into the outer for loop:
for ( my $i = 0 ; $i < scalar(#list1) ; $i++ ) {
my $count = 0;
This resets the count for each letter.
As it was said above, you just need to reset the counter variable in order to get the correct result.
However, you can really simplify the code by making use of grep rather than nesting loops. Here's how I might do it:
#!/usr/bin/perl
use warnings;
use strict;
my #list1 = qw( A B C D E F );
my #list2 = qw( AXE DOG CAT FOOD TRANCE );
# Iterate over the letter array, using grep to count how many times
# it shows up in each word, and then store that result to a hash
my %result;
for my $letter ( #list1 ) {
my $count = grep { $_ =~ /$letter/ } #list2;
$result{$letter} = $count;
}
# Now print out all of the results
print "Number of words found for each letter:\n";
for ( sort keys %result ) {
print "$_: $result{$_}\n";
}
This gives me the following result based on your test data:
Number of words found for each letter:
A: 3
B: 0
C: 2
D: 2
E: 2
F: 1

Find the word with most letters in common with other words

I want Perl (5.8.8) to find out what word has the most letters in common with the other words in an array - but only letters that are in the same place. (And preferably without using libs.)
Take this list of words as an example:
BAKER
SALER
BALER
CARER
RUFFR
Her BALER is the word that has the most letters in common with the others. It matches BAxER in BAKER, xALER in SALER, xAxER in CARER, and xxxxR in RUFFR.
I want Perl to find this word for me in an arbitrary list of words with the same length and case. Seems I've hit the wall here, so help is much appreciated!
What I've tried until now
Don't really have much of a script at the moment:
use strict;
use warnings;
my #wordlist = qw(BAKER SALER MALER BARER RUFFR);
foreach my $word (#wordlist) {
my #letters = split(//, $word);
# now trip trough each iteration and work magic...
}
Where the comment is, I've tried several kinds of code, heavy with for-loops and ++ varables. Thus far, none of my attempts have done what I need it to do.
So, to better explain: What I need is to test word for word against the list, for each letterposition, to find the word that has the most letters in common with the others in the list, at that letter's position.
One possible way could be to first check which word(s) has the most in common at letter-position 0, then test letter-position 1, and so on, until you find the word that in sum has the most letters in common with the other words in the list. Then I'd like to print the list like a matrix with scores for each letterposition plus a total score for each word, not unlike what DavidO suggest.
What you'd in effect end up with is a matrix for each words, with the score for each letter position, and the sum total score fore each word in the matrix.
Purpose of the Program
Hehe, I might as well say it: The program is for hacking terminals in the game Fallout 3. :D My thinking is that it's a great way to learn Perl while also having fun gaming.
Here's one of the Fallout 3 terminal hacking tutorials I've used for research: FALLOUT 3: Hacking FAQ v1.2, and I've already made a program to shorten the list of words, like this:
#!/usr/bin/perl
# See if one word has equal letters as the other, and how many of them are equal
use strict;
use warnings;
my $checkword = "APPRECIATION"; # the word to be checked
my $match = 4; # equal to the match you got from testing your checkword
my #checkletters = split(//, $checkword); #/
my #wordlist = qw(
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
);
print "$checkword has $match letters in common with:\n";
foreach my $word (#wordlist) {
next if $word eq $checkword;
my #letters = split(//, $word);
my $length = #letters; # determine length of array (how many letters to check)
my $eq_letters = 0; # reset to 0 for every new word to be tested
for (my $i = 0; $i < $length; $i++) {
if ($letters[$i] eq $checkletters[$i]) {
$eq_letters++;
}
}
if ($eq_letters == $match) {
print "$word\n";
}
}
# Now to make a script on to find the best word to check in the first place...
This script will yield CONSTRUCTION and TRANSMISSION as its result, just as in the game FAQ. The trick to the original question, though (and the thing I didn't manage to find out on my own), is how to find the best word to try in the first place, i.e. APPRECIATION.
OK, I've now supplied my own solution based on your help, and consider this thread closed. Many, many thanks to all the contributers. You've helped tremendously, and on the way I've also learned a lot. :D
Here's one way. Having re-read your spec a couple of times I think it's what you're looking for.
It's worth mentioning that it's possible there will be more than one word with an equal top score. From your list there's only one winner, but it's possible that in longer lists, there will be several equally winning words. This solution deals with that. Also, as I understand it, you count letter matches only if they occur in the same column per word. If that's the case, here's a working solution:
use 5.012;
use strict;
use warnings;
use List::Util 'max';
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
my #scores;
foreach my $word ( #words ) {
my $score;
foreach my $comp_word ( #words ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
say "Words with most matches:";
say for #words[#max_ixs];
This solution counts how many times per letter column each word's letters match other words. So for example:
Words: Scores: Because:
ABC 1, 2, 1 = 4 A matched once, B matched twice, C matched once.
ABD 1, 2, 1 = 4 A matched once, B matched twice, D matched once.
CBD 0, 2, 1 = 3 C never matched, B matched twice, D matched once.
BAC 0, 0, 1 = 1 B never matched, A never matched, C matched once.
That gives you the winners of ABC and ABD, each with a score of four positional matches. Ie, the cumulative times that column one, row one matched column one row two, three, and four, and so on for the subsequent columns.
It may be able to be optimized further, and re-worded to be shorter, but I tried to keep the logic fairly easy to read. Enjoy!
UPDATE / EDIT
I thought about it and realized that though my existing method does exactly what your original question requested, it did it in O(n^2) time, which is comparatively slow. But if we use hash keys for each column's letters (one letter per key), and do a count of how many times each letter appears in the column (as the value of the hash element), we could do our summations in O(1) time, and our traversal of the list in O(n*c) time (where c is the number of columns, and n is the number of words). There's some setup time too (creation of the hash). But we still have a big improvement. Here is a new version of each technique, as well as a benchmark comparison of each.
use strict;
use warnings;
use List::Util qw/ max sum /;
use Benchmark qw/ cmpthese /;
my #words = qw/
PARTNERSHIPS
REPRIMANDING
CIVILIZATION
APPRECIATION
CONVERSATION
CIRCUMSTANCE
PURIFICATION
SECLUSIONIST
CONSTRUCTION
DISAPPEARING
TRANSMISSION
APPREHENSIVE
ENCOUNTERING
/;
# Just a test run for each solution.
my( $top, $indexes_ref );
($top, $indexes_ref ) = find_top_matches_force( \#words );
print "Testing force method: $top matches.\n";
print "#words[#$indexes_ref]\n";
( $top, $indexes_ref ) = find_top_matches_hash( \#words );
print "Testing hash method: $top matches.\n";
print "#words[#$indexes_ref]\n";
my $count = 20000;
cmpthese( $count, {
'Hash' => sub{ find_top_matches_hash( \#words ); },
'Force' => sub{ find_top_matches_force( \#words ); },
} );
sub find_top_matches_hash {
my $words = shift;
my #scores;
my $columns;
my $max_col = max( map { length $_ } #{$words} ) - 1;
foreach my $col_idx ( 0 .. $max_col ) {
$columns->[$col_idx]{ substr $_, $col_idx, 1 }++
for #{$words};
}
foreach my $word ( #{$words} ) {
my $score = sum(
map{
$columns->[$_]{ substr $word, $_, 1 } - 1
} 0 .. $max_col
);
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
sub find_top_matches_force {
my $words = shift;
my #scores;
foreach my $word ( #{$words} ) {
my $score;
foreach my $comp_word ( #{$words} ) {
next if $comp_word eq $word;
foreach my $pos ( 0 .. ( length $word ) - 1 ) {
$score++ if
substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
}
}
push #scores, $score;
}
my $max = max( #scores );
my ( #max_ixs ) = grep { $scores[$_] == $max } 0 .. $#scores;
return( $max, \#max_ixs );
}
The output is:
Testing force method: 39 matches.
APPRECIATION
Testing hash method: 39 matches.
APPRECIATION
Rate Force Hash
Force 2358/s -- -74%
Hash 9132/s 287% --
I realize your original spec changed after you saw some of the other options provided, and that's sort of the nature of innovation to a degree, but the puzzle was still alive in my mind. As you can see, my hash method is 287% faster than the original method. More fun in less time!
As a starting point, you can efficiently check how many letters they have in common with:
$count = ($word1 ^ $word2) =~ y/\0//;
But that's only useful if you loop through all possible pairs of words, something that isn't necessary in this case:
use strict;
use warnings;
my #words = qw/
BAKER
SALER
BALER
CARER
RUFFR
/;
# you want a hash to indicate which letters are present how many times in each position:
my %count;
for my $word (#words) {
my #letters = split //, $word;
$count{$_}{ $letters[$_] }++ for 0..$#letters;
}
# then for any given word, you get the count for each of its letters minus one (because the word itself is included in the count), and see if it is a maximum (so far) for any position or for the total:
my %max_common_letters_count;
my %max_common_letters_words;
for my $word (#words) {
my #letters = split //, $word;
my $total;
for my $position (0..$#letters, 'total') {
my $count;
if ( $position eq 'total' ) {
$count = $total;
}
else {
$count = $count{$position}{ $letters[$position] } - 1;
$total += $count;
}
if ( ! $max_common_letters_count{$position} || $count >= $max_common_letters_count{$position} ) {
if ( $max_common_letters_count{$position} && $count == $max_common_letters_count{$position} ) {
push #{ $max_common_letters_words{$position} }, $word;
}
else {
$max_common_letters_count{$position} = $count;
$max_common_letters_words{$position} = [ $word ];
}
}
}
}
# then show the maximum words for each position and in total:
for my $position ( sort { $a <=> $b } grep $_ ne 'total', keys %max_common_letters_count ) {
printf( "Position %s had a maximum of common letters of %s in words: %s\n",
$position,
$max_common_letters_count{$position},
join(', ', #{ $max_common_letters_words{$position} })
);
}
printf( "The maximum total common letters was %s in words(s): %s\n",
$max_common_letters_count{'total'},
join(', ', #{ $max_common_letters_words{'total'} })
);
Here's a complete script. It uses the same idea that ysth mentioned (although I had it independently). Use bitwise xor to combine the strings, and then count the number of NULs in the result. As long as your strings are ASCII, that will tell you how many matching letters there were. (That comparison is case sensitive, and I'm not sure what would happen if the strings were UTF-8. Probably nothing good.)
use strict;
use warnings;
use 5.010;
use List::Util qw(max);
sub findMatches
{
my ($words) = #_;
# Compare each word to every other word:
my #matches = (0) x #$words;
for my $i (0 .. $#$words-1) {
for my $j ($i+1 .. $#$words) {
my $m = ($words->[$i] ^ $words->[$j]) =~ tr/\0//;
$matches[$i] += $m;
$matches[$j] += $m;
}
}
# Find how many matches in the best word:
my $max = max(#matches);
# Find the words with that many matches:
my #wanted = grep { $matches[$_] == $max } 0 .. $#matches;
wantarray ? #$words[#wanted] : $words->[$wanted[0]];
} # end findMatches
my #words = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
say for findMatches(\#words);
Haven't touched perl in a while, so pseudo-code it is. This isn't the fastest algorithm, but it will work fine for a small amount of words.
totals = new map #e.g. an object to map :key => :value
for each word a
for each word b
next if a equals b
totals[a] = 0
for i from 1 to a.length
if a[i] == b[i]
totals[a] += 1
end
end
end
end
return totals.sort_by_key.last
Sorry about the lack of perl, but if you code this into perl, it should work like a charm.
A quick note on run-time: this will run in time number_of_words^2 * length_of_words, so on a list of 100 words, each of length 10 characters, this will run in 100,000 cycles, which is adequate for most applications.
Here's a version that relies on transposing the words in order to count the identical characters. I used the words from your original comparison, not the code.
This should work with any length words, and any length list. Output is:
Word score
---- -----
BALER 12
SALER 11
BAKER 11
CARER 10
RUFFR 4
The code:
use warnings;
use strict;
my #w = qw(BAKER SALER BALER CARER RUFFR);
my #tword = t_word(#w);
my #score;
push #score, str_count($_) for #tword;
#score = t_score(#score);
my %total;
for (0 .. $#w) {
$total{$w[$_]} = $score[$_];
}
print "Word\tscore\n";
print "----\t-----\n";
print "$_\t$total{$_}\n" for (sort { $total{$b} <=> $total{$a} } keys %total);
# transpose the words
sub t_word {
my #w = #_;
my #tword;
for my $word (#w) {
my $i = 0;
while ($word =~ s/(.)//) {
$tword[$i++] .= $1;
}
}
return #tword;
}
# turn each character into a count
sub str_count {
my $str = uc(shift);
while ( $str =~ /([A-Z])/ ) {
my $chr = $1;
my $num = () = $str =~ /$chr/g;
$num--;
$str =~ s/$chr/$num /g;
}
return $str;
}
# sum up the character counts
# while reversing the transpose
sub t_score {
my #count = #_;
my #score;
for my $num (#count) {
my $i = 0;
while( $num =~ s/(\d+) //) {
$score[$i++] += $1;
}
}
return #score;
}
Here is my attempt at an answer. This will also allow you to see each individual match if you need it. (ie. BALER matches 4 characters in BAKER). EDIT: It now catches all matches if there is a tie between words (I added "CAKER" to the list to test).
#! usr/bin/perl
use strict;
use warnings;
my #wordlist = qw( BAKER SALER BALER CARER RUFFR CAKER);
my %wordcomparison;
#foreach word, break it into letters, then compare it against all other words
#break all other words into letters and loop through the letters (both words have same amount), adding to the count of matched characters each time there's a match
foreach my $word (#wordlist) {
my #letters = split(//, $word);
foreach my $otherword (#wordlist) {
my $count;
next if $otherword eq $word;
my #otherwordletters = split (//, $otherword);
foreach my $i (0..$#letters) {
$count++ if ( $letters[$i] eq $otherwordletters[$i] );
}
$wordcomparison{"$word"}{"$otherword"} = $count;
}
}
# sort (unnecessary) and loop through the keys of the hash (words in your list)
# foreach key, loop through the other words it compares with
#Add a new key: total, and sum up all the matched characters.
foreach my $word (sort keys %wordcomparison) {
foreach ( sort keys %{ $wordcomparison{$word} }) {
$wordcomparison{$word}{total} += $wordcomparison{$word}{$_};
}
}
#Want $word with highest total
my #max_match = (sort { $wordcomparison{$b}{total} <=> $wordcomparison{$a}{total} } keys %wordcomparison );
#This is to get all if there is a tie:
my $maximum = $max_match[0];
foreach (#max_match) {
print "$_\n" if ($wordcomparison{$_}{total} >= $wordcomparison{$maximum}{total} )
}
The output is simply: CAKER BALER and BAKER.
The hash %wordcomparison looks like:
'SALER'
{
'RUFFR' => 1,
'BALER' => 4,
'BAKER' => 3,
'total' => 11,
'CARER' => 3
};
You can do this, using a dirty regex trick to execute code if a letter matches in its place, but not otherwise, thankfully it's quite easy to build the regexes as you go:
An example regular expression is:
(?:(C(?{ $c++ }))|.)(?:(A(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)(?:(E(?{ $c++ }))|.)(?:(R(?{ $c++ }))|.)
This may or may not be fast.
use 5.12.0;
use warnings;
use re 'eval';
my #words = qw(BAKER SALER BALER CARER RUFFR);
my ($best, $count) = ('', 0);
foreach my $word (#words) {
our $c = 0;
foreach my $candidate (#words) {
next if $word eq $candidate;
my $regex_str = join('', map {"(?:($_(?{ \$c++ }))|.)"} split '', $word);
my $regex = qr/^$regex_str$/;
$candidate =~ $regex or die "did not match!";
}
say "$word $c";
if ($c > $count) {
$best = $word;
$count = $c;
}
}
say "Matching: first best: $best";
Using xor trick will be fast but assumes a lot about the range of characters you might encounter. There are many ways in which utf-8 will break with that case.
Many thanks to all the contributers! You've certainly shown me that I still have a lot to learn, but you have also helped me tremendously in working out my own answer. I'm just putting it here for reference and possible feedback, since there are probably better ways of doing it. To me this was the simplest and most straight forward approach I could find on my own. Enjøy! :)
#!/usr/bin/perl
use strict;
use warnings;
# a list of words for testing
my #list = qw(
BAKER
SALER
BALER
CARER
RUFFR
);
# populate two dimensional array with the list,
# so we can compare each letter with the other letters on the same row more easily
my $list_length = #list;
my #words;
for (my $i = 0; $i < $list_length; $i++) {
my #letters = split(//, $list[$i]);
my $letters_length = #letters;
for (my $j = 0; $j < $letters_length; $j++) {
$words[$i][$j] = $letters[$j];
}
}
# this gives a two-dimensionla array:
#
# #words = ( ["B", "A", "K", "E", "R"],
# ["S", "A", "L", "E", "R"],
# ["B", "A", "L", "E", "R"],
# ["C", "A", "R", "E", "R"],
# ["R", "U", "F", "F", "R"],
# );
# now, on to find the word with most letters in common with the other on the same row
# add up the score for each letter in each word
my $word_length = #words;
my #letter_score;
for my $i (0 .. $#words) {
for my $j (0 .. $#{$words[$i]}) {
for (my $k = 0; $k < $word_length; $k++) {
if ($words[$i][$j] eq $words[$k][$j]) {
$letter_score[$i][$j] += 1;
}
}
# we only want to add in matches outside the one we're testing, therefore
$letter_score[$i][$j] -= 1;
}
}
# sum each score up
my #scores;
for my $i (0 .. $#letter_score ) {
for my $j (0 .. $#{$letter_score[$i]}) {
$scores[$i] += $letter_score[$i][$j];
}
}
# find the highest score
my $max = $scores[0];
foreach my $i (#scores[1 .. $#scores]) {
if ($i > $max) {
$max = $i;
}
}
# and print it all out :D
for my $i (0 .. $#letter_score ) {
print "$list[$i]: $scores[$i]";
if ($scores[$i] == $max) {
print " <- best";
}
print "\n";
}
When run, the script yields the following:
BAKER: 11
SALER: 11
BALER: 12 <- best
CARER: 10
RUFFR: 4

What's the best way to compare arrays of strings in perl

I'm trying to compare multiple arrays of strings containing file listings of directories. The objective is to determine which files exist in each directory AND which files do not exists. Consider:
List1 List2 List3 List4
a a e f
b b d g
c f a h
The outcome should be:
List1:
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
c yes no no no
List2:
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
f no yes no yes
...
I could go through all the arrays and go through each entry, go through all the other arrays and do a grep:
for my $curfile (#currentdirfiles) {
if( grep(/$curfile/, #otherarrsfiles) ) {
// Set 'yes'
} else {
// set 'no'
}
}
My only concern is that I am ending up with a 0^2n order of magnitude. I may not be able to do anything about this since I would end up looping through all the arrays anyway. One improvement may be in the grep function, but I'm not sure.
Any thoughts?
For lots of string lookups, you generally want to use hashes. Here's one way of doing it:
use strict;
use warnings;
# Define the lists:
my #lists = (
[qw(a b c)], # List 1
[qw(a b f)], # List 2
[qw(e d a)], # List 3
[qw(f g h)], # List 4
);
# For each file, determine which lists it is in:
my %included;
for my $n (0 .. $#lists) {
for my $file (#{ $lists[$n] }) {
$included{$file}[$n] = 1;
} # end for each $file in this list
} # end for each list number $n
# Print out the results:
my $fileWidth = 8;
for my $n (0 .. $#lists) {
# Print the header rows:
printf "\nList %d:\n", $n+1;
print ' ' x $fileWidth;
printf "%-8s", "List $_" for 1 .. #lists;
print "\n";
# Print a line for each file:
for my $file (#{ $lists[$n] }) {
printf "%-${fileWidth}s", $file;
printf "%-8s", ($_ ? 'yes' : 'no') for #{ $included{$file} }[0 .. $#lists];
print "\n";
} # end for each $file in this list
} # end for each list number $n
Why not just remember where each file is when you're reading them in.
Let's say you have a list of directories to read from in #dirlist:
use File::Slurp qw( read_dir );
my %in_dir;
my %dir_files;
foreach my $dir ( #dirlist ) {
die "No such directory $dir" unless -d $dir;
foreach my $file ( read_dir($dir) ) {
$in_dir{$file}{$dir} = 1;
push #{ $dir_files{$dir} }, $file;
}
}
Now $in_dir{filename} will have entries defined for each directory of interest, and
$dir_files{directory} will have a list of files for each directory...
foreach my $dir ( #dirlist ) {
print "$dir\n";
print join("\t", "", #dirlist);
foreach my $file ( #{ $dir_files{$dir} } ) {
my #info = ($file);
foreach my $dir_for_file ( #dirlist ) {
if ( defined $in_dir{$file}{$dir_for_file} ) {
push #info, "Yes";
} else {
push #info, "No";
}
}
print join("\t", #info), "\n";
}
}
The clearest way is to use perl5i and autoboxing:
use perl5i;
my #list1 = qw(one two three);
my #list2 = qw(one two four);
my $missing = #list1 -> diff(\#list2);
my $both = #list1 -> intersect(\#list2);
In a more restricted setup, use hashes for this as the filenames will be unique:
sub in_list {
my ($one, $two) = #_;
my (#in, #out);
my %a = map {$_ => 1} #$one;
foreach my $f (#$two) {
if ($a{$f}) {
push #in, $f;
}
else {
push #out, $f;
}
}
return (\#in, \#out);
}
my #list1 = qw(one two three);
my #list2 = qw(one two four);
my ($in, $out) = in_list(\#list1, \#list2);
print "In list 1 and 2:\n";
print " $_\n" foreach #$in;
print "In list 2 and not in list 1\n";
print " $_\n" foreach #$out;
Now that the question has been amended, this produces the answer you want. It does work in O(n3) time, which is optimal for the problem (there are n3 outputs).
#!/usr/bin/env perl
use strict;
use warnings;
#List1 List2 List3 List4
#a a e f
#b b d g
#c f a h
my(#lists) = ( { a => 1, b => 1, c => 1 },
{ a => 1, b => 1, f => 1 },
{ e => 1, d => 1, a => 1 },
{ f => 1, g => 1, h => 1 },
);
my $i = 0;
foreach my $list (#lists)
{
analyze(++$i, $list, #lists);
}
sub analyze
{
my($num, $ref, #lists) = #_;
printf "List %d\n", $num;
my $pad = " ";
foreach my $i (1..4)
{
print "$pad List$i";
$pad = "";
}
print "\n";
foreach my $file (sort keys %{$ref})
{
printf "%-8s", $file;
foreach my $list (#lists)
{
my %dir = %{$list};
printf "%-8s", (defined $dir{$file}) ? "yes" : "no";
}
print "\n";
}
print "\n";
}
The output I get is:
List 1
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
c yes no no no
List 2
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
f no yes no yes
List 3
List1 List2 List3 List4
a yes yes yes no
d no no yes no
e no no yes no
List 4
List1 List2 List3 List4
f no yes no yes
g no no no yes
h no no no yes
My code is simpler but the output isn't quite what you want:
#lst1=('a', 'b', 'c');
#lst2=('a', 'b', 'f');
#lst3=('e', 'd', 'a');
#lst4=('f', 'g', 'h');
%hsh=();
foreach $item (#lst1) {
$hsh{$item}="list1";
}
foreach $item (#lst2) {
if (defined($hsh{$item})) {
$hsh{$item}=$hsh{$item}." list2";
}
else {
$hsh{$item}="list2";
}
}
foreach $item (#lst3) {
if (defined($hsh{$item})) {
$hsh{$item}=$hsh{$item}." list3";
}
else {
$hsh{$item}="list3";
}
}
foreach $item (#lst4) {
if (defined($hsh{$item})) {
$hsh{$item}=$hsh{$item}." list4";
}
else {
$hsh{$item}="list4";
}
}
foreach $key (sort keys %hsh) {
printf("%s %s\n", $key, $hsh{$key});
}
Gives:
a list1 list2 list3
b list1 list2
c list1
d list3
e list3
f list2 list4
g list4
h list4
Sorry for the late reply, I've been polishing this a while, because I did not want yet another negative score (bums me out).
This is an interesting efficiency problem. I don't know if my solution will work for you, but I thought I would share it anyway. It is probably efficient only if your arrays do not change too often, and if your arrays contain many duplicate values. I have not run any efficiency checks on it.
Basically, the solution is to remove one dimension of the cross checking by turning the array values into bits, and doing a bitwise comparison on the entire array in one go. Array values are deduped, sorted and given a serial number. The arrays total serial numbers are then stored in a single value by bitwise or. A single array can thereby be checked for a single serial number with only one operation, e.g.:
if ( array & serialno )
It will require one run to prepare the data, which can then be saved in cache or similar. This data can then be used until your data changes (e.g. files/folders are removed or added). I have added a fatal exit on undefined values, which means the data must be refreshed when it occurs.
Good luck!
use strict;
use warnings;
my #list1=('a', 'b', 'c');
my #list2=('a', 'b', 'f');
my #list3=('e', 'd', 'a');
my #list4=('f', 'g', 'h');
# combine arrays
my #total = (#list1, #list2, #list3, #list4);
# dedupe (Thanks Xetius for this code snippet)
my %unique = ();
foreach my $item (#total)
{
$unique{$item} ++;
}
# Default sort(), don't think it matters
#total = sort keys %unique;
# translate to serial numbers
my %serials = ();
for (my $num = 0; $num <= $#total; $num++)
{
$serials{$total[$num]} = $num;
}
# convert array values to serial numbers, and combine them
my #tx = ();
for my $entry (#list1) { $tx[0] |= 2**$serials{$entry}; }
for my $entry (#list2) { $tx[1] |= 2**$serials{$entry}; }
for my $entry (#list3) { $tx[2] |= 2**$serials{$entry}; }
for my $entry (#list4) { $tx[3] |= 2**$serials{$entry}; }
&print_all;
sub inList
{
my ($value, $list) = #_;
# Undefined serial numbers are not accepted
if (! defined ($serials{$value}) ) {
print "$value is not in the predefined list.\n";
exit;
}
return ( 2**$serials{$value} & $tx[$list] );
}
sub yesno
{
my ($value, $list) = #_;
return ( &inList($value, $list) ? "yes":"no" );
}
#
# The following code is for printing purposes only
#
sub print_all
{
printf "%-6s %-6s %-6s %-6s %-6s\n", "", "List1", "List2", "List3", "List4";
print "-" x 33, "\n";
&table_print(#list1);
&table_print(#list2);
&table_print(#list3);
&table_print(#list4);
}
sub table_print
{
my #list = #_;
for my $entry (#list) {
printf "%-6s %-6s %-6s %-6s %-6s\n", $entry,
&yesno($entry, 0),
&yesno($entry, 1),
&yesno($entry, 2),
&yesno($entry, 3);
}
print "-" x 33, "\n";
}
I would build a hash using directory entries as keys containing hashes (actually sets) of each listing in which that was found. Iterate over each listing, for each new entry add it to the outer hash with a single set (or hash) containing the identifier of the listing in which it was first encountered. For any entry that's found in the hash simply add the current listing identifier to the value's set/hash.
From there you can simply post process the sorted keys of the hash, and creating rows of your resulting table.
Personally I think Perl is ugly but here's a sample in Python:
#!/usr/bin/env python
import sys
if len(sys.argv) < 2:
print >> sys.stderr, "Must supply arguments"
sys.exit(1)
args = sys.argv[1:]
# build hash entries by iterating over each listing
d = dict()
for each_file in args:
name = each_file
f = open(each_file, 'r')
for line in f:
line = line.strip()
if line not in d:
d[line] = set()
d[line].add(name)
f.close()
# post process the hash
report_template = "%-20s" + (" %-10s" * len(args))
print report_template % (("Dir Entries",) + tuple(args))
for k in sorted(d.keys()):
row = list()
for col in args:
row.append("yes") if col in d[k] else row.append("no")
print report_template % ((k,)+tuple(row))
That should mostly be legible as if it were psuedo-code. The (k,) and ("Dir Entries",) expressions might look a little odd; but that's to force them to be tuples which are are necessary to unpack into the format string using the % operator for strings. Those could also have been written as tuple([k]+row) for example (wrapping the first item in [] makes it a list which can be added to the other list and all converted to a tuple).
Other than that a translation to Perl should be pretty straightforward, just using hashes instead of dictionaries and sets.
(Incidentally, this example will work with an arbitrary number of listings, supplied as arguments and output as columns. Obviously after a dozen columns the output would get to be rather cumbersome to print or display; but it was an easily generalization to make).

How can I compare arrays in Perl?

I have two arrays, #a and #b. I want to do a compare among the elements of the two arrays.
my #a = qw"abc def efg ghy klm ghn";
my #b = qw"def ghy jgk lom com klm";
If any element matches then set a flag. Is there any simple way to do this?
First of all, your 2 arrays need to be written correctly.
#a = ("abc","def","efg","ghy","klm","ghn");
#b = ("def","efg","ghy","klm","ghn","klm");
Second of all, for arbitrary arrays (e.g. arrays whose elements may be references to other data structures) you can use Data::Compare.
For arrays whose elements are scalar, you can do comparison using List::MoreUtils pairwise BLOCK ARRAY1 ARRAY2, where BLOCK is your comparison subroutine. You can emulate pairwise (if you don't have List::MoreUtils access) via:
if (#a != #b) {
$equals = 0;
} else {
$equals = 1;
foreach (my $i = 0; $i < #a; $i++) {
# Ideally, check for undef/value comparison here as well
if ($a[$i] != $b[$i]) { # use "ne" if elements are strings, not numbers
# Or you can use generic sub comparing 2 values
$equals = 0;
last;
}
}
}
P.S. I am not sure but List::Compare may always sort the lists. I'm not sure if it can do pairwise comparisons.
List::Compare
if ( scalar List::Compare->new(\#a, \#b)->get_intersection ) {
…
}
Check to create an intersect function, which will return a list of items that are present in both lists. Then your return value is dependent on the number of items in the intersected list.
You can easily find on the web the best implementation of intersect for Perl. I remember looking for it a few years ago.
Here's what I found :
my #array1 = (1, 2, 3);
my #array2 = (2, 3, 4);
my %original = ();
my #isect = ();
map { $original{$_} = 1 } #array1;
#isect = grep { $original{$_} } #array2;
This is one way:
use warnings;
use strict;
my #a = split /,/, "abc,def,efg,ghy,klm,ghn";
my #b = split /,/, "def,ghy,jgk,lom,com,klm";
my $flag = 0;
my %a;
#a{#a} = (1) x #a;
for (#b) {
if ($a{$_}) {
$flag = 1;
last;
}
}
print "$flag\n";
From the requirement that 'if any element matches', use the intersection of sets:
sub set{
my %set = map { $_, undef }, #_;
return sort keys %set;
}
sub compare{
my ($listA,$listB) = #_;
return ( (set(#$listA)-set(#$listB)) > 0)
}
my #a = qw' abc def efg ghy klm ghn ';
my #b = qw' def ghy jgk lom com klm ';
my $flag;
foreach my $item(#a) {
$flag = #b~~$item ? 0 : 1;
last if !$flag;
}
Note that you will need Perl 5.10, or later, to use the smart match operator (~~) .
Brute force should do the trick for small a n:
my $flag = 0;
foreach my $i (#a) {
foreach my $k (#b) {
if ($i eq $k) {
$flag = 1;
last;
}
}
}
For a large n, use a hash table:
my $flag = 0;
my %aa = ();
$aa{$_} = 1 foreach (#a);
foreach my $i (#b) {
if ($aa{$i}) {
$flag = 1;
last;
}
}
Where a large n is |#a| + |#b| > ~1000 items
IMHO, you should use List::MoreUtils::pairwise. However, if for some reason you cannot, then the following sub would return a 1 for every index where the value in the first array compares equal to the value in the second array. You can generalize this method as much as you want and pass your own comparator if you want to, but at that point, just installing List::MoreUtils would be a more productive use of your time.
use strict; use warnings;
my #a = qw(abc def ghi jkl);
my #b = qw(abc dgh dlkfj jkl kjj lkm);
my $map = which_ones_equal(\#a, \#b);
print join(', ', #$map), "\n";
sub which_ones_equal {
my ($x, $y, $compare) = #_;
my $last = $#$x > $#$y ? $#$x : $#$y;
no warnings 'uninitialized';
return [ map { 0 + ($x->[$_] eq $y->[$_]) } $[ .. $last ];
}
This is Perl. The 'obvious' solution:
my #a = qw"abc def efg ghy klm ghn";
my #b = qw"def ghy jgk lom com klm";
print "arrays equal\n"
if #a == #b and join("\0", #a) eq join("\0", #b);
given "\0" not being in #a.
But thanks for confirming that there is no other generic solution than rolling your own.
my #a1 = qw|a b c d|;
my #a2 = qw|b c d e|;
for my $i (0..$#a1) {
say "element $i of array 1 was not found in array 2"
unless grep {$_ eq $a1[$i]} #a2
}
If you would consider the arrays with different order to be different, you may use Array::Diff
if (Array::Diff->diff(\#a, \#b)->count) {
# not_same
} else {
# same
}
This question still could mean two things where it states "If any element matches then set a flag":
Elements at the same position, i.e $a[2] eq $b[2]
Values at any position, i.e. $a[3] eq $b[5]
For case 1, you might do this:
# iterate over all positions, and compare values at that position
my #matches = grep { $a[$_] eq $b[$_] } 0 .. $#a;
# set flag if there's any match at the same position
my $flag = 1 if #matches;
For case 2, you might do that:
# make a hash of #a and check if any #b are in there
my %a = map { $_ => 1 } #a;
my #matches = grep { $a{$_} } #b;
# set flag if there's matches at any position
my $flag = 1 if #matches;
Note that in the first case, #matches holds the indexes of where there are matching elements, and in the second case #matches holds the matching values in the order in which they appear in #b.