how to match two sequences using arrays in perl - perl

When looping through two arrays, I am confused about how to move the pointer through one loop but keep it constant in another. So for example:
Array 1: A T C G T C G A G C G
Array 2: A C G T C C T G T C G
So A in the first array matches A in the second array so we move on to next elements. But since the T doesn't match the C in the 2nd index, I want the program to compare that T to the next G in array 2 and so on until it finds the matching T.
my ($array1ref, $array2ref) = #_;
my #array1 = #$array1ref;
my #array2= #$array2ref;
my $count = 0;
foreach my $element (#array1) {
foreach my $element2 (#array2) {
if ($element eq $element2) {
$count++;
}else { ???????????
}

You could use a while loop to search for matches. If you find a match, advance in both arrays. If you don't, advance the second array. At the end you could print the remaining unmatched characters from the first array:
# [1, 2, 3] is a reference to an anonymous array (1, 2, 3)
# qw(1, 2, 3) is shorthand quoted-word for ('1', '2', '3')
my $arr1 = [qw(A T C G T C G A G C G)];
my $arr2 = [qw(A C G T C C T G T C G)];
my $idx1 = 0;
my $idx2 = 0;
# Find matched characters
# #$arr_ref is the size of the array referenced by $arr_ref
while ($idx1 < #$arr1 && $idx2 < #$arr2) {
my $char1 = $arr1->[$idx1];
my $char2 = $arr2->[$idx2];
if ($char1 eq $char2) {
# Matched character, advance arr1 and arr2
printf("%s %s -- arr1[%d] matches arr2[%d]\n", $char1, $char2, $idx1, $idx2);
++$idx1;
++$idx2;
} else {
# Unmatched character, advance arr2
printf(". %s -- skipping arr2[%d]\n", $char2, $idx2);
++$idx2;
}
}
# Remaining unmatched characters
while ($idx1 < #$arr1) {
my $char1 = $arr1->[$idx1];
printf("%s . -- arr1[%d] is beyond the end of arr2\n", $char1, $idx1);
$idx1++;
}
The script prints:
A A -- arr1[0] matches arr2[0]
. C -- skipping arr2[1]
. G -- skipping arr2[2]
T T -- arr1[1] matches arr2[3]
C C -- arr1[2] matches arr2[4]
. C -- skipping arr2[5]
. T -- skipping arr2[6]
G G -- arr1[3] matches arr2[7]
T T -- arr1[4] matches arr2[8]
C C -- arr1[5] matches arr2[9]
G G -- arr1[6] matches arr2[10]
A . -- arr1[7] is beyond the end of arr2
G . -- arr1[8] is beyond the end of arr2
C . -- arr1[9] is beyond the end of arr2
G . -- arr1[10] is beyond the end of arr2

Nested loops makes no sense. You don't want to loop over anything more than once.
You didn't specify what you wanted to happen after you resync, so you'll want to start with the following and tailor it to your needs.
my ($array1, $array2) = #_;
my $idx1 = 0;
my $idx2 = 0;
while ($idx1 < #$array1 && $idx2 < #$array2) {
if ($array1->[$idx1] eq $array2->[$idx2]) {
++$idx1;
++$idx2;
} else {
++$idx2;
}
}
...
As is, the above snippet will leave $idx1 at the last index it couldn't (eventually) resync. If instead you want to stop as soon as you first resync, you want
my ($array1, $array2) = #_;
my $idx1 = 0;
my $idx2 = 0;
my $mismatch = 0;
while ($idx1 < #$array1 && $idx2 < #$array2) {
if ($array1->[$idx1] eq $array2->[$idx2]) {
last if $mismatched;
++$idx1;
++$idx2;
} else {
++$mismatched;
++$idx2;
}
}
...

The foreach loops won't cut it: We'll either want to loop while there are available elements in both arrays, or iterate through all indices, which we can increment as we like:
EL1: while (defined(my $el1 = shift #array1) and #array2) {
EL2: while(defined(my $el2 = shift #array2)) {
++$count and next EL1 if $el1 eq $el2; # break out of inner loop
}
}
or
my $j = 0; # index of #array2
for (my $i = 0; $i <= $#array1; $i++) {
$j++ until $j > $#array or $array1[$i] eq $array2[$j];
last if $j > $#array;
$count++;
}
or any combination.

This is to complex a condition for for loops use while loops instead
my ($array1ref, $array2ref) = #_;
my #array1 = #$array1ref;
my #array2= #$array2ref;
my $count = 0;
my ($index, $index2) = (0,0);
#loop while indexs are in arrays
while($index <= ##array1 && $index2 <= ##array2) {
if($array1[$index] eq $array2[$index2]) {
$index++;
$index2++;
} else {
#increment index until we find a match
$index2++ until $array1[$index] eq $array2[$index2];
}
}

Here is one possibility. It will use indexes to go through both lists.
my #array1 = qw(A T C G T C G A G C G);
my #array2 = qw(A C G T C C T G T C G);
my $count = 0;
my $idx1 = 0;
my $idx2 = 0;
while(($idx1 < scalar #array1) && ($idx2 < scalar #array2)) {
if($array1[$idx1] eq $array2[$idx2]) {
print "Match of $array1[$idx1] array1 \# $idx1 and array2 \# $idx2\n";
$idx1++;
$idx2++;
$count++;
} else {
$idx2++;
}
}
print "Count = $count\n";

It seems like you could do this pretty easily with a 'grep', if you're guaranteed that array2 is always as long as or longer than array1. Something like this:
sub align
{
my ($array1, $array2) = #_;
my $index = 0;
return grep
{
$array1->[$index] eq $array2->[$_] ? ++$index : 0
} 0 .. scalar( #$array2 ) - 1;
}
Basically, the grep is saying "Return me the list of increasing indices into array2 that match contiguous elements from array1."
If you run the above with this test code, you can see it returns the expected alignment array:
my #array1 = qw(A T C G T C G A G C G);
my #array2 = qw(A C G T C C T G T C G);
say join ",", align \#array1, \#array2;
This outputs the expected mapping:
0,3,4,7,8,9,10. That list means that #array1[0 .. 6] correspond to #array2[0,3,4,7,8,9,10].
(Note: You need to use Modern::Perl or similar to use say.)
Now, you haven't really said what you need the output of the operation to be. I've assumed you wanted this mapping array. If you just need a count of the number of elements skipped in #array2 while aligning it with #array1, you can still use the grep above, but instead of the list, just return scalar(#$array2) - $index at the end.

As you may know, your problem is called Sequence Alignment. There are well-developed algorithms to do this efficiently, and one such module Algorithm::NeedlemanWunsch is available on CPAN. Here's how you might apply it to your problem.
#!/usr/bin/perl
use Algorithm::NeedlemanWunsch;
my $arr1 = [qw(A T C G T C G A G C G)];
my $arr2 = [qw(A C G T C C T G T C G)];
my $matcher = Algorithm::NeedlemanWunsch->new(sub {#_==0 ? -1 : $_[0] eq $_[1] ? 1 : -2});
my (#align1, #align2);
my $result = $matcher->align($arr1, $arr2,
{
align => sub {unshift #align1, $arr1->[shift]; unshift #align2, $arr2->[shift]},
shift_a => sub {unshift #align1, $arr1->[shift]; unshift #align2, '.'},
shift_b => sub {unshift #align1, '.'; unshift #align2, $arr1->[shift]},
});
print join("", #align1), "\n";
print join("", #align2), "\n";
That prints out an optimal solution in terms of the costs we specified in the constructor:
ATCGT.C.GAGCG
A.CGTTCGG.TCG
A very different method from the one in your original question, but I think it's worth knowing about.

Related

How to count the each element of same index key in array?

How to count the each element of same index number?
my #a = qw"A B C D E F";
my #b = qw"A B C C";
my $count = 0;
for($i = 0; $i<=scalar #a; $i++){
for($j = 0; $j <= scalar #b; $j++){
if($a[$i] eq $b[$j]){
$count++;
}
}
}
print "Total: $count";
I expect the output is:
Total:3
The output is done by count only the same element of the index key? How can i do it?
There are two potential interpretations to your problem:
1. How does one count the intersection of two arrays?
A hash is an ideal data structure to test for existance:
use strict;
use warnings;
my #a = qw"A B C D E F";
my #b = qw"A B C C";
my %b = map {$_ => 1} #b;
my $count = scalar grep {$b{$_}} #a;
print "Total: $count";
Outputs:
Total: 3
Additional perldoc reference: How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
2. How does one test element equality between two arrays, index to index?
If this is your question, then you do not need two loops, just a single iterator.
use strict;
use warnings;
use List::Util qw(min);
my #a = qw"A B C D E F";
my #b = qw"A B C C";
my $count = scalar grep {$a[$_] eq $b[$_]} (0..min($#a, $#b));
print "Total: $count";
Outputs:
Total: 3

Perl Mismatch among arrays

I have two arrays:
#array1 = (A,B,C,D,E,F);
#array2 = (A,C,H,D,E,G);
The arrays could be of different size. I want to find how many mismatches are there between the arrays. The indexes should be the same. In this case there are three mismatch :b->c,c->h and F->G.(i.e , The 'C' in $array[2] should not be considered a match to 'C' in $array[1]) I would like to get the number of mismatches as well as the mismatch.
foreach my $a1 ( 0 .. $#array1) {
foreach my $a2( 0 .. $#array2)
if($array1[$a1] ne $array2[$a2]) {
}
}
}
my %array_one = map {$_, 1} #array1;
my #difference = grep {!$array_one {$_}} #array1;
print "#difference\n";
Ans: gives me H, G but not C.
with my little Perl knowledge I tried this, with no result. Could you suggest me how I should deal this? Your suggestions and pointers would be very helpful.
You shouldn't have nested loops. You only need to go through the indexes once.
use List::Util qw( max );
my #mismatches;
for my $i (0..max($#array1, $#array2)) {
push #mismatches, $i
if $i >= #array1
|| $i >= #array2
|| $array1[$i] ne $array2[$i];
}
}
say "There are " . (0+#mismatches) . " mismatches";
for my $i (#mismatches) {
...
}
Since you mentioned grep, this is how you'd replace the for with grep:
use List::Util qw( max );
my #mismatches =
grep { $_ >= #array1
|| $_ >= #array2
|| array1[$_] ne $array2[$_] }
0 .. max($#array1, $#array2);
say "There are " . (0+#mismatches) . " mismatches";
for my $i (#mismatches) {
...
}
Here's an example using each_arrayref from List::MoreUtils.
sub diff_array{
use List::MoreUtils qw'each_arrayref';
return unless #_ && defined wantarray;
my #out;
my $iter = each_arrayref(#_);
my $index = 0;
while( my #current = $iter->() ){
next if all_same(#current);
unshift #current, $index;
push #out, \#current;
}continue{ ++$index }
return #out;
}
This version should be faster if you are going to use this for determining the number of differences often. The output is exactly the same. It just doesn't have to work as hard when returning a number.
Read about wantarray for more information.
sub diff_array{
use List::MoreUtils qw'each_arrayref';
return unless #_ && defined wantarray;
my $iter = each_arrayref(#_);
if( wantarray ){
# return structure
my #out;
my $index = 0;
while( my #current = $iter->() ){
next if all_same(#current);
unshift #current, $index;
push #out, \#current;
}continue{ ++$index }
return #out;
}else{
# only return a count of differences
my $out = 0;
while( my #current = $iter->() ){
++$out unless all_same #current;
}
return $out;
}
}
diff_array uses the subroutine all_same to determine if all of the current list of elements are the same.
sub all_same{
my $head = shift;
return undef unless #_; # not enough arguments
for( #_ ){
return 0 if $_ ne $head; # at least one mismatch
}
return 1; # all are the same
}
To get just the number of differences:
print scalar diff_array \#array1, \#array2;
my $count = diff_array \#array1, \#array2;
To get a list of differences:
my #list = diff_array \#array1, \#array2;
To get both:
my $count = my #list = diff_array \#array1, \#array2;
The output for the input you provided:
(
[ 1, 'B', 'C' ],
[ 2, 'C', 'H' ],
[ 5, 'F', 'G' ]
)
Example usage
my #a1 = qw'A B C D E F';
my #a2 = qw'A C H D E G';
my $count = my #list = diff_array \#a1, \#a2;
print "There were $count differences\n\n";
for my $group (#list){
my $index = shift #$group;
print " At index $index\n";
print " $_\n" for #$group;
print "\n";
}
You're iterating over both arrays when you don't want to be doing so.
#array1 = ("A","B","C","D","E","F");
#array2 = ("A","C","H","D","E","G");
foreach my $index (0 .. $#array1) {
if ($array1[$index] ne $array2[$index]) {
print "Arrays differ at index $index: $array1[$index] and $array2[$index]\n";
}
}
Output:
Arrays differ at index 1: B and C
Arrays differ at index 2: C and H
Arrays differ at index 5: F and G
Well, first, you're going to want to go over each element of one of the arrays, and compare it to the same element of the other array. List::MoreUtils provides an easy way to do this:
use v5.14;
use List::MoreUtils qw(each_array);
my #a = qw(a b c d);
my #b = qw(1 2 3);
my $ea = each_array #a, #b;
while ( my ($a, $b) = $ea->() ) {
say "a = $a, b = $b, idx = ", $ea->('index');
}
You can extend that to find where there is a non-match by checking inside that while loop (note: this assumes your arrays don't have undefs at the end, or that if they do, undef is the same as having a shorter array):
my #mismatch;
my $ea = each_array #a, #b;
while ( my ($a, $b) = $ea->() ) {
if (defined $a != defined $b || $a ne $b) {
push #mismatch, $ea->('index');
}
}
and then:
say "Mismatched count = ", scalar(#mismatch), " items are: ", join(q{, }, #mismatch);
The following code builds a list of mismatched pairs, then prints them out.
#a1 = (A,B,C,D,E,F);
#a2 = (A,C,H,D,E,G);
#diff = map { [$a1[$_] => $a2[$_]] }
grep { $a1[$_] ne $a2[$_] }
(0..($#a1 < $#a2 ? $#a1 : $#a2));
print "$_->[0]->$_->[1]\n" for #diff
You have the right idea, but you only need a single loop, since you are looking at each index and comparing entries between the arrays:
foreach my $a1 ( 0 .. $#array1) {
if($array1[$a1] ne $array2[$a1]) {
print "$a1: $array1[$a1] <-> $array2[$a1]\n";
}
}

What's the best way to compare arrays of strings in perl

I'm trying to compare multiple arrays of strings containing file listings of directories. The objective is to determine which files exist in each directory AND which files do not exists. Consider:
List1 List2 List3 List4
a a e f
b b d g
c f a h
The outcome should be:
List1:
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
c yes no no no
List2:
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
f no yes no yes
...
I could go through all the arrays and go through each entry, go through all the other arrays and do a grep:
for my $curfile (#currentdirfiles) {
if( grep(/$curfile/, #otherarrsfiles) ) {
// Set 'yes'
} else {
// set 'no'
}
}
My only concern is that I am ending up with a 0^2n order of magnitude. I may not be able to do anything about this since I would end up looping through all the arrays anyway. One improvement may be in the grep function, but I'm not sure.
Any thoughts?
For lots of string lookups, you generally want to use hashes. Here's one way of doing it:
use strict;
use warnings;
# Define the lists:
my #lists = (
[qw(a b c)], # List 1
[qw(a b f)], # List 2
[qw(e d a)], # List 3
[qw(f g h)], # List 4
);
# For each file, determine which lists it is in:
my %included;
for my $n (0 .. $#lists) {
for my $file (#{ $lists[$n] }) {
$included{$file}[$n] = 1;
} # end for each $file in this list
} # end for each list number $n
# Print out the results:
my $fileWidth = 8;
for my $n (0 .. $#lists) {
# Print the header rows:
printf "\nList %d:\n", $n+1;
print ' ' x $fileWidth;
printf "%-8s", "List $_" for 1 .. #lists;
print "\n";
# Print a line for each file:
for my $file (#{ $lists[$n] }) {
printf "%-${fileWidth}s", $file;
printf "%-8s", ($_ ? 'yes' : 'no') for #{ $included{$file} }[0 .. $#lists];
print "\n";
} # end for each $file in this list
} # end for each list number $n
Why not just remember where each file is when you're reading them in.
Let's say you have a list of directories to read from in #dirlist:
use File::Slurp qw( read_dir );
my %in_dir;
my %dir_files;
foreach my $dir ( #dirlist ) {
die "No such directory $dir" unless -d $dir;
foreach my $file ( read_dir($dir) ) {
$in_dir{$file}{$dir} = 1;
push #{ $dir_files{$dir} }, $file;
}
}
Now $in_dir{filename} will have entries defined for each directory of interest, and
$dir_files{directory} will have a list of files for each directory...
foreach my $dir ( #dirlist ) {
print "$dir\n";
print join("\t", "", #dirlist);
foreach my $file ( #{ $dir_files{$dir} } ) {
my #info = ($file);
foreach my $dir_for_file ( #dirlist ) {
if ( defined $in_dir{$file}{$dir_for_file} ) {
push #info, "Yes";
} else {
push #info, "No";
}
}
print join("\t", #info), "\n";
}
}
The clearest way is to use perl5i and autoboxing:
use perl5i;
my #list1 = qw(one two three);
my #list2 = qw(one two four);
my $missing = #list1 -> diff(\#list2);
my $both = #list1 -> intersect(\#list2);
In a more restricted setup, use hashes for this as the filenames will be unique:
sub in_list {
my ($one, $two) = #_;
my (#in, #out);
my %a = map {$_ => 1} #$one;
foreach my $f (#$two) {
if ($a{$f}) {
push #in, $f;
}
else {
push #out, $f;
}
}
return (\#in, \#out);
}
my #list1 = qw(one two three);
my #list2 = qw(one two four);
my ($in, $out) = in_list(\#list1, \#list2);
print "In list 1 and 2:\n";
print " $_\n" foreach #$in;
print "In list 2 and not in list 1\n";
print " $_\n" foreach #$out;
Now that the question has been amended, this produces the answer you want. It does work in O(n3) time, which is optimal for the problem (there are n3 outputs).
#!/usr/bin/env perl
use strict;
use warnings;
#List1 List2 List3 List4
#a a e f
#b b d g
#c f a h
my(#lists) = ( { a => 1, b => 1, c => 1 },
{ a => 1, b => 1, f => 1 },
{ e => 1, d => 1, a => 1 },
{ f => 1, g => 1, h => 1 },
);
my $i = 0;
foreach my $list (#lists)
{
analyze(++$i, $list, #lists);
}
sub analyze
{
my($num, $ref, #lists) = #_;
printf "List %d\n", $num;
my $pad = " ";
foreach my $i (1..4)
{
print "$pad List$i";
$pad = "";
}
print "\n";
foreach my $file (sort keys %{$ref})
{
printf "%-8s", $file;
foreach my $list (#lists)
{
my %dir = %{$list};
printf "%-8s", (defined $dir{$file}) ? "yes" : "no";
}
print "\n";
}
print "\n";
}
The output I get is:
List 1
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
c yes no no no
List 2
List1 List2 List3 List4
a yes yes yes no
b yes yes no no
f no yes no yes
List 3
List1 List2 List3 List4
a yes yes yes no
d no no yes no
e no no yes no
List 4
List1 List2 List3 List4
f no yes no yes
g no no no yes
h no no no yes
My code is simpler but the output isn't quite what you want:
#lst1=('a', 'b', 'c');
#lst2=('a', 'b', 'f');
#lst3=('e', 'd', 'a');
#lst4=('f', 'g', 'h');
%hsh=();
foreach $item (#lst1) {
$hsh{$item}="list1";
}
foreach $item (#lst2) {
if (defined($hsh{$item})) {
$hsh{$item}=$hsh{$item}." list2";
}
else {
$hsh{$item}="list2";
}
}
foreach $item (#lst3) {
if (defined($hsh{$item})) {
$hsh{$item}=$hsh{$item}." list3";
}
else {
$hsh{$item}="list3";
}
}
foreach $item (#lst4) {
if (defined($hsh{$item})) {
$hsh{$item}=$hsh{$item}." list4";
}
else {
$hsh{$item}="list4";
}
}
foreach $key (sort keys %hsh) {
printf("%s %s\n", $key, $hsh{$key});
}
Gives:
a list1 list2 list3
b list1 list2
c list1
d list3
e list3
f list2 list4
g list4
h list4
Sorry for the late reply, I've been polishing this a while, because I did not want yet another negative score (bums me out).
This is an interesting efficiency problem. I don't know if my solution will work for you, but I thought I would share it anyway. It is probably efficient only if your arrays do not change too often, and if your arrays contain many duplicate values. I have not run any efficiency checks on it.
Basically, the solution is to remove one dimension of the cross checking by turning the array values into bits, and doing a bitwise comparison on the entire array in one go. Array values are deduped, sorted and given a serial number. The arrays total serial numbers are then stored in a single value by bitwise or. A single array can thereby be checked for a single serial number with only one operation, e.g.:
if ( array & serialno )
It will require one run to prepare the data, which can then be saved in cache or similar. This data can then be used until your data changes (e.g. files/folders are removed or added). I have added a fatal exit on undefined values, which means the data must be refreshed when it occurs.
Good luck!
use strict;
use warnings;
my #list1=('a', 'b', 'c');
my #list2=('a', 'b', 'f');
my #list3=('e', 'd', 'a');
my #list4=('f', 'g', 'h');
# combine arrays
my #total = (#list1, #list2, #list3, #list4);
# dedupe (Thanks Xetius for this code snippet)
my %unique = ();
foreach my $item (#total)
{
$unique{$item} ++;
}
# Default sort(), don't think it matters
#total = sort keys %unique;
# translate to serial numbers
my %serials = ();
for (my $num = 0; $num <= $#total; $num++)
{
$serials{$total[$num]} = $num;
}
# convert array values to serial numbers, and combine them
my #tx = ();
for my $entry (#list1) { $tx[0] |= 2**$serials{$entry}; }
for my $entry (#list2) { $tx[1] |= 2**$serials{$entry}; }
for my $entry (#list3) { $tx[2] |= 2**$serials{$entry}; }
for my $entry (#list4) { $tx[3] |= 2**$serials{$entry}; }
&print_all;
sub inList
{
my ($value, $list) = #_;
# Undefined serial numbers are not accepted
if (! defined ($serials{$value}) ) {
print "$value is not in the predefined list.\n";
exit;
}
return ( 2**$serials{$value} & $tx[$list] );
}
sub yesno
{
my ($value, $list) = #_;
return ( &inList($value, $list) ? "yes":"no" );
}
#
# The following code is for printing purposes only
#
sub print_all
{
printf "%-6s %-6s %-6s %-6s %-6s\n", "", "List1", "List2", "List3", "List4";
print "-" x 33, "\n";
&table_print(#list1);
&table_print(#list2);
&table_print(#list3);
&table_print(#list4);
}
sub table_print
{
my #list = #_;
for my $entry (#list) {
printf "%-6s %-6s %-6s %-6s %-6s\n", $entry,
&yesno($entry, 0),
&yesno($entry, 1),
&yesno($entry, 2),
&yesno($entry, 3);
}
print "-" x 33, "\n";
}
I would build a hash using directory entries as keys containing hashes (actually sets) of each listing in which that was found. Iterate over each listing, for each new entry add it to the outer hash with a single set (or hash) containing the identifier of the listing in which it was first encountered. For any entry that's found in the hash simply add the current listing identifier to the value's set/hash.
From there you can simply post process the sorted keys of the hash, and creating rows of your resulting table.
Personally I think Perl is ugly but here's a sample in Python:
#!/usr/bin/env python
import sys
if len(sys.argv) < 2:
print >> sys.stderr, "Must supply arguments"
sys.exit(1)
args = sys.argv[1:]
# build hash entries by iterating over each listing
d = dict()
for each_file in args:
name = each_file
f = open(each_file, 'r')
for line in f:
line = line.strip()
if line not in d:
d[line] = set()
d[line].add(name)
f.close()
# post process the hash
report_template = "%-20s" + (" %-10s" * len(args))
print report_template % (("Dir Entries",) + tuple(args))
for k in sorted(d.keys()):
row = list()
for col in args:
row.append("yes") if col in d[k] else row.append("no")
print report_template % ((k,)+tuple(row))
That should mostly be legible as if it were psuedo-code. The (k,) and ("Dir Entries",) expressions might look a little odd; but that's to force them to be tuples which are are necessary to unpack into the format string using the % operator for strings. Those could also have been written as tuple([k]+row) for example (wrapping the first item in [] makes it a list which can be added to the other list and all converted to a tuple).
Other than that a translation to Perl should be pretty straightforward, just using hashes instead of dictionaries and sets.
(Incidentally, this example will work with an arbitrary number of listings, supplied as arguments and output as columns. Obviously after a dozen columns the output would get to be rather cumbersome to print or display; but it was an easily generalization to make).

How can I compare arrays in Perl?

I have two arrays, #a and #b. I want to do a compare among the elements of the two arrays.
my #a = qw"abc def efg ghy klm ghn";
my #b = qw"def ghy jgk lom com klm";
If any element matches then set a flag. Is there any simple way to do this?
First of all, your 2 arrays need to be written correctly.
#a = ("abc","def","efg","ghy","klm","ghn");
#b = ("def","efg","ghy","klm","ghn","klm");
Second of all, for arbitrary arrays (e.g. arrays whose elements may be references to other data structures) you can use Data::Compare.
For arrays whose elements are scalar, you can do comparison using List::MoreUtils pairwise BLOCK ARRAY1 ARRAY2, where BLOCK is your comparison subroutine. You can emulate pairwise (if you don't have List::MoreUtils access) via:
if (#a != #b) {
$equals = 0;
} else {
$equals = 1;
foreach (my $i = 0; $i < #a; $i++) {
# Ideally, check for undef/value comparison here as well
if ($a[$i] != $b[$i]) { # use "ne" if elements are strings, not numbers
# Or you can use generic sub comparing 2 values
$equals = 0;
last;
}
}
}
P.S. I am not sure but List::Compare may always sort the lists. I'm not sure if it can do pairwise comparisons.
List::Compare
if ( scalar List::Compare->new(\#a, \#b)->get_intersection ) {
…
}
Check to create an intersect function, which will return a list of items that are present in both lists. Then your return value is dependent on the number of items in the intersected list.
You can easily find on the web the best implementation of intersect for Perl. I remember looking for it a few years ago.
Here's what I found :
my #array1 = (1, 2, 3);
my #array2 = (2, 3, 4);
my %original = ();
my #isect = ();
map { $original{$_} = 1 } #array1;
#isect = grep { $original{$_} } #array2;
This is one way:
use warnings;
use strict;
my #a = split /,/, "abc,def,efg,ghy,klm,ghn";
my #b = split /,/, "def,ghy,jgk,lom,com,klm";
my $flag = 0;
my %a;
#a{#a} = (1) x #a;
for (#b) {
if ($a{$_}) {
$flag = 1;
last;
}
}
print "$flag\n";
From the requirement that 'if any element matches', use the intersection of sets:
sub set{
my %set = map { $_, undef }, #_;
return sort keys %set;
}
sub compare{
my ($listA,$listB) = #_;
return ( (set(#$listA)-set(#$listB)) > 0)
}
my #a = qw' abc def efg ghy klm ghn ';
my #b = qw' def ghy jgk lom com klm ';
my $flag;
foreach my $item(#a) {
$flag = #b~~$item ? 0 : 1;
last if !$flag;
}
Note that you will need Perl 5.10, or later, to use the smart match operator (~~) .
Brute force should do the trick for small a n:
my $flag = 0;
foreach my $i (#a) {
foreach my $k (#b) {
if ($i eq $k) {
$flag = 1;
last;
}
}
}
For a large n, use a hash table:
my $flag = 0;
my %aa = ();
$aa{$_} = 1 foreach (#a);
foreach my $i (#b) {
if ($aa{$i}) {
$flag = 1;
last;
}
}
Where a large n is |#a| + |#b| > ~1000 items
IMHO, you should use List::MoreUtils::pairwise. However, if for some reason you cannot, then the following sub would return a 1 for every index where the value in the first array compares equal to the value in the second array. You can generalize this method as much as you want and pass your own comparator if you want to, but at that point, just installing List::MoreUtils would be a more productive use of your time.
use strict; use warnings;
my #a = qw(abc def ghi jkl);
my #b = qw(abc dgh dlkfj jkl kjj lkm);
my $map = which_ones_equal(\#a, \#b);
print join(', ', #$map), "\n";
sub which_ones_equal {
my ($x, $y, $compare) = #_;
my $last = $#$x > $#$y ? $#$x : $#$y;
no warnings 'uninitialized';
return [ map { 0 + ($x->[$_] eq $y->[$_]) } $[ .. $last ];
}
This is Perl. The 'obvious' solution:
my #a = qw"abc def efg ghy klm ghn";
my #b = qw"def ghy jgk lom com klm";
print "arrays equal\n"
if #a == #b and join("\0", #a) eq join("\0", #b);
given "\0" not being in #a.
But thanks for confirming that there is no other generic solution than rolling your own.
my #a1 = qw|a b c d|;
my #a2 = qw|b c d e|;
for my $i (0..$#a1) {
say "element $i of array 1 was not found in array 2"
unless grep {$_ eq $a1[$i]} #a2
}
If you would consider the arrays with different order to be different, you may use Array::Diff
if (Array::Diff->diff(\#a, \#b)->count) {
# not_same
} else {
# same
}
This question still could mean two things where it states "If any element matches then set a flag":
Elements at the same position, i.e $a[2] eq $b[2]
Values at any position, i.e. $a[3] eq $b[5]
For case 1, you might do this:
# iterate over all positions, and compare values at that position
my #matches = grep { $a[$_] eq $b[$_] } 0 .. $#a;
# set flag if there's any match at the same position
my $flag = 1 if #matches;
For case 2, you might do that:
# make a hash of #a and check if any #b are in there
my %a = map { $_ => 1 } #a;
my #matches = grep { $a{$_} } #b;
# set flag if there's matches at any position
my $flag = 1 if #matches;
Note that in the first case, #matches holds the indexes of where there are matching elements, and in the second case #matches holds the matching values in the order in which they appear in #b.

How do I change this to "idiomatic" Perl?

I am beginning to delve deeper into Perl, but am having trouble writing "Perl-ly" code instead of writing C in Perl. How can I change the following code to use more Perl idioms, and how should I go about learning the idioms?
Just an explanation of what it is doing: This routine is part of a module that aligns DNA or amino acid sequences(using Needelman-Wunch if you care about such things). It creates two 2d arrays, one to store a score for each position in the two sequences, and one to keep track of the path so the highest-scoring alignment can be recreated later. It works fine, but I know I am not doing things very concisely and clearly.
edit: This was for an assignment. I completed it, but want to clean up my code a bit. The details on implementing the algorithm can be found on the class website if any of you are interested.
sub create_matrix {
my $self = shift;
#empty array reference
my $matrix = $self->{score_matrix};
#empty array ref
my $path_matrix = $self->{path_matrix};
#$seq1 and $seq2 are strings set previously
my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;
#create the 2d array of scores
for (my $i = 0; $i < $num_of_rows; $i++) {
push(#$matrix, []);
push(#$path_matrix, []);
$$matrix[$i][0] = $i * $self->{gap_cost};
$$path_matrix[$i][0] = 1;
}
#fill out the first row
for (my $i = 0; $i < $num_of_columns; $i++) {
$$matrix[0][$i] = $i * $self->{gap_cost};
$$path_matrix[0][$i] = -1;
}
#flag to signal end of traceback
$$path_matrix[0][0] = 2;
#double for loop to fill out each row
for (my $row = 1; $row < $num_of_rows; $row++) {
for (my $column = 1; $column < $num_of_columns; $column++) {
my $seq1_gap = $$matrix[$row-1][$column] + $self->{gap_cost};
my $seq2_gap = $$matrix[$row][$column-1] + $self->{gap_cost};
my $match_mismatch = $$matrix[$row-1][$column-1] + $self->get_match_score(substr($self->{seq1}, $row-1, 1), substr($self->{seq2}, $column-1, 1));
$$matrix[$row][$column] = max($seq1_gap, $seq2_gap, $match_mismatch);
#set the path matrix
#if it was a gap in seq1, -1, if was a (mis)match 0 if was a gap in seq2 1
if ($$matrix[$row][$column] == $seq1_gap) {
$$path_matrix[$row][$column] = -1;
}
elsif ($$matrix[$row][$column] == $match_mismatch) {
$$path_matrix[$row][$column] = 0;
}
elsif ($$matrix[$row][$column] == $seq2_gap) {
$$path_matrix[$row][$column] = 1;
}
}
}
}
You're getting several suggestions regarding syntax, but I would also suggest a more modular approach, if for no other reason that code readability. It's much easier to come up to speed on code if you can perceive the big picture before worrying about low-level details.
Your primary method might look like this.
sub create_matrix {
my $self = shift;
$self->create_2d_array_of_scores;
$self->fill_out_first_row;
$self->fill_out_other_rows;
}
And you would also have several smaller methods like this:
n_of_rows
n_of_cols
create_2d_array_of_scores
fill_out_first_row
fill_out_other_rows
And you might take it even further by defining even smaller methods -- getters, setters, and so forth. At that point, your middle-level methods like create_2d_array_of_scores would not directly touch the underlying data structure at all.
sub matrix { shift->{score_matrix} }
sub gap_cost { shift->{gap_cost} }
sub set_matrix_value {
my ($self, $r, $c, $val) = #_;
$self->matrix->[$r][$c] = $val;
}
# Etc.
One simple change is to use for loops like this:
for my $i (0 .. $num_of_rows){
# Do stuff.
}
For more info, see the Perl documentation on foreach loops and the range operator.
I have some other comments as well, but here is the first observation:
my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;
So $self->{seq1} and $self->{seq2} are strings and you keep accessing individual elements using substr. I would prefer to store them as arrays of characters:
$self->{seq1} = [ split //, $seq1 ];
Here is how I would have written it:
sub create_matrix {
my $self = shift;
my $matrix = $self->{score_matrix};
my $path_matrix = $self->{path_matrix};
my $rows = #{ $self->{seq1} };
my $cols = #{ $self->{seq2} };
for my $row (0 .. $rows) {
$matrix->[$row]->[0] = $row * $self->{gap_cost};
$path_matrix->[$row]->[0] = 1;
}
my $gap_cost = $self->{gap_cost};
$matrix->[0] = [ map { $_ * $gap_cost } 0 .. $cols ];
$path_matrix->[0] = [ (-1) x ($cols + 1) ];
$path_matrix->[0]->[0] = 2;
for my $row (1 .. $rows) {
for my $col (1 .. $cols) {
my $gap1 = $matrix->[$row - 1]->[$col] + $gap_cost;
my $gap2 = $matrix->[$row]->[$col - 1] + $gap_cost;
my $match_mismatch =
$matrix->[$row - 1]->[$col - 1] +
$self->get_match_score(
$self->{seq1}->[$row - 1],
$self->{seq2}->[$col - 1]
);
my $max = $matrix->[$row]->[$col] =
max($gap1, $gap2, $match_mismatch);
$path_matrix->[$row]->[$col] = $max == $gap1
? -1
: $max == $gap2
? 1
: 0;
}
}
}
Instead of dereferencing your two-dimensional arrays like this:
$$path_matrix[0][0] = 2;
do this:
$path_matrix->[0][0] = 2;
Also, you're doing a lot of if/then/else statements to match against particular subsequences: this could be better written as given statements (perl5.10's equivalent of C's switch). Read about it at perldoc perlsyn:
given ($matrix->[$row][$column])
{
when ($seq1_gap) { $path_matrix->[$row][$column] = -1; }
when ($match_mismatch) { $path_matrix->[$row][$column] = 0; }
when ($seq2_gap) { $path_matrix->[$row][$column] = 1; }
}
The majority of your code is manipulating 2D arrays. I think the biggest improvement would be switching to using PDL if you want to do much stuff with arrays, particularly if efficiency is a concern. It's a Perl module which provides excellent array support. The underlying routines are implemented in C for efficiency so it's fast too.
I would always advise to look at CPAN for previous solutions or examples of how to do things in Perl. Have you looked at Algorithm::NeedlemanWunsch?
The documentation to this module includes an example for matching DNA sequences. Here is an example using the similarity matrix from wikipedia.
#!/usr/bin/perl -w
use strict;
use warnings;
use Inline::Files; #multiple virtual files inside code
use Algorithm::NeedlemanWunsch; # refer CPAN - good style guide
# Read DNA sequences
my #a = read_DNA_seq("DNA_SEQ_A");
my #b = read_DNA_seq("DNA_SEQ_B");
# Read Similarity Matrix (held as a Hash of Hashes)
my %SM = read_Sim_Matrix();
# Define scoring based on "Similarity Matrix" %SM
sub score_sub {
if ( !#_ ) {
return -3; # gap penalty same as wikipedia)
}
return $SM{ $_[0] }{ $_[1] }; # Similarity Value matrix
}
my $matcher = Algorithm::NeedlemanWunsch->new( \&score_sub, -3 );
my $score = $matcher->align( \#a, \#b, { align => \&check_align, } );
print "\nThe maximum score is $score\n";
sub check_align {
my ( $i, $j ) = #_; # #a[i], #b[j]
print "seqA pos: $i, seqB pos: $j\t base \'$a[$i]\'\n";
}
sub read_DNA_seq {
my $source = shift;
my #data;
while (<$source>) {
push #data, /[ACGT-]{1}/g;
}
return #data;
}
sub read_Sim_Matrix {
#Read DNA similarity matrix (scores per Wikipedia)
my ( #AoA, %HoH );
while (<SIMILARITY_MATRIX>) {
push #AoA, [/(\S+)+/g];
}
for ( my $row = 1 ; $row < 5 ; $row++ ) {
for ( my $col = 1 ; $col < 5 ; $col++ ) {
$HoH{ $AoA[0][$col] }{ $AoA[$row][0] } = $AoA[$row][$col];
}
}
return %HoH;
}
__DNA_SEQ_A__
A T G T A G T G T A T A G T
A C A T G C A
__DNA_SEQ_B__
A T G T A G T A C A T G C A
__SIMILARITY_MATRIX__
- A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8
And here is some sample output:
seqA pos: 7, seqB pos: 2 base 'G'
seqA pos: 6, seqB pos: 1 base 'T'
seqA pos: 4, seqB pos: 0 base 'A'
The maximum score is 100