How do I create a Bubble Sort in Perl - perl

I am trying to create a simple bubble sort in Perl but it doesn't seem to work. Can any one help me?
Code:
for ( my $i = 1; $i < #array; $i++ ) {
for ( my $k = 0; $k = #array < $i - 1; $k++ ) {
if ( $array[$k] > $array[ $k + 1 ] ) {
$temp = $array[$k];
$array[$k] = $array[ $k + 1 ];
$array[ $k + 1 ] = $temp;
}
}
}
Then when I iterate through the array again it is not sorted.

Shouldn't the outer loop go from the back to the front of the array? Also the $k = #array< $i - 1 statement in the for inner loop doesn't make sense.
my #array = (5,6,3,1,7,3,2,9,10,4);
my $i, $k;
for ($i = $#array; $i > 0; $i--) { # $#array = last index = length-1
for ($k = 0; $k < $i; $k++) {
if ($array[$k] > $array[$k+1]) {
($array[$k], $array[$k+1]) = ($array[$k+1], $array[$k]);
}
}
}
print "#array\n"; # 1 2 3 3 4 5 6 7 9 10

You should avoid this sort of confusion by using list version of for
my #array = ( 5, 6, 3, 1, 7, 3, 2, 9, 10, 4 );
for my $i ( 1 .. $#array ) {
for my $k ( 0 .. $i - 1 ) {
#array[ $k, $k + 1 ] = #array[ $k + 1, $k ]
if $array[$k] > $array[ $k + 1 ];
}
}
print "#array\n";
You can prevent a lot of bugs in this way and code is more readable because there is clearly visible intent what do you like to achieve. Which leads to the realisation your algorithm is wrong and what you would probably like is
my #array = ( 5, 6, 3, 1, 7, 3, 2, 9, 10, 4 );
for my $i ( reverse 1 .. $#array ) {
for my $k ( 0 .. $i - 1 ) {
#array[ $k, $k + 1 ] = #array[ $k + 1, $k ]
if $array[$k] > $array[ $k + 1 ];
}
}
print "#array\n";

I cannot believe nobody gave him the true PERL answer yet. I believe he asked for it in Perl, not in C transcribed to Perl :)
for($i=$#a;$i>0;$i--){$m=$a[0];splice(#a,0,$i+1,map{$s=$m;$m>$_?$_:($s,$m=$_)[0]}#a[1..$i],$m);}

Related

Majority Element Failing to close cycles

I'm trying to figure out why this keeps printing the "majority element" candidate in every cycle.
The code I've been trying to make work is a Majority Element search (to find an element that is repeated more than half of the length of a list).
I can't separate the processes of finding the candidate and testing against the array because my input is a text file that has an indeterminate number of arrays. It's an exercise from rosalind.info that has different inputs every time you try to solve it.
An example of the input would be
-5 5 5 5 5 5 5 5 -8 7 7 7 1 7 3 7 -7 1 6 5 10 100 1000 1 -5 1 6 7 1 1 10 1
Here's what I've written so far.
foreach my $currentrow (#lists) {
my #row = ();
#row = split( /\s/, $currentrow );
my $length = $#row;
my $count = 0;
my $i = 0;
for $i ( 0 .. $length - 1 ) {
if ( $count == 0 ) {
$candidate = $row[$i];
$count++;
}
if ( ( $count > 0 ) and ( $i = $length - 1 ) ) {
my $counter2 = 0;
for my $j ( 0 .. $length - 1 ) {
if ( $row[$j] == $candidate ) {
$counter2++;
}
}
if ( $counter2 <= ( $#row / 2 ) and ( $i = $length - 1 ) ) {
$candidate = -1;
print $candidate, " ", $i, " ";
}
if ( $counter2 > ( $#row / 2 ) and ( $i = $length - 1 ) ) {
print $candidate, " ", $i, " ";
}
}
if ( $candidate == $row[$i] and $count > 0 ) {
$count = $count + 1;
}
if ( $candidate != $row[$i] and $count > 0 ) {
$count = $count - 1;
}
}
}
Do you have use strict and use warnings 'all' in place?
I imagine that your problem may be because of the test $i = $length - 1, which is an assignment, and should be $i == $length - 1
To find a majority element I would use a hash:
perl -nae '%h=(); $h{$_}+=2 for #F; $h{$_}>#F and print for keys %h; print "\n"'
Each line of input is treated separately. Each line of output matches a line of input and presents its majority element or is empty if there is no such element.
Edit: Now the solution uses autosplit (-a), which is shorter and work not only for numbers.

trying to predict future results, collatz, Perl

I am working on a collatz sequence. I currently have a for loop.
for my $num (1..1000000) {
my $count = 1;
for (my $i = $num; $i != 1; $count++) {
$i = $i % 2 ? 3 * $i + 1 : $i / 2;
}
}
And then I have a simple way of working out the count of the loop (who many times it takes to complete the theory).
if ($count > $max_length) {
$max = $num;
$max_length = $count;
}
I worked out that code could be made quicker by using a simple theory.
If n = 3, it would have this sequence {3,10,5,16,8,4,2,1} [8] If n =
6, it would have this sequence {6,3,10,5,16,8,4,2,1} [9] If n = 12, it
would have this sequence {12,6,3,10,5,16,8,4,2,1} [10]
So I want to save the result of 3, to be able to work out the result of 6 by just adding 1 to the count and so forth.
I tried to tackle this, with what I thought would do the trick but it infact made my program take 1 minute longer to complete, I now have a program that takes 1.49 seconds rather than 30 seconds I had before.
This is how I added the cache(it's probably wrong)
The below is outside of the for loop
my $cache = 0;
my $lengthcache = 0;
I then have this bit of code which sits after the $i line, line 4 in the for loop
$cache = $i;
$lengthcache = $count;
if ($cache = $num*2) {
$lengthcache++;
}
I don't want the answer given to me in full, I just need to understand how to correctly cache without making the code slower.
You just want the length, right? There's no much savings to be obtained to caching the sequence, and the memory usage will be quite large.
Write a recursive function that returns the length.
sub seq_len {
my ($n) = #_;
return 1 if $n == 1;
return 1 + seq_len( $n % 2 ? 3 * $n + 1 : $n / 2 );
}
Cache the result.
my %cache;
sub seq_len {
my ($n) = #_;
return $cache{$n} if $cache{$n};
return $cache{$n} = 1 if $n == 1;
return $cache{$n} = 1 + seq_len( $n % 2 ? 3 * $n + 1 : $n / 2 );
}
Might as well move terminating conditions to the cache.
my %cache = ( 1 => 1 );
sub seq_len {
my ($n) = #_;
return $cache{$n} ||= 1 + seq_len( $n % 2 ? 3 * $n + 1 : $n / 2 );
}
Recursion is not necessary. You can speed it up by flatting it. It's a bit tricky, but you can do it using the usual technique[1].
my %cache = ( 1 => 1 );
sub seq_len {
my ($n) = #_;
my #to_cache;
while (1) {
if (my $length = $cache{$n}) {
$cache{pop(#to_cache)} = ++$length while #to_cache;
return $length;
}
push #to_cache, $n;
$n = $n % 2 ? 3 * $n + 1 : $n / 2;
}
}
Making sure it works:
use strict;
use warnings;
use feature qw( say );
use List::Util qw( sum );
my $calculations;
my %cache = ( 1 => 1 );
sub seq_len {
my ($n) = #_;
my #to_cache;
while (1) {
if (my $length = $cache{$n}) {
$cache{pop(#to_cache)} = ++$length while #to_cache;
return $length;
}
push #to_cache, $n;
++$calculations;
$n = $n % 2 ? 3 * $n + 1 : $n / 2;
}
}
my #results = map { seq_len($_) } 3,6,12;
say for #results;
say "$calculations calculations instead of " . (sum(#results)-#results);
8
9
10
9 calculations instead of 24
Notes:
To remove recursion,
Make the function tail-recursive by rearranging code or by passing down information about what to do on return. (The former is not possible here.)
Replace the recursion with a loop plus stack.
Eliminate the stack if possible. (Not possible here.)
Clean up the result.
Changing your algorithm to cache results so that it can break out early:
use strict;
use warnings;
my #steps = (0,0);
my $max_steps = 0;
my $max_num = 0;
for my $num (2..1_000_000) {
my $count = 0;
my $i = $num;
while ($i >= $num) {
$i = $i % 2 ? 3 * $i + 1 : $i / 2;
$count++;
}
$count += $steps[$i];
$steps[$num] = $count;
if ($max_steps < $count) {
$max_steps = $count;
$max_num = $num;
}
}
print "$max_num takes $max_steps steps\n";
Changes my processing time from 37 seconds to 2.5 seconds.
Why is 2.5 seconds enough of an improvement?
I chose caching in an array #steps because the processing of all integers from 1 to N easily matches the indexes of an array. This also provides a memory benefit over using a hash of 33M vs 96M in a hash holding the same data.
As ikegami pointed out, this does mean that I can't cache all the values of cycles that go past 1million though, as that would quickly use up all memory. For example, the number 704,511 has a cycle that goes up to 56,991,483,520.
In the end, this means that my method does recalculate portions of certain cycles, but overall there is still a speed improvement due to not having to check for caches at every step. When I change this to use a hash and cache every cycle, the speed decreases to 9.2secs.
my %steps = (1 => 0);
for my $num (2..1_000_000) {
my #i = $num;
while (! defined $steps{$i[-1]}) {
push #i, $i[-1] % 2 ? 3 * $i[-1] + 1 : $i[-1] / 2;
}
my $count = $steps{pop #i};
$steps{pop #i} = ++$count while (#i);
#...
And when I use memoize like demonstrated by Oesor, the speed is 23secs.
If you change your implementation to be a recursive function, you can wrap it with Memoize (https://metacpan.org/pod/Memoize) to speed up already calculated responses.
use strict;
use warnings;
use feature qw/say/;
use Data::Printer;
use Memoize;
memoize('collatz');
for my $num (qw/3 6 12 1/) {
my #series = collatz($num);
p(#series);
say "$num : " . scalar #series;
}
sub collatz {
my($i) = #_;
return $i if $i == 1;
return ($i, collatz( $i % 2 ? 3 * $i + 1 : $i / 2 ));
}
Output
[
[0] 3,
[1] 10,
[2] 5,
[3] 16,
[4] 8,
[5] 4,
[6] 2,
[7] 1
]
3 : 8
[
[0] 6,
[1] 3,
[2] 10,
[3] 5,
[4] 16,
[5] 8,
[6] 4,
[7] 2,
[8] 1
]
6 : 9
[
[0] 12,
[1] 6,
[2] 3,
[3] 10,
[4] 5,
[5] 16,
[6] 8,
[7] 4,
[8] 2,
[9] 1
]
12 : 10
[
[0] 1
]
1 : 1

to change the value of an element of an array by using the index ref in perl

#!/bin/usr/perl -w
use strict;
print "Enter your input filename for original sample data values: \n";
chomp($data=<STDIN>);
print "Enter your input filename for adjustment values\n";
chomp($adj=<STDIN>) ;
print "Enter your output filename for resultant adjusted new sample data \n";
chomp($new=<STDIN>);
open(R1,"$data") or die("error");
open(R2,"$adj") or die ("error");
open(WW,"+>$new") or die ("error");
while( ($line1=(<R1>)) && ($line2=(<R2>)) )
{
$l1=$line1;
#arr1= split(" ",$l1);
$l2=$line2;
#arr2= split(" ",$l2);
$l= ( scalar#arr1);
$p= (scalar#arr2);
for ( $i = 0; $i <= $l; $i++ ){
for ( $j =($i+1); $j <= $l; $j++ ){
if ($arr1[$i]< $arr1[$j]){
$a = $arr1[$i] + ($arr2[$i]/2);
$b = $arr1[$j] - ($arr2[$i]/2);
push ( $arr1[$i]->$a , $arr1[$j]->$b);
}
elsif ( $arr1[$i]= $arr1[$j]){
$a = $arr1[$i];
$b = $arr1[$j];
push ($arr1[$i]->$a,$arr1[$j]->$b);
}
else{
$a = $arr1[$i]-($arr2[$i]/2);
$b = $arr1[$j]+ ($arr2[$i]/2);
push ($arr1[$i]->$a,$arr1[$j]->$b);
}
}
}
$l1 = scalar#arr1;
for ($k = 0; $k <= $l1 ; $k++)
{
if (($k % 10) != 0){
print WW "$arr1[$k]";
print WW "\t" ;
}
else {
print WW "\n";
print WW "$arr1[$k]";
print WW "\t";
}
}
}
close(R1);
close(R2);
close(WW);
exit;
when i am running this prog. i am getting an error that "not an ARRAY reference at line 29".
how can i create the reference to my first array #arr1 ??? so that it stores the changed values of the element at the particular index after running the iteration.
input :
#array1
1 2 3 4 5 6 7 8 9 10
#array2
1 2 3 4 5 6 7 8 9 10 9 8 7 6 5 4 3 2
desired output
#array1
15 1.5 2 3 6 4 11.5 5 5.5
Well, I'm not getting the answer you say you're looking for, but what it appears you're trying to do is to store the value of $a into the $i'th index of array #arr1 and the value of $b into the $jth index of #arr1. I have hoisted the assignment code out of the if branches since it will be the same for all three cases. I have also fixed a subtle error you had in your conditions. You had
elsif ( $arr1[$i]= $arr1[$j]){
but you surely meant to do an equality comparison rather than an assignment here:
elsif ( $arr1[$i] == $arr1[$j]){
So here is the modified section. As I say, it still doesn't print out what you say the desired result is, and I'm not sure whether it's because your computation is wrong or your printing is wrong (I couldn't figure out any obvious transform from your inputs to your desired output), but this should at least put you in the right direction:
for ( $i = 0; $i <= $l; $i++ ){
for ( $j =($i+1); $j <= $l; $j++ ){
if ($arr1[$i]< $arr1[$j]){
$a = $arr1[$i] + ($arr2[$i]/2);
$b = $arr1[$j] - ($arr2[$i]/2);
# push ( $arr1[$i]->$a , $arr1[$j]->$b);
}
elsif ( $arr1[$i] == $arr1[$j]){
$a = $arr1[$i];
$b = $arr1[$j];
# push ($arr1[$i]->$a,$arr1[$j]->$b);
}
else{
$a = $arr1[$i]-($arr2[$i]/2);
$b = $arr1[$j]+ ($arr2[$i]/2);
# push ($arr1[$i]->$a,$arr1[$j]->$b);
}
$arr1[$i] = $a;
$arr1[$j] = $b;
}
}

Perl mergesort - array reference

I was trying to implement merge-sort in Perl, I am quite new to Perl and I know I am doing something wrong with the array references. The arrays end up holding the same value after the process is done. Please help cause I don't see where I am going wrong.
The Corrected Code:
use strict;
use warnings;
my ( #aref, #auxref ) = ();
my ( $hi, $lo, $i, $j, $k, $n ) = 0;
#aref = ( 5, 7, 6, 3, 4, 1, 8, 9, 4 );
$n = #aref;
mergeSort( \#aref, \#auxref, 0, $n - 1 );
print "#auxref\n";
print "#aref\n";
sub mergeSort {
my ($aref) = $_[0];
my ($auxref) = $_[1];
my $lo = $_[2];
my $hi = $_[3];
if ( $hi <= $lo ) { return; }
my $mid = 0;
$mid = int( $lo + ( $hi - $lo ) / 2 );
mergeSort( $aref, $auxref, $lo, $mid );
mergeSort( $aref, $auxref, $mid + 1, $hi );
merge( $aref, $auxref, $lo, $mid, $hi );
}
sub merge {
my ($aref) = $_[0];
my ($auxref) = $_[1];
my $lo = $_[2];
my $mid = $_[3];
my $hi = $_[4];
for ( $i = $lo ; $i <= $hi ; $i++ ) {
$auxref->[$i] = $aref->[$i];
}
$i = $lo;
$j = $mid + 1;
for ( $k = $lo ; $k <= $hi ; $k++ ) {
if ( $i > $mid ) {
$aref->[$k] = $auxref->[$j];
$j++;
}
elsif ( $j > $hi ) {
$aref->[$k] = $auxref->[$i];
$i++;
}
elsif ( $auxref->[$i] <= $auxref->[$j] ) {
$aref->[$k] = $auxref->[$i];
$i++;
}
else {
$aref->[$k] = $auxref->[$j];
$j++;
}
}
}
In sub merge, you have two array refs: $auxref and $aref.
And you're accessing the array elements as though they were ordinary arrays (i.e. $aref[0]) but as they are array references, you need to dereference with an arrow first: $aref->[0].
Adding use strict; and use warnings; to the top of your script should have weeded out these errors though?
Arrays
my #arr = (1, 2, 3, 4);
$arr[0] = 5;
push #arr, 6;
# #arr = (5, 2, 3, 4, 6)
Array References
my $arr = [1,2,3];
$arr->[0] = 5;
push #$arr, 6;
# $arr = [5, 2, 3, 4, 6];
2D arrays of array references
my #arr = ([1, 2], [3, 4]);
print $arr[0][1]; # identical to $arr[0]->[1];
push #{$arr[1]}, 5;
# #arr = ([1, 2], [3, 4, 5]);
2D arrayref of array references
my $arr = [[1, 2], [3, 4]];
print $arr->[0][1]; # identical to $arr->[0]->[1];
push #{$arr->[1]}, 5;
# $arr = [[1, 2], [3, 4, 5]];
2D Array of arrays
...can't exist because an array can only hold scalars
my #arr = ((1, 2), (3, 4));
# #arr = (1, 2, 3, 4);
The following is a version of merge sort that doesn't rely on references at all. It almost certainly isn't as memory efficient as some of the original merge sort algorithms were intended, but it gets the job done.
use strict;
use warnings;
my #array = ( 5, 7, 6, 3, 4, 1, 8, 9, 4 );
my #sorted = mergeSort(#array);
print "#sorted\n";
sub mergeSort {
my #array = #_;
if (#array > 1) {
my $mid = int(#array / 2);
my #lowArray = mergeSort(#array[0..$mid-1]);
my #highArray = mergeSort(#array[$mid..$#array]);
# Merge the two halves
my #newArray = ();
while (#lowArray && #highArray) {
if ($lowArray[0] < $highArray[0]) {
push #newArray, shift #lowArray;
} else {
push #newArray, shift #highArray;
}
}
# Either the low or high array will be empty at this point,
# so no need to compare for the remainder.
return (#newArray, #lowArray, #highArray);
} else {
return #array;
}
}

perl blowing up in sequence alignment by dynamic programming

I'm comparing a reference sequence of size 5500 bases and query sequence of size 3600, using dynamic programming (semi global alignment), in fact I don't know much about complexity and performance and the code is blowing up and giving me the error "out of memory". Knowing that it works normally on smaller sequences, my question is: This behavior is normal or I might have another problem in code ?if it's normal any hint to solve this problem ? Thanks in advance.
sub semiGlobal {
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
my #matrix;
$matrix[0][0]{score} = 0;
$matrix[0][0]{pointer} = "none";
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j]{score} = 0;
$matrix[0][$j]{pointer} = "none";
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
$matrix[$i][0]{score} = $GAP * $i;
$matrix[$i][0]{pointer} = "up";
}
# fill
my $max_i = 0;
my $max_j = 0;
my $max_score = 0;
print "seq2: ".length($seq2);
print "seq1: ".length($seq1);
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
my ( $diagonal_score, $left_score, $up_score );
# calculate match score
my $letter1 = substr( $seq1, $j - 1, 1 );
my $letter2 = substr( $seq2, $i - 1, 1 );
if ( $letter1 eq $letter2 ) {
$diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} + $MATCH;
}
else {
$diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} + $MISMATCH;
}
# calculate gap scores
$up_score = $matrix[ $i - 1 ][$j]{score} + $GAP;
$left_score = $matrix[$i][ $j - 1 ]{score} + $GAP;
# choose best score
if ( $diagonal_score >= $up_score ) {
if ( $diagonal_score >= $left_score ) {
$matrix[$i][$j]{score} = $diagonal_score;
$matrix[$i][$j]{pointer} = "diagonal";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
}
else {
if ( $up_score >= $left_score ) {
$matrix[$i][$j]{score} = $up_score;
$matrix[$i][$j]{pointer} = "up";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
}
# set maximum score
if ( $matrix[$i][$j]{score} > $max_score ) {
$max_i = $i;
$max_j = $j;
$max_score = $matrix[$i][$j]{score};
}
}
}
my $align1 = "";
my $align2 = "";
my $j = $max_j;
my $i = $max_i;
while (1) {
if ( $matrix[$i][$j]{pointer} eq "none" ) {
$stseq1 = $j;
last;
}
if ( $matrix[$i][$j]{pointer} eq "diagonal" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
$j--;
}
elsif ( $matrix[$i][$j]{pointer} eq "left" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= "-";
$j--;
}
elsif ( $matrix[$i][$j]{pointer} eq "up" ) {
$align1 .= "-";
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
return ( $align1, $align2, $stseq1 ,$max_j);
}
One way to possibly solve the problem is to tie the #matrix with a file. However, this will dramatically slow down the program. Consider this:
sub semiGlobal {
use Tie::Array::CSV;
tie my #matrix, 'Tie::Array::CSV', 'temp.txt'; # Don't forget to add your own error handler.
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
$matrix[0][0] = '0 n';
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j] = '0 n';
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
my $score = $GAP * $i;
$matrix[$i][0] = join ' ',$score,'u';
}
#print Dumper(\#matrix);
# fill
my $max_i = 0;
my $max_j = 0;
my $max_score = 0;
print "seq2: ".length($seq2)."\n";
print "seq1: ".length($seq1)."\n";
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
my ( $diagonal_score, $left_score, $up_score );
# calculate match score
my $letter1 = substr( $seq1, $j - 1, 1 );
my $letter2 = substr( $seq2, $i - 1, 1 );
my $score = (split / /, $matrix[ $i - 1 ][ $j - 1 ])[0];
if ( $letter1 eq $letter2 ) {
$diagonal_score = $score + $MATCH;
}
else {
$diagonal_score = $score + $MISMATCH;
}
# calculate gap scores
$up_score = (split / /,$matrix[ $i - 1 ][$j])[0] + $GAP;
$left_score = (split / /,$matrix[$i][ $j - 1 ])[0] + $GAP;
# choose best score
if ( $diagonal_score >= $up_score ) {
if ( $diagonal_score >= $left_score ) {
$matrix[$i][$j] = join ' ',$diagonal_score,'d';
}
else {
$matrix[$i][$j] = join ' ', $left_score, 'l';
}
}
else {
if ( $up_score >= $left_score ) {
$matrix[$i][$j] = join ' ', $up_score, 'u';
}
else {
$matrix[$i][$j] = join ' ', $left_score, 'l';
}
}
# set maximum score
if ( (split / /, $matrix[$i][$j])[0] > $max_score ) {
$max_i = $i;
$max_j = $j;
$max_score = (split / /, $matrix[$i][$j])[0];
}
}
}
my $align1 = "";
my $align2 = "";
my $stseq1;
my $j = $max_j;
my $i = $max_i;
while (1) {
my $pointer = (split / /, $matrix[$i][$j])[1];
if ( $pointer eq "n" ) {
$stseq1 = $j;
last;
}
if ( $pointer eq "d" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
$j--;
}
elsif ( $pointer eq "l" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= "-";
$j--;
}
elsif ( $pointer eq "u" ) {
$align1 .= "-";
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
untie #matrix; # Don't forget to add your own error handler.
unlink 'temp.txt'; # Don't forget to add your own error handler.
return ( $align1, $align2, $stseq1 ,$max_j);
}
You can still use your original sub for short sequences, and switch to this sub for long ones.
I think that #j_random_hacker and #Ashalynd are on the right track regarding using this algorithm in most Perl implementations. The datatypes you're using are going to use more memory that absolutely needed for the calculations.
So this is "normal" in that you should expect to see this kind of memory usage for how you've written this algorithm in perl. You may have other problems in surrounding code that are using a lot of memory but this algorithm will hit your memory hard with large sequences.
You can address some of the memory issues by changing the datatypes that you're using as #Ashalynd suggests. You could try changing the hash which holds score and pointer into an array and changing the string pointers into integer values. Something like this might get you some benefit while still maintaining readability:
use strict;
use warnings;
# define constants for array positions and pointer values
# so the code is still readable.
# (If you have the "Readonly" CPAN module you may want to use it for constants
# instead although none of the downsides of the "constant" pragma apply in this code.)
use constant {
SCORE => 0,
POINTER => 1,
DIAGONAL => 0,
LEFT => 1,
UP => 2,
NONE => 3,
};
...
sub semiGlobal2 {
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
my #matrix;
# score and pointer are now stored in an array
# using the defined constants as indices
$matrix[0][0][SCORE] = 0;
# pointer value is now a constant integer
$matrix[0][0][POINTER] = NONE;
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j][SCORE] = 0;
$matrix[0][$j][POINTER] = NONE;
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
$matrix[$i][0][SCORE] = $GAP * $i;
$matrix[$i][0][POINTER] = UP;
}
... # continue to make the appropriate changes throughout the code
However, when I tested this I didn't get a huge benefit when attempting to align a 3600 char string in a 5500 char string of random data. I programmed my code to abort when it consumed more than 2GB of memory. The original code aborted after 23 seconds while the one using constants and an array instead of a hash aborted after 32 seconds.
If you really want to use this specific algorithm I'd check out the performance of Algorithm::NeedlemanWunsch. It doesn't look like it's very mature but it may have addressed your performance issues. Otherwise look into writing an Inline or Perl XS wrapper around a C implementation