How to extend a binary search iterator to consume multiple targets - perl

I have a function, binary_range_search, that is called like so:
my $brs_iterator = binary_range_search(
target => $range, # eg. [1, 200]
search => $ranges # eg. [ {start => 1, end => 1000},
); # {start => 500, end => 1500} ]
brs_iterator->() will iterate over all #$ranges on which $range overlaps.
I would like to extend binary_range_search to be able to call it with multiple ranges as its target, eg:
target => $target_ranges # eg. [ [1, 200], [50, 300], ... ]
search => $search_ranges # as above
So, when the search on $range->[0] is exhausted, it should move on to $range->[1], and so on. Here is the function in question, in its original form:
sub binary_range_search {
my %options = #_;
my $range = $options{target} || return;
my $ranges = $options{search} || return;
my ( $low, $high ) = ( 0, #{$ranges} - 1 );
while ( $low <= $high ) {
my $try = int( ( $low + $high ) / 2 );
$low = $try + 1, next if $ranges->[$try]{end} < $range->[0];
$high = $try - 1, next if $ranges->[$try]{start} > $range->[1];
my ( $down, $up ) = ($try) x 2;
my %seen = ();
my $brs_iterator = sub {
if ( $ranges->[ $up + 1 ]{end} >= $range->[0]
and $ranges->[ $up + 1 ]{start} <= $range->[1]
and !exists $seen{ $up + 1 } )
{
$seen{ $up + 1 } = undef;
return $ranges->[ ++$up ];
}
elsif ( $ranges->[ $down - 1 ]{end} >= $range->[0]
and $ranges->[ $down + 1 ]{start} <= $range->[1]
and !exists $seen{ $down - 1 }
and $down > 0 )
{
$seen{ $down - 1 } = undef;
return $ranges->[ --$down ];
}
elsif ( !exists $seen{$try} ) {
$seen{$try} = undef;
return $ranges->[$try];
}
else {
return;
}
};
return $brs_iterator;
}
return sub { };
}
It's a standard binary search strategy, until it finds an overlapping range. It then moves on the right, exhausts it, moves on the left, exhausts it, and finally gives up. Ideally, it should then maybe shift the next target range, and redo the search, I suppose (perhaps via recursion?). My problem is, I am not sure how to make that work with the iterator construction.

I just wrapped your iterator generation in a for loop, and built up an array of iterator functions.
Depending on context, I either return a master iterator or a list of iterator functions. I wasn't sure what you wanted.
use strict;
use warnings;
my $t = [ [1,200], [400,900] ];
my #r = (
{ start => 1, end => 100 },
{ start => 2, end => 500 },
{ start => 204, end => 500 },
{ start => 208, end => 500 },
{ start => 215, end => 1000 },
{ start => 150, end => 1000 },
{ start => 500, end => 1100 },
);
# Get a master iterator that will process each iterator in turn.
my $brs_iterator = binary_range_search(
targets => $t,
search => \#r,
);
# Get an array of iterators
my #brs_iterator = binary_range_search(
targets => $t,
search => \#r,
);
sub binary_range_search {
my %options = #_;
my $targets = $options{targets} || return;
my $ranges = $options{search} || return;
my #iterators;
TARGET:
for my $target ( #$targets ) {
my ( $low, $high ) = ( 0, $#{$ranges} );
RANGE_CHECK:
while ( $low <= $high ) {
my $try = int( ( $low + $high ) / 2 );
# Remove non-overlapping ranges
$low = $try + 1, next RANGE_CHECK
if $ranges->[$try]{end} < $target->[0];
$high = $try - 1, next RANGE_CHECK
if $ranges->[$try]{start} > $target->[1];
my ( $down, $up ) = ($try) x 2;
my %seen = ();
my $brs_iterator = sub {
if ( exists $ranges->[$up + 1]
and $ranges->[ $up + 1 ]{end} >= $target->[0]
and $ranges->[ $up + 1 ]{start} <= $target->[1]
and !exists $seen{ $up + 1 } )
{
$seen{ $up + 1 } = undef;
return $ranges->[ ++$up ];
}
elsif ( $ranges->[ $down - 1 ]{end} >= $target->[0]
and $ranges->[ $down + 1 ]{start} <= $target->[1]
and !exists $seen{ $down - 1 }
and $down > 0 )
{
$seen{ $down - 1 } = undef;
return $ranges->[ --$down ];
}
elsif ( !exists $seen{$try} ) {
$seen{$try} = undef;
return $ranges->[$try];
}
else {
return;
}
};
push #iterators, $brs_iterator;
next TARGET;
}
}
# In scalar context return master iterator that iterates over the list of range iterators.
# In list context returns a list of range iterators.
return wantarray
? #iterators
: sub {
while( #iterators ) {
if( my $range = $iterators[0]() ) {
return $range;
}
shift #iterators;
}
return;
};
}

If you're wanting to iterate over all values that overlap the search ranges, you don't need binary search.
First the customary front matter:
use warnings;
use strict;
use Carp;
First off, check that we have target and search parameters and that for each range, the starting point is no greater than its ending point. Otherwise, we refuse to proceed.
sub binary_range_search {
my %arg = #_;
my #errors;
my $target = $arg{target} || push #errors => "no target";
my $search = $arg{search} || push #errors => "no search";
for (#$target) {
my($start,$end) = #$_;
push #errors => "Target start ($start) is greater than end ($end)"
if $start > $end;
}
for (#$search) {
my($start,$end) = #{$_}{qw/ start end /};
push #errors => "Search start ($start) is greater than end ($end)"
if $start > $end;
}
croak "Invalid use of binary_range_search:\n",
map " - $_\n", #errors
if #errors;
The iterator itself is a closure that maintains the following state:
my $i;
my($ta,$tb);
my($sa,$sb);
my $si = 0;
where
$i if defined is the next value from the current overlapping range
$ta and $tb are the starting and ending points for the current target range
$sa and $sb are like the above but for the current search range
$si is an index into #$search and defines the current search range
We will be assigning and returning the iterator $it. The declaration and initialization are separate so the iterator may call itself when necessary.
my $it;
$it = sub {
We are done if no more target ranges remain or if there were no search ranges to begin with.
return unless #$target && #$search;
When $i is defined, it means we have found an overlap and iterate by incrementing $i until it is greater than the ending point of either the current target range or the current search range.
if (defined $i) {
# iterating within a target range
if ($i > $tb || $i > $sb) {
++$si;
undef $i;
return $it->();
}
else {
return $i++;
}
}
Otherwise, we need to determine whether the next target range overlaps any search range. However, if $i is undefined and we've already considered all the search ranges, we discard the current target range and start again.
else {
# does the next target range overlap?
if ($si >= #$search) {
shift #$target;
$si = 0;
return $it->();
}
Here we pull out the starting and ending points of both the current target range (always at the front of #$target) and the current search range (indexed by $si).
($ta,$tb) = #{ $target->[0] };
($sa,$sb) = #{ $search->[$si] }{qw/ start end /};
Now testing for overlap is straightforward. For disjoint search ranges, we ignore and move on. Otherwise, we find the leftmost point in the overlap and iterate from there.
if ($sb < $ta || $sa > $tb) {
# disjoint
++$si;
undef $i;
return $it->();
}
elsif ($sa >= $ta) {
$i = $sa;
return $i++;
}
elsif ($ta >= $sa) {
$i = $ta;
return $i++;
}
}
};
Finally, we return the iterator:
$it;
}
For an example similar to the one in your question
my $it = binary_range_search(
target => [ [1, 200], [50, 300] ],
search => [ { start => 1, end => 1000 },
{ start => 500, end => 1500 },
{ start => 40, end => 60 },
{ start => 250, end => 260 } ],
);
while (defined(my $value = $it->())) {
print "got $value\n";
}
Its output with internal points elided is
got 1
[...]
got 200
got 40
[...]
got 60
got 50
[...]
got 300
got 50
[...]
got 60
got 250
[...]
got 260

Split it into two functions, an outer function that loops over the ranges and calls an inner function that implements a conventional binary chop.

Warning: a very c++ biased answer:
what you have to do is define a new type of iterator, which is a pair of a usual iterator, and a segmemt iterrator (if you don't have a segment iterator, it's a pair of a const pointer / ref to the segments, and an index pointing to the correct segment). You have to define all the concepts of a random access iterator (difference, addition of integer, etc). bear in mind, that at least in c++ lingo this is not a true random iterator, since addition of an integer is not really constant time; such is life.

Related

Sum values in second column in a text file

I have a text file with the following format:
1 4730 1031782 init
4 0 6 events
2190 450 0 top
21413 5928 1 sshd
22355 1970 2009 find
I need to sum the second column of data and return that value to the user.
My current attempt reads the data from the text file like so:
sub read_file{
my $data_failed = 1;
my $file = 'task_file';
if (-z $file){print "No tasks found\n";}
if(open (INFILE, "task_file" || die "$!\n")){
my #COLUMNS = qw( memory cpu program );
my %sort_strings = ( program => sub { $a cmp $b } );
my (%process_details, %sort);
while (<INFILE>) {
$data_failed = 0;
my ($process_id, $memory_size, $cpu_time, $program_name) = split;
$process_details{$process_id} = { memory => $memory_size,
cpu => $cpu_time,
program => $program_name };
undef $sort{memory}{$memory_size}{$process_id};
undef $sort{cpu}{$cpu_time}{$process_id};
undef $sort{program}{$program_name}{$process_id};
}
for my $column ($COLUMNS[2]) {
my $cmp = $sort_strings{$column} || sub { $a <=> $b };
for my $value (sort $cmp keys %{ $sort{$column} }
) {
my #pids = keys %{ $sort{$column}{$value} };
say join ' ', $_, #{ $process_details{$_} }{#COLUMNS}
for #pids;
}
}
if($total_memory_size == 1){
sum_memory_size();
}
} else { print "No tasks found"}
}
And my attempt to sum the values:
sub sum_memory_size{
my #do_sum = qw(memory_size);
my $column;
my $value;
for $column (keys %{ $sort{$column}{$value} }) {
my $found;
my $sum = 0;
for $value (#do_sum) {
next unless exists $sort{$column}{$value};
$sum += $hash{$column}{$value};
delete $hash{$column}{$value};
$found = 1;
}
$sort{$column}{somename} = $sum if $found;
}
print $sum;
}
Use sum form List::Util:
use List::Util qw{ sum };
# ...
my $memory_sum = sum(map $_->{memory}, values %process_details);
The question describes a simple task, which appears to be part of a larger problem. One answer to the simple task is:
perl -ne '#fields=split; $sum+=$fields[1]; END { print "$sum\n" }' foo.txt
For each line (-ne), split it and add the second value ($fields[1]) to a running $sum. Then, after all lines have been processed (END), print the sum. I get 13078 from the program, which is also what I get from a calculator on the input above :) (4730 + 0 + 450 + 5928 + 1970).

Best data structure for a sparse ordered 2D array of floats allowing interpolation (perl)

The data are stock options. I want to make a 2D array based on days till expiration (int) & normalized distance out of the money (float), with the values being a list of normalized bid and ask prices. If the desired element is not in the array, I want to be able to interpolate between the nearest elements present.
I see 3 possible data structures:
A sparse 2D array, maybe 10000 elements, maybe 1/3 full.
A 2D linked list, ie: 4 listpointers for each data element (so 3000 elements becomes 15000)
A 2D hash (maybe 3000 elements), with 2 sorted lists of the keys (maybe 100 elements each) in each dimension.
The main problem is efficient retrieval when interpolation is required.
Retrieval of existing elements is relatively straight-forward with any method.
I'm currently using choice 3, but retrieval is a bit of a kloodge, since I have to scan along the keylists of each dimension till I find occupied elements, and then do a 2- or 4-way interpolation.
I use moreUtils::firstindx($_ > $desiredKey) to find the keys. The linked lists (choice 2) would spare me the search of the keylist arrays.
Choice 1 would also require scanning, wouldn't need the initial step of keylist lookup, but might need to look at more empty cells. And insertion would be a real hassle.
I would be doing many more searches than insertions.
Does any one have any suggestions for the most efficient data structure.
Since you predominantly perform lookups by lifespan and lookups by distance, and few inserts, I'd use sorted arrays to lookup the records by binary search.
Locating an existing element: O(log N)
Locating the box of a missing element: O(log N)
Inserting: O(N)
Given,
my #data = (
[ $lifespan0, $distance0, $bid0, $ask0 ],
[ $lifespan1, $distance1, $bid1, $ask1 ],
...
);
my $lifespan_search_cmp = sub { $a <=> $data[$b][0] };
my $distance_search_cmp = sub { $a <=> $data[$b][1] };
First, create indexes:
my #by_lifespan = sort { $data[$a][0] <=> $data[$b][0] } 0..$#data;
my #by_distance = sort { $data[$a][1] <=> $data[$b][1] } 0..$#data;
To lookup:
my $i = binsearch_first \&$lifespan_search_cmp, $lifespan, #by_lifespan;
my $j = binsearch_first \&$distance_search_cmp, $distance, #by_distance;
my #lifespan_matching_idxs = get_run_forward \&$lifespan_search_cmp, $lifespan, $i, #by_lifespan;
my #distance_matching_idxs = get_run_forward \&$distance_search_cmp, $distance, $j, #by_distance;
my #cross_match_idxs = do {
my %lifespan_matching_idxs = map { $_ => 1 } #lifespan_matching_idxs;
grep { $lifespan_matching_idxs{$_} }
#distance_matching_idxs
};
if (#cross_match_idxs) {
# Exact match(es) found.
...
} else {
my $lifespan_lowerbracket;
my $lifespan_upperbracket;
if ($i >= 0) {
$lifespan_lowerbracket = $lifespan;
$lifespan_upperbracket = $lifespan;
} else {
die "Can't interpolate" if ~$i == 0 || ~$i >= #by_lifespan;
$lifespan_lowerbracket = $data[~$i ][0];
$lifespan_lowerbracket = $data[~$i - 1][0];
}
my $distance_lowerbracket;
my $distance_upperbracket;
if ($i >= 0) {
$distance_lowerbracket = $distance;
$distance_upperbracket = $distance;
} else {
die "Can't interpolate" if ~$j == 0 || ~$j >= #by_distance;
$distance_lowerbracket = $data[~$j ][1];
$distance_upperbracket = $data[~$j - 1][1];
}
...
}
To insert:
my $i = binsearch_first \&$lifespan_search_cmp, $lifespan, #by_lifespan;
my $j = binsearch_first \&$distance_search_cmp, $distance, #by_distance;
push #data, [ $lifespan, $distance , $bid, $ask ];
splice(#by_lifespan, $i >= 0 ? $i : ~$i, 0, $#data);
splice(#by_distance, $j >= 0 ? $j : ~$j, 0, $#data);
Subs:
sub binsearch_first(&$\#) {
my $compare = $_[0];
#my $value = $_[1];
my $array = $_[2];
my $min = 0;
my $max = $#$array;
return -1 if $max == -1;
my $ap = do { no strict 'refs'; \*{caller().'::a'} }; local *$ap;
my $bp = do { no strict 'refs'; \*{caller().'::b'} }; local *$bp;
*$ap = \($_[1]);
while ($min <= $max) {
my $mid = int(($min+$max)/2);
*$bp = \($array->[$mid]);
my $cmp = $compare->();
if ($cmp < 0) {
$max = $mid - 1;
}
elsif ($cmp > 0) {
$min = $mid + 1;
}
else {
return $mid if $mid == $min;
$max = $mid;
}
}
# Converts unsigned int to signed int.
return unpack('j', pack('J', ~$min));
}
sub get_run_forward(&$\#) {
my $compare = $_[0];
#my $value = $_[1];
my $start = $_[2];
my $array = $_[3];
return if $start < 0;
my $ap = do { no strict 'refs'; \*{caller().'::a'} }; local *$ap;
my $bp = do { no strict 'refs'; \*{caller().'::b'} }; local *$bp;
*$ap = \($_[1]);
my $i = $start;
while ($i <= $#$array) {
*$bp = \($array->[$i]);
my $cmp = $compare->()
and last;
++$i;
}
return wantarray ? ($start..$i-1) : $i-1;
}
You might want to use a tolerance in the floating-point comparions (i.e. in $distance_search_cmp).

perl blowing up in sequence alignment by dynamic programming

I'm comparing a reference sequence of size 5500 bases and query sequence of size 3600, using dynamic programming (semi global alignment), in fact I don't know much about complexity and performance and the code is blowing up and giving me the error "out of memory". Knowing that it works normally on smaller sequences, my question is: This behavior is normal or I might have another problem in code ?if it's normal any hint to solve this problem ? Thanks in advance.
sub semiGlobal {
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
my #matrix;
$matrix[0][0]{score} = 0;
$matrix[0][0]{pointer} = "none";
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j]{score} = 0;
$matrix[0][$j]{pointer} = "none";
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
$matrix[$i][0]{score} = $GAP * $i;
$matrix[$i][0]{pointer} = "up";
}
# fill
my $max_i = 0;
my $max_j = 0;
my $max_score = 0;
print "seq2: ".length($seq2);
print "seq1: ".length($seq1);
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
my ( $diagonal_score, $left_score, $up_score );
# calculate match score
my $letter1 = substr( $seq1, $j - 1, 1 );
my $letter2 = substr( $seq2, $i - 1, 1 );
if ( $letter1 eq $letter2 ) {
$diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} + $MATCH;
}
else {
$diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} + $MISMATCH;
}
# calculate gap scores
$up_score = $matrix[ $i - 1 ][$j]{score} + $GAP;
$left_score = $matrix[$i][ $j - 1 ]{score} + $GAP;
# choose best score
if ( $diagonal_score >= $up_score ) {
if ( $diagonal_score >= $left_score ) {
$matrix[$i][$j]{score} = $diagonal_score;
$matrix[$i][$j]{pointer} = "diagonal";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
}
else {
if ( $up_score >= $left_score ) {
$matrix[$i][$j]{score} = $up_score;
$matrix[$i][$j]{pointer} = "up";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
}
# set maximum score
if ( $matrix[$i][$j]{score} > $max_score ) {
$max_i = $i;
$max_j = $j;
$max_score = $matrix[$i][$j]{score};
}
}
}
my $align1 = "";
my $align2 = "";
my $j = $max_j;
my $i = $max_i;
while (1) {
if ( $matrix[$i][$j]{pointer} eq "none" ) {
$stseq1 = $j;
last;
}
if ( $matrix[$i][$j]{pointer} eq "diagonal" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
$j--;
}
elsif ( $matrix[$i][$j]{pointer} eq "left" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= "-";
$j--;
}
elsif ( $matrix[$i][$j]{pointer} eq "up" ) {
$align1 .= "-";
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
return ( $align1, $align2, $stseq1 ,$max_j);
}
One way to possibly solve the problem is to tie the #matrix with a file. However, this will dramatically slow down the program. Consider this:
sub semiGlobal {
use Tie::Array::CSV;
tie my #matrix, 'Tie::Array::CSV', 'temp.txt'; # Don't forget to add your own error handler.
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
$matrix[0][0] = '0 n';
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j] = '0 n';
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
my $score = $GAP * $i;
$matrix[$i][0] = join ' ',$score,'u';
}
#print Dumper(\#matrix);
# fill
my $max_i = 0;
my $max_j = 0;
my $max_score = 0;
print "seq2: ".length($seq2)."\n";
print "seq1: ".length($seq1)."\n";
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
my ( $diagonal_score, $left_score, $up_score );
# calculate match score
my $letter1 = substr( $seq1, $j - 1, 1 );
my $letter2 = substr( $seq2, $i - 1, 1 );
my $score = (split / /, $matrix[ $i - 1 ][ $j - 1 ])[0];
if ( $letter1 eq $letter2 ) {
$diagonal_score = $score + $MATCH;
}
else {
$diagonal_score = $score + $MISMATCH;
}
# calculate gap scores
$up_score = (split / /,$matrix[ $i - 1 ][$j])[0] + $GAP;
$left_score = (split / /,$matrix[$i][ $j - 1 ])[0] + $GAP;
# choose best score
if ( $diagonal_score >= $up_score ) {
if ( $diagonal_score >= $left_score ) {
$matrix[$i][$j] = join ' ',$diagonal_score,'d';
}
else {
$matrix[$i][$j] = join ' ', $left_score, 'l';
}
}
else {
if ( $up_score >= $left_score ) {
$matrix[$i][$j] = join ' ', $up_score, 'u';
}
else {
$matrix[$i][$j] = join ' ', $left_score, 'l';
}
}
# set maximum score
if ( (split / /, $matrix[$i][$j])[0] > $max_score ) {
$max_i = $i;
$max_j = $j;
$max_score = (split / /, $matrix[$i][$j])[0];
}
}
}
my $align1 = "";
my $align2 = "";
my $stseq1;
my $j = $max_j;
my $i = $max_i;
while (1) {
my $pointer = (split / /, $matrix[$i][$j])[1];
if ( $pointer eq "n" ) {
$stseq1 = $j;
last;
}
if ( $pointer eq "d" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
$j--;
}
elsif ( $pointer eq "l" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= "-";
$j--;
}
elsif ( $pointer eq "u" ) {
$align1 .= "-";
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
untie #matrix; # Don't forget to add your own error handler.
unlink 'temp.txt'; # Don't forget to add your own error handler.
return ( $align1, $align2, $stseq1 ,$max_j);
}
You can still use your original sub for short sequences, and switch to this sub for long ones.
I think that #j_random_hacker and #Ashalynd are on the right track regarding using this algorithm in most Perl implementations. The datatypes you're using are going to use more memory that absolutely needed for the calculations.
So this is "normal" in that you should expect to see this kind of memory usage for how you've written this algorithm in perl. You may have other problems in surrounding code that are using a lot of memory but this algorithm will hit your memory hard with large sequences.
You can address some of the memory issues by changing the datatypes that you're using as #Ashalynd suggests. You could try changing the hash which holds score and pointer into an array and changing the string pointers into integer values. Something like this might get you some benefit while still maintaining readability:
use strict;
use warnings;
# define constants for array positions and pointer values
# so the code is still readable.
# (If you have the "Readonly" CPAN module you may want to use it for constants
# instead although none of the downsides of the "constant" pragma apply in this code.)
use constant {
SCORE => 0,
POINTER => 1,
DIAGONAL => 0,
LEFT => 1,
UP => 2,
NONE => 3,
};
...
sub semiGlobal2 {
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
my #matrix;
# score and pointer are now stored in an array
# using the defined constants as indices
$matrix[0][0][SCORE] = 0;
# pointer value is now a constant integer
$matrix[0][0][POINTER] = NONE;
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j][SCORE] = 0;
$matrix[0][$j][POINTER] = NONE;
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
$matrix[$i][0][SCORE] = $GAP * $i;
$matrix[$i][0][POINTER] = UP;
}
... # continue to make the appropriate changes throughout the code
However, when I tested this I didn't get a huge benefit when attempting to align a 3600 char string in a 5500 char string of random data. I programmed my code to abort when it consumed more than 2GB of memory. The original code aborted after 23 seconds while the one using constants and an array instead of a hash aborted after 32 seconds.
If you really want to use this specific algorithm I'd check out the performance of Algorithm::NeedlemanWunsch. It doesn't look like it's very mature but it may have addressed your performance issues. Otherwise look into writing an Inline or Perl XS wrapper around a C implementation

Trying to Develop PostFix Notation in Tree Using Perl

I'm using Perl to run through a tree, and then calculate the leaf nodes of the tree using the internal nodes as operators. I want to be able to print this in a postfix manner, and I managed to this this fairly easily with the basic operands (simply call the left and right nodes respectively before calling the parent) but I am having trouble producing the desired output for an average function. I don't have any trouble printing the actual result of the calculation, but I want to be able to print the operators and operands in postfix notation.
For example, 1 + average(3, 4, 5) will be shown as 1 ; 3 4 5 average +.
Here is my code:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse = 1;
my $debug = 0;
# an arithmetic expression tree is a reference to a list, which can
# be of two kinds as follows:
# [ 'leaf', value ]
# [ 'internal', operation, leftarg, rightarg ]
# Evaluate($ex) takes an arithmetic expression tree and returns its
# evaluated value.
sub Evaluate {
my ($ex) = #_;
$debug and
print "evaluating: ", Dumper($ex), "\n";
# the kind of node is given in the first element of the array
my $node_type = $ex->[0];
# if a leaf, then the value is a number
if ( $node_type eq 'leaf' ) {
# base case
my $value = $ex->[1];
$debug and
print "returning leaf: $value\n";
return $value;
}
# if not a leaf, then is internal,
if ( $node_type ne 'internal' ) {
die "Eval: Strange node type '$node_type' when evaluating tree";
}
# should now have an operation and two arguments
my $operation = $ex->[1];
my $left_ex = $ex->[2];
my $right_ex = $ex->[3];
# evaluate the left and right arguments;
my $left_value = Evaluate($left_ex);
my $right_value = Evaluate($right_ex);
# if any arguments are undefined, our value is undefined.
return undef unless
defined($left_value) and defined($right_value);
my $result;
# or do it explicitly for the required operators ...
if ($operation eq 'average') {
$result = ($left_value + $right_value) / 2;
}
if ($operation eq '+') {
$result = $left_value + $right_value;
} elsif ($operation eq '-') {
$result = $left_value - $right_value;
} elsif ($operation eq '*') {
$result = $left_value * $right_value;
} elsif ($operation eq 'div') {
if ($right_value != 0 ) {
$result = int ($left_value / $right_value);
} else {
$result = undef;
}
} elsif ($operation eq 'mod') {
$result = $left_value % $right_value;
} elsif ($operation eq '/') {
if ( $right_value != 0 ) {
$result = $left_value / $right_value;
}
else {
$result = undef;
}
}
$debug and
print "returning '$operation' on $left_value and $right_value result: $result\n";
return $result;
}
# Display($ex, $style) takes an arithmetic expression tree and a style
# parameter ('infix' or 'postfix') and returns a string that represents
# printable form of the expression in the given style.
sub Display {
my ($ex, $style) = #_;
# the kind of node is given in the first element of the array
my $node_type = $ex->[0];
# if a leaf, then the value is a number
if ( $node_type eq 'leaf' ) {
# base case
my $value = $ex->[1];
return $value;
}
# if not a leaf, then is internal,
if ( $node_type ne 'internal' ) {
die "Display: Strange node type '$node_type' when evaluating tree";
}
# should now have an operation and two arguments
my $operation = $ex->[1];
my $left_ex = $ex->[2];
my $right_ex = $ex->[3];
# evaluate the left and right arguments;
my $left_value = Display($left_ex, $style);
my $right_value = Display($right_ex, $style);
my $result;
if ($operation ne 'average') {
$result = "($left_value $operation $right_value) \n $left_value $right_value $operation";
} else {
$result = "($left_value $operation $right_value) \n $left_value $right_value $operation";
}
return $result;
}
# module end;
1;
And here is a test:
use strict;
use warnings;
use Display;
use arith;
my $ex1 = [ 'leaf', 42];
my $ex2 = [ 'internal', '+', [ 'leaf', 42], [ 'leaf', 10 ] ];
my $ex3 = [ 'internal', 'average', $ex2, [ 'leaf', 1 ] ];
print "ex1 is ", Evaluate($ex1), "\n";
print "ex1: ", Display($ex1), "\n";
print "\n";
print "ex2 is ", Evaluate($ex2), "\n";
print "ex2: ", Display($ex2), "\n";
print "\n";
print "ex3 is ", Evaluate($ex3), "\n";
print "ex3: ", Display($ex3), "\n";
print "\n";
Display::Render(\$ex3);
In order to do this, I realize I will have to change the subroutine "Display", but I'm not sure how to get the output --> value value ; #to indicate values that aren't averaged# value value average operand etc.
Any ideas?
I am not 100% sure that I understand your problem, but here is a cleanup / improvement of your two functions:
my %ops = ( # dispatch table for operations
average => sub {my $acc; $acc += $_ for #_; $acc / #_},
'+' => sub {$_[0] + $_[1]},
'-' => sub {$_[0] - $_[1]},
'*' => sub {$_[0] * $_[1]},
'mod' => sub {$_[0] % $_[1]},
(map {$_ => sub {$_[1] ? $_[0] / $_[1] : undef}} qw (/ div)),
);
sub Evaluate {
my $ex = shift;
print "evaluating: ", Dumper($ex), "\n" if $debug;
my $node_type = $ex->[0];
if ( $node_type eq 'leaf' ) {
print "returning leaf: $$ex[1]\n" if $debug;
return $$ex[1];
}
elsif ( $node_type ne 'internal' ) {
die "Eval: Strange node type '$node_type' when evaluating tree";
}
my $operation = $ex->[1];
my #values = map {Evaluate($_)} #$ex[2 .. $#$ex];
defined or return for #values;
if (my $op = $ops{$operation}) {
return $op->(#values);
} else {
print "operation $operation not found\n";
return undef;
}
}
Here the large if/elsif block is replaced with a dispatch table. This allows you to separate the logic from the parser. I have also replaced the $left_value and $right_value variables with the #values array, allowing your code to scale to n-arity operations (like average).
The following Display function has also been updated to handle n-arity operations:
my %is_infix = map {$_ => 1} qw( * + / - );
sub Display {
my ($ex, $style) = #_;
my $node_type = $ex->[0];
# if a leaf, then the value is a number
if ( $node_type eq 'leaf' ) {
return $$ex[1];
}
# if not a leaf, then is internal,
if ( $node_type ne 'internal' ) {
die "Display: Strange node type '$node_type' when evaluating tree";
}
# should now have an operation and n arguments
my $operation = $ex->[1];
if ($style and $style eq 'infix') {
my #values = map {Display($_, $style)} #$ex[2 .. $#$ex];
if ($is_infix{$operation}) {
return "$values[0] $operation $values[1]"
} else {
local $" = ', '; # "
return "$operation( #values )"
}
} else { # postfix by default
my #out;
for (#$ex[2 .. $#$ex]) {
if (#out and $_->[0] eq 'internal') {
push #out, ';'
}
push #out, Display($_, $style)
}
return join ' ' => #out, $operation;
}
}
You can call Display as Display($tree) or Display($tree, 'postfix') for postfix notation. And Display($tree, 'infix') for the infix notation.
ex1 is 42
ex1: 42
ex1: 42
ex2 is 52
ex2: 42 10 +
ex2: 42 + 10
ex3 is 26.5
ex3: 42 10 + 1 average
ex3: average( 42 + 10, 1 )
Which I believe is what you are looking for.
Finally, using your first example 1 + average(3, 4, 5):
my $avg = ['internal', 'average', [leaf => 3], [leaf => 4], [leaf => 5] ];
my $ex4 = ['internal', '+', [leaf => 1], $avg ];
print "ex4 is ", Evaluate($ex4), "\n";
print "ex4: ", Display($ex4), "\n";
print "ex4: ", Display($ex4, 'infix'), "\n";
print "\n";
which prints:
ex4 is 5
ex4: 1 ; 3 4 5 average +
ex4: 1 + average( 3, 4, 5 )
Maybe try AlgebraicToRPN?

How to get all feature in a range from a GFF3 file in Perl?

I would like to write a Perl function that gets a GFF3 filename and a range (i.e. 100000 .. 2000000). and returns a reference to an array containing all names/accessions of genes found in this range.
I guess using bioperl will make sense, but I have very little experience with it. I can write a script that parses a GFF3 by my self, but if using bioperl (or another packagae) is not too complicated - I'd rather reuse their code.
use Bio::Tools::GFF;
my $range_start = 100000;
my $range_end = 200000;
my #features_in_range = ( );
my $gffio = Bio::Tools::GFF->new(-file => $gff_file, -gff_version => 3);
while (my $feature = $gffio->next_feature()) {
## What about features that are not contained within the coordinate range but
## do overlap it? Such features won't be caught by this check.
if (
($feature->start() >= $range_start)
&&
($feature->end() <= $range_end)
) {
push #features_in_range, $feature;
}
}
$gffio->close();
DISCLAIMER: Naive implementation. I just banged that out, it's had no testing. I won't even guarantee it compiles.
You do want to use BioPerl for this, using possibly the Bio::Tools::GFF module.
You should really ask on the BioPerl mailing list. It's very friendly and the subscribers are very knowledgeable -- they'll definitely be able to help you. And once you do get an answer (and if you don't get one here first), I suggest answering your own question here with the answer so we can all benefit!
The following function takes a hash of targets and ranges and returns a function that will iterate over all targets that overlap any of the ranges. The targets should be a reference to an array of references:
my $targets =
[
[
$start,
$end,
],
...,
]
The ranges should be a reference to an array of hashes:
my $ranges =
[
{
seqname => $seqname,
source => $source,
feature => $feature,
start => $start,
end => $end,
score => $score,
strand => $strand,
frame => $frame,
attribute => $attribute,
},
...,
]
You can, of course, just pass a single target.
my $brs_iterator
= binary_range_search( targets => $targets, ranges => $ranges );
while ( my $gff_line = $brs_iterator->() ) {
# do stuff
}
sub binary_range_search {
my %options = #_;
my $targets = $options{targets} || croak 'Need a targets parameter';
my $ranges = $options{ranges} || croak 'Need a ranges parameter';
my ( $low, $high ) = ( 0, $#{$ranges} );
my #iterators = ();
TARGET:
for my $range (#$targets) {
RANGE_CHECK:
while ( $low <= $high ) {
my $try = int( ( $low + $high ) / 2 );
$low = $try + 1, next RANGE_CHECK
if $ranges->[$try]{end} < $range->[0];
$high = $try - 1, next RANGE_CHECK
if $ranges->[$try]{start} > $range->[1];
my ( $down, $up ) = ($try) x 2;
my %seen = ();
my $brs_iterator = sub {
if ( $ranges->[ $up + 1 ]{end} >= $range->[0]
and $ranges->[ $up + 1 ]{start} <= $range->[1]
and !exists $seen{ $up + 1 } )
{
$seen{ $up + 1 } = undef;
return $ranges->[ ++$up ];
}
elsif ( $ranges->[ $down - 1 ]{end} >= $range->[0]
and $ranges->[ $down - 1 ]{start} <= $range->[1]
and !exists $seen{ $down - 1 }
and $down > 0 )
{
$seen{ $down - 1 } = undef;
return $ranges->[ --$down ];
}
elsif ( !exists $seen{$try} ) {
$seen{$try} = undef;
return $ranges->[$try];
}
else {
return;
}
};
push #iterators, $brs_iterator;
next TARGET;
}
}
# In scalar context return master iterator that iterates over the list of range iterators.
# In list context returns a list of range iterators.
return wantarray
? #iterators
: sub {
while (#iterators) {
if ( my $range = $iterators[0]->() ) {
return $range;
}
shift #iterators;
}
return;
};
}