Perl: getting all increasing and decreasing Strips in an array (use in Bioinformatics) - perl

I'm new at Perl and im having trouble at designing a certain function in Perl.
The Function should find and return all Increasing and Decreasing Strips.
What does that mean? Two Positions are neighbors if they're neighboring numbers. i.e. (2,3) or (8,7). A Increasing Strip is an increasing Strip of neighbors. i.e. (3,4,5,6). Decreasing Strip is defined similar. At the beginning of every Array a 0 gets added and at the end the length of the array+1. Single Numbers without neighbors are decreasing. 0 and n+1 are increasing.
So if i have the array (0,3,4,5,9,8,6,2,1,7,10) i should get the following results:
Increasing Strips are: (3,4,5) (10) (0)
Decreasing Strips are: (9,8), (6), (2,1) (7)
I tried to reduce the problem to only getting all Decreasing Strips, but this is as far as i get: http://pastebin.com/yStbgNme
Code here:
sub getIncs{
my #$bar = shift;
my %incs;
my $inccount = 0;
my $i=0;
while($i<#bar-1){
for($j=$i; 1; $j++;){
if($bar[$j] == $bar[$j+1]+1){
$incs{$inccount} = ($i,$j);
} else {
$inccount++;
last;
}
}
}
//edit1: I found a Python-Program that contains said function getStrips(), but my python is sporadic at best. http://www.csbio.unc.edu/mcmillan/Media/breakpointReversalSort.txt
//edit2: Every number is exactly one Time in the array So there can be no overlap.

use strict;
my #s = (0,3,4,5,9,8,6,2,1,7,10);
my $i = 0;
my $j = 0; #size of #s
my $inc = "Increasing: ";
my $dec = "Decreasing: ";
# Prepend the beginning with 0, if necessary
if($s[0] != 0 || #s == 0 ) { unshift #s, 0; }
$j = #s;
foreach(#s) {
# Increasing
if( ($s[$i] == 0) || ($i == $j-1) || ($s[$i+1] - $s[$i]) == 1 || ($s[$i] - $s[$i-1] == 1)) {
if($s[$i] - $s[$i-1] != 1) { $inc .= "("; }
$inc .= $s[$i];
if($s[$i+1] - $s[$i] != 1) { $inc .= ")"; }
if($s[$i+1] - $s[$i] == 1) { $inc .= ","; }
}
#Decreasing
if( ($s[$i]-$s[$i-1] != 1) && ($s[$i+1] - $s[$i] != 1) && ($s[$i] != 0) && ($i != $j-1) ) {
if($s[$i-1] - $s[$i] != 1) { $dec .= "("; }
$dec .= $s[$i];
if($s[$i] - $s[$i+1] != 1) { $dec .= ")"; }
if($s[$i] - $s[$i+1] == 1) { $dec .= ","; }
}
$i++;
}
$inc =~ s/\)\(/\),\(/g;
$dec =~ s/\)\(/\),\(/g;
print "$inc\n";
print "$dec\n";
Result:
Increasing: (0),(3,4,5),(10)
Decreasing: (9,8),(6),(2,1),(7)

Related

Sum of Primes always returns 0

I have a code in Perl which takes in a number and adds up all the prime numbers up to that number. I keep on getting the value 0 which means it is not updating my $sum variable, but I don't know what else to do.
sub checkPrime {
my($numb) = #_;
$primeCheck = "prime\n";
if ($numb == 1) {
$primeCheck = "notPrime\n";
}
for ($i = 2; $i < $numb; $i++) {
$mod = $numb % $i;
if ($mod == 0) {
$primeCheck = "notPrime\n"
}
}
return $primeCheck;
}
sub sumOfPrimes {
my($input) = #_;
$sum = 0;
for ($i = 2; $i <= $input; $i++) {
if (checkPrime($i) eq "prime") {
$sum = $sum + $i;
}
}
return $sum;
}
print sumOfPrimes(10);
You are not comparing the correct string. You include a newline character (\n) when you set the value, but not when you compare it. Change:
if (checkPrime($i) eq "prime")
to:
if (checkPrime($i) eq "prime\n")
That is the simplest change, but you probably don't need to have \n in there at all.
To sum prime numbers you need to identify if the number is a prime number. Let's create a function which returns 1 if the number is prime and 0 otherwise.
sub isPrime {
my $n = shift;
return 0 unless $n > 1;
for( my $i = 2; $i < $n; $i++ ) {
return 0 if $n % $i == 0;
}
return 1;
}
Now go through the list of numbers and sum only those which is prime
$sum += $num if isPrime($num);

Best data structure for a sparse ordered 2D array of floats allowing interpolation (perl)

The data are stock options. I want to make a 2D array based on days till expiration (int) & normalized distance out of the money (float), with the values being a list of normalized bid and ask prices. If the desired element is not in the array, I want to be able to interpolate between the nearest elements present.
I see 3 possible data structures:
A sparse 2D array, maybe 10000 elements, maybe 1/3 full.
A 2D linked list, ie: 4 listpointers for each data element (so 3000 elements becomes 15000)
A 2D hash (maybe 3000 elements), with 2 sorted lists of the keys (maybe 100 elements each) in each dimension.
The main problem is efficient retrieval when interpolation is required.
Retrieval of existing elements is relatively straight-forward with any method.
I'm currently using choice 3, but retrieval is a bit of a kloodge, since I have to scan along the keylists of each dimension till I find occupied elements, and then do a 2- or 4-way interpolation.
I use moreUtils::firstindx($_ > $desiredKey) to find the keys. The linked lists (choice 2) would spare me the search of the keylist arrays.
Choice 1 would also require scanning, wouldn't need the initial step of keylist lookup, but might need to look at more empty cells. And insertion would be a real hassle.
I would be doing many more searches than insertions.
Does any one have any suggestions for the most efficient data structure.
Since you predominantly perform lookups by lifespan and lookups by distance, and few inserts, I'd use sorted arrays to lookup the records by binary search.
Locating an existing element: O(log N)
Locating the box of a missing element: O(log N)
Inserting: O(N)
Given,
my #data = (
[ $lifespan0, $distance0, $bid0, $ask0 ],
[ $lifespan1, $distance1, $bid1, $ask1 ],
...
);
my $lifespan_search_cmp = sub { $a <=> $data[$b][0] };
my $distance_search_cmp = sub { $a <=> $data[$b][1] };
First, create indexes:
my #by_lifespan = sort { $data[$a][0] <=> $data[$b][0] } 0..$#data;
my #by_distance = sort { $data[$a][1] <=> $data[$b][1] } 0..$#data;
To lookup:
my $i = binsearch_first \&$lifespan_search_cmp, $lifespan, #by_lifespan;
my $j = binsearch_first \&$distance_search_cmp, $distance, #by_distance;
my #lifespan_matching_idxs = get_run_forward \&$lifespan_search_cmp, $lifespan, $i, #by_lifespan;
my #distance_matching_idxs = get_run_forward \&$distance_search_cmp, $distance, $j, #by_distance;
my #cross_match_idxs = do {
my %lifespan_matching_idxs = map { $_ => 1 } #lifespan_matching_idxs;
grep { $lifespan_matching_idxs{$_} }
#distance_matching_idxs
};
if (#cross_match_idxs) {
# Exact match(es) found.
...
} else {
my $lifespan_lowerbracket;
my $lifespan_upperbracket;
if ($i >= 0) {
$lifespan_lowerbracket = $lifespan;
$lifespan_upperbracket = $lifespan;
} else {
die "Can't interpolate" if ~$i == 0 || ~$i >= #by_lifespan;
$lifespan_lowerbracket = $data[~$i ][0];
$lifespan_lowerbracket = $data[~$i - 1][0];
}
my $distance_lowerbracket;
my $distance_upperbracket;
if ($i >= 0) {
$distance_lowerbracket = $distance;
$distance_upperbracket = $distance;
} else {
die "Can't interpolate" if ~$j == 0 || ~$j >= #by_distance;
$distance_lowerbracket = $data[~$j ][1];
$distance_upperbracket = $data[~$j - 1][1];
}
...
}
To insert:
my $i = binsearch_first \&$lifespan_search_cmp, $lifespan, #by_lifespan;
my $j = binsearch_first \&$distance_search_cmp, $distance, #by_distance;
push #data, [ $lifespan, $distance , $bid, $ask ];
splice(#by_lifespan, $i >= 0 ? $i : ~$i, 0, $#data);
splice(#by_distance, $j >= 0 ? $j : ~$j, 0, $#data);
Subs:
sub binsearch_first(&$\#) {
my $compare = $_[0];
#my $value = $_[1];
my $array = $_[2];
my $min = 0;
my $max = $#$array;
return -1 if $max == -1;
my $ap = do { no strict 'refs'; \*{caller().'::a'} }; local *$ap;
my $bp = do { no strict 'refs'; \*{caller().'::b'} }; local *$bp;
*$ap = \($_[1]);
while ($min <= $max) {
my $mid = int(($min+$max)/2);
*$bp = \($array->[$mid]);
my $cmp = $compare->();
if ($cmp < 0) {
$max = $mid - 1;
}
elsif ($cmp > 0) {
$min = $mid + 1;
}
else {
return $mid if $mid == $min;
$max = $mid;
}
}
# Converts unsigned int to signed int.
return unpack('j', pack('J', ~$min));
}
sub get_run_forward(&$\#) {
my $compare = $_[0];
#my $value = $_[1];
my $start = $_[2];
my $array = $_[3];
return if $start < 0;
my $ap = do { no strict 'refs'; \*{caller().'::a'} }; local *$ap;
my $bp = do { no strict 'refs'; \*{caller().'::b'} }; local *$bp;
*$ap = \($_[1]);
my $i = $start;
while ($i <= $#$array) {
*$bp = \($array->[$i]);
my $cmp = $compare->()
and last;
++$i;
}
return wantarray ? ($start..$i-1) : $i-1;
}
You might want to use a tolerance in the floating-point comparions (i.e. in $distance_search_cmp).

Perl: Finding out if a given number is a prime number

I am trying to write a subroutine that determines whether or not the number passed in is prime, and it's not working correctly. The numbers I'm passing in should not be identified as prime. Is there a logic error, or something about Perl that I'm missing?
sub isPrime {
my ( $n ) = #_;
for ( my $i = 3 ; $i < $n ; $i++ ) {
if ( $n % $i == 0 ) {
return 0;
}
else {
return 1;
}
}
}
At the moment your function is checking just if n is not divisible by 3 because it calls return immediately after the fisrt test.
Try to make the function return 0 within the for loop, and return 1 outside it, or set a flag for the number being prime that is initially true and return its value after the loop.
You should also start your for loop at 2, not at 3, otherwise you aren't testing for even numbers.
Here is my code I wrote in about 40 minutes. Don't hate if it is inefficient, I am still learning perl.
print ("This is a prime number checker!\n");
print ("Enter a number below to check it:\n");
$y = 0;
$num = <>;
for ($i = $num; $i > 0; $i--) {
if ($num % $i == 0) {
$y += 1;
}
}
if ($y > 2) {
print ("$num is not a prime!");
} else {
print ("$num is a prime!");
}

perl blowing up in sequence alignment by dynamic programming

I'm comparing a reference sequence of size 5500 bases and query sequence of size 3600, using dynamic programming (semi global alignment), in fact I don't know much about complexity and performance and the code is blowing up and giving me the error "out of memory". Knowing that it works normally on smaller sequences, my question is: This behavior is normal or I might have another problem in code ?if it's normal any hint to solve this problem ? Thanks in advance.
sub semiGlobal {
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
my #matrix;
$matrix[0][0]{score} = 0;
$matrix[0][0]{pointer} = "none";
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j]{score} = 0;
$matrix[0][$j]{pointer} = "none";
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
$matrix[$i][0]{score} = $GAP * $i;
$matrix[$i][0]{pointer} = "up";
}
# fill
my $max_i = 0;
my $max_j = 0;
my $max_score = 0;
print "seq2: ".length($seq2);
print "seq1: ".length($seq1);
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
my ( $diagonal_score, $left_score, $up_score );
# calculate match score
my $letter1 = substr( $seq1, $j - 1, 1 );
my $letter2 = substr( $seq2, $i - 1, 1 );
if ( $letter1 eq $letter2 ) {
$diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} + $MATCH;
}
else {
$diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} + $MISMATCH;
}
# calculate gap scores
$up_score = $matrix[ $i - 1 ][$j]{score} + $GAP;
$left_score = $matrix[$i][ $j - 1 ]{score} + $GAP;
# choose best score
if ( $diagonal_score >= $up_score ) {
if ( $diagonal_score >= $left_score ) {
$matrix[$i][$j]{score} = $diagonal_score;
$matrix[$i][$j]{pointer} = "diagonal";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
}
else {
if ( $up_score >= $left_score ) {
$matrix[$i][$j]{score} = $up_score;
$matrix[$i][$j]{pointer} = "up";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
}
# set maximum score
if ( $matrix[$i][$j]{score} > $max_score ) {
$max_i = $i;
$max_j = $j;
$max_score = $matrix[$i][$j]{score};
}
}
}
my $align1 = "";
my $align2 = "";
my $j = $max_j;
my $i = $max_i;
while (1) {
if ( $matrix[$i][$j]{pointer} eq "none" ) {
$stseq1 = $j;
last;
}
if ( $matrix[$i][$j]{pointer} eq "diagonal" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
$j--;
}
elsif ( $matrix[$i][$j]{pointer} eq "left" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= "-";
$j--;
}
elsif ( $matrix[$i][$j]{pointer} eq "up" ) {
$align1 .= "-";
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
return ( $align1, $align2, $stseq1 ,$max_j);
}
One way to possibly solve the problem is to tie the #matrix with a file. However, this will dramatically slow down the program. Consider this:
sub semiGlobal {
use Tie::Array::CSV;
tie my #matrix, 'Tie::Array::CSV', 'temp.txt'; # Don't forget to add your own error handler.
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
$matrix[0][0] = '0 n';
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j] = '0 n';
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
my $score = $GAP * $i;
$matrix[$i][0] = join ' ',$score,'u';
}
#print Dumper(\#matrix);
# fill
my $max_i = 0;
my $max_j = 0;
my $max_score = 0;
print "seq2: ".length($seq2)."\n";
print "seq1: ".length($seq1)."\n";
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
my ( $diagonal_score, $left_score, $up_score );
# calculate match score
my $letter1 = substr( $seq1, $j - 1, 1 );
my $letter2 = substr( $seq2, $i - 1, 1 );
my $score = (split / /, $matrix[ $i - 1 ][ $j - 1 ])[0];
if ( $letter1 eq $letter2 ) {
$diagonal_score = $score + $MATCH;
}
else {
$diagonal_score = $score + $MISMATCH;
}
# calculate gap scores
$up_score = (split / /,$matrix[ $i - 1 ][$j])[0] + $GAP;
$left_score = (split / /,$matrix[$i][ $j - 1 ])[0] + $GAP;
# choose best score
if ( $diagonal_score >= $up_score ) {
if ( $diagonal_score >= $left_score ) {
$matrix[$i][$j] = join ' ',$diagonal_score,'d';
}
else {
$matrix[$i][$j] = join ' ', $left_score, 'l';
}
}
else {
if ( $up_score >= $left_score ) {
$matrix[$i][$j] = join ' ', $up_score, 'u';
}
else {
$matrix[$i][$j] = join ' ', $left_score, 'l';
}
}
# set maximum score
if ( (split / /, $matrix[$i][$j])[0] > $max_score ) {
$max_i = $i;
$max_j = $j;
$max_score = (split / /, $matrix[$i][$j])[0];
}
}
}
my $align1 = "";
my $align2 = "";
my $stseq1;
my $j = $max_j;
my $i = $max_i;
while (1) {
my $pointer = (split / /, $matrix[$i][$j])[1];
if ( $pointer eq "n" ) {
$stseq1 = $j;
last;
}
if ( $pointer eq "d" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
$j--;
}
elsif ( $pointer eq "l" ) {
$align1 .= substr( $seq1, $j - 1, 1 );
$align2 .= "-";
$j--;
}
elsif ( $pointer eq "u" ) {
$align1 .= "-";
$align2 .= substr( $seq2, $i - 1, 1 );
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
untie #matrix; # Don't forget to add your own error handler.
unlink 'temp.txt'; # Don't forget to add your own error handler.
return ( $align1, $align2, $stseq1 ,$max_j);
}
You can still use your original sub for short sequences, and switch to this sub for long ones.
I think that #j_random_hacker and #Ashalynd are on the right track regarding using this algorithm in most Perl implementations. The datatypes you're using are going to use more memory that absolutely needed for the calculations.
So this is "normal" in that you should expect to see this kind of memory usage for how you've written this algorithm in perl. You may have other problems in surrounding code that are using a lot of memory but this algorithm will hit your memory hard with large sequences.
You can address some of the memory issues by changing the datatypes that you're using as #Ashalynd suggests. You could try changing the hash which holds score and pointer into an array and changing the string pointers into integer values. Something like this might get you some benefit while still maintaining readability:
use strict;
use warnings;
# define constants for array positions and pointer values
# so the code is still readable.
# (If you have the "Readonly" CPAN module you may want to use it for constants
# instead although none of the downsides of the "constant" pragma apply in this code.)
use constant {
SCORE => 0,
POINTER => 1,
DIAGONAL => 0,
LEFT => 1,
UP => 2,
NONE => 3,
};
...
sub semiGlobal2 {
my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = #_;
# initialization: first row to 0 ;
my #matrix;
# score and pointer are now stored in an array
# using the defined constants as indices
$matrix[0][0][SCORE] = 0;
# pointer value is now a constant integer
$matrix[0][0][POINTER] = NONE;
for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
$matrix[0][$j][SCORE] = 0;
$matrix[0][$j][POINTER] = NONE;
}
for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
$matrix[$i][0][SCORE] = $GAP * $i;
$matrix[$i][0][POINTER] = UP;
}
... # continue to make the appropriate changes throughout the code
However, when I tested this I didn't get a huge benefit when attempting to align a 3600 char string in a 5500 char string of random data. I programmed my code to abort when it consumed more than 2GB of memory. The original code aborted after 23 seconds while the one using constants and an array instead of a hash aborted after 32 seconds.
If you really want to use this specific algorithm I'd check out the performance of Algorithm::NeedlemanWunsch. It doesn't look like it's very mature but it may have addressed your performance issues. Otherwise look into writing an Inline or Perl XS wrapper around a C implementation

How to extend a binary search iterator to consume multiple targets

I have a function, binary_range_search, that is called like so:
my $brs_iterator = binary_range_search(
target => $range, # eg. [1, 200]
search => $ranges # eg. [ {start => 1, end => 1000},
); # {start => 500, end => 1500} ]
brs_iterator->() will iterate over all #$ranges on which $range overlaps.
I would like to extend binary_range_search to be able to call it with multiple ranges as its target, eg:
target => $target_ranges # eg. [ [1, 200], [50, 300], ... ]
search => $search_ranges # as above
So, when the search on $range->[0] is exhausted, it should move on to $range->[1], and so on. Here is the function in question, in its original form:
sub binary_range_search {
my %options = #_;
my $range = $options{target} || return;
my $ranges = $options{search} || return;
my ( $low, $high ) = ( 0, #{$ranges} - 1 );
while ( $low <= $high ) {
my $try = int( ( $low + $high ) / 2 );
$low = $try + 1, next if $ranges->[$try]{end} < $range->[0];
$high = $try - 1, next if $ranges->[$try]{start} > $range->[1];
my ( $down, $up ) = ($try) x 2;
my %seen = ();
my $brs_iterator = sub {
if ( $ranges->[ $up + 1 ]{end} >= $range->[0]
and $ranges->[ $up + 1 ]{start} <= $range->[1]
and !exists $seen{ $up + 1 } )
{
$seen{ $up + 1 } = undef;
return $ranges->[ ++$up ];
}
elsif ( $ranges->[ $down - 1 ]{end} >= $range->[0]
and $ranges->[ $down + 1 ]{start} <= $range->[1]
and !exists $seen{ $down - 1 }
and $down > 0 )
{
$seen{ $down - 1 } = undef;
return $ranges->[ --$down ];
}
elsif ( !exists $seen{$try} ) {
$seen{$try} = undef;
return $ranges->[$try];
}
else {
return;
}
};
return $brs_iterator;
}
return sub { };
}
It's a standard binary search strategy, until it finds an overlapping range. It then moves on the right, exhausts it, moves on the left, exhausts it, and finally gives up. Ideally, it should then maybe shift the next target range, and redo the search, I suppose (perhaps via recursion?). My problem is, I am not sure how to make that work with the iterator construction.
I just wrapped your iterator generation in a for loop, and built up an array of iterator functions.
Depending on context, I either return a master iterator or a list of iterator functions. I wasn't sure what you wanted.
use strict;
use warnings;
my $t = [ [1,200], [400,900] ];
my #r = (
{ start => 1, end => 100 },
{ start => 2, end => 500 },
{ start => 204, end => 500 },
{ start => 208, end => 500 },
{ start => 215, end => 1000 },
{ start => 150, end => 1000 },
{ start => 500, end => 1100 },
);
# Get a master iterator that will process each iterator in turn.
my $brs_iterator = binary_range_search(
targets => $t,
search => \#r,
);
# Get an array of iterators
my #brs_iterator = binary_range_search(
targets => $t,
search => \#r,
);
sub binary_range_search {
my %options = #_;
my $targets = $options{targets} || return;
my $ranges = $options{search} || return;
my #iterators;
TARGET:
for my $target ( #$targets ) {
my ( $low, $high ) = ( 0, $#{$ranges} );
RANGE_CHECK:
while ( $low <= $high ) {
my $try = int( ( $low + $high ) / 2 );
# Remove non-overlapping ranges
$low = $try + 1, next RANGE_CHECK
if $ranges->[$try]{end} < $target->[0];
$high = $try - 1, next RANGE_CHECK
if $ranges->[$try]{start} > $target->[1];
my ( $down, $up ) = ($try) x 2;
my %seen = ();
my $brs_iterator = sub {
if ( exists $ranges->[$up + 1]
and $ranges->[ $up + 1 ]{end} >= $target->[0]
and $ranges->[ $up + 1 ]{start} <= $target->[1]
and !exists $seen{ $up + 1 } )
{
$seen{ $up + 1 } = undef;
return $ranges->[ ++$up ];
}
elsif ( $ranges->[ $down - 1 ]{end} >= $target->[0]
and $ranges->[ $down + 1 ]{start} <= $target->[1]
and !exists $seen{ $down - 1 }
and $down > 0 )
{
$seen{ $down - 1 } = undef;
return $ranges->[ --$down ];
}
elsif ( !exists $seen{$try} ) {
$seen{$try} = undef;
return $ranges->[$try];
}
else {
return;
}
};
push #iterators, $brs_iterator;
next TARGET;
}
}
# In scalar context return master iterator that iterates over the list of range iterators.
# In list context returns a list of range iterators.
return wantarray
? #iterators
: sub {
while( #iterators ) {
if( my $range = $iterators[0]() ) {
return $range;
}
shift #iterators;
}
return;
};
}
If you're wanting to iterate over all values that overlap the search ranges, you don't need binary search.
First the customary front matter:
use warnings;
use strict;
use Carp;
First off, check that we have target and search parameters and that for each range, the starting point is no greater than its ending point. Otherwise, we refuse to proceed.
sub binary_range_search {
my %arg = #_;
my #errors;
my $target = $arg{target} || push #errors => "no target";
my $search = $arg{search} || push #errors => "no search";
for (#$target) {
my($start,$end) = #$_;
push #errors => "Target start ($start) is greater than end ($end)"
if $start > $end;
}
for (#$search) {
my($start,$end) = #{$_}{qw/ start end /};
push #errors => "Search start ($start) is greater than end ($end)"
if $start > $end;
}
croak "Invalid use of binary_range_search:\n",
map " - $_\n", #errors
if #errors;
The iterator itself is a closure that maintains the following state:
my $i;
my($ta,$tb);
my($sa,$sb);
my $si = 0;
where
$i if defined is the next value from the current overlapping range
$ta and $tb are the starting and ending points for the current target range
$sa and $sb are like the above but for the current search range
$si is an index into #$search and defines the current search range
We will be assigning and returning the iterator $it. The declaration and initialization are separate so the iterator may call itself when necessary.
my $it;
$it = sub {
We are done if no more target ranges remain or if there were no search ranges to begin with.
return unless #$target && #$search;
When $i is defined, it means we have found an overlap and iterate by incrementing $i until it is greater than the ending point of either the current target range or the current search range.
if (defined $i) {
# iterating within a target range
if ($i > $tb || $i > $sb) {
++$si;
undef $i;
return $it->();
}
else {
return $i++;
}
}
Otherwise, we need to determine whether the next target range overlaps any search range. However, if $i is undefined and we've already considered all the search ranges, we discard the current target range and start again.
else {
# does the next target range overlap?
if ($si >= #$search) {
shift #$target;
$si = 0;
return $it->();
}
Here we pull out the starting and ending points of both the current target range (always at the front of #$target) and the current search range (indexed by $si).
($ta,$tb) = #{ $target->[0] };
($sa,$sb) = #{ $search->[$si] }{qw/ start end /};
Now testing for overlap is straightforward. For disjoint search ranges, we ignore and move on. Otherwise, we find the leftmost point in the overlap and iterate from there.
if ($sb < $ta || $sa > $tb) {
# disjoint
++$si;
undef $i;
return $it->();
}
elsif ($sa >= $ta) {
$i = $sa;
return $i++;
}
elsif ($ta >= $sa) {
$i = $ta;
return $i++;
}
}
};
Finally, we return the iterator:
$it;
}
For an example similar to the one in your question
my $it = binary_range_search(
target => [ [1, 200], [50, 300] ],
search => [ { start => 1, end => 1000 },
{ start => 500, end => 1500 },
{ start => 40, end => 60 },
{ start => 250, end => 260 } ],
);
while (defined(my $value = $it->())) {
print "got $value\n";
}
Its output with internal points elided is
got 1
[...]
got 200
got 40
[...]
got 60
got 50
[...]
got 300
got 50
[...]
got 60
got 250
[...]
got 260
Split it into two functions, an outer function that loops over the ranges and calls an inner function that implements a conventional binary chop.
Warning: a very c++ biased answer:
what you have to do is define a new type of iterator, which is a pair of a usual iterator, and a segmemt iterrator (if you don't have a segment iterator, it's a pair of a const pointer / ref to the segments, and an index pointing to the correct segment). You have to define all the concepts of a random access iterator (difference, addition of integer, etc). bear in mind, that at least in c++ lingo this is not a true random iterator, since addition of an integer is not really constant time; such is life.