Adding an array of floats produces weird sums adding forward vs. backward - perl

I'm adding (summing) and array of floats in perl, and I was trying to speed it up. When I tried, I started getting weird results.
#!/usr/bin/perl
my $total = 0;
my $sum = 0;
# Compute $sum (adds from index 0 forward)
my #y = #{$$self{"closing"}}[-$periods..-1];
my #x = map {${$_}{$what}} #y;
# map { $sum += $_ } #x;
$sum += $_ for #x;
# Compute $total (adds from index -1 backward)
for (my $i = -1; $i >= -$periods; $i--) {
$total += ${${$$self{"closing"}}[$i]}{$what};
}
if($total != $sum) {
printf("SMA($what, $periods) total ($total) != sum ($sum) %g\n",
($total - $sum));
}
# Example output:
# SMA(close, 20) total (941.03) != sum (941.03) -2.27374e-13
I seem to get different answers when I compute $sum and $total.
The only thing I can think of is that one method adds forward through the array, and the other backward.
Would this cause them to overflow differently? I would expect so, but it never occurred to me that I would get different answers. Notice that the difference is small (-2.27374e-13).
Is this what's going on, or is my code busted?
This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

As Eric mentioned in the comments, floating point arithmetic is not associative; so the order you do the operations will impact the answer.
While "add smaller values first" is good advice, it is important to emphasize that you can have differences even with just regular "small" values. Here's one example:
x = 1.004028
y = 3.0039678
z = 4.000855
If these are taken to be IEEE-754 single-precision floats (i.e., 32-bit binary format), then we get:
x + (y+z) = 8.008851
(x+y) + z = 8.00885
Infinitely precise result is 8.0088508. So neither are very good! And the error isn't insignificant for scientific computations and it accumulates.
This is a rich field with many numerical algorithms to ensure precision. While which one you pick entirely depends on your problem domain and particular needs and resources you have available, one of the best-known algorithms is Kahan's summation algorithm, see: https://en.wikipedia.org/wiki/Kahan_summation_algorithm. You can easily adopt it to your problem for (hopefully) better results.

Related

How do I display a large number in scientific notation?

Using AutoIt, when I multiply 1 by 10^21, I get 1e+021. But in separate steps, such as multiplying 1 by 10^3 seven times, I get the overflow value of 3875820019684212736.
It appears AutoIt cannot handle numbers with more than eighteen digits. Is there a way around this? For example, can I multiply 10,000,000,000,000,000 by 1000 and have the result displayed as 1e+019?
Try this UDF : BigNum UDF
Example :
$X = "9999999999999999999999999999999"
$Y = "9999999999999999999999999999999"
$product = _BigNum_Mul($X, $Y)

In Powershell, How to generate a random variable (exponential) with a specified mean?

I am trying to write a basic simulation (a queue), which relies on generating random expovariates. While Powershell offers a Get-Random function, you can specify a min and a max, but it doesn't have anywhere near Python's random.expovariate(lambd) function.
Supposedly this is the model I should be following: log(1-$u)/(−λ)
The excellent Python documentation has this to say about it:
"Exponential distribution. lambd is 1.0 divided by the desired mean. It should be nonzero. (The parameter would be called “lambda”, but that is a reserved word in Python.) Returned values range from 0 to positive infinity if lambd is positive, and from negative infinity to 0 if lambd is negative.". In another description, "expovariate() produces an exponential distribution useful for simulating arrival or interval time values for in homogeneous Poisson processes such as the rate of radioactive decay or requests coming into a web server.
The Pareto, or power law, distribution matches many observable phenomena and was popularized by Chris Anderon’s book, The Long Tail. The paretovariate() function is useful for simulating allocation of resources to individuals (wealth to people, demand for musicians, attention to blogs, etc.)."
I have tried writing this in Powershell, but my distributions are way off. If I put in a mean of 3 I am getting the results that closely follow the results I should get from a mean of 1. My code is closely modeled on John D. Cook's SimpleRNG C# library.
function GetUniform #GetUint
{
Return Get-Random -Minimum -0.00 -Maximum 1
}
# Get exponential random sample with specified mean
function GetExponential_SpecMean{
param([double]$mean)
if ($mean -le 0.0)
{
Write-Host "Mean must be positive. Received $mean."
}
$a = GetExponential
$R = $mean * $a
Return $R
}
# Get exponential random sample with mean 1
function GetExponential
{
$x = GetUniform
Return -[math]::log10(1-$x) # -Math.Log( GetUniform() );
}
cls
$mean5 = 1
$rangeBottom = 0.0
$rangeTop = 1.0
$j = 0
$k = 0
$l = 0
for($i=1; $i -le 1000; $i++){
$a = GetExponential_SpecMean $mean5
if($a -le 1.0){Write-Host $a;$j++}
if($a -gt 1.0){Write-Host $a;$k++}
if(($a -gt $rangeBottom) -and ($a -le $rangeTop)){#Write-Host $a;
$l++}
Write-Host " -> $i "
}
Write-Host "One or less: $j"
Write-Host "Greater than one: $k"
Write-Host "Total in range between bottom $rangeBottom and top $rangeTop : $l"
For a sample of 1000 and a Mean ($mean5) of 1, I should get (I believe) 500 results that are 1.0 or less and 500 that are greater than 1.0 (1:1 ratio), however I am getting a ratio of about 9:1 with a mean of 1 and a ratio of about 53:47 using a mean of 3.
There is some discussion in this Stack Overflow question with some good background, but it is not specific to Powershell: Pseudorandom Number Generator - Exponential Distribution
I see you are using [Math]::log10() which is logarithm with base 10, and all the functions I see in your links use natural logarithm. You should instead use [Math]::Log() in place of log10. Should do.

How can I take n elements at random from a Perl array?

I have an array, A = [a1,a2,a3,...aP] with size P. I have to sample q elements from array A.
I plan to use a loop with q iterations, and randomly pick a element from A at each iteration. But how can I make sure that the picked number will be different at each iteration?
The other answers all involve shuffling the array, which is O(n).
It means modifying the original array (destructive) or copying the original array (memory intensive).
The first way to make it more memory efficient is not to shuffle the original array but to shuffle an array of indexes.
# Shuffled list of indexes into #deck
my #shuffled_indexes = shuffle(0..$#deck);
# Get just N of them.
my #pick_indexes = #shuffled_indexes[ 0 .. $num_picks - 1 ];
# Pick cards from #deck
my #picks = #deck[ #pick_indexes ];
It is at least independent of the content of the #deck, but its still O(nlogn) performance and O(n) memory.
A more efficient algorithm (not necessarily faster, depends on now big your array is) is to look at each element of the array and decide if it's going to make it into the array. This is similar to how you select a random line from a file without reading the whole file into memory, each line has a 1/N chance of being picked where N is the line number. So the first line has a 1/1 chance (it's always picked). The next has a 1/2. Then 1/3 and so on. Each pick will overwrite the previous pick. This results in each line having a 1/total_lines chance.
You can work it out for yourself. A one line file has a 1/1 chance so the first one is always picked. A two line file... the first line has a 1/1 then a 1/2 chance of surviving, which is 1/2, and the second line has a 1/2 chance. For a three line file... the first line has a 1/1 chance of being picked, then a 1/2 * 2/3 chance of surviving which is 2/6 or 1/3. And so on.
The algorithm is O(n) for speed, it iterates through an unordered array once, and does not consume any more memory than is needed to store the picks.
With a little modification, this works for multiple picks. Instead of a 1/$position chance, it's $picks_left / $position. Each time a pick is successful, you decrement $picks_left. You work from the high position to the low one. Unlike before, you don't overwrite.
my $picks_left = $picks;
my $num_left = #$deck;
my #picks;
my $idx = 0;
while($picks_left > 0 ) { # when we have all our picks, stop
# random number from 0..$num_left-1
my $rand = int(rand($num_left));
# pick successful
if( $rand < $picks_left ) {
push #picks, $deck->[$idx];
$picks_left--;
}
$num_left--;
$idx++;
}
This is how perl5i implements its pick method (coming next release).
To understand viscerally why this works, take the example of picking 2 from a 4 element list. Each should have a 1/2 chance of being picked.
1. (2 picks, 4 items): 2/4 = 1/2
Simple enough. Next element has a 1/2 chance that an element will already have been picked, in which case it's chances are 1/3. Otherwise its chances are 2/3. Doing the math...
2. (1 or 2 picks, 3 items): (1/3 * 1/2) + (2/3 * 1/2) = 3/6 = 1/2
Next has a 1/4 chance that both elements will already be picked (1/2 * 1/2), then it has no chance; 1/2 chance that only one will be picked, then it has 1/2; and the remaining 1/4 that no items will be picked in which case it's 2/2.
3. (0, 1 or 2 picks, 2 items): (0/2 * 1/4) + (1/2 * 2/4) + (2/2 * 1/4) = 2/8 + 1/4 = 1/2
Finally, for the last item, there's a 1/2 the previous took the last pick.
4. (0 or 1 pick, 1 items): (0/1 * 2/4) + (1/1 * 2/4) = 1/2
Not exactly a proof, but good for convincing yourself it works.
From perldoc perlfaq4:
How do I shuffle an array randomly?
If you either have Perl 5.8.0 or later installed, or if you have
Scalar-List-Utils 1.03 or later installed, you can say:
use List::Util 'shuffle';
#shuffled = shuffle(#list);
If not, you can use a Fisher-Yates shuffle.
sub fisher_yates_shuffle {
my $deck = shift; # $deck is a reference to an array
return unless #$deck; # must not be empty!
my $i = #$deck;
while (--$i) {
my $j = int rand ($i+1);
#$deck[$i,$j] = #$deck[$j,$i];
}
}
# shuffle my mpeg collection
#
my #mpeg = <audio/*/*.mp3>;
fisher_yates_shuffle( \#mpeg ); # randomize #mpeg in place
print #mpeg;
You could also use List::Gen:
my $gen = <1..10>;
print "$_\n" for $gen->pick(5); # prints five random numbers
You can suse the Fisher-Yates shuffle algorithm to randomly permute your array and then use a slice of the first q elements. Here's code from PerlMonks:
# randomly permutate #array in place
sub fisher_yates_shuffle
{
my $array = shift;
my $i = #$array;
while ( --$i )
{
my $j = int rand( $i+1 );
#$array[$i,$j] = #$array[$j,$i];
}
}
fisher_yates_shuffle( \#array ); # permutes #array in place
You can probably optimize this by having the shuffle stop after it has q random elements selected. (The way this is written, you'd want the last q elements.)
You may construct second array, boolean with size P and store true for picked numbers. And when the numer is picked, check second table; in case "true" you must pick next one.

Wanted: a quicker way to check all combinations within a very large hash

I have a hash with about 130,000 elements, and I am trying to check all combinations within that hash for something (130,000 x 130,000 combinations). My code looks like this:
foreach $key1 (keys %CNV)
{
foreach $key2 (keys %CNV)
{
if (blablabla){do something that doesn't take as long}
}
}
As you might expect, this takes ages to run. Does anyone know a quicker way to do this? Many thanks in advance!!
-Abdel
Edit: Update on the blablabla.
Hey guys, thanks for all the feedback! Really appreciate it. I changed the foreach statement to:
for ($j=1;$j<=24;++$j)
{
foreach $key1 (keys %{$CNV{$j}})
{
foreach $key2 (keys %{$CNV{$j}})
{
if (blablabla){do something}
}
}
}
The hash is now multidimensional:
$CNV{chromosome}{$start,$end}
I'll elaborate on what I'm exactly trying to do, as requested.
The blablabla is the following:
if ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
(($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1}))
)
In short: The hash elements represent a specific part of the DNA (a so called "CNV", think of it like a gene for now), with a start and an end (which are integers representing their position on that particular chromosome, stored in hashes with the same keys: %CNVstart & %CNVend). I'm trying to check for every combination of CNVs whether they overlap. If there are two elements that overlap within a family (I mean a family of persons whose DNA I have and read in; there is also a for-statement inside the foreach-statement that let's the program check this for every family, which makes it last even longer), I check whether they also have the same "copy number" (which is stored in another hash with the same keys) and print out the result.
Thank you guys for your time!
It sounds like Algorithm::Combinatorics may help you here. It's intended to provide "efficient generation of combinatorial sequences." From its docs:
Algorithm::Combinatorics is an
efficient generator of combinatorial
sequences. ... Iterators do not use
recursion, nor stacks, and are written
in C.
You could use its combinations sub-routine to provide all possible 2 key combos from your full set of keys.
On the other hand, Perl itself is written in C. So I honestly have no idea whether or not this would help at all.
Maybe by using concurrency? But you would have to be carefull with what you do with a possitive match as to not get problems.
E.g. take $key1, split it in $key1A and §key1B. The create two separate threads, each containing "half of the loop".
I am not sure exactly how expensive it is to start new threads in Perl but if your positive action doesn't have to be synchronized I imagine that on matching hardware you would be faster.
Worth a try imho.
define blah blah.
You could write it like this:
foreach $key1 (keys %CNV)
{
if (blah1)
{
foreach $key2 (keys %CNV)
{
if (blah2){do something that doesn't take as long}
}
}
}
This pass should be O(2N) instead of O(N^2)
The data structure in the question is not a good fit to the problem. Let's try it this way.
use Set::IntSpan::Fast::XS;
my #CNV;
for ([3, 7], [4, 8], [9, 11]) {
my $set = Set::IntSpan::Fast::XS->new;
$set->add_range(#{$_});
push #CNV, $set;
}
# The comparison is commutative, so we can cut the total number in half.
for my $index1 (0 .. -1+#CNV) {
for my $index2 (0 .. $index1) {
next if $index1 == $index2; # skip if it's the same CNV
say sprintf(
'overlap of CNV %s, %s at indices %d, %d',
$CNV[$index1]->as_string, $CNV[$index2]->as_string, $index1, $index2
) unless $CNV[$index1]->intersection($CNV[$index2])->is_empty;
}
}
Output:
overlap of CNV 4-8, 3-7 at indices 1, 0
We will not get the overlap of 3-7, 4-8 because it is a duplicate.
There's also Bio::Range, but it doesn't look so efficient to me. You should definitely get in touch with the bio.perl.org/open-bio people; chances are what you're doing has been done already a million times before they already have the optimal algorithm all figured out.
I think I found the answer :-)
Couldn't have done it without you guys though. I found a way to skip most of the comparisons I make:
for ($j=1;$j<=24;++$j)
{
foreach $key1 (sort keys %{$CNV{$j}})
{
foreach $key2 (sort keys %{$CNV{$j}})
{
if (($CNVstart{$j}{$key2} < $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} < $CNVstart{$j}{$key1}))
{
next;
}
if (($CNVstart{$j}{$key2} > $CNVend{$j}{$key1}) && ($CNVend{$j}{$key2} > $CNVend{$j}{$key1}))
{
last;
}
if ( (($CNVstart{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVstart{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVend{$j}{$key1} >= $CNVstart{$j}{$key2}) && ($CNVend{$j}{$key1} <= $CNVend{$j}{$key2})) ||
(($CNVstart{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVstart{$j}{$key2} <= $CNVend{$j}{$key1})) ||
(($CNVend{$j}{$key2} >= $CNVstart{$j}{$key1}) && ($CNVend{$j}{$key2} <= $CNVend{$j}{$key1}))
) {print some stuff out}
}
}
}
What I did is:
sort the keys of the hash for each foreach loop
do "next" if the CNVs with $key2 still haven't reached the CNV with $key1 (i.e. start2 and end2 are both smaller than start1)
and probably the most time-saving: end the foreach loop if the CNV with $key2 has overtaken the CNV with $key1 (i.e. start2 and end2 are both larger than end1)
Thanks a lot for your time and feedback guys!
Your optimisation with taking out the j into the outer loop was good, but the solution is still far from optimal.
Your problem does have a simple O(N+M) solution where N is the total number of CNVs and M is the number of overlaps.
The idea is: you walk through the length of DNA while keeping track of all the "current" CNVs. If you see a new CNV start, you add it to the list and you know that it overlaps with all the other CNVs currently in the list. If you see a CNV end, you just remove it from the list.
I am not a very good perl programmer, so treat the following as a pseudo-code (it's more like a mix of Java and C# :)):
// input:
Map<CNV, int> starts;
Map<CNV, int> ends;
// temporary:
List<Tuple<int, bool, CNV>> boundaries;
foreach(CNV cnv in starts)
boundaries.add(starts[cnv], false, cnv);
foreach(CNV cnv in ends)
boundaries.add(ends[cnv], true, cnv);
// Sort first by position,
// then where position is equal we put "starts" first, "ends" last
boundaries = boundaries.OrderBy(t => t.first*2 + (t.second?1:0));
HashSet<CNV> current;
// main loop:
foreach((int position, bool isEnd, CNV cnv) in boundaries)
{
if(isEnd)
current.remove(cnv);
else
{
foreach(CNV otherCnv in current)
OVERLAP(cnv, otherCnv); // output of the algorithm
current.add(cnv);
}
}
Now I'm not a perl warrior, but based on the information given it is the same in any programming language; unless you sort the "hash" on the property you want to check and do a binary lookup you won't improve any performance in a lookup.
You can also if it is possible calculate which indexes in your hash would have the properties you are interested in, but as you have no information regarding such a possibility, this would perhaps not be a solution.

How do I factor integers using Perl?

I want split integers into their factors. For example, if the total number of records is:
169 - ( 13 x 13 times)
146 - ( 73 x 2 times)
150 - ( 50 x 3 times)
175 - ( 25 x 7 times)
168 - ( 84 x 2 )
160 - ( 80 x 2 times)
When it's more than 10k - I want everything on 1000
When it's more than 100k - I want everything on 10k
In this way I want to factor the number. How to achieve this? Is there any Perl module available for these kinds of number operations?
Suppose total number of records is 10k. It should be split by 1000x10 times only; not by 100 or 10s.
I can use sqrt function. But it's not always what I am expecting. If I give the input 146, I have to get (73, 2).
You can use the same algorithms you find for other languages in Perl. There isn't any Perl special magic in the ideas. It's just the implementation, and for something like this problem, it's probably going to look very similar to the implementation in any language.
What problem are you trying to solve? Maybe we can point you at the right algorithm if we know what you are trying to do:
Why must numbers over 10,000 use the 1,000 factor? Most numbers won't have a 1,000 factor.
Do you want all the factors, or just the largest and its companion?
What do you mean that the sqrt function doesn't work as you expect? If you're following the common algorithm, you just need to iterate up to the floor of the square root to test for factors. Most integers don't have an integral square root.
If the number is not a prime you can use a factoring algorithm.
There is an example of such a function here: http://www.classhelper.org/articles/perl-by-example-factoring-numbers/factoring-numbers-with-perl.shtml
Loop through some common numbers in an acceptable range (say, 9 to 15), compute the remainder modulo your test number, and choose the lowest.
sub compute_width {
my ($total_records) = #_;
my %remainders;
for(my $width = 9; $width <= 15; $width += 1) {
my $remainder = $total_records % $width;
$remainders{$width} = $remainder;
}
my #widths = sort {
$remainders{$a} <=> $remainders{$b} ||
$a <=> $b
} keys %remainders;
return $widths[0];
}