random binomial distributed dataset in perl - perl

I try to do in perl what I succeedd in R but is difficult to combine with my downstream needs.
in R I did the following
library("MASS")
d <- rnegbin(100000, mu = 250, theta = 2)
hist(d, breaks=1000, xlim=c(0,1000))
producing the nice graph I need with a peak round 180-200 and a tail to the right.
Could someone help me code the perl equivalent using Math::Random
I tried this but do not get the right shape
use Math::Random qw(random_negative_binomial);
# random_negative_binomial($n, $ne, $p)
# When called in an array context, returns an array of $n outcomes
# generated from the negative binomial distribution with number of
# events $ne and probability of an event in each trial $p.
# When called in a scalar context, generates and returns only one
# such outcome as a scalar, regardless of the value of $n.
# Argument restrictions: $ne is rounded using int(), the result must be positive.
# $p must be between 0 and 1 exclusive.
# I tried different variable values but never got the right shape
my #dist = random_negative_binomial($n, $ne, $p);
what values do I need to mimic the R results?
I need the same range of values on X and the same general shape
Thanks for any help, I did not find illustrated examples of that package
Stephane

I don't know much about statistics, but since nobody else comes forward: I would use the Perl Data Language PDL (which I use for other things) and fetch the PDL::Stats::Distr module. You can find an example that looks somewhat similar to yours here http://pdl-stats.sourceforge.net/Distr.htm. The module includes pmf_binomial (mass function) and mme_binomial (distribution). You will also need the PGPLOT module.
You will need some random data:
$data = pdl 1..100000; ## generate linear 1 - 100000
$data = $data->random; ## make them random between 0..1

Related

What is the meaning of ~ apart from logical not?

What does following code do in Matlab? I searched documentation but ~ shows logical not. But I could not relate following output to anything about logical not.
[~, k ] = max([0.9 1.5 4.6; 3.31 0.76 5.4]
Output: 2 1 2
The ~ placeholder allows you to ignore an output from a function. Using this allows you to acknowledge that something is output by the function, but you do not have to allocate a variable to store the output in.
When a function returns values in Matlab the number of parameters it returns and the order of these parameters is important and allows you to know what each returned value is. You may sometimes come across situations where a function returns more values than you are interested in, and you can ignore the ones you are not interested in using ~.
In your example, M = max([0.9 1.5 4.6]) would return only the maximum value. If you want to know the index of the maximum value, you have to use [M,I] = max([[0.9 1.5 4.6]). If you need to know the index of the maximum value but are not interested in the actual value itself, you can use [~,I] = max([0.9 1.5 4.6]), and you thus do not need to allocate a variable to hold the maximum values.
The according reference you're looking for is the Symbol Reference, which states:
Tilde — ~
The tilde character is used in comparing arrays for unequal values,
finding the logical NOT of an array, and as a placeholder for an input
or output argument you want to omit from a function call. Not Equal to
...
Argument Placeholder
To have the fileparts function return its third output value and skip
the first two, replace arguments one and two with a tilde character:
[~, ~, filenameExt] = fileparts(fileSpec);
which is what #David suggested in his comment.

Perl - Square Root results

I am quite new to the world of Perl and I am stuck with the sqrt function.
By stuck I mean the function is not returning the value it should.
After reading a text file with coordinate information, 8 values are stored in separate variables ($x1, $y1, $x2, $y2 and so forth). Then, a subroutine is called, which calculates the distance between the points and then other things. However, it doesn't do what it is supposed to do because the results of the sqrt function are not the ones they should! I thought it was a problem with how the variables were obtained and stored, but after performing the sqrt with the literal values, it also produces a wrong number.
Here are the values
-2130.07 207.56 -2084.46 210.76 -1892.78 -2525.74 -1938.39 -2528.93
And here are the sqrt calculations...
$side1=sqrt(($x1-$x2)^2+($y1-$y2)^2);
$sidecheck=sqrt((-2130.07-(-2084.46))^2+(207.56-210.76)^2);
Both $side1 and $sidecheck return a value of 6.7823 instead of 45.722.
Is there a way to sort this out? Thanks!
In Perl and few other Languages, the power of a number is not the caret, its a double asterisk. So you need to write
$sidecheck=sqrt((-2130.07-(-2084.46))**2+(207.56-210.76)**2);
The ^ is the bitwise XOR operator. To square a value, use **

Concatenate equivalent in MATLAB for a single value

I am trying to use MATLAB in order to generate a variable whose elements are either 0 or 1. I want to define this variable using some kind of concatenation (equivalent of Java string append) so that I can add as many 0's and 1's according to some upper limit.
I can only think of using a for loop to append values to an existing variable. Something like
variable=1;
for i=1:N
if ( i%2==0)
variable = variable.append('0')
else
variable = variable.append('1')
i=i+1;
end
Is there a better way to do this?
In MATLAB, you can almost always avoid a loop by treating arrays in a vectorized way.
The result of pseudo-code you provided can be obtained in a single line as:
variable = mod((1:N),2);
The above line generates a row vector [1,2,...,N] (with the code (1:N), use (1:N)' if you need a column vector) and the mod function (as most MATLAB functions) is applied to each element when it receives an array.
That's not valid Matlab code:
The % indicates the start of a comment, hence introducing a syntax error.
There is no append method (at least not for arrays).
Theres no need to increment the index in a for loop.
Aside of that it's a bad idea to have Matlab "grow" variables, as memory needs to be reallocated at each time, slowing it down considerably. The correct approach is:
variable=zeros(N,1);
for i=1:N
variable(i)=mod(i,2);
end
If you really do want to grow variables (some times it is inevitable) you can use this:
variable=[variable;1];
Use ; for appending rows, use , for appending columns (does the same as vertcat and horzcat). Use cat if you have more than 2 dimensions in your array.

How to convert a z score to a percentage in Perl

Seems like in Perl this should be easy or a module, but I have not found an easy answer yet. I have a calculated z normal score and the mean, but have not found an easy method to calculate the percentage. The solutions I have found are for looking it up using a statistics table, but it seems like something that common would have a module or something easy. (ie a z-score of -3 is -3 sigma and has a percentage of 0.1%). I don't want to have to build a table in perl and then interpolate if I don't need to. Anyone know?
I'm not 100% sure since the only description you gave is "the percentage", but I think the function you need is the cumulative probability function for the normal function. You can get this from the module Math::CDF.
use Math::CDF;
my $prob = Math::CDF::pnorm(-3);
printf "%.1f%%\n", $prob * 100; # Prints 0.1%

How can I work around a round-off error that causes an infinite loop in Perl's Statistics::Descriptive?

I'm using the Statistics::Descriptive library in Perl to calculate frequency distributions and coming up against a floating point rounding error problem.
I pass in two values, 0.205 and 0.205, (taken from other numbers and sprintf'd to those) to the stats module and ask it to calculate the frequency distribution but it's getting stuck in an infinite loop.
Stepping through with a debugger I can see that it's doing:
my $interval = $self->{sample_range}/$partitions;
my $iter = $self->{min};
while (($iter += $interval) < $self->{max}) {
$bins{$iter} = 0;
push #k, $iter; ##Keep the "keys" unstringified
}
$self->sample_range (The range is max-min)is returning 2.77555756156289e-17 rather than 0 as I'd expect. This means that the loop ((min+=range) < max)) enters a (for all intents and purposes) infinite loop.
DB<8> print $self->{max};
0.205
DB<9> print $self->{min};
0.205
DB<10> print $self->{max}-$self->{min};
2.77555756156289e-17
So this looks like a rounding problem. I can't think how to fix this on my side though, and I'm not sure editing the library is a good idea. I'm looking for suggestions of a workaround or alternative.
Cheers,
Neil
I am the Statistics::Descriptive maintainer. Due to its numeric nature, many rounding problems have been reported. I believe this particular one was fixed in a later version to the one you were using that I released recently, by using multiplication for the divisions instead of +=.
Please use the most up-to-date version from the CPAN, and it should be better.
Not exactly a rounding problem; you can see the more precise values with something like
printf("%.18g %.18g", $self->{max}, $self->{min});
Looks to me like there's a flaw in the module where it assumes the sample range can be divided up into $partitions pieces; because floating point doesn't have infinite precision, this isn't always possible. In your case, the min and max values are exactly adjacent representable values, so there can't be more than one partition. I don't know what exactly the module is using the partitions for, so I'm not sure what the impact of this may be.
Another possible problem in the module is that it is using numbers as hash keys, which
implicitly stringifies them which slightly rounds the value.
You may have some success in laundering your data through stringization before feeding it
to the module:
$data = 0+"$data";
This will at least ensure that two numbers that (with the default printing precision) appear equal are actually equal.
That shouldn't cause an infinite loop. What would cause that loop to be infinite would be if $self->{sample_range}/$partitions is 0.