Ruby version of gamma.fit from scipy.stats - scipy

As the title suggests, I am trying to find a function that can take an array of floats and find a distribution that fits my data.
From here I'll use it to find the CDF of new data I am passing it.
I have installed and looked through the sciruby Distribution and NArray docs but nothing appears to match the 'fit' method
The python code looks like this
# Approach 2: Model-based percentiles.
# Step 1: Find a Gamma distribution that fits your data
alpha, _, beta = stats.gamma.fit(data, floc = 0.)
# Step 2: Use that distribution's CDF to get percentiles.
scores = 100-100*stats.gamma.cdf(new_data, a = alpha, scale=beta)
print(scores)
Thank you in advance

After a deep dive into other packages and a lot of help from someone from the 'Cross Validated' forum, I have the answer needed.
In order to obtain the needed 'alpha' and 'beta' values that will give the shape and rate of the gamma distribution, you will need to discover what the 'variance' value is in the data.
There are a few approaches to achieving this. See here for more information;
https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/sample-variance/
Code examples;
data = [<insert your numbers>]
sum = data.sum
sum_square_mean = (sum**2) / data.size
all_square = data.map { |n| n**2 }.sum
net_square = all_square - sum_square_mean
minus_one = data.size - 1
variance = net_square / minus_one
mean = data.sum(0.0) / data.size
mean_squared = mean**2
alpha = mean_squared / variance
beta = mean / variance
theta = variance / mean
The line 'minus_one' isn't completely necessary but it's done in statistics to reduce the error rate. Look up Bessels correction. You can just get variance from net_square / data.size.
Second option using the 'descriptive_statistics' gem
require('descriptive_statistics')
# doesn't account for bessel's correction
#alpha = (data.mean**2) / data.variance
#beta = data.mean / data.variance
#theta = data.variance / data.mean
Once you have these values, you can use the cdf function from the Distribution Gem , docs here
The next stage is then to pass the values into this function which will return a percentile.
Make sure to use the '1 over beta' calculation or it won't work
percentile = 100 - (100 * Distribution::Gamma::Ruby_.cdf(x, alpha, 1 / beta))
You may have noticed I have also calculated #theta
This was for a separate function that means I can also return the value from my gamma distribution by passing in the percentile. Used like so
value = Distribution::Gamma.quantile(0.5, alpha, theta)
This function is also known as 'inverse cdf', 'inverse cumulative distribution function', 'probability point function' or 'percentile point function'. Here it is simply named 'quantile'.
For more information on gamma distributions, please see the wiki
Gamma Distribution

Related

Small bug in MATLAB R2017B LogLikelihood after fitnlm?

Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.

Matlab logncdf function is not producing expected result

So on this problem it seems pretty straight forward we are given
mean of x = 10,281 and sigma of x = 4112.4
We are asked to determine P(X<15,000)
Now I thought the code for this in matlab should be super straightforward
mu = 10281
sigma = 4112.4
p = logncdf(15000,10281,4112.4)
However this gives
p = .0063
The given answer is .8790 and just looking at p you can tell it is wrong because we are at 15000 which is over the mean which means it should be above .5. What is the deal with this function?
I saw somewhere you might need to take the exp(15000) for x in the function that results in a probability of 1 which is too high.
Any pointers would be much appreciated
%If X is lognormally distributed with parameters:-
mu = 10281;
sigma = 4112.4;
%then log(X) is normally distributed with following parameters:
mew_actual = log((mu^2)/sqrt(sigma^2+mu^2));
sigma_actual = sqrt(log((sigma^2)/(mu^2) +1));
Now you can use either of the following to compute CDF:-
p = cdf('Normal',log(15000),mew_actual,sigma_actual)
or
p=logncdf(15000,mew_actual,sigma_actual)
which gives 0.8796
(which I believe is the correct answer)
The answer given to you is 0.8790 because if you solve the question by hand, you'll get something like: z = 1.172759 and when you look this value in the table, you can only find z = 1.17(without the rest of decimal places) and for which φ(z)=0.8790.
You can verify the exact answer using this calculator. The related screenshot is attached below:

How to get cumulative distribution functions of a vector in Matlab using cumsum?

I want to get the probability to get a value X higher than x_i, which means the cumulative distribution functions CDF. P(X>=x_i).
I've tried to do it in Matlab with this code.
Let's assume the data is in the column vector p1.
xp1 = linspace(min(p1), max(p1)); %range of bins
histp1 = histc(p1(:), xp1); %histogram od data
probp1 = histp1/sum(histp1); %PDF (probability distribution function)
`figure;plot(probp1, 'o') `
Now I want to calculate the CDF,
sorncount = flipud(histp1);
cumsump1 = cumsum(sorncount);
normcumsump1 = cumsump1/max(cumsump1);
cdf = flipud(normcumsump1);
figure;plot(xp1, cdf, 'ok');
I'm wondering whether anyone can help me to know if I'm ok or am I doing something wrong?
Your code works correctly, but is a bit more complicated than it could be. Since probp1 has been normalized to have sum equal to 1, the maximum of its cumulative sum is guaranteed to be 1, so there is no need to divide by this maximum. This shortens the code a bit:
xp1 = linspace(min(p1), max(p1)); %range of bins
histp1 = histc(p1(:), xp1); %count for each bin
probp1 = histp1/sum(histp1); %PDF (probability distribution function)
cdf = flipud(cumsum(flipud(histp1))); %CDF (unconventional, of P(X>=a) kind)
As Raab70 noted, most of the time CDF is understood as P(X<=a), in which case you don't need flipud: taking cumsum(histp1) is all that's needed.
Also, I would probably use histp1(end:-1:1) instead of flipud(histp1), so that the vector is flipped no matter if it's a row or column.

A moving average with different functions and varying time-frames

I have a matrix time-series data for 8 variables with about 2500 points (~10 years of mon-fri) and would like to calculate the mean, variance, skewness and kurtosis on a 'moving average' basis.
Lets say frames = [100 252 504 756] - I would like calculate the four functions above on over each of the (time-)frames, on a daily basis - so the return for day 300 in the case with 100 day-frame, would be [mean variance skewness kurtosis] from the period day201-day300 (100 days in total)... and so on.
I know this means I would get an array output, and the the first frame number of days would be NaNs, but I can't figure out the required indexing to get this done...
This is an interesting question because I think the optimal solution is different for the mean than it is for the other sample statistics.
I've provided a simulation example below that you can work through.
First, choose some arbitrary parameters and simulate some data:
%#Set some arbitrary parameters
T = 100; N = 5;
WindowLength = 10;
%#Simulate some data
X = randn(T, N);
For the mean, use filter to obtain a moving average:
MeanMA = filter(ones(1, WindowLength) / WindowLength, 1, X);
MeanMA(1:WindowLength-1, :) = nan;
I had originally thought to solve this problem using conv as follows:
MeanMA = nan(T, N);
for n = 1:N
MeanMA(WindowLength:T, n) = conv(X(:, n), ones(WindowLength, 1), 'valid');
end
MeanMA = (1/WindowLength) * MeanMA;
But as #PhilGoddard pointed out in the comments, the filter approach avoids the need for the loop.
Also note that I've chosen to make the dates in the output matrix correspond to the dates in X so in later work you can use the same subscripts for both. Thus, the first WindowLength-1 observations in MeanMA will be nan.
For the variance, I can't see how to use either filter or conv or even a running sum to make things more efficient, so instead I perform the calculation manually at each iteration:
VarianceMA = nan(T, N);
for t = WindowLength:T
VarianceMA(t, :) = var(X(t-WindowLength+1:t, :));
end
We could speed things up slightly by exploiting the fact that we have already calculated the mean moving average. Simply replace the within loop line in the above with:
VarianceMA(t, :) = (1/(WindowLength-1)) * sum((bsxfun(#minus, X(t-WindowLength+1:t, :), MeanMA(t, :))).^2);
However, I doubt this will make much difference.
If anyone else can see a clever way to use filter or conv to get the moving window variance I'd be very interested to see it.
I leave the case of skewness and kurtosis to the OP, since they are essentially just the same as the variance example, but with the appropriate function.
A final point: if you were converting the above into a general function, you could pass in an anonymous function as one of the arguments, then you would have a moving average routine that works for arbitrary choice of transformations.
Final, final point: For a sequence of window lengths, simply loop over the entire code block for each window length.
I have managed to produce a solution, which only uses basic functions within MATLAB and can also be expanded to include other functions, (for finance: e.g. a moving Sharpe Ratio, or a moving Sortino Ratio). The code below shows this and contains hopefully sufficient commentary.
I am using a time series of Hedge Fund data, with ca. 10 years worth of daily returns (which were checked to be stationary - not shown in the code). Unfortunately I haven't got the corresponding dates in the example so the x-axis in the plots would be 'no. of days'.
% start by importing the data you need - here it is a selection out of an
% excel spreadsheet
returnsHF = xlsread('HFRXIndices_Final.xlsx','EquityHedgeMarketNeutral','D1:D2742');
% two years to be used for the moving average. (250 business days in one year)
window = 500;
% create zero-matrices to fill with the MA values at each point in time.
mean_avg = zeros(length(returnsHF)-window,1);
st_dev = zeros(length(returnsHF)-window,1);
skew = zeros(length(returnsHF)-window,1);
kurt = zeros(length(returnsHF)-window,1);
% Now work through the time-series with each of the functions (one can add
% any other functions required), assinging the values to the zero-matrices
for count = window:length(returnsHF)
% This is the most tricky part of the script, the indexing in this section
% The TwoYearReturn is what is shifted along one period at a time with the
% for-loop.
TwoYearReturn = returnsHF(count-window+1:count);
mean_avg(count-window+1) = mean(TwoYearReturn);
st_dev(count-window+1) = std(TwoYearReturn);
skew(count-window+1) = skewness(TwoYearReturn);
kurt(count-window +1) = kurtosis(TwoYearReturn);
end
% Plot the MAs
subplot(4,1,1), plot(mean_avg)
title('2yr mean')
subplot(4,1,2), plot(st_dev)
title('2yr stdv')
subplot(4,1,3), plot(skew)
title('2yr skewness')
subplot(4,1,4), plot(kurt)
title('2yr kurtosis')

Calculating a interest rate tree in matlab

I would like to calibrate a interest rate tree using the optimization tool in matlab. Need some guidance on doing it.
The interest rate tree looks like this:
How it works:
3.73% = 2.5%*exp(2*0.2)
96.40453 = (0.5*100 + 0.5*100)/(1+3.73%)
94.15801 = (0.5*96.40453+ 0.5*97.56098)/(1+2.50%)
The value of 2.5% is arbitrary and the upper node is obtained by multiplying with an exponential of 2*volatility(here it is 20%).
I need to optimize the problem by varying different values for the lower node.
How do I do this optimization in Matlab?
What I have tried so far?
InterestTree{1}(1,1) = 0.03;
InterestTree{3-1}(1,3-1)= 2.5/100;
InterestTree{3}(2,:) = 100;
InterestTree{3-1}(1,3-2)= (2.5*exp(2*0.2))/100;
InterestTree{3-1}(2,3-1)=(0.5*InterestTree{3}(2,3)+0.5*InterestTree{3}(2,3-1))/(1+InterestTree{3-1}(1,3-1));
j = 3-2;
InterestTree{3-1}(2,3-2)=(0.5*InterestTree{3}(2,j+1)+0.5*InterestTree{3}(2,j))/(1+InterestTree{3-1}(1,j));
InterestTree{3-2}(2,3-2)=(0.5*InterestTree{3-1}(2,j+1)+0.5*InterestTree{3-1}(2,j))/(1+InterestTree{3-2}(1,j));
But I am not sure how to go about the optimization. Any suggestions to improve the code, do tell me..Need some guidance on this..
Are you expecting the tree to increase in size? Or are you just optimizing over the value of the "2.5%" parameter?
If it's the latter, there are two ways. The first is to model the tree using a closed form expression by replacing 2.5% with x, which is possible with the tree. There are nonlinear optimization toolboxes available in Matlab (e.g. more here), but it's been too long since I've done this to give you a more detailed answer.
The seconds is the approach I would immediately do. I'm interpreting the example you gave, so the equations I'm using may be incorrect - however, the principle of using the for loop is the same.
vol = 0.2;
maxival = 100;
val1 = zeros(1,maxival); %Preallocate
finalval = zeros(1,maxival);
for ival=1:maxival
val1(ival) = i/1000; %Use any scaling you want. This will go from 0.1% to 10%
val2=val1(ival)*exp(2*vol);
x1 = (0.5*100+0.5*100)/(1+val2); %Based on the equation you gave
x2 = (0.5*100+0.5*100)/(1+val1(ival)); %I'm assuming this is how you calculate the bottom node
finalval(ival) = x1*0.5+x2*0.5/(1+...); %The example you gave isn't clear, so replace this with whatever it should be
end
[maxval, indmaxval] = max(finalval);
The maximum value is in maxval, and the interest that maximized this is in val1(indmaxval).