finding quantile of the generalized normal distribution - matlab

I am writing a Matlab code for finding quantile of the generalized normal distribution
x ~ GN(0, alfa, beta) :
p(x; 0, alfa, beta) = (beta/(2*alfa*gamma(1/beta))) * exp(-(abs(x)/alfa).^beta )
According to quantile formula shown in https://en.wikipedia.org/wiki/Generalized_normal_distribution ,
for quantile C, I calculated quantile by
z_c = sign(C-0.5).*gaminv(2*abs(C-0.5), 1./beta, 1./(alfa.^beta)).^(1./beta)+0
To validate the equation above, I assigned alfa=sqrt(2) and beta=2 to make the generalized normal to be a normal distribution. But when I calculated
>> C=0.05; beta =2; alfa =sqrt(2);
>> z_c = sign(C-0.5).*gaminv(2*abs(C-0.5), 1./beta, 1./(alfa.^beta)).^(1./beta)
z_c =
-0.8224
I thought the result should be exactly the same as the quantile of the inverse of the normal CDF with mean mu=0 and standard deviation sigma=1, however,
>> norminv(C)
ans =
-1.6449
Could anyone help to point out the mistake(s) made above?

Related

Truncating Poisson distribution on desired support in Matlab

I want to construct a 3-dimensional Poisson distribution in Matlab with lambda parameters [0.4, 0.2, 0.6] and I want to truncate it to have support in [0;1;2;3;4;5]. The 3 components are independent.
This is what I do
clear
n=3; %number components of the distribution
supp_marginal=0:1:5;
suppsize_marginal=size(supp_marginal,2);
supp_temp=repmat(supp_marginal.',1,n);
supp_temp_cell=num2cell(supp_temp,1);
output_temp_cell=cell(1,n);
[output_temp_cell{:}] = ndgrid(supp_temp_cell{:});
supp=zeros(suppsize_marginal^n,n);
for h=1:n
temp=output_temp_cell{h};
supp(:,h)=temp(:);
end
suppsize=size(supp,1);
lambda_1=0.4;
lambda_2=0.2;
lambda_3=0.6;
pr_mass=zeros(suppsize,1);
for j=1:suppsize
pr_mass(j)=(poisspdf(supp(j,1),lambda_1).*...
poisspdf(supp(j,2),lambda_2).*...
poisspdf(supp(j,3),lambda_3))/...
sum(poisspdf(supp(:,1),lambda_1).*...
poisspdf(supp(:,2),lambda_2).*...
poisspdf(supp(j,3),lambda_3));
end
When I compute the mean of the obtained distribution, I get lambda_1 and lambda_2 but not lambda_3.
lambda_empirical=sum(supp.*repmat(pr_mass,1,3));
Question: why I do not get lambda_3?
tl;dr: Truncation changes the distribution so different means are expected.
This is expected as truncation itself has changed the distribution and certainly adjusts the mean. You can see this from the experiment below. Notice that for your chosen parameters, this just starts to become noticable around lambda = 0.6.
Similar to the wiki page, this illustrates the difference between E[X] (expectation of X without truncation; fancy word for mean) and E[ X | LB ≤ X ≤ UB] (expectation of X given it is on interval [LB,UB]). This conditional expectation implies a different distribution than the unconditional distribution of X (~Poisson(lambda)).
% MATLAB R2018b
% Setup
LB = 0; % lowerbound
UB = 5; % upperbound
% Simple test to compare theoretical means with and without truncation
TestLam = 0.2:0.01:1.5;
Gap = zeros(size(TestLam(:)));
for jj = 1:length(TestLam)
TrueMean = mean(makedist('Poisson','Lambda',TestLam(jj)));
TruncatedMean = mean(truncate(makedist('Poisson','Lambda',TestLam(jj)),LB,UB));
Gap(jj) = TrueMean-TruncatedMean;
end
plot(TestLam,Gap)
Notice the gap with these truncation bounds and a lambda of 0.6 is still small and is negligible as lambda approaches zero.
lam = 0.6; % <---- try different values (must be greater than 0)
pd = makedist('Poisson','Lambda',lam)
pdt = truncate(pd,LB,UB)
mean(pd) % 0.6
mean(pdt) % 0.5998
Other Resources:
1. Wiki for Truncated Distributions
2. What is a Truncated Distribution
3. MATLAB documentation for truncate(), makedist()
4. MATLAB: Working with Probability Distribution (Objects)

Simulate left Truncated Gamma distribution with mean = 1 and variance =0.4

Suppose X~Γ(α, β), I would like to truncate all values of X
MATLAB codes:
t = 0.5; theta = 0.4;
syms alpha beta
EX = beta*( igamma(alpha+1,t/beta) / igamma(alpha,t/beta) ); %Mean
EX2 = beta^2*( igamma(alpha+2,t/beta) / igamma(alpha,t/beta) );%Second moment
VarX = EX2 -EX^2; %Variance
cond1 = alpha > 0; cond2 = beta > 0; cond3 = EX==1; cond4 = VarX ==theta;
conds =[cond1 cond2 cond3 cond4]; vars = [alpha, beta];
sol = solve(conds, [alpha beta], 'ReturnConditions',true);
soln_alpha = vpa(sol.alpha)
soln_beta = vpa(sol.beta)
The above code returns a numeric answer only if the constraint that α>0 is relaxed. The numeric answer has a negative value of α which is wrong since both α (shape parameter) and β (scale parameter) must be strictly positive.
Based on your title, I take it you want to generate samples from a Gamma distribution with mean = 1 and variance = 0.4 but want the distribution truncated to [0, inf].
If X ~ Gamma(alpha,beta), then by definition it must be nonnegative (see Gamma Distribution wiki, or MATLAB page). Indeed, both shape and scale parameters are also nonnegative. Note: MATLAB uses the (k,theta) parameterization found on the wiki page.
MATLAB has implemented probability distribution objects which make a lot of things very convenient from a practitioner perspective (or anyone who uses numerical approaches).
alpha = 0.4;
beta = 0.5;
pd = makedist('Gamma',alpha,beta) % Define the distribution object
Generating samples is now very easy.
n = 1000; % Number of samples
X = random(pd,n,1); % Random samples of X ~ Gamma(alpha,beta)
All that is left is to identify the shape and scale parameters that such that E[X] = 1 and Var(X) = 0.4.
You need to solve
alpha * beta = E[X],
alpha * (beta^2) = Var(X),
for alpha and beta. It is a system of two nonlinear equations with two unknowns.
However, truncation makes these obsolete but numerical approaches will work fine.
LB = 0.5; % lower bound (X > LB)
UB = inf; % upper bound (X < UB)
pdt = truncate(pd,LB,UB) % Define truncated distribution object
Xt = random(pd,n,1);
pdt =
GammaDistribution
Gamma distribution
a = 0.4
b = 0.5
Truncated to the interval [0.5, Inf]
Fortunately, the mean and variance of a distribution object are accessible whether or not it is truncated.
mean(pdt) % compare to mean(pd)
var(pdt) % compare to var(pd)
You can numerically solve this problem to obtain your parameters with something like fmincon.
tgtmean = 1;
tgtvar = 0.4;
fh_mean =#(p) mean(truncate(makedist('Gamma',p(1),p(2)),LB,UB));
fh_var =#(p) var(truncate(makedist('Gamma',p(1),p(2)),LB,UB));
fh =#(p) (fh_mean(p)-tgtmean).^2 + (fh_var(p)-tgtvar).^2;
[p, fval] = fmincon(fh,[alpha;beta],[],[],[],[],0,inf)
You can test the answer for validation:
pd_test = truncate(makedist('Gamma',p(1),p(2)),LB,UB);
mean(pd_test)
var(pd_test)
ans = 1.0377
ans = 0.3758
Note this seems ill-conditioned due to the desired truncation and target mean. This might be good enough depending on your application.
histogram(random(pd_test,n,1)) % Visually inspect distribution
Mean and variance combinations must be feasible under the base distribution (here, Gamma distribution), but if truncating, that further restricts the set of feasible parameters. For example, it would be impossible to truncate X~Gamma() to the interval [5, 500] and seek to get a mean of 2 or a mean of 600.
MATLAB code verified with version R2017a.
Also note that solutions from nonlinear solvers like fmincon may be sensitive to the initial starting point for some problems. If that numerical approach is giving issues, it may be a feasibility issue (as alluded to above) or it may require using multiple start points and multiple fmincon calls to get multiple answers, then use the best one.

Getting rank deficient warning when using regress function in MATLAB

I have a dataset comprising of 30 independent variables and I tried performing linear regression in MATLAB R2010b using the regress function.
I get a warning stating that my matrix X is rank deficient to within machine precision.
Now, the coefficients I get after executing this function don't match with the experimental one.
Can anyone please suggest me how to perform the regression analysis for this equation which is comprising of 30 variables?
Going with our discussion, the reason why you are getting that warning is because you have what is known as an underdetermined system. Basically, you have a set of constraints where you have more variables that you want to solve for than the data that is available. One example of an underdetermined system is something like:
x + y + z = 1
x + y + 2z = 3
There are an infinite number of combinations of (x,y,z) that can solve the above system. For example, (x, y, z) = (1, −2, 2), (2, −3, 2), and (3, −4, 2). What rank deficient means in your case is that there is more than one set of regression coefficients that would satisfy the governing equation that would describe the relationship between your input variables and output observations. This is probably why the output of regress isn't matching up with your ground truth regression coefficients. Though it isn't the same answer, do know that the output is one possible answer. By running through regress with your data, this is what I get if I define your observation matrix to be X and your output vector to be Y:
>> format long g;
>> B = regress(Y, X);
>> B
B =
0
0
28321.7264417536
0
35241.9719076362
899.386999172398
-95491.6154990829
-2879.96318251964
-31375.7038251919
5993.52959752106
0
18312.6649115112
0
0
8031.4391233753
27923.2569044728
7716.51932560781
-13621.1638587172
36721.8387047613
80622.0849069525
-114048.707780113
-70838.6034825939
-22843.7931997405
5345.06937207617
0
106542.307940305
-14178.0346010715
-20506.8096166108
-2498.51437396558
6783.3107243113
You can see that there are seven regression coefficients that are equal to 0, which corresponds to 30 - 23 = 7. We have 30 variables and 23 constraints to work with. Be advised that this is not the only possible solution. regress essentially computes the least squared error solution such that sum of residuals of Y - X*B has the least amount of error. This essentially simplifies to:
B = X^(*)*Y
X^(*) is what is known as the pseudo-inverse of the matrix. MATLAB has this available, and it is called pinv. Therefore, if we did:
B = pinv(X)*Y
We get:
B =
44741.6923363563
32972.479220139
-31055.2846404536
-22897.9685877566
28888.7558524005
1146.70695371731
-4002.86163441217
9161.6908044046
-22704.9986509788
5526.10730457192
9161.69080479427
2607.08283489226
2591.21062004404
-31631.9969765197
-5357.85253691504
6025.47661106009
5519.89341411127
-7356.00479046122
-15411.5144034056
49827.6984426955
-26352.0537850382
-11144.2988973666
-14835.9087945295
-121.889618144655
-32355.2405829636
53712.1245333841
-1941.40823106236
-10929.3953469692
-3817.40117809984
2732.64066796307
You see that there are no zero coefficients because pinv finds the solution using the L2-norm, which promotes the "spreading" out of the errors (for a lack of a better term). You can verify that these are correct regression coefficients by doing:
>> Y2 = X*B
Y2 =
16.1491563400241
16.1264219600856
16.525331600049
17.3170318001845
16.7481541301999
17.3266932502295
16.5465094100486
16.5184456100487
16.8428701100165
17.0749421099829
16.7393450000517
17.2993993099419
17.3925811702017
17.3347117202356
17.3362798302375
17.3184486799219
17.1169638102517
17.2813552099096
16.8792925100727
17.2557945601102
17.501873690151
17.6490477001922
17.7733493802508
Similarly, if we used the regression coefficients from regress, so B = regress(Y,X); then doing Y2 = X*B, we get:
Y2 =
16.1491563399927
16.1264219599996
16.5253315999987
17.3170317999969
16.7481541299967
17.3266932499992
16.5465094099978
16.5184456099983
16.8428701099975
17.0749421099985
16.7393449999981
17.2993993099983
17.3925811699993
17.3347117199991
17.3362798299967
17.3184486799987
17.1169638100025
17.281355209999
16.8792925099983
17.2557945599979
17.5018736899983
17.6490476999977
17.7733493799981
There are some slight computational differences, which is to be expected. Similarly, we can also find the answer by using mldivide:
B = X \ Y
B =
0
0
28321.726441712
0
35241.9719075889
899.386999170666
-95491.6154989513
-2879.96318251572
-31375.7038251485
5993.52959751295
0
18312.6649114859
0
0
8031.43912336425
27923.2569044349
7716.51932559712
-13621.1638586983
36721.8387047123
80622.0849068411
-114048.707779954
-70838.6034824987
-22843.7931997086
5345.06937206919
0
106542.307940158
-14178.0346010521
-20506.8096165825
-2498.51437396236
6783.31072430201
You can see that this curiously matches up with what regress gives you. That's because \ is a more smarter operator. Depending on how your matrix is structured, it finds the solution to the system by a different method. I'd like to defer you to the post by Amro that talks about what algorithms mldivide uses when examining the properties of the input matrix being operated on:
How to implement Matlab's mldivide (a.k.a. the backslash operator "\")
What you should take away from this answer is that you can certainly go ahead and use those regression coefficients and they will more or less give you the expected output for each value of Y with each set of inputs for X. However, be warned that those coefficients are not unique. This is apparent as you said that you have ground truth coefficients that don't match up with the output of regress. It isn't matching up because it generated another answer that satisfies the constraints you have provided.
There is more than one answer that can describe that relationship if you have an underdetermined system, as you have seen by my experiments shown above.

MATLAB how do I access a specific coefficient in a symbolic equation system solution?

I need to run a simple Monte Carlo varying coefficients on a system of equations. I need to record the solved coefficient of one of the variables each time.
The following gets me results from a single run:
syms alpha gamma Ps Pc beta lambda Pp Sp Ss Dp Ds;
eq1 = -Ss + alpha + 0.17*Ps - 1*Pc;
eq2 = -Sp + beta + 0.2*Pp;
eq3 = -Ds + gamma - 0.2*Ps + 1*Pp;
eq4 = -Dp + lambda - 0.17*Pp + 1*Ps;
eq5 = Ss - Ds;
eq6 = Sp - Dp;
ans1 = solve(eq1,eq2,eq3,eq4,eq5,eq6,'Ps','Pp','Ss','Ds','Sp','Dp');
disp('Ps')
vpa(ans1.Ps,3)
disp('Pp')
vpa(ans1.Pp,3)
disp('Ss')
vpa(ans1.Ss,3)
disp('Ds')
vpa(ans1.Ds,3)
disp('Sp')
vpa(ans1.Sp,3)
disp('Dp')
vpa(ans1.Dp,3)
I will be varying several of the variables (on Ps, Pp, and Pc), and recording the coefficient on Pc in each of the reduced form equations (i.e. the coefficient on Pc that shows up after vpa(ans1.xx)--so in the case above, it would be a 1x6 vector [-0.429,-1.16,-1.07,-1.07,-0.232,-0.429,-1.16]).
I'm very new to MATLAB, but I'm sure I can figure out how to implement the loop code to do the model iterations. What I can't figure out is how to record the vector of coefficients after each iteration. Is there some "accessor" that will give me just the one coefficient for each equation each time?
Something like vpa(ans1.ps.coef(pc)) (which is a total shot in the dark, and it's wrong, but hopefully you get the idea).
There's probably a better way to do this, but this is all I could come up to at this moment.
Step 1:
In order to obtain the coefficient of Pc as a double from ans1.Ps, you can use the subs function, as follows:
subs(ans1.Ps,{alpha,Pc,beta,gamma,lambda},{0,1,0,0,0});
Step 2a:
To get a vector of all coefficients per ans1 expression (say ans1.Ps) you can use something like this:
N=numel(symvar(ans1.Ps)); % obtain number of coefs
cp=num2cell(eye(N)); % create a cell array using unit matrix, so each iteration a different coef will be selected
for n=1:N;
coefs(n)=subs(ans1.Ps,{alpha,Pc,beta,gamma,lambda},cp(n,:));
end
Step 2b:
Alternatively, you want to get just Pc, but from all the ans1 expressions. If so then you can do the following:
SNames = fieldnames(ans1); % get names of ans1 expressions
for n = 1:numel(SNames)
expr = ans1.(SNames{n}); % get the expression itself
pc(n)=subs(expr,{alpha,Pc,beta,gamma,lambda},{0,1,0,0,0}); % obtain just pc
end
You can now combine the two if you want all info about the coefficients.
Edit:
To store the retrieved Pc per iteration you can do the following:
alpha=[3 1 4 6 7] % just a vector of values
beta = [6 7 8 5 2]
SNames = fieldnames(ans1); % get names of ans1 expressions
for n = 1:numel(SNames)
expr = ans1.(SNames{n}); % get the expression itself
for n1=1:numel(alpha)
for n2=1:numel(beta)
pc(n,n1,n2)=subs(expr,{alpha,Pc,beta,gamma,lambda},{alpha(n1),1,beta(n2),0,0})
end
end
end

Calculate posterior distribution of unknown mis-classification with PRTools in MATLAB

I'm using the PRTools MATLAB library to train some classifiers, generating test data and testing the classifiers.
I have the following details:
N: Total # of test examples
k: # of
mis-classification for each
classifier and class
I want to do:
Calculate and plot Bayesian posterior distributions of the unknown probabilities of mis-classification (denoted q), that is, as probability density functions over q itself (so, P(q) will be plotted over q, from 0 to 1).
I have that (math formulae, not matlab code!):
Posterior = Likelihood * Prior / Normalization constant =
P(q|k,N) = P(k|q,N) * P(q|N) / P(k|N)
The prior is set to 1, so I only need to calculate the likelihood and normalization constant.
I know that the likelihood can be expressed as (where B(N,k) is the binomial coefficient):
P(k|q,N) = B(N,k) * q^k * (1-q)^(N-k)
... so the Normalization constant is simply an integral of the posterior above, from 0 to 1:
P(k|N) = B(N,k) * integralFromZeroToOne( q^k * (1-q)^(N-k) )
(The Binomial coefficient ( B(N,k) ) can be omitted though as it appears in both the likelihood and normalization constant)
Now, I've heard that the integral for the normalization constant should be able to be calculated as a series ... something like:
k!(N-k)! / (N+1)!
Is that correct? (I have some lecture notes with this series, but can't figure out if it is for the normalization constant integral, or for the overall distribution of mis-classification (q))
Also, hints are welcome as how to practically calculate this? (factorials are easily creating truncation errors right?) ... AND, how to practically calculate the final plot (the posterior distribution over q, from 0 to 1).
I really haven't done much with Bayesian posterior distributions ( and not for a while), but I'll try to help with what you've given. First,
k!(N-k)! / (N+1)! = 1 / (B(N,k) * (N + 1))
and you can calculate the binomial coefficients in Matlab with nchoosek() though it does say in the docs that there can be accuracy problems for large coefficients. How big are N and k?
Second, according to Mathematica,
integralFromZeroToOne( q^k * (1-q)^(N-k) ) = pi * csc((k-N)*pi) * Gamma(1+k)/(Gamma(k-N) * Gamma(2+N))
where csc() is the cosecant function and Gamma() is the gamma function. However, Gamma(x) = (x-1)! which we'll use in a moment. The problem is that we have a function Gamma(k-N) on the bottom and k-N will be negative. However, the reflection formula will help us with that so that we end up with:
= (N-k)! * k! / (N+1)!
Apparently, your notes were correct.
Let q be the probability of mis-classification. Then the probability that you would observe k mis-classifications in N runs is given by:
P(k|N,q) = B(N,k) q^k (1-q)^(N-k)
You need to then assume a suitable prior for q which is bounded between 0 and 1. A conjugate prior for the above is the beta distribution. If q ~ Beta(a,b) then the posterior is also a Beta distribution. For your info the posterior is:
f(q|-) ~ Beta(a+k,b+N-k)
Hope that helps.