How can we find percentile or quantile of gamma distribution in MATLAB? - matlab

Suppose that we have this gamma distribution in MATLAB:
I want this part of distribution with higher density (x-axis range). How can I extract this in MATLAB? I fit this distribution using histfit function.
PS. My codes:
figure;
histfit(Data,20,'gamma');
[phat, pci] = gamfit(Data);
phat =
11.3360 4.2276
pci =
8.4434 3.1281
15.2196 5.7136

When you fit a gamma distribution to your data with [phat, pci] = gamfit(Data);, phat contains the MLE parameters.
You can plug this into gaminv:
x = gaminv(p, phat(1), phat(2));
where p is a vector of percentages, e.g. p = [.2, .8].

I always base my code on the following. This is a code once made which I now alter where necessary. Maybe you will find it usefull as well.
table2.10 <- cbind(lower=c(0,2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300),upper=c(2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300,Inf),freq=c(41,48,24,18,15,14,16,12,6,11,5,4,3))
loglik <-function(p,d){
upper <- d[,2]
lower <- d[,1]
n <- d[,3]
ll<-n*log(ifelse(upper<Inf,pgamma(upper,p[1],p[2]),1)-
pgamma(lower,p[1],p[2]))
sum( (ll) )
}
p0 <- c(alpha=0.47,beta=0.014)
m <- optim(p0,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10)
theta <- qgamma(0.95,m$par[1],m$par[2])
theta
One can also create a 95% confidence interval by using the Delta method. To do so, we need to differentiate the
distribution function of the claims F^(−1)_{X} (0.95; α, β) with respect to α and β.
p <- m$par
eps <- c(1e-5,1e-6,1e-7)
d.alpha <- 0*eps
d.beta <- 0*eps
for (i in 1:3){
d.alpha[i] <- (qgamma(0.95,p[1]+eps[i],p[2])-qgamma(0.95,p[1]-eps[i],p[2]))/(2*eps[i])
d.beta[i] <- (qgamma(0.95,p[1],p[2]+eps[i])-qgamma(0.95,p[1],p[2]-eps[i]))/(2*eps[i])
}
d.alpha
d.beta
var.p <- solve(-m$hessian)
var.q95 <- t(c(d.alpha[2],d.beta[2])) %*% var.p %*% c(d.alpha[2],d.beta[2])
qgamma(0.95,p[1],p[2]) + qnorm(c(0.025,0.975))*sqrt(c(var.q95))
It is even possible using the parametric bootstrap on the estimates for α and β to get B = 1000 different estimates for the
95th percentile of the loss distribution. And use these estimates to construct a 95% confidence interval
library(mvtnorm)
B <- 10000
q.b <- rep(NA,B)
for (b in 1:B){
p.b <- rmvnorm(1,p,var.p)
if (!any(p.b<0)) q.b[b] <- qgamma(0.95,p.b[1],p.b[2])
}
quantile(q.b,c(0.025,0.975))
To do the nonparametrtic bootstrap, we first ‘expand’ the data to reflect each individual observation. Then
we sample with replacement from the line numbers, calculate the frequency table, estimate the model and its
95% percentile.
line.numbers <- rep(1:13,table2.10[,"freq"])
q.b <- rep(NA,B)
table2.10b <- table2.10
for (b in 1:B){
line.numbers.b <- sample(line.numbers,size=217,replace=TRUE)
table2.10b[,"freq"] <- table(factor(line.numbers.b,levels=1:13))
m.b <- optim(m$par,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10b)
q.b[b] <- qgamma(0.95,m.b$par[1],m.b$par[2])
}
q.npb <- q.b
quantile(q.b,c(0.025,0.975))

Related

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

"bnlearn" usage in R for prediction of discrete variables

I am working on "bnlearn" in R to construct a probabilistic network on discrete variables. I need to predict one variable based on the constructed BN. Let's take iris data as an example, I wrote R codes referring to another post in CrossValidated. The codes are as below:
> data(iris)
> set.seed(1)
> tr_idx <- sample(seq_len(nrow(iris)), size = 100)
> tr <- iris[tr_idx,]
> te <- iris[-tr_idx,]
> res <- hc(tr)
> fit <- bn.fit(res, tr)
> pred <- predict(fit, "Species", te)
> cb <- cbind(pred, te[, "Species"])
> accuracy(f = pred, x = te[, "Species"])
Am I right? My questions are:
1. Do I have to discretise each variables?
2. If my data include both discrete and continuous variables, which bn structure learning function I might use?
3. Function accuracy did not work in terms of the above example, I got an error as below:
accuracy(f = pred, x = te[, "Species"])
Error in accuracy(f = pred, x = te[, "Species"]) : First argument should be a forecast object or a time series.
So, Would you please help on this? Many thanks in advance.
Regards,
Xuemei

Define a Specific Probability Distribution Object in Matlab

I'm trying to create a specific distribution in Matlab to sample from. I want to model some processes that are distributed as ARG (Autoregressive Gamma).
A process v is said to be ARG(a, b, c) if it is of the form v ~ Gamma(a + zt, c)
where zt, ~ Poisson (bvt-1 /c)
The parameters a, b, c correspond, respectively, to a degree of freedom, a measure of serial dependence, and a scale parameter.
I know MATLAB has Poisson and Gamma distributions built in, but I'm not sure how to create a custom distribution function incorporating both of these.
I want to create a function for sampling from processes that follow ARG dynamics. For example, MATLAb has
R = mvnrnd(MU,SIGMA)
to generate random numbers from the normal distribution. How can I create something similar for ARG?
Edit:
An attempted implementation:
function [v] = ARG(a, b, c, startval )
%a = delta
%b = rho
%c = (1-rho)*(1-gamma^2)/delta
v_lag = startval
z = poissrnd(b*v_lag / c)
v = gamrnd(a + z, c)

Why CLARA clustering method does not give the same classes as when I do clustering manually?

I am using CLARA (in 'cluster' package). This method is supposed to assign each observation to the closest 'medoid'. But when I calculate the distance of medoids and observations manually and assign them manually, the results are slightly different (1-2 percent of occurrence probability). Does anyone know how clara calculates dissimilarities and why I get different clustering results?
This is the function I use to do clustering manually:
Manual.Clustering <- function(Data,Clusters,Weights=NULL) {
if (is.null(Weights)) Weights <- rep(1,length(Data));
if (length(Weights)==1) Weights <- rep(Weights,length(Data));
Data2 <- Data[,rownames(Clusters)];
Data2 <- Weights*Data2;
dist <- matrix(NA,nrow=nrow(Data),ncol=ncol(Clusters));
for (i in 1:ncol(Clusters)) {
dist[,i] <- Dist2Center(Data2,Clusters[,i],Weights=NULL);
}
classes <- apply(dist,1,which.min);
Out <- cbind(Data,classes);
colnames(Out) <- c(colnames(Data),"Class");
Freq <- FreqTable(Out[,"Class"]);
Freq <- as.data.frame(Freq);
return(list(Data=Out,Freq=Freq));
}
=====================================
Dist2Center <- function(Data,Center,Weights=NULL) {
if (is.null(Weights)) Weights <- matrix(rep(1,nrow(Data)),ncol=1);
if (length(Weights)==1) Weights <- rep(Weights,nrow(Data));
if (ncol(Data)!=length(Center)) stop();
Dist <- Weights*apply(Data,1,function(x){sqrt(sum((x-Center)^2,na.rm=T))} );
return(Dist);
}
Data: Original Data.
Clusters: t(Medoids).
Medoids: 'medoids' picked by clara.
Dist2Center: A function which calculates Euclidean distance of each observation from each Medoids.
Behnam.
I found that this happens only when input data has NA values. For inputs without NAs, the results of my algorithm and clara are identical. I think this is related to how clara handles NA values while calculating the distances of observations to medoids. Any comment? Any suggestion for replacing clara with a better algorithm compatible with large datasets and NA values?
Having a look at the Clara C code, I found that Clara manipulates the distances if there is any missing values. The line " dsum *= (nobs / pp) " in the code shows that it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. That is why it does not give the same results as my algorithm.

How to estimate goodness-of-fit using scipy.odr?

I am fitting data with weights using scipy.odr but I don't know how to obtain a measure of goodness-of-fit or an R squared. Does anyone have suggestions for how to obtain this measure using the output stored by the function?
The res_var attribute of the Output is the so-called reduced Chi-square value for the fit, a popular choice of goodness-of-fit statistic. It is somewhat problematic for non-linear fitting, though. You can look at the residuals directly (out.delta for the X residuals and out.eps for the Y residuals). Implementing a cross-validation or bootstrap method for determining goodness-of-fit, as suggested in the linked paper, is left as an exercise for the reader.
The output of ODR gives both the estimated parameters beta as well as the standard deviation of those parameters sd_beta. Following p. 76 of the ODRPACK documentation, you can convert these values into a t-statistic with (beta - beta_0) / sd_beta, where beta_0 is the number that you're testing significance with respect to (often zero). From there, you can use the t-distribution to get the p-value.
Here's a working example:
import numpy as np
from scipy import stats, odr
def linear_func(B, x):
"""
From https://docs.scipy.org/doc/scipy/reference/odr.html
Linear function y = m*x + b
"""
# B is a vector of the parameters.
# x is an array of the current x values.
# x is in the same format as the x passed to Data or RealData.
#
# Return an array in the same format as y passed to Data or RealData.
return B[0] * x + B[1]
np.random.seed(0)
sigma_x = .1
sigma_y = .15
N = 100
x_star = np.linspace(0, 10, N)
x = np.random.normal(x_star, sigma_x, N)
# the true underlying function is y = 2*x_star + 1
y = np.random.normal(2*x_star + 1, sigma_y, N)
linear = odr.Model(linear_func)
dat = odr.Data(x, y, wd=1./sigma_x**2, we=1./sigma_y**2)
this_odr = odr.ODR(dat, linear, beta0=[1., 0.])
odr_out = this_odr.run()
# degrees of freedom are n_samples - n_parameters
df = N - 2 # equivalently, df = odr_out.iwork[10]
beta_0 = 0 # test if slope is significantly different from zero
t_stat = (odr_out.beta[0] - beta_0) / odr_out.sd_beta[0] # t statistic for the slope parameter
p_val = stats.t.sf(np.abs(t_stat), df) * 2
print('Recovered equation: y={:3.2f}x + {:3.2f}, t={:3.2f}, p={:.2e}'.format(odr_out.beta[0], odr_out.beta[1], t_stat, p_val))
Recovered equation: y=2.00x + 1.01, t=239.63, p=1.76e-137
One note of caution in using this approach on nonlinear problems, from the same ODRPACK docs:
"Note that for nonlinear ordinary least squares, the linearized confidence regions and intervals are asymptotically correct as n → ∞ [Jennrich, 1969]. For the orthogonal distance regression problem, they have been shown to be asymptotically correct as σ∗ → 0 [Fuller, 1987]. The difference between the conditions of asymptotic correctness can be explained by the fact that, as the number of observations increases in the orthogonal distance regression problem one does not obtain additional information for ∆. Note also that Vˆ is dependent upon the weight matrix Ω, which must be assumed to be correct, and cannot be confirmed from the orthogonal distance regression results. Errors in the values of wǫi and wδi that form Ω will have an adverse affect on the accuracy of Vˆ and its component parts. The results of a Monte Carlo experiment examining the accuracy
of the linearized confidence intervals for four different measurement error models is presented in [Boggs and Rogers, 1990b]. Those results indicate that the confidence regions and intervals for ∆ are not as accurate as those for β.
Despite its potential inaccuracy, the covariance matrix is frequently used to construct confidence regions and intervals for both nonlinear ordinary least squares and measurement error models because the resulting regions and intervals are inexpensive to compute, often adequate, and familiar to practitioners. Caution must be exercised when using such regions and intervals, however, since the validity of the approximation will depend on the nonlinearity of the model, the variance and distribution of the errors, and the data itself. When more reliable intervals and regions are required, other more accurate methods should be used. (See, e.g., [Bates and Watts, 1988], [Donaldson and Schnabel, 1987], and [Efron, 1985].)"
As mentioned by R. Ken, chi-square or variance of the residuals is one of the more
commonly used tests of goodness of fit. ODR stores the sum of squared
residuals in out.sum_square and you can verify yourself
that out.res_var = out.sum_square/degrees_freedom corresponds to what is commonly called reduced chi-square: i.e. the chi-square test result divided by its expected value.
As for the other very popular estimator of goodness of fit in linear regression, R squared and its adjusted version, we can define the functions
import numpy as np
def R_squared(observed, predicted, uncertainty=1):
""" Returns R square measure of goodness of fit for predicted model. """
weight = 1./uncertainty
return 1. - (np.var((observed - predicted)*weight) / np.var(observed*weight))
def adjusted_R(x, y, model, popt, unc=1):
"""
Returns adjusted R squared test for optimal parameters popt calculated
according to W-MN formula, other forms have different coefficients:
Wherry/McNemar : (n - 1)/(n - p - 1)
Wherry : (n - 1)/(n - p)
Lord : (n + p - 1)/(n - p - 1)
Stein : (n - 1)/(n - p - 1) * (n - 2)/(n - p - 2) * (n + 1)/n
"""
# Assuming you have a model with ODR argument order f(beta, x)
# otherwise if model is of the form f(x, a, b, c..) you could use
# R = R_squared(y, model(x, *popt), uncertainty=unc)
R = R_squared(y, model(popt, x), uncertainty=unc)
n, p = len(y), len(popt)
coefficient = (n - 1)/(n - p - 1)
adj = 1 - (1 - R) * coefficient
return adj, R
From the output of your ODR run you can find the optimal values for your model's parameters in out.beta and at this point we have everything we need for computing R squared.
from scipy import odr
def lin_model(beta, x):
"""
Linear function y = m*x + q
slope m, constant term/y-intercept q
"""
return beta[0] * x + beta[1]
linear = odr.Model(lin_model)
data = odr.RealData(x, y, sx=sigma_x, sy=sigma_y)
init = odr.ODR(data, linear, beta0=[1, 1])
out = init.run()
adjusted_Rsq, Rsq = adjusted_R(x, y, lin_model, popt=out.beta)