How do I calculate in Matlab the 95% confidence interval with lsqcurvefit? - matlab

due to some problems in Matlab with fixed parameters, I had to switch from the std. fit command to lsqcurvefit.
For the normal fit command, one of the output parameters is gof, from which I can calculate the +/- of each parameter and the r^2 value.
That should be possible for the lsqcurvefit as well. But I don't get it as one of the output parameters.
Or to put my question in other words: how do I calculate the +/- of a fitparamter from the lsqcurvefit?
Can someone help me with that?
Thanks, Niko

Yep. Get all the output parameters of lsqcurvefit and use them in nlparci like so:
[x,resnorm,residual,exitflag,output,lambda,jacobian] =...
lsqcurvefit(#myfun,x0,xdata,ydata);
conf = nlparci(x,residual,'jacobian',jacobian)
Now conf contains an N x 2 matrix for your N fit parameters. Each row of conf gives the upper and lower 95% confidence interval for the corresponding parameter.

Related

Matlab Gumbel Distribution (Extreme Maximum case)

The default MATLAB 'Extreme Value' distribution (also called a Gumbel distribution) is used for the extreme MIN case.
Given the mean and standard deviation of Gumbel distributed random variables for the extreme MAX case, I can get the location and scale parameter using the following equations from this website:
My question is how do I transform the MATLAB 'Extreme Value' distribution from the MIN case to MAX case (MATLAB says "using the negative of the original values").
I would like to use MATLAB's icdf function, thus do I need to negate the location and scale parameters of the inputs?
Judging by the last paragraph of your question you want the inverse CDF of the maximum Gumbel distribution. Given that Matlab offers the inverse CDF of the Gumbel min distribution as follows:
X = evinv(P,mu,sigma);
You can get the inverse CDF of the Gumbel max by:
X = -evinv(1-P, -mu, sigma);
Note that for computing the PDF or CDF different expressions hold (that can be similarly worked out based on the definition of the two distributions).
I have been working on the same problem and this is what I've concluded:
To create the probability distribution function of extreme value type I or gumbel for the maximum case in matlab using mu and sigma, or location and scale parameter, you can use the makedist function, use generalized extreme value function and set the k parameter equal to zero. This will create a mirror image of the ev, or extreme value function minimum which is used for gumbel in matlab. The mirror of the minimum case of gumbel is the maximum case of gumbel.
pd = makedist('GeneralizedExtremeValue','k',0,'sigma',sigma,'mu',mu);
so using the above command all you have to do is replace sigma and mu with the values you've got.
I am a student and this is my understanding of this problem.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end
If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().
you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

Optimization algorithm in Matlab

I want to calculate maximum of the function CROSS-IN-TRAY which is
shown here:
So I have made this function in Matlab:
function f = CrossInTray2(x)
%the CrossInTray2 objective function
%
f = 0.0001 *(( abs(sin(x(:,1)).* sin(x(:,2)).*exp(abs(100 - sqrt(x(:,1).^2 + x(:,2).^2)/3.14159 )) )+1 ).^0.1);
end
I multiplied the whole formula by (-1) so the function is inverted so when I will be looking for the minimum of the inverted formula it will be actually the maximum of original one.
Then when I go to optimization tools and select the GA algorithm and define lower and upper bounds as -3 and 3 it shows me the result after about 60 iterations which is about 0.13 and the final point is something like [0, 9.34].
And how is this possible that the final point is not in the range defined by the bounds? And what is the actual maximum of this function?
The maximum is (0,0) (actually, when either input is 0, and periodically at multiples of pi). After you negate, you're looking for a minimum of a positive quantity. Just looking at the outer absolute value, it obviously can't get lower than 0. That trivially occurs when either value of sin(x) is 0.
Plugging in, you have f_min = f(0,0) = .0001(0 + 1)^0.1 = 1e-4
This expression is trivial to evaluate and plot over a 2d grid. Do that until you figure out what you're looking at, and what the approximate answer should be, and only then invoke an actual optimizer. GA does not sound like a good candidate for a relatively smooth expression like this. The reason you're getting strange answers is the fact that only one of the input parameters has to be 0. Once the optimizer finds one of those, the other input could be anything.

Python and Matlab compute variances differently. Am I using the correct functions? [duplicate]

I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?
The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.
The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).
For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.

Bootstrap and asymmetric CI

I'm trying to create confidence interval for a set of data not randomly distributed and very skewed at right. Surfing, I discovered a pretty rude method that consists in using the 97.5% percentile (of my data) for the upperbound CL and 2.5% percentile for your lower CL.
Unfortunately, I need a more sophisticated way!
Then I discovered the bootstrap, precisley the MATLAB bootci function, but I'm having hard time to undestand how to used it properly.
Let's say that M is my matrix containing my data (19x100), and let's say that:
Mean = mean(M,2);
StdDev = sqrt(var(M'))';
How can I compute the asymmetrical CI for every row of the Mean vector using bootci?
Note: earlier, I was computing the CI in this very wrong way: Mean +/- 2 * StdDev, shame on me!
Let's say you have a 100x19 data set. Each column has a different distribution. We'll choose the log normal distribution, so that they skew to the right.
means = repmat(log(1:19), 100, 1);
stdevs = ones(100, 19);
X = lognrnd(means, stdevs);
Notice that each column is from the same distribution, and the rows are separate observations. Most functions in MATLAB operate on the rows by default, so it's always preferable to keep your data this way around.
You can compute bootstrap confidence intervals for the mean using the bootci function.
ci = bootci(1000, #mean, X);
This does 1000 resamplings of your data, calculates the mean for each resampling and then takes the 2.5% and 97.5% quantiles. To show that it's an asymmetric confidence interval about the mean, we can plot the mean and the confidence intervals for each column
plot(mean(X), 'r')
hold on
plot(ci')