Mixture of 1D Gaussians fit to data in Matlab / Python - matlab

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end

If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().

you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

Related

Double numerical integration in Matlab - Singularity

I have discrete data of a 2D function defined as
theta = linspace(0,pi,nTheta);
phi = linspace(0,2*pi,nPhi);
p=zeros(nPhi,nTheta);%only to show the dimension of my matrix
[np,nt]=ndgrid(phi,theta);
f1 = griddedInterpolant(np,nt,p,'spline');
f2= #(np,nt) f1(np,nt);
integral2(f2,0,2*pi,0,pi)
Note that p is calculated from a complex physical problem, but i showed above how it is initialized.
Also, I can increase nTheta and nPhi, which leads to more accurate calculation of p.
My calculated function (with nPhi=400,nTheta=200) is something like:
I tried 3 ways :
using Trapz function
using the code above but with linear interpolation for gridded interpolant
using the code above with spline interpolation
Although the spline is better than others, i still need to increase nPhi and nTheta, which makes it impossible for me to do the simulation due to its cost.
Is there any suggestion except these 3 methods or any general suggestion how i can do this computation more efficient? (I also took advantage of the symmetry in both directions)
Note that the shape of my function varies in each time step, so a local mesh refinement might be challenging because i don't know the detail of my function in advance.

Should I perform data centering before apply SVD?

I have to use SVD in Matlab to obtain a reduced version of my data.
I've read that the function svds(X,k) performs the SVD and returns the first k eigenvalues and eigenvectors. There is not mention in the documentation if the data have to be normalized.
With normalization I mean both substraction of the mean value and division by the standard deviation.
When I implemented PCA, I used to normalize in such way. But I know that it is not needed when using the matlab function pca() because it computes the covariance matrix by using cov() which implicitly performs the normalization.
So, the question is. I need the projection matrix useful to reduce my n-dim data to k-dim ones by SVD. Should I perform data normalization of the train data (and therefore, the same normalization to further projected new data) or not?
Thanks
Essentially, the answer is yes, you should typically perform normalization. The reason is that features can have very different scalings, and we typically do not want to take scaling into account when considering the uniqueness of features.
Suppose we have two features x and y, both with variance 1, but where x has a mean of 1 and y has a mean of 1000. Then the matrix of samples will look like
n = 500; % samples
x = 1 + randn(n,1);
y = 1000 + randn(n,1);
svd([x,y])
But the problem with this is that the scale of y (without normalizing) essentially washes out the small variations in x. Specifically, if we just examine the singular values of [x,y], we might be inclined to say that x is a linear factor of y (since one of the singular values is much smaller than the other). But actually, we know that that is not the case since x was generated independently.
In fact, you will often find that you only see the "real" data in a signal once we remove the mean. At the extremely end, you could image that we have some feature
z = 1e6 + sin(t)
Now if somebody just gave you those numbers, you might look at the sequence
z = 1000001.54, 1000001.2, 1000001.4,...
and just think, "that signal is boring, it basically is just 1e6 plus some round off terms...". But once we remove the mean, we see the signal for what it actually is... a very interesting and specific one indeed. So long story short, you should always remove the means and scale.
It really depends on what you want to do with your data. Centering and scaling can be helpful to obtain principial components that are representative of the shape of the variations in the data, irrespective of the scaling. I would say it is mostly needed if you want to further use the principal components itself, particularly, if you want to visualize them. It can also help during classification since your scores will then be normalized which may help your classifier. However, it depends on the application since in some applications the energy also carries useful information that one should not discard - there is no general answer!
Now you write that all you need is "the projection matrix useful to reduce my n-dim data to k-dim ones by SVD". In this case, no need to center or scale anything:
[U,~] = svd(TrainingData);
RecudedData = U(:,k)'*TestData;
will do the job. The svds may be worth considering when your TrainingData is huge (in both dimensions) so that svd is too slow (if it is huge in one dimension, just apply svd to the gram matrix).
It depends!!!
A common use in signal processing where it makes no sense to normalize is noise reduction via dimensionality reduction in correlated signals where all the fearures are contiminated with a random gaussian noise with the same variance. In that case if the magnitude of a certain feature is twice as large it's snr is also approximately twice as large so normalizing the features makes no sense since it would just make the parts with the worse snr larger and the parts with the good snr smaller. You also don't need to subtract the mean in that case (like in PCA), the mean (or dc) isn't different then any other frequency.

Creating a 1D Second derivative of gaussian Window

In MATLAB I need to generate a second derivative of a gaussian window to apply to a vector representing the height of a curve. I need the second derivative in order to determine the locations of the inflection points and maxima along the curve. The vector representing the curve may be quite noise hence the use of the gaussian window.
What is the best way to generate this window?
Is it best to use the gausswin function to generate the gaussian window then take the second derivative of that?
Or to generate the window manually using the equation for the second derivative of the gaussian?
Or even is it best to apply the gaussian window to the data, then take the second derivative of it all? (I know these last two are mathematically the same, however with the discrete data points I do not know which will be more accurate)
The maximum length of the height vector is going to be around 100-200 elements.
Thanks
Chris
I would create a linear filter composed of the weights generated by the second derivative of a Gaussian function and convolve this with your vector.
The weights of a second derivative of a Gaussian are given by:
Where:
Tau is the time shift for the filter. If you are generating weights for a discrete filter of length T with an odd number of samples, set tau to zero and allow t to vary from [-T/2,T/2]
sigma - varies the scale of your operator. Set sigma to a value somewhere between T/6. If you are concerned about long filter length then this can be reduced to T/4
C is the normalising factor. This can be derived algebraically but in practice I always do this numerically after calculating the filter weights. For unity gain when smoothing periodic signals, I will set C = 1 / sum(G'').
In terms of your comment on the equivalence of smoothing first and taking a derivative later, I would say it is more involved than that. As which derivative operator would you use in the second step? A simple central difference would not yield the same results.
You can get an equivalent (but approximate) response to a second derivative of a Gaussian by filtering the data with two Gaussians of different scales and then taking the point-wise differences between the two resulting vectors. See Difference of Gaussians for that approach.

Matlab recursive curve fitting with custom equations

I have a curve IxV. I also have an equation that I want to fit in this IxV curve, so I can adjust its constants. It is given by:
I = I01(exp((V-R*I)/(n1*vth))-1)+I02(exp((V-R*I)/(n2*vth))-1)
vth and R are constants already known, so I only want to achieve I01, I02, n1, n2. The problem is: as you can see, I is dependent on itself. I was trying to use the curve fitting toolbox, but it doesn't seem to work on recursive equations.
Is there a way to make the curve fitting toolbox work on this? And if there isn't, what can I do?
Assuming that I01 and I02 are variables and not functions, then you should set the problem up like this:
a0 = [I01 I02 n1 n2];
MinFun = #(a) abs(a(1)*(exp(V-R*I)/(a(3)*vth))-1) + a(2)*(exp((V-R*I)/a(4)*vth))-1) - I);
aout = fminsearch(a0,MinFun);
By subtracting I and taking the absolute value, the point where both sides are equal will be the point where MinFun is zero (minimized).
No, the CFTB cannot fit such recursively defined functions. And errors in I, since the true value of I is unknown for any point, will create a kind of errors in variables problem. All you have are the "measured" values for I.
The problem of errors in I MAY be serious, since any errors in I, or lack of fit, noise, model problems, etc., will be used in the expression itself. Then you exponentiate these inaccurate values, potentially casing a mess.
You may be able to use an iterative approach. Thus something like
% 0. Initialize I_pred
I_pred = I;
% 1. Estimate the values of your coefficients, for this model:
% (The curve fitting toolbox CAN solve this problem, given I_pred)
I = I01(exp((V-R*I_pred)/(n1*vth))-1)+I02(exp((V-R*I_pred)/(n2*vth))-1)
% 2. Generate new predictions for I_pred
I_pred = I01(exp((V-R*I_pred)/(n1*vth))-1)+I02(exp((V-R*I_pred)/(n2*vth))-1)
% Repeat steps 1 and 2 until the parameters from the CFTB stabilize.
The above pseudo-code will work only if your starting values are good, and there are not large errors/noise in the model/data. Even on a good day, the above approach may not converge well. But I see little hope otherwise.

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.