Decompose a 2-peaked pdf into 2 elementary pdfs - matlab

How could I decompose a two-peaked (empirical) pdf into 2 say lognormals or other appropriate pdf in a straightforward way? I'd prefer in Matlab.
to something like this:
Thanks!

What you are looking for is called a mixture density, defined as p(x) = sum_i a_i p_i(x), where sum_i a_i = 1 and each p_i(x) is itself a density function. The most widely-used such model is the Gaussian mixture density, in which each p_i(x) is a Gaussian density; I think there are functions to find the parameters for that in some Matlab package. More generally, the p_i(x) can be any density. The customary algorithm to fit the parameters is the expectation-maximization algorithm. A web search should turn up a lot of references and probably some Matlab code for that as well.

Related

What are the differences between different gaussian functions in Matlab?

y = gauss(x,s,m)
Y = normpdf(X,mu,sigma)
R = normrnd(mu,sigma)
What are the basic differences between these three functions?
Y = normpdf(X,mu,sigma) is the probability density function for a normal distribution with mean mu and stdev sigma. Use this if you want to know the relative likelihood at a point X.
R = normrnd(mu,sigma) takes random samples from the same distribution as above. So use this function if you want to simulate something based on the normal distribution.
y = gauss(x,s,m) at first glance looks like the exact same function as normpdf(). But there is a slight difference: Its calculation is
Y = EXP(-(X-M).^2./S.^2)./(sqrt(2*pi).*S)
while normpdf() uses
Y = EXP(-(X-M).^2./(2*S.^2))./(sqrt(2*pi).*S)
This means that the integral of gauss() from -inf to inf is 1/sqrt(2). Therefore it isn't a legit PDF and I have no clue where one could use something like this.
For completeness we also have to mention p = normcdf(x,mu,sigma). This is the normal cumulative distribution function. It gives the probability that a value is between -inf and x.
A few more insights to add to Leander good answer:
When comparing between functions it is good to look at their source or toolbox. gauss is not a function written by Mathworks, so it may be redundant to a function that comes with Matlab.
Also, both normpdf and normrnd are part of the Statistics and Machine Learning Toolbox so users without it cannot use them. However, generating random numbers from a normal distribution is quite a common task, so it should be accessible for users that have only the core Matlab. Hence, there is a redundant function to normrnd which is randn that is part of the core Matlab.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end
If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().
you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

How to customize SVM non-linear descision boudary in MATLAB?

I want to train a SVM with non-linear boundary. The boundary is known, expressed with formula
y = sgn( (w11*x1+ w12*x2 + w13*x3)* (w21*x4+ w22*x5 + w23*x6) ), where [x1 x2 ... x6] are 1-bit inputs, [w11 w12 w13 w21 w22 w23] are unknown parameters.
How can I learn [w11 w12 w13 w21 w22 w23] with train data?
SVM is not an algorithm for such task. SVM has its own criterion to maximize, which has nothing to do with the decision boundary shape (ok, not nothing, but it is hard to convert one to another). Obviously, one can try to predefine custom kernel function to do so, but this task seems as almost unsolvable problem (I can't think of any reproducing hilbert space with such decision boundaries).
In short: your question is a bit like "how to make a watermelon remove nails from the wall?". Obviously - you can do some pretty hard "magic" to do so, but this is not what watermelons are for.

scaling when sampling from multivariate gaussian

I have a data matrix A (with dependencies between columns) of which I estimate the covariance matrix S. I now want to use this covariance matrix to simulate a new matrix A_sim. Since I assume that the underlying data generator of A was gaussian, I can simply sample from a gaussian specified by S. I do that in matlab as follows:
A_sim = randn(size(A))*chol(S);
However, the values in A_sim are way larger than in A. if I scale down S by a factor of 100, A_sim looks much better. I am now looking for a way to determine this scaling factor in a principled way. can anyone give advise or suggest literature that might be helpful?
Matlab has the function mvnrnd which generates multivariate random variables for you.

Implementing Naïve Bayes algorithm in MATLAB - Need some guidance

I have a Binary classification problem that I need to do in MATLAB. There are two classes and the training data and testing data problems are from two classes and they are 2d coordinates drawn from Gaussian distributions.
The samples are 2D points and they are something like these (1000 samples for class A and 1000 samples for class B):
I am just posting some of them here:
5.867766 3.843014
5.019520 2.874257
1.787476 4.483156
4.494783 3.551501
1.212243 5.949315
2.216728 4.126151
2.864502 3.139245
1.532942 6.669650
6.569531 5.032038
2.552391 5.753817
2.610070 4.251235
1.943493 4.326230
1.617939 4.948345
If a new test data comes in, how should I classify the test sample?
P(Class/TestPoint) is proportional to P(TestPoint/Class) * (ProbabilityOfClass).
I am not sure of how we compute the P(Sample/Class) variable for the 2D coordinates given. Right now, I am using the formula
P(Coordinates/Class) = (Coordinates- mean for that class) / standard deviation of points in that class).
However, I am not getting very good test results with this. Am I doing anything wrong?
That is the good method, however the formula is not correct, look at the multivariate gaussian distribution article on wikipedia:
P(TestPoint|Class)=
,
where is the determinant of A.
Sigma = classPoint*classPoint';
mu = mean(classPoint,2);
proba = 1/((2*pi)^(2/2)*det(Sigma)^(1/2))*...
exp(-1/2*(testPoint-mu)*inv(Sigma)*(testPoint-mu)');
In your case, since they are as many points in both class, P(class)=1/2
Assuming your formula is correctly applied, another issue could be the derivation of features from your data points. Your problem might not be suited for a linear classifier.