Implementing Naïve Bayes algorithm in MATLAB - Need some guidance - matlab

I have a Binary classification problem that I need to do in MATLAB. There are two classes and the training data and testing data problems are from two classes and they are 2d coordinates drawn from Gaussian distributions.
The samples are 2D points and they are something like these (1000 samples for class A and 1000 samples for class B):
I am just posting some of them here:
5.867766 3.843014
5.019520 2.874257
1.787476 4.483156
4.494783 3.551501
1.212243 5.949315
2.216728 4.126151
2.864502 3.139245
1.532942 6.669650
6.569531 5.032038
2.552391 5.753817
2.610070 4.251235
1.943493 4.326230
1.617939 4.948345
If a new test data comes in, how should I classify the test sample?
P(Class/TestPoint) is proportional to P(TestPoint/Class) * (ProbabilityOfClass).
I am not sure of how we compute the P(Sample/Class) variable for the 2D coordinates given. Right now, I am using the formula
P(Coordinates/Class) = (Coordinates- mean for that class) / standard deviation of points in that class).
However, I am not getting very good test results with this. Am I doing anything wrong?

That is the good method, however the formula is not correct, look at the multivariate gaussian distribution article on wikipedia:
P(TestPoint|Class)=
,
where is the determinant of A.
Sigma = classPoint*classPoint';
mu = mean(classPoint,2);
proba = 1/((2*pi)^(2/2)*det(Sigma)^(1/2))*...
exp(-1/2*(testPoint-mu)*inv(Sigma)*(testPoint-mu)');
In your case, since they are as many points in both class, P(class)=1/2

Assuming your formula is correctly applied, another issue could be the derivation of features from your data points. Your problem might not be suited for a linear classifier.

Related

Getting covariance matrix in Spark Linear Regression

I have been looking into Spark's documentation but still couldn't find how to get covariance matrix after doing linear regression.
Given input training data, I did a very simple linear regression similar to this:
val lr = new LinearRegression()
val fit = lr.fit(training)
Getting regression parameters is as easy as fit.coefficients but there seems to be no information on how to get covariance matrix.
And just to clarify, I am looking for function similar to vcov in R. With this, I should be able to do something like vcov(fit) to get the covariance matrix. Any other methods that can help to achieve this are okay too.
EDIT
The explanation on how to get covariance matrix from linear regression is discussed in detail here. Standard deviation is easy to get as it is provided by fit.summary.meanSsquaredError. However, the parameter (X'X)-1 is hard to get. It would be interesting to see if this can be used to somehow calculate the covariance matrix.
Although the whole covariance matrix is collected on the driver, it is not possible to obtain it without making your own solver. You can do that by copying WLS and setting additional "getters".
Closest you can get without digging into the code is lrModel.summary.coefficientStandardErrors that is based on diagonal of inverted matrix (A^T * W * A) which is based on upper triangular matrix (covariance).
I don't think that is enough so sorry about that.

How to generate a 2D random vector in MATLAB?

I have a non-negative function f defined on a unit square S = [0,1] x [0,1] such that
My question is, how can I use MATLAB to generate a 2D random vector from S according to the probability density function f?
Rejection Sampling
The suggestion Luis Mendo made is very good because it applies to nearly all distribution functions. Based on this answer I wrote code for m.
An important point when using rejection sampling this way is that you must know the maximum of your pdf within the range. If you over-estimate the maximum your code will only run slower. If you under-estimate it it will create wrong numbers!
The idea is that you sample many uniform distributed points and accept depending on the probability density for the points.
pdf=#(x).5.*x(:,1)+3./2.*x(:,2);
maximum=2; %Right maximum for THIS EXAMPLE.
%If you are unable to determine the maximum of your
%function within the [0,1]x[0,1] range, please give an example.
result=[];
n=10;
while (size(result,1)<n)
%1. sample random point:
val=rand(1,2);
%2. Accept with probability pdf(val)/maximum
if rand<pdf(val)/maximum
%append to solution
result(end+1,:)=val;
end
end
I know that this solution is not a fast implementation, but I wanted to start with an implementation as simple as possible to make sure that the concept of rejection sampling becomes clear.
ICDF
Besides rejection sampling there is a different approach to address this issue on a more mathematical level, but you need to sit down and do some math first to end up with a better solution. For 1 dimensional distributions you typically sample using the ICDF (inverted cumulative density function) function simply using ICDF(rand(n,1)) to get random samples.
If you manage to do the math, you could instead for your PDF function define two functions ICDF1 (ICDF for the first dimension) and ICDF2 (ICDF for the second dimension) in matlab.
The first ICDF1 would map unifrom random distributed samples to sample values for the first dimension of your random distribution.
The second ICDF2 would map the output if ICDF1 and uniform distributed samples to your intended solution.
Here is some matlab code assuming you already defined ICDF1 and ICDF2
samples=ICDF1(rand(n,1));
samples(:,2)=ICDF2(samples,rand(n,1));
The great advantage of this solution is, that it does not reject any samples, being potentially much faster.

leave-one-out regression using lasso in Matlab

I have 300 data samples with around 4000 dimension feature each. Each input has a 5 dim. output which is in the range of -2 to 2. I am trying to fit a lasso model to it. I went through a few posts which talk about cross validation strategies like this one: Leave one out cross validation algorithm in matlab
But I saw that lasso does not support leaveout in Matlab! http://www.mathworks.com/help/stats/lasso.html
How can I train a model using leave one out cross validation and fit a model using lasso on my dataset? I am trying to do this in matlab. I would like to get a set of weights which I will be able to use for future predictions on other data.
I tried using glmnet: http://www.stanford.edu/~hastie/glmnet_matlab/intro.html but I couldn't compile it on my machine due to lack of proper mex compiler.
Any solutions to my problem? Thanks :)
EDIT
I am also trying to use lasso function in-built with MATLAB. It has an option to perform cross validation. It outputs B and Fit Statistics, where B is Fitted coefficients, a p-by-L matrix, where p is the number of predictors (columns) in X, and L is the number of Lambda values.
Now given a new test sample, how can I calculate the output using this model?
You can use a leave-one-out approach regardless of your training method. As explained here, you can use crossvalind to split the data into training and test sets.
[Train, Test] = crossvalind('LeaveMOut', N, M)

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.

scaling when sampling from multivariate gaussian

I have a data matrix A (with dependencies between columns) of which I estimate the covariance matrix S. I now want to use this covariance matrix to simulate a new matrix A_sim. Since I assume that the underlying data generator of A was gaussian, I can simply sample from a gaussian specified by S. I do that in matlab as follows:
A_sim = randn(size(A))*chol(S);
However, the values in A_sim are way larger than in A. if I scale down S by a factor of 100, A_sim looks much better. I am now looking for a way to determine this scaling factor in a principled way. can anyone give advise or suggest literature that might be helpful?
Matlab has the function mvnrnd which generates multivariate random variables for you.