Naive Bayes model using "carsmall" example data - matlab

I wanna make in Matlab a Naive Bayes model with the carsmall data.
This is my code:
load carsmall
car = table(Model_Year, Weight);
naive_model = fitcnb(car, Origin)
But I get this error and I don't know why. Can anybody say where the error is?
Error using ClassificationNaiveBayes/findNoDataCombos (line 256)
A normal distribution cannot be fit for the combination of class Italy and predictor Model_Year. The data has zero variance.

Since the case of "Italy" appears only once there is no variance, and the normal distribution is pointless for this case. This causes the error for fitcnb, removing this element works OK.
I suggest you to organize more your code, maybe you where going to do this later but, is a good practice. So, here is the new code with a bit more of detail.
clear all
load carsmall
X = [Model_Year Weight];
Y = cellstr(Origin);
%The next line helps to see how many classnames you have
tabulate(Y);
Y(36)=[]; %removing the only case of italy
X (36,:)=[];%removing the only case of italy
%Train a naive Bayes classifier. It is good practice to specify the class order.
naive_model = fitcnb(X, Y,'ClassNames',{'USA','France','Japan','Germany','Sweden'});

Related

How to use `crossval` in matlab for a Leave one Out Validation method

I have been reading the documentation: here and here but it's really unclear for me and I don't see how to use pratically crossval to do a leave one out cross-validation.
vals = crossval(fun,X)
vals = crossval(fun,X,Y,...)
mse = crossval('mse',X,y,'Predfun',predfun)
mcr = crossval('mcr',X,y,'Predfun',predfun)
val = crossval(criterion,X1,X2,...,y,'Predfun',predfun)
vals = crossval(...,'name',value)
I really don't understand the funpart.
I have estimatimate chlorophyll rate with different index. Then I have done a linear regression between those index and the field taken chlorophyll rate. Now I want to validate them, one of my estimation is a column with 22 entries, so I want to use 21 of them as trainee and 1 as a test, and do 22 loops so that all the data have been used as test.
But I don't where should I put the regression model? If my regression is Y=aX+b,
do I re-use the a and b calculated before for the train part, or do I do a new linear regression with the train part then see what's the test will be with that?
I am not sure I totally understood how to make a leave one out model.
Then I want to know the result of the test by calculating the RMSE (and maybe the R²).
How do I code that using crossval?
I saw the answer to the question here but I don't have access to the crossvalind fonction with my license.
Well I finaly figure it out: so this is my script:
First I charged my data and the linear regression fonction
X=indicesCha_without_Cloud(:,3);
y=Cha_g_m2t_without_Cloud(:,3);
testval=#(XTRAIN,ytrain,XTEST)Linear_regression_indices( XTRAIN,ytrain,XTEST);
where in my case fun(in the Mathwork help) is testvaland Linear_regression_indices is a very simple fonction:
function [ Linear_regression_indices ] = Linear_regression_indices(XTRAIN,ytrain,XTEST )
Linear_regression_indices=(polyval(polyfit(XTRAIN,ytrain,1),XTEST));
end
There is 2 ways to do it and they both give the same result:
one by using simply the crossval fonction
cvMse = crossval('mse',X,y,'predfun',testval,'leaveout',1);
this will do as many fold as the data size, using each time one of the data as Xtest
the second one is using cvpartition
c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross validation on n observations. Leave-one-out is a special case of 'KFold', in which the number of folds equals the number of observations. link
c = cvpartition(y,'LeaveOut');
cvMse2=crossval('mse',X,y,'predfun',testval,'partition',c);
then the RMSE can be easily calculated
RMSE=sqrt(cvMse);
RMSE2=sqrt(cvMse2);
then I simply get my answer, in my case RMSE=0,3548

data driven curve - MATLAB

I have several sets of data that I want to fit but not all of them look the same (some look like a Gaussian with one peak, some like two Gaussians with 2 peaks or Lorentzians). I wanted to try this method
http://www.mathworks.com/matlabcentral/fileexchange/31562-data-driven-fitting-with-matlab/content/fitit.m
but the program given is not complete so I can not use it (there is no line that defines 'train' and 'test'). I am writing it so that it suits and works for my data (based on the code that it is given and the demo). I was able to find the best fit but I am also trying to use the bootstrap technique in order to find the confidence intervals. My data is xdata and ydata and they are sorted and the duplicates have been removed before I use them in my program.
cpart=cvpartition(size(xdata,1),'k',10);
tr_x=xdata(training(cpart,1));
tr_y=ydata(training(cpart,1));
tst_x=xdata(test(cpart,1));
tst_y=ydata(test(cpart,1));
all_span=linspace(0.01,0.99,99);
s=zeros(length(all_span);
for k=1:length(all_span)
f = #(tr_x,tr_y,tst_x,tst_y) norm(tst_y mylowess (tr_x, tr_y, tst_x, all_span (k)))^2
s(k) = sum(crossval(f,datax,datay,'partition',cpart));
end
[~,mj]=min(s);
n_span=all_span(mj);%n_span is the optimal span
function ys=mylowess(x1,y1,xs,span)
ys1 = smooth(x1,y1,span,'loess');
ys = interp1(x1,ys1,xs,'linear',NaN);
if any(isnan(ys))
ys(xs<x1(1)) = ys1(1);
ys(xs>x1(end)) = ys1(end);
end
So up to this point I understand the program and I have managed to find the optimal span. I want to find the confidence intervals but so far I was not able to make it work.
When I type:
NB=length(xdata);
f=#(xdata,ydata) mylowess(xdata,ydata,xdata,n_span);
yboot2 = bootstrp(NB,f,xdata,ydata)';
I get the following error
Error using griddedInterpolant
The grid vectors are not strictly monotonic increasing.
Error in interp1 (line 186)
F = griddedInterpolant(X,V,method);
Error in mylowess (line 26)
ysmooth=interp1(xdata,ysmooth1,xinput,'linear',NaN);
As I said before there are no duplicates in xdata and I have already sorted xdata before I used them in the program. Can anyone see the mistake I am making? Or is there an easier way to get the confidence intervals?
Thank you for your help.

Why does this trivially learnable example break AdaBoost?

I'm testing out a boosted tree model that I built using Matlab's fitensemble method.
X = rand(100, 10);
Y = X(:, end)>.5;
boosted_tree = fitensemble(X, Y, 'AdaBoostM1', 100,'Tree');
predicted_Y = predict(boosted_tree, X);
I just wanted to run it on a few simple examples, so I threw in an easy case, one feature is >.5 for positive examples and < .5 for negative examples. I get the warning
Warning: AdaBoostM1 exits because classification error = 0
Which leads me to think, great, it figured out the relevant feature and all the training examples were correctly classified.
But if I look at the accuracy
sum(predicted_Y==Y)/length(Y)
The result is 0.5 because the classifier simply assigned the positive class to all examples!
Why does Matlab think that classification error = 0 when it is clearly not 0? I believe this example should be easily learnable. Is there a way to prevent this error and get the correct result using this method?
Edit: The code above should reproduce the warning.
This is not a bug, it's just that AdaBoost is not designed to work in cases where the first weak learner gets perfect classification. More details:
1) The warning you get is referring to the error of the first weak learning, which is indeed zero. You can see this by following the stack trace that comes along with the warning into the function Ensemble.m (in Matlab R2013b, at line 194). If you place a breakpoint there and run your example, then run the command H.predict(X) you will see that this learning has perfect prediction.
2) So why doesn't your ensemble have perfect prediction? If you look more at Ensemble.m, you'll see that this perfect learner never gets added to the ensemble. This is also reflected in that boosted_tree.NTrained is zero.
3) So why doesn't this perfect learner get added to the ensemble? If you find a description of the AdaBoost.M1 algorithm, you'll see that in each round, training examples are weighted by the error of the previous weak learner. But if that weak learner had no error, then the weights will be zero and therefore all subsequent learners will have nothing to do.
4) If you come across this situation in the real world, what do you do? Don't bother with AdaBoost! The problem is easy enough that a single one of your weak learners can solve it:
X = rand(100, 10);
Y = X(:, end)>.5;
tree = fit(ClassificationTree.template, X, Y);
predicted_Y = predict(tree, X);
accuracy = sum(predicted_Y == Y) / length(Y)

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers
This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers
You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.

How to get level of fitness of data to a distribution by using probplot() in Matlab?

I have 2 sets of data of float numbers, set A and set B. Both of them are matrices of size 40*40. I would like to find out which set is closer to the normal distribution. I know how to use probplot() in matlab to plot the probability of one set. However, I do not know how to find out the level of the fitness of the distribution is.
In python, when people use problot, a parameter ,R^2, shows how good the distribution of the data is against to the normal distribution. The closer the R^2 value to value 1, the better the fitness is. Thus, I can simply use the function to compare two set of data by their R^2 value. However, because of some machine problem, I can not use the python in my current machine. Is there such parameter or function similar to the R^2 value in matlab ?
Thank you very much,
Fitting a curve or surface to data and obtaining the goodness of fit, i.e., sse, rsquare, dfe, adjrsquare, rmse, can be done using the function fit. More info here...
The approach of #nate (+1) is definitely one possible way of going about this problem. However, the statistician in me is compelled to suggest the following alternative (that does, alas, require the statistics toolbox - but you have this if you have the student version):
Given that your data is Normal (not Multivariate normal), consider using the Jarque-Bera test.
Jarque-Bera tests the null hypothesis that a given dataset is generated by a Normal distribution, versus the alternative that it is generated by some other distribution. If the Jarque-Bera test statistic is less than some critical value, then we fail to reject the null hypothesis.
So how does this help with the goodness-of-fit problem? Well, the larger the test statistic, the more "non-Normal" the data is. The smaller the test statistic, the more "Normal" the data is.
So, assuming you have converted your matrices into two vectors, A and B (each should be 1600 by 1 based on the dimensions you provide in the question), you could do the following:
%# Build sample data
A = randn(1600, 1);
B = rand(1600, 1);
%# Perform JB test
[ANormal, ~, AStat] = jbtest(A);
[BNormal, ~, BStat] = jbtest(B);
%# Display result
if AStat < BStat
disp('A is closer to normal');
else
disp('B is closer to normal');
end
As a little bonus of doing things this way, ANormal and BNormal tell you whether you can reject or fail to reject the null hypothesis that the sample in A or B comes from a normal distribution! Specifically, if ANormal is 1, then you fail to reject the null (ie the test statistic indicates that A is probably drawn from a Normal). If ANormal is 0, then the data in A is probably not generated from a Normal distribution.
CAUTION: The approach I've advocated here is only valid if A and B are the same size, but you've indicated in the question that they are :-)