Naive Bayse Classifier for Multiclass: Getting Same Error Rate

Naive Bayse Classifier for Multiclass: Getting Same Error Rate - matlab

I have implemented the Naive Bayse Classifier for multiclass but problem is my error rate is same while I increase the training data set. I was debugging this over an over but wasn't able to figure why its happening. So I thought I ll post it here to find if I am doing anything wrong.
%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate.
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
%Getting the index common to test and train data
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the label data from the 80:20 split
train_labels = shuffledMatrix_label(1:percent_data_80,:);
test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents [5 10 15..]
percent_tracker = zeros(length(percent), 2);
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:);
new_train_label = train_labels(1:percentOfRows);
%get unique labels in training
numClasses = size(unique(new_train_label),1);
classMean = zeros(numClasses,size(new_train,2));
classStd = zeros(numClasses, size(new_train,2));
priorClass = zeros(numClasses, size(2,1));
% Doing the K class mean and std with prior
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
end
error = 0;
p = zeros(numClasses,1);
% Calculating the posterior for each test row for each k class
for testRow=1:length(test)
c=0; k=0;
for class=1:numClasses
temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
end
%Take the max of posterior
[c,k] = max(p(1,:));
if test_labels(testRow) ~= k
error = error + 1;
end
end
avgError = error/length(test);
percent_tracker(pRows,:) = [avgError percent(pRows)];
tPercent = percent_tracker;
plot(percent_tracker)
end
end
end
Here is the dimentionality of my data
x =
data: [768x8 double]
labels: [768x1 double]
I am using Pima data set from UCI

What are the results of your implementation of the training data itself? Does it fit it at all?
It's hard to be sure but there are couple things that I noticed:
It is important for every class to have training data. You can't really train a classifier to recognize a class if there was no training data.
If possible number of training examples shouldn't be skewed towards some of classes. For example if in 2-class classification number of training and cross validation examples for class 1 constitutes only 5% of the data then function that always returns class 2 will have error of 5%. Did you try checking precision and recall separately?
You're trying to fit normal distribution to each feature in a class and then use it for posterior probabilities. I'm not sure how it plays out in terms of smoothing. Could you try to re-implement it with simple counting and see if it gives any different results?
It also could be that features are highly redundant and bayes method overcounts probabilities.

Related

Correct practice and approach for reporting the training and generalization performance

I am trying to learn the correct procedure for training a neural network for classification. Many tutorials are there but they never explain how to report for the generalization performance. Can somebody please tell me if the following is the correct method or not. I am using first 100 examples from the fisheriris data set that has labels 1,2 and call them as X and Y respectively. Then I split X into trainData and Xtest with a 90/10 split ratio. Using trainData I trained the NN model. Now the NN internally further splits trainData into tr,val,test subsets. My confusion is which one is usually used for generalization purpose when reporting the performance of the model to unseen data in conferences/Journals?
The dataset can be found in the link: https://www.mathworks.com/matlabcentral/fileexchange/71468-simple-neural-networks-with-k-fold-cross-validation-manner
rng('default')
load iris.mat;
X = [f(1:100,:) l(1:100)];
numExamples = size(X,1);
indx = randperm(numExamples);
X = X(indx,:);
Y = X(:,end);
split1 = cvpartition(Y,'Holdout',0.1,'Stratify',true); %90% trainval 10% test
istrainval = training(split1); % index for fitting
istest = test(split1); % indices for quality assessment
trainData = X(istrainval,:);
Xtest = X(istest,:);
Ytest = Y(istest);
numExamplesXtrainval = size(trainData,1);
indxXtrainval = randperm(numExamplesXtrainval);
trainData = trainData(indxXtrainval,:);
Ytrain = trainData(:,end);
hiddenLayerSize = 10;
% data format = rows = number of dim, column = number of examples
net = patternnet(hiddenLayerSize);
net = init(net);
net.performFcn = 'crossentropy';
net.trainFcn = 'trainscg';
net.trainParam.epochs=50;
[net tr]= train(net,trainData', Ytrain');
Trained = sim(net, trainData'); %outputs predicted labels
train_predict = net(trainData');
performanceTrain = perform(net,Ytrain',train_predict)
lbl_train=grp2idx(Ytrain);
Yhat_train = (train_predict >= 0.5);
Lbl_Yhat_Train = grp2idx(Yhat_train);
[cmMatrixTrain]= confusionmat(lbl_train,Lbl_Yhat_Train )
accTrain=sum(lbl_train ==Lbl_Yhat_Train)/size(lbl_train,1);
disp(['Training Set: Total Train Acccuracy by MLP = ',num2str(100*accTrain ), '%'])
[confTest] = confusionmat(lbl_train(tr.testInd),Lbl_Yhat_Train(tr.testInd) )
%unknown test
test_predict = net(Xtest');
performanceTest = perform(net,Ytest',test_predict);
Yhat_test = (test_predict >= 0.5);
test_lbl=grp2idx(Ytest);
Lbl_Yhat_Test = grp2idx(Yhat_test);
[cmMatrix_Test]= confusionmat(test_lbl,Lbl_Yhat_Test )
This is the output.
Problem1: There seems to be no prediction for the other class. Why?
Problem2: Do I need a separate dataset like the one I created as Xtest for reporting generalization error or is it the practice to use the data trainData(tr.testInd,:) as the generalization test set? Did I create an unnecessary subset?
performanceTrain =
2.2204e-16
cmMatrixTrain =
45 0
45 0
Training Set: Total Train Acccuracy by MLP = 50%
confTest =
9 0
5 0
cmMatrix_Test =
5 0
5 0

There are a few issues with the code. Let's deal with them before answering your question. First, you set a threshold of 0.5 for making decisions (Yhat_train = (train_predict >= 0.5);) while all points of your net prediction are above 0.5. This means you only get zeros in your confusion matrices. You can plot the scores to choose a better threshold:
figure;
plot(train_predict(Ytrain == 1),'.b')
hold on
plot(train_predict(Ytrain == 2),'.r')
legend('label 1','label 2')
cvpartition gave me an error. It ran successfully as split1 = cvpartition(Y,'Holdout',0.1); In any case, artificial neural networks usuallly manage partitioning within the training process, so you feed in X and Y and some parameters for how to do it. See here for example: link where you set
net.divideParam.trainRatio = .4;
net.divideParam.valRatio = .3;
net.divideParam.testRatio = .3;
So how to report the results? Only for the test data. The train data will suffer from overfit, and will show false, too good results. If you use validation data (you havn't), then you cannot show results for it because it will also suffer from overfit. If you let the training do validation for you your test results will be safe from overfit.

EEG data classification with SWLDA using matlab

I want to ask your help in EEG data classification.
I am a graduate student trying to analyze EEG data.
Now I am struggling with classifying ERP speller (P300) with SWLDA using Matlab
Maybe there is something wrong in my code.
I have read several articles, but they did not cover much details.
My data size is described as below.
size(target) = [300 1856]
size(nontarget) = [998 1856]
row indicates the number of trials, column indicates spanned feature
(I stretched data [64 29] (for visual representation I did not select ROI)
I used stepwisefit function in Matlab to classify target vs non-target
Code is attached below.
ingredients = [targets; nontargets];
heat = [class_targets; class_nontargets]; % target: 1, non-target: -1
randomized_set = shuffle([ingredients heat]);
for k=1:10 % 10-fold cross validation
parition_factor = ceil(size(randomized_set,1) / 10);
cv_test_idx = (k-1)*parition_factor + 1:min(k * parition_factor, size(randomized_set,1));
total_idx = 1:size(randomized_set,1);
cv_train_idx = total_idx(~ismember(total_idx, cv_test_idx));
ingredients = randomized_set(cv_train_idx, 1:end-1);
heat = randomized_set(cv_train_idx, end);
[W,SE,PVAL,INMODEL,STATS,NEXTSTEP,HISTORY]= stepwisefit(ingredients, heat, 'penter', .1);
valid_id = find(INMODEL==1);
v_weights = W(valid_id)';
t_ingredients = randomized_set(cv_test_idx, 1:end-1);
t_heat = randomized_set(cv_test_idx, end); % true labels for test set
v_features = t_ingredients(:, valid_id);
v_weights = repmat(v_weights, size(v_features, 1), 1);
predictor = sum(v_weights .* v_features, 2);
m_result = predictor > 0; % class A: +1, B: 0
t_heat(t_heat==-1) = 0;
acc(k) = sum(m_result==t_heat) / length(m_result);
end
p.s. my code is currently very inefficient and might be bad..
In my assumption, stepwisefit calculates significant coefficients every steps, and valid column would be remained.
Even though it's not LDA, but for binary classification, LDA and linear regression are not different.
However, results were almost random chance.. (for other binary data on the internet, it worked..)
I think I made something wrong, and your help can correct me.
I will appreciate any suggestion and tips to implement classifier for ERP speller.
Or any idea for implementing SWLDA in Matlab code?

The name SWLDA is only used in the context of Brain Computer Interfaces, but I bet it has another name in a more general context.
If you track the recipe of SWLDA you will end up in Krusienski 2006 papers ("A comparison..." and "Toward enhanced P300..") and from there the book where stepwise logarithmic regression is explained: "Draper Smith, Applied Regression Analysis, 1981". However, as far as I am aware of, no paper gives actually the complete recipe on how to implement it (and their details and secrets).
My approach was using stepwiseglm:
H=predictors;
TH=variables;
lbs=labels % (1,2)
if (stepwiseflag)
mdl = stepwiseglm(H', lbs'-1,'constant','upper','linear','distr','binomial');
if (mdl.NumEstimatedCoefficients>1)
inmodel = [];
for i=2:mdl.NumEstimatedCoefficients
inmodel = [inmodel str2num(mdl.CoefficientNames{i}(2:end))];
end
H = H(inmodel,:);
TH = TH(inmodel,:);
end
end
lbls = classify(TH',H',lbs','linear');
You can also use a k-fold cross validaton approach using matlab cvpartition.
c = cvpartition(lbs,'k',10);
opts = statset('display','iter');
fun = #(XT,yT,Xt,yt)...
(sum(~strcmp(yt,classify(Xt,XT,yT,'linear'))));

Chance level accuracy for clearly separable data

I have written what I believe to be quite a simple SVM-classifier [SVM = Support Vector Machine]. "Testing" it with normally distributed data with different parameters, the classifier is returning me a 50% accuracy. What is wrong?
Here is the code, the results should be reproducible:
features1 = normrnd(1,5,[100,5]);
features2 = normrnd(50,5,[100,5]);
features = [features1;features2];
labels = [zeros(100,1);ones(100,1)];
%% SVM-Classification
nrFolds = 10; %number of folds of crossvalidation
kernel = 'linear'; % 'linear', 'rbf' or 'polynomial'
C = 1; % C is the 'boxconstraint' parameter.
cvFolds = crossvalind('Kfold', labels, nrFolds);
for i = 1:nrFolds % iterate through each fold
testIdx = (cvFolds == i); % indices test instances
trainIdx = ~testIdx; % indices training instances
% train the SVM
cl = fitcsvm(features(trainIdx,:), labels(trainIdx),'KernelFunction',kernel,'Standardize',true,...
'BoxConstraint',C,'ClassNames',[0,1]);
[label,scores] = predict(cl, features(testIdx,:));
eq = sum(labels(testIdx));
accuracy(i) = eq/numel(labels(testIdx));
end
crossValAcc = mean(accuracy)

You are not computing the accuracy correctly. You need to determine how many predictions match the original data. You are simply summing up the total number of 1s in the test set, not the actual number of correct predictions.
Therefore you must change your eq statement to this:
eq = sum(labels(testIdx) == label);
Recall that labels(testIdx) extracts the true label from your test set and label is the predicted results from your SVM model. This correctly generates a vector of 0/1 where 0 means that the prediction does not match the actual label from the test set and 1 means that they agree. Summing over each time they agree, or each time the vector is 1 is the way to compute the accuracy.

Matlab SVM example

I am trying to implement SVM for classification. The goal is to output the correct grid of origin of a power signal (.wav file). The grids are titled A-I and there are 93 total signals for the training set and 49 practice signals. I have a 93x10x36 matrix of feature vectors. Does anyone know why I get the errors shown? TrainCorrectGrid and Training_Cepstrum1 both have 93 rows so I don't understand what the problem is. Any help is greatly appreciated.
My code is shown here:
clc; clear; close all;
load('avg_fft_feature (4).mat'); %training feature vectors
load('practice_fft_Mag_all (2).mat'); %practice feauture vectors
load('practice_GridOrigin.mat'); %correct grids of origin for practice data
load PracticeCorrectGrid.mat;
load Training_Cepstrum1;
load Practice_Cepstrum1a;
load fSet1.mat %load in correct practice grids
TrainCorrectGrid=['A';'A';'A';'A';'A';'A';'A';'A';'A';'B';'B';'B';'B';'B';'B';'B';'B';'B';'B';'C';'C';'C';'C';'C';'C';'C';'C';'C';'C';'C';'D';'D';'D';'D';'D';'D';'D';'D';'D';'D';'D';'E';'E';'E';'E';'E';'E';'E';'E';'E';'E';'E';'F';'F';'F';'F';'F';'F';'F';'F';'G';'G';'G';'G';'G';'G';'G';'G';'G';'G';'G';'H';'H';'H';'H';'H';'H';'H';'H';'H';'H';'H';'I';'I';'I';'I';'I';'I';'I';'I';'I';'I';'I'];
%[results,u] = multisvm(avg_fft_feature, TrainCorrectGrid, avg_fft_feature_practice);%avg_fft_feature);
[results,u] = multisvm(Training_Cepstrum1(93,:,1), TrainCorrectGrid, Practice_Cepstrum1a(49,:,1));
disp('Grids of Origin (SVM)');
%Display SVM Results
for i = 1:numel(u)
str = sprintf('%d: %s', i, u(i));
disp(str);
end
%Display Percent Correct
numCorrect = 0;
for i = 1:numel(u)
%if (strcmp(TrainCorrectGrid(i,1), u(i))==1); %compare training to
%training
if (strcmp(PracticeCorrectGrid(i,1), u(i))==1); %compare practice data to training
numCorrect = numCorrect + 1;
end
end
numberOfElements = numel(u);
percentCorrect = numCorrect / numberOfElements * 100;
% percentCorrect = round(percentCorrect, 2);
dispPercent = sprintf('Percent Correct = %0.3f%%', percentCorrect);
disp(dispPercent);
error shown here
The multisvm function is shown here:
function [result, u] = multisvm(TrainingSet,GroupTrain,TestSet)
%Models a given training set with a corresponding group vector and
%classifies a given test set using an SVM classifier according to a
%one vs. all relation.
%
%This code was written by Cody Neuburger cneuburg#fau.edu
%Florida Atlantic University, Florida USA and slightly modified by Renny Varghese
%This code was adapted and cleaned from Anand Mishra's multisvm function
%found at http://www.mathworks.com/matlabcentral/fileexchange/33170-multi-class-support-vector-machine/
u=unique(GroupTrain);
numClasses=length(u);
result = zeros(length(TestSet(:,1)),1);
%build models
for k=1:numClasses
%Vectorized statement that binarizes Group
%where 1 is the current class and 0 is all other classes
G1vAll=(GroupTrain==u(k));
models(k) = svmtrain(TrainingSet,G1vAll);
end
%classify test cases
for j=1:size(TestSet,1)
for k=1:numClasses
if(svmclassify(models(k),TestSet(j,:)))
break;
end
end
result(j) = k;
end
mapValues = 'ABCDEFGHI';
u = mapValues(result);

You state that Training_Cepstrum1 has size [93,10,36]. But when you call multisvm, you are only passing in Training_Cepstrum1(93,:,1) which has size [1,10]. Since TrainCorrectGrid has size [93,1], there is a mismatch in the number of rows.
It looks like you make the same error when passing in Practice_Cepstrum1a.
Try replacing your call to multisvm with
[results,u] = multisvm(Training_Cepstrum1(:,:,1), TrainCorrectGrid, Practice_Cepstrum1a(:,:,1));
This way Training_Cepstrum1(:,:,1) has size [93,10], the same number of rows as TrainCorrectGrid.

Returning the Best Decision Tree From Cross Validation In Matlab

When using Matlab, what is the correct means of finding the model with the least error from a cross validated fitting? My goal is to show the error rates of the best, cross validated decision tree as a function of the size of test data and have the following code:
chess = csvread(filename);
predictors = chess(:,1:6);
class = chess(:,7);
cvpart = cvpartition(class,'holdout', 0.3);
Xtrain = predictors(training(cvpart),:);
Ytrain = class(training(cvpart),:);
Xtest = predictors(test(cvpart),:);
Ytest = class(test(cvpart),:);
numElements = numel(training(cvpart));
trainErrorGrowing = zeros(numElements,1);
testErrorGrowing = zeros(numElements,1);
for n = 100:numElements
data = datasample(training(cvpart), n);
dataX = predictors(data,:);
dataY = class(data,:);
% Fit the decision tree
tree = fitctree(dataX, dataY, 'AlgorithmForCategorical', 'PullLeft', 'CrossVal', 'on');
% Loop to find the model with the least error
kfoldError = 100;
bestTree = tree.Trained{1};
for i = 1:10
err = loss(tree.Trained{i}, Xtrain, Ytrain);
if err < kfoldError
kfoldError = err;
bestTree = tree.Trained{i};
end
end
trainErrorGrowing(n) = loss(bestTree,Xtest,Ytest,'Subtrees','all'); % Training Error
testErrorGrowing(n) = loss(bestTree,Xtest,Ytest,'Subtrees','all'); % Testing Error
end
plot(numElements,testErrorGrowing);
It is important to the metrics that the data used for the final testing not be used in any way to train the tree. However, when I try to execute this code, I get the error
Error using classreg.learning.internal.classCount
You passed an unknown class '1' of type double.
on the line
err = loss(tree.Trained{i}, Xtrain, Ytrain);
I have tried casting the iterator in an int8 and a char, but receive the same error both times. Is there a simpler way to find the resulting decision tree with the least error, or at least a way to reference the individual trained trees?

Let's say you are doing 10-fold cross validation while learning the model. You can then use the kfoldLoss function to also get the CV loss for each fold and then choose the trained model that gives you the least CV loss in the following way:
modelLosses = kfoldLoss(tree,'mode','individual');
The above code will give you a vector of length 10 (10 CV error values) if you have done 10-fold cross-validation while learning. Assuming the trained model with least CV error is the 'k'th one, you would then use:
testSetPredictions = predict(tree.Trained{k}, testSetFeatures);