Chance level accuracy for clearly separable data - matlab

I have written what I believe to be quite a simple SVM-classifier [SVM = Support Vector Machine]. "Testing" it with normally distributed data with different parameters, the classifier is returning me a 50% accuracy. What is wrong?
Here is the code, the results should be reproducible:
features1 = normrnd(1,5,[100,5]);
features2 = normrnd(50,5,[100,5]);
features = [features1;features2];
labels = [zeros(100,1);ones(100,1)];
%% SVM-Classification
nrFolds = 10; %number of folds of crossvalidation
kernel = 'linear'; % 'linear', 'rbf' or 'polynomial'
C = 1; % C is the 'boxconstraint' parameter.
cvFolds = crossvalind('Kfold', labels, nrFolds);
for i = 1:nrFolds % iterate through each fold
testIdx = (cvFolds == i); % indices test instances
trainIdx = ~testIdx; % indices training instances
% train the SVM
cl = fitcsvm(features(trainIdx,:), labels(trainIdx),'KernelFunction',kernel,'Standardize',true,...
'BoxConstraint',C,'ClassNames',[0,1]);
[label,scores] = predict(cl, features(testIdx,:));
eq = sum(labels(testIdx));
accuracy(i) = eq/numel(labels(testIdx));
end
crossValAcc = mean(accuracy)

You are not computing the accuracy correctly. You need to determine how many predictions match the original data. You are simply summing up the total number of 1s in the test set, not the actual number of correct predictions.
Therefore you must change your eq statement to this:
eq = sum(labels(testIdx) == label);
Recall that labels(testIdx) extracts the true label from your test set and label is the predicted results from your SVM model. This correctly generates a vector of 0/1 where 0 means that the prediction does not match the actual label from the test set and 1 means that they agree. Summing over each time they agree, or each time the vector is 1 is the way to compute the accuracy.

Related

Correct practice and approach for reporting the training and generalization performance

I am trying to learn the correct procedure for training a neural network for classification. Many tutorials are there but they never explain how to report for the generalization performance. Can somebody please tell me if the following is the correct method or not. I am using first 100 examples from the fisheriris data set that has labels 1,2 and call them as X and Y respectively. Then I split X into trainData and Xtest with a 90/10 split ratio. Using trainData I trained the NN model. Now the NN internally further splits trainData into tr,val,test subsets. My confusion is which one is usually used for generalization purpose when reporting the performance of the model to unseen data in conferences/Journals?
The dataset can be found in the link: https://www.mathworks.com/matlabcentral/fileexchange/71468-simple-neural-networks-with-k-fold-cross-validation-manner
rng('default')
load iris.mat;
X = [f(1:100,:) l(1:100)];
numExamples = size(X,1);
indx = randperm(numExamples);
X = X(indx,:);
Y = X(:,end);
split1 = cvpartition(Y,'Holdout',0.1,'Stratify',true); %90% trainval 10% test
istrainval = training(split1); % index for fitting
istest = test(split1); % indices for quality assessment
trainData = X(istrainval,:);
Xtest = X(istest,:);
Ytest = Y(istest);
numExamplesXtrainval = size(trainData,1);
indxXtrainval = randperm(numExamplesXtrainval);
trainData = trainData(indxXtrainval,:);
Ytrain = trainData(:,end);
hiddenLayerSize = 10;
% data format = rows = number of dim, column = number of examples
net = patternnet(hiddenLayerSize);
net = init(net);
net.performFcn = 'crossentropy';
net.trainFcn = 'trainscg';
net.trainParam.epochs=50;
[net tr]= train(net,trainData', Ytrain');
Trained = sim(net, trainData'); %outputs predicted labels
train_predict = net(trainData');
performanceTrain = perform(net,Ytrain',train_predict)
lbl_train=grp2idx(Ytrain);
Yhat_train = (train_predict >= 0.5);
Lbl_Yhat_Train = grp2idx(Yhat_train);
[cmMatrixTrain]= confusionmat(lbl_train,Lbl_Yhat_Train )
accTrain=sum(lbl_train ==Lbl_Yhat_Train)/size(lbl_train,1);
disp(['Training Set: Total Train Acccuracy by MLP = ',num2str(100*accTrain ), '%'])
[confTest] = confusionmat(lbl_train(tr.testInd),Lbl_Yhat_Train(tr.testInd) )
%unknown test
test_predict = net(Xtest');
performanceTest = perform(net,Ytest',test_predict);
Yhat_test = (test_predict >= 0.5);
test_lbl=grp2idx(Ytest);
Lbl_Yhat_Test = grp2idx(Yhat_test);
[cmMatrix_Test]= confusionmat(test_lbl,Lbl_Yhat_Test )
This is the output.
Problem1: There seems to be no prediction for the other class. Why?
Problem2: Do I need a separate dataset like the one I created as Xtest for reporting generalization error or is it the practice to use the data trainData(tr.testInd,:) as the generalization test set? Did I create an unnecessary subset?
performanceTrain =
2.2204e-16
cmMatrixTrain =
45 0
45 0
Training Set: Total Train Acccuracy by MLP = 50%
confTest =
9 0
5 0
cmMatrix_Test =
5 0
5 0
There are a few issues with the code. Let's deal with them before answering your question. First, you set a threshold of 0.5 for making decisions (Yhat_train = (train_predict >= 0.5);) while all points of your net prediction are above 0.5. This means you only get zeros in your confusion matrices. You can plot the scores to choose a better threshold:
figure;
plot(train_predict(Ytrain == 1),'.b')
hold on
plot(train_predict(Ytrain == 2),'.r')
legend('label 1','label 2')
cvpartition gave me an error. It ran successfully as split1 = cvpartition(Y,'Holdout',0.1); In any case, artificial neural networks usuallly manage partitioning within the training process, so you feed in X and Y and some parameters for how to do it. See here for example: link where you set
net.divideParam.trainRatio = .4;
net.divideParam.valRatio = .3;
net.divideParam.testRatio = .3;
So how to report the results? Only for the test data. The train data will suffer from overfit, and will show false, too good results. If you use validation data (you havn't), then you cannot show results for it because it will also suffer from overfit. If you let the training do validation for you your test results will be safe from overfit.

Separate matrix of column vectors into training and testing data in MATLAB

I'm currently doing a project in MATLAB using the MNIST data set. I have a training data set of n = 50000, represented by a matrix of 784 x 50000 (50000 column vectors of size 784).
I am trying to separate my training and testing data (70-30, respectively), but the method I am using is a bit wordy and brute force for my liking. Being that this is MATLAB, I'm sure there has got to be a better way. The code I have been using is listed below. I'm brand new to MATLAB so please help! Thanks :)
% MNIST - data loads into trn and trnAns, representing
% the input vectors and the desired output vectors, respectively
load('Data/mnistTrn.mat');
mnist_train = zeros(784, 35000);
mnist_train_ans = zeros(10, 35000);
mnist_test = zeros(784, 15000);
mnist_test_ans = zeros(10, 15000);
indexes = zeros(1,50000);
for i = 1:50000
indexes(i) = i;
end
indexes(randperm(length(indexes)));
for i = 1:50000
if i <= 35000
mnist_train (:,i) = trn(:,indexes(i));
mnist_train_ans(:,i) = trnAns(:,indexes(i));
else
mnist_test(:,i-35000) = trn(:,indexes(i));
mnist_test_ans(:,i-35000) = trnAns(:,indexes(i));
end
end
I hope this works:
% MNIST - data loads into trn and trnAns, representing
% the input vectors and the desired output vectors, respectively
load('Data/mnistTrn.mat');
% Generating a random permutation for both trn and trnAns:
perm = randperm(50000);
% Shuffling both trn and trnAns columns using a single random permutations:
trn = trn(:, perm);
trnAns = trnAns(:, perm);
mnist_train = trn(:, 1:35000);
mnist_train_ans = trnAns(:, 1:35000);
mnist_test = trn(:, 35001:50000);
mnist_test_ans = trnAns(:, 35001:50000);

Variable error rate of SVM Classifier using K-Fold Cross Vaidation Matlab

I'm using K-Fold Cross-validation to get the error rate of a SVM Classifier. This is the code with wich I'm getting the error rate for 8-Fold Cross-validation:
data = load('Entrenamiento.txt');
group = importdata('Grupos.txt');
CP = classperf(group);
N = length(group);
k = 8;
indices = crossvalind('KFold',N,k);
single_error = zeros(1,k);
for j = 1:k
test = (indices==j);
train = ~test;
SVMModel_1 = fitcsvm(data(train,:),group(train,:),'BoxConstraint',1,'KernelFunction','linear');
classification = predict(SVMModel_1,data(test,:));
classperf(CP,classification,test);
single_error(1,j) = CP.ErrorRate;
end
confusion_matrix = CP.CountingMatrix
VP = confusion_matrix(1,1);
FP = confusion_matrix(1,2);
FN = confusion_matrix(2,1);
VN = confusion_matrix(2,2);
mean_error = mean(single_error)
However, the mean_error changes each time I run the script. This is due to crossvalind, which generates random cross-validation indices, so each time I run the script, it generates different random indices.
What should I do to calculate the true error rate? Should I calculate the mean error rate of n code executions? Or what value should I use?
You can check wiki,
In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples.
and
The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation.
So no worries about different error rates of randomly selecting folds.
Of course the results will be different.
However if your error rate is in wide range then increasing k would help.
Also rng can be used to get fixed results.

Naive Bayse Classifier for Multiclass: Getting Same Error Rate

I have implemented the Naive Bayse Classifier for multiclass but problem is my error rate is same while I increase the training data set. I was debugging this over an over but wasn't able to figure why its happening. So I thought I ll post it here to find if I am doing anything wrong.
%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate.
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
%Getting the index common to test and train data
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the label data from the 80:20 split
train_labels = shuffledMatrix_label(1:percent_data_80,:);
test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents [5 10 15..]
percent_tracker = zeros(length(percent), 2);
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:);
new_train_label = train_labels(1:percentOfRows);
%get unique labels in training
numClasses = size(unique(new_train_label),1);
classMean = zeros(numClasses,size(new_train,2));
classStd = zeros(numClasses, size(new_train,2));
priorClass = zeros(numClasses, size(2,1));
% Doing the K class mean and std with prior
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
end
error = 0;
p = zeros(numClasses,1);
% Calculating the posterior for each test row for each k class
for testRow=1:length(test)
c=0; k=0;
for class=1:numClasses
temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
end
%Take the max of posterior
[c,k] = max(p(1,:));
if test_labels(testRow) ~= k
error = error + 1;
end
end
avgError = error/length(test);
percent_tracker(pRows,:) = [avgError percent(pRows)];
tPercent = percent_tracker;
plot(percent_tracker)
end
end
end
Here is the dimentionality of my data
x =
data: [768x8 double]
labels: [768x1 double]
I am using Pima data set from UCI
What are the results of your implementation of the training data itself? Does it fit it at all?
It's hard to be sure but there are couple things that I noticed:
It is important for every class to have training data. You can't really train a classifier to recognize a class if there was no training data.
If possible number of training examples shouldn't be skewed towards some of classes. For example if in 2-class classification number of training and cross validation examples for class 1 constitutes only 5% of the data then function that always returns class 2 will have error of 5%. Did you try checking precision and recall separately?
You're trying to fit normal distribution to each feature in a class and then use it for posterior probabilities. I'm not sure how it plays out in terms of smoothing. Could you try to re-implement it with simple counting and see if it gives any different results?
It also could be that features are highly redundant and bayes method overcounts probabilities.

How to set output size in Matlab newff method

Summary:
I'm trying to do classification of some images depending on the angles between body parts.
I assume that human body consists of 10 parts(as rectangles) and find the center of each part and calculate the angle of each part by reference to torso.
And I have three action categories:Handwave-Walking-Running.
My goal is to find which test images fall into which action category.
Facts:
TrainSet:1057x10 feature set,1057 stands for number of image.
TestSet:821x10
I want my output to be 3x1 matrice each row showing the percentage of classification for action category.
row1:Handwave
row2:Walking
row3:Running
Code:
actionCat='H';
[train_data_hw train_label_hw] = tugrul_traindata(TrainData,actionCat);
[test_data_hw test_label_hw] = tugrul_testdata(TestData,actionCat);
actionCat='W';
[train_data_w train_label_w] = tugrul_traindata(TrainData,actionCat);
[test_data_w test_label_w] = tugrul_testdata(TestData,actionCat);
actionCat='R';
[train_data_r train_label_r] = tugrul_traindata(TrainData,actionCat);
[test_data_r test_label_r] = tugrul_testdata(TestData,actionCat);
Train=[train_data_hw;train_data_w;train_data_r];
Test=[test_data_hw;test_data_w;test_data_r];
Target=eye(3,1);
net=newff(minmax(Train),[10 3],{'logsig' 'logsig'},'trainscg');
net.trainParam.perf='sse';
net.trainParam.epochs=50;
net.trainParam.goal=1e-5;
net=train(net,Train);
trainSize=size(Train,1);
testSize=size(Test,1);
if(trainSize > testSize)
pend=-1*ones(trainSize-testSize,size(Test,2));
Test=[Test;pend];
end
x=sim(net,Test);
Question:
I'm using Matlab newff method.But my output is always an Nx10 matrice not 3x1.
My input set should be grouped as 3 classes but they are grouped to 10 classes.
Thanks
%% Load data : I generated some random data instead
Train = rand(1057,10);
Test = rand(821,10);
TrainLabels = randi([1 3], [1057 1]);
TestLabels = randi([1 3], [821 1]);
trainSize = size(Train,1);
testSize = size(Test,1);
%% prepare the input/output vectors (1-of-N output encoding)
input = Train'; %'matrix of size numFeatures-by-numImages
output = zeros(3,trainSize); % matrix of size numCategories-by-numImages
for i=1:trainSize
output(TrainLabels(i), i) = 1;
end
%% create net: one hidden layer with 10 nodes (output layer size is infered: 3)
net = newff(input, output, 10, {'logsig' 'logsig'}, 'trainscg');
net.trainParam.perf = 'sse';
net.trainParam.epochs = 50;
net.trainParam.goal = 1e-5;
view(net)
%% training
net = init(net); % initialize
[net,tr] = train(net, input, output); % train
%% performance (on Training data)
y = sim(net, input); % predict
%[err cm ind per] = confusion(output, y);
[maxVals predicted] = max(y); % predicted
cm = confusionmat(predicted, TrainLabels);
acc = sum(diag(cm))/sum(cm(:));
fprintf('Accuracy = %.2f%%\n', 100*acc);
fprintf('Confusion Matrix:\n');
disp(cm)
%% Testing (on Test data)
y = sim(net, Test');
Note how I converted from category label for each instance (1/2/3) to a 1-to-N encoding vector ([100]: 1, [010]: 2, [001]: 3)
Also note that the test set is currently not being used, since by default the input data is divided into train/test/validation. You could achieve your manual division by setting net.divideFcn to the divideind function, and setting the corresponding net.divideParam parameters.
I showed the testing on the same training data, but you could do the same for the Test data.