Understanding the output of ovrpredict in LIBSVM - matlab

I'm implementing a multiclass classification with Libsvm adopting a one versus all strategy. For this purpose, I used the ovrtrain and ovrpredict MATLAB functions:
model = ovrtrain(GroupTrain, TrainingSet,'t -0' );
[predicted_labels ac decv] = ovrpredict(testY, TestSet, model);
The output of ovrpredict is as follows
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 95% (19/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
I have 10 classes, I'm new to libsvm so I guess that those accuracies correspond to the classification accuracy of each class. However, I don't understand what is the difference between this output and the value of the accuracy ac returned by ovrpredict, which is 60%.
ac =
0.6000
Thanks

Both values are quite different from each other. Accuracy is the output of svmpredict() function, which tells you how your test data set is fitting to that specific class while ac gives you accuracy of input test class-labels (testY in your case) w.r.t predicted class-labels.
Lets, have a look inside overpredict function and see how these accuracy values are being generated.
function [pred, ac, decv] = ovrpredict(y, x, model)
From definition, we can see, we have 3 input parameters.
y = Class labels
x = Test sata set
model = A struct containing 10 models for 10 different classes.
labelSet = model.labelSet;
labelSet extracts labelSet (unique class-labels). In your case, you will have 10 unique labels, depending how you set while defining 10 separate classes of test data.
labelSetSize = length(labelSet)
Here you get number of classes (10 in your case).
models = model.models;
'models' variable will contain all training models (10 in your case).
decv= zeros(size(y, 1), labelSetSize)
Here, decv matrix has been created to keep decision probablities of each test data value.
for i=1:labelSetSize
[l,a,d] = svmpredict(double(y == labelSet(i)), x, models{i});
decv(:, i) = d * (2 * models{i}.Label(1) - 1);
end
Here, we pass our test data from svmpredict function for each generated model. In your case, this loop will iterate 10 times and generate classification Accuracy of test for each specific class. For example, Accuracy = 90% (18/20) (classification) indicates that 18 out of 20 rows of your test data set matches to that specific class.
Please note, in multi-class SVM, you can't make a decision based on Accuracy values. You will need Pred and ac values to make individual or overall estimate respectively.
double(y == labelSet(i) changes multi-class labels to single class labels by by checking which labels in y belong to a specific class (where iterator i is pointing). it will output either 0 or 1 for unmatched or matched cases respectively. Hence output label vector will contain either 0's or 1's thus corresponding to single class SVM.
decv(:, i) = d * (2 * models{i}.Label(1) - 1) labels the decision values -ve(unhealthy) or +ve(healthy) depending upon the single-class label values in respective trained model. models{i}.Label(1) contains only 2 types of values .i.e. 0 (for unmatched cases) or 1(for matched cases). Hence (2 * models{i}.Label(1) - 1)will always evaluate to 1 or -1, therefore, labelling the decision value healthy or unhealthy.
[tmp,pred] = max(decv, [], 2);
pred = labelSet(pred);
max returns two column vectors, 1st (tmp) containing the maximum decision value in each row and end (pred) respective row (or class) index.Hence, we are only interested in class index, we discard tmp variable.
ac = sum(y==pred) / size(x, 1);
Finally, we will calculate ac by checking how many predicted labels match input test labels and dividing the sum with number of test classes.
In your case ac=0.6 means 6 out of 10 test labels match predicted labels or 4 labels have been predicted otherwise.
I hope, it answers your question.

Related

Individual class accuracy calculation confusion

The total number of data points for which the following binary classification result is obtained = 1500. Out of which, I have
1473 labelled as 0 and
the remaining 27 as 1 .
As can be seen from the confusion matrix, out of 27 data points belonging to class 1, I got only 1 data point misclassified as 0 . So, I calculated the accuracy for individual classes and got Accuracy for class labelled as 0 = 98.2% and for the other as 1.7333%. Is this calculation correct? I am not sure...I did get a pretty good classification for the class labelled as 1 so why the accuracy for it is low?
The individual class accuracies should have been 100% for class0 and around 98% for class1
Does one misclassification reduce the accuracy of class 1 by so much amount? This is the how I calculated the individual class accuracies in MAtlab.
cmMatrix =
1473 0
1 26
acc_class0 = 100*(cmMatrix(1,1))/1500;
acc_class1= 100*(cmMatrix(2,2))/1500;
If everything had been classified correctly, your computation would indicate accuracy for class 1 as 27/1500=0.018. This is obviously wrong. Overall accuracy is 1499/1500, but per-class accuracy cannot use 1500 as denominator. 27 is the maximum correctly classified elements, and should therefore be the denominator.
acc_class0 = 100*cmMatrix(1,1)/sum(cmMatrix(1,:));
acc_class1 = 100*cmMatrix(2,2)/sum(cmMatrix(2,:));

2 errors in libsvm matlab "Model does not support probabiliy estimates and Subscripted assignment dimension mismatch"

I want to classify a list of 5 test images using the library LIBSVM with a strategy 'one against all' in order to obtain probabilities for each class. the used code is bellow :
load('D:\xapp.mat');
load('D:\xtest.mat');
load('D:\yapp.mat');%% matrix contains true class of images yapp=[641;645;1001;1010;1100]
load('D:\ytest.mat');%% matrix contains unlabeled class of test set ytest=[1;2;3;4;5]
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ['-c 1000 -g 10 -b 1 ']);
end
%# get probability estimates of test instances using each model
prob = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred] = max(prob,[],2);
acc = sum(pred == ytest) ./ numel(ytest) %# accuracy
I obtain this error :
Model does not support probabiliy estimates
Subscripted assignment dimension mismatch.
Error in comp (line 98)
prob(:,k) = p(:,model{k}.Label==1); %# probability of class==k
please, help me to solve this error and thanks in advance
What you're trying to do is to use a code snippet that evaluates a SVM classifier performances whereas your goal is to properly estimate the labels for your test set.
I assume your five labels are [641;645;1001;1010;1100] (as in yapp). First thing you have to do is delete ytest, because you don't know any labels for the test set. It is pointless to fill ytest with some dummy values: the SVMs will return our predicted labels.
The first error, as already pointed out in the comments is in
numLabels=max(yapp);
you must change max() with length() in order to gather the number of classes.
The training stage is almost correct.
Given the fact that k goes from 1 to 5 whereas yapp has the range above, you should consider changing double(yapp==k) into double(yapp==yapp(k)): in this manner we mark as positive the k-th value in yapp. Given the fact that k goes from 1 to 5, then yapp(k) will go from 641 to 1100.
And now the prediction stage.
The first input for svmpredict() should be the test labels but now we don't know them so we can fill it with a vector of zeros (there will be as many zeros as there are patterns in the test set). That is because svmpredict() automatically returns the accuracy as well if the test labels are known, but that's not the case. So you must change the second for-loop to
for k=1:numLabels
[~,~,p] = svmpredict(zeros(size(xtest,1),1), xtest, model{k}, '-b 1');
prob(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
and finally predict the labels with
[~,pred] = max(prob,[],2);
and pred contains the predicted labels.
Note 1: in this method, however, you cannot measure accuracy and/or other parameters because what we called the test set actually is not a test set. A test set is a labelled set and we pretend we don't know its labels in order to let the SVM predict them and then match the predicted labels with the actual labels in order to measure its accuracy.
Note 2: predicted labels in pred will most likely have values in range 1 to 5 due to the second for-loop. However, since your labels have different values, you can map back taking into account that 1 is 641, 2 is 645, 3 is 1001, 4 is 1010, 5 is 1100.

Classifying data based on a training set

I have some data that needs classifying. I've tried to use the classify function described here.
My sample is a matrix that has 1 column and 382 rows.
My training is a matrix with 1 column and 2 rows.
Grouping is causing me the issues. I've written: grouping = [a,b]; where a is one category and b is another.
This gives me the error:
Undefined function or variable 'a'.
Error in discrimtrialab (line 89)
grouping = [a,b];
Further to this, how do I classify a group, ie. not just the exact value in training?
Here is my code:
a = -0.09306:0.0001:0.00476;
b = -0.02968:0.0001:0.01484;
%training = groups (odour index)
training = [-0.09306:0.00476; -0.02968:0.01484;];
%grouping variable
group = [a,b]
%classify
[class, err] = classify(sample, training, group, 'linear');
class(a)
(note - there is some processing above this, but it is irrelevant to the question)
From the documentation:
class = classify(sample,training,group) classifies each row of the
data in sample into one of the groups in training. (See Grouped Data.)
sample and training must be matrices with the same number of columns.
group is a grouping variable for training. Its unique values define
groups; each element defines the group to which the corresponding row
of training belongs.
That is, "group" must have the same number of rows as training. From the example in the help:
load fisheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
SL & SW are 100 x 1 matrices to be used for training (two different measurements made on each of 100 samples). group is a 100 x 1 cell array of strings indicating which species each of those measurements belongs to. It could also be a char array or simply a list of numbers (1,2,3) where each number refers to a different group, but it must have 100 rows.
e.g. if your training matrix was a 100 x 1 matrix of doubles, where the first 50 were values that belonged to 'a' and the second 50 were values that belonged to 'b' your group matrix could be:
group = [repmat('a',50,1);repmat('b',50,1)];
However, if all your "groups" are just non-overlapping ranges as stated here in the comments:
What I want classify to do is work out whether or not each number in
"sample" is type A, ie, in the range -0.04416 +/- 0.0163, or type B,
with the range -0.00914 +/- 0.00742
then you don't really need classify. To extract the values from sample which are equal to a value plus or minus some tolerance:
sample1 = sample(abs(sample-value)<tol);
ETA after latest comment: "group" can be a numeric vector, so if you have a training data set which you need to group based on the ranges of some variable, then something like (this code is unchecked but the basic principle should be sound):
%presume "data" is our training data (381 x 3) and "sample" (n x 2) is the data we want to classify
group = zeros(length(data),1); %empty matrix
% first column is variable for grouping, second + third are data equivalent to the entries in "sample".
training = data(:,2:3);
% find where data(:,1) meets whatever our requirements are and label groups with numbers
group(data(:,1)<3)=1; % group "1" is wherever first column is below 3
group(data(:,1)>7)=2; % group "2" is wherever first column is above 7
group(group==0)=NaN; % set any remaining data to NaN
%now we classify "sample" based on "data" which has been split into "training" and "group" variables
class = classify(sample, training, group);

Find the classification rate of testing data

I need to use KNN search to classify the testing data and find the classification rate.
Below is the matlab code:
for example:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
load fisheriris
x = meas(:,3:4); % x =all training data
y = [5 1.45;6 2;2.75 .75]; % y =3 testing data
[n,d] = knnsearch(x,y,'k',10); % find the 10 nearest neighbors to three testing data
for b=1:3
tabulate(species(n(b,:)))
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The result was display in Command window:
tabulate(species(n(1,:)))
Value Count Percent
virginica 2 20.00%
versicolor 8 80.00%
tabulate(species(n(2,:)))
Value Count Percent
virginica 10 100.00%
tabulate(species(n(3,:)))
Value Count Percent
versicolor 7 70.00%
setosa 3 30.00%
If the testing points are 'Versicolor',the result of first and third testing point are classify correctly and second testing point is wrong one.So the classification rate is 2/3 x100%=66.7%.
Is there any idea to modify the matlab code to find the classification rate automatically and save the result into the Workspace?
In general you can find the number of correct predictions by using
sum(predicted_class == true_class) % For numerical data
sum(strcmp(predicted_class, true_class)) % For cellstrings
Or as a percentage
100 * sum(predicted_class == true_class) / length(predicted_class)
In the case of fisheriris the true class would be species. For your constructed data it would be
true_classes = [cellstr('versicolor'); cellstr('versicolor'); cellstr('versicolor')]
In the case of nearest neighbours, the true classes would be the class of the nearest neighbour(s). For a single neighbour:
predicted_class = species(n)
Where n is the index of the nearest neighbour as found by [n, d] = knnsearch(x, y).
sum(strcmp(predicted_class, true_class))
% result: 1
Which is indeed correct when you use only one neighbor.

Matlab : decision tree shows invalid output values

I'm making a decision tree using the classregtree(X,Y) function. I'm passing X as a matrix of size 70X9 (70 data objects, each having 9 attributes), and Y as a 70X1 matrix. Each one of my Y values is either 2 or 4. However, in the decision tree formed, it gives values of 2.5 or 3.5 for some of the leaf nodes.
Any ideas why this might be caused?
You are using classregtree in regression mode (which is the default mode).
Change the mode to classification mode.
Here is an example using CLASSREGTREE for classification:
%# load dataset
load fisheriris
%# split training/testing
cv = cvpartition(species, 'holdout',1/3);
trainIdx = cv.training;
testIdx = cv.test;
%# train
t = classregtree(meas(trainIdx,:), species(trainIdx), 'method','classification', ...
'names',{'SL' 'SW' 'PL' 'PW'});
%# predict
pred = t.eval(meas(testIdx,:));
%# evaluate
cm = confusionmat(species(testIdx),pred)
acc = sum(diag(cm))./sum(testIdx)
The output (confusion matrix and accuracy):
cm =
17 0 0
0 13 3
0 2 15
acc =
0.9
Now if your target class is encoded as numbers, the returned prediction will still be cell array of strings, so you have to convert them back to numbers:
%# load dataset
load fisheriris
[species,GN] = grp2idx(species);
%# ...
%# evaluate
cm = confusionmat(species(testIdx),str2double(pred))
acc = sum(diag(cm))./sum(testIdx)
Note that classification will always return strings, so I think you might have mistakenly used the method=regression option, which performs regression (numeric target) not classification (discrete target)