What are the MATLAB 'perfcurve' Roc Curve Parameters? - matlab

I have been using the LibSVM classifier to classify between 3 different classes - labeled 2, 1, -1
I'm trying to use MATLAB to generate Roc Curve graphs for some data produced using LibSVM but am having trouble understanding the parameters it needs to run.
I assume that:
labels is the vector of labels generated that states into which class my data belongs (mine consists on 1, -1 and 2 and is 60x1 in size)
scores is the variable created by LibSVM called 'accuracy_score' (60x3 in size)
But I don't know what posclass is?
I would also appreciate finding out if my assumptions are correct, and if not, why not?

See here for a clear explanation:
Given the following instruction:
[X,Y] = perfcurve(labels,scores,posclass);
labels are the true labels of the data, scores are the output scores from your classifier (before the threshold) and posclass is the positive class in your labels.

see documentation of percurve, posclass is the label of positive class, in your case it has to be either 1,-1 or 2
http://www.mathworks.com/help/stats/perfcurve.html
ROC curve have "false positive rate" on x axis and "true positive rate on y axis". By specifying the posclass, you are specifying with respect to which class you are calculating false positive rate and true positive rate.
e.g. if you specify posclass as 2, you consider that when the true label is 2, predicting either 1 or -1 is considered a false prediction (false negative).
Edit:
The accuaracy_score you metioned (in my version of the documentation(3.17, in matlab folder), it is called decision_values/prob_estimates) have 3 column, each column correspond to the probability of the data belonging to one class.
e.g.
model=svmtrain(train_label,train_data);
[predicted_label, accuracy, decision_values]=predict(test_label,test_dat,model);
model.Label contains the class labels, individual columns in decision_values contains probability of the the test case belong to class specified in model.Label.(see http://www.csie.ntu.edu.tw/~b91082/SVM/README).
to use purfcurve to compute ROC for class m:
[X,Y] = perfcurve(truelabels, decision_values(:,m)*model.Label(m),model.Label(m));
It is essential to do decision_values(:,m)*model.Label(m) especially when you class label is a negative number.

Related

Why ROC's plotting function perfcurve of MATLAB is yielding 3 ROC curves in case of cross validation?

I plotted 5 fold cross-validation data as a cell array to perfcurve function with positive class=1. Then it generated 3 curves as you can see in the diagram. I was expecting only one curve.
[X,Y,T,AUC,OPTROCPT,SUBY,SUBYNAMES] = perfcurve(Actual_label,Score,1);
plot(X,Y)
Here, Actual_label and Score are a cell array of size 5 X 1. Each cell array is of size 70 X 1. And 1 denotes positive class=1.
P.S: I am using One-class SVM and 'fitSVMPosterior' function is not appropriate for one-class learning (same has been mentioned in the documentation of MATLAB). Therefore posterior probability can't be used here.
When you compute the confidence bounds, X and Y are an m-by-3 array, where m is the number of fixed X values or thresholds (T values). The first column of Y contains the mean value. The second and third columns contain the lower bound and the upper bound, respectively, of the pointwise confidence bounds. AUC is also a row vector with three elements, following the same convention.
Above explanation is taken from MATLAB documentation.
That is expected because you are plotting the ROC curve for each of the 5 folds.
Now if you want to have only one ROC for your classifier, you can either use the 5 trained classifiers to predict the labels of an independent test set or you can average the posterior probabilities of the 5 folds and have one ROC.

plot ROC curve for neural network classifier using perfcurve

I have been using the patternnet classifier to classify between 2 different classes - labeled 0, 1.
I'm trying to use MATLAB to generate Roc Curve graphs for some data produced using patternnet but I am having trouble understanding the parameters it needs to run.
[xTr, yTr, TTr, aucTr] = perfcurve(t, results.Data.y, 1);
I assume that:
t is the vector of labels generated that states into which class my data belongs (mine consists of 0 and 1 and is 2x834 in size)
scores is the variable created by patternnet called ‘results.Data.y' (2x834 in size)
posclass is 1.
But scores should be a vector (1x834 in size) and I don't know which row to choose?
Detail about the perfcurve function is available here :
http://in.mathworks.com/help/stats/perfcurve.html
The examples are pretty helpful.
Regarding your specific case, t is not the predicted vector, but the vector you create yourself to label your test data, which obviously would be 1xn (where n = 834 for your case.)
Your score matrix would be m*n, where m is the number of classes in your data (which is 2 for your case).
Since you're saying 1 is your positive class, you'll chose the column which contains the score of your positive class.
Something like :
[xTr, yTr, TTr, aucTr] = perfcurve(t, results.Data.y(:,1), 1);
plot(xTr, yTr);

KNN classification in MATLAB - confusion matrix and ROC?

I'm trying to classify a data set containing two classes using different classifiers (LDA, SVM, KNN) and would like to compare their performance. I've made ROC curves for the LDA by modifying the priori probability.
But how can i do the same for a KNN classifier?
I searched the documentation and found some functions:
Class = knnclassify(Sample, Training, Group, k)
mdl = ClassificationKNN.fit(X,Y,'NumNeighbors',i,'leaveout','On')
I can run (a) and get a confusion matrix by using leave-one-out cross-validation but it is not possible to change the priori probability to make an ROC?
I haven't tried (b) before but this creates a model where you can modify the mdl.Prior. But i have no clue how to get a confusion matrix.
Is there an option i've missed or someone who can explain how to fully use those function to get a ROC?
This is indeed not straightforward, because the output of the k-nn classifier is not a score from which a decision is derived by thresholding, but only a decision based on the majority vote.
My suggestion: define a score based on the ratio of classes in the neighborhood, and then threshold this score to compute the ROC. Loosely speaking, the score expresses how certain the algorithm; it ranges from -1 (maximum certainty for class -1) to +1 (maximum certainty for class +1).
Example: for k=6, the score is
1 if all six neighbours are of class +1;
-1 if all six neighbours are of class -1;
0 if halve the neighbours are of class +1 and halve the neigbours are of class -1.
Once you have computed this score for each datapoint, you can feed it into a standard ROC function.

Linear Discriminant Analysis in Matlab

I am working on performing a LDA in Matlab and I am able to get it to successfully create a threshold for distinguishing between binary classes. However, I noticed that the threshold always crosses the origin which gives me incorrect thresholds. Is there a way to perform an LDA without a threshold crossing the origin in Matlab?
Thanks in advance
This depends on which formulation you are using for LDA.
By threshold, I assume you're referring to decision threshold?
In the code below the prior probabilities affect the decision threshold, so you may not be setting them correctly.
Here is some sample pseudo code:
N = number of cases
c= number of classes
Priors = vector of prior probabilities for each case per class
Target = Target labels for each case per class
dimension of Data = Features x Cases.
Get target labels for each data point:
T = Targets(:,Cases); % Target labels for each case
Calculate the mean vector per class and the common covariance matrix:
classifier.u = [mean(Data(:,(T(1,:)==1)),2),mean_nan(Data(:,(T(2,:)==1)),2),....,mean_nan(Data(:,(T(2,:)==c)),2]; % Matrix of data means
classifier.invCV = cov(Data');
Get discriminant value using class mean vectors and common covariance matrix:
A1=classifier.u;
B1=classifier.invCV;
D = A1'*B1*Data-0.5*(A1'*B1.*A1')*ones(d,N)+log(Priors(:,Cases));
Function will produce c discriminant values. The case is then assigned to the class with the largest discriminant value.

how to calculate roc curves?

I write a classifier (Gaussian Mixture Model) to classify five human actions. For every observation the classifier compute the posterior probability to belong to a cluster.
I want to valutate the performance of my system parameterized with a threshold, with values from 0 to 100. For every threshold values, for every observation, if the probability of belonging to one of cluster is greater than threshold I accept the result of the classifier otherwise I discard it.
For every threshold values I compute the number of true-positive, true-negative, false-positive, false-negative.
Than I compute the two function: sensitivity and specificity as
sensitivity = TP/(TP+FN);
specificity=TN/(TN+FP);
In matlab:
plot(1-specificity,sensitivity);
to have the ROC curve. But the result isn't what I expect.
This is the plot of the functions of discards, errors, corrects, sensitivity and specificity varying the threshold of one action.
This is the plot of ROC curve of one action
This is the stem of ROC curve for the same action
I am wrong, but i don't know where. Perhaps I do wrong the calculating of FP, FN, TP, TN especially when the result of the classifier is minor of the threshold, so I have a discard. What I have to incremente when there is a discard?
Background
I am answering this because I need to work through the content, and a question like this is a great excuse. Thank you for the good opportunity.
I use data from the built-in fisher iris data:
http://archive.ics.uci.edu/ml/datasets/Iris
I also use code snippets from the Mathworks tutorial on the classification, and for plotroc
http://www.mathworks.com/products/demos/statistics/classdemo.html
http://www.mathworks.com/help/nnet/ref/plotroc.html?searchHighlight=plotroc
Problem Description
There is clearer boundary within the domain to classify "setosa" but there is overlap for "versicoloir" vs. "virginica". This is a two dimensional plot, and some of the other information has been discarded to produce it. The ambiguity in the classification boundaries is a useful thing in this case.
%load data
load fisheriris
%show raw data
figure(1); clf
gscatter(meas(:,1), meas(:,2), species,'rgb','osd');
xlabel('Sepal length');
ylabel('Sepal width');
axis equal
axis tight
title('Raw Data')
Analysis
Lets say that we want to determine the bounds for a linear classifier that defines "virginica" versus "non-virginica". We could look at "self vs. not-self" for other classes, but they would have their own
So now we make some linear discriminants and plot the ROC for them:
%load data
load fisheriris
load iris_dataset
irisInputs=meas(:,1:2)';
irisTargets=irisTargets(3,:);
ldaClass1 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'linear')';
ldaClass2 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diaglinear')';
ldaClass3 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'quadratic')';
ldaClass4 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diagquadratic')';
ldaClass5 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'mahalanobis')';
myinput=repmat(irisTargets,5,1);
myoutput=[ldaClass1;ldaClass2;ldaClass3;ldaClass4;ldaClass5];
whos
plotroc(myinput,myoutput)
The result is shown in the following, though it took deleting repeat copies of the diagonal:
You can note in the code that I stack "myinput" and "myoutput" and feed them as inputs into the "plotroc" function. You should take the results of your classifier as targets and actuals and you can get similar results. This compares the actual output of your classifier versus the ideal output of your target values. Those are the input to plotroc.
So this will give you "built-in" ROC, which is useful for quick work, but does not make you learn every step in detail.
Questions you can ask at this point include:
which classifier is best? How do I determine what best is in this case?
What is the convex hull of the classifiers? Is there some mixture of classifiers that is more informative than any pure method? Bagging perhaps?
You are trying to draw the curves of precision vs recall, depending on the classifier threshold parameter. The definition of precision and recall are:
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
You can check the definition of these parameters in:
http://en.wikipedia.org/wiki/Precision_and_recall
There are some curves here:
http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
Are you dividing your dataset in training set, cross validation set and test set? (if you do not divide the data, it is normal that your precision-recall curve seems weird)
EDITED: I think that there are two possible sources for your problem:
When you train a classifier for 5 classes, usually you have to train 5 distinctive classifiers. One classifier for (class A = class 1, class B = class 2, 3, 4 or 5), then a second classfier for (class A = class 2, class B = class 1, 3, 4 or 5), ... and the fifth for class A = class 5, class B = class 1, 2, 3 or 4).
As you said to select the output for your "compound" classifier, you have to pass your new (test) datapoint through the five classifiers, and you choose the one with the biggest probability.
Then, you should have 5 thresholds to define weighting values that my prioritize selecting one classifier over the others. You should check how the matlab implementations uses the thresholds, but their effect is that you don't choose the class with more probability, but the class with better weighted probability.
As you say, maybe you are not calculating well TP, TN, FP, FN. Your test data should have datapoints belonging to all the classes. Then you have testdata(i,:) and classtestdata(i) are the feature vector and "ground truth" class of datapoint i. When you evaluate the classifier you obtain classifierOutput(i) = 1 or 2 or 3 or 4 or 5. Then you should calculate the "confusion matrix", which is the way to calculate TP, TN, FP, FN when you have multiple classes (> 2):
http://en.wikipedia.org/wiki/Confusion_matrix
http://www.mathworks.com/help/stats/confusionmat.html
(note the relation between TP, TN, FP, FN that you are calculating for the multiclass problem)
I think that you can obtain the TP, TN, FP, FN data of each subclassifier (remember that you are calculating 5 separate classifiers, even if you do not realize it) from the confusion matrix. I am not sure but you can draw the precision recall curve for each subclassifier.
Also check these slides: http://www.slideserve.com/MikeCarlo/multi-class-and-structured-classification
I don't know what the ROC curve is, I will check it because machine learning is a really interesting subject for me.
Hope this helps,