How to plot precision and recall of a CNN in MATLAB? - matlab

How to plot the precision and recall curves of a CNN?
I have generated the scores from CNN and want to plot the precision-recall curve, but I am unable to get that.
I have calculated TP, TN, FP, and FN using:
idx = (ACTUAL()==1);
p = length(ACTUAL(idx));
n = length(ACTUAL(~idx));
N = p+n;
tp = sum(ACTUAL(idx)==PREDICTED(idx));
tn = sum(ACTUAL(~idx)==PREDICTED(~idx));
fp = n-tn;
fn = p-tp;
The formula of precision and recall is
precision = tp/(tp+fp)
but with that, I am getting some undesired plot.
I have obtained scores of the CNN using the following command:
[YTest,score]=classify(convnet,TestData)

MATLAB has a function for creating ROC curves and similar performance curves (such as precision-recall curves) in the Statistics and Machine Learning Toolbox: perfcurve.
By default, the ROC curve is calculated.
The function has the following syntax:
[X, Y] = perfcurve(labels, scores, posclass)
Here, labels is the true label for each sample, scores is the prediction of the CNN (or any other classifier), and posclass is the label of the class you assume to be "positive" - which appears to be 1 in your example. The outputs of the perfcurve function are the (x, y) coordinates of the ROC curve, so you can easily plot it using
plot(X, Y)
To make perfcurve plot the precision-recall curve instead of the ROC curve, you have to set the optional 'XCrit' and 'YCrit' arguments of the function. As described in the documentation, different pre-defined criteria such as number of false positives ('fp'), true positive rate ('tpr'), accuracy ('accu') and many more, or even custom functions can be used.
By setting 'XCrit' to 'tpr' (Recall) and 'YCrit' to 'prec' (Precision), a precision-recall curve is created:
[X, Y] = perfcurve(labels, scores, posclass, 'XCrit', 'tpr', 'YCrit', 'prec');
plot(X, Y);
xlabel('Recall')
ylabel('Precision')
xlim([0, 1])
ylim([0, 1])
For example (using randomly generated data and a SVM):

The answer of hbaderts is correct but the end of the answer is wrong.
[X,Y] = perfcurve(labels,scores,posclass,'xCrit', 'fpr', 'yCrit', 'tpr');
Then the generated Receiver operating characteristic (ROC) curve is correct.

Related

Multiclass Logistic Regression ROC Curves in MATLAB

I have 7 classes within my training examples (labeled 1-7). I'm running logistic regression and I want to create my ROC curve for each of my classes.
To train my model and make a prediction, I have the following code:
Theta = zeros(k, n+1); %initialize theta
[Theta, costs] = gradientDescent(Theta, #(t)(CostFunc(t, X, Y, lambda)),...
#(t)(DerivOfCostFunc(t, X, Y, lambda)), alpha, iter_num);
%Make prediction with trained model
[scores,prediction] = predict(Theta, X_test); %X_test is the design matrix (ones on the first col)
Within the predict script, I have
scores = g(X*all_theta'); %this is the sigmoid function
[p_max, IndexOfMax]=max(scores, [], 2);
prediction = IndexOfMax;
Note that scores is a m by k matrix, where m is the number of training examples and k is the number of classes. Prediction is a m by 1 vector with numbers going from 1-7, based on the predicted class.
To create the ROC curve, for class 3 for example,
classNum=3;
for i=1:size(scores,1)
temp=scores(i,:);
diffscore(i,:)=temp(classNum)-max([temp(:,1:classNum-1),temp(:,classNum+1:end)]);
end
This last part I did because I read that I had to establish my class 3 as positive and the others as negative.
At last, I made my curve with the following code:
[xROC,yROC,~,auc] = perfcurve(y_test,diffscore,classNum);
%y_test contains my true labels, m by 1 column vector
However, when running the ROC curve for each of my classes, I get the same plot for all. They all have an AUC of 1. Based on some analysis, I know this is not correct but can't figure out in which part of the code I went wrong! Is there additional code I should add or should I need to modify any of my existing code?

What is the problem with the shape of the roc curve with low auc(.4)?

I'm trying to plot a ROC curve. I have 75 data points and I considered only 10 features. Ii'm getting a staircase like image see below. Is this due to the small data set? Can we add more points to improve the curve?
AUC is very low .44. Is there any method to upload csv file ?
species1= readtable('target.csv');
species1 = table2cell(species1)
meas1= readtable('feature.csv');
meas1=meas1(:,1:10);
meas1= table2array(meas1)
numObs = length(species1);
half = floor(numObs/2);
training = meas1(1:half,:);
trainingSpecies = species1(1:half);
sample = meas1(half+1:end,:);
trainingSpecies = cell2mat(trainingSpecies)
group = species1(half+1:end,:);
group = cell2mat(group)
SVMModel = fitcsvm(training,trainingSpecies)
[label,score] = predict(SVMModel,sample);
[X,Y,T,AUC] = perfcurve(group,score(:,2),'1');
plot(X,Y,'LineWidth',3)
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification ')
As indicated by Durkee, the perfcurve function will always be stepwise. In fact, the ROC curve is an empirical (as opposed to theoretical) cumulative distribution function (ecdf), and ecdf are stepwise functions by definition (as it computes the CDF on the values observed in the sample).
Usually, smoothing of the ROC curve is done via binning. You could bin the score values and compute an approximate ROC curve, or you could bin the False Positive Rate values obtained by the actual ROC curve (i.e. bin the X values generated by perfcurve()) which generates a smooth version that preserves the area under the curve (AUC).
In the following example I will show and compare the smoothed ROC curves obtained from these two options, which can be accomplished using the TVals option and the XVals option of the perfcurve function, respectively.
In each case, the binning is done so that we get approximately equal-sized (equal in the number of cases) bins using the tiedrank function. The values to use for the TVals and the XVals options are then computed using the grpstats function as the max value on each bin of the original/pre-binned variable (scores or X, respectively).
%% Reference for the original ROC curve example: https://www.mathworks.com/help/stats/perfcurve.html
load fisheriris
pred = meas(51:end,1:2);
resp = (1:100)'>50; % Versicolor = 0, virginica = 1
mdl = fitglm(pred,resp,'Distribution','binomial','Link','logit');
scores = mdl.Fitted.Probability;
[X,Y,T,AUC] = perfcurve(species(51:end,:),scores,'virginica');
AUC
%% Define the number of bins to use for smoothing
nbins = 10;
%% Option 1 (RED): Smooth the ROC curve by defining score thresholds (based on equal-size bins of the score).
scores_grp = ceil(nbins * tiedrank(scores(:,1)) / length(scores));
scores_thr = grpstats(scores, scores_grp, #max);
[X_grpScore,Y_grpScore,T_grpScore,AUC_grpScore] = perfcurve(species(51:end,:),scores,'virginica','TVals',scores_thr);
AUC_grpScore
%% Option 2 (GREEN) Smooth the ROC curve by binning the False Positive Rate (variable X of the perfcurve() output)
X_grp = ceil(nbins * tiedrank(X(:,1)) / length(X));
X_thr = grpstats(X, X_grp, #max);
[X_grpFPR,Y_grpFPR,T_grpFPR,AUC_grpFPR] = perfcurve(species(51:end,:),scores,'virginica','XVals',X_thr);
AUC_grpFPR
%% Plot
figure
plot(X,Y,'b.-'); hold on
plot(X_grpScore,Y_grpScore,'rx-')
plot(X_grpFPR,Y_grpFPR,'g.-')
xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')
legend({'Original ROC curve', ...
sprintf('Smoothed ROC curve in %d bins (based on score bins)', nbins), ...
sprintf('Smoothed ROC curve in %d bins (based on FPR bins)', nbins)}, ...
'Location', 'SouthEast')
The graphical output from this code is the following:
Note: if you look at the text output generated by the above code, you will notice that, as anticipated, the AUC values for the original ROC and the smoothed ROC curve based on FPR bins (GREEN option) coincide (AUC = 0.7918), whereas the AUC value for the smoothed ROC curve based on score bins (RED option) is quite smaller than the original AUC (= 0.6342), so the FPR approach should be preferred as smoothing technique for plotting purposes. Note however that the FPR approach requires computing the ROC curve twice, once on the original scores variable, and once on the binned FPR values (X values of the first ROC calculation).
However, the second ROC calculation can be avoided because the same smoothed ROC curve can be obtained by binning the X values and computing the max(Y) value on each bin, as shown in the following snippet:
%% Compute max(Y) on the binned X values
% Make a dataset with the X and Y variables as columns (for easier manipulation and grouping)
ds = dataset(X,Y);
% Compute equal size bins on X and the corresponding MAX statistics
ds.X_grp = ceil(nbins * tiedrank(ds.X(:,1)) / size(ds.X,1));
ds_grp = grpstats(ds, 'X_grp', #max, 'DataVars', {'X', 'Y'});
% Add the smooth curve to the previous plot
hold on
plot(ds_grp.max_X, ds_grp.max_Y, 'mx-')
And now you should see the above plot where the green curve has been overridden by a magenta curve with star points.

Non-symbolic derivative at all sample points including boundary points

Suppose I have a vector t = [0 0.1 0.9 1 1.4], and a vector x = [1 3 5 2 3]. How can I compute the derivative of x with respect to time that has the same length as the original vectors?
I should not use any symbolic operations. The command diff(x)./diff(t) does not produce a vector of the same length. Should I first interpolate the x(t) function and then take its derivative?
Different approaches exist to calculate the derivative at the same points as your initial data:
Finite differences: Use a central difference scheme at your inner points and a forward/backward scheme at your first/last point
or
Curve fitting: Fit a curve through your points, calculate the derivative of this fitted function and sample them at the same points as the original data. Typical fitting functions are polynomials or spline functions.
Note that the curve fitting approach gives better results, but needs more tuning options and is slower (~100x).
Demonstration
As an example, I will calculate the derivative of a sine function:
t = 0:0.1:1;
y = sin(t);
Its exact derivative is well known:
dy_dt_exact = cos(t);
The derivative can approximately been calculated as:
Finite differences:
dy_dt_approx = zeros(size(y));
dy_dt_approx(1) = (y(2) - y(1))/(t(2) - t(1)); % forward difference
dy_dt_approx(end) = (y(end) - y(end-1))/(t(end) - t(end-1)); % backward difference
dy_dt_approx(2:end-1) = (y(3:end) - y(1:end-2))./(t(3:end) - t(1:end-2)); % central difference
or
Polynomial fitting:
p = polyfit(t,y,5); % fit fifth order polynomial
dp = polyder(p); % calculate derivative of polynomial
The results can be visualised as follows:
figure('Name', 'Derivative')
hold on
plot(t, dy_dt_exact, 'DisplayName', 'eyact');
plot(t, dy_dt_approx, 'DisplayName', 'finite difference');
plot(t, polyval(dp, t), 'DisplayName', 'polynomial');
legend show
figure('Name', 'Error')
hold on
plot(t, abs(dy_dt_approx - dy_dt_exact)/max(dy_dt_exact), 'DisplayName', 'finite difference');
plot(t, abs(polyval(dp, t) - dy_dt_exact)/max(dy_dt_exact), 'DisplayName', 'polynomial');
legend show
The first graph shows the derivatives itself and the second graph plots the relative errors made by both methods.
Discussion
One clearly sees that the curve fitting method gives better results than the finite differences, but it is ~100x slower. The curve fitting methods has a relative error of order 10^-5. Note that the finite differences approach becomes better when your data is sampled more densely or you use a higher order scheme. The disadvantage of the curve fitting approach is that one has to choose a good polynomial order. Spline functions may be better suited in general.
A 10x faster sampled dataset, i.e. t = 0:0.01:1;, results in the following graphs:

How to write MatLab Code for bimodal Probability Density Functions?

I want to write a bimodal Probability Density Function (PDF with multiple peaks, Galtung S) without using the pdf function from statistics toolbox. Here is my code:
x = 0:0.01:5;
d = [0.5;2.5];
a = [12;14]; % scale parameter
y = 2*a(1).*(x-d(1)).*exp(-a(1).*(x-d(1)).^2) + ...
2*a(2).*(x-d(2)).*exp(-a(2).*(x-d(2)).^2);
plot(x,y)
Here's the curve.
plot(x,y)
I would like to change the mathematical formula to to get rid of the dips in the curve that appear at approx. 0<x<.5 and 2<x<2.5.
Is there a way to implement x>d(1) and x>d(2) in line 4 of the code to avoid y < 0? I would not want to solve this with a loop because I need to convert the formula to CDF later on.
If you want to plot only for x>max(d1,d2), you can use logical indexing:
plot(x(x>max(d)),y(x>max(d)))
If you to plot for all x but plot max(y,0), you just can write so:
plot(x,max(y,0))

Relative Frequency Histograms and Probability Density Functions

The function called DicePlot simulates rolling 10 dice 5000 times.
The function calculates the sum of values of the 10 dice of each roll, which will be a 1 ⇥ 5000 vector, and plot relative frequency histogram with edges of bins being selected in where each bin in the histogram represents a possible value of for the sum of the dice.
The mean and standard deviation of the 1 ⇥ 5000 sums of dice values will be computed, and the probability density function of normal distribution (with the mean and standard deviation computed) on top of the relative frequency histogram will be plotted.
Below is my code so far - What am I doing wrong? The graph shows up but not the extra red line on top? I looked at answers like this, and I don't think I'll be plotting anything like the Gaussian function.
% function[]= DicePlot()
for roll=1:5000
diceValues = randi(6,[1, 10]);
SumDice(roll) = sum(diceValues);
end
distr=zeros(1,6*10);
for i = 10:60
distr(i)=histc(SumDice,i);
end
bar(distr,1)
Y = normpdf(X)
xlabel('sum of dice values')
ylabel('relative frequency')
title(['NumDice = ',num2str(NumDice),' , NumRolls = ',num2str(NumRolls)]);
end
It is supposed to look like
But it looks like
The red line is not there because you aren't plotting it. Look at the documentation for normpdf. It computes the pdf, it doesn't plot it. So you problem is how do you add this line to the plot. The answer to that problem is to google "matlab hold on".
Here's some code to get you going in the right direction:
% Normalize your distribution
normalizedDist = distr/sum(distr);
bar(normalizedDist ,1);
hold on
% Setup your density function using the mean and std of your sample data
mu = mean(SumDice);
stdv = std(SumDice);
yy = normpdf(xx,mu,stdv);
xx = linspace(0,60);
% Plot pdf
h = plot(xx,yy,'r'); set(h,'linewidth',1.5);