ROC curve and libsvm - matlab

Given a ROC curve drawn with plotroc.m (see here):
Theoretical question: How to select the best threshold to be used?
Programming qeuestion: How to induce the libsvm classifier to work with the selected (best) threshold?

ROC curve is plot generated by plotting fraction of true positive on y-axis versus fraction of false positive on x-axis. So, co-ordinates of any point (x,y) on ROC curve indicates FPR and TPR value at particular threshold.
As shown in figure, we find the point (x,y) on ROC curve which corresponds to the minimum distance of that point from top-left corner (i.e given by(0,1)) of plot. The threshold value corresponding to that point is the required threshold. Sorry, I am not permitted to put any image, so couldn't explain with figure. But, for more details about this click ROC related help
Secondly, In libsvm, svmpredict function returns you probability of data sample belonging to a particular class. So, if that probability(for positive class) is greater than threshold (obtained from ROC plot) then we can classify the sample to positive class. These few lines might be usefull to you:
[pred_labels,~,p] = svmpredict(target_labels,feature_test,svmStruct,'-b 1');
% where, svmStruct is structure returned by svmtrain function.
op = p(:,svmStruct.Label==1); % This gives probability for positive
% class (i.e whose label is 1 )
Now if this variable 'op' is greater than threshold then we can classify the corresponding test sample to positive class. This can be done as
op_labels = op>th; % where 'th' is threshold obtained from ROC

Related

roc curve from SVM classifier is visualise with limite thresholds in Python

i am trying to plot ROC to evaluate my classifier, however my ruc plot is not "smooth". It supposed to be some problem with the thresholds? i am quite new in python classification so propably there is sth wrong with my code. see image below. Where i sould look for solution?
i used that drop_intermediate=False but it does not help;/
This is because you are passing 0 and 1 values (predicted labels) to the plotting function. The ROC curve can only be figured out, when you provide floats in a range of 0.0 to 1.0 (predicted label probabilities) such that the ROC curve can consider multiple cutoff values and appears more "smooth" as a result.
Whatever classifier you are using, make sure y_train_pred contains float values in the range [0.0,1.0]. If you have a scoring classifier with values in the range [-∞,+∞] you can apply a sigmoid function to remap the values to this range.

Why ROC's plotting function perfcurve of MATLAB is yielding 3 ROC curves in case of cross validation?

I plotted 5 fold cross-validation data as a cell array to perfcurve function with positive class=1. Then it generated 3 curves as you can see in the diagram. I was expecting only one curve.
[X,Y,T,AUC,OPTROCPT,SUBY,SUBYNAMES] = perfcurve(Actual_label,Score,1);
plot(X,Y)
Here, Actual_label and Score are a cell array of size 5 X 1. Each cell array is of size 70 X 1. And 1 denotes positive class=1.
P.S: I am using One-class SVM and 'fitSVMPosterior' function is not appropriate for one-class learning (same has been mentioned in the documentation of MATLAB). Therefore posterior probability can't be used here.
When you compute the confidence bounds, X and Y are an m-by-3 array, where m is the number of fixed X values or thresholds (T values). The first column of Y contains the mean value. The second and third columns contain the lower bound and the upper bound, respectively, of the pointwise confidence bounds. AUC is also a row vector with three elements, following the same convention.
Above explanation is taken from MATLAB documentation.
That is expected because you are plotting the ROC curve for each of the 5 folds.
Now if you want to have only one ROC for your classifier, you can either use the 5 trained classifiers to predict the labels of an independent test set or you can average the posterior probabilities of the 5 folds and have one ROC.

Non parametric estimate of cdf in Matlab

I have a vector A in Matlab of dimension Nx1. I want to get a non-parametric estimate the cdf at each point in A and store all the values in a vector B of dimension Nx1. Which different options do I have?
I have read about ecdf and ksdensity but it is not clear to me what is the difference, pros and cons. Any direction would be appreciated.
This doesn't exactly answer your question, but you can compute the empirical CDF very simply:
A = randn(1,1e3); % example Gaussian data
x_cdf = sort(A);
y_cdf = (1:numel(A))/numel(A);
plot(x_cdf, y_cdf) % plot CDF
This works because, by definition, each sample contributes to the (empirical) CDF with an increment of 1/N. That is, for values smaller than the minimum sample the CDF equals 0; for values between the minimum sample and the next highest sample it equals 1/N, etc.
The advantage of this approach is that you know exactly what is being done.
If you need to evaluate the empirical CDF at prescribed x-axis values:
A = randn(1,1e3); % example Gaussian data
x_cdf = -5:.1:5;
y_cdf = sum(bsxfun(#le, A(:), x_cdf), 1)/numel(A);
plot(x_cdf, y_cdf) % plot CDF
If you have prescribed y-axis values, the corresponding x-axis values are by definition the quantiles of the (empirical) distribution:
A = randn(1,1e3); % example Gaussian data
y_cdf = 0:.01:1;
x_cdf = quantile(A, y_cdf);
plot(x_cdf, y_cdf) % plot CDF
You want ecdf, not ksdensity.
ecdf computes the empirical distribution function of your data set. This converges to the cumulative distribution function of the underlying population as the sample size increases.
ksdensity computes a kernel density estimation from your data. This converges to the probability density function of the underlying population as the sample size increases.
The PDF tells you how likely you are to get values near a given value. It wiggles up and down over your domain, going up near more likely values and falling near less likely values. The CDF tells you how likely you are to get values below a given value. So it always starts at zero at the left end of your domain and increases monotonically to one at the right end of your domain.

MATLAB how to plot a vector of probability densities on to a histogram?

I currently have a vector of calculated probability densities, i.e.
probden = (0.0008, 0.0016, 0.0048, 0.0064, 0.0072, ... , 1.0936, ... , 0.0072, 0.0064, 0.0048, 0.0016, 0.0008)
The list of calculated probability densities should be in the shape of a normal distribution.
I also have a same length list of the bins of each probability density.
I am trying to create a histogram such that each probability density is reflected on each bin on the X-axis.
If I use the function hist, it only shows how many probability densities are in each bin.
How should I go on approaching this issue?
Thanks!
The function that goes hand in hand with hist is bar
In your case, you already have your histogram/distribution values (so no need to call hist), you can directly call bar:
bar( YourvectorOfBins , probden )

Determining probability density for a Normal Distribution in Matlab

I have the following code, which I use to obtain the graph below. How can I determine the probability density of the values as I want my Y-Axis label to be Probability density or,do I have to normalise the Y-values?
Thanks
% thresh_strain contains a Normally Distributed set of numbers
[mu_j,sigma_j] = normfit(thresh_strain);
x=linspace(mu_j-4*sigma_j,mu_j+4*sigma_j,200);
pdf_x = 1/sqrt(2*pi)/sigma_j*exp(-(x-mu_j).^2/(2*sigma_j^2));
plot(x,pdf_x);
Your figure as it stands is correct - the area under the curve is 1. It does not need to be normalised.
You can check this by plotting the cumulative distribution function:
plot(x,(x(2)-x(1)).*cumsum(pdf_x));
The y-axis in your figure needs to be relabeled as it is not "number of dents". "Probability density" is an acceptable label.