I am using matlab perfcurve
[X,Y,T,AUC] = perfcurve(labels,scores,posclass)
I am confused about the following. first a basic example and then I ll followup with my question
a) [X,Y,T,AUC] = perfcurve([1 1 1 0 0 0],[.9 .9 .9 .1 .1 .1],1) produces AUC = 1
b) [X,Y,T,AUC] = perfcurve([0 0 0 1 1 1],[.9 .9 .9 .1 .1 .1],1) produces AUC = 0
when I provide the positive class(laebl=1) does it always have to have the higher scores?
If I make the positive class(label=1) have lower scores as in b) above would the ROC curve be flipped (mirror opposite of the normal ROC curve)
The curves I generate with my data looks like below.
plot 1 is the distribution of the scores.The classes are shown in red and blue. Notice that the label=1 (red) class has low scores.
red -> label=1
blue-> label=0
The next image is the generated ROC curve. It's basically a flipped image of what I want to see. Am I doing something wrong? or is this behavior related to the label=1 class having low scores?
When you write the 1 in the third argument, you define the class label to be assumed as positive (1), and then perfcurve calculates fpr and tpr by looking at the probabilites/scores you provide in the second argument, in relation to the positive class label as you defined it (1). The score for each data defines if it is a TP or a FP (you already defined the positive class), so if you exchange scores as you show above, without changing the class label of the positive class also, each TP becomes a FP, since now is at the opposite side of the thresholds used to calculate ROC curve. That's why the plot is a mirror image of what you expect.
Related
I 'd like to fit my empirical data to a poisson distribution curve.
I have the mean given value, say 2.3, and data (empirical).
def fit_poisson(data=None,network=None,mu=2.3):
sns.set_theme()
fig, ax = plt.subplots(1, 1)
x = np.arange(poisson.ppf(0.01, mu),
poisson.ppf(0.99, mu))
sns.histplot(data, stat='density')
plt.plot(x, poisson.pmf(x, mu))
It plots:
Apparently, there's is a range issue in y, here. Maybe a problem with lambda? How do I properly fit my empirical histogram to a poisson distribution curve of same mean?
Poisson random variables are discrete: their y value is "probability" not "density". But the default behavior of histplot avoids guessing that you have discrete data, and it is choosing bins with binwidth < 1 in this case.
Because density normalization forces the area of all bars to sum to 1, that means the density value for the bar containing observations of a certain value will be greater than the probability mass on that value.
There are two relevant parameters here:
stat="probability" will make the heights of the bars sum to 1, so they will match the PMF (assuming binwidth < 2, so that only one unique value appears in each bar)
discrete=True, which sets binwidth=1 (and aligns the center of each bar with integral values)
sns.histplot(data, stat='probability', discrete=True, shrink=.8)
I've also added shrink=0.8, which draws the bars a bit narrower than the binwidth; this helps emphasize the discrete nature of the data.
(Note that with discrete=True (implying binwidth=1), density and probability normalization will do the same thing so that's actually all you need, but Probability is the right y axis label to use here).
I'm new to the forum and a beginner in programming.
I have the task to program a random walk in Matlab (1D or 2D) with a variance that I can adjust. I found the code for the random walk, but I'm really confused where to put the variance. I thought that the random walk always has the same variance (= t) so maybe I'm just lost in the math.
How do I control the variance?
For a simple random walk, consider using the Normal distribution with mean 0 (also called 'drift') and a non-zero variance. Notice since the mean is zero and the distribution is symmetric, this is a symmetric random walk. On each step, the process is equally like to go up or down, left or right, etc.
One easy way:
Step 1: Generate each step
Step 2: Get the cumulative sum
This can be done for any number of dimensions.
% MATLAB R2019a
drift = 0;
std = 1; % std = sqrt(variance)
pd = makedist('Normal',drift,std);
% One Dimension
nsteps = 50;
Z = random(pd,nsteps,1);
X = [0; cumsum(Z)];
plot(0:nsteps,X) % alternatively: stairs(0:nsteps,X)
And in two dimensions:
% Two Dimensions
nsteps = 100;
Z = random(pd,nsteps,2);
X = [zeros(1,2); cumsum(Z)];
% 2D Plot
figure, hold on, box on
plot(X(1,1),X(1,1),'gd','DisplayName','Start','MarkerFaceColor','g')
plot(X(:,1),X(:,2),'k-','HandleVisibility','off')
plot(X(end,1),X(end,2),'rs','DisplayName','Stop','MarkerFaceColor','r')
legend('show')
The variance will affect the "volatility" so a higher variance means a more "jumpy" process relative to the lower variance.
Note: I've intentionally avoided the Brownian motion-type implementation (scaling, step size decreasing in the limit, etc.) since OP specifically asked for a random walk. A Brownian motion implementation can link the variance to a time-index due to Gaussian properties.
The OP writes:
the random walk has always the same variance
This is true for the steps (each step typically has the same distribution). However, the variance of the process at a time step (or point in time) should be increasing with the number of steps (or as time increases).
Related:
MATLAB: plotting a random walk
I plotted 5 fold cross-validation data as a cell array to perfcurve function with positive class=1. Then it generated 3 curves as you can see in the diagram. I was expecting only one curve.
[X,Y,T,AUC,OPTROCPT,SUBY,SUBYNAMES] = perfcurve(Actual_label,Score,1);
plot(X,Y)
Here, Actual_label and Score are a cell array of size 5 X 1. Each cell array is of size 70 X 1. And 1 denotes positive class=1.
P.S: I am using One-class SVM and 'fitSVMPosterior' function is not appropriate for one-class learning (same has been mentioned in the documentation of MATLAB). Therefore posterior probability can't be used here.
When you compute the confidence bounds, X and Y are an m-by-3 array, where m is the number of fixed X values or thresholds (T values). The first column of Y contains the mean value. The second and third columns contain the lower bound and the upper bound, respectively, of the pointwise confidence bounds. AUC is also a row vector with three elements, following the same convention.
Above explanation is taken from MATLAB documentation.
That is expected because you are plotting the ROC curve for each of the 5 folds.
Now if you want to have only one ROC for your classifier, you can either use the 5 trained classifiers to predict the labels of an independent test set or you can average the posterior probabilities of the 5 folds and have one ROC.
Given a ROC curve drawn with plotroc.m (see here):
Theoretical question: How to select the best threshold to be used?
Programming qeuestion: How to induce the libsvm classifier to work with the selected (best) threshold?
ROC curve is plot generated by plotting fraction of true positive on y-axis versus fraction of false positive on x-axis. So, co-ordinates of any point (x,y) on ROC curve indicates FPR and TPR value at particular threshold.
As shown in figure, we find the point (x,y) on ROC curve which corresponds to the minimum distance of that point from top-left corner (i.e given by(0,1)) of plot. The threshold value corresponding to that point is the required threshold. Sorry, I am not permitted to put any image, so couldn't explain with figure. But, for more details about this click ROC related help
Secondly, In libsvm, svmpredict function returns you probability of data sample belonging to a particular class. So, if that probability(for positive class) is greater than threshold (obtained from ROC plot) then we can classify the sample to positive class. These few lines might be usefull to you:
[pred_labels,~,p] = svmpredict(target_labels,feature_test,svmStruct,'-b 1');
% where, svmStruct is structure returned by svmtrain function.
op = p(:,svmStruct.Label==1); % This gives probability for positive
% class (i.e whose label is 1 )
Now if this variable 'op' is greater than threshold then we can classify the corresponding test sample to positive class. This can be done as
op_labels = op>th; % where 'th' is threshold obtained from ROC
I have been using the LibSVM classifier to classify between 3 different classes - labeled 2, 1, -1
I'm trying to use MATLAB to generate Roc Curve graphs for some data produced using LibSVM but am having trouble understanding the parameters it needs to run.
I assume that:
labels is the vector of labels generated that states into which class my data belongs (mine consists on 1, -1 and 2 and is 60x1 in size)
scores is the variable created by LibSVM called 'accuracy_score' (60x3 in size)
But I don't know what posclass is?
I would also appreciate finding out if my assumptions are correct, and if not, why not?
See here for a clear explanation:
Given the following instruction:
[X,Y] = perfcurve(labels,scores,posclass);
labels are the true labels of the data, scores are the output scores from your classifier (before the threshold) and posclass is the positive class in your labels.
see documentation of percurve, posclass is the label of positive class, in your case it has to be either 1,-1 or 2
http://www.mathworks.com/help/stats/perfcurve.html
ROC curve have "false positive rate" on x axis and "true positive rate on y axis". By specifying the posclass, you are specifying with respect to which class you are calculating false positive rate and true positive rate.
e.g. if you specify posclass as 2, you consider that when the true label is 2, predicting either 1 or -1 is considered a false prediction (false negative).
Edit:
The accuaracy_score you metioned (in my version of the documentation(3.17, in matlab folder), it is called decision_values/prob_estimates) have 3 column, each column correspond to the probability of the data belonging to one class.
e.g.
model=svmtrain(train_label,train_data);
[predicted_label, accuracy, decision_values]=predict(test_label,test_dat,model);
model.Label contains the class labels, individual columns in decision_values contains probability of the the test case belong to class specified in model.Label.(see http://www.csie.ntu.edu.tw/~b91082/SVM/README).
to use purfcurve to compute ROC for class m:
[X,Y] = perfcurve(truelabels, decision_values(:,m)*model.Label(m),model.Label(m));
It is essential to do decision_values(:,m)*model.Label(m) especially when you class label is a negative number.