Calculating the area under curve from classification accuracy - classification

I have an assignment:
Using Naive Bayes we built a model on some data with 2 classes (model returns 2 probabilities - one for positive and one for negative class). We calculated the area under ROC curve AUC = 0.8 and classification accuracy CA = 0.6 with threshold set to 0.5 (if the probability of some example for positive class is higher than 0.5, we predict positive class for that example, else the negative class). Then we discovered that if we set the threshold to 0.3, classification accuracy becomes CA = 0.7. What is the AUC for the second threshold? If the result depends on initial data, present all possibilities.
How can I calculate that?

Not sure if that qualifies as an answer, but the ROC AUC is the integral of sensitivity and specificity over all classification thresholds. Therefore you cannot compute the AUC for a specific threshold.

Related

How to calibrate the thresholds of neural network output layer in multiclass classification task?

Assume we have a multi-class classification task with 3 classes:
{Cheesecake, Ice Cream, Apple Pie}
Given that we have a trained neural network that can classify which of the three desserts a random chef would prefer. Also, assume that the output layer consists of 3 neurons with softmax activation, such that each neuron represents the probability to like the corresponding dessert.
For example, possible outputs of such network might be:
Output(chef_1) = { P(Cheesecake) = 0.3; P(Ice Cream) = 0.1; P(Apple Pie) = 0.6; }
Output(chef_2) = { P(Cheesecake) = 0.2; P(Ice Cream) = 0.1; P(Apple Pie) = 0.7; }
Output(chef_3) = { P(Cheesecake) = 0.1; P(Ice Cream) = 0.1; P(Apple Pie) = 0.8; }
In such case, all instances (chef_1, chef_2 and chef_3) are likely to prefer an Apple Pie, but with a different confidence (e.g. chef_3 is more likely to prefer Apple Pie than chef_1 as the network probability outputs are 0.8 and 0.6 respectively)
Given that we have a new dataset of 1000 chefs, and we want to calculate the distribution of their favorite desserts, we would simply classify each one of the 1000 chefs and determine his favorite dessert based on the neuron with maximum probability.
We also want to improve the prediction accuracy by discarding chefs whose max prediction probability is below 0.6. Let's assume that 200 out of the 1000 were predicted with such probability, and we discarded them.
In such case, we may bias distribution over the 800 chefs (who were predicted with a probability higher than 0.6) if one dessert is easier to predict than another.
For example, if the average prediction probability of the classes are:
AverageP(Cheesecake) = 0.9
AverageP(Ice Cream) = 0.5
AverageP(Apple Pie) = 0.8
And we discard chefs who were predicted with probability which is lower than 0.6, among the 200 chefs that were discarded there are likely to be more chefs who prefer Ice Cream, and this will result in a biased distribution among the other 800.
Following this very long introduction (I am happy that you are still reading), my questions are:
Do we need a different threshold for each class? (e.g. among Cheesecake predictions discard instances whose probability is below X, among Ice Cream predictions discard instances whose probability is below Y, and among Apple Pie predictions discard instances whose probability is below Z).
If yes, how can I calibrate the thresholds without impacting the overall distribution on my 1000 chefs dataset (i.e. discard predictions with low probability in order to improve the accuracy, while preserving the distribution over the original dataset).
I've tried to use the average prediction probability of each class as a threshold, but I cannot assure that it will not impact the distribution (as these thresholds may overfit to the test set and not to the 1000 chefs dataset).
Any suggestions or related papers?
I had a similar multilabel problem. I had plotted the F1 score of each class to the threshold to see where the max F1 score for each class lies. and it was different for each class. for some, the precision vs recall was higher at a threshold of > 0.8 while for some i was even as low as 0.4. I chose different thresholds for calling the class a class.
but i guess if you dont want to bias a class for high precision or high recall, you could select different thresholds based on a test set (you can optimize test set collection)

Function approximation by ANN

So I have something like this,
y=l3*[sin(theta1)*cos(theta2)*cos(theta3)+cos(theta1)*sin(theta2)*cos(theta3)-sin(theta1)*sin(theta2)*sin(theta3)+cos(theta1)*cos(theta2)sin(theta3)]+l2[sin(theta1)*cos(theta2)+cos(theta1)*sin(theta2)]+l1*sin(theta1)+l0;
and something similar for x. Where thetai is angles from specified interval and li some coeficients. Task is approximate inversion of equation, so you set x and y and result will be appropriate theta. So I random generate thetas from specified intervals, compute x and y. Then I norm x and y between <-1,1> and thetas between <0,1>. This data I used as training set in such way, inputs of network are normalized x and y, outputs are normalized thetas.
I train the network, tried different configuration and absolute error of network was still around 24.9% after whole night of training. It's so much, so I don't know what to do.
Bigger training set?
Bigger network?
Experiment with learning rate?
Longer training?
Technical info
As training algorithm was used error back propagation. Neurons have sigmoid activation function, units are biased. I tried topology: [2 50 3], [2 100 50 3], training set has length 1000 and training duration was 1000 cycle(in one cycle I go through all dataset). Learning rate has value 0.2.
Error of approximation was computed as
sum of abs(desired_output - reached_output)/dataset_lenght.
Used optimizer is stochastic gradient descent.
Loss function,
1/2 (desired-reached)^2
Network was realized in my Matlab template for NN. I know that is weak point, but I'm sure my template is right because(successful solution of XOR problem, approximation of differential equations, approximation of state regulator). But I show this template, because this information may be useful.
Neuron class
Network class
EDIT:
I used 2500 unique data within theta ranges.
theta1<0, 180>, theta2<-130, 130>, theta3<-150, 150>
I also experiment with larger dataset, but accuracy doesn't improve.

interpret results linear regression matlab

I am trying to fit a model having as predictor the variables TNST and Seff and as response the variable AUCMET.
The result of the fitting is:
mdl1 =
Linear regression model:
AUCMET ~ 1 + TNST + Seff
Estimated Coefficients:
Estimate SE tStat pValue
(Intercept) 1251.5 72.176 17.34 1.4406e-58
TNST -2.3058 0.16045 -14.371 1.9579e-42
Seff 13.087 1.0748 12.176 9.4907e-32
Number of observations: 932, Error degrees of freedom: 929
Root Mean Squared Error: 322
R-squared: 0.197, Adjusted R-Squared 0.195
F-statistic vs. constant model: 114, p-value = 5.36e-45
The result from the anova analisis is
anova(mdl1)
ans =
SumSq DF MeanSq F pValue
TNST 2.1395e+07 1 2.1395e+07 206.52 1.9579e-42
Seff 1.5359e+07 1 1.5359e+07 148.25 9.4907e-32
Error 9.6243e+07 929 1.036e+05
The output of the diagnostic plot is
plotDiagnostics(mdl)
Could you help me to interpret this result? I see that all the p are < 0.05 so they variables are important for the model.
Is it a good model? what should I look at to understand it?
The r-squared / adjusted r-squared are the Pearson correlation coefficient. https://en.m.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
A 1 is good a 0 is bad so I'd say it's a poetry bad model.
Edit: Now that you edited the question with new information:
1- From the plot diagnostic test it can be seen that there are a percentage of points with high leverage. But this plot does not reveal whether the high-leverage points are outliers. Try plotDiagnostics(mdl,'cookd') to find the outliers (points with large Cook's distance) and remove them from the data.
2- The ANOVA table shows that both variables are important and you cannot consider removing them.
Is a Low R-squared Bad?
No. In fields such as predicting human behavior (e.g. psychology), R-squared values are low because the human's behavior are hard to predict.
Also, if the obtained R-squared is low but the prediction is good, the model counts as a good model. So a low R-squared doesn't necessarily affect the interpretation of significant variables. How high should the R-squared be for prediction? Well, that depends on your requirements for the width of a prediction interval and how much variability is present in your data. While a high R-squared is required for precise predictions, it’s not sufficient by itself, as we shall see. On the other hand, High R-squared Values are not Inherently Good. A high R-squared does not necessarily indicate that the model has a good fit. (read more)
What to do next?
To examine the quality of the model you can perform other tests, such as
ANOVA
To examine the quality of the fitted model, consult an ANOVA table.
tbl = anova(mdl)
Diagnostic plots
Diagnostic plots help you identify outliers, and see other problems in your model or fit.
plotDiagnostics(mdl)
Residuals
There are several residual plots to help you discover errors, outliers, or correlations in the model or data. The simplest residual plots are the default histogram plot, which shows the range of the residuals and their frequencies, and the probability plot, which shows how the distribution of the residuals compares to a normal distribution with matched variance.
plotResiduals(mdl)
And more

ROC curve and libsvm

Given a ROC curve drawn with plotroc.m (see here):
Theoretical question: How to select the best threshold to be used?
Programming qeuestion: How to induce the libsvm classifier to work with the selected (best) threshold?
ROC curve is plot generated by plotting fraction of true positive on y-axis versus fraction of false positive on x-axis. So, co-ordinates of any point (x,y) on ROC curve indicates FPR and TPR value at particular threshold.
As shown in figure, we find the point (x,y) on ROC curve which corresponds to the minimum distance of that point from top-left corner (i.e given by(0,1)) of plot. The threshold value corresponding to that point is the required threshold. Sorry, I am not permitted to put any image, so couldn't explain with figure. But, for more details about this click ROC related help
Secondly, In libsvm, svmpredict function returns you probability of data sample belonging to a particular class. So, if that probability(for positive class) is greater than threshold (obtained from ROC plot) then we can classify the sample to positive class. These few lines might be usefull to you:
[pred_labels,~,p] = svmpredict(target_labels,feature_test,svmStruct,'-b 1');
% where, svmStruct is structure returned by svmtrain function.
op = p(:,svmStruct.Label==1); % This gives probability for positive
% class (i.e whose label is 1 )
Now if this variable 'op' is greater than threshold then we can classify the corresponding test sample to positive class. This can be done as
op_labels = op>th; % where 'th' is threshold obtained from ROC

KNN classification in MATLAB - confusion matrix and ROC?

I'm trying to classify a data set containing two classes using different classifiers (LDA, SVM, KNN) and would like to compare their performance. I've made ROC curves for the LDA by modifying the priori probability.
But how can i do the same for a KNN classifier?
I searched the documentation and found some functions:
Class = knnclassify(Sample, Training, Group, k)
mdl = ClassificationKNN.fit(X,Y,'NumNeighbors',i,'leaveout','On')
I can run (a) and get a confusion matrix by using leave-one-out cross-validation but it is not possible to change the priori probability to make an ROC?
I haven't tried (b) before but this creates a model where you can modify the mdl.Prior. But i have no clue how to get a confusion matrix.
Is there an option i've missed or someone who can explain how to fully use those function to get a ROC?
This is indeed not straightforward, because the output of the k-nn classifier is not a score from which a decision is derived by thresholding, but only a decision based on the majority vote.
My suggestion: define a score based on the ratio of classes in the neighborhood, and then threshold this score to compute the ROC. Loosely speaking, the score expresses how certain the algorithm; it ranges from -1 (maximum certainty for class -1) to +1 (maximum certainty for class +1).
Example: for k=6, the score is
1 if all six neighbours are of class +1;
-1 if all six neighbours are of class -1;
0 if halve the neighbours are of class +1 and halve the neigbours are of class -1.
Once you have computed this score for each datapoint, you can feed it into a standard ROC function.