What is the threshold in AUC (Area under curve) - classification

Assume a binary classifier (say a random forest) rfc and I want to calculate the AUC. I struggle to understand how the threshold are being used in the calculation. I understand that you make a plot of TPR/FPR for different thresholds. I also understand the threshold is used as a threshold for predicting class 1 (else class 0), but how does the AUC algorithm predict classes?
Say using sklearn.metrics.roc_auc_score you pass y_true and y_rfc (being the true value and the predicted value), but I do not see how the thresholds come into play in the AUC score/plot.
I have read different guides/tutorials for AUC, but all of their explanation regarding the threshold and how it is used is kinda vague.
I have also had a look at How does sklearn actually calculate AUROC? .

AUC curve is generated based on TPR/FPR of different thresholds. The main point of ROC is to sample threshold from (0;1) and get a point for curve. Notice that if your classifier is perfect you will get point (0,1) and for all smaller threshold cant be worst, so it also will be on (0,1) which leads to auc = 1.
AUC provide your information not only about classification quality but also about how good confidence of your classifier was evaluated.

Related

How get the best threshold for classification using H2o Python

I have a classification model using H2o in Python for which the AUC = 71%
But the accuracy based on confusion Matrix is only 61%. I Understand that confusion matrix is based on .5 threshold
How do I determine for which threshold the accuracy will be 71%?
AUC of the ROC curve is not accuracy, and the value is threshold independent. It is a measure of how well separated two classes are. The 71% value tells you the probability of you randomly sampling positive class having a higher predicted probability than a randomly sampled negative class. See this explanation.
Selecting the threshold should depend on your cost matrix (how much the penalty is for False Positives or False Negatives). You would want to select the threshold that maximize your desired metric (max. F1, precision, accuracy). H2O gives multiple options. In H2O, if you call the model performance (Python ex: your_model.model_performance()), you will get the threshold for max accuracy and other optimized metrics listed.

Discriminant analysis method to classify data

my aim is to classify the data into two sections- upper and lower- finding the mid line of the peaks.
I would like to apply machine learning methods- i.e. Discriminant analysis.
Could you let me know how to do that in MATLAB?
It seems that what you are looking for is GMM (gaussian mixture model). With K=2 (number of mixtures) and dimension equal 1 this will be simple, fast method, which will give you a direct solution. Given components it is easy to analytically find a local minima (which is just a weighted average of means, with weights proportional to the std's).

h2o random forest calculating MSE for multinomial classification

Why is h2o.randomforest calculating MSE on Out of bag sample and while training for a multinomail classification problem?
I have done binary classification also using h2o.randomforest, there it used to calculate AUC on out of bag sample and while training but for multi classification random forest is calculating MSE which seems suspicious. Please see this screenshot.
My target variable was a factor containing 4 factor levels model1, model2, model3 and model4. In the screenshot you would also a confusion matrix for these factors.
Can someone please explain this behaviour?
Both binomial and multinomial classification display MSE, so you will see it in the Scoring History table for both models (highlighted training_MSE column).
H2O does not evaluate a multinomial AUC. A few evaluation methods exist, but there is not yet a single widely adopted method. The pROC package discusses the method of Hand and Till, but mentions that it cannot be plotted and results rarely tested. Log loss and classification error are still available, specific to classification, as each has standard methods of evaluation in a multinomial context.
There is a confusion matrix comparing your 4 factor levels, as you highlighted. Can you clarify what more you are expecting? If you were looking for four individual confusion matrices, the four-column table contains enough information that they could be computed.

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps

ROC curve from the result of a classification or clustering

Say that I've clustered a training dataset of 5 classes containing 1000 instances, to 5 clusters (centers) using for example k-means. Then I've constructed a confusion matrix by validating on a test dataset. I want then to use plot a ROC curve from this, how is it possible to do that ?
Roc Curves show trade-off between True Positive and False Positive Rate. In other words
ROC graphs are two-dimensional graphs in which TP rate is plotted on
the Y axis and FP rate is plotted on the X axis
ROC Graphs: Notes and Practical Considerations for Researchers
When you use a discrete classifier, that classifier produces only a single point in ROC Space. Normally you need a classifier which produces probabilities. You change your parameters in classifier so that your TP and FP rates change. After that you use this points to draw a ROC curve.
Lets say you use k-means. K-means give you cluster membership discretely. A point belongs to ClusterA or .. ClusterE. Therefore outputting ROC curve from k-means is not straightforward. Lee and Fujita
describes an algorithm for this. You should look to their paper. But algorithm is something like this.
Apply k-means
calculate TP and FP using test data.
change membership of data points from one cluster to second cluster.
calculate TP and FP using test data again.
As you see they get more points in ROC space and use these points to draw ROC curve