I have a classification model using H2o in Python for which the AUC = 71%
But the accuracy based on confusion Matrix is only 61%. I Understand that confusion matrix is based on .5 threshold
How do I determine for which threshold the accuracy will be 71%?
AUC of the ROC curve is not accuracy, and the value is threshold independent. It is a measure of how well separated two classes are. The 71% value tells you the probability of you randomly sampling positive class having a higher predicted probability than a randomly sampled negative class. See this explanation.
Selecting the threshold should depend on your cost matrix (how much the penalty is for False Positives or False Negatives). You would want to select the threshold that maximize your desired metric (max. F1, precision, accuracy). H2O gives multiple options. In H2O, if you call the model performance (Python ex: your_model.model_performance()), you will get the threshold for max accuracy and other optimized metrics listed.
Related
Assume a binary classifier (say a random forest) rfc and I want to calculate the AUC. I struggle to understand how the threshold are being used in the calculation. I understand that you make a plot of TPR/FPR for different thresholds. I also understand the threshold is used as a threshold for predicting class 1 (else class 0), but how does the AUC algorithm predict classes?
Say using sklearn.metrics.roc_auc_score you pass y_true and y_rfc (being the true value and the predicted value), but I do not see how the thresholds come into play in the AUC score/plot.
I have read different guides/tutorials for AUC, but all of their explanation regarding the threshold and how it is used is kinda vague.
I have also had a look at How does sklearn actually calculate AUROC? .
AUC curve is generated based on TPR/FPR of different thresholds. The main point of ROC is to sample threshold from (0;1) and get a point for curve. Notice that if your classifier is perfect you will get point (0,1) and for all smaller threshold cant be worst, so it also will be on (0,1) which leads to auc = 1.
AUC provide your information not only about classification quality but also about how good confidence of your classifier was evaluated.
I am solving a classification problem. I train my unsupervised neural network for a set of entities (using skip-gram architecture).
The way I evaluate is to search k nearest neighbours for each point in validation data, from training data. I take weighted sum (weights based on distance) of labels of nearest neighbours and use that score of each point of validation data.
Observation - As I increase the number of epochs (model1 - 600 epochs, model 2- 1400 epochs and model 3 - 2000 epochs), my AUC improves at smaller values of k but saturates at the similar values.
What could be a possible explanation of this behaviour?
[Reposted from CrossValidated]
To cross check if imbalanced classes are an issue, try fitting a SVM model. If that gives a better classification(possible if your ANN is not very deep) it may be concluded that classes should be balanced first.
Also, try some kernel functions to check if this transformation makes data linearly separable?
In Matlab, I'm creating a visual codebook using Bag of Features with the SURF features of 3913 images and k = 450. I train an SVM classifier with the visual codebook, and then use it to classify video frames to detect humans. The video I'm using is an aerial one. My maximum number of iterations is 100 by default, but when I ran the code, I get a warning from Matlab that says "Failed convergence at 100 iterations". What does this mean? Does it affect my clustering? I only have 2 classes: person and nonperson. Does it also mean that I have to increase my maximum iterations for better results or do I have to decrease it?
When you say 100 iterations, are you talking about the clustering, i. e. building the "visual vocabulary"? If so, then the message you are getting would indicate that the k-means clustering was not able to converge after 100 iterations. That means the centers of clusters are moving after each iteration by an amount greater than what is specified in the convergence criterion. The most reasonable thing to do would be to run k-means for more iterations.
Currently I'm trying to classify spam emails with kNN classification. Dataset is represented in the bag-of-words notation and it contains approx. 10000 observations with approx. 900 features. Matlab is the tool I use to process the data.
Within the last days I played with several machine learning approaches: SVM, Bayes and kNN. In my point of view, kNN's performance beats SVM and Bayes when it comes to minimize the false positive rate. Checking with 10-fold Cross-Validation I obtain a false positive rate of 0.0025 using k=9 and Manhattan-Distance. Hamming distance performs in the same region.
To further improve my FPR I tried to preprocess my data with PCA, but that blow away my FPR as a value of 0.08 is not acceptable.
Do you have any idea how to tune the dataset to get a better FPR?
PS: Yes, this is a task I have to do in order to pass a machine learning course.
Something to try: double count the non-spam samples in your training data. Say, 500 of the 1000 samples were non-spam. After double counting the non-spam ones you will have a training set of 1500 samples. This might give the false positive test samples more positive nearest neighbours. Note that overall performance might suffer.
I am using RBF kernel matlab function.
On couple of dataset as I go on increasing sigma value the number of support vectors increase and accuracy increases.
While in case of one data set, as I increase the sigma value, the support vectors decrease and accuracy increases.
I am not able to analyze the relation between support vectors and accuracy in case of RBF kernel.
The number of support vectors doesn't have a direct relationship to accuracy; it depends on the shape of the data (and your C/nu parameter).
Higher sigma means that the kernel is a "flatter" Gaussian and so the decision boundary is "smoother"; lower sigma makes it a "sharper" peak, and so the decision boundary is more flexible and able to reproduce strange shapes if they're the right answer. If sigma is very high, your data points will have a very wide influence; if very low, they will have a very small influence.
Thus, often, increasing the sigma values will result in more support vectors: for more-or-less the same decision boundary, more points will fall within the margin, because points become "fuzzier." Increased sigma also means, though, that the slack variables "moving" points past the margin are more expensive, and so the classifier might end up with a much smaller margin and fewer SVs. Of course, it also might just give you a dramatically different decision boundary with a completely different number of SVs.
In terms of maximizing accuracy, you should be doing a grid search on many different values of C and sigma and choosing the one that gives you the best performance on e.g. 3-fold cross-validation on your training set. One reasonable approach is to choose from e.g. 2.^(-9:3:18) for C and median_eval * 2.^(-4:2:10); those numbers are fairly arbitrary, but they're ones I've used with success in the past.