Accuracy degrade with highly skewed data after handling imbalance problem - classification

After preprocessing my data, like missing value replacement and outlier detection I partitioned my data using randomized and remove percentage filter using WEKA. My dataset is a highly skewed dataset with imbalance ratio 6:1 corresponding to negative and positive class. If I classify the data using Naive Bayes classifier without handling class imbalance problem I got 83% accuracy with recall 0.623. However if I handle class imbalance(after balance 1:1) with supervised -instances - resample or supervised -instances - spreadsub sample filter and then apply Naive Bayes for classification accuracy degrade by 77% with recall ratio 0.456.
I cannot understand why accuracy degrade when handle class imbalance ratio?
Thank you.

If you have a class imbalance of 6:1 then the majority class is 6/7 = 85.7%. Just by predicting the majority class (eg using ZeroR) you would get an accuracy slightly better than what NaiveBayes achieves.
After balancing your dataset, NaiveBayes reports 77% accuracy, which is well above the 50% for predicting the majority class.
NaiveBayes has, in some sense, actually improved.

Related

Dimensionality reduction, noralization, resampling, k-fold CV... In what order?

In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:
Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).
Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.
After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.
So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince which comes with FAMD(Factor Analysis for Mixture Data).
Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?
Question 2: The package prince does not have methods such as fit or transform like in Sklearn, so I cannot do the 3rd step described below. Any other good packages to do fitand transform for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?
My approach:
Make k folds and isolate one of them for validation, use the rest for training
Normalize training data and transform validation data
Fit FAMD on training data, and transform training and test data
Resample only training data using SMOTE-NC
Train whatever model it is, evaluate on validation data
Repeat 2-5 k times and take the average of precision, recall F2 score
*I would also appreciate for any kinds of advices on my overall approach to this problem
Thanks!

How to improve digit recognition prediction in Neural Networks in Matlab?

I've made digit recognition (56x56 digits) using Neural Networks, but I'm getting 89.5% accuracy on test set and 100% on training set. I know that it's possible to get >95% on test set using this training set. Is there any way to improve my training so I can get better predictions? Changing iterations from 300 to 1000 gave me +0.12% accuracy. I'm also file size limited so increasing number of nodes can be impossible, but if that's the case maybe I could cut some pixels/nodes from the input layer.
To train I'm using:
input layer: 3136 nodes
hidden layer: 220 nodes
labels: 36
regularized cost function with lambda=0.1
fmincg to calculate weights (1000 iterations)
As mentioned in the comments, the easiest and most promising way is to switch to a Convolutional Neural Network. But with you current model you can:
Add more layers with less neurons each, which increases learning capacity and should increase accuracy by a bit. Problem is that you might start overfitting. Use regularization to counter this.
Use batch Normalization (BN). While you are already using regularization, BN accelerates training and also does regularization, and is a NN specific algorithm that might work better.
Make an ensemble. Train several NNs on the same dataset, but with a different initialization. This will produce slightly different classifiers and you can combine their output to get a small increase in accuracy.
Cross-entropy loss. You don't mention what loss function you are using, if its not Cross-entropy, then you should start using it. All the high accuracy classifiers use cross-entropy loss.
Switch to backpropagation and Stochastic Gradient Descent. I do not know the effect of using a different optimization algorithm, but backpropagation might outperform the optimization algorithm you are currently using, and you could combine this with other optimizers such as Adagrad or ADAM.
Other small changes that might increase accuracy are changing the activation functions (like ReLU), shuffle training samples after every epoch, and do data augmentation.

WEKA classifier evaluation

I'm trying to evaluate the performance of a classifier using 10-fold CV in WEKA. I have 32,000 records split across three different classes, "po", "ng", "ne".
po: ~950
ng: ~1200
ne: ~30000
How should I split the dataset for performing CV? Am I right in assuming that for CV I should have a roughly equal number of records for each class, so to prevent unfair weighting towards the "ne" class?
Firstly, no you need not have equal no. of cases in your classes. Not all datasets are balanced. Yes it might give unrealistic answer. The imbalance in the dataset is a common phenomenon but there are few tactics to handle it-:
1) Resampling the dataset
Undersampling- Deleting the records of majority class
Oversampling- Adding the records in minority class
you can use SMOTE algorithm to do it for you.
2) Performance Metrics
Some metrics like Kappa (or Cohen’s kappa)can work great in which Classification accuracy is normalized by the imbalance of the classes in the data.
3) Cost Sensitive Classifier
Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for miss classification.
But the challenge here is how you determine the cost because cost should be domain dependent and not data dependent.
In case of cross-validation, I found this link to be useful.
http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation
Hope it helps.

Evaluating performance of Neural Network embeddings in kNN classifier

I am solving a classification problem. I train my unsupervised neural network for a set of entities (using skip-gram architecture).
The way I evaluate is to search k nearest neighbours for each point in validation data, from training data. I take weighted sum (weights based on distance) of labels of nearest neighbours and use that score of each point of validation data.
Observation - As I increase the number of epochs (model1 - 600 epochs, model 2- 1400 epochs and model 3 - 2000 epochs), my AUC improves at smaller values of k but saturates at the similar values.
What could be a possible explanation of this behaviour?
[Reposted from CrossValidated]
To cross check if imbalanced classes are an issue, try fitting a SVM model. If that gives a better classification(possible if your ANN is not very deep) it may be concluded that classes should be balanced first.
Also, try some kernel functions to check if this transformation makes data linearly separable?

How to enhance sensitivity in one class classification if you are getting high accuracy and specificity but low sensitivity?

If you are getting low sensitivity and high specifcity with one class classfication then how to overcome this problem?
e.g. positive data is divided into 90 % training and 10% testing. Further add fare amount of negative data with 10% positive testing data (use only positive data for training because of one class classification problem).
Since, you have to decide threshold in one class classification by rejection of 10% or few percentage most deviating training data. So, there is possibilty that your 10% testing lies in the rejection region. So, it leads to low sensitivity.
How to resolve this issue? I stucked on this "by high accuracy but low sensitivity".