WEKA classifier evaluation - classification

I'm trying to evaluate the performance of a classifier using 10-fold CV in WEKA. I have 32,000 records split across three different classes, "po", "ng", "ne".
po: ~950
ng: ~1200
ne: ~30000
How should I split the dataset for performing CV? Am I right in assuming that for CV I should have a roughly equal number of records for each class, so to prevent unfair weighting towards the "ne" class?

Firstly, no you need not have equal no. of cases in your classes. Not all datasets are balanced. Yes it might give unrealistic answer. The imbalance in the dataset is a common phenomenon but there are few tactics to handle it-:
1) Resampling the dataset
Undersampling- Deleting the records of majority class
Oversampling- Adding the records in minority class
you can use SMOTE algorithm to do it for you.
2) Performance Metrics
Some metrics like Kappa (or Cohen’s kappa)can work great in which Classification accuracy is normalized by the imbalance of the classes in the data.
3) Cost Sensitive Classifier
Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for miss classification.
But the challenge here is how you determine the cost because cost should be domain dependent and not data dependent.
In case of cross-validation, I found this link to be useful.
http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation
Hope it helps.

Related

Tensorflow Resnet with unbalanced classes

I use Resnet with Tensorflow to train a model with 20 classes.
My problem is that I have 6-7 classes with A LOT of samples, about the same number of classes with a medium number of samples and the rest of classes with few samples. With this given distribution, my model had a too strong tendency to predict classes with a larger sampling over the smaller one. I've tried to balance my classes by reducing the number of samples of my large classes, and it helped to give a place to the smaller classes during the prediction, but now I've reach a point where I can't improve my model over an accuracy of 90%, and I feel like I'm loosing a lot of valuable information by cutting samples in my large classes.
So, before I go buy more samples, I'm wondering if there is a way to work with unbalanced classes with a logic that model become very good to recognize if the larger classes are present or not (because it has so many samples of them that it is extremely capable to recognize their presence), then, if they are absent, to go check which other classes are present. The idea is to use all the samples I have instead of reducing them.
I've already try the weighted class option in Keras/Tensorflow, it didn't help.
Beside the undersampling technic you used so far, there two other ways to deal with imbalanced data:
class weighting
oversampling
Oversampling is the opposite of what you did ie. you will train every sample of under represneted classes multiple times. class weighting is the case when you tell the model how much it should weigh every class samples in training procedure (weight updates in case of a neural network training). Both these cases are supported by tensorflow and you cand find them in official tutorials.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

Does sklearn support a cost matrix?

Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.
The main classifier I am using is a Random Forest.
Thanks.
The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.
You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:
def financial_loss_scorer(y, y_pred, **kwargs):
import pandas as pd
totals = kwargs['totals']
# Create an indicator - 0 if correct, 1 otherwise
errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
# Use the product totals dataset to create results
results = errors.merge(totals, left_index=True, right_index=True, how='inner')
# Calculate per-prediction loss
loss = results.Result * results.SumNetAmount
return loss.sum()
The scorer becomes:
make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)
Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.
You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.
On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.
Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.
One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.
May not be direct to your question (since you are asking about Random Forest).
But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.
You might want to refer to this page to see an example of using class_weight.

K nearest neighbour validation performance

I am using knn to do classification for a telecom problem. I splitted my data into 70% training and 30% validation. While the knn classifier is able to catch over 80% in 2 deciles in training, its performance in validation sample is as good as random 45 degree line. I am surprised how does KNN work that the model performance in training and validation are so different.
Any pointers ?
Reasonable pointers are hardly possible without more details. The behavior of your KNN depends on several aspects:
The parameter K defining the neighbors. If it is set to K=1, for example, you will get no training error at all, this showing that the consideration of training-to-validation-error may not be justified.
The parameter K is often found using cross validation. I would suggest you to do this as well.
The distance metric. Which function are you using, are there different units, length scales, etc.?
The noise of your data, the size of your data ... -- there simply exist data sets which are hard to describe.
By the way: can you tell what kind of data you want to describe, and, if possible, also provide some examples or show some scatter plot (data and your result)?

Genetic algorithm for classification

I am trying to solve classification problem using Matlab GPTIPS framework.
I managed to build reasonable data representation and fitness function so far and got an average accuracy per class near 65%.
What I need now is some help with two difficulties:
My data is biased. Basically I am solving binary classification problem and only 20% of data belongs to class 1, while other 80% belong to class 0. I used accuracy of prediction as my fitness function at first, but it was really bad. The best I have now is
Fitness = 0.5*(PositivePredictiveValue + NegativePredictiveValue) - const*ComplexityOfSolution
Please, advize, how can I improve my function to make correction for data bias.
Second problem is overfitting. I divided my data into three parts: training (70%), testing (20%), validation (10%). I train each chromosome on training set, then evaluate it's fitness function on testing set. This routine allows me to reach fitness of 0.82 on my test data for the best individual in population. But same individual's result on validation data is only 60%.
I added validation check for best individual each time before new population is generated. Then I compare fitness on validation set with fitness on test set. If difference is more then 5%, then I increase penalty for solution complexity in my fitness function. But it didn't help.
I could also try to evaluate all individuals with validation set during each generation, and simply remove overfitted ones. But then I don't see any difference between my test and validation data. What else can be done here?
UPDATE:
For my second question I've found great article "Experiments on Controlling Overtting
in Genetic Programming" Along with some article authors' ideas on dealing with overfitting in GP it has impressive review with a lot of references to many different approaches to the issue. Now I have a lot of new ideas I can try for my problem.
Unfortunately, still cant' find anything on selecting a proper fitness function which will take into account unbalanced class proportions in my data.
65% accuracy is very bad when the baseline (classify everything as the class with most samples) would be 80%. You need to achieve at least baseline classification in order to have a better model than the naive one.
I would not penalize complexity. Rather limit the tree size (if possible). You could identify simpler models during the run, like storing a pareto front of models with quality and complexity as its two fitness values.
In HeuristicLab we have integrated GP based classification that can do these things. There are several options: You can choose to use MSE for classification or R2. In the latest trunk build there is also an evaluator to optimize accuracy directly (exactly speaking it optimizes the classification penalties). Optimizing MSE means it assigns each class a value (1, 2, 3,...) and tries to minimize mean squared error from that value. This may not seem optimal at first, but works. Optimizing accuracy directly may lead to faster overfitting. There is also a formula simplifier which allows you to prune and shrink your formula (and view the effects of that).
Also, does it need to be GP? Have you tried Random Forest Classification or Support Vector Machines as well? RF are pretty fast and work pretty well usually.