Is Cross Validation enough to ensure that there is no Overfitting in a classification algorithm? - matlab

I have a data set with 45 observations for one class and 55 observations for another class. Moreover, I am using 4 different features which were previously chosen by using a Feature Selection filter though the results of this procedure were somewhat strange..
On the other hand, I am using cross validation and getting good accuracy results (75% to 85%) from different classifiers since I'm using the classificationLearner on Matlab. Would this ensure that there is no overfitting? Or there might still be a chance for this? How can I assure that there is no overfitting?

That really depends on your training data set that you have available. If data that is available to you isn't representative enough, you will not get a good model regardless of the methods you use for training and validation.
Having that in mind, if you are sure your data is representative (has the same distribution of values for any subset of "important" attributes as the the global set of all data) than cross validation is good enough to rely on.

Related

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

K means Analysis on KDD Cup Dataset 99

What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.

why we need cross validation in multiSVM method for image classification?

I am new to image classification, currently working on SVM(support Vector Machine) method for classifying four groups of images by multisvm function, my algorithm every time the training and testing data are randomly selected and the performance is varies at every time. Some one suggested to do cross validation i did not understand why we need cross validation and what is the main purpose of this? . My actual data set consist training matrix size 28×40000 and testing matrix size 17×40000. how to do cross validation by this data set help me. thanks in advance .
Cross validation is used to select your model. The out-of-sample error can be estimated from your validation error. As a result, you would like to select the model with the least validation error. Here the model refers to the features you want to use, and of more importance, the gamma and C in your SVM. After cross validation, you will use the selected gamma and C with the least average validation error to train the whole training data.
You may also need to estimate the performance of your features and parameters to avoid both high-bias and high-variance. Whether your model suffers underfitting or overfitting can be observed from both in-sample-error and validation error.
Ideally 10-fold is often used for cross validation.
I'm not familiar with multiSVM but you may want to check out libSVM, it is a popular, free SVM library with support for a number of different programming languages.
Here they describe cross validation briefly. It is a way to avoid over-fitting the model by breaking up the training data into sub groups. In this way you can find a model (defined by a set of parameters) which fits both sub groups optimally.
For example, in the following picture they plot the validation accuracy contours for parameterized gamma and C values which are used to define the model. From this contour plot you can tell that the heuristically optimal values (from those tested) are those that give an accuracy closer to 84 instead of 81.
Refer to this link for more detailed information on cross-validation.
You always need to cross-validate your experiments in order to guarantee a correct scientific approach. For instance, if you don't cross-validate, the results you read (such as accuracy) might be highly biased by your test set. In an extreme case, your training step might have been very weak (in terms of fitting data) and your test step might have been very good. This applies to ALL machine learning and optimization experiments, not only SVMs.
To avoid such problems just divide your initial dataset in two (for instance), then train in the first set and test in the second, and repeat the process invesely, training in the second and testing in the first. This will guarantee that any biases to the data are visible to you. As someone suggested, you can perform this with even further division: 10-fold cross-validation, means dividing your data set in 10 parts, then training in 9 and testing in 1, then repeating the process until you have tested in all parts.

Genetic algorithm for classification

I am trying to solve classification problem using Matlab GPTIPS framework.
I managed to build reasonable data representation and fitness function so far and got an average accuracy per class near 65%.
What I need now is some help with two difficulties:
My data is biased. Basically I am solving binary classification problem and only 20% of data belongs to class 1, while other 80% belong to class 0. I used accuracy of prediction as my fitness function at first, but it was really bad. The best I have now is
Fitness = 0.5*(PositivePredictiveValue + NegativePredictiveValue) - const*ComplexityOfSolution
Please, advize, how can I improve my function to make correction for data bias.
Second problem is overfitting. I divided my data into three parts: training (70%), testing (20%), validation (10%). I train each chromosome on training set, then evaluate it's fitness function on testing set. This routine allows me to reach fitness of 0.82 on my test data for the best individual in population. But same individual's result on validation data is only 60%.
I added validation check for best individual each time before new population is generated. Then I compare fitness on validation set with fitness on test set. If difference is more then 5%, then I increase penalty for solution complexity in my fitness function. But it didn't help.
I could also try to evaluate all individuals with validation set during each generation, and simply remove overfitted ones. But then I don't see any difference between my test and validation data. What else can be done here?
UPDATE:
For my second question I've found great article "Experiments on Controlling Overtting
in Genetic Programming" Along with some article authors' ideas on dealing with overfitting in GP it has impressive review with a lot of references to many different approaches to the issue. Now I have a lot of new ideas I can try for my problem.
Unfortunately, still cant' find anything on selecting a proper fitness function which will take into account unbalanced class proportions in my data.
65% accuracy is very bad when the baseline (classify everything as the class with most samples) would be 80%. You need to achieve at least baseline classification in order to have a better model than the naive one.
I would not penalize complexity. Rather limit the tree size (if possible). You could identify simpler models during the run, like storing a pareto front of models with quality and complexity as its two fitness values.
In HeuristicLab we have integrated GP based classification that can do these things. There are several options: You can choose to use MSE for classification or R2. In the latest trunk build there is also an evaluator to optimize accuracy directly (exactly speaking it optimizes the classification penalties). Optimizing MSE means it assigns each class a value (1, 2, 3,...) and tries to minimize mean squared error from that value. This may not seem optimal at first, but works. Optimizing accuracy directly may lead to faster overfitting. There is also a formula simplifier which allows you to prune and shrink your formula (and view the effects of that).
Also, does it need to be GP? Have you tried Random Forest Classification or Support Vector Machines as well? RF are pretty fast and work pretty well usually.

When to start using the selection set in a Back Propagation Neural Network?

Beginner on ANNs:
I am implementing a back propagation neural network to predict the price of gold. I know that I have to split my data into training data, selection data and test data.
However I unsure How to go on about using these sets of data. At first I was training the data network with my training set then after it's trained I am getting a number of inputs to my network from the test set and comparing the output.
I'm not sure if I'm doing this right and were does the selection set come in ?
thanks in advance!
The general idea is:
Train the network for a little while on the training set.
Evaluate the network on a second set, often called the validation set. Probably what you're calling the selection set.
Train the network a little more on the training set.
Evaluate the new network on the selection set again.
Which did better, the old network or the new network? If the new network is better, we're still getting some use out of training, so goto 3. If the new network is worse, more training will probably only hurt. Use the previously version of the network, since it did better.
In this way, you can tell when to stop training.
One easy modification to this is to always keep track of the best network seen so far, and we only stop training when we see some number (say, three) of training attempts that do worse in a row.
The third set, the test set, is necessary because the selection set is, if indirectly, involved in the training process. Final evaluation must be done on data that was not used at all during training.
This sort of thing is sufficient for simple experiments, but in general you'll want to use cross-validation to get a better idea of your system's performance.
I wanted to leave a comment just to say that validation sets are a good place for model-dependent hyper-parameter tuning, but I'm new here and hence lack the reputation points to do so. To make this more worthy of a separate posting, I've included an outline of my own train-validate-test process. In practice, my workflow is as follows:
Identify, collect, and clean data. Try to limit complaining during data munging process.
Split data into three sets: training, validation, test.
Establish two "base" models for evaluating more complex models built later on in the process. The first of these models is typically a basic linear/logistic regression using all possible features. The second models uses only the most obviously informative (initial identification of informative features depends on use case, typically involves combination of domain knowledge, basic clustering, simple correlation).
Begin more empirical feature selection (i.e. unsupervised NN, but usually random forest) and prototype a broad range of models using the training set.
Eliminate poorly performing models as well as uninformative features
Compare performance of remaining models against each other and the "base" models, using a modified version of the training set (same data, but sans uninformative features). Toss under-performing models.
Using the validation set, tune the appropriate hyper-parameters for each of the models (either by hand or gridsearch). Further reduce the number of models in consideration, ideally to just 2-3 (excluding base models).
Finally, evaluate model performance (with optimized hyper-parameters) on the test set. Again, compare models among themselves and against the base models. Make final model choice based on a problem-specific appropriate combination of computational complexity/cost, ease of interpretation/transparency/"explainability", and improvement over and/or performance vs base models.