Can linear model give more prediction accuracy than random forest,decision tree,neural network? - neural-network

I have calculated the following parameters after applying the following algorithms on a dataset from kaggle
enter image description here
In the above case,linear model is giving the best results.
Are the above results correct and can linear model actually give better results than other 3 in any case?
Or am I missing something?

According to AUC criterion this classification is perfect (1 is theoretical maximum). This means a clear difference in the data. In this case, it makes no sense to talk about differences in the results of methods. Another point is that you can play with methods parameters (you likely will get slightly different results) and other methods can become better. But real result will be indistinguishable. Sophisticated methods are invented for sophisticated data. This is not the case.

All models are wrong, some are useful. - George Box
In terms of classification, a model would be effective as long as it could nicely fit the classification boundaries.
For binary classification case, supposing your data is perfectly linearly separable, then linear model will do the job - actually the "best" job since any more complicated models won't perform better.
If your +'s and -'s are somehow a bit scattered when they cannot be separated by a line (actually hyperplane), then linear model could be beaten by decision tree simply because decision trees can provide classification boundary of more complex shape (cubes).
Then random forest may beat decision tree as classification boundary of random forest is more flexible.
However, as we mentioned early, linear model still has its time.

Related

How to do Hierarchical Heteroskedastic Sparse GPs in GPflow?

Is is possible to model a general trend from a population using GPflow and also have individual predictions, as in Hensman et al?
Specifically, I am trying to fit spatial data from a bunch of individuals from a clinical assessment. For each individual, am I dealing with approx 20000 datapoints (different number of recordings for each individual), which definitely restricts myself to a sparse implementation. In addition to this, there also seemes that I need an input dependent noise model, hence the heteroskedasticity.
I have fitted a hetero-sparse model as in this notebook example, but I am not sure how to scale it to perform the hierarchical learning. Any ideas would be welcome :)
https://github.com/mattramos/SparseHGP may be helpful. This repo is gives GPFlow2 code for modelling a sparse hierarchical model. Note, there are still some rough edges in the implementation that require an expensive for loop to be constructed.

Poorly calibrated probabilities but good classification in confusion matrix

I have an imbalanced data set. My goal is to balance sensitivity and specificity via the confusion matrix. I used glmnet in r with class weights. The model does well at balancing the sensitivity/specificity, but I looked at the calibration plot, and the probabilities are not well calibrated. I have read about calibrating probabilities, but I am wondering if it matters if my goal is to produce class predictions. If it does matter, I have not found a way to calibrate the probabilities when using caret::train().
This topic has been widely discussed, especially in some answers by Stephan Kolassa. I will try to summarize the main take-home messages for your specific question.
From a pure statistical point of view your interest should be on producing as output a probability for each class of any new data instance. As you deal with unbalanced data such probabilities can be small which however - as long as they are correct - is not an issue. Of course, some models can give you poor estimates of the class probabilities. In such cases, the calibration allows you to better calibrate the probabilities obtained from a given model. This means that whenever you estimate for a new observation a probability p of belonging to the target class, then p is indeed its true probability to be of that class.
If you are able to obtain a good probability estimator, then balancing sensitivity or specificity is not part of the statistical part of your problem, but rather of the decision component. Such the final decision will likely need to use some kind of threshold. Depending on the costs of type I and II errors, the cost-optimal threshold might change; however, an optimal decision might also include more than one threshold.
Ultimately, you really have to be careful about which is the specific need of the end-user of your model, because this is what is going to determine the best way of taking decisions using it.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

Can KNN be better than other classifiers?

As Known, there are classifiers that have a training or a learning step, like SVM or Random Forest. On the other hand, KNN does not have.
Can KNN be better than these classifiers?
If no, why?
If yes, when, how and why?
The main answer is yes, it can due to no free lunch theorem implications. FLT can be loosley stated as (in terms of classification)
There is no universal classifier which is consisntenly better at any task than others
It can also be (not very strictly) inverted
For each (well defined) classifier there exists a dataset where it is the best one
And in particular - kNN is well-defined classifier, in particular it is consistent with any distibution, which means that given infinitely many training points it converges to the optimal, Bayesian separator.
So can it be better than SVM or RF? Obviously! When? There is no clear answer. First of all in supervised learning you often actually get just one training set and try to fit the best model. In such scenario any model can be the best one. When statisticians/theoretical ML try to answer whether one model is better than another, we actually try to test "what would happen if we would have ifinitely many training sets" - so we look at the expected value of the behaviour of the classifiers. In such setting, we often show that SVM/RF is better than KNN. But it does not mean that they are always better. It only means, that for randomly selected dataset you should expect KNN to work worse, but this is only probability. And as you can always win in a lottery (no matter the odds!) you can also always win with KNN (just to be clear - KNN has bigger chances of being a good model than winning a lottery :-)).
What are particular examples? Let us for example consider a rotated XOR problem.
If the true decision boundaries are as above, and you only have this four points. Obviously 1NN will be much better than SVM (with dot, poly or rbf kernel) or RF. It should also be true once you include more and more training points.
"In general kNN would not be expected to exceed SVM or RF. When kNN does, that says something very interesting about the training data. If many doublets are present i the data set, a nearest neighbor algorithm works very well."
I heard the argument something like as written by Claudia Perlich in this podcast:
http://www.thetalkingmachines.com/blog/2015/6/18/working-with-data-and-machine-learning-in-advertizing
My intuitive understanding of why RF and SVM is better kNN in generel: All algorithms basicly assume some local similarity, such that samples very alike gets classified alike. kNN can only choose the most similar samples by distance(or some other global kernel). So the samples which could influence a prediction on kNN would exists within a hyper sphere for the Euclidean distance kernel. RF and SVM can learn other definitions of locality which could stretch far by some features and short by others. Also the propagation of locality could take up many learned shapes, and these shapes can differ through out the feature space.

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?
1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.
If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)
aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results