Am working on shap implementation and am trying to prove that kernelshap is trustful , the main concern is that since kernelshap is fitting a linear model to determine shap values there will be some sort of uncertainty since it's an approximation .
I tried to look for some metrics but what we have in the library https://github.com/slundberg/shap/blob/master/shap/benchmark/metrics.py are metrics dedicated for the existing methods (benchmark) . I tried comparing shap values of treexplainer since it's an exact calculation for the same data and I assumed at least to find that most important features are the same when calculating with kernelshap but am not getting comparable results .
these are shap values for Boston housing dataset with treexplainer (model : XGBRegressor) :
These are shap values for Boston housing dataset with Kernelshap (model : SVR) :
Related
Let's say I have 2 images of a car, but one is generated from the camera and the other is a depth image generated from Lidar pointcloud transformation.
I used the same CNN model on both image to predict the class (output is a softmax, as there is other classes in my dataset : pedestrian, van, truck, cyclist, etc.
How can I combine the two probabilities vector in order to predict the class by taking into account both predictions?
I used method like average, maximum, minimum, naive product apply to each score for each class, but don't know if it work.
Thanks you in advance
EDIT :
Following this article : https://www.researchgate.net/publication/327744903_Multimodal_CNN_Pedestrian_Classification_a_Study_on_Combining_LIDAR_and_Camera_Data
We can see that they use maximum or minimum rule to combine the outpout of classifiers. So dit it work for multiclass problem?
As per MSalter's comment, the softmax output isn't a true probability vector. But if we choose to regard it as such, we can simply take the average of each prediction. This is equivalent to having two persons each classify a random sample of objects from a big pool of objects and assuming they both counted an equal amount, estimate the distribution of objects in the big pool by combining their observations. The sum of the 'probabilities' of the classes will still equal 1.
Following this article : https://www.researchgate.net/publication/327744903_Multimodal_CNN_Pedestrian_Classification_a_Study_on_Combining_LIDAR_and_Camera_Data
We can see that they use maximum or minimum rule to combine the outpout of classifiers. So dit it work for multiclass problem?
I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...
I have two sets of features predicting the same outputs. But instead of training everything at once, I would like to train them separately and fuse the decisions. In SVM classification, we can take the probability values for the classes which can be used to train another SVM. But in SVR, how can we do this?
Any ideas?
Thanks :)
There are a couple of choices here . The two most popular ones would be:
ONE)
Build the two models and simply average the results.
It tends to work well in practice.
TWO)
You could do it in a very similar fashion as when you have probabilities. The problem is, you need to control for over fitting .What I mean is that it is "dangerous" to produce a score with one set of features and apply to another where the labels are exactly the same as before (even if the new features are different). This is because the new applied score was trained on these labels and therefore over fits in it (hyper-performs).
Normally you use a Cross-validation
In your case you have
train_set_1 with X1 features and label Y
train_set_2 with X2 features and same label Y
Some psedo code:
randomly split 50-50 both train_set_1 and train_set_2 at exactly the same points along with the Y (output array)
so now you have:
a.train_set_1 (50% of training_set_1)
b.train_set_1 (the rest of 50% of training_set_1)
a.train_set_2 (50% of training_set_2)
b.train_set_2 (the rest of 50% of training_set_2)
a.Y (50% of the output array that corresponds to the same sets as a.train_set_1 and a.train_set_2)
b.Y (50% of the output array that corresponds to the same sets as b.train_set_1 and b.train_set_2)
here is the key part
Build a svr with a.train_set_1 (that contains X1 features) and output a.Y and
Apply that model's prediction as a feature to b.train_set_2 .
By this I mean, you score the b.train_set_2 base on your first model. Then you take this score and paste it next to your a.train_set_2 .So now this set will have X2 features + 1 more feature, the score produced by the first model.
Then build your final model on b.train_set_2 and b.Y
The new model , although uses the score produced from training_set_1, still it does so in an unbiased way , since the later was never trained on these labels!
You might also find this paper quite useful
I'm currently trying to build an AR model to approximate the error process in a sensor system. And I'm comparing the different parameter estimators in MATLAB. I have sets of data that i'm trying to match a model to, but I'm not too sure on the benefits/disadvantages of the algorithms available in the signal processing toolbox.
arburg: Autoregressive (AR) all-pole model parameters estimated using Burg method
arcov: Estimate AR model parameters using covariance method
armcov: Estimate AR model parameters using modified covariance method
aryule: Estimate autoregressive (AR) all-pole model using Yule-Walker method
If someone could give a more detailed description comparing the different algorithm and which one would best model existing data that would be very helpful.
I have painstakingly gathered data for a proof-of-concept study I am performing. The data consists of 40 different subjects, each with 12 parameters measured at 60 time intervals and 1 output parameter being 0 or 1. So I am building a binary classifier.
I knew beforehand that there is a non-linear relation between the input-parameters and the output so a simple perceptron of Bayes classifier would be unable to classify the sample. This assumption proved correct after initial tests.
Therefore I went to neural networks and as I hoped the results were pretty good. An error of about 1-5% is generally the result. The training is done by using 70% as training and 30% as evaluation. Running the complete dataset again (100%) through the model I was very happy with the results. The following is a typical confusion matrix (P = positive, N = negative):
P N
P 13 2
N 3 42
So I am happy and with the notion that I used a 30% for evaluation I am confident that I am not fitting noise.
Therefore I resolved to SVM for a double check and the SVM was unable to converge to a good solution. Most of the time the solutions are terrible (say 90% error...). Maybe I am not fully aware of SVM's or the implementations are not correct, but it troubles me because I thought that when NN provide a good solution, SVM's are most of the time better in seperating the data due to their maximum-margin hyperplane.
What does this say of my result? Am I fitting noise? And how do I know if this is a correct result?
I am using Encog for the calculations but the NN results are comparable to home-grown NN models I made.
If it is your first time to use SVM, I strongly recommend you to take a look at A Practical Guide to Support Vector Classication, by authors of a famous SVM package libsvm. It gives a list of suggestions to train your SVM classifier.
Transform data to the format of an SVM package
Conduct simple scaling on the data
Consider the RBF kernel
Use cross-validation to nd the best parameter C and γ
Use the best parameter C and γ
to train the whole training set
Test
In short, try scaling your data and carefully choosing the kernal plus the parameters.