Obtaining the SHAP values for a prediction made with kNN - shap

If I want to obtain the SHAP values with kernel SHAP for a kNN classifier with n variables, do I have to recalculate the prediction 2^n times?
(I'm not using python, but MATLAB, so I need to know the inside of the algorithm)

For those who use python find the following script to get shap values from a knn model. For step by step modeling follow this link:
# Initialize model
knn = sklearn.neighbors.KNeighborsClassifier()
# Fit the model
knn.fit(X_train, Y_train)
# Get the model explainer object
explainer = shap.KernelExplainer(knn.predict_proba, X_train)
# Get shap values for the test data observation whose index is 0, i.e. first observation in the test set
shap_values = explainer.shap_values(X_test.iloc[0,:])
# Generate a force plot for this first observation using the derived shap values
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[0,:])

Yes, I believe so according to the paper entitled 'A Unified Approach to Interpreting Model
Predictions'.
You will need to iterate through all feature coalitions: hence the 2^n (where n in number of features).

Related

How can we model independent noise for every output dimension of a multi-output GP in GPflow?

Say I have a problem having D outputs with isotopic data, I would like to use independent noise for each output dimension of a multi-output GP model (Intrinsic Coregionalisation Model) in gpflow, which is the most general case like:
I have seen some example of using multi-output GPs in GPflow, like this notebook and this question
However, it seems for the GPR model class in gpflow, the likelihood variance ($\Sigma$) is still one number instead of D numbers even if a product kernel (i.e. Kernel * Coregionalization) is specified.
Is there any way to achieve that?
Just like you can augment X with a column that designates for each data point (row) which output it relates to (the column is specified by the active_dims keyword argument to the Coregion kernel; note that it is zero-based indexing), you can augment Y with a column to specify different likelihoods (the SwitchedLikelihood is hard-coded to require the index to be in the last column of Y) - there is an example (Demo 2) in the varying noise notebook in the GPflow tutorials. You just have to combine the two, use a Coregion kernel and a SwitchedLikelihood, and augment both X and Y with the same column indicating outputs!
However, as plain GPR only works with a Gaussian likelihood, the GPR model has been hard-coded for a Gaussian likelihood. It would certainly be possible to write a version of it that can deal with different Gaussian likelihoods for the different outputs, but you would have to do it all manually in the _build_likelihood method of a new model (incorporating the stitching code from the SwitchedLikelihood).
It would be much easier to simply use a VGP model that can handle any likelihood - for Gaussian likelihoods the optimisation problem is very simple and should be easy to optimise using ScipyOptimizer.

In Matlab, what does it mean to use GMM as a posterior distribution to make a supervised classifier inspired by GMM? Suggested by podludek and lejlot

I understand that GMM is not a classifier itself, but I am trying to follow the instructions of some users in this stack exchange post below to create a GMM-inspired classifier.
lejlot: Multiclass classification using Gaussian Mixture Models with scikit learn
"construct your own classifier where you fit one GMM per label and then use assigned probability to do actual classification. Then it is a proper classifier"
What is meant by "assigned probability" for GMM Matlab objects in the above quote and how can we input a new point to get our desired assigned probability? For a new point that we are trying to classify, my understanding is that we need to get the posterior probabilities that the new point belongs to either Gaussian and then compare these two probabilities.
It looks from the documentation https://www.mathworks.com/help/stats/gmdistribution.html
like we only have access to cluster center mu's and covariance matrices (sigma) but not an actual probability distribution that would take in a point and spit out a probability
podludek: Multiclass classification using Gaussian Mixture Models with scikit learn
"GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem.....You should use GMM as a posterior distribution, one GMM per each class." -
In the documentation in Matlab for posterior(gm,X), the tutorial shows us inputting X, which is already the the data we used to create ("train") our GMM. But how can we get the posterior probability of being in a cluster for a new point?
https://www.mathworks.com/help/stats/gmdistribution.posterior.html
"P = posterior(gm,X) returns the posterior probability of each Gaussian mixture component in gm given each observation in X"
--> But the X used in the link above is the 'training' data used to create the GMM itself, not a new point. Also we have two gm objects, not one. How can we grab the probability a point belongs to a Gaussian?
The pseudocode below is how I envisioned a GMM inspired classifier would go for a two class example: I would fit GMM's to individual clusters as described by podludek. Then, I would use the posterior probailities of a point being in each cluster and then pick the bigger probability.
I'm aware there are issues with this conceptually (such as the two GMM objects having conflicting covariance matrices) but I've been assured by my mentor that there is a way to make a supervised version of GMM, and he wants me to make one, so here we go:
Pseusdocode:
X % The training data matrix
% each new row is a new data point
% each column is new feature
% Ex: if you had 10,000 data points and 100 features for each, your matrix
% would be 10000 by 100
% Let's say we had 200 points of each class in our training data
% Grab subsets of X that corresponds to classes 1 and 2
X_only_class_2 = X(1:200,:)
X_only_class_1 = X(201:end,:)
gmfit_class_1 = fitgmdist(X_only_class_1,1,'RegularizationValue',0.1);
cov_matrix_1=gmfit_class_1.Sigma;
gmfit_class_2 = fitgmdist(X_only_class_2,1,'RegularizationValue',0.1);
cov_matrix_2=gmfit_class_2.Sigma;
% Now do some tests on data we already know the classification of to check if this is working as we would expect:
a = posterior(gmfit_class_1,X_only_class_1)
b = posterior(gmfit_class_1,X_only_class_2)
c = posterior(gmfit_class_2,X_only_class_1)
d = posterior(gmfit_class_2,X_only_class_2)
But unfortunately, computing these posteriors a, b, c, and d just result in column vectors of 1's. I'm aware these are degenerate cases (and pointless for actual classification since we already know the classifications of our training data) but I still wanted to test them to make sure the posterior method is working as I would expect.
Expected:
a = posterior(gmfit_class_1,X_only_class_1)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
b = posterior(gmfit_class_1,X_only_class_2)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
c = posterior(gmfit_class_2,X_only_class_1)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
d = posterior(gmfit_class_2,X_only_class_2)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
I have to think that somehow Matlab is being confused by how in both gmm fit models, there is only one cluster in each. Either that or I am not interpreting the posterior method correctly.

Using Linear Prediction Over Time Series to Determine Next K Points

I have a time series of N data points of sunspots and would like to predict based on a subset of these points the remaining points in the series and then compare the correctness.
I'm just getting introduced to linear prediction using Matlab and so have decided that I would go the route of using the following code segment within a loop so that every point outside of the training set until the end of the given data has a prediction:
%x is the data, training set is some subset of x starting from beginning
%'unknown' is the number of points to extend the prediction over starting from the
%end of the training set (i.e. difference in length of training set and data vectors)
%x_pred is set to x initially
p = length(training_set);
coeffs = lpc(training_set, p);
for i=1:unknown
nextValue = -coeffs(2:end) * x_pred(end-unknown-1+i:-1:end-unknown-1+i-p+1)';
x_pred(end-unknown+i) = nextValue;
end
error = norm(x - x_pred)
I have three questions regarding this:
1) Does this appropriately do what I have described? I ask because my error seems rather large (>100) when predicting over only the last 20 points of a dataset that has hundreds of points.
2) Am I interpreting the second argument of lpc correctly? Namely, that it means the 'order' or rather number of points that you want to use in predicting the next point?
3) If this is there a more efficient, single line function in Matlab that I can call to replace the looping and just compute all necessary predictions for me given some subset of my overall data as a training set?
I tried looking through the lpc Matlab tutorial but it didn't seem to do the prediction as I have described my needs require. I have also been using How to use aryule() in Matlab to extend a number series? as a reference.
So after much deliberation and experimentation I have found the above approach to be correct and there does not appear to be any single Matlab function to do the above work. The large errors experienced are reasonable since I am using a linear prediction algorithm for a problem (i.e. sunspot prediction) that has inherent nonlinear behavior.
Hope this helps anyone else out there working on something similar.

leave-one-out regression using lasso in Matlab

I have 300 data samples with around 4000 dimension feature each. Each input has a 5 dim. output which is in the range of -2 to 2. I am trying to fit a lasso model to it. I went through a few posts which talk about cross validation strategies like this one: Leave one out cross validation algorithm in matlab
But I saw that lasso does not support leaveout in Matlab! http://www.mathworks.com/help/stats/lasso.html
How can I train a model using leave one out cross validation and fit a model using lasso on my dataset? I am trying to do this in matlab. I would like to get a set of weights which I will be able to use for future predictions on other data.
I tried using glmnet: http://www.stanford.edu/~hastie/glmnet_matlab/intro.html but I couldn't compile it on my machine due to lack of proper mex compiler.
Any solutions to my problem? Thanks :)
EDIT
I am also trying to use lasso function in-built with MATLAB. It has an option to perform cross validation. It outputs B and Fit Statistics, where B is Fitted coefficients, a p-by-L matrix, where p is the number of predictors (columns) in X, and L is the number of Lambda values.
Now given a new test sample, how can I calculate the output using this model?
You can use a leave-one-out approach regardless of your training method. As explained here, you can use crossvalind to split the data into training and test sets.
[Train, Test] = crossvalind('LeaveMOut', N, M)

Issues related to plots in pattern recognition(Part1)

I cannot follow crossval() & cvpartition() function given in MATLAB documentation crossval(). What goes in the parameter and how would it help to compare performance and accuracy of different classifiers. Would be obliged if a simpler version of it is provided here.
Let's work on Example 2 from CROSSVAL documentation.
load('fisheriris');
y = species;
X = meas;
Here we loaded the data from example mat-file and assigned variable to X and y. meas amtrix contains different measurements of iris flowers and species are tree classes of iris, what we are trying to predict with the data.
Cross-validation is used to train a classifier on the same data set many times. Basically at each iteration you split the data set to training and test data. The proportion is determined by k-fold. For example, if k is 10, 90% of the data will be used for training, and the rest 10% - for test, and you will have 10 iterations. This is done by CVPARTITION function.
cp = cvpartition(y,'k',10); % Stratified cross-validation
You can explore cp object if you type cp. and press Tab. You will see different properties and methods. For example, find(cp.test(1)) will show indices of the test set for 1st iteration.
Next step is to prepare prediction function. This is probably where you had the main problem. This statement create function handle using anonymous function. #(XTRAIN, ytrain,XTEST) part declare that this function has 3 input arguments. Next part (classify(XTEST,XTRAIN,ytrain)) defines the function, which gets training data XTRAIN with known ytrain classes and predicts classes for XTEST data with generated model. (Those data are from cp, remember?)
classf = #(XTRAIN, ytrain,XTEST)(classify(XTEST,XTRAIN,ytrain));
Then we are running CROSSVAL function to estimate missclassification rate (mcr) passing the complete data set, prediction function handle and partitioning object cp.
cvMCR = crossval('mcr',X,y,'predfun',classf,'partition',cp)
cvMCR =
0.0200
Still have questions?