Shap values hand calculated incorrectly - shap

I'm running into an issue trying to hand calculate shap values for a single tree using LGBM. Basically to firm up my understanding of how the shap values are calculated for a GBDT, I wanted to perform the calculation manually and compare to the package output values for a single prediction. I plotted the tree using show_tree_digraph. The tree only has three leaves (and I only used one estimator) and a depth of 2 excluding the root (in other words, one variable at level 0 and one variable at level 1. This is for a regression problem.
I followed the methodology of the original tree shap paper (drop out a variable, trace down the tree, if you hit the dropped out variable, weight the two potential outcomes by the tree's cover, compare the final result to the true prediction). My hand calculation for each shap value was only off by ~5-10%. Here's the thing, it was consistently off for both variables I trained the tree on.
This begs the following questions
Is there a constant term for each shap value I'm neglecting?
If not, does shap calculate the final value differently than I am?
Why would I get a non-zero shap value for a node in a tree that is not touched at all by the sample? This seems to happen when the tree includes one or two additional layers the sample does not touch. Somehow the shap value is still non-zero, but shouldn't it be zero since dropping out an untouched feature yields the same resulting prediction?

Related

AnyLogic Sensitivity analysis visualization

I am new to this and lost about how to create visualization for the AnyLogic sensitivity analysis. Here is the summary:
I have dataset that captures dependent and independent variables at the end of simulation run (just captures one pair at the end). Trying to vary independent variable to see the impact on dependent variable. The resulting dataset is correct (when I copy and paste in Excel) but the chart looks blank.
Also, output states that it completed 5 iterations but I specified 10 and the data shows that there were in fact 10 iterations.
There are no parameters for the chart data (see the screenshot) but I am guessing it is automatically populated at the end simulation based on the code (also copied below)? Otherwise, I cannot figure out what goes into the chart data (tried to manually enter variables/datasets to no avail).
This is the code after each simulation runs:
Color color = lerpColor( (getCurrentIteration() - 1) / (double) (getMaximumIterations() - 1), blue, red );
chart0.addDataSet( root.died_friend, format( root.SocFriendBrave ), color, true, `Chart.INTERPOLATION_LINEAR, 1, Chart.POINT_NONE );`
I realize this is a very basic question but I am lost and cannot get on the right path based on the help I found. thank you.
Let's say you have 2 experiments: simulation (normal one) and sensitivity analysis (new one)
The only way for the chart to look blank is if your dataset died_friend is empty. This can happen for many reasons, but the reason I suspect it happens here, is because you fill this dataset with information at the end of the simulation run, which means that you are probably using the java actions of the simulation experiment.
The sensitivity experiment DOES NOT read what you write on the java actions of the simulation experiment, so that might be the problem.
If this is not the case, you need to check other possible reasons on why your dataset is empty when you run the sensitivity analysis.
Remember: the fact that your dataset has data in your simulation experiment, does not necessarily mean the dataset will not be empty on the sensitivity experiment.
Since the Sensitivity Analysis experiment is set up via the wizard, it would be better if you showed us how you'd set that up, rather than the resultant AnyLogic-generated code.
From what you've said here, it looks like you may be incorrectly trying to use datasets instead of scalar values. If there's one 'dependent variable' (model output of interest) then the 'independent variable' needs to be a model parameter (parameter of your top-level agent, normally Main, which the experiment will be varying).
So you should specify:
Varying the relevant parameter (looks like you wanted it to be 0, 0.25, 0.5, 0.75, 1 so min 0, max 1 with step size of 0.25)
Producing a chart for the scalar output you want. (If this was, say, a variable called outputValue, you'd use expression root.outputValue when setting up the chart in the wizard.)
Also you should only be getting more than 5 iterations if you set up your parameter variation criteria incorrectly.
(Dataset outputs in the experiment's charts are typically for where your model produces a time series as an output --- i.e., a dataset of sim time vs. value --- and you want the Sensitivity Analysis experiment to show charts with each run's time series as separate lines (i.e., time on the X-axis). Scalar charts are for where the X-axis is the varying parameter and the Y-axis the output of interest.)

should I use float or classes as output for the final layer in my neural network?

I am working on a deep learning problem where I am trying to predict time-to-failure on laboratory earthquake data from an observed seismic time series. The target is a single integer number (the time until the next earthquake) ranging, say, from 1 to 10.
I could design the last layer to return a single float and use, say, mean-square-error(MSE), as a loss to make that returned float close to the desired integer. Or, I could think of each integer possibility as a "class" and use a cross-entropy(CE) loss to optimize.
Are there any arguments in favour of either of these options?
Also, what if the target is a float number ranging from 1 to 10? I could also turn this into a class/CE problem.
So far, I have tried the CE option (which works at some level) and am thinking of trying the mse option but wanted to step back and think before proceeding. Such thoughts would include reasoning as to why one approach might outperform the other.
I am working with pytorch version 1.0.1 and Python 3.7.
Thanks for any guidances.
I decided to just implement a float head with an L1Loss in Pytorch and I created a simple but effective synthetic data set to test the implementation. The data set created images into which a number of small squares were randomly drawn. The training label was simply the number of squares divided by 10, a float number with one decimal digit.
The net trained very quickly and to a high degree of precision - the test samples were correct to the one decimal digit.
As to the original question, the runs I made definitely favoured the float over class.
My take on this is that the implementation in classes had a basic imprecision in the assignment of the classes and, perhaps more importantly, the class implementation has no concept of a "metric". That is, the float implementation, even if it misses the exact match, will try to generate an output label "close" to the input label while the class implementation has no concept of "close".
One warning with Pytorch. If you are fitting to one float, be sure to encase it in a length 1 vector in the data generator. Pytorch cannot handle a "naked" float number (even though it does become a vector when batches are done). But it doesn't complain. This cost me a bunch of time.

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

RapidMiner: Ability to classify based off user set support threshold?

I am have built a small text analysis model that is classifying small text files as either good, bad, or neutral. I was using a Support-Vector Machine as my classifier. However, I was wondering if instead of classifying all three I could classify into either Good or Bad but if the support for that text file is below .7 or some user specified threshold it would classify that text file as neutral. I know this isn't looked at as the best way of doing this, I am just trying to see what would happen if I took a different approach.
The operator Drop Uncertain Predictions might be what you want.
After you have applied your model to some test data, the resulting example set will have a prediction and two new attributes called confidence(Good) and confidence(Bad). These confidences are between 0 and 1 and for the two class case they will sum to 1 for each example within the example set. The highest confidence dictates the value of the prediction.
The Drop Uncertain Predictions operator requires a min confidence parameter and will set the prediction to missing if the maximum confidence it finds is below this value (you can also have different confidences for different class values for more advanced investigations).
You could then use the Replace Missing Values operator to change all missing predictions to be a text value of your choice.

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).
Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.
For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.
You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.