Deriving prediction efficiency and prediction errors for Ensemble Machine Learning model stacks - prediction

I am trying to derive prediction errors for ensemble models fitted using makeStackedLearner in the mlr package. These are the steps I am following:
> library(mlr)
> library(matrixStats)
> data(BostonHousing, package = "mlbench")
> tsk = makeRegrTask(data = BostonHousing, target = "medv")
> BostonHousing$chas = as.numeric(BostonHousing$chas)
> base = c("regr.rpart", "regr.svm", "regr.ranger")
> lrns = lapply(base, makeLearner)
> m = makeStackedLearner(base.learners = lrns,
+ predict.type = "response", method = "stack.cv", super.learner = "regr.lm")
> tmp = train(m, tsk)
> summary(tmp$learner.model$super.model$learner.model)
Call:
stats::lm(formula = f, data = d)
Residuals:
Min 1Q Median 3Q Max
-10.8014 -1.5154 -0.2479 1.2160 23.6530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.76991 0.43211 -6.410 3.35e-10 ***
regr.rpart -0.09575 0.04858 -1.971 0.0493 *
regr.svm 0.17379 0.07710 2.254 0.0246 *
regr.ranger 1.04503 0.08904 11.736 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.129 on 502 degrees of freedom
Multiple R-squared: 0.885, Adjusted R-squared: 0.8843
F-statistic: 1287 on 3 and 502 DF, p-value: < 2.2e-16
> res = predict(tmp, tsk)
Note I use the method = "stack.cv" which means that any time the models get refitted using makeStackedLearner the numbers will be slightly different. My first question is:
Is the R-square derived from the super.learner model an objective measure of the predictive power? (I assume because it is based on the Cross-Validation with refitting it should be)
> ## cross-validation R-square
> round(1-tmp$learner.model$super.model$learner.model$deviance /
+ tmp$learner.model$super.model$learner.model$null.deviance, 3)
[1] 0.872
How to derive prediction error (prediction interval) for all newdata rows?
The method I use at the moment simply derives standard deviation of the multiple independent model predictions (which is the model error):
> res.all <- getStackedBaseLearnerPredictions(tmp)
> wt <- summary(tmp$learner.model$super.model$learner.model)$coefficients[-1,4]
> res.all$model.error <- matrixStats::rowSds(
+ as.matrix(as.data.frame(res.all))[,which(wt<0.05)], na.rm=TRUE)
> res$data[1,]
id truth response
1 1 24 26.85235
> res.all$model.error[1]
[1] 2.24609
So in this case predicted value is 26.85, truth is 24, and the prediction error is estimated at 2.24. Again, because stack.cv method is used, everytime you refit the models you get slightly different values. Are you aware of any similar approach to derive prediction error for ensemble models? Thanks in advance.

To derive prediction intervals (individual errors at new data) we can use the predict.lm function:
> m = makeStackedLearner(base.learners = lrns, predict.type = "response",
method = "stack.cv", super.learner = "regr.lm")
> tmp = train(m, tsk)
> tmp$learner.model$super.model$learner.model
Call:
stats::lm(formula = f, data = d)
Coefficients:
(Intercept) regr.rpart regr.svm regr.ranger
-2.5879 -0.0971 0.3549 0.8635
> res.all <- getStackedBaseLearnerPredictions(tmp)
> pred.error = predict(tmp$learner.model$super.model$learner.model,
newdata = res.all, interval = "prediction", level=2/3)
> str(pred.error)
num [1:506, 1:3] 29.3 23.3 34.6 36 33.6 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:506] "1" "2" "3" "4" ...
..$ : chr [1:3] "fit" "lwr" "upr"
> summary(tmp$learner.model$super.model$learner.model$residuals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-11.8037 -1.5931 -0.3161 0.0000 1.1951 29.2145
> mean((pred.error[,3]-pred.error[,2])/2)
[1] 3.253142
This is an example with lm model as super learner. The level argument can be used to pass different probabilities (2/3 for 1 standard deviation). Predictions at newdata should be somewhat higher than what you obtain with training data (depending on extrapolation). This approach could also be extended to using e.g. Random Forest models (see ranger package) and quantile regression Random Forest derivation of prediction intervals (Hengl et al. 2019). Note that for this type of analysis, there should be at least two base learners (three recommended).

Related

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

Predicting same values for entire test Set in MATLAB using LIBSVM

I am using Support Vector Regression(SVR) in libsvm package to predict outputs. Kernel : RBF
Train set size : 729x40
Test set size : 137x40
The output of train set seems fine when measured against ground truth. But the predictions on test set are all the same. It outputs same values.
After checking the related posts, I normalized the data and played with the values of gamma(10-100000) but still the problem persists.
trainGT=games(((games(:,46)>=2010) & (games(:,46)<2015) & (games(:,1)~=8)),43);
featuresTrain=lastGame(games,true,1);
testGT=games((games(:,46)>=2015 & (games(:,1)~=8)),43);
featureTest=lastGame(games,false,1);
eval(['model = svmtrain( trainGT, featuresTrain,''-s 4 -t 2 -c 10 -g 10 ' ''');']);
w = (model.sv_coef' * full(model.SVs));
b = -model.rho;
predictionsTrain = svmpredict(trainGT, featuresTrain,model);
predictionsTest = svmpredict(zeros(length(testGT),1), featureTest, model);
My output is as follows
optimization finished, #iter = 1777
epsilon = 0.630588
obj = -19555.036253, rho = -17.470386
nSV = 681, nBSV = 118
Mean squared error = 305.214 (regression)
Squared correlation coefficient = -1.#IND (regression)
All my predictionTest values are 17.4704 (which is same as the rho value in the output). Can someone please help me on this? Thanks.

Matlab : Help in implementing an encoding for realizing a mapping function

An example : Consider the unimodal logistic map : x[n+1] = 4*x[n](1-x[n]). The map can be used to generate +1/-1 symbols using the technique
I want to extend the above concept using the map f(x) for 3 levels, each level corresponds to a symbol but I am unsure how I can do that.
To map a continuous range (obtained for example as the output of a pseudo-random number generator, or alternatively the logistic map) to a small set of discrete values, you would need to split the continuous range into regions, and assign an output value to each of those regions. The limits of those regions would determine the corresponding threshold values to use.
For example, in the binary case you start off with a continuous range of values in [0,1] which you split into two regions: [0,0.5] and (0.5,1]. Each of those region begin assigned an output symbol, namely -1 and +1. As you have noted, the boundary of the regions being set to the midpoint of your [0,1] input range gives you a threshold of 0.5. This could be implemented as:
if (x > 0.5)
symbol = +1;
else
symbol = -1;
end
As a more compact implementation, the formula 2*(x>0.5)-1 takes advantage of the fact that in Matlab a true condition (from the x>0.5 expression) has a value of 1, whereas false has a value of 0.
For 3 discrete output values, you'd similarly split your [0,1] input range into 3 regions: [0,1/3], (1/3,2/3] and (2/3,1]. The corresponding thresholds thus being 1/3 and 2/3.
Finally for 8 discrete output values, you would similarly split your [0,1] input range into 8 regions: [0,1/8], (1/8,2/8], (2/8,3/8], (3/8,4/8], (4/8,5/8], (5/8,6/8], (6/8,7/8] and (7/8,1]. The corresponding thresholds thus being 1/8, 2/8, 3/8, 4/8, 5/8, 6/8 and 7/8, as illustrated in the following diagram:
thresholding function input: |-----|-----|-----|-----|-----|-----|-----|-----|
0 | | | | | | | 1
thresholds: 1/8 2/8 3/8 4/8 5/8 6/8 7/8
| | | | | | | |
v v v v v v v v
generated symbol: -7 -5 -3 -1 +1 +3 +5 +7
This then gives the following symbol mapping implementation:
if (x < 1/8)
symbol = -7;
elseif (x < 2/8)
symbol = -5;
elseif (x < 3/8)
symbol = -3;
elseif (x < 4/8)
symbol = -1;
elseif (x < 5/8)
symbol = +1;
elseif (x < 6/8)
symbol = +3;
elseif (x < 7/8)
symbol = +5;
else
symbol = +7;
end
As a more compact implementation, you could similarly use the floor function to obtain discrete levels:
% x : some value in the [0,1] range
% s : a symbol in the {-7,-5,-3,-1,+1,+3,+5,+7} set
function s = threshold(x)
% Note on implementation:
% 8*x turns the input range from [0,1] to [0,8]
% floor(8*x) then turns that into values {0,1,2,3,4,5,6,7}
% then a linear transform (2*() - 7) is applied to map
% 0 -> -7, 1 -> -5, 2 -> -3, ..., 7 -> 7
% min/max finally applied just as a safety to make sure we don't overflow due
% to roundoff errors (if any).
s = min(7, max(-7, 2*floor(8*x) - 7));
end
Now if you want to generate complex symbols with 8 levels for the real part and 8 levels for the imaginary part, you'd simply combine them just like in the binary case. Mainly you'd generate a first value which gives you the real part, then a second value for the imaginary part:
x_real = rand(); % random input 0 <= x_real <= 1
x_imag = rand(); % another one
s = threshold(x_real) + sqrt(-1)*threshold(x_imag);
Addressing some points raised by a previous revision of the question:
One thing to note is that x[n+1] = 4*x[n](1-x[n]) maps values in [0,1] to the same range of values. This makes it possible to iteratively apply the mapping to obtain additional values, and correspondingly generate a binary sequence with the threshold application (x > 0.5). The function f(x) you provided (in an earlier edit of the question) on the other hand, maps values within a range with discontinuities (roughly covering [-7.5,7.5] depending on p) to [0,1]. In other words you would need to either modify f(x) or otherwise map its output back to the input domain of f(x). It would probably be easier to consider a general uniform pseudo-random number generator over the [-8,+8] range as input to the threshold function:
% x : some value in the [-8,8] range
% s : a symbol in the {-7,-5,-3,-1,+1,+3,+5,+7} set
function s = threshold_8PAM(x)
s = min(7, max(-7, 2*round(x/2 + 0.5) - 1));
end
To get the final 64-QAM symbols you would combine two 8-PAM symbols in quadrature (i.e. x64qam = xQ + sqrt(-1)*xI, where xQ and xI have both been generated with the above procedure).
That said, if the goal is to implement a digital communication system using 64-QAM symbols with additional chaotic modulation, you'd ultimately want to take into account the source of input data to transmit rather than randomly generating both the chaotic modulation and the source data in one shot. That is even if for performance evaluation you wind up generating the source data randomly, it is still a good idea to be generating it independently of the chaotic modulation.
Addressing those concerns, the paper An Enhanced Spectral Efficiency Chaos-Based Symbolic Dynamics Transceiver Design suggests a different approach based on the inverse map you provided, which can be implemented as:
function x = inverse_mapping(x,SymbIndex,p)
if (SymbIndex==0)
x = ((1-p)*x-14)/2;
elseif (SymbIndex==1)
x = ((1-p)*x-10)/2;
elseif (SymbIndex==2)
x = ((1-p)*x-6)/2;
elseif (SymbIndex==3)
x = ((1-p)*x-2)/2;
elseif (SymbIndex==4)
x = ((1-p)*x+2)/2;
elseif (SymbIndex==5)
x = ((1-p)*x+6)/2;
elseif (SymbIndex==6)
x = ((1-p)*x+10)/2;
elseif (SymbIndex==7)
x = ((1-p)*x+14)/2;
end
end
As you may notice, the function takes a symbol index (3 bits, which you'd get from the input source data) and the current state of the modulated output (which you may seed with any value within the convergence range of inverse_mapping) as two independent input streams. Note that you can compute the bounds of the convergence range of inverse_mapping by finding the limits of repeated application of the mapping using input symbol index s=0, and s=7 (using for example a seed of x=0). This should converge to [-14/(1+p), 14/(1+p)].
The chaotic modulation described in the above referenced paper can then be achieved with (setting the control parameter p=0.8 as an example):
% Simulation parameters
Nsymb = 10000;
p = 0.8;
M = 64;
% Source data generation
SymbolIndexQ = randi([0 sqrt(M)-1],Nsymb,1);
SymbolIndexI = randi([0 sqrt(M)-1],Nsymb,1);
% Modulation
xmax = 14/(1+p); % found by iterative application of inverse_mapping
xQ = xmax*(2*rand(1)-1); % seed initial state
xI = xmax*(2*rand(1)-1); % seed initial state
x = zeros(Nsymb,1);
for i=1:Nsymb
xQ = inverse_mapping(xQ, SymbolIndexQ(i), p);
xI = inverse_mapping(xI, SymbolIndexI(i), p);
x(i) = xQ + sqrt(-1)*xI;
end
% x holds the modulated symbols
plot(real(x), imag(x), '.');
% if you also need the unmodulated symbols you can get them from
% SymbolIndexQ and SymbolIndexI
s = (2*SymbolIndexQ-7) + sqrt(-1)*(2*SymbolIndexI-7);
with should produce the corresponding constellation diagram:
or with p=1 (which is essentially unmodulated):

Chi squared test

I have written code in MATLAB for a Chi-Square test. I wish to obtain P-values as 0.897 or 0.287 and so on, but my results are too small. Below is my code:
pd = fitdist(sample, 'weibull');
[h,p,st] = chi2gof(sample,'CDF',pd)
I've also tried using the AD test with similar result:
dist = makedist('Weibull', 'a',A, 'b',B);
[h,p,ad,cv] = adtest(sample, 'Distribution',dist)
Below is a histogram of the data with a fitted Weibull density function (Weibull parameters are A=4.0420 and B=2.0853)
When the p-value is less than a predetermined significance level (default is 5% or 0.05), it means that the null hypotheses is rejected (which in your case means that the sample did not come from a Weibull distribution).
The chi2gof function first output variable h denotes the test result, where h=1 means that the test rejects the null hypothesis at the specified significance level.
Example:
sample = rand(1000,1); % sample from Uniform distribution
pd = fitdist(sample, 'weibull');
[h,p,st] = chi2gof(sample, 'CDF',pd, 'Alpha',0.05)
The test clearly rejects H0, and concludes that the data did not came from a Weibull distribution:
h =
1 % 1: H1 (alternate hypo), 0: H0 (null hypo)
p =
2.8597e-27 % note that p << 0.05
st =
chi2stat: 141.1922
df: 7
edges: [0.0041 0.1035 0.2029 0.3023 0.4017 0.5011 0.6005 0.6999 0.7993 0.8987 0.9981]
O: [95 92 92 97 107 110 102 95 116 94]
E: [53.4103 105.6778 130.7911 136.7777 129.1428 113.1017 93.1844 72.8444 54.3360 110.7338]
Next let's try that again with a conforming sample:
>> sample = wblrnd(0.5, 2, [1000,1]); % sample from a Weibull distribution
>> pd = fitdist(sample, 'weibull')
pd =
WeibullDistribution
Weibull distribution
A = 0.496413 [0.481027, 0.512292]
B = 2.07314 [1.97524, 2.17589]
>> [h,p] = chi2gof(sample, 'CDF',pd, 'Alpha',0.05)
h =
0
p =
0.7340
the test now clearly passes with a high p-value.
EDIT:
Looking at the histogram you've shown, it does look like the data follows a Weibull distribution, although there might be cases of outliers (look at the right side of the histogram), which might explain why you are getting bad p-values. Consider preprocessing your data to handle extreme outliers..
Here is an example where I simulate outlier values:
% 5000 samples from a Weibull distribution
pd = makedist('Weibull', 'a',4.0420, 'b',2.0853);
sample = random(pd, [5000 1]);
%sample = wblrnd(4.0420, 2.0853, [5000 1]);
% add 20 outlier instances
sample(1:20) = [rand(10,1)+15; rand(10,1)+25];
% hypothesis tests using original distribution
[h,p,st] = chi2gof(sample, 'CDF',pd, 'Alpha',0.05)
[h,p,ad,cv] = adtest(sample, 'Distribution',pd)
% hypothesis tests using empirical distribution
[h,p,st] = chi2gof(sample, 'CDF',fitdist(sample,'Weibull'))
[h,p,ad,cv] = adtest(sample, 'Distribution', 'Weibull')
% show histogram
histfit(sample, 20, 'Weibull')
% chi-squared test
h =
1
p =
0.0382
st =
chi2stat: 8.4162
df: 3
edges: [0.1010 2.6835 5.2659 7.8483 25.9252]
O: [1741 2376 764 119]
E: [1.7332e+03 2.3857e+03 788.6020 92.5274]
% AD test
h =
1
p =
1.2000e-07
ad =
Inf
cv =
2.4924
The outliers are causing the distribution tests to fail (null hypothesis rejected). Still I couldn't reproduce getting a NaN p-value (you might wanna check this related question on Stats.SE about getting NaN p-values)..