Adonis result does not match the NMDS plot? - vegan

I'm not a statistician and kinda working through this blindly, but why does my NMDS plot of bray-curtis measures show clear groupings for status (blue and pink) but adonis says that time point is significantly different?
distance_calc <- phyloseq::distance(rarified2, "bray")
sampledf <- data.frame(sample_data(rarified2))
adonisresult<- adonis2(formula = distance_calc ~ Status * Time.Point, data = sampledf, type = "bray")
> adonisresult
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = adonisf, data = sampledf, type = "bray")
Df SumOfSqs R2 F Pr(>F)
Status 1 0.2534 0.05397 1.2969 0.198
Time.Point 1 0.4277 0.09111 2.1893 0.016 *
Status:Time.Point 1 0.1061 0.02259 0.5429 0.901
Residual 20 3.9072 0.83232
Total 23 4.6943 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Related

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

regression separately for specific variable

I am trying to run a lm/glm between two variables "area" and "intensity".
I ran a linear model regression between the variables with all rows combined and got summary results as below. I want to run the lm for the two variables individually for each city (A/B/C/D/E). How can I modify/loop the script such that I do not have to run the script 5 times, and the r-squared value and model results are added in the dataframe?
R1 <- lm(formula = area ~ intensity,
data = df1)
Call:
lm(formula = area ~ intensity, data = df1)
Residuals:
Min 1Q Median 3Q Max
-2716.1 -1540.5 -684.3 1588.8 2686.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1646.30 569.73 2.890 0.00976 **
intensity -333.10 42.73 -7.795 3.54e-07 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1790 on 18 degrees of freedom
Multiple R-squared: 0.7715, Adjusted R-squared: 0.7588
F-statistic: 60.77 on 1 and 18 DF, p-value: 3.537e-07
I am sharing a way to store a list of outputs and also a way to put results in primary dataframe:
results <- list()
cities <- unique(df1$city)
for (i in 1:length(cities)){
R1 <- lm(area ~ intensity, data=df1[df1$city==cities[i],])
results[[cities[i]]] <- summary(R1) # if you want to store everything
temp_df <- data.frame(prediceted=fitted(R1))
temp_df$city <- cities[i]
temp_df$r_square <- summary(R1)$r.squared
if(i==1) result_df <- temp_df else result_df <- rbind(result_df,
temp_df)
}
df1 <- merge(df1, result_df, by='city')

Deriving prediction efficiency and prediction errors for Ensemble Machine Learning model stacks

I am trying to derive prediction errors for ensemble models fitted using makeStackedLearner in the mlr package. These are the steps I am following:
> library(mlr)
> library(matrixStats)
> data(BostonHousing, package = "mlbench")
> tsk = makeRegrTask(data = BostonHousing, target = "medv")
> BostonHousing$chas = as.numeric(BostonHousing$chas)
> base = c("regr.rpart", "regr.svm", "regr.ranger")
> lrns = lapply(base, makeLearner)
> m = makeStackedLearner(base.learners = lrns,
+ predict.type = "response", method = "stack.cv", super.learner = "regr.lm")
> tmp = train(m, tsk)
> summary(tmp$learner.model$super.model$learner.model)
Call:
stats::lm(formula = f, data = d)
Residuals:
Min 1Q Median 3Q Max
-10.8014 -1.5154 -0.2479 1.2160 23.6530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.76991 0.43211 -6.410 3.35e-10 ***
regr.rpart -0.09575 0.04858 -1.971 0.0493 *
regr.svm 0.17379 0.07710 2.254 0.0246 *
regr.ranger 1.04503 0.08904 11.736 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.129 on 502 degrees of freedom
Multiple R-squared: 0.885, Adjusted R-squared: 0.8843
F-statistic: 1287 on 3 and 502 DF, p-value: < 2.2e-16
> res = predict(tmp, tsk)
Note I use the method = "stack.cv" which means that any time the models get refitted using makeStackedLearner the numbers will be slightly different. My first question is:
Is the R-square derived from the super.learner model an objective measure of the predictive power? (I assume because it is based on the Cross-Validation with refitting it should be)
> ## cross-validation R-square
> round(1-tmp$learner.model$super.model$learner.model$deviance /
+ tmp$learner.model$super.model$learner.model$null.deviance, 3)
[1] 0.872
How to derive prediction error (prediction interval) for all newdata rows?
The method I use at the moment simply derives standard deviation of the multiple independent model predictions (which is the model error):
> res.all <- getStackedBaseLearnerPredictions(tmp)
> wt <- summary(tmp$learner.model$super.model$learner.model)$coefficients[-1,4]
> res.all$model.error <- matrixStats::rowSds(
+ as.matrix(as.data.frame(res.all))[,which(wt<0.05)], na.rm=TRUE)
> res$data[1,]
id truth response
1 1 24 26.85235
> res.all$model.error[1]
[1] 2.24609
So in this case predicted value is 26.85, truth is 24, and the prediction error is estimated at 2.24. Again, because stack.cv method is used, everytime you refit the models you get slightly different values. Are you aware of any similar approach to derive prediction error for ensemble models? Thanks in advance.
To derive prediction intervals (individual errors at new data) we can use the predict.lm function:
> m = makeStackedLearner(base.learners = lrns, predict.type = "response",
method = "stack.cv", super.learner = "regr.lm")
> tmp = train(m, tsk)
> tmp$learner.model$super.model$learner.model
Call:
stats::lm(formula = f, data = d)
Coefficients:
(Intercept) regr.rpart regr.svm regr.ranger
-2.5879 -0.0971 0.3549 0.8635
> res.all <- getStackedBaseLearnerPredictions(tmp)
> pred.error = predict(tmp$learner.model$super.model$learner.model,
newdata = res.all, interval = "prediction", level=2/3)
> str(pred.error)
num [1:506, 1:3] 29.3 23.3 34.6 36 33.6 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:506] "1" "2" "3" "4" ...
..$ : chr [1:3] "fit" "lwr" "upr"
> summary(tmp$learner.model$super.model$learner.model$residuals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-11.8037 -1.5931 -0.3161 0.0000 1.1951 29.2145
> mean((pred.error[,3]-pred.error[,2])/2)
[1] 3.253142
This is an example with lm model as super learner. The level argument can be used to pass different probabilities (2/3 for 1 standard deviation). Predictions at newdata should be somewhat higher than what you obtain with training data (depending on extrapolation). This approach could also be extended to using e.g. Random Forest models (see ranger package) and quantile regression Random Forest derivation of prediction intervals (Hengl et al. 2019). Note that for this type of analysis, there should be at least two base learners (three recommended).

stan number of effective sample size

I reproduced the results of a hierarchical model using the rethinking package with just rstan() and I am just curious why n_eff is not closer.
Here is the model with random intercepts for 2 groups (intercept_x2) using the rethinking package:
Code:
response = c(rnorm(500,0,1),rnorm(500,200,10))
predicotr1_continuous = rnorm(1000)
predictor2_categorical = factor(c(rep("A",500),rep("B",500) ))
data = data.frame(y = response, x1 = predicotr1_continuous, x2 = predictor2_categorical)
head(data)
library(rethinking)
m22 <- map2stan(
alist(
y ~ dnorm( mu , sigma ) ,
mu <- intercept + intercept_x2[x2] + beta*x1 ,
intercept ~ dnorm(0,10),
intercept_x2[x2] ~ dnorm(0, sigma_2),
beta ~ dnorm(0,10),
sigma ~ dnorm(0, 10),
sigma_2 ~ dnorm(0,10)
) ,
data=data , chains=1 , iter=5000 , warmup=500 )
precis(m22, depth = 2)
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
intercept 9.96 9.59 -5.14 25.84 1368 1
intercept_x2[1] -9.94 9.59 -25.55 5.43 1371 1
intercept_x2[2] 189.68 9.59 173.28 204.26 1368 1
beta 0.06 0.22 -0.27 0.42 3458 1
sigma 6.94 0.16 6.70 7.20 2927 1
sigma_2 43.16 5.01 35.33 51.19 2757 1
Now here is the same model in rstan():
# create a numeric vector to indicate the categorical groups
data$GROUP_ID = match( data$x2, levels( data$x2 ) )
library(rstan)
standat <- list(
N = nrow(data),
y = data$y,
x1 = data$x1,
GROUP_ID = data$GROUP_ID,
nGROUPS = 2
)
stanmodelcode = '
data {
int<lower=1> N;
int nGROUPS;
real y[N];
real x1[N];
int<lower=1, upper=nGROUPS> GROUP_ID[N];
}
transformed data{
}
parameters {
real intercept;
vector[nGROUPS] intercept_x2;
real beta;
real<lower=0> sigma;
real<lower=0> sigma_2;
}
transformed parameters { // none needed
}
model {
real mu;
// priors
intercept~ normal(0,10);
intercept_x2 ~ normal(0,sigma_2);
beta ~ normal(0,10);
sigma ~ normal(0,10);
sigma_2 ~ normal(0,10);
// likelihood
for(i in 1:N){
mu = intercept + intercept_x2[ GROUP_ID[i] ] + beta*x1[i];
y[i] ~ normal(mu, sigma);
}
}
'
fit22 = stan(model_code=stanmodelcode, data=standat, iter=5000, warmup=500, chains = 1)
fit22
Inference for Stan model: b212ebc67c08c77926c59693aa719288.
1 chains, each with iter=5000; warmup=500; thin=1;
post-warmup draws per chain=4500, total post-warmup draws=4500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept 10.14 0.30 9.72 -8.42 3.56 10.21 16.71 29.19 1060 1
intercept_x2[1] -10.12 0.30 9.73 -29.09 -16.70 -10.25 -3.50 8.36 1059 1
intercept_x2[2] 189.50 0.30 9.72 170.40 182.98 189.42 196.09 208.05 1063 1
beta 0.05 0.00 0.21 -0.37 -0.10 0.05 0.20 0.47 3114 1
sigma 6.94 0.00 0.15 6.65 6.84 6.94 7.05 7.25 3432 1
sigma_2 43.14 0.09 4.88 34.38 39.71 42.84 46.36 53.26 3248 1
lp__ -2459.75 0.05 1.71 -2463.99 -2460.68 -2459.45 -2458.49 -2457.40 1334 1
Samples were drawn using NUTS(diag_e) at Thu Aug 31 15:53:09 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
My Questions:
the n_eff is larger using rethinking(). There is simulation differences but do you think something else is going on here?
Besides the n_eff being different the percentiles of the posterior distributions are different. I was thinking rethinking() and rstan() should return similar results with 5000 iterations since rethinking is just calling rstan. Are differences like that normal or something different between the 2 implementations?
I created data$GROUP_ID to indicate the categorical groupings. Is this the correct way to incorporate categorical variables into a hierarchical model in rstan()? I have 2 groups and if I had 50 groups I use the same data$GROUP_ID vector but is that the standard way?
Thank you.

How to find subset selection for linear regression model?

I am working with mtcars dataset and using linear regression
data(mtcars)
fit<- lm(mpg ~.,mtcars);summary(fit)
When I fit the model with lm it shows the result like this
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.5087 -1.3584 -0.0948 0.7745 4.6251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.87913 20.06582 1.190 0.2525
cyl6 -2.64870 3.04089 -0.871 0.3975
cyl8 -0.33616 7.15954 -0.047 0.9632
disp 0.03555 0.03190 1.114 0.2827
hp -0.07051 0.03943 -1.788 0.0939 .
drat 1.18283 2.48348 0.476 0.6407
wt -4.52978 2.53875 -1.784 0.0946 .
qsec 0.36784 0.93540 0.393 0.6997
vs1 1.93085 2.87126 0.672 0.5115
amManual 1.21212 3.21355 0.377 0.7113
gear4 1.11435 3.79952 0.293 0.7733
gear5 2.52840 3.73636 0.677 0.5089
carb2 -0.97935 2.31797 -0.423 0.6787
carb3 2.99964 4.29355 0.699 0.4955
carb4 1.09142 4.44962 0.245 0.8096
carb6 4.47757 6.38406 0.701 0.4938
carb8 7.25041 8.36057 0.867 0.3995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.833 on 15 degrees of freedom
Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
I found that none of variables are marked as significant at 0.05 significant level.
To find out significant variables I want to to do subset selection to find out best pair of vairables as predictors with response variable mpg.
The function regsubsets in the package leaps does best subset regression (see ?leaps). Adapting your code:
library(leaps)
regfit <- regsubsets(mpg ~., data = mtcars)
summary(regfit)
# or for a more visual display
plot(regfit,scale="Cp")