regression separately for specific variable - linear-regression

I am trying to run a lm/glm between two variables "area" and "intensity".
I ran a linear model regression between the variables with all rows combined and got summary results as below. I want to run the lm for the two variables individually for each city (A/B/C/D/E). How can I modify/loop the script such that I do not have to run the script 5 times, and the r-squared value and model results are added in the dataframe?
R1 <- lm(formula = area ~ intensity,
data = df1)
Call:
lm(formula = area ~ intensity, data = df1)
Residuals:
Min 1Q Median 3Q Max
-2716.1 -1540.5 -684.3 1588.8 2686.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1646.30 569.73 2.890 0.00976 **
intensity -333.10 42.73 -7.795 3.54e-07 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1790 on 18 degrees of freedom
Multiple R-squared: 0.7715, Adjusted R-squared: 0.7588
F-statistic: 60.77 on 1 and 18 DF, p-value: 3.537e-07

I am sharing a way to store a list of outputs and also a way to put results in primary dataframe:
results <- list()
cities <- unique(df1$city)
for (i in 1:length(cities)){
R1 <- lm(area ~ intensity, data=df1[df1$city==cities[i],])
results[[cities[i]]] <- summary(R1) # if you want to store everything
temp_df <- data.frame(prediceted=fitted(R1))
temp_df$city <- cities[i]
temp_df$r_square <- summary(R1)$r.squared
if(i==1) result_df <- temp_df else result_df <- rbind(result_df,
temp_df)
}
df1 <- merge(df1, result_df, by='city')

Related

Adonis result does not match the NMDS plot?

I'm not a statistician and kinda working through this blindly, but why does my NMDS plot of bray-curtis measures show clear groupings for status (blue and pink) but adonis says that time point is significantly different?
distance_calc <- phyloseq::distance(rarified2, "bray")
sampledf <- data.frame(sample_data(rarified2))
adonisresult<- adonis2(formula = distance_calc ~ Status * Time.Point, data = sampledf, type = "bray")
> adonisresult
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = adonisf, data = sampledf, type = "bray")
Df SumOfSqs R2 F Pr(>F)
Status 1 0.2534 0.05397 1.2969 0.198
Time.Point 1 0.4277 0.09111 2.1893 0.016 *
Status:Time.Point 1 0.1061 0.02259 0.5429 0.901
Residual 20 3.9072 0.83232
Total 23 4.6943 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

Deriving prediction efficiency and prediction errors for Ensemble Machine Learning model stacks

I am trying to derive prediction errors for ensemble models fitted using makeStackedLearner in the mlr package. These are the steps I am following:
> library(mlr)
> library(matrixStats)
> data(BostonHousing, package = "mlbench")
> tsk = makeRegrTask(data = BostonHousing, target = "medv")
> BostonHousing$chas = as.numeric(BostonHousing$chas)
> base = c("regr.rpart", "regr.svm", "regr.ranger")
> lrns = lapply(base, makeLearner)
> m = makeStackedLearner(base.learners = lrns,
+ predict.type = "response", method = "stack.cv", super.learner = "regr.lm")
> tmp = train(m, tsk)
> summary(tmp$learner.model$super.model$learner.model)
Call:
stats::lm(formula = f, data = d)
Residuals:
Min 1Q Median 3Q Max
-10.8014 -1.5154 -0.2479 1.2160 23.6530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.76991 0.43211 -6.410 3.35e-10 ***
regr.rpart -0.09575 0.04858 -1.971 0.0493 *
regr.svm 0.17379 0.07710 2.254 0.0246 *
regr.ranger 1.04503 0.08904 11.736 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.129 on 502 degrees of freedom
Multiple R-squared: 0.885, Adjusted R-squared: 0.8843
F-statistic: 1287 on 3 and 502 DF, p-value: < 2.2e-16
> res = predict(tmp, tsk)
Note I use the method = "stack.cv" which means that any time the models get refitted using makeStackedLearner the numbers will be slightly different. My first question is:
Is the R-square derived from the super.learner model an objective measure of the predictive power? (I assume because it is based on the Cross-Validation with refitting it should be)
> ## cross-validation R-square
> round(1-tmp$learner.model$super.model$learner.model$deviance /
+ tmp$learner.model$super.model$learner.model$null.deviance, 3)
[1] 0.872
How to derive prediction error (prediction interval) for all newdata rows?
The method I use at the moment simply derives standard deviation of the multiple independent model predictions (which is the model error):
> res.all <- getStackedBaseLearnerPredictions(tmp)
> wt <- summary(tmp$learner.model$super.model$learner.model)$coefficients[-1,4]
> res.all$model.error <- matrixStats::rowSds(
+ as.matrix(as.data.frame(res.all))[,which(wt<0.05)], na.rm=TRUE)
> res$data[1,]
id truth response
1 1 24 26.85235
> res.all$model.error[1]
[1] 2.24609
So in this case predicted value is 26.85, truth is 24, and the prediction error is estimated at 2.24. Again, because stack.cv method is used, everytime you refit the models you get slightly different values. Are you aware of any similar approach to derive prediction error for ensemble models? Thanks in advance.
To derive prediction intervals (individual errors at new data) we can use the predict.lm function:
> m = makeStackedLearner(base.learners = lrns, predict.type = "response",
method = "stack.cv", super.learner = "regr.lm")
> tmp = train(m, tsk)
> tmp$learner.model$super.model$learner.model
Call:
stats::lm(formula = f, data = d)
Coefficients:
(Intercept) regr.rpart regr.svm regr.ranger
-2.5879 -0.0971 0.3549 0.8635
> res.all <- getStackedBaseLearnerPredictions(tmp)
> pred.error = predict(tmp$learner.model$super.model$learner.model,
newdata = res.all, interval = "prediction", level=2/3)
> str(pred.error)
num [1:506, 1:3] 29.3 23.3 34.6 36 33.6 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:506] "1" "2" "3" "4" ...
..$ : chr [1:3] "fit" "lwr" "upr"
> summary(tmp$learner.model$super.model$learner.model$residuals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-11.8037 -1.5931 -0.3161 0.0000 1.1951 29.2145
> mean((pred.error[,3]-pred.error[,2])/2)
[1] 3.253142
This is an example with lm model as super learner. The level argument can be used to pass different probabilities (2/3 for 1 standard deviation). Predictions at newdata should be somewhat higher than what you obtain with training data (depending on extrapolation). This approach could also be extended to using e.g. Random Forest models (see ranger package) and quantile regression Random Forest derivation of prediction intervals (Hengl et al. 2019). Note that for this type of analysis, there should be at least two base learners (three recommended).

How can we find percentile or quantile of gamma distribution in MATLAB?

Suppose that we have this gamma distribution in MATLAB:
I want this part of distribution with higher density (x-axis range). How can I extract this in MATLAB? I fit this distribution using histfit function.
PS. My codes:
figure;
histfit(Data,20,'gamma');
[phat, pci] = gamfit(Data);
phat =
11.3360 4.2276
pci =
8.4434 3.1281
15.2196 5.7136
When you fit a gamma distribution to your data with [phat, pci] = gamfit(Data);, phat contains the MLE parameters.
You can plug this into gaminv:
x = gaminv(p, phat(1), phat(2));
where p is a vector of percentages, e.g. p = [.2, .8].
I always base my code on the following. This is a code once made which I now alter where necessary. Maybe you will find it usefull as well.
table2.10 <- cbind(lower=c(0,2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300),upper=c(2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300,Inf),freq=c(41,48,24,18,15,14,16,12,6,11,5,4,3))
loglik <-function(p,d){
upper <- d[,2]
lower <- d[,1]
n <- d[,3]
ll<-n*log(ifelse(upper<Inf,pgamma(upper,p[1],p[2]),1)-
pgamma(lower,p[1],p[2]))
sum( (ll) )
}
p0 <- c(alpha=0.47,beta=0.014)
m <- optim(p0,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10)
theta <- qgamma(0.95,m$par[1],m$par[2])
theta
One can also create a 95% confidence interval by using the Delta method. To do so, we need to differentiate the
distribution function of the claims F^(−1)_{X} (0.95; α, β) with respect to α and β.
p <- m$par
eps <- c(1e-5,1e-6,1e-7)
d.alpha <- 0*eps
d.beta <- 0*eps
for (i in 1:3){
d.alpha[i] <- (qgamma(0.95,p[1]+eps[i],p[2])-qgamma(0.95,p[1]-eps[i],p[2]))/(2*eps[i])
d.beta[i] <- (qgamma(0.95,p[1],p[2]+eps[i])-qgamma(0.95,p[1],p[2]-eps[i]))/(2*eps[i])
}
d.alpha
d.beta
var.p <- solve(-m$hessian)
var.q95 <- t(c(d.alpha[2],d.beta[2])) %*% var.p %*% c(d.alpha[2],d.beta[2])
qgamma(0.95,p[1],p[2]) + qnorm(c(0.025,0.975))*sqrt(c(var.q95))
It is even possible using the parametric bootstrap on the estimates for α and β to get B = 1000 different estimates for the
95th percentile of the loss distribution. And use these estimates to construct a 95% confidence interval
library(mvtnorm)
B <- 10000
q.b <- rep(NA,B)
for (b in 1:B){
p.b <- rmvnorm(1,p,var.p)
if (!any(p.b<0)) q.b[b] <- qgamma(0.95,p.b[1],p.b[2])
}
quantile(q.b,c(0.025,0.975))
To do the nonparametrtic bootstrap, we first ‘expand’ the data to reflect each individual observation. Then
we sample with replacement from the line numbers, calculate the frequency table, estimate the model and its
95% percentile.
line.numbers <- rep(1:13,table2.10[,"freq"])
q.b <- rep(NA,B)
table2.10b <- table2.10
for (b in 1:B){
line.numbers.b <- sample(line.numbers,size=217,replace=TRUE)
table2.10b[,"freq"] <- table(factor(line.numbers.b,levels=1:13))
m.b <- optim(m$par,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10b)
q.b[b] <- qgamma(0.95,m.b$par[1],m.b$par[2])
}
q.npb <- q.b
quantile(q.b,c(0.025,0.975))