With gtsummary, is it possible to have N on a separate row to the column name? - gtsummary

gtsummary by default puts the number of observations in a by group beside the label for that group. This increases the width of the table... with many groups or large N, the table would quickly become very wide.
Is it possible to get gtsummary to report N on a separate row beneath the label? E.g.
> data(mtcars)
> mtcars %>%
+ select(mpg, cyl, vs, am) %>%
+ tbl_summary(by = am) %>%
+ as_tibble()
# A tibble: 6 x 3
`**Characteristic**` `**0**, N = 19` `**1**, N = 13`
<chr> <chr> <chr>
1 mpg 17.3 (14.9, 19.2) 22.8 (21.0, 30.4)
2 cyl NA NA
3 4 3 (16%) 8 (62%)
4 6 4 (21%) 3 (23%)
5 8 12 (63%) 2 (15%)
6 vs 7 (37%) 7 (54%)
would become
# A tibble: 6 x 3
`**Characteristic**` `**0**` `**1**`
<chr> <chr> <chr>
1 N = 19 N = 13
2 mpg 17.3 (14.9, 19.2) 22.8 (21.0, 30.4)
3 cyl NA NA
4 4 3 (16%) 8 (62%)
5 6 4 (21%) 3 (23%)
6 8 12 (63%) 2 (15%)
7 vs 7 (37%) 7 (54%)
(I only used as_tibble so that it was easy to show what I mean by editing it manually...)
Any idea?
Thanks!

Here is one way you could do this:
library(tidyverse)
library(gtsummary)
mtcars %>%
select(mpg, cyl, vs, am) %>%
# create a new variable to display N in table
mutate(total = 1) %>%
# this is just to reorder variables for table
select(total, everything()) %>%
tbl_summary(
by = am,
# this is to specify you only want N (and no percentage) for new total variable
statistic = total ~ "N = {N}") %>%
# this is a gtsummary function that allows you to edit the header
modify_header(all_stat_cols() ~ "**{level}**")
First, I am making a new variable that is just total observations (called total)
Then, I am customizing the way I want that variable statistic to be displayed
Then I am using gtsummary::modify_header() to remove N from the header
Additionally, if you use the flextable print engine, you can add a line break in the header itself:
mtcars %>%
select(mpg, cyl, vs, am) %>%
# create a new variable to display N in table
tbl_summary(
by = am
# this is to specify you only want N (and no percentage) for new total variable
) %>%
# this is a gtsummary function that allows you to edit the header
modify_header(all_stat_cols() ~ "**{level}**\nN = {n}") %>%
as_flex_table()
Good luck!

#kittykatstat already posted two fantastic solutions! I'll just add a slight variation :)
If you want to use the {gt} package to print the table and you're outputting to HTML, you can use the HTML tag <br> to add a line break in the header row (very similar to the \n solution already posted).
library(gtsummary)
mtcars %>%
select(mpg, cyl, vs, am) %>%
dplyr::mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
# create a new variable to display N in table
tbl_summary(by = am) %>%
# this is a gtsummary function that allows you to edit the header
modify_header(stat_by = "**{level}**<br>N = {N}")

Related

Marginal Means accounting for the random effect uncertainty

When we have repeated measurements on an experimental unit, typically these units cannot be considered 'independent' and need to be modeled in a way that we get valid estimates for our standard errors.
When I compare the intervals obtained by computing the marginal means for the treatment using a mixed model (treating the unit as a random effect) and in the other case, first averaging over the unit and THEN runnning a simple linear model on the averaged responses, I get the exact same uncertainty intervals.
How do we incorporate the uncertainty of the measurements of the unit, into the uncertainty of what we think our treatments look like?
In order to really propogate all the uncertainty, shouldn't we see what the treatment looks like, averaged over "all possible measurements" on a unit?
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(emmeans)
library(lme4)
#> Loading required package: Matrix
library(ggplot2)
tmp <- structure(list(treatment = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), response = c(151.27333548, 162.3933313,
159.2199999, 159.16666725, 210.82, 204.18666667, 196.97333333,
194.54666667, 154.18666667, 194.99333333, 193.48, 191.71333333,
124.1, 109.32666667, 105.32, 102.22, 110.83333333, 114.66666667,
110.54, 107.82, 105.62000069, 79.79999821, 77.58666557, 75.78666928
), experimental_unit = c("A-1", "A-1", "A-1", "A-1", "A-2", "A-2",
"A-2", "A-2", "A-3", "A-3", "A-3", "A-3", "B-1", "B-1", "B-1",
"B-1", "B-2", "B-2", "B-2", "B-2", "B-3", "B-3", "B-3", "B-3"
)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
))
### Option 1 - Treat the experimental unit as a random effect since there are
### 4 repeat observations for the same unit
lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(.,aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Option 2 - instead of treating the unit as random effect, we average over the
### 4 repeat observations, and run a simple linear model
tmp %>%
group_by(experimental_unit) %>%
summarise(mean_response = mean(response)) %>%
mutate(treatment = c(rep("A", 3), rep("B", 3))) %>%
lm(mean_response ~ treatment, data = .) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(., aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Whether we include a random effect for the unit, or average over it and THEN model it, we find no difference in the
### marginal means for the treatments
### How do we incoporate the variation of the repeat measurments to the marginal means of the treatments?
### Do we then ignore the variation in the 'subsamples' and simply average over them PRIOR to modeling?
<sup>Created on 2021-07-31 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>
emmeans() does take into account the errors of random effects. This is what I get when I remove the complex sequences of pipes:
> mmod = lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp)
> emmeans(mmod, "treatment")
treatment emmean SE df lower.CL upper.CL
A 181 10.8 4 151.0 211
B 102 10.8 4 71.9 132
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
This is as shown. If I fit a fixed-effects model that accounts for experimental units as a fixed effect, I get:
> fmod = lm(response ~ treatment + experimental_unit, data = tmp)
> emmeans(fmod, "treatment")
NOTE: A nesting structure was detected in the fitted model:
experimental_unit %in% treatment
treatment emmean SE df lower.CL upper.CL
A 181 3.25 18 174.2 188
B 102 3.25 18 95.1 109
Results are averaged over the levels of: experimental_unit
Confidence level used: 0.95
The SEs of the latter results are considerably lower, and that is because the random variations in experimental_unit are modeled as fixed variations.
Apparently the piping you did accounts for the variation of the random effects and includes those in the EMMs. I think that is because you did things separately for each experimental unit and somehow combined those results. I'm not very comfortable with a sequence of pipes that is 7 steps long, and I don't understand why that results in just one set of means.
I recommend against the as.data.frame() at the end. That zaps out annotations that can be helpful in understanding what you have. If you are doing that to get more digits precision, I'll claim that those are digits you don't need, it just exaggerates the precision you are entitled to claim.
Notes on some follow-up comments
Subsequently, I am convinced that what we see in the piped operations in the second part of the OP doe indeed comprise computing the mean of each EU, then analyzing those.
Let's look at that in the context of the formal model. We have (sorry MathJax doesn't work on stackoverflow, but I'll leave the markup there anyway)
$$ Y_{ijk} = \mu + \tau_i + U_{ij} + E_{ijk} $$
where $Y_{ijk}$ is the kth response measurement on the ith treatment and jth EU in the ith treatment, and the rhs terms represent respectively the overall mean, the (fixed) treatment effects, the (random) EU effects, and the (random) error effects. We assume the random effects are all mutually independent. With a balanced design, the EMMs are just the marginal means:
$$ \bar Y_{i..} = \mu + \tau_i + \bar U_{i.} + \bar E_{i..} $$
where a '.' subscript means we averaged over that subscript. If there are n EUs per treatment and m measurements on each EU, we get that
$$ Var(\bar Y_{i..} = \sigma^2_U / n + \sigma^2_E / mn $$
Now, if we aggregate the data on EUs ahead of time, we are starting with
$$ \bar Y_{ij.} = \mu + U_{ij} + \bar E_{ij.} $$
However, if we then compute marginal means by averaging over j, we get exactly the same thing as we did before with $\bar Y_{i..}$, and the variance is exactly as already shown. That is why it doesn't matter if we aggregated first or not.

Option to cut values below a threshold in papaja::apa_table

I can't figure out how to selectively print values in a table above or below some value. What I'm looking for is known as "cut" in Revelle's psych package. MWE below.
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
print(derp, cut=0.5) #removes all loadings smaller than 0.5
derp <- print(derp, cut=0.5) #apa_table still doesn't print like this
Question is, how do I add that cut to an apa_table? Printing apa_table(derp) prints the entire table, including all values.
The print-method from psych does not return the formatted loadings but only the table of variance accounted for. You can, however, get the result you want by manually formatting the loadings table:
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
# Class `loadings` cannot be coerced to data.frame or matrix
class(derp$Structure)
[1] "loadings"
# Class `matrix` is supported by apa_table()
derp_loadings <- unclass(derp$Structure)
class(derp_loadings)
[1] "matrix"
# Remove values below "cut"
derp_loadings[derp_loadings < 0.5] <- NA
colnames(derp_loadings) <- paste("Factor", 1:3)
apa_table(
derp_loadings
, caption = "Factor loadings"
, added_stub_head = "Item"
, format = "pandoc" # Omit this in your R Markdown document
, format.args = list(na_string = "") # Don't print NA
)
*Factor loadings*
Item Factor 1 Factor 2 Factor 3
---------- --------- --------- ---------
reason.4 0.60
reason.16
reason.17 0.65
reason.19
letter.7 0.61
letter.33 0.56
letter.34 0.65
letter.58
matrix.45
matrix.46
matrix.47
matrix.55
rotate.3 0.70
rotate.4 0.73
rotate.6 0.63
rotate.8 0.63

An issue with argument "sortv" of function seqIplot()

I'm trying to plot individual sequences by means of function seqIplot() in TraMineR. These individual sequences represent work trajectories, completed by former school's graduates via a WEB questionnaire.
Using argument "sortv", I'd like to sort my sequences according to the order of the levels of one covariate, the year of graduation, named "PROMO".
"PROMO" is a factor variable contained in a data frame named "covariates.seq", gathering covariates together:
str(covariates.seq)
'data.frame': 733 obs. of 6 variables:
$ ID_SQ : Factor w/ 733 levels "1","2","3","5",..: 1 2 3 4 5 6
7 8 9 10 ...
$ SEXE : Factor w/ 2 levels "Féminin","Masculin": 1 1 1 1 2 1
1 2 2 1 ...
$ PROMO : Factor w/ 6 levels "1997","1998",..: 1 2 2 4 4 3 2 2
2 2 ...
$ DEPARTEMENT : Factor w/ 10 levels "BC","GCU","GE",..: 1 4 7 8 7 9
9 7 7 4 ...
$ NIVEAU_ADMISSION: Factor w/ 2 levels "En Premier Cycle",..: NA 1 1 1 1
1 NA 1 1 1 ...
$ FILIERE_SECTION : Factor w/ 4 levels "Cursus Classique",..: NA 4 2 NA
1 1 NA NA 4 3 ..
I'm also using "SEXE", the graduates' gender, as a grouping variable. To plot the individual sequences so, my command is as follows:
seqIplot(sequences, group = covariates.seq$SEXE,
sortv = covariates.seq$PROMO,
cex.axis = 0.7, cex.legend = 0.7)
I expected that, by using a process time axis (with the year of graduation as sequence-dependent origin), sorting the sequences according to the order of the levels of "PROMO" would give a plot with groups of sequences from the longest (for the older graduates) to the shortest (for the younger graduates).
But I've got an issue: in the output plot, the sequences don't appear to be correctly sorted according to the levels of "PROMO". Indeed, by using "sortv = covariates.seq$PROMO" as in the command above, the plot doesn't show groups of sequences from the longest to the shortest, as expected. It looks like the plot obtained without using the argument "sortv" (see Figures below).
Without using argument "sortv"
Using "sortv = covariates.seq$PROMO"
Note that I have 733 individual sequences in my object "sequences", created as follows:
labs <- c("En poste","Au chômage (d'au moins 6 mois)", "Autre situation
(d'au moins 6 mois)","En poursuite d'études (thèse ou hors
thèse)", "En reprise d'études / formation (d'au moins 6 mois)")
codes <- c("En poste", "Au chômage", "Autre situation", "En poursuite
d'études", "En reprise d'études / formation")
sequences <- seqdef(situations, alphabet = labs, states = codes, left =
NA, right = "DEL", missing = NA,
cnames = as.character(seq(0,7400/365,1/365)),
xtstep = 365)
The values of the covariates are sorted in the same order as the individual sequences. The covariate "PROMO" doesn't contain any missing value.
Something's going wrong, but what?
Thank you in advance for your help,
Best,
Arnaud.
Using a factor as sortv argument in seqIplot works fine as illustrated by the example below:
sdc <- c("aabbccdd","bbbccc","aaaddd","abcabcab")
sd <- seqdecomp(sdc, sep="")
seq <- seqdef(sd)
fac <- factor(c("2000","2001","2001","2000"))
par(mfrow=c(1,3))
seqIplot(seq, with.legend=FALSE)
seqIplot(seq, sortv=fac, with.legend=FALSE)
seqlegend(seq)

Aligning and italicising table column headings using Rmarkdown and pander

I am writing a rmarkdown document knitting to pdf with tables taken from portions of lists from the ezANOVA package. The tables are made using the pander package. Toy Rmarkdown file with toy dataset below.
---
title: "Table Doc"
output: pdf_document
---
```{r global_options, include=FALSE}
#set global knit options parameters.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE, dev = 'pdf')
```
```{r, echo=FALSE}
# toy data
id <- rep(c(1,2,3,4), 5)
group1 <- factor(rep(c("A", "B"), 10))
group2 <- factor(rep(c("A", "B"), each = 10))
dv <- runif(20, min = 0, max = 10)
df <- data.frame(id, group1, group2, dv)
```
``` {r anova, echo = FALSE}
library(ez)
library(plyr)
library(pander)
# create anova object
anOb <- ezANOVA(df,
dv = dv,
wid = id,
between = c(group1, group2),
type = 3,
detailed = TRUE)
# extract the output table from the anova object, reduce it down to only desired columns
anOb <- data.frame(anOb[[1]][, c("Effect", "F", "p", "p<.05")])
# format entries in columns
anOb[,2] <- format( round (anOb[,2], digits = 1), nsmall = 1)
anOb[,3] <- format( round (anOb[,3], digits = 4), nsmall = 1)
pander(anOb, justify = c("left", "center", "center", "right"))
```
Now I have a few problems
a) For the last three columns I would like to have the column heading in the table aligned in the center, but the actual column entries underneath those headings aligned to the right.
b) I would like to have the column headings 'F' and 'p' in italics and the 'p' in the 'p<.05' column in italics also but the rest in normal font. So they read F, p and p<.05
I tried renaming the column headings using plyr::rename like so
anOb <- rename(anOb, c("F" = "italic(F)", "p" = "italic(p)", "p<.05" = ""))
But it didn't work
In markdown, you have to use the markdown syntax for italics, which is wrapping text between a star or underscore:
> names(anOb) <- c('Effect', '*F*', '*p*', '*p<.05*')
> pander(anOb)
-----------------------------------------
Effect *F* *p* *p<.05*
--------------- ------ -------- ---------
(Intercept) 52.3 0.0019 *
group1 1.3 0.3180
group2 2.0 0.2261
group1:group2 3.7 0.1273
-----------------------------------------
If you want to do that in a programmatic way, you can also use the pandoc.emphasis helper function to add the starts to a string.
But your other problem is due to a bug in the package, for which I've just proposed a fix on GH. Please feel free to give a try to that branch and report back on GH -- I will try to get some time later this week to clean up the related unit tests and merge the branch if everything seem to be OK.

Standardized coefficients for lmer model

I used to use the code below to calculate standardized coefficients of a lmer model. However, with the new version of lme the structure of the returned object has changed.
How to adapt the function stdCoef.lmer to make it work with the new lme4 version?
# Install old version of lme 4
install.packages("lme4.0", type="both",
repos=c("http://lme4.r-forge.r-project.org/repos",
getOption("repos")[["CRAN"]]))
# Load package
detach("package:lme4", unload=TRUE)
library(lme4.0)
# Define function to get standardized coefficients from an lmer
# See: https://github.com/jebyrnes/ext-meta/blob/master/r/lmerMetaPrep.R
stdCoef.lmer <- function(object) {
sdy <- sd(attr(object, "y"))
sdx <- apply(attr(object, "X"), 2, sd)
sc <- fixef(object)*sdx/sdy
#mimic se.ranef from pacakge "arm"
se.fixef <- function(obj) attr(summary(obj), "coefs")[,2]
se <- se.fixef(object)*sdx/sdy
return(list(stdcoef=sc, stdse=se))
}
# Run model
fm0 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
# Get standardized coefficients
stdCoef.lmer(fm0)
# Comparison model with prescaled variables
fm0.comparison <- lmer(scale(Reaction) ~ scale(Days) + (scale(Days) | Subject), sleepstudy)
The answer by #LeonardoBergamini works, but this one is more compact and understandable and only uses standard accessors — less likely to break in the future if/when the structure of the summary() output, or the internal structure of the fitted model, changes.
stdCoef.merMod <- function(object) {
sdy <- sd(getME(object,"y"))
sdx <- apply(getME(object,"X"), 2, sd)
sc <- fixef(object)*sdx/sdy
se.fixef <- coef(summary(object))[,"Std. Error"]
se <- se.fixef*sdx/sdy
return(data.frame(stdcoef=sc, stdse=se))
}
library("lme4")
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fixef(fm1)
## (Intercept) Days
## 251.40510 10.46729
stdCoef.merMod(fm1)
## stdcoef stdse
## (Intercept) 0.0000000 0.00000000
## Days 0.5352302 0.07904178
(This does give the same results as stdCoef.lmer in
#LeonardoBergamini's answer ...)
You can get partially scaled coefficients — scaled by 2 times the SD of x, but not scaled by SD(y), and not centred — using broom.mixed::tidy + dotwhisker::by_2sd:
library(broom.mixed)
library(dotwhisker)
(fm1
|> tidy(effect="fixed")
|> by_2sd(data=sleepstudy)
|> dplyr::select(term, estimate, std.error)
)
this should work:
stdCoef.lmer <- function(object) {
sdy <- sd(attr(object, "resp")$y) # the y values are now in the 'y' slot
### of the resp attribute
sdx <- apply(attr(object, "pp")$X, 2, sd) # And the X matriz is in the 'X' slot of the pp attr
sc <- fixef(object)*sdx/sdy
#mimic se.ranef from pacakge "arm"
se.fixef <- function(obj) as.data.frame(summary(obj)[10])[,2] # last change - extracting
## the standard errors from the summary
se <- se.fixef(object)*sdx/sdy
return(data.frame(stdcoef=sc, stdse=se))
}