Aligning and italicising table column headings using Rmarkdown and pander - knitr

I am writing a rmarkdown document knitting to pdf with tables taken from portions of lists from the ezANOVA package. The tables are made using the pander package. Toy Rmarkdown file with toy dataset below.
---
title: "Table Doc"
output: pdf_document
---
```{r global_options, include=FALSE}
#set global knit options parameters.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE, dev = 'pdf')
```
```{r, echo=FALSE}
# toy data
id <- rep(c(1,2,3,4), 5)
group1 <- factor(rep(c("A", "B"), 10))
group2 <- factor(rep(c("A", "B"), each = 10))
dv <- runif(20, min = 0, max = 10)
df <- data.frame(id, group1, group2, dv)
```
``` {r anova, echo = FALSE}
library(ez)
library(plyr)
library(pander)
# create anova object
anOb <- ezANOVA(df,
dv = dv,
wid = id,
between = c(group1, group2),
type = 3,
detailed = TRUE)
# extract the output table from the anova object, reduce it down to only desired columns
anOb <- data.frame(anOb[[1]][, c("Effect", "F", "p", "p<.05")])
# format entries in columns
anOb[,2] <- format( round (anOb[,2], digits = 1), nsmall = 1)
anOb[,3] <- format( round (anOb[,3], digits = 4), nsmall = 1)
pander(anOb, justify = c("left", "center", "center", "right"))
```
Now I have a few problems
a) For the last three columns I would like to have the column heading in the table aligned in the center, but the actual column entries underneath those headings aligned to the right.
b) I would like to have the column headings 'F' and 'p' in italics and the 'p' in the 'p<.05' column in italics also but the rest in normal font. So they read F, p and p<.05
I tried renaming the column headings using plyr::rename like so
anOb <- rename(anOb, c("F" = "italic(F)", "p" = "italic(p)", "p<.05" = ""))
But it didn't work

In markdown, you have to use the markdown syntax for italics, which is wrapping text between a star or underscore:
> names(anOb) <- c('Effect', '*F*', '*p*', '*p<.05*')
> pander(anOb)
-----------------------------------------
Effect *F* *p* *p<.05*
--------------- ------ -------- ---------
(Intercept) 52.3 0.0019 *
group1 1.3 0.3180
group2 2.0 0.2261
group1:group2 3.7 0.1273
-----------------------------------------
If you want to do that in a programmatic way, you can also use the pandoc.emphasis helper function to add the starts to a string.
But your other problem is due to a bug in the package, for which I've just proposed a fix on GH. Please feel free to give a try to that branch and report back on GH -- I will try to get some time later this week to clean up the related unit tests and merge the branch if everything seem to be OK.

Related

Marginal Means accounting for the random effect uncertainty

When we have repeated measurements on an experimental unit, typically these units cannot be considered 'independent' and need to be modeled in a way that we get valid estimates for our standard errors.
When I compare the intervals obtained by computing the marginal means for the treatment using a mixed model (treating the unit as a random effect) and in the other case, first averaging over the unit and THEN runnning a simple linear model on the averaged responses, I get the exact same uncertainty intervals.
How do we incorporate the uncertainty of the measurements of the unit, into the uncertainty of what we think our treatments look like?
In order to really propogate all the uncertainty, shouldn't we see what the treatment looks like, averaged over "all possible measurements" on a unit?
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(emmeans)
library(lme4)
#> Loading required package: Matrix
library(ggplot2)
tmp <- structure(list(treatment = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), response = c(151.27333548, 162.3933313,
159.2199999, 159.16666725, 210.82, 204.18666667, 196.97333333,
194.54666667, 154.18666667, 194.99333333, 193.48, 191.71333333,
124.1, 109.32666667, 105.32, 102.22, 110.83333333, 114.66666667,
110.54, 107.82, 105.62000069, 79.79999821, 77.58666557, 75.78666928
), experimental_unit = c("A-1", "A-1", "A-1", "A-1", "A-2", "A-2",
"A-2", "A-2", "A-3", "A-3", "A-3", "A-3", "B-1", "B-1", "B-1",
"B-1", "B-2", "B-2", "B-2", "B-2", "B-3", "B-3", "B-3", "B-3"
)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
))
### Option 1 - Treat the experimental unit as a random effect since there are
### 4 repeat observations for the same unit
lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(.,aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Option 2 - instead of treating the unit as random effect, we average over the
### 4 repeat observations, and run a simple linear model
tmp %>%
group_by(experimental_unit) %>%
summarise(mean_response = mean(response)) %>%
mutate(treatment = c(rep("A", 3), rep("B", 3))) %>%
lm(mean_response ~ treatment, data = .) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(., aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Whether we include a random effect for the unit, or average over it and THEN model it, we find no difference in the
### marginal means for the treatments
### How do we incoporate the variation of the repeat measurments to the marginal means of the treatments?
### Do we then ignore the variation in the 'subsamples' and simply average over them PRIOR to modeling?
<sup>Created on 2021-07-31 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>
emmeans() does take into account the errors of random effects. This is what I get when I remove the complex sequences of pipes:
> mmod = lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp)
> emmeans(mmod, "treatment")
treatment emmean SE df lower.CL upper.CL
A 181 10.8 4 151.0 211
B 102 10.8 4 71.9 132
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
This is as shown. If I fit a fixed-effects model that accounts for experimental units as a fixed effect, I get:
> fmod = lm(response ~ treatment + experimental_unit, data = tmp)
> emmeans(fmod, "treatment")
NOTE: A nesting structure was detected in the fitted model:
experimental_unit %in% treatment
treatment emmean SE df lower.CL upper.CL
A 181 3.25 18 174.2 188
B 102 3.25 18 95.1 109
Results are averaged over the levels of: experimental_unit
Confidence level used: 0.95
The SEs of the latter results are considerably lower, and that is because the random variations in experimental_unit are modeled as fixed variations.
Apparently the piping you did accounts for the variation of the random effects and includes those in the EMMs. I think that is because you did things separately for each experimental unit and somehow combined those results. I'm not very comfortable with a sequence of pipes that is 7 steps long, and I don't understand why that results in just one set of means.
I recommend against the as.data.frame() at the end. That zaps out annotations that can be helpful in understanding what you have. If you are doing that to get more digits precision, I'll claim that those are digits you don't need, it just exaggerates the precision you are entitled to claim.
Notes on some follow-up comments
Subsequently, I am convinced that what we see in the piped operations in the second part of the OP doe indeed comprise computing the mean of each EU, then analyzing those.
Let's look at that in the context of the formal model. We have (sorry MathJax doesn't work on stackoverflow, but I'll leave the markup there anyway)
$$ Y_{ijk} = \mu + \tau_i + U_{ij} + E_{ijk} $$
where $Y_{ijk}$ is the kth response measurement on the ith treatment and jth EU in the ith treatment, and the rhs terms represent respectively the overall mean, the (fixed) treatment effects, the (random) EU effects, and the (random) error effects. We assume the random effects are all mutually independent. With a balanced design, the EMMs are just the marginal means:
$$ \bar Y_{i..} = \mu + \tau_i + \bar U_{i.} + \bar E_{i..} $$
where a '.' subscript means we averaged over that subscript. If there are n EUs per treatment and m measurements on each EU, we get that
$$ Var(\bar Y_{i..} = \sigma^2_U / n + \sigma^2_E / mn $$
Now, if we aggregate the data on EUs ahead of time, we are starting with
$$ \bar Y_{ij.} = \mu + U_{ij} + \bar E_{ij.} $$
However, if we then compute marginal means by averaging over j, we get exactly the same thing as we did before with $\bar Y_{i..}$, and the variance is exactly as already shown. That is why it doesn't matter if we aggregated first or not.

Read and merge large tables on computer cluster

I need to merge different large tables (up to 10Gb each) into a single one. To do so I am using a computer cluster with 50+ cores and 10+Gb Ram that runs on Linux.
I always end up with an error message like: "Cannot allocate vector of size X Mb".
Given that commands like memory.limit(size=X) are Windows-specific and not accepted, I cannot find a way around to merge my large tables.
Any suggestion welcome!
This is the code I use:
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
temp = list.files(pattern="*.txt$")
gc()
Here the error occurs:
myfiles = parLapply(cl,temp, function(x) read.csv(x,
header=TRUE,
sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA","99","")))
myfiles.final = do.call(rbind, myfiles)
You could just use merge, for example:
`
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.
Reproducible example:
x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")
`
One way to approach this is with python and dask. The dask dataframe is stored mostly on disk rather than in ram- allowing you to work with larger than ram data- and can help you do computations with clever parallelization. A nice tutorial of ways to work on big data can be found in this kaggle post which might also be helpful for you. I also suggest checking out the docs on dask performance here. To be clear, if your data can fit in RAM using regular R dataframe or pandas dataframe will be faster.
Here's a dask solution which will assume you have named columns in the tables to align the concat operation. Please add to your question if you have any other special requirements about the data we need to consider.
import dask.dataframe as dd
import glob
tmp = glob.glob("*.txt")
dfs= []
for f in tmp:
# read the large tables
ddf = dd.read_table(f)
# make a list of all the dfs
dfs.append(ddf)
#row-wise concat of the data
dd_all = dd.concat(dfs)
#repartition the df to 1 partition for saving
dd_all = dd_all.repartition(npartitions=1)
# save the data
# provide list of one name if you don't want the partition number appended on
dd_all.to_csv(['all_big_files.tsv'], sep = '\t')
if you just wanted to cat all the tables together you can do something like this in straight python. (you could also use linux cat/paste).
with open('all_big_files.tsv', 'w') as O:
file_number = 0
for f in tmp:
with open(f, 'rU') as F:
if file_number == 0:
for line in F:
line = line.rstrip()
O.write(line + '\n')
else:
# skip the header line
l = F.readline()
for line in F:
line = line.rstrip()
O.write(line + '\n')
file_number +=1

htmlTable in Rmd - conversion to Word docx

I have the following Rmd file, which produces an html file, which I then copy-paste into a docx file (for collaborators). Here are things I'd like to know how to do with the tables, but I can't find answers in the vignettes here:
A. I want to know how to remove the blank column that gets inserted in Word in between Cgroup 1 and Cgroup 2.
B. I want to know how to set the width of the column with the row names ("1st row",...)
C. How can I change the font and font size? I tried following this but it doesn't work to have output: word_document with htmlTable()
D. To ease the conversion to Word, is there a way to specify page breaks? Landscape orientation?
Thank you so much!
---
title: "Example"
output:
Gmisc::docx_document:
fig_caption: TRUE
force_captions: TRUE
---
Results
=======
```{r, echo = FALSE}
library(htmlTable)
library(Gmisc)
library(knitr)
mx <-
matrix(ncol=6, nrow=8)
rownames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:8, "th")),
"row")
colnames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:6, "th")),
"hdr")
for (nr in 1:nrow(mx)){
for (nc in 1:ncol(mx)){
mx[nr, nc] <-
paste0(nr, ":", nc)
}
}
htmlTable(mx,
cgroup = c("Cgroup 1", "Cgroup 2"),
n.cgroup = c(2,4))
```
The styling seemed to be off for the row names and it is now fixed in version 1.10.1 that you can download using the devtools package: devtools::install_github("gforge/htmlTable", ref="develop")
Regarding the styling the function allows almost any CSS-style you could image. Unfortunately it requires copy-pasting into Word and this functionality hasn't been Microsofts highest priority. You can easily adapt you example to accomodate the requiered changes using the css.cell:
library(htmlTable)
library(knitr)
mx <-
matrix(ncol=6, nrow=8)
rownames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:8, "th")),
"row")
colnames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:6, "th")),
"hdr")
for (nr in 1:nrow(mx)){
for (nc in 1:ncol(mx)){
mx[nr, nc] <-
paste0(nr, ":", nc)
}
}
css.cell = rep("font-size: 1.5em;", times = ncol(mx) + 1)
css.cell[1] = "width: 4cm; font-size: 2em;"
htmlTable(mx,
css.cell=css.cell,
css.cgroup = "color: red",
css.table = "color: blue",
cgroup = c("Cgroup 1", "Cgroup 2"),
n.cgroup = c(2,4))
There is no way to remove the empty column generated by cgroups. This was required for the table to look nice and is a conscious design choice.
Regarding page-breaks I doubt there is any elegant way for doing that. An alternative could possibly be the ReporteRs package. I haven't used it myself but it's closer integrated with Word and could possibly be a solution.

Standardized coefficients for lmer model

I used to use the code below to calculate standardized coefficients of a lmer model. However, with the new version of lme the structure of the returned object has changed.
How to adapt the function stdCoef.lmer to make it work with the new lme4 version?
# Install old version of lme 4
install.packages("lme4.0", type="both",
repos=c("http://lme4.r-forge.r-project.org/repos",
getOption("repos")[["CRAN"]]))
# Load package
detach("package:lme4", unload=TRUE)
library(lme4.0)
# Define function to get standardized coefficients from an lmer
# See: https://github.com/jebyrnes/ext-meta/blob/master/r/lmerMetaPrep.R
stdCoef.lmer <- function(object) {
sdy <- sd(attr(object, "y"))
sdx <- apply(attr(object, "X"), 2, sd)
sc <- fixef(object)*sdx/sdy
#mimic se.ranef from pacakge "arm"
se.fixef <- function(obj) attr(summary(obj), "coefs")[,2]
se <- se.fixef(object)*sdx/sdy
return(list(stdcoef=sc, stdse=se))
}
# Run model
fm0 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
# Get standardized coefficients
stdCoef.lmer(fm0)
# Comparison model with prescaled variables
fm0.comparison <- lmer(scale(Reaction) ~ scale(Days) + (scale(Days) | Subject), sleepstudy)
The answer by #LeonardoBergamini works, but this one is more compact and understandable and only uses standard accessors — less likely to break in the future if/when the structure of the summary() output, or the internal structure of the fitted model, changes.
stdCoef.merMod <- function(object) {
sdy <- sd(getME(object,"y"))
sdx <- apply(getME(object,"X"), 2, sd)
sc <- fixef(object)*sdx/sdy
se.fixef <- coef(summary(object))[,"Std. Error"]
se <- se.fixef*sdx/sdy
return(data.frame(stdcoef=sc, stdse=se))
}
library("lme4")
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fixef(fm1)
## (Intercept) Days
## 251.40510 10.46729
stdCoef.merMod(fm1)
## stdcoef stdse
## (Intercept) 0.0000000 0.00000000
## Days 0.5352302 0.07904178
(This does give the same results as stdCoef.lmer in
#LeonardoBergamini's answer ...)
You can get partially scaled coefficients — scaled by 2 times the SD of x, but not scaled by SD(y), and not centred — using broom.mixed::tidy + dotwhisker::by_2sd:
library(broom.mixed)
library(dotwhisker)
(fm1
|> tidy(effect="fixed")
|> by_2sd(data=sleepstudy)
|> dplyr::select(term, estimate, std.error)
)
this should work:
stdCoef.lmer <- function(object) {
sdy <- sd(attr(object, "resp")$y) # the y values are now in the 'y' slot
### of the resp attribute
sdx <- apply(attr(object, "pp")$X, 2, sd) # And the X matriz is in the 'X' slot of the pp attr
sc <- fixef(object)*sdx/sdy
#mimic se.ranef from pacakge "arm"
se.fixef <- function(obj) as.data.frame(summary(obj)[10])[,2] # last change - extracting
## the standard errors from the summary
se <- se.fixef(object)*sdx/sdy
return(data.frame(stdcoef=sc, stdse=se))
}

Display Superscript in SSRS reports

I m working on SSRS 2008.
i want to display date as 1st January 2011..
but "st" should be in superscipt ..
not like "1st".
is there any way to display "st", "nd","rd" and "th" in superscipt without installing any custom font type(other Font Type).
just copy and paste from the following list:
ABC⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁼ ⁽ ⁾
ABC₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₋ ₌ ₍ ₎
ABCᵃ ᵇ ᶜ ᵈ ᵉ ᶠ ᵍ ʰ ⁱ ʲ ᵏ ˡ ᵐ ⁿ ᵒ ᵖ ʳ ˢ ᵗ ᵘ ᵛ ʷ ˣ ʸ ᶻ
ABCᴬ ᴮ ᴰ ᴱ ᴳ ᴴ ᴵ ᴶ ᴷ ᴸ ᴹ ᴺ ᴼ ᴾ ᴿ ᵀ ᵁ ᵂ
ABCₐ ₑ ᵢ ₒ ᵣ ᵤ ᵥ ₓ
ABC½ ¼ ¾ ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞ № ℠ ™ © ®
ABC^ ± ¶
Maybe...
You are limited to what can be done with String.Format. Font size and spacing also refer to the whole text box. So it isn't "native"
However, superscript is unicode so you may be able to to do it with some fancy expression that concatenate characters. I'd suggest custom code.
I haven't tried this, but these articles mention it
http://beyondrelational.com/blogs/jason/archive/2010/12/06/subscripts-and-superscripts-in-ssrs-reports.aspx
http://www.codeproject.com/KB/reporting-services/SSRSSuperscript.aspx
I am not looking for credit here as above solution has answered it for you but for beginners sake, I use a code function within my report.
So in my SQL say I have Number field, then I add a new field OrdinalNumber:
SELECT ..., Number,
CASE WHEN (Number % 100) BETWEEN 10 AND 20 THEN 4
WHEN (Number % 10) = 1 THEN 1
WHEN (Number % 10) = 2 THEN 2
WHEN (Number % 10) = 3 THEN 3
ELSE 4 END AS OrdinalNumber,
...
Then my code function:
Function OrdinalText(ByVal OrdinalNumber As Integer) As String
Dim result As String
Select Case OrdinalNumber
Case 1
result = "ˢᵗ"
Case 2
result = "ⁿᵈ"
Case 3
result = "ʳᵈ"
Case Else
result = "ᵗʰ"
End Select
Return result
End Function
Then in the report textbox I use the expression:
=CStr(Fields!Number.Value) & Code.OrdinalText(Fields!OrdinalNumber.Value)