Any other options besides the traditional CLD bar graph? - pairwise

I am looking for an alternative approach to plotting results from pairwise comparisons than traditional bar plots. If possible, I would like to create a plot like the one shown below [1], but for a model that includes an interaction effect. R code for the plot below is online [2]. Is there a way to revise or add onto this code to include an interaction effect?
Example of my data set (too large to include in its entirety but I can send upon request) and the model used:
aq <- tibble::tribble(
~trt, ~season, ~site, ~sp,
"herbicide", "early", 1L, 0.120494496,
"herbicide", "early", 1L, 0.04057757,
"herbicide", "early", 1L, 0.060556802,
"herbicide", "early", 1L, 0.050567186,
"herbicide", "early", 1L, 0.110504881,
"herbicide", "early", 1L, 0.090525649,
"herbicide", "early", 1L, 0.100515265,
"herbicide", "early", 1L, 0.030587954,
"herbicide", "early", 1L, 0.080536033,
"herbicide", "early", 1L, 0.010608723,
"herbicide", "early", 1L, 0.080536033,
"herbicide", "early", 1L, 0.04057757,
"herbicide", "mid", 1L, 0.050567186,
"herbicide", "mid", 1L, 0.050567186,
"herbicide", "mid", 1L, 0.04057757,
"herbicide", "mid", 1L, 0.04057757,
"herbicide", "mid", 1L, 0.140473728,
"herbicide", "mid", 1L, 0.030587954,
"herbicide", "mid", 1L, 0.150463344,
"herbicide", "mid", 1L, 0.020598339,
"herbicide", "mid", 1L, 0.120494496,
"herbicide", "mid", 1L, 0.04057757,
"herbicide", "mid", 1L, 0.050567186,
"herbicide", "late", 1L, 0.090525649,
"herbicide", "late", 1L, 0.070546417,
"herbicide", "late", 1L, 0.150463344,
"herbicide", "late", 1L, 0.070546417,
"herbicide", "late", 1L, 0.220390654,
"herbicide", "late", 1L, 0.120494496,
"herbicide", "late", 1L, 0.150463344,
"herbicide", "late", 1L, 0.130484112,
"herbicide", "late", 1L, 0.090525649,
"herbicide", "late", 1L, 0.020598339,
"herbicide", "late", 1L, 0.170442575,
"herbicide", "late", 1L, 0.050567186,
"herbicide", "early", 1L, 0.010608723,
"herbicide", "early", 1L, 0.060556802,
"herbicide", "early", 1L, 0.000619107,
"herbicide", "early", 1L, 0.050567186,
"herbicide", "early", 1L, 0.030587954,
"herbicide", "early", 1L, 0.010608723,
"herbicide", "early", 1L, 0.000619107,
"herbicide", "early", 1L, 0.000619107,
"herbicide", "early", 1L, 0.020598339,
"herbicide", "early", 1L, 0.000619107,
"herbicide", "early", 1L, 0.030587954,
"herbicide", "early", 1L, 0.010608723,
"herbicide", "mid", 1L, 0.04057757,
"herbicide", "mid", 1L, 0.050567186,
"herbicide", "mid", 1L, 0.010608723,
"herbicide", "mid", 1L, 0.010608723,
"herbicide", "mid", 1L, 0.04057757,
"herbicide", "mid", 1L, 0.010608723,
"herbicide", "mid", 1L, 0.050567186,
"herbicide", "mid", 1L, 0.010608723,
"herbicide", "mid", 1L, 0.010608723,
"herbicide", "mid", 1L, 0.070546417,
"herbicide", "mid", 1L, 0.020598339,
"herbicide", "mid", 1L, 0.060556802,
"herbicide", "late", 1L, 0.030587954,
"herbicide", "late", 1L, 0.030587954,
"herbicide", "late", 1L, 0.070546417,
"herbicide", "late", 1L, 0.04057757,
"herbicide", "late", 1L, 0.010608723,
"herbicide", "late", 1L, 0.080536033,
"herbicide", "late", 1L, 0.000619107,
"herbicide", "late", 1L, 0.010608723,
"herbicide", "late", 1L, 0.010608723,
"herbicide", "late", 1L, 0.030587954,
"mow", "early", 1L, 0.050567186,
"mow", "early", 1L, 0.050567186,
"mow", "early", 1L, 0.04057757,
"mow", "early", 1L, 0.04057757,
"mow", "early", 1L, 0.080536033,
"mow", "early", 1L, 0.050567186,
"mow", "early", 1L, 0.020598339,
"mow", "early", 1L, 0.060556802,
"mow", "early", 1L, 0.000619107,
"mow", "early", 1L, 0.04057757,
"mow", "early", 1L, 0.050567186,
"mow", "early", 1L, 0.020598339,
"mow", "mid", 1L, 0.020598339,
"mow", "mid", 1L, 0.020598339,
"mow", "mid", 1L, 0.070546417,
"mow", "mid", 1L, 0.020598339,
"mow", "mid", 1L, 0.04057757,
"mow", "mid", 1L, 0.04057757,
"mow", "mid", 1L, 0.020598339,
"mow", "mid", 1L, 0.020598339,
"mow", "mid", 1L, 0.030587954,
"mow", "mid", 1L, 0.010608723,
"mow", "mid", 1L, 0.010608723,
"mow", "late", 1L, 0.04057757,
"mow", "late", 1L, 0.020598339,
"mow", "late", 1L, 0.04057757,
"mow", "late", 1L, 0.020598339,
"mow", "late", 1L, 0.020598339,
"mow", "late", 1L, 0.020598339,
"mow", "late", 1L, 0.030587954,
"mow", "late", 1L, 0.030587954,
"mow", "late", 1L, 0.020598339,
"mow", "late", 1L, 0.000619107,
"mow", "late", 1L, 0.030587954,
"mow", "late", 1L, 0.030587954,
"mow", "early", 1L, 0.050567186,
"mow", "early", 1L, 0.010608723,
"mow", "early", 1L, 0.100515265,
"mow", "early", 1L, 0.110504881,
"mow", "early", 1L, 0.04057757,
"mow", "early", 1L, 0.030587954,
"mow", "early", 1L, 0.050567186,
"mow", "early", 1L, 0.04057757,
"mow", "early", 1L, 0.050567186,
"mow", "early", 1L, 0.010608723,
"mow", "early", 1L, 0.010608723,
"mow", "early", 1L, 0.000619107,
"mow", "mid", 1L, 0.060556802,
"mow", "mid", 1L, 0.010608723,
"mow", "mid", 1L, 0.000619107,
"mow", "mid", 1L, 0.030587954,
"mow", "mid", 1L, 0.060556802,
"mow", "mid", 1L, 0.020598339,
"mow", "mid", 1L, 0.050567186,
"mow", "mid", 1L, 0.04057757,
"mow", "mid", 1L, 0.020598339,
"mow", "mid", 1L, 0.04057757,
"mow", "mid", 1L, 0.030587954,
"mow", "mid", 1L, 0.030587954,
"mow", "late", 1L, 0.050567186,
"mow", "late", 1L, 0.050567186,
"mow", "late", 1L, 0.010608723,
"mow", "late", 1L, 0.030587954,
"mow", "late", 1L, 0.010608723,
"mow", "late", 1L, 0.010608723,
"mow", "late", 1L, 0.060556802,
"mow", "late", 1L, 0.020598339,
"mow", "late", 1L, 0.050567186,
"mow", "late", 1L, 0.04057757,
"mow", "late", 1L, 0.010608723,
"mow", "late", 1L, 0.070546417,
"herbicide", "early", 2L, 0.04057757,
"herbicide", "early", 2L, 0.450151817,
"herbicide", "early", 2L, 0.000619107,
"herbicide", "early", 2L, 0.500099896,
"herbicide", "early", 2L, 0.010608723,
"herbicide", "early", 2L, 0.190421807,
"herbicide", "early", 2L, 0.180432191,
"herbicide", "early", 2L, 0.130484112,
"herbicide", "early", 2L, 0.020598339,
"herbicide", "early", 2L, 0.360245275,
"herbicide", "early", 2L, 0.010608723,
"herbicide", "early", 2L, 0.030587954,
"herbicide", "mid", 2L, 0.050567186,
"herbicide", "mid", 2L, 0.370234891,
"herbicide", "mid", 2L, 0.010608723,
"herbicide", "mid", 2L, 0.250359502,
"herbicide", "mid", 2L, 0.050567186,
"herbicide", "mid", 2L, 0.080536033,
"herbicide", "mid", 2L, 0.04057757,
"herbicide", "mid", 2L, 0.050567186,
"herbicide", "mid", 2L, 0.050567186,
"herbicide", "mid", 2L, 0.16045296,
"herbicide", "mid", 2L, 0.000619107,
"herbicide", "mid", 2L, 0.000619107,
"herbicide", "late", 2L, 0.050567186,
"herbicide", "late", 2L, 0.540058359,
"herbicide", "late", 2L, 0.04057757,
"herbicide", "late", 2L, 0.260349117,
"herbicide", "late", 2L, 0.070546417,
"herbicide", "late", 2L, 0.120494496,
"herbicide", "late", 2L, 0.030587954,
"herbicide", "late", 2L, 0.070546417,
"herbicide", "late", 2L, 0.020598339,
"herbicide", "late", 2L, 0.120494496,
"herbicide", "late", 2L, 0.04057757,
"herbicide", "late", 2L, 0.000619107,
"herbicide", "early", 2L, 0.010608723,
"herbicide", "early", 2L, 0.050567186,
"herbicide", "early", 2L, 0.010608723,
"herbicide", "early", 2L, 0.010608723,
"herbicide", "early", 2L, 0.060556802,
"herbicide", "early", 2L, 0.04057757,
"herbicide", "early", 2L, 0.210401038,
"herbicide", "early", 2L, 0.060556802,
"herbicide", "early", 2L, 0.100515265,
"herbicide", "early", 2L, 0.090525649,
"herbicide", "early", 2L, 0.010608723,
"herbicide", "early", 2L, 0.000619107,
"herbicide", "mid", 2L, 0.060556802,
"herbicide", "mid", 2L, 0.020598339,
"herbicide", "mid", 2L, 0.030587954,
"herbicide", "mid", 2L, 0.010608723,
"herbicide", "mid", 2L, 0.000619107,
"herbicide", "mid", 2L, 0.010608723,
"herbicide", "mid", 2L, 0.030587954,
"herbicide", "mid", 2L, 0.070546417,
"herbicide", "mid", 2L, 0.020598339,
library(tidyverse)
library(betareg)
library(emmeans)
library(lmtest)
library(multcomp)
library(lme4)
library(car)
library(glmmTMB)
trt_key <- c(ctrl = "Control", mow = "FallMow", herbicide = "SpotSpray")
aq$trt <- recode(aq$trt, !!!trt_key)
aq$trt <- factor(aq$trt, levels = c("Control", "FallMow", "SpotSpray"))
season_key <- c(early = "Early", mid = "Mid", late = "Late")
aq$season <- recode(aq$season, !!!season_key)
aq$season <- factor(aq$season, levels=c("Early","Mid","Late"))
glm.soil <- glmmTMB(sp ~ trt + season + trt*season + (1 | site), data = aq,
family = list(family = "beta", link = "logit"), dispformula = ~trt)
#Interaction
lsm <- emmeans(glm.soil, pairwise ~ trt:season, type="response", adjust = "tukey")
lsmtab <- cld(lsm, Letter=letters, sort = F)
colnames(lsmtab)[1] <- "Treatment"
colnames(lsmtab)[2] <- "Season"
colnames(lsmtab)[8] <- "letter"
df <- as.data.frame(lsmtab)
print(df)
This is my first post, so I apologize in advance if I've overlooked any posting protocols. Thanks!
[1]: https://i.stack.imgur.com/GJ8VA.png
[2]: https://schmidtpaul.github.io/DSFAIR/compactletterdisplay.html

I'm the author of the plot/code you linked.
You are not the first one asking how to create an analogous plot when interactions are present. I am suggesting two options below using your data.
(Note that in the following reprex I deleted the part of the code with the data, because my post was reaching the character limit.)
# packages ---------------------------------------------------------------
library(emmeans)
library(glmmTMB)
library(car)
library(multcomp)
library(multcompView)
library(scales)
library(tidyverse)
# format ------------------------------------------------------------------
trt_key <- c(Control = "ctrl", FallMow = "mow", SpotSpray = "herbicide")
season_key <- c(Early = "early", Mid = "mid", Late = "late")
aq <- aq %>%
mutate(
trt = trt %>% fct_recode(!!!trt_key) %>% fct_relevel(names(trt_key)),
season = season %>% fct_recode(!!!season_key) %>% fct_relevel(names(season_key))
)
# model -------------------------------------------------------------------
glm.soil <-
glmmTMB(
sp ~ trt + season + trt:season + (1 | site),
data = aq,
family = list(family = "beta", link = "logit"),
dispformula = ~ trt
)
Anova(glm.soil) # interaction is significant!
#> Analysis of Deviance Table (Type II Wald chisquare tests)
#>
#> Response: sp
#> Chisq Df Pr(>Chisq)
#> trt 48.422 2 3.057e-11 ***
#> season 18.888 2 7.916e-05 ***
#> trt:season 16.980 4 0.001951 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here the ANOVA tells us that there are significant interaction effects between treatment and season. Simply put, this means that treatments behave differently depending on the season. In the extreme case this could mean that the same treatment could have the best performance in one season, but the worst performance in another season. Therefore it would be misleading to simply estimate one mean per treatment across seasons (or one mean per season across treatments). Instead, we should look at the means of all treatment-season combinations. In this scenario, there are 9 season-treatment combinations and we can estimate them using emmeans() via either ~ trt:season or ~ trt|season. These are the two options I was talking about above. Again, both estimate the same means for the same 9 combinations, but what is different is which of these means are compared to each other. None of the two appraoches is more correct than the other. Instead, you as the analyst must decide which approach is more informative for what you are trying to investigate.
# emmean comparisons ------------------------------------------------------
emm1 <- emmeans(glm.soil, ~ trt:season, type = "response")
emm2 <- emmeans(glm.soil, ~ trt|season, type = "response")
cld1 <- cld(emm1, Letter=letters, adjust = "tukey")
cld2 <- cld(emm2, Letter=letters, adjust = "tukey")
~ trt:season Comparing all combinations to all other combinations
Here, all combinations are compared to all other combinations. In order to plot this, I create a new column trt_season that represents each combination (and I also sort its factor levels according to their estimated mean) and put it on the x-axis. Note that I also filled the boxes and colored the dots according to their treatment, but this is optional.
cld1
#> trt season response SE df lower.CL upper.CL .group
#> Control Early 0.0560 0.00540 950 0.0428 0.0730 abc
#> FallMow Early 0.0316 0.00248 950 0.0254 0.0393 d
#> SpotSpray Early 0.0498 0.00402 950 0.0397 0.0622 abc
#> Control Mid 0.0654 0.00610 950 0.0504 0.0845 a
#> FallMow Mid 0.0427 0.00305 950 0.0350 0.0520 b
#> SpotSpray Mid 0.0609 0.00473 950 0.0490 0.0754 a c
#> Control Late 0.0442 0.00464 950 0.0330 0.0590 bcd
#> FallMow Late 0.0437 0.00314 950 0.0358 0.0533 b
#> SpotSpray Late 0.0630 0.00482 950 0.0508 0.0777 a c
#>
#> Confidence level used: 0.95
#> Conf-level adjustment: sidak method for 9 estimates
#> Intervals are back-transformed from the logit scale
#> P value adjustment: tukey method for comparing a family of 9 estimates
#> Tests are performed on the log odds ratio scale
#> significance level used: alpha = 0.05
#> NOTE: Compact letter displays can be misleading
#> because they show NON-findings rather than findings.
#> Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.
cld1_df <- cld1 %>% as.data.frame()
cld1_df <- cld1_df %>%
mutate(trt_season = paste0(season, "-", trt)) %>%
mutate(trt_season = fct_reorder(trt_season, response))
aq <- aq %>%
mutate(trt_season = paste0(season, "-", trt)) %>%
mutate(trt_season = fct_relevel(trt_season, levels(cld1_df$trt_season)))
ggplot() +
# y-axis
scale_y_continuous(
name = "sp",
limits = c(0, NA),
breaks = pretty_breaks(),
expand = expansion(mult = c(0, 0.1))
) +
# x-axis
scale_x_discrete(name = "Treatment-Season combination") +
# general layout
theme_bw() +
theme(axis.text.x = element_text(
angle = 45,
hjust = 1,
vjust = 1
),
legend.position = "top") +
# black data points
geom_point(
data = aq,
aes(y = sp, x = trt_season, color = trt),
#shape = 16,
alpha = 0.5,
position = position_nudge(x = -0.2)
) +
# black boxplot
geom_boxplot(
data = aq,
aes(y = sp, x = trt_season, fill = trt),
width = 0.05,
outlier.shape = NA,
position = position_nudge(x = -0.1)
) +
# red mean value
geom_point(
data = cld1_df,
aes(y = response, x = trt_season),
size = 2,
color = "red"
) +
# red mean errorbar
geom_errorbar(
data = cld1_df,
aes(ymin = lower.CL, ymax = upper.CL, x = trt_season),
width = 0.05,
color = "red"
) +
# red letters
geom_text(
data = cld1_df,
aes(
y = response,
x = trt_season,
label = str_trim(.group)
),
position = position_nudge(x = 0.1),
hjust = 0,
color = "red"
) +
labs(
caption = str_wrap("Black dots represent raw data. Red dots and error bars represent backtransformed estimated marginal means ± 95% confidence interval per group. Means not sharing any letter are significantly different by the Tukey-test at the 5% level of significance.", width = 100)
)
~ trt|season Comparing all trt combinations only within each season
Here, fewer mean comparisons are made, i.e. only 3 comparisons (Control vs. FallMow, FallMow vs. SpotSpray, Control vs. SpotSpray) for each of the seasons. This means that e.g. Early-Control is never compared to Mid-Control. Moreover, the letters from the compact letter display are also created separately for each season. This means that e.g. the a assigned to the Early-Control mean has nothing to do with the a assigned to the Mid-Control mean. This is crucial and I made sure to explicitly state this in the plot's caption. I also used facettes which separate the results for the seasons visually.
That being said, presenting the results in this way may actually be more suited for your investigation. (Obviously you could use colors per treatment or season here as well)
cld2
#> season = Early:
#> trt response SE df lower.CL upper.CL .group
#> Control 0.0560 0.00540 950 0.0444 0.0704 a
#> FallMow 0.0316 0.00248 950 0.0262 0.0381 b
#> SpotSpray 0.0498 0.00402 950 0.0410 0.0603 a
#>
#> season = Mid:
#> trt response SE df lower.CL upper.CL .group
#> Control 0.0654 0.00610 950 0.0522 0.0816 a
#> FallMow 0.0427 0.00305 950 0.0360 0.0506 b
#> SpotSpray 0.0609 0.00473 950 0.0505 0.0733 a
#>
#> season = Late:
#> trt response SE df lower.CL upper.CL .group
#> Control 0.0442 0.00464 950 0.0344 0.0567 a
#> FallMow 0.0437 0.00314 950 0.0368 0.0519 a
#> SpotSpray 0.0630 0.00482 950 0.0524 0.0755 b
#>
#> Confidence level used: 0.95
#> Conf-level adjustment: sidak method for 3 estimates
#> Intervals are back-transformed from the logit scale
#> P value adjustment: tukey method for comparing a family of 3 estimates
#> Tests are performed on the log odds ratio scale
#> significance level used: alpha = 0.05
#> NOTE: Compact letter displays can be misleading
#> because they show NON-findings rather than findings.
#> Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.
cld2_df <- cld2 %>% as.data.frame()
ggplot() +
facet_grid(cols = vars(season), labeller = label_both) +
# y-axis
scale_y_continuous(
name = "sp",
limits = c(0, NA),
breaks = pretty_breaks(),
expand = expansion(mult = c(0, 0.1))
) +
# x-axis
scale_x_discrete(name = "Treatment") +
# general layout
theme_bw() +
# black data points
geom_point(
data = aq,
aes(y = sp, x = trt),
shape = 16,
alpha = 0.5,
position = position_nudge(x = -0.2)
) +
# black boxplot
geom_boxplot(
data = aq,
aes(y = sp, x = trt),
width = 0.05,
outlier.shape = NA,
position = position_nudge(x = -0.1)
) +
# red mean value
geom_point(
data = cld2_df,
aes(y = response, x = trt),
size = 2,
color = "red"
) +
# red mean errorbar
geom_errorbar(
data = cld2_df,
aes(ymin = lower.CL, ymax = upper.CL, x = trt),
width = 0.05,
color = "red"
) +
# red letters
geom_text(
data = cld2_df,
aes(
y = response,
x = trt,
label = str_trim(.group)
),
position = position_nudge(x = 0.1),
hjust = 0,
color = "red"
) +
labs(
caption = str_wrap("Black dots represent raw data. Red dots and error bars represent backtransformed estimated marginal means ± 95% confidence interval per group. For each season separately, means not sharing any letter are significantly different by the Tukey-test at the 5% level of significance.", width = 100)
)
So these are the two options I think about when I have a significant two-way interaction and want to compare means. Note that you could also switch treatment and season to ~ season|trt and also plot it the other way around.
Further reading
emmeans documentation
chapter on Compact Letter Display (CLD) - you will also find a brief discussion on the weakness of CLDs that Russ Lenth mentioned in his comment
related stackoverflow post #1
related stackoverflow post #2
Bonus: Raincloud plot
This has nothing to do with your question, but I'd like to point out that you have much more data points than I have in the the plot you linked. Because of this, the original plotting approach could potentially be improved because geom_point() leads to many dots being plotted on top of each other so that we have no way seeing how much data there really is. Therefore, you could simply replace geom_point() with geom_jitter() or even go further and create these raincloud plots mentioned in this blogpost. I've created one (without the emmeans part) below that is analogous to the first plot above.
ggplot() +
# y-axis
scale_y_continuous(
name = "sp",
limits = c(0, NA),
breaks = pretty_breaks(),
expand = expansion(mult = c(0, 0.1))
) +
# x-axis
scale_x_discrete(name = "Treatment-Season combination") +
# general layout
theme_bw() +
theme(axis.text.x = element_text(
angle = 45,
hjust = 1,
vjust = 1
),
legend.position = "top") +
# add half-violin from {ggdist} package
ggdist::stat_halfeye(
data = aq,
aes(y = sp, x = trt_season, fill = trt),
adjust = .5,
width = .5,
.width = 0,
justification = -.2,
point_colour = NA,
show.legend = FALSE
) +
# boxplot
geom_boxplot(
data = aq,
aes(y = sp, x = trt_season, fill = trt),
width = 0.1,
outlier.shape = NA
) +
# add justified jitter from the {gghalves} package
gghalves::geom_half_point(
data = aq,
aes(y = sp, x = trt_season, color = trt),
side = "l",
range_scale = .4,
alpha = .2
)
Created on 2022-01-26 by the reprex package (v2.0.1)

IMO, almost anything is better than a CLD. They display non-findings rather than findings.
I suggest presenting the simple comparisons in tabular form
lsm = emmeans(glm.soil, ~ trt:season, type = "response")
pairs(lsm, by = "trt")
pairs(lsm, by = "season")
If you really want the diagonal comparisons, you could consider pwpp() or pwpm() which can display quite a few comparisons compactly. See the documentation in emmeans

Related

How to do repeated measures Anova and Friedman test in gtsummary?

I have repeated measures data for fasting glucose of few patients which are measured in Day 0, 3 and 7 during a clinical trial. I want to do friedman.test (and repeated measures anova for another data). Is it possible in gtsummaryto do Friedman test and repeated measures Anova?
This is my code
library(gtsummary)
library(dplyr)
smpl %>% select(c(Participant.ID, value, Day)) %>%
tbl_summary(by = Day, include = -Participant.ID) %>% add_p()
By default gtsummary does Kruskal-Wallis rank sum test for the three groups in my data (Day 0, 3 and 7). As it is a repeated measures I wanted to do friedman.test on these groups. So I tried the following code
my_friedman <- function(data, variable, by, random, ...) {
data <- data[c(variable, by)] %>% dplyr::filter(complete.cases(.))
friedman.test(variable ~ data[[by]] | random, data = data)
}
smpl %>% select(c(Participant.ID, value, Day)) %>%
tbl_summary(by = Day) %>%
add_p(test = value ~ "my_friedman", random = Participant.ID)
But that gives an error..
this is the dput for my sample data smpl
structure(list(Participant.ID = c(1002, 1007, 1010, 1017, 1018,
1022, 1044, 1050, 1051, 1052, 1002, 1007, 1010, 1017, 1018, 1022,
1044, 1050, 1051, 1052, 1002, 1007, 1010, 1017, 1018, 1022, 1044,
1050, 1051, 1052), Randomization = c("Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone"
), Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("0", "3", "7"), class = "factor"),
value = c(31.92, 53.25, 34.69, 33.32, 25.94, 34.69, 22.65,
33.32, 24.33, 25.94, 31.92, 46.54, 44.74, 145.12, 37.34,
39.89, 22.65, 27.5, 25.94, 25.94, 37.34, 53.25, 29.02, 37.34,
25.94, 34.69, 22.65, 27.5, 29.02, 31.92)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
Can someone kindly help?
Here is the direct call to friedman.test
> friedman.test(value ~ Day | Participant.ID, data = aa)
Friedman rank sum test
data: value and Day and Participant.ID
Friedman chi-squared = 1.2667, df = 2, p-value = 0.5308
Your code was close! I made some small adjustments. See below!
library(gtsummary)
#> #BlackLivesMatter
smpl <-
structure(list(Participant.ID = c(1002, 1007, 1010, 1017, 1018,
1022, 1044, 1050, 1051, 1052, 1002, 1007, 1010, 1017, 1018, 1022,
1044, 1050, 1051, 1052, 1002, 1007, 1010, 1017, 1018, 1022, 1044,
1050, 1051, 1052), Randomization = c("Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone",
"Pioglitazone", "Pioglitazone", "Pioglitazone", "Pioglitazone"
), Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("0", "3", "7"), class = "factor"),
value = c(31.92, 53.25, 34.69, 33.32, 25.94, 34.69, 22.65,
33.32, 24.33, 25.94, 31.92, 46.54, 44.74, 145.12, 37.34,
39.89, 22.65, 27.5, 25.94, 25.94, 37.34, 53.25, 29.02, 37.34,
25.94, 34.69, 22.65, 27.5, 29.02, 31.92)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
my_friedman <- function(data, variable, by, group, ...) {
# construct the formula
formula <- stringr::str_glue("{variable} ~ {by} | {group}") |> as.formula()
# perform Friedman test
friedman.test(
formula = formula,
data = data
) |>
broom::tidy()
}
my_friedman(data = smpl, variable = "value", by = "Day", group = "Participant.ID")
#> # A tibble: 1 × 4
#> statistic p.value parameter method
#> <dbl> <dbl> <dbl> <chr>
#> 1 1.27 0.531 2 Friedman rank sum test
tbl <-
smpl %>%
select(Participant.ID, value, Day) %>%
tbl_summary(
by = Day,
include = -Participant.ID
) %>%
add_p(test = value ~ "my_friedman", group = Participant.ID)
Created on 2023-01-08 with reprex v2.0.2

Converting TSE data to STS with TraMineRextras

I have a data set which is formatted like TSE:
Event is a code of the mode of entry (1 is export, 2 to 5 is partenariat, 6 to 7 is Acq.Mino, 8 is Acq_Majo). Country is not used.
I would like to convert this table to a sequence object, and I understood that the TSE_to_STS function would help me to do so. However it seems that the output is not as I expect.
I ran the code:
seq.sts<-TSE_to_STS(seq.tse.data,id = 1, timestamp = 2, event = 3, stm =NULL, tmin = 1935,tmax = 2018, firstState = "None").
the output is for company 7:
> seq.sts[7,]
My first surprise is that the event happened in year n and are coded in year n+1 (in red in the above output). My second surprise is that in 2013 and in 2015 I lose information about the second entry in Thailand (in yellow in the above output).Third, I don’t understand why the states are added: why do we have in 2012 Acq_Mino.Acq_Majo and not only Acq_Mino?
The TSE_to_STS function is intended to convert time-stamped event sequences into state sequences. This is done by defining the state in which we are after the event occurs. With “stm = NULL”, the state after each event is characterized as the combination of the current state and the event.
Your example is not reproducible because:
We do not know the re-labeling you applied to your events coded from 1 to 8, and
Seems that you mixed the timestamp and event arguments in the TSE_to_STS example command.
I illustrate below with a reproducible example (based on your example except for the country column that we do not use).
library(TraMineRextras)
seq.tse.data <- structure(list(
ID = c(1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L),
Year = c(2008L, 2010L, 2012L, 2007L, 2009L, 2010L, 2012L,
2013L, 1996L, 1997L, 1999L, 2003L, 2006L, 2008L,
2012L, 2007L, 2007L, 2008L, 2003L, 2007L, 2007L,
2009L, 2009L, 2011L, 2014L, 2016L, 2006L, 2009L,
2011L, 2013L, 2013L, 2015L, 2015L, 2016L),
Event = c(8L, 7L, 8L, 5L, 1L, 8L, 8L, 8L, 3L, 3L, 3L,
3L, 3L, 8L, 1L, 8L, 8L, 8L, 7L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 7L, 7L, 6L, 7L, 6L,
8L)),
class = "data.frame", row.names = c(NA, -34L)
)
seq.tse.data[seq.tse.data$ID==7,]
# ID Year Event
# 27 7 2006 8
# 28 7 2009 8
# 29 7 2011 7
# 30 7 2013 7
# 31 7 2013 6
# 32 7 2015 7
# 33 7 2015 6
# 34 7 2016 8
seq.sts <- TSE_to_STS(seq.tse.data,
id = 1, timestamp = 2, event = 3,
stm =NULL, tmin = 1990, tmax = 2018,
firstState = "None")
For the id 7, we get:
seq.sts[7,which(names(seq.sts)=="a2005"):which(names(seq.sts)=="a2018")]
# a2005 a2006 a2007 a2008 a2009 a2010 a2011 a2012 a2013 a2014 a2015 a2016 a2017 a2018
# 7 None None 8 8 8 8 8 8.7 8.7 8.7.6 8.7.6 8.7.6 8.7.6 8.7.6
In this example, for id=7 the first event (8) occurs in 2006. Therefore the state is None until 2006 and switches to 8 afterwards. In 2011, event 7 occurs. So, the next year (2012) we switch to the state 'events 8 and 7 already occurred', which is denoted as 8.7. When 7 occurs again in 2013, it does not change the state because 7 already occurred while the occurrence of event 6 leads to the state 8.7.6. The latter becomes effective the next year, i.e. in 2014.
Different schemes for determining the state resulting after the occurrence of each event can be defined by providing the stm matrix to the TSE_to_STS function. Use the seqe2stm function for defining the stm matrix. (Look at the help page of this latter function for more details.)

How to obtain convert DataFrame to specific RDD?

I have the following DataFrame in Spark 2.2:
df =
v_in v_out
123 456
123 789
456 789
This df defines edges of a graph. Each row is a pair of vertices. I want to extract the Array of edges in order to create an RDD of edges as follows:
val edgeArray = Array(
Edge(2L, 1L, 0.0),
Edge(2L, 4L, 0.2),
Edge(3L, 2L, 0.9),
Edge(3L, 6L, 0.1),
Edge(4L, 1L, 0.0),
Edge(5L, 2L, 0.8),
Edge(5L, 3L, 0.7),
Edge(5L, 6L, 0.5)
)
val spark = SparkSession.builder()
.appName("ES")
.master("local[*]")
.getOrCreate()
implicit val sparkContext = spark.sparkContext
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
How can I obtain edgeArray of the same structure using df? In each Edge, the third value can be any random Double value from 0 to 1.
UPDATE:
I did it in this way, but not sure if this is the most optimal solution:
val edgeArray = df.rdd.collect().map(row => Edge(row.get(0).toString.toLong, row.get(1).toString.toLong, 0.0))
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
I don't like to use Array, because I might have millions of edges. Can I pass DataFrame more directly to RDD?
Given
val df = Seq((123, 456), (123, 789), (456, 789)).toDF("v_in", "v_out")
Import
import org.apache.spark.sql.functions.rand
import org.apache.spark.graphx.Edge
and convert:
val edgeRDD = df.toDF("srcId", "dstId")
.withColumn("attr", rand)
.as[Edge[Double]].rdd
With graphframes:
spark.jars.packages graphframes:graphframes:X.X.X-sparkY.Y-s_Z.ZZ
where X.X.X is package version, Y.Y is Spark version and Z.ZZ is Scala version, you can create Graph like this:
GraphFrame.fromEdges(df.toDF("src", "dst")).toGraphX
but it'll use Row attributes.

Get all the nodes connected to a node in Apache Spark GraphX

Suppose we have got the input in Apache GraphX as :
Vertex RDD:
val vertexArray = Array(
(1L, "Alice"),
(2L, "Bob"),
(3L, "Charlie"),
(4L, "David"),
(5L, "Ed"),
(6L, "Fran")
)
Edge RDD:
val edgeArray = Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 1),
Edge(3L, 4L, 1),
Edge(5L, 6L, 1)
)
I need all the components connected to a node in Apache Spark GraphX
1,[1,2,3,4]
5,[5,6]
You can use ConnectedComponents which returns
a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.
and reshape results
graph.connectedComponents.vertices.map(_.swap).groupByKey

How can I get the number of common edges in Spark Graphx?

For example, if I have two graphs with vertices and edges like this:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexRdd1: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 65))
))
val edgeRdd1: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 8)
))
val vertexRdd2: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 28)),
(4L, ("d", 27)),
(5L, ("e", 65))
))
val edgeRdd2: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 4),
Edge(3L, 5L, 1),
Edge(2L, 4L, 1)
))
How can I get the number of common edges between these two graphs, without considering the edge attribute? So, in the above example the number of common edges is 2 and the common edges are: Edge(1L, 2L, 1) common with Edge(1L, 2L, 1) and Edge(2L, 3L, 8) common with Edge(2L, 3L, 4).
I am programming in scala.
Assuming you have graph1 (Graph(vertexRdd1, edgeRdd1)) and graph2 (Graph(vertexRdd2, edgeRdd2))) you can map edges to (srcId, dstId) and then use intersection method:
val srcDst1 = graph1.edges.map(e => (e.srcId, e.dstId))
val srcDst2 = graph2.edges.map(e => (e.srcId, e.dstId))
srcDst1.intersection(srcDst2).count()