How to add two difference statistics in tbl_summary? - gtsummary

I am a beginner in r and trying to create some tables using this great gtsummary package. My question is that is it possible to use add_difference to add two separate difference statistics at the same time (or to combine these somehow)? I am able to create a perfect table with p-values or effect size, but not with both. Also, is it possible to use bonferroni adjusted p-values?
My simple code (with t.test) looks like this:
table1 <- tbl_summary(df, by = gr, statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% add_difference(test = list(all_continuous() ~ "t.test"), group = gr, conf.level = 0.95, pvalue_fun = function(x) style_pvalue(x, digits = 2)))
thanks for help.

Yes, the table you're after is possible in gtsummary. Use the tbl_merge() function to merge the tables with standardized differences and the p-values. There is an example of this below. An alternative to this is to use add_stat() to construct customized columns.
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.1'
# standardized difference
tbl1 <-
trial %>%
select(trt, age, marker) %>%
tbl_summary(by = trt, missing = "no",
statistic = all_continuous() ~ "{mean} ({sd})") %>%
add_difference(all_continuous() ~ "cohens_d")
# table with p-value and corrected p-values
tbl2 <-
trial %>%
select(trt, age, marker) %>%
tbl_summary(by = trt, missing = "no") %>%
add_p(all_continuous() ~ "t.test") %>%
add_q(method = "bonferroni") %>%
# hide the trt summaries, because we don't need them repeated
modify_column_hide(all_stat_cols())
#> add_q: Adjusting p-values with
#> `stats::p.adjust(x$table_body$p.value, method = "bonferroni")`
# merge tbls together
tbl_final <-
tbl_merge(list(tbl1, tbl2)) %>%
# remove spanning headers
modify_spanning_header(everything() ~ NA)
Created on 2021-07-20 by the reprex package (v2.0.0)

Related

Using gt functions to style gtsummary merged tables

I've been utilising the as_gt() function to format my gtsummary tables and this works fine on each individual table. But when trying to merge tables (tbl_merge) together this error appears:
Error: Error in argument 'x='. Expecting object of class 'gtsummary'
This happens regardless of feeding the gt format options to each individual table, or only to the merged object, same error. Any ideas on workarounds? Repex below.
library(survival)
library(tidyverse)
library(gt)
library(gtsummary)
# Works fine for individual tbls
t1 <-
glm(response ~ trt + grade + age, trial, family = binomial) %>%
tbl_regression(exponentiate = TRUE) %>%
as_gt() %>%
gt::tab_options(table.font.names = "Times New Roman")
t1
t2 <-
coxph(Surv(ttdeath, death) ~ trt + grade + age, trial) %>%
tbl_regression(exponentiate = TRUE)
# Error appears
tbl_merge_ex1 <-
tbl_merge(
tbls = list(t1, t2),
tab_spanner = c("**Tumor Response**", "**Time to Death**") %>%
as_gt() %>%
gt::tab_options(table.font.names = "Times New Roman")
)
Error: Error in argument 'x='. Expecting object of class 'gtsummary'
you need to remove all as_gt() until the very end after the merge like so:
library(survival)
library(tidyverse)
library(gt)
library(gtsummary)
t1 <-
glm(response ~ trt + grade + age, trial, family = binomial) %>%
tbl_regression(exponentiate = TRUE)
t2 <-
coxph(Surv(ttdeath, death) ~ trt + grade + age, trial) %>%
tbl_regression(exponentiate = TRUE)
tbl_merge_ex1 <-
tbl_merge(
tbls = list(t1, t2),
tab_spanner = c("**Tumor Response**", "**Time to Death**")
) %>%
as_gt() %>%
gt::tab_options(table.font.names = "Times New Roman")

Tbl_Strata count by distinct individual vs rows

How can I use tbl_strata and get the output to show counts by distinct individual rather than rows?
Also, how can I change the order that displays for the variable I am putting in the by= section in tbl_summary?
I have a long table AND a wide table with one row per patient. Not sure how to apply the wide table to this code. I can apply the long table but getting row counts instead of distinct patient counts.
I have included an example of the Long table I have and the wide table and what I would like the output to look like in the picture.
Example code:
#Wide Table Example
df_Wide <- data.frame(patientICN =c(1, 2, 3, 4, 5)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene1 =c("unk", "pos", "neg", "neg", "unk")
,gene2 =c("pos", "neg", "pos", "unk", "neg")
,gene3 =c("neg", "unk", "unk", "pos", "pos"))
#Long Table Example
df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2")
,result=c("Positve", "Negative", "Unknown","Positive","Unknown"))
#Table Categorized by testtype and result for long table
df_Long %>%
select (result, gene, testtype)%>%
mutate(testcategory=paste("TestType",testtype))%>%
tbl_strata(
strata=testtype,
.tbl_fun =
~.x %>%
tbl_summary(by=result,missing="no")%>%
add_n(),
.header= "**{strata}**, N={n}"
)
##above is giving multiple Rows per patient counts
Is this what you're after? You can install the bstfun pkg from my R-universe: https://ddsjoberg.r-universe.dev/ui#packages
library(gtsummary)
library(dplyr)
#Long Table Example
df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2")
,result=c("Positive", "Negative", "Unknown","Positive","Unknown"))
tbl <-
df_Long %>%
tidyr::pivot_wider(
id_cols = c(patientICN, testtype),
names_from = gene,
values_from = result,
values_fill = "Unknown"
) %>%
mutate(across(starts_with('Gene'), ~factor(.x, levels = c("Positive", "Negative", "Unknown")))) %>%
tbl_strata(
strata = testtype,
~ .x %>%
bstfun::tbl_likert(
include = starts_with("Gene")
)
)
Created on 2022-10-06 with reprex v2.0.2

How to do handle this use-case (running-window data) in spark

I am using spark-sql-2.4.1v with java 1.8.
Have source data as below :
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",27989978,"2019-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-06-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2020-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2019-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2017-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-06-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2018-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-12-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2019-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-09-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2016-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2020-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2019-09-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2016-03-01")
).toDF("industry_id","industry_name","country","state","revenue","generated_date");
Query :
val distinct_gen_date = df_data.select("generated_date").distinct.orderBy(desc("generated_date"));
For each "generated_date" in list distinct_gen_date , need to get all unique industry_ids for 6 months data
val cols = {col("industry_id")}
val ws = Window.partitionBy(cols).orderBy(desc("generated_date"));
val newDf = df_data
.withColumn("rank",rank().over(ws))
.where(col("rank").equalTo(lit(1)))
//.drop(col("rank"))
.select("*");
How to get moving aggregate (on unique industry_ids for 6 months data ) for each distinct item , how to achieve this moving aggregation.
more details :
Example, in the given sample data given , assume, is from "2020-03-01" to "2016-03-01". if some industry_x is not there in "2020-03-01", need to check "2020-02-01" "2020-01-01","2019-12-01","2019-11-01","2019-10-01","2019-09-01" sequentically whenever we found thats rank-1 is taken into consider for that data set for calculating "2020-03-01" data......we next go .."2020-02-01" i.e. each distinct "generated_date".. for each distinct date go back 6 months get unique industries ..pick rank 1 data...this data for ."2020-02-01"...next pick another distinct "generated_date" and do same so on .....here dataset keep changing....using for loop I can do but it is not giving parallesm..how to pick distinct dataset for each distinct "generated_date" parallell ?
I don't know how to do this with window functions but a self join can solve your problem.
First, you need a DataFrame with distinct dates:
val df_dates = df_data
.select("generated_date")
.withColumnRenamed("generated_date", "distinct_date")
.distinct()
Next, for each row in your industries data you need to calculate up to which date that industry will be included, i.e., add 6 months to generated_date. I think of them as active dates. I've used add_months() to do this but you can think of different logics.
import org.apache.spark.sql.functions.add_months
val df_active = df_data.withColumn("active_date", add_months(col("generated_date"), 6))
If we start with this data (separated by date just for our eyes):
industry_id generated_date
(("Indus_1", ..., "2020-03-01"),
("Indus_1", ..., "2019-12-01"),
("Indus_2", ..., "2019-12-01"),
("Indus_3", ..., "2018-06-01"))
It has now:
industry_id generated_date active_date
(("Indus_1", ..., "2020-03-01", "2020-09-01"),
("Indus_1", ..., "2019-12-01", "2020-06-01"),
("Indus_2", ..., "2019-12-01", "2020-06-01")
("Indus_3", ..., "2018-06-01", "2018-12-01"))
Now proceed with self join based on dates, using the join condition that will match your 6 month period:
val condition: Column = (
col("distinct_date") >= col("generated_date")).and(
col("distinct_date") <= col("active_date"))
val df_joined = df_dates.join(df_active, condition, "inner")
df_joined has now:
distinct_date industry_id generated_date active_date
(("2020-03-01", "Indus_1", ..., "2020-03-01", "2020-09-01"),
("2020-03-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2020-03-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2018-06-01", "Indus_3", ..., "2018-06-01", "2018-12-01"))
Drop that auxiliary column active_date or even better, drop duplicates based on your needs:
val df_result = df_joined.dropDuplicates(Seq("distinct_date", "industry_id"))
Which drops the duplicated "Indus_1" in "2020-03-01" (It appeared twice because it's retrieved from two different generated_dates):
distinct_date industry_id
(("2020-03-01", "Indus_1"),
("2020-03-01", "Indus_2"),
("2019-12-01", "Indus_1"),
("2019-12-01", "Indus_2"),
("2018-06-01", "Indus_3"))

Querying by Keys with LogicBlox

If I have two predicates (not functional):
addblock 'city(city_dim_id) -> int(city_dim_id).'
addblock 'city_name[city_dim_id] = name -> int(city_dim_id), string(name).'
I can add facts:
exec '+city(1).'
exec '+city_name[0] = "N/A".'
exec '+city_name[1] = "Chicago".'
These are then queries of facts in the predicates:
query '_(city_name) <- city_name(city_name, _).'
query '_(city_name) <- city_name(_, city_name).'
query '_(city_dim_id, city_name) <- city_name(city_dim_id, city_name).'
My question is how do I make a query to show
1. what are the city_dim_id in both tables,
2. return city_dim_id and city_name, but only where city_dim_id present in both tables?
Thanks in advance.
Sorry I'm struggling to understand the question.
The following will return the city_dim_id's that have the same city_name.
_(c1, c2) <-
city(c1),
city(c2),
city_name[c1] = city_name[c2],
c1 != c2.
If by ' city_dim_id in both tables ' you mean 'city_dim_id which are in both tables' then you want
_(id) <-city(id), city_name[id] = _.
if on the other hand you want the id who are in either table, you need to replace the conjunction by a disjunction.
_(id) <- city(id); city_name[id] = _.
I think you want
_(id,name) <- city(id), city_name[id] = name.
note: if you use the square bracket syntax city_name[id] = name then the predicate WILL be functional

F# Navigate object graph to return specific Nodes

I'm trying to build a list of DataTables based on DataRelations in a DataSet, where the tables returned are only included by their relationships with each other, knowing each end of the chain in advance. Such that my DataSet has 7 Tables. The relationships look like this:
Table1 -> Table2 -> Table3 -> Table4 -> Table5
-> Table6 -> Table7
So given Table1 and Table7, I want to return Tables 1, 2, 3, 6, 7
My code so far traverses all the relations and returns all Tables so in the example it returns Table4 and Table5 as well. I've passed in the first, last as arguments, and know that I'm not yet using the last one yet, I am still trying to think how to go about it, and is where I need the help.
type DataItem =
| T of DataTable
| R of DataRelation list
let GetRelatedTables (first, last) =
let rec flat_rec dt acc =
match dt with
| T(dt) ->
let rels = [ for r in dt.ParentRelations do yield r ]
dt :: flat_rec(R(rels)) acc
| R(h::t) ->
flat_rec(R(t)) acc # flat_rec(T(h.ParentTable)) acc
| R([]) -> []
flat_rec first []
I think something like this would do it (although I haven't tested it). It returns DataTable list option because, in theory, a path between two tables might not exist.
let findPathBetweenTables (table1 : DataTable) (table2 : DataTable) =
let visited = System.Collections.Generic.HashSet() //check for circular references
let rec search path =
let table = List.head path
if not (visited.Add(table)) then None
elif table = table2 then Some(List.rev path)
else
table.ChildRelations
|> Seq.cast<DataRelation>
|> Seq.tryPick (fun rel -> search (rel.ChildTable::path))
search [table1]