How can I use tbl_strata and get the output to show counts by distinct individual rather than rows?
Also, how can I change the order that displays for the variable I am putting in the by= section in tbl_summary?
I have a long table AND a wide table with one row per patient. Not sure how to apply the wide table to this code. I can apply the long table but getting row counts instead of distinct patient counts.
I have included an example of the Long table I have and the wide table and what I would like the output to look like in the picture.
Example code:
#Wide Table Example
df_Wide <- data.frame(patientICN =c(1, 2, 3, 4, 5)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene1 =c("unk", "pos", "neg", "neg", "unk")
,gene2 =c("pos", "neg", "pos", "unk", "neg")
,gene3 =c("neg", "unk", "unk", "pos", "pos"))
#Long Table Example
df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2")
,result=c("Positve", "Negative", "Unknown","Positive","Unknown"))
#Table Categorized by testtype and result for long table
df_Long %>%
select (result, gene, testtype)%>%
mutate(testcategory=paste("TestType",testtype))%>%
tbl_strata(
strata=testtype,
.tbl_fun =
~.x %>%
tbl_summary(by=result,missing="no")%>%
add_n(),
.header= "**{strata}**, N={n}"
)
##above is giving multiple Rows per patient counts
Is this what you're after? You can install the bstfun pkg from my R-universe: https://ddsjoberg.r-universe.dev/ui#packages
library(gtsummary)
library(dplyr)
#Long Table Example
df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2")
,result=c("Positive", "Negative", "Unknown","Positive","Unknown"))
tbl <-
df_Long %>%
tidyr::pivot_wider(
id_cols = c(patientICN, testtype),
names_from = gene,
values_from = result,
values_fill = "Unknown"
) %>%
mutate(across(starts_with('Gene'), ~factor(.x, levels = c("Positive", "Negative", "Unknown")))) %>%
tbl_strata(
strata = testtype,
~ .x %>%
bstfun::tbl_likert(
include = starts_with("Gene")
)
)
Created on 2022-10-06 with reprex v2.0.2
I am using spark-sql-2.4.1v with java 1.8.
Have source data as below :
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",27989978,"2019-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-06-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2020-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2019-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2017-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-06-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2018-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-12-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2019-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-09-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2016-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2020-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2019-09-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2016-03-01")
).toDF("industry_id","industry_name","country","state","revenue","generated_date");
Query :
val distinct_gen_date = df_data.select("generated_date").distinct.orderBy(desc("generated_date"));
For each "generated_date" in list distinct_gen_date , need to get all unique industry_ids for 6 months data
val cols = {col("industry_id")}
val ws = Window.partitionBy(cols).orderBy(desc("generated_date"));
val newDf = df_data
.withColumn("rank",rank().over(ws))
.where(col("rank").equalTo(lit(1)))
//.drop(col("rank"))
.select("*");
How to get moving aggregate (on unique industry_ids for 6 months data ) for each distinct item , how to achieve this moving aggregation.
more details :
Example, in the given sample data given , assume, is from "2020-03-01" to "2016-03-01". if some industry_x is not there in "2020-03-01", need to check "2020-02-01" "2020-01-01","2019-12-01","2019-11-01","2019-10-01","2019-09-01" sequentically whenever we found thats rank-1 is taken into consider for that data set for calculating "2020-03-01" data......we next go .."2020-02-01" i.e. each distinct "generated_date".. for each distinct date go back 6 months get unique industries ..pick rank 1 data...this data for ."2020-02-01"...next pick another distinct "generated_date" and do same so on .....here dataset keep changing....using for loop I can do but it is not giving parallesm..how to pick distinct dataset for each distinct "generated_date" parallell ?
I don't know how to do this with window functions but a self join can solve your problem.
First, you need a DataFrame with distinct dates:
val df_dates = df_data
.select("generated_date")
.withColumnRenamed("generated_date", "distinct_date")
.distinct()
Next, for each row in your industries data you need to calculate up to which date that industry will be included, i.e., add 6 months to generated_date. I think of them as active dates. I've used add_months() to do this but you can think of different logics.
import org.apache.spark.sql.functions.add_months
val df_active = df_data.withColumn("active_date", add_months(col("generated_date"), 6))
If we start with this data (separated by date just for our eyes):
industry_id generated_date
(("Indus_1", ..., "2020-03-01"),
("Indus_1", ..., "2019-12-01"),
("Indus_2", ..., "2019-12-01"),
("Indus_3", ..., "2018-06-01"))
It has now:
industry_id generated_date active_date
(("Indus_1", ..., "2020-03-01", "2020-09-01"),
("Indus_1", ..., "2019-12-01", "2020-06-01"),
("Indus_2", ..., "2019-12-01", "2020-06-01")
("Indus_3", ..., "2018-06-01", "2018-12-01"))
Now proceed with self join based on dates, using the join condition that will match your 6 month period:
val condition: Column = (
col("distinct_date") >= col("generated_date")).and(
col("distinct_date") <= col("active_date"))
val df_joined = df_dates.join(df_active, condition, "inner")
df_joined has now:
distinct_date industry_id generated_date active_date
(("2020-03-01", "Indus_1", ..., "2020-03-01", "2020-09-01"),
("2020-03-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2020-03-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2018-06-01", "Indus_3", ..., "2018-06-01", "2018-12-01"))
Drop that auxiliary column active_date or even better, drop duplicates based on your needs:
val df_result = df_joined.dropDuplicates(Seq("distinct_date", "industry_id"))
Which drops the duplicated "Indus_1" in "2020-03-01" (It appeared twice because it's retrieved from two different generated_dates):
distinct_date industry_id
(("2020-03-01", "Indus_1"),
("2020-03-01", "Indus_2"),
("2019-12-01", "Indus_1"),
("2019-12-01", "Indus_2"),
("2018-06-01", "Indus_3"))
If I have two predicates (not functional):
addblock 'city(city_dim_id) -> int(city_dim_id).'
addblock 'city_name[city_dim_id] = name -> int(city_dim_id), string(name).'
I can add facts:
exec '+city(1).'
exec '+city_name[0] = "N/A".'
exec '+city_name[1] = "Chicago".'
These are then queries of facts in the predicates:
query '_(city_name) <- city_name(city_name, _).'
query '_(city_name) <- city_name(_, city_name).'
query '_(city_dim_id, city_name) <- city_name(city_dim_id, city_name).'
My question is how do I make a query to show
1. what are the city_dim_id in both tables,
2. return city_dim_id and city_name, but only where city_dim_id present in both tables?
Thanks in advance.
Sorry I'm struggling to understand the question.
The following will return the city_dim_id's that have the same city_name.
_(c1, c2) <-
city(c1),
city(c2),
city_name[c1] = city_name[c2],
c1 != c2.
If by ' city_dim_id in both tables ' you mean 'city_dim_id which are in both tables' then you want
_(id) <-city(id), city_name[id] = _.
if on the other hand you want the id who are in either table, you need to replace the conjunction by a disjunction.
_(id) <- city(id); city_name[id] = _.
I think you want
_(id,name) <- city(id), city_name[id] = name.
note: if you use the square bracket syntax city_name[id] = name then the predicate WILL be functional