gtsummary modify_table_styling in multiple rows - gtsummary

I am attempting to modify the indentation of multiple rows in gtsummary. How can I select multiple rows under "rows"?
ex.
this works:
df1 %>%
tbl_summary(
)%>%
modify_table_styling(
columns = label,
rows = variable == "var1",
text_format = "indent2")
this doesn't:
df1 %>%
tbl_summary(
)%>%
modify_table_styling(
columns = label,
rows = variable == c("var1","var2")
text_format = "indent2")
Thanks

Related

updating multiple columns to blank based flag true otherwise keep column values as it is pyspark

Here df contains the other column and values in addition to column mentioned in the col_list but we need to update only the column mentioned in col_list based flag true but below not working as excepted.
col_list = ['age','label','test','length','work','least']
for i in col_list:
test = df.withColumn(i, when(df.flag != True, df[i]).otherwise(''))
You should replace df.withColumn with test.withColumn to update your test df each time. Right now it's only makes the last col.
test_df = df
for i in col_list:
test_df = test_df.withColumn(i, F.when(df.flag != True, df[i]).otherwise(''))

How to label columns and retain group sizes when splitting summary table by group?

When creating a summary table, split by group, the size of each group automatically shows up at the top of their respective columns. So the column headings look like this: Characteristic | 1, N = 100 | 2, N = 120. Code below:
library(dplyr)
library(gtsummary)
data %>%
select(group, age, sex) %>%
tbl_summary(by = group)
However, I would like to name my groups to something more meaningful than "1" and "2". For example, if my data consists of kids in a swim class, I would want to name the groups by the name of the swim class: ducks and turtles. So I do something like this:
library(dplyr)
library(gtsummary)
data %>%
select(group, age, sex) %>%
tbl_summary(by = group) %>%
modify_header(
update = list(
stat_1 ~ "**Ducks**",
stat_2 ~ "**Turtles**"))
modify_spanning_header(
update = starts_with("stat_") ~ "Swim Class Name")
This works! However, the size of each group disappears from the top of their respective columns. My work-around is to add in the size of each group manually, as part of the names. I have to leave a little note for myself to check the N for each group before adding it in. Like this:
library(dplyr)
library(gtsummary)
data %>%
select(group, age, sex) %>%
tbl_summary(by = group) %>%
modify_header(
update = list(
stat_1 ~ "**Ducks**, N = 100",
stat_2 ~ "**Turtles**, N = 120")) %>% # to check the N for each group, remove this to see default appearance which shows the N
modify_spanning_header(
update = starts_with("stat_") ~ "Swim Class Name")
This works but its error-prone as it requires me to double check the numbers then add them in manually.
How do I label the columns, representing each group, AND retain the numbers showing group sizes when splitting the summary table by group?
There are two ways to get this done.
The first is to change the levels in the data frame before you pass it to tbl_summary(). Then the default column header will have your custom headers with the correct Ns by default.
The second is to take advantage dynamic statistics available within modify_header(). When you have a tbl_summary(by=) object split by a variable, you can access {n}, {N}, {p}, and they can be placed in the column header. Review the help file for details: http://www.danieldsjoberg.com/gtsummary/reference/modify.html (Note you need gtsummary v1.3.6 for this code to work.)
Both methods lead to identical tables.
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.3.6'
# Method 1: Change the underlying data before passing df to `tbl_summary()`
tbl1 <-
trial %>%
select(trt, age) %>%
mutate(trt = factor(trt, labels = c("Duck", "Turtle"))) %>%
tbl_summary(by = trt, missing = "no")
# Method 2: Use the dynamic stats available in `modify_header()`
tbl2 <-
trial %>%
select(trt, age) %>%
tbl_summary(by = trt, missing = "no") %>%
modify_header(list(
stat_1 ~ "**Duck**, N = {n}",
stat_2 ~ "**Turtle**, N = {n}"
))
Created on 2021-01-18 by the reprex package (v0.3.0)

How to remove rows with more then x Null values in pyspark

I am having some trouble removing rows with more then the na_threshold of nulls in my dataframe
na_threshold=2
df3=df3.dropna(thresh=len(df3.columns) - na_threshold)
When I run
df_null = df3.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df3.columns)))
df_null is a dataframe entry with 1 row which only has one column with a null value
I have tried increasing the value na_threshold but it hasn't made a difference.
I have realised that the drop.na function does work
What happened is that the file was initially read in with Pandas and I had put the drop na function before converting other Null like values to Null as Pandas uses nan, NaT and sometimes "-"
for column in columns:
df3 = df3.withColumn(column,when((col(column)=='nan'),None).otherwise(F.col(column)))
df3 = df3.withColumn(column,when((col(column)=='NaT'),None).otherwise(F.col(column)))
df3 = df3.withColumn(column,when((col(column)=='-'),None).otherwise(F.col(column)))
na_threshold=2
df3=df3.dropna(thresh=len(df3.columns) - na_threshold)```

How to rename a duplicate column using column index?

I have a dataframe that has two same name columns, since the first column (agreementID) holds a value, I want to rename the second column) which holds null values to a different name, and different records. I want to use the aggrementID as a key in the future.
enter image description here
enter image description here
Please help on how to rename the column using column position ore index?
val columnIndex = 1
val newColumnName = "new_name"
val cols = df.columns
cols(columnsIndex) = newColumnName
df.toDF(cols)
This should work:
val distinctColumns = Seq("name","agreementId","dupAgreementId")
val df = df.toDF(distinctColumns:_*)

pyspark processing & compare 2 dataframes

I am working on pyspark (Spark 2.2.0) with 2 dataframes that have common columns. Requirement I am dealing with is as below: Join the 2 frames as per rule below.
frame1 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
frame2 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
key = [Column 1, Column 2] ### is an array
If frame1.[Column1, column2] == frame1.[Column1, column2]
if frame1.column_n == frame2.column_n
write to a new data frame DF_A using values from frame 2 as is
if frame1.column_n != frame2.column_n
write to a new data frame DF_A using values from frame 1 as is
write to a new data frame DF_B using values from frame 2 but with column3, & column 5 hard coded values
To do this, I am first creating 2 temp views and constructing 3 SQLs dynamically.
sql_1 = select frame1.* from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n=frame2.column_n
DFA = sqlContext.sql(sql_1)
sql_2 = select [all columns from frame1] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_A = DF_A.union(sqlContext.sql(sql_2))
sql_3 = select [all columns from frame2 except for column3 & column5 to be hard coded] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_B = sqlContext.sql(sql_1)
Question1: is there better way to dynamically pass key columns for joining? I am currently doing this by maintaining key columns in arrays (is working) and constructing SQL.
Question2: is there better way to dynamically pass selection columns without changing sequence of columns? I am currently doing this by maintaining column names in array and performing concatenation.
I did consider one single full outer join option but since column names are same I thought it will have more overhead of renaming.
For question#1 and #2, I went with getting the column names form dataframe schema (df.schema.names and df.columns) and string processing inside the loop.
For the logic, I went with minimal of 2 SQLs - one with full outer join.