Create individual p.value / test method detail footnotes for add_difference in gtsummary - gtsummary

I am trying to apply the separate_p_footnotes() function of the gtsummary package to the p.value's generated by the add_difference() function.
When doing so, I am getting an error, as the x$meta_data doesn't contain the column named stat_test_lbl
Example:
library(tidyverse)
library(gtsummary)
trial %>%
select(trt, age, response) %>%
tbl_summary(
by = trt
) %>%
add_difference() %>%
separate_p_footnotes()
#> Error in separate_p_footnotes(.): The `x$meta_data` data frame must have a column called 'stat_test_lbl'.
Created on 2023-01-11 with reprex v2.0.2
The test method details are stored much deeper, so I can't seem to figure out how to modify the separate_p_value() function to retrieve the test methods details for each row/variable.
Is there a way to retrieve similar separated p.value / test method information for add_difference()?

Related

Is there anyway to get schema from the parquet files being queried?

So, I have parquet files separated by folder with date in it, something like
root_folder
|_date=20210101
|_ file_A.parquet
|_date=20210102
|_ file_B.parquet
file_A has 2 column X,Y, file_B has 3 column X,Y,Z
but when i query using sparksession on the date 20210102, it's using schema from the topmost folder that is 20210101 and when i tried querying column Z it doesn't exist.
I've tried using mergeSchema=true option, but it doesn't fit my use case because I need to treat those with column Z differently, and i'm checking if there's column Z using DataFrame.columns.
Is there any workaround for this? I need to get schema from the one i query only.
If computational cost is not a concern, you can solve this problem by reading the entire dataset into spark, filter to the date you are looking for, and then drop the column if is entirely null.
This performs a pass over the data just to figure out if the column should be dropped, which is not great. Luckily .where and .count parallelize pretty well so you have enough compute it might be okay.
val base = spark.read
.option("mergeSchema", true)
.parquet("root_folder/")
.where(col("date") === "20210101")
val df = if (base.where(col("Z").isNotNull).count > 0) base.drop("Z") else base
df.schema // Should only have X, Y
If you want to generalize this into a function that drops all empty columns, you can compute the .isNotNull count for all columns in 1 pass.

Scala Spark groupBy/Agg functions

I have two datasets that i need to join and perform operations on and I cant figure out how to do it.
A stipulation for this is that i do not have org.apache.spark.sql.functions methods available to me, so must use the dataset API
The input given is two Datasets
The first dataset is of type Customer with Fields:
customerId, forename, surname - All String
And the second dataset is of Transaction:
customerId (String), accountId(String), amount (Long)
customerId is the link
The outputted Dataset needs to have these fields:
customerId (String), forename(String), surname(String), transactions( A list of type Transaction), transactionCount (int), totalTransactionAmount (Double),averageTransactionAmount (Double)
I understand that i need to use groupBy, agg, and some kind of join at the end.
Can anyone help/point me in the right direction? Thanks
It is very hard to work with the information you have, but from what I understand you dont want to use the dataframe functions but implement everything with the dataset api, you could do this in the following way
Join both the datasets using joinWith, you can find an example here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html#joinWith
Aggregating : I would use groupByKey followed by mapGroups something like
ds.groupByKey(x=>x.id).mapGroups { case (key,iter) => {
val list = iter.toList
val totalTransactionAmount = ???
val averageTransactionAmount = ???
(key,totalTransactionAmount,averageTransactionAmount)
}
}
Hopefully the example gives you an idea how you could solve your problem with the dataset API and you could adapt it to your problem.

Spark : Dynamic generation of the query based on the fields in s3 file

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

How to read partitioned parquets with same structure but different column names?

I have parquet files that are partitioned by the date created (BusinessDate) and the data source (SourceSystem). Some source systems generate their data with different column names (small stuff like capitalization, ie orderdate vs OrderDate), but the same overall data structure (column order and data type is always the same between files).
My data looks like this in my filesystem:
dataroot
|-BusinessDate=20170809
|-SourceSystem=StoreA
|-data.parquet (has column "orderdate")
|-SourceSystem=StoreB
|-data.parquet (has column "OrderDate")
Is there a way to read the data in from either dataroot or dataroot/BusinessData=######/, and somehow normalize the data into a uniform schema?
My first attempt was to try:
val inputDF = spark.read.parquet(samplePqt)
standardNames = Seq(...) //list of uniform column names in order
val uniformDF = inputDF.toDF(standardNames: _*)
But this does not work (will rename columns which have same column names between source systems, but will populate with null for records from source system B with different column names).
I never did find a way to process all of the data in one pass, my solution iterates through the distinct source systems, creates filepaths pointing to each source system, and processes them individually. As they get individually processed, they get transformed into a standard schema and unioned with the other results.
val inputDF = spark.read.parquet(dataroot) //dataroot contains business date
val sourceList = inputDF.select(inputDF("source_system")).distinct.collect.map(_(0)).toList //list of source systems for businessdate
sourceList.foreach(println(_))
for (ss <- sourceList){//process data}

Merge two datasets duplicate BY variables Or I want to make following form

I am a novice in SAS program.
I have a question about merging two dataset.
The two data sets look like (please click this Image link (Excel sheet image):
Please let me know key concepts or code to make this happen!
I have searched the answer through Googling etc., but there is no site that exactly solve what I want.
(If it is possible to tackle above question without PROC SQL.)
To get the desired result you should do a cartesian product (Cross join) which returns all the rows in all tables. Each row in table1 is paired with all the rows in table2. I have used Proc SQL to do this and I am eager to see how this can be done using Data step. Here's what I know,
Proc Sql;
create table test_merge as
select a.*, b.type_rhs, b.rhs1, b.rhs2
from test a, test11 b
where a.yearmonth=b.yearmonth
;
quit;
Again, I am new to SAS as well and I think this is one of the ways to create the desired output.
When working with huge data, you will see a note in log that says "The execution of this query involves performing one or more Cartesian product joins that can not be optimized."