Omit categories from table - gtsummary

The following code
library(tidyverse)
library(gtsummary)
df <- tibble(category = c("a", "a", "a",
"b", "b", "b", "b"))
output_table <- df %>%
tbl_summary()
output_table
produces this table.
Is it possible to remove the "a" category from this table without changing the associated frequencies? So in this case, the final table should look like this (but without the extra whitespace).

Using the tbl_summary(value=) argument, you can select a single level of a categorical variable to display in the table.
library(gtsummary)
trial %>%
select(grade) %>%
# show only one level for grade
tbl_summary(value = grade ~ "I",
label = grade ~ "Grade I")
You can also just delete a single row from the output. But you'll need to install the development version of the package to use the new function modify_table_body().
remotes::install_github("ddsjoberg/gtsummary")
trial %>%
select(grade) %>%
tbl_summary() %>%
# remove grade I row
modify_table_body(filter, !(variable == "grade" & label == "I"))

Related

How can I add a custom type of column to my dataframe, or add this when collecting .as[Class]?

Would love some help on this issue that I am having which ive been stuck on for quite some time now.
I have a class Test(id: Int, name: String, type: SaleType). Where SaleType can be an enum with either SALE or PURCHASE. I want to collect this dataframe to a List(Seq(Test))
I have a dataframe like this.
val ds = Seq(
(0, "p"),
(1, "s")
).toDF("id", "name")
If I try to add a column SaleType
.withColumn("SaleType",when(col("name") === "p", SaleType.PURCHASE)
.otherwise(SaleType.SALE)
This does not work because I the column type cannot be SaleType.
When I try to create a dataframe with this
val ds = Seq(
(0, "p"),
(1, "s")
).toDF("id", "name")
.withColumn("SaleType",when(col("name") === "p", "PURCHASE")
.otherwise("SALE").as[Test].collect().toList()
I get an error since I cannot create these classes Test because my 3rd column is a string.
Is there a function that can make sure I can still create a list of Test objects?
I thought I could not include the type column, collect the DF, add the SaleType and then cast as [Type], but unsure how to do that.
Thanks!

How to not make dplyr mutate calculate for missing group

I have data looking like this:
df=data.frame(a=1:6,b=rep(c("one","two"),each=3))
df[2,2]<-NA
I want to calculate the mean of each group for each row, like this:
df %>% group_by(b) %>% mutate(mean=mean(a))
The problem is that R views the NA as a group. Desired output would be
mean=c(2,NA,2,5,5,5).
Attempt:
df %>% group_by(b) %>% mutate(mean=if_else(b==NA,NA,mean(a)))
but this throws an error
Try
df %>% group_by(b) %>% mutate(mean=mean(a)) %>% mutate(mean = if_else(is.na(b), NA_real_, mean))
If you want to avoid error messages:
library(hablar)
df %>%
convert(chr(b)) %>%
group_by(b) %>%
mutate(mean = if_else_(!is.na(b), mean(a), NA))

Pyspark groupby then sort within group

I have a table which contains id, offset, text. Suppose input:
id offset text
1 1 hello
1 7 world
2 1 foo
I want output like:
id text
1 hello world
2 foo
I'm using:
df.groupby(id).agg(concat_ws("",collect_list(text))
But I don't know how to ensure the order in the text. I did sort before groupby the data, but I've heard that groupby might shuffle the data. Is there a way to do sort within group after groupby data?
this will create a required df:
df1 = sqlContext.createDataFrame([("1", "1","hello"), ("1", "7","world"), ("2", "1","foo")], ("id", "offset" ,"text" ))
display(df1)
then you can use the following code, could be optimized further:
#udf
def sort_by_offset(col):
result =""
text_list = col.split("-")
for i in range(len(text_list)):
text_list[i] = text_list[i].split(" ")
text_list[i][0]=int(text_list[i][0])
text_list = sorted(text_list, key=lambda x: x[0], reverse=False)
for i in range(len(text_list)):
result = result+ " " +text_list[i][1]
return result.lstrip()
df2 = df1.withColumn("offset_text",concat(col("offset"),lit(" "),col("text")))
df3 = df2.groupby(col("id")).agg(concat_ws("-",collect_list(col("offset_text"))).alias("offset_text"))
df4 = df3.withColumn("text",sort_by_offset(col("offset_text")))
display(df4)
Final Output:
Add sort_array:
from pyspark.sql.functions import sort_array
df.groupby(id).agg(concat_ws("", sort_array(collect_list(text))))

How to calculate product of columns followed by sum over all columns?

Table 1 --Spark DataFrame table
There is a column called "productMe" in Table 1; and there are also other columns like a, b, c and so on whose schema name is contained in a schema array T.
What I want is the inner product of columns(product each row of the two columns) in schema array T with the column productMe(Table 2). And sum each column of Table 2 to get Table 3.
Table 2 is not necessary if you have good idea to get Table 3 in one step.
Table 2 -- Inner product table
For example, the column "a·productMe" is (3*0.2, 6*0.6, 5*0.4) to get (0.6, 3.6, 2)
Table 3 -- sum table
For example, the column "sum(a·productMe)" is 0.6+3.6+2=6.2.
Table 1 is DataFrame of Spark, how can I get Table 3?
You can try something like the following :
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
import org.apache.spark.sql.functions.col
val columnsToSum = df.
columns. // <-- grab all the columns by their name
tail. // <-- skip productMe
map(col). // <-- create Column objects
map(c => round(sum(c * col("productMe")), 3).as(s"sum_${c}_productMe"))
val df2 = df.select(columnsToSum: _*)
df2.show()
# +---------------+---------------+---------------+
# |sum_a_productMe|sum_b_productMe|sum_c_productMe|
# +---------------+---------------+---------------+
# | 6.2| 6.3| 4.3|
# +---------------+---------------+---------------+
The trick is to use df.select(columnsToSum: _*) which means that you want to select all the columns on which we did the sum of columns times the productMe column. The :_* is a Scala-specific syntax to specify that we are passing repeated arguments because we don't have a fix number of arguments.
We can do it with simple SparkSql
val table1 = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
table1.show
table1.createOrReplaceTempView("table1")
val table2 = spark.sql("select a*productMe, b*productMe, c*productMe from table1") //spark is sparkSession here
table2.show
val table3 = spark.sql("select sum(a*productMe), sum(b*productMe), sum(c*productMe) from table1")
table3.show
All the other answers use sum aggregation that use groupBy under the covers.
groupBy always introduces a shuffle stage and usually (always?) is slower than corresponding window aggregates.
In this particular case, I also believe that window aggregates give better performance as you can see in their physical plans and details for their only one job.
CAUTION
Either solution uses one single partition to do the calculation that in turn makes them unsuitable for large datasets as their size together may easily exceed the memory size of a single JVM.
Window Aggregates
What follows is a window aggregate-based calculation which, in this particular case where we group over all the rows in a dataset, unfortunately gives the same physical plan. That makes my answer just a (hopefully) nice learning experience.
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
// yes, I did borrow this trick with columns from #eliasah's answer
import org.apache.spark.sql.functions.col
val columns = df.columns.tail.map(col).map(c => c * col("productMe") as s"${c}_productMe")
val multiplies = df.select(columns: _*)
scala> multiplies.show
+------------------+------------------+------------------+
| a_productMe| b_productMe| c_productMe|
+------------------+------------------+------------------+
|0.6000000000000001| 1.5|1.2000000000000002|
|3.5999999999999996|1.7999999999999998|0.6000000000000001|
| 2.0| 3.0| 2.5|
+------------------+------------------+------------------+
def sumOverRows(name: String) = sum(name) over ()
val multipliesCols = multiplies.
columns.
map(c => sumOverRows(c) as s"sum_${c}")
val answer = multiplies.
select(multipliesCols: _*).
limit(1) // <-- don't use distinct or dropDuplicates here
scala> answer.show
+-----------------+---------------+-----------------+
| sum_a_productMe|sum_b_productMe| sum_c_productMe|
+-----------------+---------------+-----------------+
|6.199999999999999| 6.3|4.300000000000001|
+-----------------+---------------+-----------------+
Physical Plan
Let's see the physical plan then (as it was the only reason why we wanted to see how to do the query using window aggregates, wasn't it?)
The following is the details for the only job 0.
If I understand your question correctly then following can be your solution
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
This gives input dataframe as you have (you can add more)
+---------+---+---+---+
|productMe|a |b |c |
+---------+---+---+---+
|3 |0.2|0.5|0.4|
|6 |0.6|0.3|0.1|
|5 |0.4|0.6|0.5|
+---------+---+---+---+
And
val productMe = df.columns.head
val colNames = df.columns.tail
var tempdf = df
for(column <- colNames){
tempdf = tempdf.withColumn(column, col(column)*col(productMe))
}
Above steps should give you Table2
+---------+------------------+------------------+------------------+
|productMe|a |b |c |
+---------+------------------+------------------+------------------+
|3 |0.6000000000000001|1.5 |1.2000000000000002|
|6 |3.5999999999999996|1.7999999999999998|0.6000000000000001|
|5 |2.0 |3.0 |2.5 |
+---------+------------------+------------------+------------------+
Table3 can be achieved as following
tempdf.select(sum("a").as("sum(a.productMe)"), sum("b").as("sum(b.productMe)"), sum("c").as("sum(c.productMe)")).show(false)
Table3 is
+-----------------+----------------+-----------------+
|sum(a.productMe) |sum(b.productMe)|sum(c.productMe) |
+-----------------+----------------+-----------------+
|6.199999999999999|6.3 |4.300000000000001|
+-----------------+----------------+-----------------+
Table2 can be achieved for any number of columns you have but Table3 would require you to define columns explicitly

Scala code to label rows of data frame based on another data frame

I just started learning scala to do data analytics and I encountered a problem when I try to label my data rows based on another data frame.
Suppose I have a df1 with columns "date","id","value",and"label" which is set to be "F" for all rows in df1 in the beginning. Then I have this df2 which is a smaller set of data with columns "date","id","value".Then I want to change the row label in df1 from "F" to "T" if that row appears in df2, i.e.some row in df2 has the same combination of ("date","id","value")as that row in df1.
I tried with df.filter and df.join but seems that both cannot solve my problem.
I Think this is what you are looking for.
val spark =SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
//create Dataframe 1
val df1 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd", "F"),
("2016-01-01", 2, "efg", "F"),
("2016-01-01", 3, "hij", "F"),
("2016-01-01", 4, "klm", "F")
)).toDF("date","id","value", "label")
//Create Dataframe 2
val df2 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd"),
("2016-01-01", 3, "hij")
)).toDF("date1","id1","value1")
val condition = $"date" === $"date1" && $"id" === $"id1" && $"value" === $"value1"
//Join two dataframe with above condition
val result = df1.join(df2, condition, "left")
// check wather both fields contain same value and drop columns
val finalResult = result.withColumn("label", condition)
.drop("date1","id1","value1")
//Update column label from true false to T or F
finalResult.withColumn("label", when(col("label") === true, "T").otherwise("F")).show
The basic idea is to join the two and then calculate the result. Something like this:
df2Mod = df2.withColumn("tmp", lit(true))
joined = df1.join(df2Mod , df1("date") <=> df2Mod ("date") && df1("id") <=> df2Mod("id") && df1("value") <=> df2Mod("value"), "left_outer")
joined.withColumn("label", when(joined("tmp").isNull, "F").otherwise("T")
The idea is that we add the "tmp" column and then do a left_outer join. "tmp" would be null for everything not in df2 and therefore we can use that to calculate the label.