PySpark - Get aggregated values for dynamic columns in a dataframe - pyspark

I have a dataframe with the below rows:
+------+--------+-------+-------+
| label| machine| value1| value2|
+------+--------+-------+-------+
|label1|machine1| 13| 7.5|
|label1|machine1| 9 | 7.5|
|label1|machine1| 8.5| 7.5|
|label1|machine1| 10.5| 7.5|
|label1|machine1| 12| 8|
|label1|machine2| 8 | 13.5|
|label1|machine2| 18| 10|
|label1|machine2| 10| 14|
|label1|machine2| 9 | 10.5|
|label1|machine2| 8.5| 10|
|label2|machine3| 8 | 7.5|
|label2|machine3| 18| 7.5|
|label2|machine3| 10| 7.5|
|label2|machine3| 9 | 7.5|
|label2|machine3| 8.5| 8|
|label2|machine4| 13.5| 13|
|label2|machine4| 10| 9|
|label2|machine4| 14| 8.5|
|label2|machine4| 10.5| 10.5|
|label2|machine4| 10| 12|
+------+--------+-------+-------+
Here, I can have multiple value columns other than value1, value2 in the data frame. For every column, I want to aggregate the values with collect_list and create a new column in the data frame, so that I can perform some functions later.
For this, I tried like this:
my_df = my_df.groupBy(['label', 'machine']). \
agg(collect_list("value1").alias("col_value1"), collect_list("value2").alias("col_value2"))
It is giving me the below 4 rows as I'm grouping by label and machine columns.
+------+--------+--------------------+--------------------+
| label| machine| collected_value1| collected_value2|
+------+--------+--------------------+--------------------+
|label1|machine1|[13.0, 9.0, 8.5, ...|[7.5, 7.5, 7.5, 7...|
|label2|machine2|[8.0, 18.0, 10.0,...|[13.5, 10.0, 14, ...|
|label1|machine3|[8.0, 18.0, 10.0,...|[7.5, 7.5, 7.5, 7...|
|label2|machine4|[13.5, 10.0, 14, ...|[13.0, 9.0, 8.5, ...|
+------+--------+--------------------+--------------------+
Now, my problem here is how to pass columns dynamically to this group by. The columns might differ for every run, so I want to use something like this:
df_cols = ['value1', 'value2']
my_df = my_df.groupBy(['label', 'machine']). \
agg(collect_list(col_name).alias(str(col_name+"_collected")) for col_name in df_cols)
It gives me AssertionError: all exprs should be Column error.
How can I achieve this? Can someone please help me on this?
Thanks in advance.

The below code has worked. Thank you.
exprs = [collect_list(x).alias(str(x+"_collected")) for x in df_cols]
my_df = my_df.groupBy(['label', 'machine']).agg(*exprs)

Related

How to save one of the values of a column eliminated by groupBy in Spark

Here is a sample of the dataFrame I have:
+----+------+-------+-----+
| id | pool | start | end |
+----+------+-------+-----+
| 1| cat| 100| 105|
| 2| cat| 104| 110|
| 3| cat| 130| 135|
| 4| dog| 100| 110|
| 5| dog| 101| 103|
| 6| rat| 150| 151|
| 7| rat| 180| 187|
| 8| rat| 183| 184|
| 9| cat| 143| 150|
| 10| dog| 107| 120|
| ...| ...| ...| ...|
+----+------+-------+-----+
And I want to group by the pool while preserving the min and max values seen in the start and end for each pool. Great, I run
sourceDataframe.groupBy("pool").agg(min("start"), max("end"))
which gives me:
+------+-----------+---------+
| pool | min(start)| max(end)|
+------+-----------+---------+
| cat| 100| 150|
| dog| 100| 120|
| rat| 150| 187|
| ...| ...| ...|
+------+-----------+---------+
But I want one more thing, which is any id belonging to each pool. If possible, preferably the one with the maximum end (or any other arbitrary requirement I suppose). And if possible, without doing another join after the fact.
Example of success:
// If two have the same maximum end, it should just pick an arbitrary one
+------+-----------+---------+-----------+
| pool | min(start)| max(end)| idOfMaxEnd|
+------+-----------+---------+-----------+
| cat| 100| 150| 9|
| dog| 100| 120| 10|
| rat| 150| 187| 7|
| ...| ...| ...| ...|
+------+-----------+---------+-----------+
Thanks for your time and effort!
Edit: Okay so doing the following almost gives me what I want:
sourceDataframe.groupBy("pool").agg(min("start"), max("end"), first("id"))
It saves the first ID of each group that it comes across. I would like to save the ID with the maximum end per group, if possible. I know I could solve this by sorting the DataFrame but that would require too much time.
You can use window function with 2 different windows although I am not sure that it will be more efficient than using join (you mentioned that you want to avoid it):
val df: DataFrame = Seq(
(1, "cat", 100, 1335),
(3, "cat", 130, 135),
(7, "rat", 180, 187),
(8, "rat", 183, 184),
).toDF("id", "pool", "start", "end")
val windowSpecEnd = Window.partitionBy("pool").orderBy(col("end").desc)
val windowSpecStart = Window.partitionBy("pool").orderBy(col("start"))
val aggDf = df
.withColumn("min_start", functions.first("start").over(windowSpecStart))
.withColumn("max_end", functions.first("end").over(windowSpecEnd))
.withColumn("idOfMaxEnd", functions.first("id").over(windowSpecEnd))
.select("pool", "min_start", "max_end", "idOfMaxEnd")
.distinct
aggDf.show
// output:
+----+---------+-------+----------+
|pool|min_start|max_end|idOfMaxEnd|
+----+---------+-------+----------+
| rat| 180| 187| 7|
| cat| 100| 1335| 1|
+----+---------+-------+----------+

How do I find a max of a column for each unique value in another column in Spark?

Say I have a dataset similar to this:
My final product needs to be a row for each day of the Week with the Place that had the most Activities for that day. i.e. Mon Place A 56, Wed Place C 64, etc. I have tried using the Window function and am using max and groupie, but I am getting myself confused.
For your purposes you need to write window function:
val df = Seq(
("Mon", "Place A", 10),
("Mon", "Place B", 42),
("Wed", "Place C", 41),
("Thurs", "Place D", 45),
("Fri", "Place E", 64),
("Fri", "Place A", 12),
("Wed", "Place F", 54),
("Wed", "Place A", 1)
).toDF("day", "place", "number")
df.show()
df.withColumn("orderedNumberForDay",
row_number()
.over(
Window.orderBy(col("number").desc)
.partitionBy("day")
)
).filter(col("orderedNumberForDay") === lit(1))
.select("day", "place", "number")
.show()
/*
+-----+-------+------+ +-----+-------+------+
| day| place|number| | day| place|number|
+-----+-------+------+ +-----+-------+------+
| Mon|Place A| 10| | Mon|Place B| 42|
| Mon|Place B| 42| ===>> | Wed|Place F| 54|
| Wed|Place C| 41| | Fri|Place E| 64|
|Thurs|Place D| 45| |Thurs|Place D| 45|
| Fri|Place E| 64| +-----+-------+------+
| Fri|Place A| 12|
| Wed|Place F| 54|
| Wed|Place A| 1|
+-----+-------+------+
*/
Just a little explanation how it works
Firstly you need to add column with window function result, here is:
df.withColumn("orderedNumberForDay",
row_number()
.over(
Window.orderBy(col("number").desc)
.partitionBy("day")
)
)
row_number() - is counter of rows inside your partition. Partition is like group in group by. partitionBy("day") just grouping windows with same day column value. And finally we have to order this window by number in desc order, so there is orderBy(col("number").desc in our window function. over is like a bridge from windows to some useful computations inside windows and it's just bind row_number and window function.
After execution this stage we will have data:
+-----+-------+------+-------------------+
| day| place|number|orderedNumberForDay|
+-----+-------+------+-------------------+
| Mon|Place B| 42| 1|
| Mon|Place A| 10| 2|
| Wed|Place F| 54| 1|
| Wed|Place C| 41| 2|
| Wed|Place A| 1| 3|
| Fri|Place E| 64| 1|
| Fri|Place A| 12| 2|
|Thurs|Place D| 45| 1|
+-----+-------+------+-------------------+
So, all we need is filter rows with orderedNumberForDay equals 1 - it will be with max number and select started columns: day, place, number. Final result will be:
+-----+-------+------+
| day| place|number|
+-----+-------+------+
| Mon|Place B| 42|
| Wed|Place F| 54|
| Fri|Place E| 64|
|Thurs|Place D| 45|
+-----+-------+------+
Spark 3.0 introduced the aggregation function max_by that does exactly what you are looking for:
df.groupBy("day")
.agg(expr("max_by(place, number)"), max('number))
.show()
Result:
+-----+---------------------+-----------+
| day|max_by(place, number)|max(number)|
+-----+---------------------+-----------+
| Mon| Place B| 42|
| Wed| Place F| 54|
| Fri| Place E| 64|
|Thurs| Place D| 45|
+-----+---------------------+-----------+

How do I replace null values of multiple columns with values from multiple different columns

I have a data frame like below
data = [
(1, None,7,10,11,19),
(1, 4,None,10,43,58),
(None, 4,7,67,88,91),
(1, None,7,78,96,32)
]
df = spark.createDataFrame(data, ["A_min", "B_min","C_min","A_max", "B_max","C_max"])
df.show()
and I would want the columns which show name as 'min' to be replaced by their equivalent max column.
Example null values of A_min column should be replaced by A_max column
It should be like the data frame below.
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+
I have tried the code below by defining the columns but clearly this does not work. Really appreciate any help.
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
for i in min_cols
df = df.withColumn(i,when(f.col(i)=='',max_cols.otherwise(col(i))))
display(df)
Assuming you have the same number of max and min columns, you can use coalesce along with python's list comprehension to obtain your solution
from pyspark.sql.functions import coalesce
min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]
df.select(*[coalesce(df[val], df[max_cols[pos]]).alias(val) for pos, val in enumerate(min_cols)], *max_cols).show()
Output:
+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
| 1| 11| 7| 10| 11| 19|
| 1| 4| 58| 10| 43| 58|
| 67| 4| 7| 67| 88| 91|
| 1| 96| 7| 78| 96| 32|
+-----+-----+-----+-----+-----+-----+

Adding column with sum of all rows above in same grouping

I need to create a 'rolling count' column which takes the previous count and adds the new count for each day and company. I have already organized and sorted the dataframe into groups of ascending dates per company with the corresponding count. I also added a 'ix' column which indexes each grouping, like so:
+--------------------+--------------------+-----+---+
| Normalized_Date| company|count| ix|
+--------------------+--------------------+-----+---+
|09/25/2018 00:00:...|[5c40c8510fb7c017...| 7| 1|
|09/25/2018 00:00:...|[5bdb2b543951bf07...| 9| 1|
|11/28/2017 00:00:...|[593b0d9f3f21f9dd...| 7| 1|
|11/29/2017 00:00:...|[593b0d9f3f21f9dd...| 60| 2|
|01/09/2018 00:00:...|[593b0d9f3f21f9dd...| 1| 3|
|04/27/2018 00:00:...|[593b0d9f3f21f9dd...| 9| 4|
|09/25/2018 00:00:...|[593b0d9f3f21f9dd...| 29| 5|
|11/20/2018 00:00:...|[593b0d9f3f21f9dd...| 42| 6|
|12/11/2018 00:00:...|[593b0d9f3f21f9dd...| 317| 7|
|01/04/2019 00:00:...|[593b0d9f3f21f9dd...| 3| 8|
|02/13/2019 00:00:...|[593b0d9f3f21f9dd...| 15| 9|
|04/01/2019 00:00:...|[593b0d9f3f21f9dd...| 1| 10|
+--------------------+--------------------+-----+---+
The output I need would simply add up all the counts up to that date for each company. Like so:
+--------------------+--------------------+-----+---+------------+
| Normalized_Date| company|count| ix|RollingCount|
+--------------------+--------------------+-----+---+------------+
|09/25/2018 00:00:...|[5c40c8510fb7c017...| 7| 1| 7|
|09/25/2018 00:00:...|[5bdb2b543951bf07...| 9| 1| 9|
|11/28/2017 00:00:...|[593b0d9f3f21f9dd...| 7| 1| 7|
|11/29/2017 00:00:...|[593b0d9f3f21f9dd...| 60| 2| 67|
|01/09/2018 00:00:...|[593b0d9f3f21f9dd...| 1| 3| 68|
|04/27/2018 00:00:...|[593b0d9f3f21f9dd...| 9| 4| 77|
|09/25/2018 00:00:...|[593b0d9f3f21f9dd...| 29| 5| 106|
|11/20/2018 00:00:...|[593b0d9f3f21f9dd...| 42| 6| 148|
|12/11/2018 00:00:...|[593b0d9f3f21f9dd...| 317| 7| 465|
|01/04/2019 00:00:...|[593b0d9f3f21f9dd...| 3| 8| 468|
|02/13/2019 00:00:...|[593b0d9f3f21f9dd...| 15| 9| 483|
|04/01/2019 00:00:...|[593b0d9f3f21f9dd...| 1| 10| 484|
+--------------------+--------------------+-----+---+------------+
I figured the lag function would be of use, and I was able to get each row of rollingcount with ix > 1 to add the count directly above it with the following code:
w = Window.partitionBy('company').orderBy(F.unix_timestamp('Normalized_Dat e','MM/dd/yyyy HH:mm:ss aaa').cast('timestamp'))
refined_DF = solutionDF.withColumn("rn", F.row_number().over(w))
solutionDF = refined_DF.withColumn('RollingCount',F.when(refined_DF['rn'] > 1, refined_DF['count'] + F.lag(refined_DF['count'],count= 1 ).over(w)).otherwise(refined_DF['count']))
which yields the following df:
+--------------------+--------------------+-----+---+------------+
| Normalized_Date| company|count| ix|RollingCount|
+--------------------+--------------------+-----+---+------------+
|09/25/2018 00:00:...|[5c40c8510fb7c017...| 7| 1| 7|
|09/25/2018 00:00:...|[5bdb2b543951bf07...| 9| 1| 9|
|11/28/2017 00:00:...|[593b0d9f3f21f9dd...| 7| 1| 7|
|11/29/2017 00:00:...|[593b0d9f3f21f9dd...| 60| 2| 67|
|01/09/2018 00:00:...|[593b0d9f3f21f9dd...| 1| 3| 61|
|04/27/2018 00:00:...|[593b0d9f3f21f9dd...| 9| 4| 10|
|09/25/2018 00:00:...|[593b0d9f3f21f9dd...| 29| 5| 38|
|11/20/2018 00:00:...|[593b0d9f3f21f9dd...| 42| 6| 71|
|12/11/2018 00:00:...|[593b0d9f3f21f9dd...| 317| 7| 359|
|01/04/2019 00:00:...|[593b0d9f3f21f9dd...| 3| 8| 320|
|02/13/2019 00:00:...|[593b0d9f3f21f9dd...| 15| 9| 18|
|04/01/2019 00:00:...|[593b0d9f3f21f9dd...| 1| 10| 16|
+--------------------+--------------------+-----+---+------------+
I just need it to sum all of the counts ix rows above it. I have tried using a udf to figure out the 'count' input into the lag function, but I keep getting a "'Column' object is not callable" error, plus it doesn't do the sum of all of the rows. I have also tried using a loop but that seems impossible because it will make a new dataframe each time through, plus I would need to join them all afterwards. There must be an easier and simpler way to do this. Perhaps a different function than lag?
The lag returns you a certain single row before your current value, but you need a range to calculate the cummulative sum. Therefore you have to use the window function rangeBetween (rowsBetween). Have a look at the example below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
('09/25/2018', '5c40c8510fb7c017', 7, 1),
('09/25/2018', '5bdb2b543951bf07', 9, 1),
('11/28/2017', '593b0d9f3f21f9dd', 7, 1),
('11/29/2017', '593b0d9f3f21f9dd', 60, 2),
('01/09/2018', '593b0d9f3f21f9dd', 1, 3),
('04/27/2018', '593b0d9f3f21f9dd', 9, 4),
('09/25/2018', '593b0d9f3f21f9dd', 29, 5),
('11/20/2018', '593b0d9f3f21f9dd', 42, 6),
('12/11/2018', '593b0d9f3f21f9dd', 317, 7),
('01/04/2019', '593b0d9f3f21f9dd', 3, 8),
('02/13/2019', '593b0d9f3f21f9dd', 15, 9),
('04/01/2019', '593b0d9f3f21f9dd', 1, 10)
]
columns = ['Normalized_Date', 'company','count', 'ix']
df=spark.createDataFrame(l, columns)
df = df.withColumn('Normalized_Date', F.to_date(df.Normalized_Date, 'MM/dd/yyyy'))
w = Window.partitionBy('company').orderBy('Normalized_Date').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('Rolling_count', F.sum('count').over(w))
df.show()
Output:
+---------------+----------------+-----+---+-------------+
|Normalized_Date| company|count| ix|Rolling_count|
+---------------+----------------+-----+---+-------------+
| 2018-09-25|5c40c8510fb7c017| 7| 1| 7|
| 2018-09-25|5bdb2b543951bf07| 9| 1| 9|
| 2017-11-28|593b0d9f3f21f9dd| 7| 1| 7|
| 2017-11-29|593b0d9f3f21f9dd| 60| 2| 67|
| 2018-01-09|593b0d9f3f21f9dd| 1| 3| 68|
| 2018-04-27|593b0d9f3f21f9dd| 9| 4| 77|
| 2018-09-25|593b0d9f3f21f9dd| 29| 5| 106|
| 2018-11-20|593b0d9f3f21f9dd| 42| 6| 148|
| 2018-12-11|593b0d9f3f21f9dd| 317| 7| 465|
| 2019-01-04|593b0d9f3f21f9dd| 3| 8| 468|
| 2019-02-13|593b0d9f3f21f9dd| 15| 9| 483|
| 2019-04-01|593b0d9f3f21f9dd| 1| 10| 484|
+---------------+----------------+-----+---+-------------+
try this.
You need the sum of all preceding rows to current row in the window frame.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
import org.apache.spark.sql.functions._
val df = Seq(
("5c40c8510fb7c017", 7, 1),
("5bdb2b543951bf07", 9, 1),
("593b0d9f3f21f9dd", 7, 1),
("593b0d9f3f21f9dd", 60, 2),
("593b0d9f3f21f9dd", 1, 3),
("593b0d9f3f21f9dd", 9, 4),
("593b0d9f3f21f9dd", 29, 5),
("593b0d9f3f21f9dd", 42, 6),
("593b0d9f3f21f9dd", 317, 7),
("593b0d9f3f21f9dd", 3, 8),
("593b0d9f3f21f9dd", 15, 9),
("593b0d9f3f21f9dd", 1, 10)
).toDF("company", "count", "ix")
scala> df.show(false)
+----------------+-----+---+
|company |count|ix |
+----------------+-----+---+
|5c40c8510fb7c017|7 |1 |
|5bdb2b543951bf07|9 |1 |
|593b0d9f3f21f9dd|7 |1 |
|593b0d9f3f21f9dd|60 |2 |
|593b0d9f3f21f9dd|1 |3 |
|593b0d9f3f21f9dd|9 |4 |
|593b0d9f3f21f9dd|29 |5 |
|593b0d9f3f21f9dd|42 |6 |
|593b0d9f3f21f9dd|317 |7 |
|593b0d9f3f21f9dd|3 |8 |
|593b0d9f3f21f9dd|15 |9 |
|593b0d9f3f21f9dd|1 |10 |
+----------------+-----+---+
scala> val overColumns = Window.partitionBy("company").orderBy("ix").rowsBetween(Window.unboundedPreceding, Window.currentRow)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#3ed5e17c
scala> val outputDF = df.withColumn("RollingCount", sum("count").over(overColumns))
outputDF: org.apache.spark.sql.DataFrame = [company: string, count: int ... 2 more fields]
scala> outputDF.show(false)
+----------------+-----+---+------------+
|company |count|ix |RollingCount|
+----------------+-----+---+------------+
|5c40c8510fb7c017|7 |1 |7 |
|5bdb2b543951bf07|9 |1 |9 |
|593b0d9f3f21f9dd|7 |1 |7 |
|593b0d9f3f21f9dd|60 |2 |67 |
|593b0d9f3f21f9dd|1 |3 |68 |
|593b0d9f3f21f9dd|9 |4 |77 |
|593b0d9f3f21f9dd|29 |5 |106 |
|593b0d9f3f21f9dd|42 |6 |148 |
|593b0d9f3f21f9dd|317 |7 |465 |
|593b0d9f3f21f9dd|3 |8 |468 |
|593b0d9f3f21f9dd|15 |9 |483 |
|593b0d9f3f21f9dd|1 |10 |484 |
+----------------+-----+---+------------+

Joining two DataFrames and appending where not exists

I have two DataFrames. One is a MasterList, the other is an InsertList
MasterList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
+--------+--------+
InsertList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 9|
+--------+--------+
In Scala, how do I join two DataFrames but only append to the new DataFrame records
WHERE MasterList.ttm_id = InsertList.ttm_id AND
MasterList.audit_id != InsertList.audit_id
-
ExpectedOutput:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
| 15| 9|
+--------+--------+
I'd anti join (NOT IN) by both columns and union
val masterList = Seq((1, 10), (15, 10)).toDF("ttm_id", "audit_id")
val insertList = Seq((1, 10), (15, 9)).toDF("ttm_id", "audit_id")
insertList
.join(masterList, Seq("ttm_id", "audit_id"), "leftanti")
.union(masterList)
.show
// +------+--------+
// |ttm_id|audit_id|
// +------+--------+
// | 15| 9|
// | 1| 10|
// | 15| 10|
// +------+--------+
It seems that you want to merge rows from insertList dataFrame that are not in masterList dataFrame. This can be achived using except function
insertList.except(masterList)
And you just use union function merge both dataFrames as
masterList.union(insertList.except(masterList))
You should get what you desire as
+------+--------+
|ttm_id|audit_id|
+------+--------+
|1 |10 |
|15 |10 |
|15 |9 |
+------+--------+