Spark aggregation with window functions - pyspark

I have a spark df which I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:
A
B
C
Snap
1
2
3
2019-12-29
1
2
4
2019-12-31
where the primary key is formed by fields A and B. I need to create a new field to indicate which register is active (the last snap for each set of rows with the same PK). So I need something like this:
A
B
C
Snap
activity
1
2
3
2019-12-29
false
1
2
4
2019-12-31
true
I have done this by creating an auxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I donĀ“t know how I can implement it.
Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the activity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:
A
B
C
Snap
activity
end
1
2
3
2019-12-29
false
2019-12-30
1
2
4
2019-12-31
true

You can check row_number ordered by Snap in descending order. The 1st row is the last active snap:
df.selectExpr(
'*',
'row_number() over (partition by A, B order by Snap desc) = 1 as activity'
).show()
+---+---+---+----------+--------+
| A| B| C| Snap|activity|
+---+---+---+----------+--------+
| 1| 2| 4|2019-12-31| true|
| 1| 2| 3|2019-12-29| false|
+---+---+---+----------+--------+
Edit: to get the end date for each group, use max window function on Snap:
import pyspark.sql.functions as f
df.withColumn(
'activity',
f.expr('row_number() over (partition by A, B order by Snap desc) = 1')
).withColumn(
"end",
f.expr('case when activity then null else max(date_add(to_date(Snap), -1)) over (partition by A, B) end')
).show()
+---+---+---+----------+--------+----------+
| A| B| C| Snap|activity| end|
+---+---+---+----------+--------+----------+
| 1| 2| 4|2019-12-31| true| null|
| 1| 2| 3|2019-12-29| false|2019-12-30|
+---+---+---+----------+--------+----------+

Related

How to count the last 30 day occurrence & transpose a column's row value to new columns in pyspark

I am trying to get the count of occurrence of the status column for each 'name', 'id' & 'branch' combination in the last 30 days using Pyspark.
For simplicity lets assume the current day is 19/07/2021
Input dataframe
id name branch status eventDate
1 a main failed 18/07/2021
1 a main error 15/07/2021
2 b main failed 16/07/2021
3 c main snooze 12/07/2021
4 d main failed 18/01/2021
2 b main failed 18/07/2021
expected output
id name branch failed error snooze
1 a main 1 1 0
2 b main 2 0 0
3 c main 0 0 1
4 d main 0 0 0
I tried the following code.
from pyspark.sql import functions as F
df = df.withColumn("eventAgeinDays", (F.datediff(F.current_timestamp(), F.col("eventDate"))))
df = df.groupBy('id', 'branch', 'name', 'status')\
.agg(
F.sum(
F.when(F.col("eventAgeinDays") <= 30, 1).otherwise(0)
).alias("Last30dayFailure")
)
df = df.groupBy('id', 'branch', 'name', 'status').pivot('status').agg(F.collect_list('Last30dayFailure'))
The code kind of gives me the output, but I get arrays in the output since I am using F.collect_list()
my partially correct output
id name branch failed error snooze
1 a main [1] [1] []
2 b main [2] [] []
3 c main [] [] [1]
4 d main [] [] []
Could you please suggest a more elegant way of creating my expected output? Or let me know how to fix my code?
Instead of using collect_list which creates list, use first as the aggregation method (The reason we can use first is that you already had an aggregation grouped by id, branch, name and status so you are sure that there's at most one value for each unique combination):
(df.groupBy('id', 'branch', 'name')
.pivot('status')
.agg(F.first('Last30dayFailure'))
.fillna(0)
.show())
+---+------+----+-----+------+------+
| id|branch|name|error|failed|snooze|
+---+------+----+-----+------+------+
| 1| main| a| 1| 1| 0|
| 4| main| d| 0| 0| 0|
| 3| main| c| 0| 0| 1|
| 2| main| b| 0| 2| 0|
+---+------+----+-----+------+------+

Apache spark aggregation: aggregate column based on another column value

I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+

Remove rows from Spark DataFrame that ONLY satisfy two conditions

I am using Scala and Spark. I want to filter out certain rows from a DataFrame that do NOT satisfy ALL the conditions that I am specifying, while keeping other rows that might only one of the conditions be satisfied.
For example: let's say I have this DataFrame
+-------+----+
|country|date|
+-------+----+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
and I want to filter out country A and dates 1 and 2, so that the expected output should be:
+-------+----+
|country|date|
+-------+----+
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
As you can see, I am still keeping country B with dates 1 and 2.
I tried to use filter in the following way
df.filter("country != 'A' and date not in (1,2)")
But the output filters out all dates 1, and 2, which is not what I want.
Thanks.
Your current condition is
df.filter("country != 'A' and date not in (1,2)")
which can be translated as "accept any country other than A, then accept any date except 1 or 2". Your conditions are applied independently
What you want is:
df.filter("not (country = 'A' and date in (1,2))")
i.e. "Find the rows with country A and date of 1 or 2, and reject them"
or equivalently:
df.filter("country != 'A' or date not in (1,2)")
i.e. "If country isn't A, then accept it regardless of the date. If the country is A, then the date mustn't be 1 or 2"
See De Morgan's laws:
not(A or B) = not A and not B
not (A and B) = not A or not B

Get Unique records in Spark [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
Corey beat me to it, but here's the Scala version:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result

How to group by a column on a dataframe and applying single value to columns of all rows grouped?

I have a dataframe(scala) and I want to do something like below on the dataframe:
I want to group by column 'a' and select any of the value from column 1 out of the grouped columns and apply it on all rows.I.e for a=1, then b should be either x or y or h on all 3 rows and the rest of the columns should be unaffected.
any help on this?
You can try this, i.e, create another data frame that contains a, b columns where b has one value per a and then join it back with the original data frame:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val w = Window.partitionBy($"a").orderBy($"b")
// create the window object so that we can create a column that gives unique row number
// for each unique a
(df.withColumn("rn", row_number.over(w)).where($"rn" === 1).select("a", "b")
// create the row number column for each unique a and choose the first row for each group
// which returns a reduced data frame one row per group
.join(df.select("a", "c"), Seq("a"), "inner").show)
// join the reduced data frame back with the original data frame(a,c columns), then b column
// will have just one value
+---+---+---+
| a| b| c|
+---+---+---+
| 1| h| g|
| 1| h| y|
| 1| h| x|
| 2| c| d|
| 2| c| x|