Spark window function with condition on current row - scala

I am trying to count for a given order_id how many orders there were in the past 365 days which had a payment. And this is not the problem: I use the window function.
Where it gets tricky for me is: I don't want to count orders in this time window where the payment_date is after order_date of the current order_id.
Currently, I have something like this:
val window: WindowSpec = Window
.partitionBy("customer_id")
.orderBy("order_date")
.rangeBetween(-365*days, -1)
and
df.withColumn("paid_order_count", count("*") over window)
which would count all orders for the customer within the last 365 days before his current order.
How can I now incorporate a condition for the counting that takes the order_date of the current order into account?
Example:
+---------+-----------+-------------+------------+
|order_id |order_date |payment_date |customer_id |
+---------+-----------+-------------+------------+
|1 |2017-01-01 |2017-01-10 |A |
|2 |2017-02-01 |2017-02-10 |A |
|3 |2017-02-02 |2017-02-20 |A |
The resulting table should look like this:
+---------+-----------+-------------+------------+-----------------+
|order_id |order_date |payment_date |customer_id |paid_order_count |
+---------+-----------+-------------+------------+-----------------+
|1 |2017-01-01 |2017-01-10 |A |0 |
|2 |2017-02-01 |2017-02-10 |A |1 |
|3 |2017-02-02 |2017-02-20 |A |1 |
For order_id = 3 the paid_order_count should not be 2 but 1 as order_id = 2 is paid after order_id = 3 is placed.
I hope that I explained my problem well and look forward to your ideas!

Very good question!!!
A couple of remarks, using rangeBetween creates a fixed frame that is based on number of rows in it and not on values, so it will be problematic in 2 cases:
customer does not have orders every single day, so 365 rows window might contain rows with order_date well before one year ago
if customer has more than one order per day, it will mess with the one year coverage
combination of the 1 and 2
Also rangeBetween does not work with Date and Timestamp datatypes.
To solve it, it is possible to use window function with lists and an UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.sparkContext.parallelize(Seq(
(1, "2017-01-01", "2017-01-10", "A")
, (2, "2017-02-01", "2017-02-10", "A")
, (3, "2017-02-02", "2017-02-20", "A")
)
).toDF("order_id", "order_date", "payment_date", "customer_id")
.withColumn("order_date_ts", to_timestamp($"order_date", "yyyy-MM-dd").cast("long"))
.withColumn("payment_date_ts", to_timestamp($"payment_date", "yyyy-MM-dd").cast("long"))
// df.printSchema()
// df.show(false)
val window = Window.partitionBy("customer_id").orderBy("order_date_ts").rangeBetween(Window.unboundedPreceding, -1)
val count_filtered_dates = udf( (days: Int, top: Long, array: Seq[Long]) => {
val bottom = top - (days * 60 * 60 * 24).toLong // in spark timestamps are in secconds, calculating the date days ago
array.count(v => v >= bottom && v < top)
}
)
val res = df.withColumn("paid_orders", collect_list("payment_date_ts") over window)
.withColumn("paid_order_count", count_filtered_dates(lit(365), $"order_date_ts", $"paid_orders"))
res.show(false)
Output:
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|order_id|order_date|payment_date|customer_id|order_date_ts|payment_date_ts|paid_orders |paid_order_count|
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|1 |2017-01-01|2017-01-10 |A |1483228800 |1484006400 |[] |0 |
|2 |2017-02-01|2017-02-10 |A |1485907200 |1486684800 |[1484006400] |1 |
|3 |2017-02-02|2017-02-20 |A |1485993600 |1487548800 |[1484006400, 1486684800]|1 |
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
Converting dates to Spark timestamps in seconds makes the lists more memory efficient.
It is the easiest code to implement, but not the most optimal as the lists will take up some memory, custom UDAF would be best, but requires more coding, might do later. This will still work if you have thousands of orders per customer.

Related

How to plot a graph with data gaps in Zeppelin?

Dataframe was extracted to a temp table to plot the data density per time unit (1 day):
val dailySummariesDf =
getDFFromJdbcSource(SparkSession.builder().appName("test").master("local").getOrCreate(), s"SELECT * FROM values WHERE time > '2020-06-06' and devicename='Voltage' limit 100000000")
.persist(StorageLevel.MEMORY_ONLY_SER)
.groupBy($"digital_twin_id", window($"time", "1 day")).count().as("count")
.withColumn("windowstart", col("window.start"))
.withColumn("windowstartlong", unix_timestamp(col("window.start")))
.orderBy("windowstart")
dailySummariesDf.
registerTempTable("bank")
Then I plot it with %sql processor
%sql
select windowstart, count
from bank
and
%sql
select windowstartlong, count
from bank
What I get is shown below:
So, my expectation is to have gaps in this graph, as there were days with no data at all. But instead I see it being plotted densely, with October days plotted right after August, not showing a gap for September.
How can I force those graphs to display gaps and regard the real X axis values?
Indeed, grouping a dataset by window column won't produce any rows for the intervals that did not contain any original rows within those intervals.
One way to deal with that I can think of, is to add a bunch of fake rows ("manually fill in the gaps" in raw dataset), and only then apply a groupBy/window. For your case, that can be done by creating a trivial one-column dataset containing all the dates within a range you're interested in, and then joining it to your original dataset.
Here is my quick attempt:
import spark.implicits._
import org.apache.spark.sql.types._
// Define sample data
val df = Seq(("a","2021-12-01"),
("b","2021-12-01"),
("c","2021-12-01"),
("a","2021-12-02"),
("b","2021-12-17")
).toDF("c","d").withColumn("d",to_timestamp($"d"))
// Define a dummy dataframe for the range 12/01/2021 - 12/30/2021
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val start = DateTime.parse("2021-12-01",DateTimeFormat.forPattern("yyyy-MM-dd")).getMillis/1000
val end = start + 30*24*60*60
val temp = spark.range(start,end,24*60*60).toDF().withColumn("tc",to_timestamp($"id".cast(TimestampType))).drop($"id")
// Fill the gaps in original dataframe
val nogaps = temp.join(df, temp.col("tc") === df.col("d"), "left")
// Aggregate counts by a tumbling 1-day window
val result = nogaps.groupBy(window($"tc","1 day","1 day","5 hours")).agg(sum(when($"c".isNotNull,1).otherwise(0)).as("count"))
result.withColumn("windowstart",to_date(col("window.start"))).select("windowstart","count").orderBy("windowstart").show(false)
+-----------+-----+
|windowstart|count|
+-----------+-----+
|2021-12-01 |3 |
|2021-12-02 |1 |
|2021-12-03 |0 |
|2021-12-04 |0 |
|2021-12-05 |0 |
|2021-12-06 |0 |
|2021-12-07 |0 |
|2021-12-08 |0 |
|2021-12-09 |0 |
|2021-12-10 |0 |
|2021-12-11 |0 |
|2021-12-12 |0 |
|2021-12-13 |0 |
|2021-12-14 |0 |
|2021-12-15 |0 |
|2021-12-16 |0 |
|2021-12-17 |1 |
|2021-12-18 |0 |
|2021-12-19 |0 |
|2021-12-20 |0 |
+-----------+-----+
For illustration purposes only :)

Spark UDF not giving rolling counts properly

I have a Spark UDF to calculate rolling counts of a column, precisely wrt time. If I need to calculate a rolling count for 24 hours, for example for entry with time 2020-10-02 09:04:00, I need to look back until 2020-10-01 09:04:00 (very precise).
The Rolling count UDF works fine and gives correct counts, if I run locally, but when I run on a cluster, its giving incorrect results. Here is the sample input and output
Input
+---------+-----------------------+
|OrderName|Time |
+---------+-----------------------+
|a |2020-07-11 23:58:45.538|
|a |2020-07-12 00:00:07.307|
|a |2020-07-12 00:01:08.817|
|a |2020-07-12 00:02:15.675|
|a |2020-07-12 00:05:48.277|
+---------+-----------------------+
Expected Output
+---------+-----------------------+-----+
|OrderName|Time |Count|
+---------+-----------------------+-----+
|a |2020-07-11 23:58:45.538|1 |
|a |2020-07-12 00:00:07.307|2 |
|a |2020-07-12 00:01:08.817|3 |
|a |2020-07-12 00:02:15.675|1 |
|a |2020-07-12 00:05:48.277|1 |
+---------+-----------------------+-----+
Last two entry values are 4 and 5 locally, but on cluster they are incorrect. My best guess is data is being distributed across executors and udf is also being called in parallel on each executor. As one of the parameter to UDF is column (Partition key - OrderName in this example), how could I control/correct the behavior for cluster if thats the case. So that it calculates proper counts for each partition in a right way. Any suggestion please
As per your comment , you want to count the total no of records of every record for the last 24 hours
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.LongType
//A sample data (Guessing from your question)
val df = Seq(("a","2020-07-10 23:58:45.438","1"),("a","2020-07-11 23:58:45.538","1"),("a","2020-07-11 23:58:45.638","1")).toDF("OrderName","Time","Count")
// Extract the UNIX TIMESTAMP for your time column
val df2 = df.withColumn("unix_time",concat(unix_timestamp($"Time"),split($"Time","\\.")(1)).cast(LongType))
val noOfMilisecondsDay : Long = 24*60*60*1000
//Create a window per `OrderName` and select rows from `current time - 24 hours` to `current time`
val winSpec = Window.partitionBy("OrderName").orderBy("unix_time").rangeBetween(Window.currentRow - noOfMilisecondsDay, Window.currentRow)
// Final you perform your COUNT or SUM(COUNT) as per your need
val finalDf = df2.withColumn("tot_count", count("OrderName").over(winSpec))
//or val finalDf = df2.withColumn("tot_count", sum("Count").over(winSpec))

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense

How to create pairs of nodes in Spark?

I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.
You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.
If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+

Remove duplicates in Pair RDD based on Values

I have an RDD with multiple rows which looks like below.
val row = [(String, String), (String, String, String)]
The value is a sequence of Tuples. In the tuple, the last String is a timestamp and the second one is category. I want to filter this sequence based on maximum timestamp for each category.
(A,B) Id Category Timestamp
-------------------------------------------------------
(123,abc) 1 A 2016-07-22 21:22:59+0000
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
(234,bcd) 2 B 2015-07-20 21:21:20+0000
I want one row for each of the categories.I was looking for some help on getting the row with the max timestamp for each category. The result I am looking to get is
(A,B) Id Category Timestamp
-------------------------------------------------------
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
Given input dataframe as
+---------+---+--------+------------------------+
|(A,B) |Id |Category|Timestamp |
+---------+---+--------+------------------------+
|[123,abc]|1 |A |2016-07-22 21:22:59+0000|
|[234,bcd]|2 |B |2016-07-20 21:21:20+0000|
|[123,abc]|1 |A |2017-07-09 21:22:59+0000|
|[345,cde]|4 |C |2016-07-05 09:22:30+0000|
|[456,def]|5 |D |2016-07-21 07:32:06+0000|
|[234,bcd]|2 |B |2015-07-20 21:21:20+0000|
+---------+---+--------+------------------------+
You can do the following to get the result dataframe you require
import org.apache.spark.sql.functions._
val requiredDataframe = df.orderBy($"Timestamp".desc).groupBy("Category").agg(first("(A,B)").as("(A,B)"), first("Id").as("Id"), first("Timestamp").as("Timestamp"))
You should have the requiredDataframe as
+--------+---------+---+------------------------+
|Category|(A,B) |Id |Timestamp |
+--------+---------+---+------------------------+
|B |[234,bcd]|2 |2016-07-20 21:21:20+0000|
|D |[456,def]|5 |2016-07-21 07:32:06+0000|
|C |[345,cde]|4 |2016-07-05 09:22:30+0000|
|A |[123,abc]|1 |2017-07-09 21:22:59+0000|
+--------+---------+---+------------------------+
You can do the same by using Window function as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("Category").orderBy($"Timestamp".desc)
df.withColumn("rank", rank().over(windowSpec)).filter($"rank" === lit(1)).drop("rank")