Pyspark calculate time difference ordered by code - pyspark

I would like to know if it is possible using pyspark if I can calculate the time difference of a dataset by group.
For example I have
CODE1 | CODE2 | TIME
00001 | AAA | 2019-01-01 14:00:00
00001 | AAA | 2019-01-01 14:05:00
00001 | AAA | 2019-01-01 14:10:00
00001 | BBB | 2019-01-01 14:15:00
00001 | BBB | 2019-01-01 14:20:00
00001 | AAA | 2019-01-01 14:25:00
00001 | AAA | 2019-01-01 14:30:00
What I would like is something like
CODE1 | CODE2 | TIME_DIFF
00001 | AAA | 10 MINUTES
00001 | BBB | 5 MINUTES
00001 | AAA | 5 MINUTES
The time difference is from the last record to the first one in the same category. I have already sorted the information by time.
Is it possible?

I have coded it with a pretty normal & decent approach. However, the below can be optimized utilizing more inbuilt functions available in spark.
>>> df.show()
+-----+-----+-------------------+
|CODE1|CODE2| TIME|
+-----+-----+-------------------+
| 1| AAA|2019-01-01 14:00:00|
| 1| AAA|2019-01-01 14:05:00|
| 1| AAA|2019-01-01 14:10:00|
| 1| BBB|2019-01-01 14:15:00|
| 1| BBB|2019-01-01 14:20:00|
| 1| AAA|2019-01-01 14:25:00|
| 1| AAA|2019-01-01 14:30:00|
+-----+-----+-------------------+
>>> df.printSchema()
root
|-- CODE1: long (nullable = true)
|-- CODE2: string (nullable = true)
|-- TIME: string (nullable = true)
>>> from pyspark.sql import functions as F, Window
>>> win = Window.partitionBy(F.lit(0)).orderBy('TIME')
#batch_order column is to group CODE2 as per the ordered timestamp
>>> df_1=df.withColumn('prev_batch', F.lag('CODE2').over(win)) \
... .withColumn('flag', F.when(F.col('CODE2') == F.col('prev_batch'),0).otherwise(1)) \
... .withColumn('batch_order', F.sum('flag').over(win)) \
... .drop('prev_batch', 'flag') \
... .sort('TIME')
>>> df_1.show()
+-----+-----+-------------------+-----------+
|CODE1|CODE2| TIME|batch_order|
+-----+-----+-------------------+-----------+
| 1| AAA|2019-01-01 14:00:00| 1|
| 1| AAA|2019-01-01 14:05:00| 1|
| 1| AAA|2019-01-01 14:10:00| 1|
| 1| BBB|2019-01-01 14:15:00| 2|
| 1| BBB|2019-01-01 14:20:00| 2|
| 1| AAA|2019-01-01 14:25:00| 3|
| 1| AAA|2019-01-01 14:30:00| 3|
+-----+-----+-------------------+-----------+
#Extract min and max timestamps for each group
>>> df_max=df_1.groupBy([df_1.batch_order,df_1.CODE2]).agg(F.max("TIME").alias("mx"))
>>> df_min=df_1.groupBy([df_1.batch_order,df_1.CODE2]).agg(F.min("TIME").alias("mn"))
>>> df_max.show()
+-----------+-----+-------------------+
|batch_order|CODE2| mx|
+-----------+-----+-------------------+
| 1| AAA|2019-01-01 14:10:00|
| 2| BBB|2019-01-01 14:20:00|
| 3| AAA|2019-01-01 14:30:00|
+-----------+-----+-------------------+
>>> df_min.show()
+-----------+-----+-------------------+
|batch_order|CODE2| mn|
+-----------+-----+-------------------+
| 1| AAA|2019-01-01 14:00:00|
| 2| BBB|2019-01-01 14:15:00|
| 3| AAA|2019-01-01 14:25:00|
+-----------+-----+-------------------+
#join on batch_order
>>> df_joined=df_max.join(df_min,df_max.batch_order==df_min.batch_order)
>>> df_joined.show()
+-----------+-----+-------------------+-----------+-----+-------------------+
|batch_order|CODE2| mx|batch_order|CODE2| mn|
+-----------+-----+-------------------+-----------+-----+-------------------+
| 1| AAA|2019-01-01 14:10:00| 1| AAA|2019-01-01 14:00:00|
| 3| AAA|2019-01-01 14:30:00| 3| AAA|2019-01-01 14:25:00|
| 2| BBB|2019-01-01 14:20:00| 2| BBB|2019-01-01 14:15:00|
+-----------+-----+-------------------+-----------+-----+-------------------+
>>> from pyspark.sql.functions import unix_timestamp
>>> from pyspark.sql.types import IntegerType
#difference between the max and min timestamp
>>> df_joined.withColumn("diff",((unix_timestamp(df_joined.mx, 'yyyy-MM-dd HH:mm:ss')-unix_timestamp(df_joined.mn, 'yyyy-MM-dd HH:mm:ss'))/60).cast(IntegerType())).show()
+-----------+-----+-------------------+-------------------+----+
|batch_order|CODE2| mx| mn|diff|
+-----------+-----+-------------------+-------------------+----+
| 1| AAA|2019-01-01 14:10:00|2019-01-01 14:00:00| 10|
| 3| AAA|2019-01-01 14:30:00|2019-01-01 14:25:00| 5|
| 2| BBB|2019-01-01 14:20:00|2019-01-01 14:15:00| 5|
+-----------+-----+-------------------+-------------------+----+

Related

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

Spark dataframe groupby and order group?

I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given

How to calculate 5 day-mean, 10-day mean & 15-day mean for given data?

Scenario :
I have following dataframe as below
``` -- -----------------------------------
companyId | calc_date | mean |
----------------------------------
1111 | 01-08-2002 | 15 |
----------------------------------
1111 | 02-08-2002 | 16.5 |
----------------------------------
1111 | 03-08-2002 | 17 |
----------------------------------
1111 | 04-08-2002 | 15 |
----------------------------------
1111 | 05-08-2002 | 23 |
----------------------------------
1111 | 06-08-2002 | 22.6 |
----------------------------------
1111 | 07-08-2002 | 25 |
----------------------------------
1111 | 08-08-2002 | 15 |
----------------------------------
1111 | 09-08-2002 | 15 |
----------------------------------
1111 | 10-08-2002 | 16.5 |
----------------------------------
1111 | 11-08-2002 | 22.6 |
----------------------------------
1111 | 12-08-2002 | 15 |
----------------------------------
1111 | 13-08-2002 | 16.5 |
----------------------------------
1111 | 14-08-2002 | 25 |
----------------------------------
1111 | 15-08-2002 | 16.5 |
----------------------------------
```
Required :
Need to calculate for given data 5 day-mean , 10-day mean , 15-day mean for every record for every company.
5 day-mean --> Previous 5 days available mean sum
10 day-mean --> Previous 10 days available mean sum
15 day-mean --> Previous 15 days available mean sum
Resultant dataframe should have caluluated columns as below
----------------------------------------------------------------------------
companyId | calc_date | mean | 5 day-mean | 10-day mean | 15-day mean |
----------------------------------------------------------------------------
Question :
How to achieve this ?
What is the best way to do this in spark ?
Here's one approach using Window partitions by company to compute the n-day mean between the current row and previous rows within the specified timestamp range via rangeBetween, as shown below (using a dummy dataset):
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 3).flatMap(i => Seq.tabulate(15)(j => (i, s"${j+1}-2-2019", j+1))).
toDF("company_id", "calc_date", "mean")
df.show
// +----------+---------+----+
// |company_id|calc_date|mean|
// +----------+---------+----+
// | 1| 1-2-2019| 1|
// | 1| 2-2-2019| 2|
// | 1| 3-2-2019| 3|
// | 1| 4-2-2019| 4|
// | 1| 5-2-2019| 5|
// | ... |
// | 1|14-2-2019| 14|
// | 1|15-2-2019| 15|
// | 2| 1-2-2019| 1|
// | 2| 2-2-2019| 2|
// | 2| 3-2-2019| 3|
// | ... |
// +----------+---------+----+
def winSpec = Window.partitionBy("company_id").orderBy("ts")
def dayRange(days: Int) = winSpec.rangeBetween(-(days * 24 * 60 * 60), 0)
df.
withColumn("ts", unix_timestamp(to_date($"calc_date", "d-M-yyyy"))).
withColumn("mean-5", mean($"mean").over(dayRange(5))).
withColumn("mean-10", mean($"mean").over(dayRange(10))).
withColumn("mean-15", mean($"mean").over(dayRange(15))).
show
// +----------+---------+----+----------+------+-------+-------+
// |company_id|calc_date|mean| ts|mean-5|mean-10|mean-15|
// +----------+---------+----+----------+------+-------+-------+
// | 1| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0|
// | 1| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5|
// | 1| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0|
// | 1| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5|
// | 1| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0|
// | 1| 6-2-2019| 6|1549440000| 3.5| 3.5| 3.5|
// | 1| 7-2-2019| 7|1549526400| 4.5| 4.0| 4.0|
// | 1| 8-2-2019| 8|1549612800| 5.5| 4.5| 4.5|
// | 1| 9-2-2019| 9|1549699200| 6.5| 5.0| 5.0|
// | 1|10-2-2019| 10|1549785600| 7.5| 5.5| 5.5|
// | 1|11-2-2019| 11|1549872000| 8.5| 6.0| 6.0|
// | 1|12-2-2019| 12|1549958400| 9.5| 7.0| 6.5|
// | 1|13-2-2019| 13|1550044800| 10.5| 8.0| 7.0|
// | 1|14-2-2019| 14|1550131200| 11.5| 9.0| 7.5|
// | 1|15-2-2019| 15|1550217600| 12.5| 10.0| 8.0|
// | 3| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0|
// | 3| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5|
// | 3| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0|
// | 3| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5|
// | 3| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0|
// +----------+---------+----+----------+------+-------+-------+
// only showing top 20 rows
Note that one could use rowsBetween (as opposed to rangeBetween) directly on calc_date if the dates are guaranteed to be contiguous per-day time series.

join 2 DF with diferent dimension scala

Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you
I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).
I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+
I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))

Spark - How to apply rules defined in a dataframe to another dataframe

I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+