PySpark Window function to check based on condition a lag row - pyspark

I would need to take (if present and last) STAT 200.
I would like to take its date only if last status and for this case I know how to do it, it's ok.
But i need first to check it (if it not the last status) and then,
if it were not the last, I would like to consider the date of the last state as 200 but only if the next state is not minor (for example 150 as in the next case). In this case I wouldn't want to consider it.
How to do in PySpark? Thank you very much
ID
STAT
TIME
1
100
17/11/2021
1
200
18/11/2021
1
150
19/11/2021
1
100
20/11/2021
2
200
20/11/2021
In this case I don't want to consider ID 1 (status 150 < 200). So I take only ID 2.
ID
STAT
TIME
1
100
17/11/2021
1
200
18/11/2021
1
350
19/11/2021
1
400
20/11/2021
2
200
20/11/2021
In this case I want to consider status 200 and its details (status 350 > 200 for ID 1; just 200 for ID 2)
Thank u for support!

You can specify a default value to the lead function and then handle last row with STAT = 200 and non-last row with STAT=200 using the same logic.
from pyspark.sql import functions as F
from pyspark.sql import Window
window_spec = Window.partitionBy("ID").orderBy("TIME")
df.withColumn("LEAD_STAT", F.lead("STAT", default=201).over(window_spec)).filter((F.col("STAT") == 200) & (F.col("LEAD_STAT") > 200)).show()
Example
Scneario One
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [(1, 100, datetime.strptime("17/11/2021", "%d/%m/%Y"),),
(1, 200, datetime.strptime("18/11/2021", "%d/%m/%Y"),),
(1, 150, datetime.strptime("19/11/2021", "%d/%m/%Y"),),
(1, 100, datetime.strptime("20/11/2021", "%d/%m/%Y"),),
(2, 200, datetime.strptime("20/11/2021", "%d/%m/%Y"),),]
df = spark.createDataFrame(data, ("ID", "STAT", "TIME"))
window_spec = Window.partitionBy("ID").orderBy("TIME")
df.withColumn("LEAD_STAT", F.lead("STAT", default=201).over(window_spec)).filter((F.col("STAT") == 200) & (F.col("LEAD_STAT") > 200)).show()
Output
+---+----+-------------------+---------+
| ID|STAT| TIME|LEAD_STAT|
+---+----+-------------------+---------+
| 2| 200|2021-11-20 00:00:00| 201|
+---+----+-------------------+---------+
Scneario Two
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [(1, 100, datetime.strptime("17/11/2021", "%d/%m/%Y"),),
(1, 200, datetime.strptime("18/11/2021", "%d/%m/%Y"),),
(1, 350, datetime.strptime("19/11/2021", "%d/%m/%Y"),),
(1, 400, datetime.strptime("20/11/2021", "%d/%m/%Y"),),
(2, 200, datetime.strptime("20/11/2021", "%d/%m/%Y"),),
]
df = spark.createDataFrame(data, ("ID", "STAT", "TIME"))
window_spec = Window.partitionBy("ID").orderBy("TIME")
df.withColumn("LEAD_STAT", F.lead("STAT", default=201).over(window_spec)).filter((F.col("STAT") == 200) & (F.col("LEAD_STAT") > 200)).show()
Output
+---+----+-------------------+---------+
| ID|STAT| TIME|LEAD_STAT|
+---+----+-------------------+---------+
| 1| 200|2021-11-18 00:00:00| 350|
| 2| 200|2021-11-20 00:00:00| 201|
+---+----+-------------------+---------+

Related

Spark Combining Disparate rate Dataframes in Time

Using Spark and Scala, I have two DataFrames with data values.
I'm trying to accomplish something that, when processing serially would be trival, but when processing in a cluster seems daunting.
Let's say I have to sets of values. One of them is very regular:
Relative Time
Value1
10
1
20
2
30
3
And I want to combine it with another value that is very irregular:
Relative Time
Value2
1
100
22
200
And get this (driven by Value1):
Relative Time
Value1
Value2
10
1
100
20
2
100
30
3
200
Note: There are a few scenarios here. One of them is that Value1 is a massive DataFrame and Value2 only has a few hundred values. The other scenario is that they're both massive.
Also note: I depict Value2 as being very slow, and it might be, but also could may be much faster than Value1, so I may have 10 or 100 values of Value2 before my next value of Value1, and I'd want the latest. Because of this doing a union of them and windowing it doesn't seem practical.
How would I accomplish this in Spark?
I think you can do:
Full outer join between the two tables
Use the last function to look back the closest value of value2
import spark.implicits._
import org.apache.spark.sql.expressions.Window
val df1 = spark.sparkContext.parallelize(Seq(
(10, 1),
(20, 2),
(30, 3)
)).toDF("Relative Time", "value1")
val df2 = spark.sparkContext.parallelize(Seq(
(1, 100),
(22, 200)
)).toDF("Relative Time", "value2_temp")
val df = df1.join(df2, Seq("Relative Time"), "outer")
val window = Window.orderBy("Relative Time")
val result = df.withColumn("value2", last($"value2_temp", ignoreNulls = true).over(window)).filter($"value1".isNotNull).drop("value2_temp")
result.show()
+-------------+------+------+
|Relative Time|value1|value2|
+-------------+------+------+
| 10| 1| 100|
| 20| 2| 100|
| 30| 3| 200|
+-------------+------+------+

Pyspark agg max abs val but keep sign

I'm agg on another col. For ex if the col was
ID val
A 10
A 100
A -150
A 15
B 10
B 200
B -150
B 15
I'd want to return the below (keeping the sign). Not sure how to do this while keeping the sign
ID max(val)
A -150
B 200
Option 1: using a window function + row_number. We parition by ID and order by abs(val) descending. Then we simply take the first row.
import pyspark.sql.functions as F
data = [
('A', 10),
('A', 100),
('A', -150),
('A', 15),
('B', 10),
('B', 200),
('B', -150),
('B', 15)
]
df = spark.createDataFrame(data=data, schema=('ID','val'))
w = Window().partitionBy('ID').orderBy(F.abs('val').desc())
(df
.withColumn('rn', F.row_number().over(w))
.filter(F.col('rn') == 1)
.drop('rn')
).show()
+---+----+
| ID| val|
+---+----+
| A|-150|
| B| 200|
+---+----+
Option 2: A solution which works with agg. We compare the max value to the absolute max value. If they match then take the max, if they don't then take the min. Note that this solution prefers the positive value in case of ties.
df.groupby('ID').agg(
F.when(F.max('val') == F.max(F.abs('val')), F.max('val')).otherwise(F.min('val')).alias('max_val')
).show()
+---+-------+
| ID|max_val|
+---+-------+
| A| -150|
| B| 200|
+---+-------+

Validate data from the same column in different rows with pyspark

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))
You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()
I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()

How to use a window function to count day of week occurrences in Pyspark 2.1

With the below pyspark dataset (2.1), how to you use a windowing function that would count the number of times the current record's day of week appeared int he last 28 days.
Example Data frame:
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
("a", "1", "2018-01-01 12:01:01","Monday"),
("a", "13", "2018-01-01 14:01:01","Monday"),
("a", "22", "2018-01-02 22:01:01","Tuesday"),
("a", "43", "2018-01-08 01:01:01","Monday"),
("a", "43", "2018-01-09 01:01:01","Tuesday"),
("a", "74", "2018-01-10 12:01:01","Wednesday"),
("a", "95", "2018-01-15 06:01:01","Monday"),
], ["person_id", "other_id", "timestamp","dow"])
df.withColumn("dow_count",`some window function`)
Possible window
from pyspark.sql import Window
from pyspark.sql import functions as F
Days_28 = (86400 * 28)
window= Window.partitionBy("person_id").orderBy('timestamp').rangeBetween(-Days_30, -1)
## I know this next line is wrong
df.withColumn("dow_count",F.sum(F.when(Current_day=windowed_day,1).otherwise(0)).over(window))
Example Output
df.show()
+---------+--------+-------------------+---------+---------+
|person_id|other_id| timestamp| dow|dow_count|
+---------+--------+-------------------+---------+---------+
| a| 1|2018-01-01 12:01:01| Monday|0 |
| a| 13|2018-01-01 14:01:01| Monday|1 |
| a| 22|2018-01-02 22:01:01| Tuesday|0 |
| a| 43|2018-01-08 01:01:01| Monday|2 |
| a| 43|2018-01-09 01:01:01| Tuesday|1 |
| a| 74|2018-01-10 12:01:01|Wednesday|0 |
| a| 95|2018-01-15 06:01:01| Monday|3 |
+---------+--------+-------------------+---------+---------+
Use F.row_number(), window partitioned by (person_id, dow) and the logic with your rangeBetween() should be replaced with where():
from datetime import timedelta, datetime
N_days = 28
end = datetime.combine(datetime.today(), datetime.min.time())
start = end - timedelta(days=N_days)
window = Window.partitionBy("person_id", "dow").orderBy('timestamp')
df.where((df.timestamp < end) & (df.timestamp >= start)) \
.withColumn('dow_count', F.row_number().over(window)-1) \
.show()
I figured it out and thought I'd share.
First create a unix timestamp and cast it to long.
Then, partition by person and day of week.
Finally, use the count function over the window.
from pyspark.sql import functions as F
df = df.withColumn('unix_ts',df.timestamp.astype('Timestamp').cast("long"))
w = Window.partitionBy('person_id','dow').orderBy('unix_ts').rangeBetween(-86400*15,-1)
df = df.withColumn('occurrences_in_7_days',F.count('unix_ts').over(w))
df.sort(df.unix_ts).show()
Bonus: How to create the actual day of week from the timestamp.
df = df.withColumn("DayOfWeek",F.date_format(df.timestamp, 'EEEE'))
I couldn't have done it without tips from jxc and this stackoverflow article.

How to slice and sum elements of array column?

I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.
I have a table as
+-------+-------+---------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+---------------------------------+
| 10|Finance| [100, 200, 300, 400, 500]|
| 20| IT| [10, 20, 50, 100]|
+-------+-------+---------------------------------+
I would like to sum the values of this emp_details column .
Expected query:
sqlContext.sql("select sum(emp_details) from mytable").show
Expected result
1500
180
Also I should be able to sum on the range elements too like :
sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show
result
600
80
when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type.
I think we need to create UDF for this . but how ?
Will I be facing any performance hits with UDFs ?
and is there any other solution apart from the UDF one ?
Spark 2.4.0
As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.
The "modern" solution would be as follows:
scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details |
+-------+-------+-------------------------+
|10 |Finance|[100, 200, 300, 400, 500]|
|20 |IT |[10, 20, 50, 100] |
+-------+-------+-------------------------+
input.createOrReplaceTempView("mytable")
val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
You can find a good reading on higher-order functions in the following articles and video:
Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4
Working with Nested Data Using Higher Order Functions in SQL on Databricks
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (Databricks)
Spark 2.3.2 and earlier
DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.
One solution would be to use Dataset solution where the combination of Spark SQL and Scala could show its power.
scala> val inventory = Seq(
| (10, "Finance", Seq(100, 200, 300, 400, 500)),
| (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]
// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
toDF("dept_id", "dept_nm", "sum").
show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
I'm leaving the slice part as an exercise as it's equally simple.
Since Spark 2.4 you can slice with the slice function:
import org.apache.spark.sql.functions.slice
val df = Seq(
(10, "Finance", Seq(100, 200, 300, 400, 500)),
(20, "IT", Seq(10, 20, 50, 100))
).toDF("dept_id", "dept_nm", "emp_details")
val dfSliced = df.withColumn(
"emp_details_sliced",
slice($"emp_details", 1, 3)
)
dfSliced.show(false)
+-------+-------+-------------------------+------------------+
|dept_id|dept_nm|emp_details |emp_details_sliced|
+-------+-------+-------------------------+------------------+
|10 |Finance|[100, 200, 300, 400, 500]|[100, 200, 300] |
|20 |IT |[10, 20, 50, 100] |[10, 20, 50] |
+-------+-------+-------------------------+------------------+
and sum arrays with aggregate:
dfSliced.selectExpr(
"*",
"aggregate(emp_details, 0, (x, y) -> x + y) as details_sum",
"aggregate(emp_details_sliced, 0, (x, y) -> x + y) as details_sliced_sum"
).show
+-------+-------+--------------------+------------------+-----------+------------------+
|dept_id|dept_nm| emp_details|emp_details_sliced|details_sum|details_sliced_sum|
+-------+-------+--------------------+------------------+-----------+------------------+
| 10|Finance|[100, 200, 300, 4...| [100, 200, 300]| 1500| 600|
| 20| IT| [10, 20, 50, 100]| [10, 20, 50]| 180| 80|
+-------+-------+--------------------+------------------+-----------+------------------+
A possible approach it to use explode() on your Array column and consequently aggregate the output by unique key. For example:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
(mytable
.withColumn("emp_sum",
explode($"emp_details"))
.groupBy("dept_nm")
.agg(sum("emp_sum")).show)
+-------+------------+
|dept_nm|sum(emp_sum)|
+-------+------------+
|Finance| 1500|
| IT| 180|
+-------+------------+
To select only specific values in your array, we can work with the answer from the linked question and apply it with a slight modification:
val slice = udf((array : Seq[Int], from : Int, to : Int) => array.slice(from,to))
(mytable
.withColumn("slice",
slice($"emp_details",
lit(0),
lit(3)))
.withColumn("emp_sum",
explode($"slice"))
.groupBy("dept_nm")
.agg(sum("emp_sum")).show)
+-------+------------+
|dept_nm|sum(emp_sum)|
+-------+------------+
|Finance| 600|
| IT| 80|
+-------+------------+
Data:
val data = Seq((10, "Finance", Array(100,200,300,400,500)),
(20, "IT", Array(10,20,50,100)))
val mytable = sc.parallelize(data).toDF("dept_id", "dept_nm","emp_details")
Here is an alternative to mtoto's answer without using a groupBy (I really don't know which one is fastest: UDF, mtoto solution or mine, comments welcome)
You would a performance impact on using a UDF, in general. There is an answer which you might want to read and this resource is a good read on UDF.
Now for your problem, you can avoid the use of a UDF. What I would use is a Column expression generated with Scala logic.
data:
val df = Seq((10, "Finance", Array(100,200,300,400,500)),
(20, "IT", Array(10, 20, 50,100)))
.toDF("dept_id", "dept_nm","emp_details")
You need some trickery to be able to traverse a ArrayType, you can play a bit with the solution to discover various problems (see edit at the bottom for the slice part). Here is my proposal but you might find better. First you take the maximum length
val maxLength = df.select(size('emp_details).as("l")).groupBy().max("l").first.getInt(0)
Then you use it, testing when you have a shorter array
val sumArray = (1 until maxLength)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")
val res = df
.select('dept_id,'dept_nm,'emp_details,sumArray)
result:
+-------+-------+--------------------+--------+
|dept_id|dept_nm| emp_details|sumArray|
+-------+-------+--------------------+--------+
| 10|Finance|[100, 200, 300, 4...| 1500|
| 20| IT| [10, 20, 50, 100]| 180|
+-------+-------+--------------------+--------+
I advise you to look at sumArray to understand what it is doing.
Edit: Of course I only read half of the question again... But if you want to changes the items on which to sum, you can see that it becomes obvious with this solution (i.e. you don't need a slice function), just change (0 until maxLength) with the range of index you need:
def sumArray(from: Int, max: Int) = (from until max)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")
Building on zero323's awesome answer; in case you have an array of Long integers i.e. BIGINT, you need to change the initial value from 0 to BIGINT(0) as explained in the first paragraph here
so you have
dfSliced.selectExpr(
"*",
"aggregate(emp_details, BIGINT(0), (x, y) -> x + y) as details_sum",
"aggregate(emp_details_sliced, BIGINT(0), (x, y) -> x + y) as details_sliced_sum"
).show
The rdd way is missing, so let me add it.
val df = Seq((10, "Finance", Array(100,200,300,400,500)),(20, "IT", Array(10,20,50,100))).toDF("dept_id", "dept_nm","emp_details")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x=> {val p = x.getAs[mutable.WrappedArray[Int]]("emp_details").toArray; Row.merge(x,Row(p.sum,p.slice(0,2).sum)) })
spark.createDataFrame(rdd1,df.schema.add(StructField("sumArray",IntegerType)).add(StructField("sliceArray",IntegerType))).show(false)
Output:
+-------+-------+-------------------------+--------+----------+
|dept_id|dept_nm|emp_details |sumArray|sliceArray|
+-------+-------+-------------------------+--------+----------+
|10 |Finance|[100, 200, 300, 400, 500]|1500 |300 |
|20 |IT |[10, 20, 50, 100] |180 |30 |
+-------+-------+-------------------------+--------+----------+