In Spark scala, how to check between adjacent rows in a dataframe

In Spark scala, how to check between adjacent rows in a dataframe - scala

How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe. This should happen at a key level
I have following data after sorting on key, dates
source_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-08 |
| 10 | BAC | 2018-01-03 | 2018-01-15 |
| 10 | CAS | 2018-01-03 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-03 |
| 20 | DAS | 2018-01-01 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
When the dates are in a range from these rows (i.e. the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.
Here is the output I need..
final_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-21 |
| 10 | BAC | 2018-01-01 | 2018-01-21 |
| 10 | CAS | 2018-01-01 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-12 |
| 20 | DAS | 2017-11-12 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
Appreciate any ideas to achieve this. Thanks in advance!

Here's one approach:
Create new column group_id with null value if begin_dt is within date range from the previous row; otherwise a unique integer
Backfill nulls in group_id with the last non-null value
Compute min(begin_dt) and max(end_dt) within each (key, group_id) partition
Example below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(10, "ABC", "2018-01-01", "2018-01-08"),
(10, "BAC", "2018-01-03", "2018-01-15"),
(10, "CAS", "2018-01-03", "2018-01-21"),
(20, "AAA", "2017-11-12", "2018-01-03"),
(20, "DAS", "2018-01-01", "2018-01-12"),
(20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")
val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")
df.
withColumn("group_id", when(
$"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
).otherwise(monotonically_increasing_id)
).
withColumn("group_id", last($"group_id", ignoreNulls=true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("begin_dt2", min($"begin_dt").over(win2)).
withColumn("end_dt2", max($"end_dt").over(win2)).
orderBy("key", "begin_dt", "end_dt").
show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code| begin_dt| end_dt| group_id| begin_dt2| end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+

Related

Fix date overlap in pyspark

I have a dataset like this below:
+------+------+-------+------------+------------+-----------------+
| key1 | key2 | price | date_start | date_end | sequence_number |
+------+------+-------+------------+------------+-----------------+
| a | b | 10 | 2022-01-03 | 2022-01-05 | 1 |
| a | b | 10 | 2022-01-02 | 2050-05-15 | 2 |
| a | b | 10 | 2022-02-02 | 2022-05-10 | 3 |
| a | b | 20 | 2024-02-01 | 2050-10-10 | 4 |
| a | b | 20 | 2024-04-01 | 2025-09-10 | 5 |
| a | b | 10 | 2022-04-02 | 2024-09-10 | 6 |
| a | b | 20 | 2024-09-11 | 2050-10-10 | 7 |
+------+------+-------+------------+------------+-----------------+
What I want to achieve is this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The sequence number is the order in which the data rows were received.
The resultant dataset should be able to fix the overlapping dates for each price but also consider the fact that when there is a new price for the same key columns the older record's date_end is updated to date_start-1
After the first three sequence numbers, the output looked like this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2050-05-15 |
+------+------+-------+------------+------------+
This covers the max range for the price.
After the 4th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
After the 5th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes as the date overlaps
After the 6th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The date_start and date_end both are updated
After the 7th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes.

So here's an answer. First I expand the dates into rows. Then I use a group by with struct & collect_list, to capture the price and sequence_number together. I take the first item of the array returned (and sorted) from collect list to effectively give me the max of the sequence number to then pull back the price. From there a group the dates by min/max.
from pyspark.sql.functions import min, explode, col, expr, max
data = [
("a" ,"b" , 10 ,"2022-01-03" ,"2022-01-05" , 1 ),
("a" ,"b" , 10 ,"2022-01-02" ,"2050-05-15" , 2 ),
("a" ,"b" , 10 ,"2022-02-02" ,"2022-05-10" , 3 ),
("a" ,"b" , 20 ,"2024-02-01" ,"2050-10-10" , 4 ),
("a" ,"b" , 20 ,"2024-04-01" ,"2025-09-10" , 5 ),
("a" ,"b" , 10 ,"2022-04-02" ,"2024-09-10" , 6 ),
("a" ,"b" , 20 ,"2024-09-11" ,"2050-10-10" , 7 ),
]
columns = ["key1","key2","price","date_start","date_end","sequence_number"]
df = spark.createDataFrame(data).toDF(*columns)
#create all the dates as rows
df_sequence = df.select( "*" , explode( expr("sequence ( to_date(date_start),to_date(date_end), interval 1 day)").alias('day') ) )
df_date_range = df_sequence.groupby( \
col("key1"),
col("key2"),
col("col")
).agg( \# here's the magic
reverse( \#descending sort effectively on sequence number.
array_sort( \#ascending sort of array by first element in struct then second element in struct
collect_list( \# collect all elements of the group
struct( \# get all data we need
col("sequence_number").alias("sequence") ,
col("price" ).alias("price")
)
)
)
)[0].alias("mystruct")\ # pull the first element to get "max"
).select( \
col("key1"),
col("key2"),
col("col"),
col("mystruct.price") \#pull out price from struct
)
#without comments
#df_date_range = df_sequence.groupby( col("key1"),col("key2"), col("col") ).agg( reverse(array_sort(collect_list( struct(col("sequence_number").alias("sequence") ,col("price" ).alias("price") ) )))[0].alias("mystruct") ).select( col("key1"), col("key2"), col("col"), col("mystruct.price") )
df_date_range.groupby( col("key1"),col("key2"), col("price") ).agg( min("col").alias("date_start"), max("col").alias("date_end") ).show()
+----+----+-----+----------+----------+
|key1|key2|price| date_start | date_end|
+----+----+-----+----------+----------+
| a| b| 10|2022-01-02|2024-09-10|
| a| b| 20|2024-09-11|2050-10-10|
+----+----+-----+----------+----------+
This does assume you will only ever use the price once before changing it. If you needed to go to back to a price you'd have to use window logic to identify the price different prices ranges and add that to your collect_list as an extra factor to sort on.

Flatten all map columns recursively in PySpark dataframe

I have a pyspark dataframe with multiple map columns. I want to flatten all map columns recursively. personal and financial are map type columns. Similarly, we might have more map columns.
Input dataframe:
-------------------------------------------------------------------------------------------------------
| id | name | Gender | personal | financial |
-------------------------------------------------------------------------------------------------------
| 1 | A | M | {age:20,city:Dallas,State:Texas} | {salary:10000,bonus:2000,tax:1500}|
| 2 | B | F | {city:Houston,State:Texas,Zipcode:77001} | {salary:12000,tax:1800} |
| 3 | C | M | {age:22,city:San Jose,Zipcode:940088} | {salary:2000,bonus:500} |
-------------------------------------------------------------------------------------------------------
Output dataframe:
--------------------------------------------------------------------------------------------------------------
| id | name | Gender | age | city | state | Zipcode | salary | bonus | tax |
--------------------------------------------------------------------------------------------------------------
| 1 | A | M | 20 | Dallas | Texas | null | 10000 | 2000 | 1500 |
| 2 | B | F | null | Houston | Texas | 77001 | 12000 | null | 1800 |
| 3 | C | M | 22 | San Jose | null | 940088 | 2000 | 500 | null |
--------------------------------------------------------------------------------------------------------------

use map_concat to merge the map fields and then explode them. exploding a map column creates 2 new columns - key and value. pivot the key column with value as values to get your desired output.
data_sdf. \
withColumn('personal_financial', func.map_concat('personal', 'financial')). \
selectExpr(*[c for c in data_sdf.columns if c not in ['personal', 'financial']],
'explode(personal_financial)'
). \
groupBy([c for c in data_sdf.columns if c not in ['personal', 'financial']]). \
pivot('key'). \
agg(func.first('value')). \
show(truncate=False)
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |id |name|gender|State|Zipcode|age |bonus|city |salary|tax |
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |1 |A |M |Texas|null |20 |2000 |Dallas |10000 |1500|
# |2 |B |F |Texas|77001 |null|null |Houston |12000 |1800|
# |3 |C |M |null |940088 |22 |500 |San Jose|2000 |null|
# +---+----+------+-----+-------+----+-----+--------+------+----+

Create a Dataframe based on ranges of other Dataframe

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?

Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows

The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+

You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

Get start date & end date from the range of timestamp

I have a dataframe in Spark (Scala) from a large csv file.
Dataframe is something like this
key| col1 | timestamp |
---------------------------------
1 | aa | 2019-01-01 08:02:05.1 |
1 | aa | 2019-09-02 08:02:05.2 |
1 | cc | 2019-12-24 08:02:05.3 |
2 | dd | 2013-01-22 08:02:05.4 |
I need to add two columns start_date & end_date something like this
key| col1 | timestamp | start date | end date |
---------------------------------+---------------------------------------------------
1 | aa | 2019-01-01 08:02:05.1 | 2017-01-01 08:02:05.1 | 2018-09-02 08:02:05.2 |
1 | aa | 2019-09-02 08:02:05.2 | 2018-09-02 08:02:05.2 | 2019-12-24 08:02:05.3 |
1 | cc | 2019-12-24 08:02:05.3 | 2019-12-24 08:02:05.3 | NULL |
2 | dd | 2013-01-22 08:02:05.4 | 2013-01-22 08:02:05.4 | NULL |
Here,
for each column "key", end_date is next timestamp for the same key. However, "end_date" for the latest date should be NULL.
What I tried so far:
I tried to use window function to calculate rank for each partition
something like this
var df = read_csv()
//copy timestamp to start_date
df = df
.withColumn("start_date", df.col("timestamp"))
//add null value to the end_date
df = df.withColumn("end_date", typedLit[Option[String]](None))
val windowSpec = Window.partitionBy("merge_key_column").orderBy("start_date")
df
.withColumn("rank", dense_rank()
.over(windowSpec))
.withColumn("max", max("rank").over(Window.partitionBy("merge_key_column")))
So far, I haven't got the desired output.

Use window lead function for this case.
Example:
val df=Seq((1,"aa","2019-01-01 08:02:05.1"),(1,"aa","2019-09-02 08:02:05.2"),(1,"cc","2019-12-24 08:02:05.3"),(2,"dd","2013-01-22 08:02:05.4")).toDF("key","col1","timestamp")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df1=df.withColumn("start_date",col("timestamp"))
val windowSpec = Window.partitionBy("key").orderBy("start_date")
df1.withColumn("end_date",lead(col("start_date"),1).over(windowSpec)).show(10,false)
//+---+----+---------------------+---------------------+---------------------+
//|key|col1|timestamp |start_date |end_date |
//+---+----+---------------------+---------------------+---------------------+
//|1 |aa |2019-01-01 08:02:05.1|2019-01-01 08:02:05.1|2019-09-02 08:02:05.2|
//|1 |aa |2019-09-02 08:02:05.2|2019-09-02 08:02:05.2|2019-12-24 08:02:05.3|
//|1 |cc |2019-12-24 08:02:05.3|2019-12-24 08:02:05.3|null |
//|2 |dd |2013-01-22 08:02:05.4|2013-01-22 08:02:05.4|null |
//+---+----+---------------------+---------------------+---------------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

In Spark scala, how to check between adjacent rows in a dataframe - scala

Related

Fix date overlap in pyspark

Flatten all map columns recursively in PySpark dataframe

Create a Dataframe based on ranges of other Dataframe

Mean with differents columns ignoring Null values, Spark Scala

Get start date & end date from the range of timestamp

Categories

Resources