Fix overlapping dates in Pyspark - pyspark

I have a dataset that looks like this with overlapping dates in Date1 and Date2
+------+------+------+------+-------+------------+------------+
| Key1 | Key2 | Key3 | Key4 | Value | Date1 | Date2 |
+------+------+------+------+-------+------------+------------+
| k1 | k2 | k3 | k4 | 10 | 2022-01-01 | 2026-01-30 |
| k1 | k2 | k3 | k4 | 12 | 2022-06-05 | 2026-01-10 |
| k1 | k2 | k3 | k4 | 14 | 2022-08-07 | 2026-01-15 |
+------+------+------+------+-------+------------+------------+
I want to fix the overlaps and make the dates continuous like this below -
+------+------+------+------+-------+------------+------------+
| Key1 | Key2 | Key3 | Key4 | Value | Date1 | Date2 |
+------+------+------+------+-------+------------+------------+
| k1 | k2 | k3 | k4 | 10 | 2022-01-01 | 2022-06-04 |
| k1 | k2 | k3 | k4 | 12 | 2022-06-05 | 2022-08-06 |
| k1 | k2 | k3 | k4 | 14 | 2022-08-07 | 2026-01-15 |
+------+------+------+------+-------+------------+------------+
In a sense that, new_date2 = old_date1 (next record) - 1

You can use the lead Window Function.
df = df.withColumn('date2', F.expr('nvl(date_sub(lead(date1) over (order by date1), 1), date2)'))
df.show(truncate=False)

Related

Fix date overlap in pyspark

I have a dataset like this below:
+------+------+-------+------------+------------+-----------------+
| key1 | key2 | price | date_start | date_end | sequence_number |
+------+------+-------+------------+------------+-----------------+
| a | b | 10 | 2022-01-03 | 2022-01-05 | 1 |
| a | b | 10 | 2022-01-02 | 2050-05-15 | 2 |
| a | b | 10 | 2022-02-02 | 2022-05-10 | 3 |
| a | b | 20 | 2024-02-01 | 2050-10-10 | 4 |
| a | b | 20 | 2024-04-01 | 2025-09-10 | 5 |
| a | b | 10 | 2022-04-02 | 2024-09-10 | 6 |
| a | b | 20 | 2024-09-11 | 2050-10-10 | 7 |
+------+------+-------+------------+------------+-----------------+
What I want to achieve is this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The sequence number is the order in which the data rows were received.
The resultant dataset should be able to fix the overlapping dates for each price but also consider the fact that when there is a new price for the same key columns the older record's date_end is updated to date_start-1
After the first three sequence numbers, the output looked like this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2050-05-15 |
+------+------+-------+------------+------------+
This covers the max range for the price.
After the 4th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
After the 5th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes as the date overlaps
After the 6th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The date_start and date_end both are updated
After the 7th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes.
So here's an answer. First I expand the dates into rows. Then I use a group by with struct & collect_list, to capture the price and sequence_number together. I take the first item of the array returned (and sorted) from collect list to effectively give me the max of the sequence number to then pull back the price. From there a group the dates by min/max.
from pyspark.sql.functions import min, explode, col, expr, max
data = [
("a" ,"b" , 10 ,"2022-01-03" ,"2022-01-05" , 1 ),
("a" ,"b" , 10 ,"2022-01-02" ,"2050-05-15" , 2 ),
("a" ,"b" , 10 ,"2022-02-02" ,"2022-05-10" , 3 ),
("a" ,"b" , 20 ,"2024-02-01" ,"2050-10-10" , 4 ),
("a" ,"b" , 20 ,"2024-04-01" ,"2025-09-10" , 5 ),
("a" ,"b" , 10 ,"2022-04-02" ,"2024-09-10" , 6 ),
("a" ,"b" , 20 ,"2024-09-11" ,"2050-10-10" , 7 ),
]
columns = ["key1","key2","price","date_start","date_end","sequence_number"]
df = spark.createDataFrame(data).toDF(*columns)
#create all the dates as rows
df_sequence = df.select( "*" , explode( expr("sequence ( to_date(date_start),to_date(date_end), interval 1 day)").alias('day') ) )
df_date_range = df_sequence.groupby( \
col("key1"),
col("key2"),
col("col")
).agg( \# here's the magic
reverse( \#descending sort effectively on sequence number.
array_sort( \#ascending sort of array by first element in struct then second element in struct
collect_list( \# collect all elements of the group
struct( \# get all data we need
col("sequence_number").alias("sequence") ,
col("price" ).alias("price")
)
)
)
)[0].alias("mystruct")\ # pull the first element to get "max"
).select( \
col("key1"),
col("key2"),
col("col"),
col("mystruct.price") \#pull out price from struct
)
#without comments
#df_date_range = df_sequence.groupby( col("key1"),col("key2"), col("col") ).agg( reverse(array_sort(collect_list( struct(col("sequence_number").alias("sequence") ,col("price" ).alias("price") ) )))[0].alias("mystruct") ).select( col("key1"), col("key2"), col("col"), col("mystruct.price") )
df_date_range.groupby( col("key1"),col("key2"), col("price") ).agg( min("col").alias("date_start"), max("col").alias("date_end") ).show()
+----+----+-----+----------+----------+
|key1|key2|price| date_start | date_end|
+----+----+-----+----------+----------+
| a| b| 10|2022-01-02|2024-09-10|
| a| b| 20|2024-09-11|2050-10-10|
+----+----+-----+----------+----------+
This does assume you will only ever use the price once before changing it. If you needed to go to back to a price you'd have to use window logic to identify the price different prices ranges and add that to your collect_list as an extra factor to sort on.

How to replace null values in a dataframe based on values in other dataframe?

Here's a dataframe, df1, I have
+---------+-------+---------+
| C1 | C2 | C3 |
+---------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| by | 8 | srjs |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
| mi | 1 | pdlo |
| lp | 7 | ztpq |
+---------+-------+---------+
Here's another, df2, that I have
+----------+-------+---------+
| V1 | V2 | V3 |
+----------+-------+---------+
| Null | 6 | ixfg |
| Null | 2 | jsfd |
| Null | 2 | hfga |
| Null | 7 | qwks |
| Null | 1 | khfd |
| Null | 9 | gdbu |
+----------+-------+---------+
What I would like to have is another dataframe that
Ignores values in V2 and takes values in C2 whereever V3 and C3 match, and
Replaces V1 with values in C1 wherever V3 and C3 match.
The result should look like the following:
+----------+-------+---------+
| M1 | M2 | M3 |
+----------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
+----------+-------+---------+
You can join and use coalesce to take a value which has a higher priority.
** coalesce will take any number of columns (the highest priority to least in the order of arguments) and return first non-null value, so if you do want to replace with null when there is a null in the lower priority column, you cannot use this function.
df = (df1.join(df2, on=(df1.C3 == df2.V3))
.select(F.coalesce(df1.C1, df2.V1).alias('M1'),
F.coalesce(df2.V2, df1.C2).alias('M2'),
F.col('C3').alias('M3')))

Iterate over Dataframe & Recursive filters

I have 2 dataframes. "MinNRule" & "SampleData"
MinNRule provides some rule information based on which SampleData needs to be:
Aggregate "Sample Data" on columns defined in MinNRule.MinimumNPopulation and MinNRule.OrderOfOperation
Check if Aggregate.Entity >= MinNRule.MinimumNValue
a. For all Entities that do not meet the MinNRule.MinimumNValue, remove from population
b. For all Entities that meet the MinNRule.MinimumNValue, keep in population
Perform 1 through 2 for next MinNRule.OrderOfOperation using 2.b dataset
MinNRule
| MinimumNGroupName | MinimumNPopulation | MinimumNValue | OrderOfOperation |
|:-----------------:|:------------------:|:-------------:|:----------------:|
| Group1 | People by Facility | 6 | 1 |
| Group1 | People by Project | 4 | 2 |
SampleData
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
| F1 | P2 | 373865 |
| F1 | P2 | 120672 |
| F1 | P2 | 369407 |
| F2 | P4 | 121705 |
| F2 | P4 | 211807 |
| F2 | P4 | 408041 |
| F2 | P4 | 415579 |
Proposed Steps:
Read MinNRule, read rule with OrderOfOperation=1
a. GroupBy Facility, Count on People
b. Aggregate SampleData by 1.a and compare to MinimumNValue=6
| Facility | Count | MinNPass |
|:--------: |:-------: |:--------: |
| F1 | 7 | Y |
| F2 | 4 | N |
Select MinNPass='Y' rows and filter the initial dataframe down to those entities (F2 gets dropped)
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
| F1 | P2 | 373865 |
| F1 | P2 | 120672 |
| F1 | P2 | 369407 |
Read MinNRule, read rule with OrderOfOperation=2
a. GroupBy Project, Count on People
b. Aggregate SampleData by 3.a and compare to MinimumNValue=4
| Project | Count | MinNPass |
|:--------: |:-------: |:--------: |
| P1 | 4 | Y |
| P2 | 3 | N |
Select MinNPass='Y' rows and filter dataframe in 3 down to those entities (P2 gets dropped)
Print Final Result
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
Ideas:
I have been thinking of moving MinNRule to a LocalIterator and loopinng through it and "filtering" SampleData
I am not sure how to pass the result at the end of one loop over to another
Still learning Pyspark, unsure if this is the correct approach.
I am using Azure Databricks
IIUC, since the rules df defines the rules therefore it must be small and can be collected to the driver for performing the operations on the main data.
One approach to get the desired result can be by collecting the rules df and passing it to the reduce function as:
data = MinNRule.orderBy('OrderOfOperation').collect()
from pyspark.sql.functions import *
from functools import reduce
dfnew = reduce(lambda df, rules: df.groupBy(col(rules.MinimumNPopulation.split('by')[1].strip())).\
agg(count(col({'People':'PeopleID'}.get(rules.MinimumNPopulation.split('by')[0].strip()))).alias('count')).\
filter(col('count')>=rules.MinimumNValue).drop('count').join(df,rules.MinimumNPopulation.split('by')[1].strip(),'inner'), data, sampleData)
dfnew.show()
+-------+--------+--------+
|Project|Facility|PeopleID|
+-------+--------+--------+
| P1| F1| 166152|
| P1| F1| 425906|
| P1| F1| 332127|
| P1| F1| 241630|
+-------+--------+--------+
Alternatively you can also loop through the df and get the result the performance remains same in both the cases
import pyspark.sql.functions as f
mapped_cols = {'People':'PeopleID'}
data = MinNRule.orderBy('OrderOfOperation').collect()
for i in data:
cnt, grp = i.MinimumNPopulation.split('by')
cnt = mapped_cols.get(cnt.strip())
grp = grp.strip()
sampleData = sampleData.groupBy(f.col(grp)).agg(f.count(f.col(cnt)).alias('count')).\
filter(f.col('count')>=i.MinimumNValue).drop('count').join(sampleData,grp,'inner')
sampleData.show()
+-------+--------+--------+
|Project|Facility|PeopleID|
+-------+--------+--------+
| P1| F1| 166152|
| P1| F1| 425906|
| P1| F1| 332127|
| P1| F1| 241630|
+-------+--------+--------+
Note: You have to manually parse your rules grammar as it is subjected to change

DB2 Query multiple select and sum by date

I have 3 tables: ITEMS, ODETAILS, OHIST.
ITEMS - a list of products, ID is the key field
ODETAILS - line items of every order, no key field
OHIST - a view showing last years order totals by month
ITEMS ODETAILS OHIST
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| ID | NAME | | OID | ODUE | ITEM_ID | ITEM_QTY | | ITEM_ID | M5QTY |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 10 + Widget10 | | A33 | 1180503 | 10 | 100 | | 10 | 1000 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 11 + Widget11 | | A33 | 1180504 | 11 | 215 | | 11 | 1500 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 12 + Widget12 | | A34 | 1180505 | 10 | 500 | | 12 | 2251 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 13 + Widget13 | | A34 | 1180504 | 11 | 320 | | 13 | 4334 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| A34 | 1180504 | 12 | 450 |
+-----+---------+---------+----------+
| A34 | 1180505 | 13 | 125 |
+-----+---------+---------+----------+
Assuming today is May 2, 2018 (1180502).
I want my results to show ID, NAME, M5QTY, and SUM(ITEM_QTY) grouped by day
over the next 3 days (D1, D2, D3)
Desired Result
+----+----------+--------+------+------+------+
| ID | NAME | M5QTY | D1 | D2 | D3 |
+----+----------+--------+------+------+------+
| 10 | Widget10 | 1000 | 100 | | 500 |
+----+----------+--------+------+------+------+
| 11 | Widget11 | 1500 | | 535 | |
+----+----------+--------+------+------+------+
| 12 | Widget12 | 2251 | | 450 | |
+----+----------+--------+------+------+------+
| 13 | Widget13 | 4334 | | | 125 |
+----+----------+--------+------+------+------+
This is how I convert ODUE to a date
DATE(concat(concat(concat(substr(char((ODETAILS.ODUE-1000000)+20000000),1,4),'-'), concat(substr(char((ODETAILS.ODUE-1000000)+20000000),5,2), '-')), substr(char((ODETAILS.ODUE-1000000)+20000000),7,2)))
Try this (you can add the joins you need)
SELECT ITEM_ID
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 1 THEN ITEM_QTY ELSE 0 END) AS D1
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 2 THEN ITEM_QTY ELSE 0 END) AS D2
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 3 THEN ITEM_QTY ELSE 0 END) AS D3
FROM
ODETAILS
GROUP BY
ITEM_ID

Transposed table redshift

I want to transpose columns into rows (without using UNION):
|Dimension1 | Measure1 | Measure2 |
-----------------------------------
| 1 | x1 | y1 |
| 0 | x2 | y2 |
Into:
| Dimension1 | Measures | Values |
-----------------------------------
| 1 | Measure1 | x1 |
| 1 | Measure2 | y1 |
| 0 | Measure1 | x2 |
| 0 | Measure2 | y2 |
The number of the measure is fixed.
I'm using Amazon Redshift.
You need to use Union for that. Why don't you want to use it ? There is no other way.