Fix date overlap in pyspark - pyspark

I have a dataset like this below:
+------+------+-------+------------+------------+-----------------+
| key1 | key2 | price | date_start | date_end | sequence_number |
+------+------+-------+------------+------------+-----------------+
| a | b | 10 | 2022-01-03 | 2022-01-05 | 1 |
| a | b | 10 | 2022-01-02 | 2050-05-15 | 2 |
| a | b | 10 | 2022-02-02 | 2022-05-10 | 3 |
| a | b | 20 | 2024-02-01 | 2050-10-10 | 4 |
| a | b | 20 | 2024-04-01 | 2025-09-10 | 5 |
| a | b | 10 | 2022-04-02 | 2024-09-10 | 6 |
| a | b | 20 | 2024-09-11 | 2050-10-10 | 7 |
+------+------+-------+------------+------------+-----------------+
What I want to achieve is this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The sequence number is the order in which the data rows were received.
The resultant dataset should be able to fix the overlapping dates for each price but also consider the fact that when there is a new price for the same key columns the older record's date_end is updated to date_start-1
After the first three sequence numbers, the output looked like this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2050-05-15 |
+------+------+-------+------------+------------+
This covers the max range for the price.
After the 4th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
After the 5th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes as the date overlaps
After the 6th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The date_start and date_end both are updated
After the 7th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes.

So here's an answer. First I expand the dates into rows. Then I use a group by with struct & collect_list, to capture the price and sequence_number together. I take the first item of the array returned (and sorted) from collect list to effectively give me the max of the sequence number to then pull back the price. From there a group the dates by min/max.
from pyspark.sql.functions import min, explode, col, expr, max
data = [
("a" ,"b" , 10 ,"2022-01-03" ,"2022-01-05" , 1 ),
("a" ,"b" , 10 ,"2022-01-02" ,"2050-05-15" , 2 ),
("a" ,"b" , 10 ,"2022-02-02" ,"2022-05-10" , 3 ),
("a" ,"b" , 20 ,"2024-02-01" ,"2050-10-10" , 4 ),
("a" ,"b" , 20 ,"2024-04-01" ,"2025-09-10" , 5 ),
("a" ,"b" , 10 ,"2022-04-02" ,"2024-09-10" , 6 ),
("a" ,"b" , 20 ,"2024-09-11" ,"2050-10-10" , 7 ),
]
columns = ["key1","key2","price","date_start","date_end","sequence_number"]
df = spark.createDataFrame(data).toDF(*columns)
#create all the dates as rows
df_sequence = df.select( "*" , explode( expr("sequence ( to_date(date_start),to_date(date_end), interval 1 day)").alias('day') ) )
df_date_range = df_sequence.groupby( \
col("key1"),
col("key2"),
col("col")
).agg( \# here's the magic
reverse( \#descending sort effectively on sequence number.
array_sort( \#ascending sort of array by first element in struct then second element in struct
collect_list( \# collect all elements of the group
struct( \# get all data we need
col("sequence_number").alias("sequence") ,
col("price" ).alias("price")
)
)
)
)[0].alias("mystruct")\ # pull the first element to get "max"
).select( \
col("key1"),
col("key2"),
col("col"),
col("mystruct.price") \#pull out price from struct
)
#without comments
#df_date_range = df_sequence.groupby( col("key1"),col("key2"), col("col") ).agg( reverse(array_sort(collect_list( struct(col("sequence_number").alias("sequence") ,col("price" ).alias("price") ) )))[0].alias("mystruct") ).select( col("key1"), col("key2"), col("col"), col("mystruct.price") )
df_date_range.groupby( col("key1"),col("key2"), col("price") ).agg( min("col").alias("date_start"), max("col").alias("date_end") ).show()
+----+----+-----+----------+----------+
|key1|key2|price| date_start | date_end|
+----+----+-----+----------+----------+
| a| b| 10|2022-01-02|2024-09-10|
| a| b| 20|2024-09-11|2050-10-10|
+----+----+-----+----------+----------+
This does assume you will only ever use the price once before changing it. If you needed to go to back to a price you'd have to use window logic to identify the price different prices ranges and add that to your collect_list as an extra factor to sort on.

Related

Postgres distinct rows whilst also summing

I have a dataset that is similar to this. I need to pick out the most recent metadata (greater execution time = more recent) for a client including the sum of quantities and the latest execution time and meta where the quantity > 0
| Name | Quantity | Metadata | Execution time |
| -------- | ---------|----------|----------------|
| Neil | 1 | [1,3] | 4 |
| James | 1 | [2,18] | 5 |
| Neil | 1 | [4, 1] | 6 |
| Mike | 1 | [5, 42] | 7 |
| James | -1 | Null | 8 |
| Neil | -1 | Null | 9 |
Eg the query needs to return:
| Name | Summed Quantity | Metadata | Execution time |
| -------- | ----------------|----------|----------------|
| James | 0 | [2,18] | 5 |
| Neil | 1 | [4, 1] | 6 |
| Mike | 1 | [5, 42] | 7 |
My query doesn't quite work as it's not returning the sum of the quantities correctly.
SELECT
distinct on (name) name,
(
SELECT
cast(
sum(quantity) as int
)
) as summed_quantity,
meta,
execution_time
FROM
table
where
quantity > 0
group by
name,
meta,
execution_time
order by
name,
execution_time desc;
This query gives a result of
| Name | Summed Quantity | Metadata | Execution time |
| -------- | ----------------|----------|----------------|
| James | 1 | [2,18] | 5 |
| Neil | 1 | [4, 1] | 6 |
| Mike | 1 | [5, 42] | 7 |
ie it's just taking the quantity > 0 from the where and not adding up the quantities in the sub query (i assume because of the distinct clause) I'm unsure how to fix my query to produce the desired output.
This can be achieved using window functions (hence with a single pass of the data)
select
name
, sum_qty
, metadata
, execution_time
from (
select
*
, sum(Quantity) over(partition by name) sum_qty
, row_number() over(partition by name, case when quantity > 0 then 1 else 0 end
order by Execution_time DESC) as rn
from mytable
) d
where rn = 1 and quantity > 0
order by name
result
+-------+---------+----------+----------------+
| name | sum_qty | metadata | execution_time |
+-------+---------+----------+----------------+
| James | 0 | [2,18] | 5 |
| Mike | 1 | [5,42] | 7 |
| Neil | 1 | [4,1] | 6 |
+-------+---------+----------+----------------+
db<>fiddle here

PostgreSQL - Calculate SUM() of COUNT()

Basically I have a table called cities which looks like this:
+------+-----------+---------+----------+----------------+
| id | name | lat | lng | submitted_by |
|------+-----------+---------+----------+----------------|
| 1 | Pyongyang | 39.0392 | 125.7625 | 15 |
| 2 | Oslo | 59.9139 | 10.7522 | 8 |
| 3 | Hebron | 31.5326 | 35.0998 | 8 |
| 4 | Hebron | 31.5326 | 35.0998 | 10 |
| 5 | Paris | 48.8566 | 2.3522 | 12 |
| 6 | Hebron | 31.5326 | 35.0998 | 7 |
+------+-----------+---------+----------+----------------+
Desired result:
+-----------+---------+
| name | count |
|-----------+---------|
| Hebron | 3 |
| Pyongyang | 1 |
| Oslo | 1 |
| Paris | 1 |
| Total | 6 | <-- The tricky part
+-----------+---------+
In other words, what I need to do is SELECT the SUM of the COUNT in the query I'm currently using:
SELECT name, count(name)::int FROM cities GROUP BY name;
But apparently nested aggregated functions are not allowed in PostgreSQL. I'm guessing I need to use ROLLUP in some way but I can't seem to get it right.
Thanks for the help.
You need to UNION ALL the total sum.
WITH ROLLUP works by summing up the total for every group separate and can't be used here.
CREATE TABLE cities (
"id" INTEGER,
"name" VARCHAR(9),
"lat" FLOAT,
"lng" FLOAT,
"submitted_by" INTEGER
);
INSERT INTO cities
("id", "name", "lat", "lng", "submitted_by")
VALUES
('1', 'Pyongyang', '39.0392', '125.7625', '15'),
('2', 'Oslo', '59.9139', '10.7522', '8'),
('3', 'Hebron', '31.5326', '35.0998', '8'),
('4', 'Hebron', '31.5326', '35.0998', '10'),
('5', 'Paris', '48.8566', '2.3522', '12'),
('6', 'Hebron', '31.5326', '35.0998', '7');
SELECT name, COUNT(name)::int FROM cities GROUP BY name
UNION ALL
SELECT 'Total', COUNT(*) FROM cities
name | count
:-------- | ----:
Hebron | 3
Pyongyang | 1
Oslo | 1
Paris | 1
Total | 6
db<>fiddle here

In Spark scala, how to check between adjacent rows in a dataframe

How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe. This should happen at a key level
I have following data after sorting on key, dates
source_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-08 |
| 10 | BAC | 2018-01-03 | 2018-01-15 |
| 10 | CAS | 2018-01-03 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-03 |
| 20 | DAS | 2018-01-01 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
When the dates are in a range from these rows (i.e. the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.
Here is the output I need..
final_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-21 |
| 10 | BAC | 2018-01-01 | 2018-01-21 |
| 10 | CAS | 2018-01-01 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-12 |
| 20 | DAS | 2017-11-12 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
Appreciate any ideas to achieve this. Thanks in advance!
Here's one approach:
Create new column group_id with null value if begin_dt is within date range from the previous row; otherwise a unique integer
Backfill nulls in group_id with the last non-null value
Compute min(begin_dt) and max(end_dt) within each (key, group_id) partition
Example below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(10, "ABC", "2018-01-01", "2018-01-08"),
(10, "BAC", "2018-01-03", "2018-01-15"),
(10, "CAS", "2018-01-03", "2018-01-21"),
(20, "AAA", "2017-11-12", "2018-01-03"),
(20, "DAS", "2018-01-01", "2018-01-12"),
(20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")
val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")
df.
withColumn("group_id", when(
$"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
).otherwise(monotonically_increasing_id)
).
withColumn("group_id", last($"group_id", ignoreNulls=true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("begin_dt2", min($"begin_dt").over(win2)).
withColumn("end_dt2", max($"end_dt").over(win2)).
orderBy("key", "begin_dt", "end_dt").
show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code| begin_dt| end_dt| group_id| begin_dt2| end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+

DB2 Query multiple select and sum by date

I have 3 tables: ITEMS, ODETAILS, OHIST.
ITEMS - a list of products, ID is the key field
ODETAILS - line items of every order, no key field
OHIST - a view showing last years order totals by month
ITEMS ODETAILS OHIST
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| ID | NAME | | OID | ODUE | ITEM_ID | ITEM_QTY | | ITEM_ID | M5QTY |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 10 + Widget10 | | A33 | 1180503 | 10 | 100 | | 10 | 1000 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 11 + Widget11 | | A33 | 1180504 | 11 | 215 | | 11 | 1500 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 12 + Widget12 | | A34 | 1180505 | 10 | 500 | | 12 | 2251 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 13 + Widget13 | | A34 | 1180504 | 11 | 320 | | 13 | 4334 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| A34 | 1180504 | 12 | 450 |
+-----+---------+---------+----------+
| A34 | 1180505 | 13 | 125 |
+-----+---------+---------+----------+
Assuming today is May 2, 2018 (1180502).
I want my results to show ID, NAME, M5QTY, and SUM(ITEM_QTY) grouped by day
over the next 3 days (D1, D2, D3)
Desired Result
+----+----------+--------+------+------+------+
| ID | NAME | M5QTY | D1 | D2 | D3 |
+----+----------+--------+------+------+------+
| 10 | Widget10 | 1000 | 100 | | 500 |
+----+----------+--------+------+------+------+
| 11 | Widget11 | 1500 | | 535 | |
+----+----------+--------+------+------+------+
| 12 | Widget12 | 2251 | | 450 | |
+----+----------+--------+------+------+------+
| 13 | Widget13 | 4334 | | | 125 |
+----+----------+--------+------+------+------+
This is how I convert ODUE to a date
DATE(concat(concat(concat(substr(char((ODETAILS.ODUE-1000000)+20000000),1,4),'-'), concat(substr(char((ODETAILS.ODUE-1000000)+20000000),5,2), '-')), substr(char((ODETAILS.ODUE-1000000)+20000000),7,2)))
Try this (you can add the joins you need)
SELECT ITEM_ID
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 1 THEN ITEM_QTY ELSE 0 END) AS D1
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 2 THEN ITEM_QTY ELSE 0 END) AS D2
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 3 THEN ITEM_QTY ELSE 0 END) AS D3
FROM
ODETAILS
GROUP BY
ITEM_ID

Crosstab function and Dates PostgreSQL

I had to create a cross tab table from a Query where dates will be changed into column names. These order dates can be increase or decrease as per the dates passed in the query. The order date is in Unix format which is changed into normal format.
Query is following:
Select cd.cust_id
, od.order_id
, od.order_size
, (TIMESTAMP 'epoch' + od.order_date * INTERVAL '1 second')::Date As order_date
From consumer_details cd,
consumer_order od,
Where cd.cust_id = od.cust_id
And od.order_date Between 1469212200 And 1469212600
Order By od.order_id, od.order_date
Table as follows:
cust_id | order_id | order_size | order_date
-----------|----------------|---------------|--------------
210721008 | 0437756 | 4323 | 2016-07-22
210721008 | 0437756 | 4586 | 2016-09-24
210721019 | 10749881 | 0 | 2016-07-28
210721019 | 10749881 | 0 | 2016-07-28
210721033 | 13639 | 2286145 | 2016-09-06
210721033 | 13639 | 2300040 | 2016-10-03
Result will be:
cust_id | order_id | 2016-07-22 | 2016-09-24 | 2016-07-28 | 2016-09-06 | 2016-10-03
-----------|----------------|---------------|---------------|---------------|---------------|---------------
210721008 | 0437756 | 4323 | 4586 | | |
210721019 | 10749881 | | | 0 | |
210721033 | 13639 | | | | 2286145 | 2300040