Elegantly merging rows on Spark, based on multiple conditions - scala

Heyo StackOverflow,
Currently trying to find an elegant way to do a specific transformation.
So I have a dataframe of actions, that looks like this:
+---------+----------+----------+---------+
|timestamp| user_id| action| value|
+---------+----------+----------+---------+
| 100| 1| click| null|
| 101| 2| click| null|
| 103| 1| drag| AAA|
| 101| 1| click| null|
| 108| 1| click| null|
| 100| 2| click| null|
| 106| 1| drag| BBB|
+---------+----------+----------+---------+
Context:
Users can perform actions: clicks and drags. Clicks don't have a value, drags do. A drag implies there was a click but not the other way around. Let's also assume that the drag event can be recorded after or before the click event.
So I have, for each drag, a corresponding click action. What I would like to do, is merge the drag and click actions into 1, ie. delete the drag action after giving its value to the click action.
To know which click corresponds to which drag, I have to take the click whose timestamp is the closest to the drag's timestamp. I also want to make sure that a drag cannot be linked to a click if there timestamp difference is over 5 (and that means some drags might not be linked, it's fine). Of course, I don't want the drag of user 1 to correspond to the click of user 2.
Here, the result would look like this:
+---------+----------+----------+---------+
|timestamp| user_id| action| value|
+---------+----------+----------+---------+
| 100| 1| click| null|
| 101| 2| click| null|
| 101| 1| click| AAA|
| 108| 1| click| BBB|
| 100| 2| click| null|
+---------+----------+----------+---------+
The drag with AAA (timestamp = 103) was linked to the click that happened at 101 because it's the closest to 103. Same logic for BBB.
So I would like to perform these operations, in a smooth/efficient way. So far, I have something like this:
val window = Window partitionBy ($"user_id") orderBy $"timestamp".asc
myDF
.withColumn("previous_value", lag("value", 1, null) over window)
.withColumn("previous_timestamp", lag("timestamp", 1, null) over window)
.withColumn("next_value", lead("value", 1, null) over window)
.withColumn("next_timestamp", lead("timestamp", 1, null) over window)
.withColumn("value",
when(
$"previous_value".isNotNull and
// If there is more than 5 sec. difference, it shouldn't be joined
$"timestamp" - $"previous_timestamp" < 5 and
(
$"next_timestamp".isNull or
$"next_timestamp" - $"timestamp" > $"timestamp" - $"previous_timestamp"
), $"previous_value")
.otherwise(
when($"next_timestamp" - $"timestamp" < 5, $"next_value")
.otherwise(null)
)
)
.filter($"action" === "click")
.drop("previous_value")
.drop("previous_timestamp")
.drop("next_value")
.drop("next_timestamp")
But I feel this is rather inefficient. Is there a better way to do this ? (something that can be done without having to create 4 temporary columns...)
Is there a way to manipulate both the row with offset -1 and +1 in the same expression for example ?
Thanks in advance!

Here's my attempt using Spark-SQL rather than DataFrame APIs, but it should be possible to convert:
myDF.registerTempTable("mydf")
spark.sql("""
with
clicks_table as (select * from mydf where action='click')
,drags_table as (select * from mydf where action='drag' )
,one_click_many_drags as (
select
c.timestamp as c_timestamp
, d.timestamp as d_timestamp
, c.user_id as c_user_id
, d.user_id as d_user_id
, c.action as c_action
, d.action as d_action
, c.value as c_value
, d.value as d_value
from clicks_table c
inner join drags_table d
on c.user_id = d.user_id
and abs(c.timestamp - d.timestamp) <= 5 --a drag cannot be linked to a click if there timestamp difference is over 5
)
,one_click_one_drag as (
select c_timestamp as timestamp, c_user_id as user_id, c_action as action, d_value as value
from (
select *, row_number() over (
partition by d_user_id, d_timestamp --for each drag timestamp with multiple possible click timestamps, we rank the click timestamps by nearness
order by
abs(c_timestamp - d_timestamp) asc --prefer nearest
, c_timestamp asc --prefer next_value if tied
) as rn
from one_click_many_drags
)
where rn=1 --take only the best match for each drag timestamp
)
--now we start from the clicks_table and add in the desired drag values!
select c.timestamp, c.user_id, c.action, m.value
from clicks_table c
left join one_click_one_drag m
on c.user_id = m.user_id
and c.timestamp = m.timestamp
""")
Tested to produce your desired output.

Related

How to calculate total revenue within 3 months in postgresql

Thanks in advance for any help.
I have a table with unique tickets, customer IDs and ticket price. For each ticket, I want to see the number of tickets and total revenue from a customer 3 months after the date of the ticket.
I tried to use the partition by function with the date condition set in the on clause, but it just evaluates all tickets of the customer rather than the 3 month period I want.
select distinct on (at2.ticket_number)
at2.customer_id
,at2.ticket_id
,at2.ticket_number
,at2.initial_sale_date
,ata.tix "a_tix"
,ata.aov "a_aov"
,ata.rev "a_rev"
from reports.agg_tickets at2
left join (select at2.customer_id, at2.final_fare_value, at2.initial_sale_date, count(at2.customer_id) OVER (PARTITION BY at2.customer_id) AS tix,
avg(at2.final_fare_value) over (partition by at2.customer_id) as aov,
sum(at2.final_fare_value) over (partition by at2.customer_id) as rev
from reports.agg_tickets at2
) ata
on (ata.customer_id = at2.customer_id
and ata.initial_sale_date > at2.initial_sale_date
and ata.initial_sale_date < at2.initial_sale_date + interval '3 months')
I could use a left join lateral, but it takes far too long. Slightly confused with how to achieve what I want, so any help would be greatly appreciated.
Many thanks
Edit:
Here is the sample of data. Picture of data table.
The table is unique on ticket number, but not on customer.
No need to use a join at all, this will yield (as you observe) a problemetic performnce.
What is your solution is a plain window function with a frame_clause that will consider the next 3 months for each ticket
Example (self explained)
count(*) over (partition by customer_id order by initial_sale_date
range between current row and '3 months' following) ticket_cnt
Here a full query with simplified sample data and the result
with dt as (
select * from (values
(1, 1, date'2020-01-01', 10),
(1, 2, date'2020-02-01', 15),
(1, 3, date'2020-03-01', 20),
(1, 4, date'2020-04-01', 25),
(1, 5, date'2020-05-01', 30),
(2, 6, date'2020-01-01', 15),
(2, 7, date'2020-02-01', 20),
(2, 7, date'2021-01-01', 25)
) tab (customer_id, ticket_id, initial_sale_date,final_fare_value)
)
select
customer_id, ticket_id, initial_sale_date, final_fare_value,
count(*) over (partition by customer_id order by initial_sale_date range between current row and '3 months' following) ticket_cnt,
sum(final_fare_value) over (partition by customer_id order by initial_sale_date range between current row and '3 months' following) ticket_sum
from dt;
customer_id|ticket_id|initial_sale_date|final_fare_value|ticket_cnt|ticket_sum|
-----------+---------+-----------------+----------------+----------+----------+
1| 1| 2020-01-01| 10| 4| 70|
1| 2| 2020-02-01| 15| 4| 90|
1| 3| 2020-03-01| 20| 3| 75|
1| 4| 2020-04-01| 25| 2| 55|
1| 5| 2020-05-01| 30| 1| 30|
2| 6| 2020-01-01| 15| 2| 35|
2| 7| 2020-02-01| 20| 1| 20|
2| 7| 2021-01-01| 25| 1| 25|

Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.
The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

Spark aggregation with window functions

I have a spark df which I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:
A
B
C
Snap
1
2
3
2019-12-29
1
2
4
2019-12-31
where the primary key is formed by fields A and B. I need to create a new field to indicate which register is active (the last snap for each set of rows with the same PK). So I need something like this:
A
B
C
Snap
activity
1
2
3
2019-12-29
false
1
2
4
2019-12-31
true
I have done this by creating an auxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I donĀ“t know how I can implement it.
Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the activity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:
A
B
C
Snap
activity
end
1
2
3
2019-12-29
false
2019-12-30
1
2
4
2019-12-31
true
You can check row_number ordered by Snap in descending order. The 1st row is the last active snap:
df.selectExpr(
'*',
'row_number() over (partition by A, B order by Snap desc) = 1 as activity'
).show()
+---+---+---+----------+--------+
| A| B| C| Snap|activity|
+---+---+---+----------+--------+
| 1| 2| 4|2019-12-31| true|
| 1| 2| 3|2019-12-29| false|
+---+---+---+----------+--------+
Edit: to get the end date for each group, use max window function on Snap:
import pyspark.sql.functions as f
df.withColumn(
'activity',
f.expr('row_number() over (partition by A, B order by Snap desc) = 1')
).withColumn(
"end",
f.expr('case when activity then null else max(date_add(to_date(Snap), -1)) over (partition by A, B) end')
).show()
+---+---+---+----------+--------+----------+
| A| B| C| Snap|activity| end|
+---+---+---+----------+--------+----------+
| 1| 2| 4|2019-12-31| true| null|
| 1| 2| 3|2019-12-29| false|2019-12-30|
+---+---+---+----------+--------+----------+

concatenate a column value for several rows based on condition

I have a table which has format like this (id is the pk)
id|timestamps |year|month|day|groups_ids|status |SCHEDULED |uid|
--|-------------------|----|-----|---|----------|-------|-------------------|---|
1|2021-02-04 17:18:24|2020| 8| 9| 1|OK |2020-08-09 00:00:00| 1|
2|2021-02-04 17:18:09|2020| 9| 9| 1|OK |2020-09-09 00:00:00| 1|
3|2021-02-04 17:19:51|2020| 10| 9| 1|HOLD |2020-10-09 00:00:00| 1|
4|2021-02-04 17:19:04|2020| 10| 10| 2|HOLD |2020-10-09 00:00:00| 1|
5|2021-02-04 17:18:30|2020| 10| 11| 2|HOLD |2020-10-09 00:00:00| 1|
6|2021-02-04 17:18:57|2020| 10| 12| 2|OK |2020-10-09 00:00:00| 1|
7|2021-02-04 17:18:24|2020| 8| 9| 1|HOLD |2020-08-09 00:00:00| 2|
8|2021-02-04 17:18:09|2020| 9| 9| 2|HOLD |2020-09-09 00:00:00| 2|
9|2021-02-04 17:19:51|2020| 10| 9| 2|HOLD |2020-10-09 00:00:00| 2|
10|2021-02-04 17:19:04|2020| 10| 10| 2|HOLD |2020-10-09 00:00:00| 2|
11|2021-02-04 17:18:30|2020| 10| 11| 2|HOLD |2020-10-09 00:00:00| 2|
12|2021-02-04 17:18:57|2020| 10| 12| 2|HOLD |2020-10-09 00:00:00| 2|
the job is i want to extract every group_ids for each uid when the status is OK order by SCHEDULED ascended, and if there's no OK found in the record for the uid it will takes for the latest HOLD based on year month and day. After that I want to make a weighing score with each group_ids:
group_ids > score
1 > 100
2 > 80
3 > 60
4 > 50
5 > 10
6 > 50
7 > 0
so if [1,1,2] will be change to (100+100+80) = 280
it will look like this:
ids|uid|pattern|score|
---|---|-------|-----|
1| 1|[1,1,2]| 280|
2| 2|[2] | 80|
It's pretty hard since i cannot found any operators like python for loop and append operators in PostgreSQL
step-by-step demo:db<>fiddle
SELECT
s.uid, s.values,
sum(v.value) as score
FROM (
SELECT DISTINCT ON (uid)
uid,
CASE
WHEN cardinality(ok_count) > 0 THEN ok_count
ELSE ARRAY[last_value]
END as values
FROM (
SELECT
*,
ARRAY_AGG(groups_ids) FILTER (WHERE status = 'OK') OVER (PARTITION BY uid ORDER BY scheduled)as ok_count,
first_value(groups_ids) OVER (PARTITION BY uid ORDER BY year, month DESC) as last_value
FROM mytable
) s
ORDER BY uid, scheduled DESC
) s,
unnest(values) as u_group_id
JOIN (VALUES
(1, 100), (2, 80), (3, 60), (4, 50), (5,10), (6, 50), (7, 0)
) v(group_id, value) ON v.group_id = u_group_id
GROUP BY s.uid, s.values
Phew... quite complex. Let's have a look at the steps:
a)
SELECT
*,
-- 1:
ARRAY_AGG(groups_ids) FILTER (WHERE status = 'OK') OVER (PARTITION BY uid ORDER BY scheduled)as oks,
-- 2:
first_value(groups_ids) OVER (PARTITION BY uid ORDER BY year, month DESC) as last_value
FROM mytable
Using the array_agg() window function to create an array of group_ids without loosing the other data as we would with simple GROUP BY. The FILTER clause is to put only the status = OK records into the array.
Find the last group_id of a group (partition) using the first_value() window function. In descending order is returns the last value.
b)
SELECT DISTINCT ON (uid) -- 2
uid,
CASE -- 1
WHEN cardinality(ok_count) > 0 THEN ok_count
ELSE ARRAY[last_value]
END as values
FROM (
...
) s
ORDER BY uid, scheduled DESC -- 2
The CASE clause either takes the previously created array (from step a1) or, if there is none, it takes the last value (from step a2), creates an one-elemented array.
The DISTINCT ON clause returns only the first element of an ordered group. The group is your uid and the order is given by the column scheduled. Since you don't want the first, but last records within the group, you have to order it DESC to make the most recent one the topmost record. That is taken by the DISTINCT ON
c)
SELECT
uid,
group_id
FROM (
...
) s,
unnest(values) as group_id -- 1
The arrays should be extracted into one record per element. That helps to join the weighted values later.
d)
SELECT
s.uid, s.values,
sum(v.weighted_value) as score -- 2
FROM (
...
) s,
unnest(values) as u_group_id
JOIN (VALUES
(1, 100), (2, 80), ...
) v(group_id, weighted_value) ON v.group_id = u_group_id -- 1
GROUP BY s.uid, s.values -- 2
Join your weighted value on the array elements. Naturally, this can be a table or query or whatever.
Regroup the uid groups to calculate the SUM() of the weighted_values
Additional note:
You should avoid duplicate data storing. You don't need to store the date parts year, month and day if you also store the complete date. You can always calculate them from the date.

Pyspark forward and backward fill within column level

I try to fill missing data in a pyspark dataframe. The pyspark dataframe looks as such:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
| | 4.905615|2019-08-01 00:00:00| 1|
|51.819645| |2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| | |2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
Within the column "name" I want to either forward fill or backward fill (whichever is necessary) to fill only "latitude" and "longitude" ("timestamplast" should not be filled). How do I do this?
Output will be:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| 51.82918| 4.911187|2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
In Pandas this would be done as such:
df = df.groupby("name")['longitude','latitude'].apply(lambda x : x.ffill().bfill())
How would this be done in Pyspark?
I suggest you use the following two Window Specs:
from pyspark.sql import Window
w1 = Window.partitionBy('name').orderBy('timestamplast')
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Where:
w1 is the regular WinSpec we use to calculate the forward-fill which is the same as the following:
w1 = Window.partitionBy('name').orderBy('timestamplast').rowsBetween(Window.unboundedPreceding,0)
see the following note from the documentation for default window frames:
Note: When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
after ffill, we only need to fix the null values at the very front if exists, so we can use a fixed Window frame(Between Window.unboundedPreceding and Window.unboundedFollowing), this is more efficient than using a running Window frame since it requires only one aggregate, see SPARK-8638
Then the x.ffill().bfill() can be handled by using coalesce + last + first based on the above two WindowSpecs:
from pyspark.sql.functions import coalesce, last, first
df.withColumn('latitude_new', coalesce(last('latitude',True).over(w1), first('latitude',True).over(w2))) \
.select('name','timestamplast', 'latitude','latitude_new') \
.show()
+----+-------------------+---------+------------+
|name| timestamplast| latitude|latitude_new|
+----+-------------------+---------+------------+
| 1|2019-08-01 00:00:00| null| 51.819645|
| 1|2019-08-01 00:00:01| null| 51.819645|
| 1|2019-08-01 00:00:02|51.819645| 51.819645|
| 1|2019-08-01 00:00:03| 51.81964| 51.81964|
| 1|2019-08-01 00:00:04| null| 51.81964|
| 1|2019-08-01 00:00:05| null| 51.81964|
| 1|2019-08-01 00:00:06| null| 51.81964|
| 1|2019-08-01 00:00:07| 51.82385| 51.82385|
+----+-------------------+---------+------------+
Edit: to process (ffill+bfill) on multiple columns, use a list comprehension:
cols = ['latitude', 'longitude']
df_new = df.select([ c for c in df.columns if c not in cols ] + [ coalesce(last(c,True).over(w1), first(c,True).over(w2)).alias(c) for c in cols ])
I got a working solution for either forward or backward fill of one target name "longitude". I guess I could repeat the procedure for also "latitude" and then again for backward fill. Is there a more efficient way?
window = Window.partitionBy('name')\
.orderBy('timestamplast')\
.rowsBetween(-sys.maxsize, 0) # this is for forward fill
# .rowsBetween(0,sys.maxsize) # this is for backward fill
# define the forward-filled column
filled_column = last(df['longitude'], ignorenulls=True).over(window) # this is for forward fill
# filled_column = first(df['longitude'], ignorenulls=True).over(window) # this is for backward fill
df = df.withColumn('mmsi_filled', filled_column) # do the fill