Group the rank and get min and max of the dates

Group the rank and get min and max of the dates - scala

I have the below table where I have the increasing streak if the activity_date is consecutive. If not, streak will be set to 1.
Now I need to get the min and max of each group of streaks.
Using Spark and scala or Spark SQL.
Input
floor activity_date streak
--------------------------------
floor1 2018-11-08 1
floor1 2019-01-24 1
floor1 2019-04-05 1
floor1 2019-04-08 1
floor1 2019-04-09 2
floor1 2019-04-14 1
floor1 2019-04-17 1
floor1 2019-04-20 1
floor2 2019-05-04 1
floor2 2019-05-05 2
floor2 2019-06-04 1
floor2 2019-07-28 1
floor2 2019-08-14 1
floor2 2019-08-22 1
Output
floor activity_date end_activity_date
----------------------------------------
floor1 2018-11-08 2018-11-08
floor1 2019-01-24 2019-01-24
floor1 2019-04-05 2019-04-05
floor1 2019-04-08 2019-04-09
floor1 2019-04-14 2019-04-14
floor1 2019-04-17 2019-04-17
floor1 2019-04-20 2019-04-20
floor2 2019-05-04 2019-05-05
floor2 2019-06-04 2019-06-04
floor2 2019-07-28 2019-07-28
floor2 2019-08-14 2019-08-14
floor2 2019-08-22 2019-08-22

You may use the following approach
Using Spark SQL
SELECT
floor,
activity_date,
MAX(activity_date) OVER (PARTITION BY gn,floor) as end_activity_date
FROM (
SELECT
*,
SUM(is_same_streak) OVER (
PARTITION BY floor ORDER BY activity_date
) as gn
FROM (
SELECT
*,
CASE
WHEN streak > LAG(streak,1,streak-1) OVER (
PARTITION BY floor
ORDER BY activity_date
) THEN 0
ELSE 1
END as is_same_streak
FROM
df
) t1
) t2
ORDER BY
"floor",
activity_date
View working demo db fiddle
Using scala api
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val floorWindow = Window.partitionBy("floor").orderBy("activity_date")
val output = df.withColumn(
"is_same_streak",
when(
col("streak") > lag(col("streak"),1,col("streak")-1).over(floorWindow) , 0
).otherwise(1)
)
.withColumn(
"gn",
sum(col("is_same_streak")).over(floorWindow)
)
.select(
"floor",
"activity_date",
max(col("activity_date")).over(
Window.partitionBy("gn","floor")
).alias("end_activity_date")
)
Using pyspark api
from pyspark.sql import functions as F
from pyspark.sql import Window
floorWindow = Window.partitionBy("floor").orderBy("activity_date")
output = (
df.withColumn(
"is_same_streak",
F.when(
F.col("streak") > F.lag(F.col("streak"),1,F.col("streak")-1).over(floorWindow) , 0
).otherwise(1)
)
.withColumn(
"gn",
F.sum(F.col("is_same_streak")).over(floorWindow)
)
.select(
"floor",
"activity_date",
F.max(F.col("activity_date")).over(
Window.partitionBy("gn","floor")
).alias("end_activity_date")
)
)
Let me know if this works for you.

Related

Bring max value from a table based on the values of another table with PySpark

I have two Spark DataFrames, the first one (Events) contains events information as following:
Event_id
Date
User_id
1
2019-04-19
1
2
2019-05-30
2
3
2020-01-20
1
The second one (User) contains information from users as below:
Id
User_id
Date
Weight-kg
1
1
2019-04-05
78
2
1
2019-04-17
75
3
2
2019-10-10
50
4
1
2020-02-10
76
What I wonder to know is how to bring the latest value of weight from User before the Event Date using PySpark?
The return of this code must be the following table:
Event_id
Date
User_id
Weight-kg
1
2019-04-19
1
75
2
2019-05-30
2
null
3
2020-01-20
1
75

The idea is left join events and users then ranking the weight based on dates to get the latest ones
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(event
# left join to keep all events
# note the join condition where
# event's date >= user's date
.join(
user,
on=[
event['User_id'] == user['User_id'],
event['Date'] >= user['Date'],
],
how='left'
)
# rank user's weight to get the latest
# based on the dates that already filtered by event's date
.withColumn('rank_weight', F.rank().over(W.partitionBy(user['User_id']).orderBy(User['Date'].desc())))
.where(F.col('rank_weight') == 1)
.drop('rank_weight')
# drop unnecessary columns
.drop(user['User_id'])
.drop(user['Date'])
.drop('Id')
.orderBy('Event_id')
.show()
)
# Output
# +--------+----------+-------+------+
# |Event_id| Date|User_id|Weight|
# +--------+----------+-------+------+
# | 1|2019-04-19| 1| 75|
# | 2|2019-05-30| 2| null|
# | 3|2020-01-20| 1| 75|
# +--------+----------+-------+------+

Spark Join based on nearest Date

I have 2 parquet tables. The simplified schema is as follows:
case class Product(SerialNumber:Integer,
UniqueKey:String,
ValidityDate1:Date
)
case class ExceptionEvents(SerialNumber:Integer,
ExceptionId:String,
ValidityDate2:Date
)
The Product Dataframe can contain the following entries, as an example:
Product:
-----------------------------------------
SerialNumber UniqueKey ValidityDate1
-----------------------------------------
10001 Key_1 01/10/2021
10001 Key_2 05/10/2021
10001 Key_3 10/10/2021
10002 Key_4 02/10/2021
10003 Key_5 07/10/2021
-----------------------------------------
ExceptionEvents:
-----------------------------------------
SerialNumber ExceptionId ValidityDate2
-----------------------------------------
10001 ExcId_1 02/10/2021
10001 ExcId_2 05/10/2021
10001 ExcId_3 07/10/2021
10001 ExcId_4 11/10/2021
10001 ExcId_5 15/10/2021
-----------------------------------------
I want to join the 2 DFs such that the SerialNumbers match and the ValidityDate shall be mapped such that ValidityDate2 of ExceptionEvent is greater than ValidityDate1 of Product, but the 2 dates should be as close as possible.
For example, the resultant DF should look like below:
---------------------------------------------------------------------
SerialNumber ExceptionId UniqueKey ValidityDate2
---------------------------------------------------------------------
10001 ExcId_1 Key_1 02/10/2021
10001 ExcId_2 Key_2 05/10/2021
10001 ExcId_3 Key_2 07/10/2021
10001 ExcId_4 Key_3 11/10/2021
10001 ExcId_5 Key_3 15/10/2021
---------------------------------------------------------------------
Any idea how the query should be done using scala & spark Dataframe APIs?

The below solution works fine for me:
val dfp1 = List(("1001", "Key1", "01/10/2021"), ("1001", "Key2", "05/10/2021"), ("1001", "Key3", "10/10/2021"), ("1002", "Key4", "02/10/2021")).toDF("SerialNumber", "UniqueKey", "Date1")
val dfProduct = dfp1.withColumn("Date1", to_date($"Date1","dd/MM/yyyy"))
val dfe1 = List(("1001", "ExcId1", "02/10/2021"), ("1001", "ExcId2", "05/10/2021"), ("1001", "ExcId3", "07/10/2021"), ("1001", "ExcId4", "11/10/2021"), ("1001", "ExcId5", "15/10/2021")).toDF("SerialNumber", "ExceptionId", "Date2")
val dfExceptions = dfe1.withColumn("Date2", to_date($"Date2","dd/MM/yyyy"))
val exceptionStat2 = dfExceptions.as("fact").join(dfProduct.as("dim"), Seq("SerialNumber")).select($"fact.*", $"dim.UniqueKey", datediff($"fact.Date2", $"dim.Date1").as("DiffDate")).where($"DiffDate" >= 0)
import org.apache.spark.sql.expressions.Window
val exceptionStat3 = exceptionStat2.withColumn("rank", rank.over(Window.partitionBy($"SerialNumber", $"ExceptionId").orderBy($"DiffDate")) )
.where($"rank" === 1)
.select( $"SerialNumber", $"ExceptionId", $"UniqueKey", $"Date2", $"DiffDate", $"rank" )
.orderBy($"SerialNumber", $"Date2")

historical aggregation of a column up until a specified time in each row in another column

I have two tables login_attempts and checkouts in Amazon RedShift. A user can have multiple (un)successful login attempts and multiple (un)successful checkouts as shown in this example:
login_attempts
login_id | user_id | login | success
-------------------------------------------------------
1 | 1 | 2021-07-01 14:00:00 | 0
2 | 1 | 2021-07-01 16:00:00 | 1
3 | 2 | 2021-07-02 05:01:01 | 1
4 | 1 | 2021-07-04 03:25:34 | 0
5 | 2 | 2021-07-05 11:20:50 | 0
6 | 2 | 2021-07-07 12:34:56 | 1
and
checkouts
checkout_id | checkout_time | user_id | success
------------------------------------------------------------
1 | 2021-07-01 18:00:00 | 1 | 0
2 | 2021-07-02 06:54:32 | 2 | 1
3 | 2021-07-04 13:00:01 | 1 | 1
4 | 2021-07-08 09:05:00 | 2 | 1
Given this information, how can I get the following table with historical performance included for each checkout AS OF THAT TIME?
checkout_id | checkout | user_id | lastGoodLogin | lastFailedLogin | lastGoodCheckout | lastFailedCheckout |
---------------------------------------------------------------------------------------------------------------------------------------
1 | 2021-07-01 18:00:00 | 1 | 2021-07-01 16:00:00 | 2021-07-01 14:00:00 | NULL | NULL
2 | 2021-07-02 06:54:32 | 2 | 2021-07-02 05:01:01 | NULL | NULL | NULL
3 | 2021-07-04 13:00:01 | 1 | 2021-07-01 16:00:00 | 2021-07-04 03:25:34 | NULL | 2021-07-01 18:00:00
4 | 2021-07-08 09:05:00 | 2 | 2021-07-07 12:34:56 | 2021-07-05 11:20:50 | 2021-07-02 06:54:32 | NULL
Update: I was able to get lastFailedCheckout & lastGoodCheckout because that's doing window operations on the same table (checkouts) but I am failing to understand how to best join it with login_attempts table to get last[Good|Failed]Login fields. (sqlfiddle)
P.S.: I am open to PostgreSQL suggestions as well.

Good start! A couple things in your SQL - 1) You should really try to avoid inequality joins as these can lead to data explosions and aren't needed in this case. Just put a CASE statement inside your window function to use only the type of checkout (or login) you want. 2) You can use the frame clause to not self select the same row when finding previous checkouts.
Once you have this pattern you can use it to find the other 2 columns of data you are looking for. The first step is to UNION the tables together, not JOIN. This means making a few more columns so the data can live together but that is easy. Now you have the userid and the time the "thing" happened all in the same data. You just need to WINDOW 2 more times to pull the info you want. Lastly, you need to strip out the non-checkout rows with an outer select w/ where clause.
Like this:
create table login_attempts(
loginid smallint,
userid smallint,
login timestamp,
success smallint
);
create table checkouts(
checkoutid smallint,
userid smallint,
checkout_time timestamp,
success smallint
);
insert into login_attempts values
(1, 1, '2021-07-01 14:00:00', 0),
(2, 1, '2021-07-01 16:00:00', 1),
(3, 2, '2021-07-02 05:01:01', 1),
(4, 1, '2021-07-04 03:25:34', 0),
(5, 2, '2021-07-05 11:20:50', 0),
(6, 2, '2021-07-07 12:34:56', 1)
;
insert into checkouts values
(1, 1, '2021-07-01 18:00:00', 0),
(2, 2, '2021-07-02 06:54:32', 1),
(3, 1, '2021-07-04 13:00:01', 1),
(4, 2, '2021-07-08 09:05:00', 1)
;
SQL:
select *
from (
select
c.checkoutid,
c.userid,
c.checkout_time,
max(case success when 0 then checkout_time end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastFailedCheckout,
max(case success when 1 then checkout_time end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastGoodCheckout,
max(case lsuccess when 0 then login end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastFailedLogin,
max(case lsuccess when 1 then login end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastGoodLogin
from (
select checkout_time as event_time, checkoutid, userid,
checkout_time, success,
NULL as login, NULL as lsuccess
from checkouts
UNION ALL
select login as event_time,NULL as checkoutid, userid,
NULL as checkout_time, NULL as success,
login, success as lsuccess
from login_attempts
) c
) o
where o.checkoutid is not null
order by o.checkoutid

Data from last 12 months each month with trailing 12 months

This is TSQL and I'm trying to calculate repeat purchase rate for last 12 months. This is achieved by looking at sum of customers who have bought more than 1 time last 12 months and the total number of customers last 12 months.
The SQL code below will give me just that; but i would like to dynamically do this for the last 12 months. This is the part where i'm stuck and not should how to best achieve this.
Each month should include data going back 12 months. I.e. June should hold data between June 2018 and June 2018, May should hold data from May 2018 till May 2019.
[Order Date] is a normal datefield (yyyy-mm-dd hh:mm:ss)
DECLARE #startdate1 DATETIME
DECLARE #enddate1 DATETIME
SET #enddate1 = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE())-1, 0) -- Starting June 2018
SET #startdate1 = DATEADD(mm,DATEDIFF(mm,0,GETDATE())-13,0) -- Ending June 2019
;
with dataset as (
select [Phone No_] as who_identifier,
count(distinct([Order No_])) as mycount
from [MyCompany$Sales Invoice Header]
where [Order Date] between #startdate1 and #enddate1
group by [Phone No_]
),
frequentbuyers as (
select who_identifier, sum(mycount) as frequentbuyerscount
from dataset
where mycount > 1
group by who_identifier),
allpurchases as (
select who_identifier, sum(mycount) as allpurchasescount
from dataset
group by who_identifier
)
select sum(frequentbuyerscount) as frequentbuyercount, (select sum(allpurchasescount) from allpurchases) as allpurchasecount
from frequentbuyers
I'm hoping to achieve end result looking something like this:
...Dec, Jan, Feb, March, April, May, June each month holding both values for frequentbuyercount and allpurchasescount.

Here is the code. I made a little modification for the frequentbuyerscount and allpurchasescount. If you use a sumif like expression you don't need a second cte.
if object_id('tempdb.dbo.#tmpMonths') is not null drop table #tmpMonths
create table #tmpMonths ( MonthID datetime, StartDate datetime, EndDate datetime)
declare #MonthCount int = 12
declare #Month datetime = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()), 0)
while #MonthCount > 0 begin
insert into #tmpMonths( MonthID, StartDate, EndDate )
select #Month, dateadd(month, -12, #Month), #Month
set #Month = dateadd(month, -1, #Month)
set #MonthCount = #MonthCount - 1
end
;with dataset as (
select m.MonthID as MonthID, [Phone No_] as who_identifier,
count(distinct([Order No_])) as mycount
from [MyCompany$Sales Invoice Header]
inner join #tmpMonths m on [Order Date] between m.StartDate and m.EndDate
group by m.MonthID, [Phone No_]
),
buyers as (
select MonthID, who_identifier
, sum(iif(mycount > 1, mycount, 0)) as frequentbuyerscount --sum only if count > 1
, sum(mycount) as allpurchasescount
from dataset
group by MonthID, who_identifier
)
select
b.MonthID
, max(tm.StartDate) StartDate, max(tm.EndDate) EndDate
, sum(b.frequentbuyerscount) as frequentbuyercount
, sum(b.allpurchasescount) as allpurchasecount
from buyers b inner join #tmpMonths tm on tm.MonthID = b.MonthID
group by b.MonthID
Be aware, that the code was tested only syntax-wise.
After the test data, this is the result:
MonthID | StartDate | EndDate | frequentbuyercount | allpurchasecount
-----------------------------------------------------------------------------
2018-08-01 | 2017-08-01 | 2018-08-01 | 340 | 3702
2018-09-01 | 2017-09-01 | 2018-09-01 | 340 | 3702
2018-10-01 | 2017-10-01 | 2018-10-01 | 340 | 3702
2018-11-01 | 2017-11-01 | 2018-11-01 | 340 | 3702
2018-12-01 | 2017-12-01 | 2018-12-01 | 340 | 3703
2019-01-01 | 2018-01-01 | 2019-01-01 | 340 | 3703
2019-02-01 | 2018-02-01 | 2019-02-01 | 2 | 8
2019-03-01 | 2018-03-01 | 2019-03-01 | 2 | 3
2019-04-01 | 2018-04-01 | 2019-04-01 | 2 | 3
2019-05-01 | 2018-05-01 | 2019-05-01 | 2 | 3
2019-06-01 | 2018-06-01 | 2019-06-01 | 2 | 3
2019-07-01 | 2018-07-01 | 2019-07-01 | 2 | 3

How to query with lead() values not in current range

I´m having problems querying when lead() values are not within the range of current row, rows on the range's edge return null lead() values.
Let’s say I have a simple table to keep track of continuous counters
create table anytable
( wseller integer NOT NULL,
wday date NOT NULL,
wshift smallint NOT NULL,
wconter numeric(9,1) )
with the following values
wseller wday wshift wcounter
1 2016-11-30 1 100.5
1 2017-01-03 1 102.5
1 2017-01-25 2 103.2
1 2017-02-05 2 106.1
2 2015-05-05 2 81.1
2 2017-01-01 1 92.1
2 2017-01-01 2 93.1
3 2016-12-01 1 45.2
3 2017-01-05 1 50.1
and want net units for current year
wseller wday wshift units
1 2017-01-03 1 2
1 2017-01-25 2 0.7
1 2017-02-05 2 2.9
2 2017-01-01 1 11
2 2017-01-01 2 1
3 2017-01-05 1 4.9
If I use
seletc wseller, wday, wshift, wcounter-lead(wcounter) over (partition by wseller order by wseller, wday desc, wshift desc)
from anytable
where wday>='2017-01-01'
gives me nulls on the first wseller by partition. I´m using this query within a large CTE.
What am I doing wrong?

The scope of a window function takes into account conditions in the WHERE clause. Move the condition to the outer query:
select *
from (
select
wseller, wday, wshift,
wcounter- lead(wcounter) over (partition by wseller order by wday desc, wshift desc)
from anytable
) s
where wday >= '2017-01-01'
order by wseller, wday, wshift
wseller | wday | wshift | ?column?
---------+------------+--------+----------
1 | 2017-01-03 | 1 | 2.0
1 | 2017-01-25 | 2 | 0.7
1 | 2017-02-05 | 2 | 2.9
2 | 2017-01-01 | 1 | 11.0
2 | 2017-01-01 | 2 | 1.0
3 | 2017-01-05 | 1 | 4.9
(6 rows)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Group the rank and get min and max of the dates - scala

Related

Bring max value from a table based on the values of another table with PySpark

Spark Join based on nearest Date

historical aggregation of a column up until a specified time in each row in another column

Data from last 12 months each month with trailing 12 months

How to query with lead() values not in current range

Categories

Resources