Need to add date ranges between two date columns in pyspark? - pyspark

I have input pyspark dataframe with columns like ID,StartDatetime,EndDatetime. I want to add new column named newdate based on startdatetime and enddatetime.
Input DF :-
ID StartDatetime EndDatetime
1 21-06-2021 07:00 24-06-2021 16:00
2 21-06-2021 07:00 22-06-2021 16:00
required output :-
ID StartDatetime EndDatetime newdate
1 21-06-2021 07:00 24-06-2021 16:00 21-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 22-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 23-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 24-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 21-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 22-06-2021

You can use explode and array_repeat to duplicate the rows.
I use a combination of row_number and date functions to get the date ranges between start and end dates:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("ID").orderBy('StartDatetime')
output_df = df.withColumn("diff", 1+F.datediff(F.to_date(F.unix_timestamp('EndDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')), \
F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp'))))\
.withColumn('diff', F.expr('explode(array_repeat(diff,int(diff)))'))\
.withColumn("diff", F.row_number().over(w))\
.withColumn("start_dt", F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')))\
.withColumn("newdate", F.date_format(F.expr("date_add(start_dt, diff-1)"), 'dd-MM-yyyy')).drop('diff', 'start_dt')
Output:
output_df.orderBy("ID", "newdate").show()
+---+----------------+----------------+----------+
| ID| StartDatetime| EndDatetime| newdate|
+---+----------------+----------------+----------+
| 1|21-06-2021 07:00|24-06-2021 16:00|21-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|22-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|23-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|24-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|21-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|22-06-2021|
+---+----------------+----------------+----------+
I dropped the diff column, but displaying it will help you understand the logic if it's not clear.

Related

How to INSERT repeated values like (a,b,c,d,a,b,c,d....) in DB table?

I try to make work schedule table.
I have a table like:
shift_starts_dt
shift_type
2022-01-01 08:00:00
Day
2022-01-01 20:00:00
Night
2022-01-02 08:00:00
Day
2022-01-02 20:00:00
Night
2022-01-03 08:00:00
Day
2022-01-03 20:00:00
Night
2022-01-04 08:00:00
Day
2022-01-04 20:00:00
Night
etc.. until the end of the year
I can't figure out how to add repeated values to table.
I want to add the 'shift_name' column that contains 'A','B','C','D' (It's like name for team)
What query should I use to achieve the next result:
shift_starts_dt
shift_type
shift_name
2022-01-01 08:00:00
Day
'A'
2022-01-01 20:00:00
Night
'B'
2022-01-02 08:00:00
Day
'C'
2022-01-02 20:00:00
Night
'D'
2022-01-03 08:00:00
Day
'A'
2022-01-03 20:00:00
Night
'B'
2022-01-04 08:00:00
Day
'C'
2022-01-04 20:00:00
Night
'D'
. . . . . .
Use number of half days since Jan 1 modulus 4 to index an array:
select
shift_starts_dt,
shift_type,
(array['A','B','C','D'])[(extract(epoch from shift_starts_dt - '2022-01-01')::int / 43200) % 4 + 1]
from work_schedule
See live demo.
You could replace '2022-01-01' with (select min(shift_starts_dt) from work_schedule) for a more general solution.

add 1 day to date when using CASE WHEN REDSHIFT

I am trying to use a CASE WHEN statement like below to add 1 day to a timestamp based on the time part of the timestamp:
CASE WHEN to_char(pickup_date, 'HH24:MI') between 0 and 7 then y.pickup_date else dateadd(day,1,y.pickup_date) end as ead_target
pickup_Date is a timestamp with default format YYYY-MM-DD HH:MM:SS
My output
pickup_Date ead_target
2020-07-01 10:00:00 2020-07-01 10:00:00
2020-07-02 3:00:00 2020-07-02 3:00:00
When the hour of the day is between 0 and 7 then ead_target = pickup_Date ELSE add 1 day
Expected output
pickup_Date ead_target
2020-07-01 10:00:00 2020-07-02 10:00:00
2020-07-02 3:00:00 2020-07-02 3:00:00
You will want to use the date_part() function to extract the hour of the day - https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_PART_function.html
Your case statement should work if you extract 'hour' from the timestamp and compare it to the range 0 - 7.

Data from last 12 months each month with trailing 12 months

This is TSQL and I'm trying to calculate repeat purchase rate for last 12 months. This is achieved by looking at sum of customers who have bought more than 1 time last 12 months and the total number of customers last 12 months.
The SQL code below will give me just that; but i would like to dynamically do this for the last 12 months. This is the part where i'm stuck and not should how to best achieve this.
Each month should include data going back 12 months. I.e. June should hold data between June 2018 and June 2018, May should hold data from May 2018 till May 2019.
[Order Date] is a normal datefield (yyyy-mm-dd hh:mm:ss)
DECLARE #startdate1 DATETIME
DECLARE #enddate1 DATETIME
SET #enddate1 = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE())-1, 0) -- Starting June 2018
SET #startdate1 = DATEADD(mm,DATEDIFF(mm,0,GETDATE())-13,0) -- Ending June 2019
;
with dataset as (
select [Phone No_] as who_identifier,
count(distinct([Order No_])) as mycount
from [MyCompany$Sales Invoice Header]
where [Order Date] between #startdate1 and #enddate1
group by [Phone No_]
),
frequentbuyers as (
select who_identifier, sum(mycount) as frequentbuyerscount
from dataset
where mycount > 1
group by who_identifier),
allpurchases as (
select who_identifier, sum(mycount) as allpurchasescount
from dataset
group by who_identifier
)
select sum(frequentbuyerscount) as frequentbuyercount, (select sum(allpurchasescount) from allpurchases) as allpurchasecount
from frequentbuyers
I'm hoping to achieve end result looking something like this:
...Dec, Jan, Feb, March, April, May, June each month holding both values for frequentbuyercount and allpurchasescount.
Here is the code. I made a little modification for the frequentbuyerscount and allpurchasescount. If you use a sumif like expression you don't need a second cte.
if object_id('tempdb.dbo.#tmpMonths') is not null drop table #tmpMonths
create table #tmpMonths ( MonthID datetime, StartDate datetime, EndDate datetime)
declare #MonthCount int = 12
declare #Month datetime = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()), 0)
while #MonthCount > 0 begin
insert into #tmpMonths( MonthID, StartDate, EndDate )
select #Month, dateadd(month, -12, #Month), #Month
set #Month = dateadd(month, -1, #Month)
set #MonthCount = #MonthCount - 1
end
;with dataset as (
select m.MonthID as MonthID, [Phone No_] as who_identifier,
count(distinct([Order No_])) as mycount
from [MyCompany$Sales Invoice Header]
inner join #tmpMonths m on [Order Date] between m.StartDate and m.EndDate
group by m.MonthID, [Phone No_]
),
buyers as (
select MonthID, who_identifier
, sum(iif(mycount > 1, mycount, 0)) as frequentbuyerscount --sum only if count > 1
, sum(mycount) as allpurchasescount
from dataset
group by MonthID, who_identifier
)
select
b.MonthID
, max(tm.StartDate) StartDate, max(tm.EndDate) EndDate
, sum(b.frequentbuyerscount) as frequentbuyercount
, sum(b.allpurchasescount) as allpurchasecount
from buyers b inner join #tmpMonths tm on tm.MonthID = b.MonthID
group by b.MonthID
Be aware, that the code was tested only syntax-wise.
After the test data, this is the result:
MonthID | StartDate | EndDate | frequentbuyercount | allpurchasecount
-----------------------------------------------------------------------------
2018-08-01 | 2017-08-01 | 2018-08-01 | 340 | 3702
2018-09-01 | 2017-09-01 | 2018-09-01 | 340 | 3702
2018-10-01 | 2017-10-01 | 2018-10-01 | 340 | 3702
2018-11-01 | 2017-11-01 | 2018-11-01 | 340 | 3702
2018-12-01 | 2017-12-01 | 2018-12-01 | 340 | 3703
2019-01-01 | 2018-01-01 | 2019-01-01 | 340 | 3703
2019-02-01 | 2018-02-01 | 2019-02-01 | 2 | 8
2019-03-01 | 2018-03-01 | 2019-03-01 | 2 | 3
2019-04-01 | 2018-04-01 | 2019-04-01 | 2 | 3
2019-05-01 | 2018-05-01 | 2019-05-01 | 2 | 3
2019-06-01 | 2018-06-01 | 2019-06-01 | 2 | 3
2019-07-01 | 2018-07-01 | 2019-07-01 | 2 | 3

How to get rows between time intervals

I have delivery slots that has a from column (datetime).
Delivery slots are stored as 1 hour to 1 hour and 30 minute intervals, daily.
i.e. 3.00am-4.30am, 6.00am-7.30am, 9.00am-10.30am and so forth
id | from
------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
3 | 2016-01-01 06:00:00
4 | 2016-01-01 07:30:00
5 | 2016-01-01 09:00:00
6 | 2016-01-01 10:30:00
7 | 2016-01-01 12:00:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
10 | 2016-01-02 06:00:00
11 | 2016-01-02 07:30:00
12 | 2016-01-02 09:00:00
13 | 2016-01-02 10:30:00
14 | 2016-01-02 12:00:00
I’m trying to get all delivery_slots between the hours of 3.00am - 4.30 am. Ive got the following so far:
SELECT * FROM delivery_slots WHERE EXTRACT(HOUR FROM delivery_slots.from) >= 3 AND EXTRACT(MINUTE FROM delivery_slots.from) >= 0 AND EXTRACT(HOUR FROM delivery_slots.from) <= 4 AND EXTRACT(MINUTE FROM delivery_slots.from) <= 30;
Which kinda works. Kinda, because it is only returning delivery slots that have minutes of 00.
Thats because of the last where condition (EXTRACT(MINUTE FROM delivery_slots.from) <= 30)
To give you an idea, of what I am trying to expect:
id | from
-------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
15 | 2016-01-03 03:00:00
16 | 2016-01-03 04:30:00
etc...
Is there a better way to go about this?
Try this: (not tested)
SELECT * FROM delivery_slots WHERE delivery_slots.from::time >= '03:00:00' AND delivery_slots.from::time <= '04:30:00'
Hope this helps.
Cheers.
The easiest way to do this, in my mind, is to cast the from column as a type time and do a where >= and <=, like so
select * from testing where (date::time >= '3:00'::time and date::time <= '4:30'::time);

Pandas: Combine resampling with groupby and calculate time differences

I am doing data analysis with trading data. I would like to use Pandas in order to examine the times when the traders are active.
In particular, I try to extract the difference in minutes between the dates of every first trade of every trader for each day and cumulate it to a monthly basis
The data looks like this:
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 16:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-02-01 10:00:00 | Mark | 10
2012-02-01 09:00:00 | John | 10
2012-02-01 17:00:00 | Mark | 10
Right now I use resampling to retrieve the first trade on a daily basis. However, I want to group also by the buyer to calculate the differences in their trading dates. Like this
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-01-02 10:00:00 | Mark | 10
2012-01-02 09:00:00 | John | 10
Overall I am looking to calculate the differences in minutes between the first trades on a daily basis for each trader.
Update
For example in the case of John on the 2012-01-01: Dist = 60 (Diff John-Mark) + 120 (Diff John-Kevin) = 180
I would highly appreciate if anyone has an idea how to do this.
Thank you
Your original frame (the resampled one)
In [71]: df_orig
Out[71]:
buyer date volume
0 John 2012-01-01 09:00:00 10
1 Mark 2012-01-01 10:00:00 10
2 Kevin 2012-01-01 11:00:00 10
3 Mark 2012-01-02 10:00:00 10
4 John 2012-01-02 09:00:00 10
Set the index to the date column, keeping the date column in place
In [75]: df = df_orig.set_index('date',drop=False)
Create this aggregation function
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
return frame
Groupby the single date
In [74]: x = df.groupby(pd.TimeGrouper('1d')).apply(f)
Create the differential in minutes
In [86]: x['diff'] = (x.date-x.start).apply(lambda x: float(x.item().total_seconds())/60)
In [87]: x
Out[87]:
buyer date volume start diff
date
2012-01-01 2012-01-01 09:00:00 John 2012-01-01 09:00:00 10 2012-01-01 09:00:00 0
2012-01-01 10:00:00 Mark 2012-01-01 10:00:00 10 2012-01-01 09:00:00 60
2012-01-01 11:00:00 Kevin 2012-01-01 11:00:00 10 2012-01-01 09:00:00 120
2012-01-02 2012-01-02 09:00:00 John 2012-01-02 09:00:00 10 2012-01-02 09:00:00 0
2012-01-02 10:00:00 Mark 2012-01-02 10:00:00 10 2012-01-02 09:00:00 60
Here's the explanation. We use the TimeGrouper to have the grouping by date, where a frame is passed to the function f. This function, then uses the first date of the day (the sort is necessary here). You subtract this from the date on the entry to get a timedelta64, which is then massaged to minutes (this is a bit hacky right now because of some numpy issues, should be more natural in 0.12)
Thanks for you update, I originally thought you wanted the diff per buyer, not from the first buyer, but that's just a minor tweak.
Update:
To track the buyer name as well (which corresponds to the start date), just include
it in the function f
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
frame['start_buyer'] = frame.buyer.iloc[0]
return frame
Then can groupby on this at the end:
In [14]: x.groupby(['start_buyer']).sum()
Out[14]:
diff
start_buyer
John 240