How to extract the whole hours from a time range in Postgresql and get the duration of each extracted hour - postgresql

I'm new to database (even more to postgres), so if you can help me. I have a table something like this:
id_interaction
start_time
end_time
0001
2022-06-03 12:40:10
2022-06-03 12:45:16
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
0004
2022-06-05 23:00:00
2022-06-06 10:30:12
Basically I need to create a query to get the duration doing a separation by hours, for example:
id_interaction
start_time
end_time
hour
duration
0001
2022-06-03 12:40:10
2022-06-03 12:45:16
12:00:00
00:05:06
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
10:00:00
00:09:20
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
11:00:00
00:10:12
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
16:00:00
00:30:00
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
17:00:00
01:00:00
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
18:00:00
00:20:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
23:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
24:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
01:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
02:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
03:00:00
00:30:12
I need all the hours from start to finish. For example: if an id starts at 17:10 and ends at 19:00, I need the duration of 17:00, 18:00 and 19:00

If you're trying to get the duration in each whole hour interval overlapped by your data, this can be achieved by rounding timestamps using date_trunc(), using generate_series() to move around the intervals and casting between time, interval and timestamp:
create or replace function hours_crossed(starts timestamp,ends timestamp)
returns integer
language sql as '
select case
when date_trunc(''hour'',starts)=date_trunc(''hour'',ends)
then 0
when date_trunc(''hour'',starts)=starts
then floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0)
else floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0) +1
end';
select * from (
select
id_interacao,
tempo_inicial,
tempo_final,
to_char(hora, 'HH24:00')::time as hora,
least(tempo_final, hora + '1 hour'::interval)
- greatest(tempo_inicial, hora)
as duracao
from (select
*,
date_trunc('hour',tempo_inicial)
+ (generate_series(0, hours_crossed(tempo_inicial,tempo_final))::text||' hours')::interval
as hora
from test_times
) a
) a
where duracao<>'0'::interval;
This also fixes your first entry that lasts 5 minutes but shows as 45.
You'll need to decide how you want to handle zero-length intervals and ones that end on an exact hour - I added a condition to skip them. Here's a working example.

Related

How to INSERT repeated values like (a,b,c,d,a,b,c,d....) in DB table?

I try to make work schedule table.
I have a table like:
shift_starts_dt
shift_type
2022-01-01 08:00:00
Day
2022-01-01 20:00:00
Night
2022-01-02 08:00:00
Day
2022-01-02 20:00:00
Night
2022-01-03 08:00:00
Day
2022-01-03 20:00:00
Night
2022-01-04 08:00:00
Day
2022-01-04 20:00:00
Night
etc.. until the end of the year
I can't figure out how to add repeated values to table.
I want to add the 'shift_name' column that contains 'A','B','C','D' (It's like name for team)
What query should I use to achieve the next result:
shift_starts_dt
shift_type
shift_name
2022-01-01 08:00:00
Day
'A'
2022-01-01 20:00:00
Night
'B'
2022-01-02 08:00:00
Day
'C'
2022-01-02 20:00:00
Night
'D'
2022-01-03 08:00:00
Day
'A'
2022-01-03 20:00:00
Night
'B'
2022-01-04 08:00:00
Day
'C'
2022-01-04 20:00:00
Night
'D'
. . . . . .
Use number of half days since Jan 1 modulus 4 to index an array:
select
shift_starts_dt,
shift_type,
(array['A','B','C','D'])[(extract(epoch from shift_starts_dt - '2022-01-01')::int / 43200) % 4 + 1]
from work_schedule
See live demo.
You could replace '2022-01-01' with (select min(shift_starts_dt) from work_schedule) for a more general solution.

Remove columns that match

With 2 pyspark dataframes
id
myTimeStamp
1
2022-06-01 05:00
1
2022-06-06 05:00
2
2022-06-01 05:00
2
2022-06-02 05:00
2
2022-06-03 05:00
2
2022-06-04 08:00
3
2022-06-02 05:00
3
2022-06-04 10:00
myTimeToRemove
2022-06-01 05:00
2022-06-04 05:00
Need to remove the records from first dataframe that contains values on second dataframe by date (doesn't matter the time)
Expected dataframe:
id
myTimeStamp
1
2022-06-06 05:00
2
2022-06-02 05:00
2
2022-06-03 05:00
3
2022-06-02 05:00
I tried
fdcn_df = fdcn_df.join(holidays_df, fdcn_df['myTimeStamp'].cast('date') != holidays_df['myTimeToRemove'].cast('date'),"inner")
but no result
I was expecting
Expected dataframe:
id
myTimeStamp
1
2022-06-06 05:00
2
2022-06-02 05:00
2
2022-06-03 05:00
3
2022-06-02 05:00
What is your data type of myTimeStamp and myTimeToRemove? Assuming it is a timestamp column, you can use leftanti join:
fdcn_df.alias('a').join(
holidays_df.alias('b'),
on=func.to_date(func.col('a.myTimeStamp'))==func.to_date(func.col('b. myTimeToRemove')),
how='leftanti'
).show(100, False)

Postgresql group by recurring items

I'm using postgresql to store historical data coming from an RTLS platform.
Position data is not collected continuosly.
The historical_movements is implemented as a single table as follow (it is a simplified table but enough to present the use case):
User Area EnterTime ExitTime
John room1 2018-01-01 10:00:00 2018-01-01 10:00:05
Doe room1 2018-01-01 10:00:00 2018-01-01 10:10:00
John room1 2018-01-01 10:05:00 2018-01-01 10:10:00
Doe room1 2018-01-01 10:20:00 2018-01-01 10:30:00
John room2 2018-01-01 11:00:00 2018-01-01 11:05:00
John room2 2018-01-01 11:08:00 2018-01-01 11:15:00
John room1 2018-01-01 12:00:00 2018-01-01 12:08:00
John room1 2018-01-01 12:10:00 2018-01-01 12:20:00
John room1 2018-01-01 12:25:00 2018-01-01 12:25:00
John room3 2018-01-01 12:30:00 2018-01-01 12:35:00
John room3 2018-01-01 12:40:00 2018-01-01 12:50:00
I'm looking at a way to make a query showing the user staying in the various rooms, aggregating the data related to the same room and computing the overall staying time, as follows
User Area EnterTime ExitTime ArregateTime
John room1 2018-01-01 10:00:00 2018-01-01 10:10:00 00:10:00
John room2 2018-01-01 11:00:00 2018-01-01 11:05:00 00:15:00
John room1 2018-01-01 12:00:00 2018-01-01 12:25:00 00:25:00
John room3 2018-01-01 12:30:00 2018-01-01 12:50:00 00:20:00
Doe room1 2018-01-01 10:00:00 2018-01-01 10:30:00 00:30:00
Looking at various threads I'm quite sure I'd have to use lag and partition by functions but it's not clear how.
Any hints?
Best regards.
AggregateTime isn't really an aggregate in your expected result. It seems to be a difference between max_time and min_time for each block where each block is set of contiguous rows with same (users, area).
with block as(
select users, area, entertime, exittime,
(row_number() over (order by users, entertime) -
row_number() over (partition by users, area order by entertime)
) as grp
from your_table
order by 1,2,3
)
select users, area, entertime, exittime, (exittime - entertime) as duration
from (select users, area, grp, min(entertime) as entertime, max(exittime) as exittime
from block
group by users, area, grp
) t2
order by 5;
I made some changes to 'Resetting Row number according to record data change' to arrive at the solution.

How to get rows between time intervals

I have delivery slots that has a from column (datetime).
Delivery slots are stored as 1 hour to 1 hour and 30 minute intervals, daily.
i.e. 3.00am-4.30am, 6.00am-7.30am, 9.00am-10.30am and so forth
id | from
------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
3 | 2016-01-01 06:00:00
4 | 2016-01-01 07:30:00
5 | 2016-01-01 09:00:00
6 | 2016-01-01 10:30:00
7 | 2016-01-01 12:00:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
10 | 2016-01-02 06:00:00
11 | 2016-01-02 07:30:00
12 | 2016-01-02 09:00:00
13 | 2016-01-02 10:30:00
14 | 2016-01-02 12:00:00
I’m trying to get all delivery_slots between the hours of 3.00am - 4.30 am. Ive got the following so far:
SELECT * FROM delivery_slots WHERE EXTRACT(HOUR FROM delivery_slots.from) >= 3 AND EXTRACT(MINUTE FROM delivery_slots.from) >= 0 AND EXTRACT(HOUR FROM delivery_slots.from) <= 4 AND EXTRACT(MINUTE FROM delivery_slots.from) <= 30;
Which kinda works. Kinda, because it is only returning delivery slots that have minutes of 00.
Thats because of the last where condition (EXTRACT(MINUTE FROM delivery_slots.from) <= 30)
To give you an idea, of what I am trying to expect:
id | from
-------+---------------------
1 | 2016-01-01 03:00:00
2 | 2016-01-01 04:30:00
8 | 2016-01-02 03:00:00
9 | 2016-01-02 04:30:00
15 | 2016-01-03 03:00:00
16 | 2016-01-03 04:30:00
etc...
Is there a better way to go about this?
Try this: (not tested)
SELECT * FROM delivery_slots WHERE delivery_slots.from::time >= '03:00:00' AND delivery_slots.from::time <= '04:30:00'
Hope this helps.
Cheers.
The easiest way to do this, in my mind, is to cast the from column as a type time and do a where >= and <=, like so
select * from testing where (date::time >= '3:00'::time and date::time <= '4:30'::time);

Pandas: Combine resampling with groupby and calculate time differences

I am doing data analysis with trading data. I would like to use Pandas in order to examine the times when the traders are active.
In particular, I try to extract the difference in minutes between the dates of every first trade of every trader for each day and cumulate it to a monthly basis
The data looks like this:
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 16:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-02-01 10:00:00 | Mark | 10
2012-02-01 09:00:00 | John | 10
2012-02-01 17:00:00 | Mark | 10
Right now I use resampling to retrieve the first trade on a daily basis. However, I want to group also by the buyer to calculate the differences in their trading dates. Like this
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-01-02 10:00:00 | Mark | 10
2012-01-02 09:00:00 | John | 10
Overall I am looking to calculate the differences in minutes between the first trades on a daily basis for each trader.
Update
For example in the case of John on the 2012-01-01: Dist = 60 (Diff John-Mark) + 120 (Diff John-Kevin) = 180
I would highly appreciate if anyone has an idea how to do this.
Thank you
Your original frame (the resampled one)
In [71]: df_orig
Out[71]:
buyer date volume
0 John 2012-01-01 09:00:00 10
1 Mark 2012-01-01 10:00:00 10
2 Kevin 2012-01-01 11:00:00 10
3 Mark 2012-01-02 10:00:00 10
4 John 2012-01-02 09:00:00 10
Set the index to the date column, keeping the date column in place
In [75]: df = df_orig.set_index('date',drop=False)
Create this aggregation function
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
return frame
Groupby the single date
In [74]: x = df.groupby(pd.TimeGrouper('1d')).apply(f)
Create the differential in minutes
In [86]: x['diff'] = (x.date-x.start).apply(lambda x: float(x.item().total_seconds())/60)
In [87]: x
Out[87]:
buyer date volume start diff
date
2012-01-01 2012-01-01 09:00:00 John 2012-01-01 09:00:00 10 2012-01-01 09:00:00 0
2012-01-01 10:00:00 Mark 2012-01-01 10:00:00 10 2012-01-01 09:00:00 60
2012-01-01 11:00:00 Kevin 2012-01-01 11:00:00 10 2012-01-01 09:00:00 120
2012-01-02 2012-01-02 09:00:00 John 2012-01-02 09:00:00 10 2012-01-02 09:00:00 0
2012-01-02 10:00:00 Mark 2012-01-02 10:00:00 10 2012-01-02 09:00:00 60
Here's the explanation. We use the TimeGrouper to have the grouping by date, where a frame is passed to the function f. This function, then uses the first date of the day (the sort is necessary here). You subtract this from the date on the entry to get a timedelta64, which is then massaged to minutes (this is a bit hacky right now because of some numpy issues, should be more natural in 0.12)
Thanks for you update, I originally thought you wanted the diff per buyer, not from the first buyer, but that's just a minor tweak.
Update:
To track the buyer name as well (which corresponds to the start date), just include
it in the function f
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
frame['start_buyer'] = frame.buyer.iloc[0]
return frame
Then can groupby on this at the end:
In [14]: x.groupby(['start_buyer']).sum()
Out[14]:
diff
start_buyer
John 240