Postgresql group by recurring items - postgresql

I'm using postgresql to store historical data coming from an RTLS platform.
Position data is not collected continuosly.
The historical_movements is implemented as a single table as follow (it is a simplified table but enough to present the use case):
User Area EnterTime ExitTime
John room1 2018-01-01 10:00:00 2018-01-01 10:00:05
Doe room1 2018-01-01 10:00:00 2018-01-01 10:10:00
John room1 2018-01-01 10:05:00 2018-01-01 10:10:00
Doe room1 2018-01-01 10:20:00 2018-01-01 10:30:00
John room2 2018-01-01 11:00:00 2018-01-01 11:05:00
John room2 2018-01-01 11:08:00 2018-01-01 11:15:00
John room1 2018-01-01 12:00:00 2018-01-01 12:08:00
John room1 2018-01-01 12:10:00 2018-01-01 12:20:00
John room1 2018-01-01 12:25:00 2018-01-01 12:25:00
John room3 2018-01-01 12:30:00 2018-01-01 12:35:00
John room3 2018-01-01 12:40:00 2018-01-01 12:50:00
I'm looking at a way to make a query showing the user staying in the various rooms, aggregating the data related to the same room and computing the overall staying time, as follows
User Area EnterTime ExitTime ArregateTime
John room1 2018-01-01 10:00:00 2018-01-01 10:10:00 00:10:00
John room2 2018-01-01 11:00:00 2018-01-01 11:05:00 00:15:00
John room1 2018-01-01 12:00:00 2018-01-01 12:25:00 00:25:00
John room3 2018-01-01 12:30:00 2018-01-01 12:50:00 00:20:00
Doe room1 2018-01-01 10:00:00 2018-01-01 10:30:00 00:30:00
Looking at various threads I'm quite sure I'd have to use lag and partition by functions but it's not clear how.
Any hints?
Best regards.

AggregateTime isn't really an aggregate in your expected result. It seems to be a difference between max_time and min_time for each block where each block is set of contiguous rows with same (users, area).
with block as(
select users, area, entertime, exittime,
(row_number() over (order by users, entertime) -
row_number() over (partition by users, area order by entertime)
) as grp
from your_table
order by 1,2,3
)
select users, area, entertime, exittime, (exittime - entertime) as duration
from (select users, area, grp, min(entertime) as entertime, max(exittime) as exittime
from block
group by users, area, grp
) t2
order by 5;
I made some changes to 'Resetting Row number according to record data change' to arrive at the solution.

Related

How to INSERT repeated values like (a,b,c,d,a,b,c,d....) in DB table?

I try to make work schedule table.
I have a table like:
shift_starts_dt
shift_type
2022-01-01 08:00:00
Day
2022-01-01 20:00:00
Night
2022-01-02 08:00:00
Day
2022-01-02 20:00:00
Night
2022-01-03 08:00:00
Day
2022-01-03 20:00:00
Night
2022-01-04 08:00:00
Day
2022-01-04 20:00:00
Night
etc.. until the end of the year
I can't figure out how to add repeated values to table.
I want to add the 'shift_name' column that contains 'A','B','C','D' (It's like name for team)
What query should I use to achieve the next result:
shift_starts_dt
shift_type
shift_name
2022-01-01 08:00:00
Day
'A'
2022-01-01 20:00:00
Night
'B'
2022-01-02 08:00:00
Day
'C'
2022-01-02 20:00:00
Night
'D'
2022-01-03 08:00:00
Day
'A'
2022-01-03 20:00:00
Night
'B'
2022-01-04 08:00:00
Day
'C'
2022-01-04 20:00:00
Night
'D'
. . . . . .
Use number of half days since Jan 1 modulus 4 to index an array:
select
shift_starts_dt,
shift_type,
(array['A','B','C','D'])[(extract(epoch from shift_starts_dt - '2022-01-01')::int / 43200) % 4 + 1]
from work_schedule
See live demo.
You could replace '2022-01-01' with (select min(shift_starts_dt) from work_schedule) for a more general solution.

How to extract the whole hours from a time range in Postgresql and get the duration of each extracted hour

I'm new to database (even more to postgres), so if you can help me. I have a table something like this:
id_interaction
start_time
end_time
0001
2022-06-03 12:40:10
2022-06-03 12:45:16
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
0004
2022-06-05 23:00:00
2022-06-06 10:30:12
Basically I need to create a query to get the duration doing a separation by hours, for example:
id_interaction
start_time
end_time
hour
duration
0001
2022-06-03 12:40:10
2022-06-03 12:45:16
12:00:00
00:05:06
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
10:00:00
00:09:20
0002
2022-06-04 10:50:40
2022-06-04 11:10:12
11:00:00
00:10:12
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
16:00:00
00:30:00
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
17:00:00
01:00:00
0003
2022-06-04 16:30:00
2022-06-04 18:20:00
18:00:00
00:20:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
23:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
24:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
01:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
02:00:00
01:00:00
0004
2022-06-05 23:00:00
2022-06-06 03:30:12
03:00:00
00:30:12
I need all the hours from start to finish. For example: if an id starts at 17:10 and ends at 19:00, I need the duration of 17:00, 18:00 and 19:00
If you're trying to get the duration in each whole hour interval overlapped by your data, this can be achieved by rounding timestamps using date_trunc(), using generate_series() to move around the intervals and casting between time, interval and timestamp:
create or replace function hours_crossed(starts timestamp,ends timestamp)
returns integer
language sql as '
select case
when date_trunc(''hour'',starts)=date_trunc(''hour'',ends)
then 0
when date_trunc(''hour'',starts)=starts
then floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0)
else floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0) +1
end';
select * from (
select
id_interacao,
tempo_inicial,
tempo_final,
to_char(hora, 'HH24:00')::time as hora,
least(tempo_final, hora + '1 hour'::interval)
- greatest(tempo_inicial, hora)
as duracao
from (select
*,
date_trunc('hour',tempo_inicial)
+ (generate_series(0, hours_crossed(tempo_inicial,tempo_final))::text||' hours')::interval
as hora
from test_times
) a
) a
where duracao<>'0'::interval;
This also fixes your first entry that lasts 5 minutes but shows as 45.
You'll need to decide how you want to handle zero-length intervals and ones that end on an exact hour - I added a condition to skip them. Here's a working example.

TSQL: Need to Count Multiple Columns and Group by their Contents

I have the following dataset:
StartDate EnterDate Order#
---------- ---------- ------
2018-01-01 2018-01-01 1
2018-01-01 2018-01-01 2
2018-01-01 2018-01-02 3
2018-01-02 2018-01-02 4
2018-01-02 2018-01-03 5
2018-01-02 2018-01-03 6
2018-01-03 2018-01-04 7
2018-01-03 2018-01-04 8
2018-01-03 2018-01-04 9
2018-01-03 2018-01-05 10
I need to COUNT the number of dates in each column.
Example output:
Date StartDate EnterDate
---------- --------- ---------
01-01-2018 3 2
01-02-2018 3 2
01-03-2018 4 2
01-04-2018 0 3
01-05-2018 0 1
NULL can be substituted for 0.
You can use full join to achieve that
select
Date = isnull(t.StartDate, q.EnterDate), StartDate = isnull(t.cnt, 0), EnterDate = isnull(q.cnt, 0)
from (
select
StartDate, count(*) cnt
from
myTable
group by StartDate
) t
full join (
select
EnterDate, count(*) cnt
from
myTable
group by EnterDate
) q on t.StartDate = q.EnterDate

Select from table removing similar rows - PostgreSQL

There is a table with document revisions and authors. Looks like this:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 2 2016-01-01 03:40 Bill
123 3 2016-01-01 03:50 Bill
123 4 2016-01-01 04:10 Bill
123 5 2016-01-01 08:40 Alice
123 6 2016-01-01 08:41 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
942 15 2016-01-01 11:17 Bill
I need to find out moments when document was transferred to another editor - only first rows of every edition series.
Like so:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 5 2016-01-01 08:40 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
If I use DISTINCT ON (doc_id, editor) it resorts a table and I see only one per doc and editor, that is incorrect.
Of course I can dump all and filter with shell tools like awk | sort | uniq. But it is not good for big tables.
Window functions like FIRST_ROW do not give much, because I cannot partition by doc_id, editor not to mess all them.
How to do better?
Thank you.
You can use lag() to get the previous value, and then a simple comparison:
select t.*
from (select t.*,
lag(editor) over (partition by doc_id order by rev_date) as prev_editor
from t
) t
where prev_editor is null or prev_editor <> editor;

Pandas: Combine resampling with groupby and calculate time differences

I am doing data analysis with trading data. I would like to use Pandas in order to examine the times when the traders are active.
In particular, I try to extract the difference in minutes between the dates of every first trade of every trader for each day and cumulate it to a monthly basis
The data looks like this:
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 16:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-02-01 10:00:00 | Mark | 10
2012-02-01 09:00:00 | John | 10
2012-02-01 17:00:00 | Mark | 10
Right now I use resampling to retrieve the first trade on a daily basis. However, I want to group also by the buyer to calculate the differences in their trading dates. Like this
Timestamp (Datetime) | Buyer | Volume
--------------------------------------
2012-01-01 09:00:00 | John | 10
2012-01-01 10:00:00 | Mark | 10
2012-01-01 11:00:00 | Kevin | 10
2012-01-02 10:00:00 | Mark | 10
2012-01-02 09:00:00 | John | 10
Overall I am looking to calculate the differences in minutes between the first trades on a daily basis for each trader.
Update
For example in the case of John on the 2012-01-01: Dist = 60 (Diff John-Mark) + 120 (Diff John-Kevin) = 180
I would highly appreciate if anyone has an idea how to do this.
Thank you
Your original frame (the resampled one)
In [71]: df_orig
Out[71]:
buyer date volume
0 John 2012-01-01 09:00:00 10
1 Mark 2012-01-01 10:00:00 10
2 Kevin 2012-01-01 11:00:00 10
3 Mark 2012-01-02 10:00:00 10
4 John 2012-01-02 09:00:00 10
Set the index to the date column, keeping the date column in place
In [75]: df = df_orig.set_index('date',drop=False)
Create this aggregation function
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
return frame
Groupby the single date
In [74]: x = df.groupby(pd.TimeGrouper('1d')).apply(f)
Create the differential in minutes
In [86]: x['diff'] = (x.date-x.start).apply(lambda x: float(x.item().total_seconds())/60)
In [87]: x
Out[87]:
buyer date volume start diff
date
2012-01-01 2012-01-01 09:00:00 John 2012-01-01 09:00:00 10 2012-01-01 09:00:00 0
2012-01-01 10:00:00 Mark 2012-01-01 10:00:00 10 2012-01-01 09:00:00 60
2012-01-01 11:00:00 Kevin 2012-01-01 11:00:00 10 2012-01-01 09:00:00 120
2012-01-02 2012-01-02 09:00:00 John 2012-01-02 09:00:00 10 2012-01-02 09:00:00 0
2012-01-02 10:00:00 Mark 2012-01-02 10:00:00 10 2012-01-02 09:00:00 60
Here's the explanation. We use the TimeGrouper to have the grouping by date, where a frame is passed to the function f. This function, then uses the first date of the day (the sort is necessary here). You subtract this from the date on the entry to get a timedelta64, which is then massaged to minutes (this is a bit hacky right now because of some numpy issues, should be more natural in 0.12)
Thanks for you update, I originally thought you wanted the diff per buyer, not from the first buyer, but that's just a minor tweak.
Update:
To track the buyer name as well (which corresponds to the start date), just include
it in the function f
def f(frame):
frame.sort('date',inplace=True)
frame['start'] = frame.date.iloc[0]
frame['start_buyer'] = frame.buyer.iloc[0]
return frame
Then can groupby on this at the end:
In [14]: x.groupby(['start_buyer']).sum()
Out[14]:
diff
start_buyer
John 240