Aggregate difference between time ranges in Postgres - postgresql

I'm on Postgres 15 and using the multirange type to aggregate overlapping time ranges into islands:
CREATE TABLE time_entries (
id bigint NOT NULL,
contract_id bigint,
"from" timestamp(6) without time zone,
"to" timestamp(6) without time zone,
type varchar,
range tsrange GENERATED ALWAYS AS (tsrange("from", "to")) STORED
);
INSERT INTO time_entries VALUES (1, 1, '2022-12-07T09:00', '2022-12-07T10:00', 'billed');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T08:00', '2022-12-07T10:30', 'punch_clock');
INSERT INTO time_entries VALUES (1, 1, '2022-12-07T12:00', '2022-12-07T12:30', 'billed');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T11:30', '2022-12-07T12:15', 'punch_clock');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T13:00', '2022-12-07T13:30', 'billed');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T13:15', '2022-12-07T13:45', 'punch_clock');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T14:00', '2022-12-07T15:00', 'punch_clock');
SELECT contract_id, unnest(range_agg(range)) AS range FROM time_entries GROUP BY contract_id;
Working db<>fiddle: https://dbfiddle.uk/V9a7H8nJ
This results in these merged time ranges:
contract_id
range
1
["2022-12-07 08:00:00","2022-12-07 10:30:00")
1
["2022-12-07 11:30:00","2022-12-07 12:30:00")
1
["2022-12-07 13:00:00","2022-12-07 13:45:00")
1
["2022-12-07 14:00:00","2022-12-07 15:00:00")
But now, I need another metric: The unbilled hours of a tracking. I have two types of entries punch_clock and billed. The former is a tracker that runs in the background when they are working, and the latter is a parallel tracker that runs for specific projects.
How could I calculate the difference between the overlapping punch_clock and billed entries?
The desired result would be:
contract_id
unbilled
1
["2022-12-07 08:00:00","2022-12-07 09:00:00")
1
["2022-12-07 10:00:00","2022-12-07 10:30:00")
1
["2022-12-07 11:30:00","2022-12-07 12:00:00")
1
["2022-12-07 13:30:00","2022-12-07 13:45:00")
1
["2022-12-07 14:00:00","2022-12-07 15:00:00")
I played around by grouping by type, but it seems there is no range_difference_agg aggregation function for multiranges, only range_agg for union and range_intersect_agg for intersections: https://www.postgresql.org/docs/current/functions-aggregate.html

First I suggest you to change the definition of the time_entries table so that to replace the type tsrange of the range column by tsmultirange :
CREATE TABLE time_entries (
id bigint NOT NULL,
contract_id bigint,
"from" timestamp(6) without time zone,
"to" timestamp(6) without time zone,
type varchar,
range tsmultirange GENERATED ALWAYS AS (tsmultirange(tsrange("from", "to"))) STORED
);
Then, according to the data sample you provide, you don't need an aggregate function. A self-join query should provide the expected result :
SELECT p.contract_id, unnest(CASE WHEN b.range IS NULL THEN p.range ELSE p.range - b.range END) AS unbilled
FROM ( SELECT contract_id, range FROM time_entries WHERE type = 'punch_clock') AS p
LEFT JOIN (SELECT contract_id, range FROM time_entries WHERE type = 'billed') AS b
ON p.contract_id = b.contract_id
AND p.range && b.range
Result :
contract_id
unbilled
1
["2022-12-07 08:00:00","2022-12-07 09:00:00")
1
["2022-12-07 10:00:00","2022-12-07 10:30:00")
1
["2022-12-07 11:30:00","2022-12-07 12:00:00")
1
["2022-12-07 13:30:00","2022-12-07 13:45:00")
1
["2022-12-07 14:00:00","2022-12-07 15:00:00")
This query will work while only one 'billed' timerange intersects with one 'punch_clock' time range. If several 'billed' timeranges may intersect the same 'punch_clock' time range, then you will need to create your own aggregate function based on the multirange difference operator :
CREATE OR REPLACE FUNCTION multirange_diff (x anymultirange, y anymultirange, z anymultirange)
RETURNS anymultirange LANGUAGE sql IMMUTABLE AS $$
SELECT CASE WHEN x IS NULL THEN COALESCE(y-z, y) ELSE COALESCE(x-z, x) END ; $$ ;
CREATE OR REPLACE AGGREGATE multirange_diff_agg(anymultirange, anymultirange)
( stype = anymultirange, sfunc = multirange_diff) ;
The query using the new aggregate function is :
SELECT p.contract_id, unnest(multirange_diff_agg(p.range, b.range)) AS unbilled
FROM ( SELECT contract_id, range FROM time_entries WHERE type = 'punch_clock') AS p
LEFT JOIN (SELECT contract_id, range FROM time_entries WHERE type = 'billed') AS b
ON p.contract_id = b.contract_id
AND p.range && b.range
GROUP BY p.contract_id, p.range
As an example, after having inserted the new row in table time_entries :
INSERT INTO time_entries VALUES (1, 1, '2022-12-07T10:10', '2022-12-07T10:20', 'billed');
The result of the above query is :
contract_id
unbilled
1
["2022-12-07 08:00:00","2022-12-07 09:00:00")
1
["2022-12-07 10:00:00","2022-12-07 10:10:00")
1
["2022-12-07 10:20:00","2022-12-07 10:30:00")
1
["2022-12-07 11:30:00","2022-12-07 12:00:00")
1
["2022-12-07 13:30:00","2022-12-07 13:45:00")
1
["2022-12-07 14:00:00","2022-12-07 15:00:00")
UPDATE
Even when several 'billed' timeranges intersect the same 'punch_clock' time range, you don't need an aggregate function. The following query should provide the expected result :
SELECT contract_id
, unnest ( range_agg(range) FILTER (WHERE type = 'punch_clock')
- range_agg(range) FILTER (WHERE type = 'billed')
) AS unbilled
FROM time_entries
GROUP BY contract_id
see the test result in dbfiddle
for more information about creating an aggregate function see the manual

Related

SQL partition by query optimization

I have below prices table and I want to obtain last_30_days price array and last_year_price from it. As shown below
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
select id,time,first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30
from prices p
However I want to place a where clause on the prices table to get last_30_days price array and last_year_price for some specific rows. Eg. where time >= '1 week' interval (so I run this query only for last 1 week values as opposed to the entire table)
But a where clause pre-filtering the rows and then window partition only runs on that conditioned rows which results in wrong results. it is giving result like
time, id, last_30_days
-1day, X, [A,B,C,D,E, F,G]
-2day, X, [A,B,C,D,E,F]
-3day, X, [A,B,C,D,E]
-4day, X, [A,B,C,D]
-5day, X, [A,B,C]
-6day, X, [A,B]
-7day, X, [A]
How do I fix this so that partition over window always takes 30 values irrespective of where condition? Without having to run the query always on the entire table and then selecting a subset of rows with where clause. My prices table is huge and running it on entire table is very expensive.
EDIT
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
prev_30 double precision[],
prev_year double precision,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
Use a subquery:
SELECT *
FROM (select id, time,
first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30 I
from prices p
WHERE time >= current_timestamp - INTERVAL '1 year 1 week') AS q
WHERE time >= current_timestamp - INTERVAL '1 week' ;

How correctly test the date is within time intervals

I have a timestamp of an user action. And several time intervals when the user have grants to perform the action.
I need to check wether the timestamp of this action is within at least one of the time intervals or not.
Table with users:
CREATE TABLE ausers (
id serial PRIMARY KEY,
user_name VARCHAR(255) default NULL,
action_date TIMESTAMP
);
INSERT INTO ausers VALUES(1,'Jhon', '2018-02-21 15:05:06');
INSERT INTO ausers VALUES(2,'Bob', '2018-05-24 12:22:26');
#|id|user_name|action_date
----------------------------------
1|1 |Jhon |21.02.2018 15:05:06
2|2 |Bob |24.05.2018 12:22:26
Table with grants:
CREATE TABLE user_grants (
id serial PRIMARY KEY,
user_id INTEGER,
start_date TIMESTAMP,
end_date TIMESTAMP
);
INSERT INTO user_grants VALUES(1, 1, '2018-01-01 00:00:01', '2018-03-01 00:00:00');
INSERT INTO user_grants VALUES(2, 1, '2018-06-01 00:00:01', '2018-09-01 00:00:00');
INSERT INTO user_grants VALUES(3, 2, '2018-01-01 00:00:01', '2018-02-01 00:00:00');
INSERT INTO user_grants VALUES(4, 2, '2018-02-01 00:00:01', '2018-03-01 00:00:00');
#|id|user_id|start_date |end_date
------------------------------------------------------
1|1 |1 |01.01.2018 00:00:01 |01.03.2018 00:00:00
2|2 |1 |01.06.2018 00:00:01 |01.09.2018 00:00:00
3|3 |2 |01.01.2018 00:00:01 |01.02.2018 00:00:00
4|4 |2 |01.02.2018 00:00:01 |01.03.2018 00:00:00
The query:
select u.user_name,
case
when array_agg(gr.range) #> array_agg(tstzrange(u.action_date, u.action_date, '[]')) then 'Yes'
else 'No'
end as "permition was granted"
from ausers u
left join (select tstzrange(ug.start_date, ug.end_date, '[]') as range, ug.user_id as uid
from user_grants ug) as gr on gr.uid = u.id
group by u.user_name;
Result:
#|user_name|permition was granted
---------------------------------
1|Bob |No
2|Jhon |No
Timestamp '01.02.2018 15:05:06' is within "01.01.2018 00:00:01, 01.03.2018 00:00:00" range, so "Bob" had grants to perform action and where should be "Yes" in first row, not "No".
The expected output is like this:
#|user_name|permition was granted
---------------------------------
1|Bob |Yes
2|Jhon |No
I tried to test like this:
select array_agg(tstzrange('2018-02-21 15:05:06', '2018-02-21 15:05:06', '[]')) <# array_agg(tstzrange('2018-01-01 00:00:01', '2018-03-01 00:00:01', '[]'));
#|?column?
----------
|false
Result is "false".
But if remove array_agg function
select tstzrange('2018-02-21 15:05:06', '2018-02-21 15:05:06', '[]') <# tstzrange('2018-01-01 00:00:01', '2018-03-01 00:00:01', '[]');
#|?column?
----------
|true
It works fine - the result is "true". Why? Whats's wrong with array_agg?
I have to use array_agg because I have several time intervals to compare.
I have to make "fake" time interval
array_agg(tstzrange(u.action_date, u.action_date, '[]'))
from one timestamp because operator #> doesn't allow to compare the timestamp and array of timestamps intervals.
How to compare that one date is within at least on time interval from the array of time intervals?
There are several #> operators in PostgreSQL:
tstzrange #> tstzrange tests if the first interval contains the second
anyarray #> anyarray tests if the first array contains all elements of the second array.
In your query that will test if for each interval in the second array there is an equal interval in the first array.
Therebis a way to test if an interval is contained in one of the elements of an array of intervals:
someinterval <# ANY (array_of_intervals)
but there is no straightforward way to express your condition with an operator.
Do without an aggregate, join the two tables on #> and count the result rows.
Since the all three dates aare scalar quantities Postgres range checking is not required, a simple BETWEEN operation suffices.
select au.user_name
, case when ug.user_id is null then 'No' else 'Yes' end authorized
from ausers au
left join user_grants ug
on ( au.id = ug.id
and au.action_date between ug.start_date and ug.end_date
);
BTW. I think your expected results posted are backwards. Neither user name has a timestamp of '01.02.2018 15:05:06' as indicated in the description.

PostgreSQL: how do I group rows by 'nearby' timestamps

Considering the following simplified situation:
create table trans
(
id integer not null
, tm timestamp without time zone not null
, val integer not null
, cus_id integer not null
);
insert into trans
(id, tm, val, cus_id)
values
(1, '2017-12-12 16:42:00', 2, 500) --
,(2, '2017-12-12 16:42:02', 4, 501) -- <--+---------+
,(3, '2017-12-12 16:42:05', 7, 502) -- |dt=54s |
,(4, '2017-12-12 16:42:56', 3, 501) -- <--+ |dt=59s
,(5, '2017-12-12 16:43:00', 2, 503) -- |
,(6, '2017-12-12 16:43:01', 5, 501) -- <------------+
,(7, '2017-12-12 16:43:15', 6, 502) --
,(8, '2017-12-12 16:44:50', 4, 501) --
;
I want to group rows by cus_id, but also where the interval between time stamps of consecutive rows for the same cus_id is less than 1 minute.
In the example above this applies to rows with id's 2, 4 and 6. These rows have the same cus_id (501) and have intervals below 1 minute. The interval id{2,4} is 54s and for id{2,6} it is 59s. The interval id{4,6} is also below 1 minute, but it is overridden by the larger interval id{2,6}.
I need a query that gives me the output:
cus_id | tm | val
--------+---------------------+-----
501 | 2017-12-12 16:42:02 | 12
(1 row)
The tm value would be the tm of the first row, i.e. with the lowest tm. The val would be the sum(val) of the grouped rows.
In the example 3 rows are grouped, but that could also be 2, 4, 5, ...
For simplicity, I only let the rows for cus_id 501 have nearby time stamps, but in my real table, there would be a lot more of them. It contains 20M+ rows.
Is this possible?
Naive (subobtimal) solution using a CTE
(a faster approach would avoid the CTE, replacing it by a joined subquery or maybe even use a window function) :
-- Step one: find the start of a cluster
-- (the start is everything after a 60 second silence)
WITH starters AS (
SELECT * FROM trans tr
WHERE NOT EXISTS (
SELECT * FROM trans nx
WHERE nx.cus_id = tr.cus_id
AND nx.tm < tr.tm
AND nx.tm >= tr.tm -'60sec'::interval
)
)
-- SELECT * FROM starters ; \q
-- Step two: join everything within 60sec to the starter
-- and aggregate the clusters
SELECT st.cus_id
, st.id AS id
, MAX(tr.id) AS max_id
, MIN(tr.tm) AS first_tm
, MAX(tr.tm) AS last_tm
, SUM(tr.val) AS val
FROM trans tr
JOIN starters st ON st.cus_id = tr.cus_id
AND st.tm <= tr.tm AND st.tm > tr.tm -'60sec'::interval
GROUP BY 1,2
ORDER BY 1,2
;

Closest datetime for PostgreSQL 9.5

I'm using PostgreSQL 9.5
and I have a table like this:
CREATE TABLE tracks (
track bigserial NOT NULL,
time_track timestamp,
CONSTRAINT pk_aircraft_tracks PRIMARY KEY ( track )
);
I want to obtain track for the closest value of datetime by SELECT operator.
e.g, if I have:
track datatime
1 | 2016-12-01 21:02:47
2 | 2016-11-01 21:02:47
3 |2016-12-01 22:02:47
For input datatime 2016-12-01 21:00, the track is 2.
I foud out Is there a postgres CLOSEST operator? similar queston for integer.
But it is not working with datatime or PostgreSQL 9.5 :
SELECT * FROM
(
(SELECT time_track, track FROM tracks WHERE time_track >= now() ORDER BY time_track LIMIT 1) AS above
UNION ALL
(SELECT time_track, track FROM tracks WHERE time_track < now() ORDER BY time_track DESC LIMIT 1) AS below
)
ORDER BY abs(?-time_track) LIMIT 1;
The error:
ERROR: syntax error at or near "UNION"
LINE 4: UNION ALL
Track 1 is the closest to '2016-12-01 21:00':
with tracks(track, datatime) as (
values
(1, '2016-12-01 21:02:47'::timestamp),
(2, '2016-11-01 21:02:47'),
(3, '2016-12-01 22:02:47')
)
select *
from tracks
order by
case when datatime > '2016-12-01 21:00' then datatime - '2016-12-01 21:00'
else '2016-12-01 21:00' - datatime end
limit 1;
track | datatime
-------+---------------------
1 | 2016-12-01 21:02:47
(1 row)

Get Data From Postgres Table At every nth interval

Below is my table and i am inserting data from my windows .Net application at every 1 Second Interval. i want to write query to fetch data from the table at every nth interval for example at every 5 second.Below is the query i am using but not getting result as required. Please Help me
CREATE TABLE table_1
(
timestamp_col timestamp without time zone,
value_1 bigint,
value_2 bigint
)
This is my query which i am using
select timestamp_col,value_1,value_2
from (
select timestamp_col,value_1,value_2,
INTERVAL '5 Seconds' * (row_number() OVER(ORDER BY timestamp_col) - 1 )
+ timestamp_col as r
from table_1
) as dt
Where r = 1
Use date_part() function with modulo operator:
select timestamp_col, value_1, value_2
from table_1
where date_part('second', timestamp_col)::int % 5 = 0