Grouping Events in Postgres - postgresql

I've got an events table that is generated by user activity on a site:
timestamp | name
7:00 AM | ...
7:01 AM | ...
7:02 AM | ...
7:30 AM | ...
7:31 AM | ...
7:32 AM | ...
8:01 AM | ...
8:03 AM | ...
8:05 AM | ...
8:08 AM | ...
8:09 AM | ...
I'd like to aggregate over the events to provide a view of when a user is active. I'm defining active to mean the period in which an event is within +/- 2 minutes. For the above that'd mean:
from | till
7:00 AM | 7:02 AM
7:30 AM | 7:32 AM
8:01 AM | 8:05 AM
8:08 AM | 8:09 AM
What's the best way to write a query that'll aggregate in that method? Is it possible via a WINDOW function or self join or is PL/SQL required?

Use two window functions: one to calculate intervals between contiguous events (gaps) and another to find series of gaps less or equal 2 minutes:
select arr[1] as "from", arr[cardinality(arr)] as "till"
from (
select array_agg(timestamp order by timestamp) arr
from (
select timestamp, sum((gap > '2m' )::int) over w
from (
select timestamp, coalesce(timestamp - lag(timestamp) over w, '3m') gap
from events
window w as (order by timestamp)
) s
window w as (order by timestamp)
) s
group by sum
) s
from | till
----------+----------
07:00:00 | 07:02:00
07:30:00 | 07:32:00
08:01:00 | 08:05:00
(3 rows)
Test it here.

By grouping them around half-hour flooring and getting min & max values:
WITH x(t) AS ( VALUES
('7:02 AM'::TIME),('7:01 AM'::TIME),('7:00 AM'::TIME),
('7:30 AM'::TIME),('7:31 AM'::TIME),('7:32 AM'::TIME),
('8:01 AM'::TIME),('8:03 AM'::TIME),('8:05 AM'::TIME)
)
SELECT MIN(t) "from", MAX(t) "till"
FROM (select t, date_trunc('hour', t) +
CASE WHEN (t-date_trunc('hour', t)) >= '30 minutes'::interval
THEN '30 minutes'::interval ELSE '0'::interval END t1 FROM x ) y
GROUP BY t1 ORDER BY t1;
You can apply the same receipt with datetime values like:
WITH x(t) AS (
SELECT '2017-01-01'::TIMESTAMP + (RANDOM()*1440*'1 minute'::INTERVAL) t
FROM GENERATE_SERIES(0,1000))
SELECT MIN...

Related

postgres, group by date, and bucketize per hour

I would like to create a result object that can be used with Grafana for a heatmap. In order to display the data correctly I need it the output to be like:
| date | 00:00 | 01:00 | 02:00 | 03:00 | ...etc |
| 2023-01-01 | 1 | 2 | 0 | 1 | ... |
| 2023-01-02 | 0 | 0 | 1 | 1 | ... |
| 2023-01-03 | 4 | 0 | 2 | 0 | ... |
my data table structure:
trades
-----
id
closed_at
asset
So far, I know that I need to use generate_series and use the interval function to return the hours, but I need my query to plot these hours as columns, but I've not been able to do that, as its getting a bit too advanced.
So far I have the following query:
SELECT
closed_at::DATE,
COUNT(id)
FROM trades
GROUP BY closed_at
ORDER BY closed_at
It now shows the amount of rows grouped by the days, I want to further aggregate the data, so it outputs the count per hour, as shown above.
Thanks for your help!
You can add more columns, now I only add 0:00 to 05:00.
filter usage: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
date_trunc usage: https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
BEGIN;
CREATE temp TABLE trades (
id bigint GENERATED BY DEFAULT AS IDENTITY,
closed_a timestamp,
asset text
) ON COMMIT DROP;
INSERT INTO trades (closed_a)
SELECT
date '2023-01-01' + interval '10 min' * (random() * i * 10)::int
FROM
generate_series(1, 10) g (i);
INSERT INTO trades (closed_a)
SELECT
date '2023-01-02' + interval '10 min' * (random() * i * 10)::int
FROM
generate_series(1, 10) g (i);
SELECT
closed_a::date
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date) AS "0:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '1 hour') AS "1:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '2 hour') AS "2:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '3 hour') AS "3:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '4 hour') AS "4:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '5 hour') AS "5:00"
FROM
trades
GROUP BY
1;
END;

Need help to merge overlapping time intervals

I need some help with merging overlapping time intervals if the interval not more than 4 minutes (for example only where id = 1).
I have the next table:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
2 | started | 2020-08-18 13:52:14
2 | suspended | 2020-08-18 13:52:17
3 | started | 2020-08-18 13:52:21
3 | suspended | 2020-08-18 13:52:24
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:08:56
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:11
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:13
Expected result:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:09:13
How it can be done? I will be very grateful for your help!
You want to eliminate suspended/start pairs that are for the same id and within 4 minutes. Use lag() and lead():
select t.*
from (select t.*,
lag(date) over (partition by id order by date) as prev_date,
lead(date) over (partition by id order by date) as next_date
from t
) t
where (action = 'start' and
prev_date > date - interval '4 minute'
) or
(action = 'suspended' and
next_date < date + interval '4 minute'
);
Date/time functions are notoriously database dependent. This is just adding or subtracting 4 minutes, which any database can do but the syntax might vary.
You're wanting to filter out certain rows, what is common with the rows you are removing?
It seems you want the first 'started' and last 'suspended' rows. Can you just ignore 'started' rows if there is another 'started' row in the previous 4 minutes, and ignore 'suspended' rows if there is another 'suspended row in the next 4 minutes?
from my_table a
where action = 'started' and not exists (
select 1 from my_table b
where b.id = a.id and b.action = 'started'
and datediff(minute, b.date, a.date) <= 4 -- row exists in the previous 4 min
)
Ditto for 'suspended' but the other way. That doesn't work if the difference between the last 'started' and 'suspended' is > 4 minutes though, but that can be overcome with another condition to check for no start within the last 4 minutes.
If you need to get overlapping intervals duration not more than 4 minutes, can use this query:
--cte where creating groups with time intervals
with base_cte as
(
select Tab.id,Tab.NumGr,Tab.date,
Tab.action from
(
select * from
(
--selecting only values where time difference <= 4 min
select *,sum(TimeDiff)over(partition by id,NumGr order by date rows unbounded preceding)SumTimeInterval from
(
--creating a group
select sum(Num)over(partition by id order by date rows unbounded preceding )NumGr, * from
(
select date,lead(date)over(partition by id order by date)lead_date,id,action,
lead(action)over(partition by id order by date)lead_action,
--split intervals between overlaps (240seconds)
iif(TimeDiff>240,1,0)Num,TimeDiff from
(
--find time difference in seconds between current and next date (partition by id)
select datediff(second,date,LEAD(date)over(partition by id order by date))TimeDiff,* from Table
)A
)B
)C
--selecting only pairs within time intervals
where TimeDiff<=240
--checking duration interval:<=4 min
)D where SumTimeInterval<=240
)E
CROSS JOIN LATERAL
(values (id,NumGr,date,action),
(id,NumGr,lead_date,lead_action)
)Tab(id,NumGr,date,action)
)
--selectig data with start/end overlapping time interval
select id,date,action from base_cte base
where date
in (select max(date) from base_cte where NumGr=base.NumGr)
or date in
(select min(date) from base_cte where NumGr=base.NumGr)

Fixed range of timestamps for every uuid in SQL

I would like to generate a table with the last n weeks timestamps of data (in this case, n=3) and all the data, even if it is null.
I am using the following pieces of code
with raw_weekly_data as (SELECT
distinct d.uuid,
date_trunc('week',a.start_timestamp) as tstamp,
avg(price) as price
FROM
a join d on a.uuid = d.uuid
where start_timestamp between date_trunc('week',now()) - interval '3 week' and date_trunc('week',now())
group by 1,2,3
order by 1)
,tstamp as (SELECT
distinct tstamp
FROM
raw_weekly_data
)
SELECT
t.tstamp,
r.*
from raw_weekly_data r right join tstamp t on r.tstamp = t.tstamp
order by uuid
I would like to have something like that:
week | uuid | price
w1 | 1 | 10
w2 | 1 | 2
w3 | 1 |
w1 | 2 | 20
w2 | 2 |
w3 | 2 |
w1 | 3 | 10
w2 | 3 | 10
w3 | 3 | 20
But instead all the null results are not showed. What is the best approach in here?
week | uuid | price
w1 | 1 | 10
w2 | 1 | 2
w1 | 2 | 20
w1 | 3 | 10
w2 | 3 | 10
w3 | 3 | 20
Form a Cartesian product of all weeks an UUIDs, then LEFT JOIN to actual avg, prices per (week, uuid). Like:
SELECT *
FROM generate_series (date_trunc('week', now() - interval '3 week')
, now() - interval '1 week'
, interval '1 week') tstamp
CROSS JOIN (SELECT DISTINCT uuid FROM a) a
LEFT JOIN (
SELECT d.uuid
, date_trunc('week', a.start_timestamp) AS tstamp
, avg(price) AS price -- d.price?
FROM a
JOIN d USING (uuid)
WHERE a.start_timestamp >= date_trunc('week',now()) - interval '3 week'
AND a.start_timestamp < date_trunc('week',now())
) ad USING (uuid, tstamp)
GROUP BY 1, 2
ORDER BY 1, 2
This way you get all combinations of the last three weeks and UUIDs, extended by the average price - if one should exist for the combination.
Based on some educated guesses to fill in missing information ..

selecting records without value

I have a problem when I'm trying to reach the desired result. The task looks simple — make a daily count of occurrences of the event for top countries.
The main table looks like this:
id | date | country | col1 | col2 | ...
1 | 2018-01-01 21:21:21 | US | value 1 | value 2 | ...
2 | 2018-01-01 22:32:54 | UK | value 1 | value 2 | ...
From this table, I want to get daily event counts by the country, which is achieved by
SELECT date::DATE AT TIME ZONE 'UTC', country, COALESCE(count(id),0) FROM tab1
GROUP BY 1, 2
The problem comes when there is no event was made by an UK user on 2 January 2018
country_events
date | country | count
2018-01-01 | US | 23
2018-01-01 | UK | 5
2018-01-02 | US | 30
2018-01-02 | UK | 0 -> is desired result, but row is missing
I've tried to generate date series and series of countries which I'm looking for, then CROSS JOIN these two tables. This helper with columns date and country I've left joined with my result table like
SELECT * FROM helper h
LEFT JOIN country_events c ON c.date::DATE = h.date::DATE AND c.country = h.country
I'm using PostgreSQL.
You need an outer join, not a cross join:
SELECT tab1.date::date, tab1.country, coalesce(count(*), 0)
FROM generate_series(TIMESTAMP '2018-01-01 00:00:00',
TIMESTAMP '2018-01-31 00:00:00',
INTERVAL '1 day') AS ts(d)
LEFT JOIN tab1 ON tab1.date >= ts.d AND tab1.date < ts.d + INTERVAL '1 day'
GROUP BY tab1.date::date, tab1.country
ORDER BY tab1.date::date, tab1.country;
This will give the desired list for January 2018.

Postgresql running totals with groups missing data and outer joins

I've written a sql query that pulls data from a user table and produces a running total and cumulative total of when users were created. The data is grouped by week (using the windowing feature of postgres). I'm using a left outer join to include the weeks when no users where created. Here is the query...
<!-- language: lang-sql -->
WITH reporting_period AS (
SELECT generate_series(date_trunc('week', date '2015-04-02'), date_trunc('week', date '2015-10-02'), interval '1 week') AS interval
)
SELECT
date(interval) AS interval
, count(users.created_at) as interval_count
, sum(count( users.created_at) ) OVER (order by date_trunc('week', users.created_at)) AS cumulative_count
FROM reporting_period
LEFT JOIN users
ON interval=date(date_trunc('week', users.created_at) )
GROUP BY interval, date_trunc('week', users.created_at) ORDER BY interval
It works almost perfectly. The cumulative value is calculated properly for weeks week a user was created. For weeks when no user was create it is set to grand total and not the cumulative total up to that point.
Notice the rows with ** the Week Tot column (interval_count) is 0 as expected but the Run Tot (cumulative_total) is 1053 which equals the grand total.
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 1053 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 1053 **
2015-05-11 | 0 | 1053 **
2015-05-18 | 1 | 30
2015-05-25 | 0 | 1053 **
...
2015-06-08 | 996 | 1031
...
2015-09-07 | 2 | 1052
2015-09-14 | 0 | 1053 **
2015-09-21 | 1 | 1053 **
2015-09-28 | 0 | 1053 **
This is what I would like
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 17 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 29 **
...
It seems to me that if the outer join can somehow apply the grand total to the last column it should be possible to apply the current running total but I'm at a loss on how to do it.
Is this possible?
This is not guaranteed to work out of the box as I havent tested on acutal tables, but the key here is to join users on created_at over a range of dates.
with reportingperiod as (
select intervaldate as interval_begin,
intervaldate + interval '1 month' as interval_end
from (
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-03-15')),
DATE(DATE_TRUNC('day', DATE '2015-10-15')), interval '1 month') AS intervaldate
) as rp
)
select interval_end,
interval_count,
sum(interval_count) over (order by interval_end) as running_sum
from (
select interval_end,
count(u.created_at) as interval_count
from reportingperiod rp
left join (
select created_at
from users
where created_at < '2015-10-02'
) u on u.created_at > rp.interval_begin
and u.created_at <= rp.interval_end
group by interval_end
) q
I figured it out. The trick was subqueries. Here's my approach
Add a count column to the generate_series call with default value of 0
Select interval and count(users.created_at) from the users data
Union the the generate_series and the result from the select in step #2
(At this point the result will have duplicates for each interval)
Use the results in a subquery to get interval and max(interval_count) which eliminates duplicates
Use the window aggregation as before to get the running total
SELECT
interval
, interval_count
, SUM(interval_count ) OVER (ORDER BY interval) AS cumulative_count
FROM
(
SELECT interval, MAX(interval_count) AS interval_count FROM
(
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('week', DATE '2015-04-02')),
DATE(DATE_TRUNC('week', DATE '2015-10-02')), interval '1 week') AS interval,
0 AS interval_count
UNION
SELECT DATE_TRUNC('week', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
) sub1
GROUP BY interval
) grouped_data
I'm not sure if there are any serious performance issues with this approach but it seems to work. If anyone has a better, more elegant or performant approach I would love the feedback.
Edit: My solution doesn't work when trying to group by arbitrary time windows
Just tried this solution with the following changes
/* generate series using DATE_TRUNC('day'...)*/
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-04-02')),
DATE(DATE_TRUNC('day', DATE '2015-10-02')), interval '1 month') AS interval,
0 AS interval_count
/* And this part */
SELECT DATE_TRUNC('day', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
For example is is possible to produce these similar results but have the data grouped by intervals as so
3/15/15 - 4/14/15,
4/15/15 - 5/14/15,
5/15/15 - 6/14/15
etc.