postgres, group by date, and bucketize per hour - postgresql

I would like to create a result object that can be used with Grafana for a heatmap. In order to display the data correctly I need it the output to be like:
| date | 00:00 | 01:00 | 02:00 | 03:00 | ...etc |
| 2023-01-01 | 1 | 2 | 0 | 1 | ... |
| 2023-01-02 | 0 | 0 | 1 | 1 | ... |
| 2023-01-03 | 4 | 0 | 2 | 0 | ... |
my data table structure:
trades
-----
id
closed_at
asset
So far, I know that I need to use generate_series and use the interval function to return the hours, but I need my query to plot these hours as columns, but I've not been able to do that, as its getting a bit too advanced.
So far I have the following query:
SELECT
closed_at::DATE,
COUNT(id)
FROM trades
GROUP BY closed_at
ORDER BY closed_at
It now shows the amount of rows grouped by the days, I want to further aggregate the data, so it outputs the count per hour, as shown above.
Thanks for your help!

You can add more columns, now I only add 0:00 to 05:00.
filter usage: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
date_trunc usage: https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
BEGIN;
CREATE temp TABLE trades (
id bigint GENERATED BY DEFAULT AS IDENTITY,
closed_a timestamp,
asset text
) ON COMMIT DROP;
INSERT INTO trades (closed_a)
SELECT
date '2023-01-01' + interval '10 min' * (random() * i * 10)::int
FROM
generate_series(1, 10) g (i);
INSERT INTO trades (closed_a)
SELECT
date '2023-01-02' + interval '10 min' * (random() * i * 10)::int
FROM
generate_series(1, 10) g (i);
SELECT
closed_a::date
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date) AS "0:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '1 hour') AS "1:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '2 hour') AS "2:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '3 hour') AS "3:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '4 hour') AS "4:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '5 hour') AS "5:00"
FROM
trades
GROUP BY
1;
END;

Related

Fixed range of timestamps for every uuid in SQL

I would like to generate a table with the last n weeks timestamps of data (in this case, n=3) and all the data, even if it is null.
I am using the following pieces of code
with raw_weekly_data as (SELECT
distinct d.uuid,
date_trunc('week',a.start_timestamp) as tstamp,
avg(price) as price
FROM
a join d on a.uuid = d.uuid
where start_timestamp between date_trunc('week',now()) - interval '3 week' and date_trunc('week',now())
group by 1,2,3
order by 1)
,tstamp as (SELECT
distinct tstamp
FROM
raw_weekly_data
)
SELECT
t.tstamp,
r.*
from raw_weekly_data r right join tstamp t on r.tstamp = t.tstamp
order by uuid
I would like to have something like that:
week | uuid | price
w1 | 1 | 10
w2 | 1 | 2
w3 | 1 |
w1 | 2 | 20
w2 | 2 |
w3 | 2 |
w1 | 3 | 10
w2 | 3 | 10
w3 | 3 | 20
But instead all the null results are not showed. What is the best approach in here?
week | uuid | price
w1 | 1 | 10
w2 | 1 | 2
w1 | 2 | 20
w1 | 3 | 10
w2 | 3 | 10
w3 | 3 | 20
Form a Cartesian product of all weeks an UUIDs, then LEFT JOIN to actual avg, prices per (week, uuid). Like:
SELECT *
FROM generate_series (date_trunc('week', now() - interval '3 week')
, now() - interval '1 week'
, interval '1 week') tstamp
CROSS JOIN (SELECT DISTINCT uuid FROM a) a
LEFT JOIN (
SELECT d.uuid
, date_trunc('week', a.start_timestamp) AS tstamp
, avg(price) AS price -- d.price?
FROM a
JOIN d USING (uuid)
WHERE a.start_timestamp >= date_trunc('week',now()) - interval '3 week'
AND a.start_timestamp < date_trunc('week',now())
) ad USING (uuid, tstamp)
GROUP BY 1, 2
ORDER BY 1, 2
This way you get all combinations of the last three weeks and UUIDs, extended by the average price - if one should exist for the combination.
Based on some educated guesses to fill in missing information ..

Combine generate series and count into one query

Postgres version 9.4.18, PostGIS Version 2.2.
I removed some of the details about the tables from this question because I doubt it's needed to answer the question. I can add those details back if necessary.
Desired result:
I want a total count for each week of year and hour of day (0100 to 5223). I'm able to successfully generate a series of 0100 to 5223 (actually up to 5300), and I'm able to get a total count for each week of year and hour of day individually, but i'm unable to combine the queries so that weeks of year/hours of day with a zero county still show up. I want to combine the count result with the generate_series (and ideally divide that result by 30) to get something like below.
MM-DD | count_not_zero | count_not_zero_divided_by_30
-------+----------------+----------------------------
0100 | 10 | 33.3
0101 | 0 | 0
0102 | 0 | 0
...
0123 | 0 | 0
0200 | 3 | 10
0201 | 10 | 33.3
...
5223 | 20 | 66.6
Here are my individual queries that work...that I want to combine:
SELECT DISTINCT f_woyhh(d::timestamp) as woyhh
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31', interval '1 hour') d
GROUP BY woyhh
ORDER by woyhh asc;
SELECT dt, count(*) FROM
(SELECT f_woyhh((time)::timestamp at time zone 'utc' at time zone 'america/chicago')
AS dt,
EXTRACT(YEAR FROM time) AS ctYear, count(*)
AS ct
FROM counties c
INNER JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE countyname = 'Milwaukee' AND state = 'WI' AND EXTRACT(YEAR from time) > '1987' GROUP BY dt, EXTRACT(YEAR from time))
AS count group by dt;
The result from the second query above is (and skips zero count dt, which I don't want):
dt | count
-------+-------
0100 | 10
0104 | 5
0108 | 4
...
Conclusion:
I'm trying to combine the above working individual queries into a single query that provides a three a three column result--woyhh, count, and count divided by 30. And I want to include woyhh that have zero in the county, so that I have a complete set of woyhh.
Thanks for any help!!
I found the answer. I'll be posting it tomorrow, but I wanted to put this on today so no one unnecessarily works on this question. I apologize for the formatting.
WITH CTE_Dates AS (SELECT DISTINCT f_woyhh(d::timestamp) as dt
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31', interval '1 hour') d),
CTE_WeeklyHourlyCounts AS (SELECT dt, count(*) as ct
FROM (SELECT f_woyhh((time)::timestamp at time zone 'utc' at time zone 'america/chicago') as dt,
EXTRACT(YEAR FROM time) as ctYear, count(*) as ct
FROM counties c
INNER JOIN ltg_data d on ST_contains(c.the_geom, d.ltg_geom)
WHERE countyname = 'Milwaukee' AND state = 'WI' AND EXTRACT(YEAR from time) > '1987'
GROUP BY dt,
EXTRACT(YEAR from time)) as count group by dt),
CTE_FullSTats AS (SELECT CTE_Dates.dt as dt, CAST(CTE_WeeklyHourlyCounts.ct as decimal) as ct
FROM CTE_Dates LEFT JOIN CTE_WeeklyHourlyCounts ON CTE_WeeklyHourlyCounts.dt = CTE_Dates.dt
GROUP BY CTE_Dates.dt, CTE_WeeklyHourlyCounts.ct, CTE_WeeklyHourlyCounts.dt) SELECT dt, COALESCE(ct, 0)
AS count, round(((COALESCE(ct,0) * 100) / 30),0) as percent FROM CTE_FullStats
GROUP BY dt, ct ORDER BY dt;

Grouping Events in Postgres

I've got an events table that is generated by user activity on a site:
timestamp | name
7:00 AM | ...
7:01 AM | ...
7:02 AM | ...
7:30 AM | ...
7:31 AM | ...
7:32 AM | ...
8:01 AM | ...
8:03 AM | ...
8:05 AM | ...
8:08 AM | ...
8:09 AM | ...
I'd like to aggregate over the events to provide a view of when a user is active. I'm defining active to mean the period in which an event is within +/- 2 minutes. For the above that'd mean:
from | till
7:00 AM | 7:02 AM
7:30 AM | 7:32 AM
8:01 AM | 8:05 AM
8:08 AM | 8:09 AM
What's the best way to write a query that'll aggregate in that method? Is it possible via a WINDOW function or self join or is PL/SQL required?
Use two window functions: one to calculate intervals between contiguous events (gaps) and another to find series of gaps less or equal 2 minutes:
select arr[1] as "from", arr[cardinality(arr)] as "till"
from (
select array_agg(timestamp order by timestamp) arr
from (
select timestamp, sum((gap > '2m' )::int) over w
from (
select timestamp, coalesce(timestamp - lag(timestamp) over w, '3m') gap
from events
window w as (order by timestamp)
) s
window w as (order by timestamp)
) s
group by sum
) s
from | till
----------+----------
07:00:00 | 07:02:00
07:30:00 | 07:32:00
08:01:00 | 08:05:00
(3 rows)
Test it here.
By grouping them around half-hour flooring and getting min & max values:
WITH x(t) AS ( VALUES
('7:02 AM'::TIME),('7:01 AM'::TIME),('7:00 AM'::TIME),
('7:30 AM'::TIME),('7:31 AM'::TIME),('7:32 AM'::TIME),
('8:01 AM'::TIME),('8:03 AM'::TIME),('8:05 AM'::TIME)
)
SELECT MIN(t) "from", MAX(t) "till"
FROM (select t, date_trunc('hour', t) +
CASE WHEN (t-date_trunc('hour', t)) >= '30 minutes'::interval
THEN '30 minutes'::interval ELSE '0'::interval END t1 FROM x ) y
GROUP BY t1 ORDER BY t1;
You can apply the same receipt with datetime values like:
WITH x(t) AS (
SELECT '2017-01-01'::TIMESTAMP + (RANDOM()*1440*'1 minute'::INTERVAL) t
FROM GENERATE_SERIES(0,1000))
SELECT MIN...

Postgresql running totals with groups missing data and outer joins

I've written a sql query that pulls data from a user table and produces a running total and cumulative total of when users were created. The data is grouped by week (using the windowing feature of postgres). I'm using a left outer join to include the weeks when no users where created. Here is the query...
<!-- language: lang-sql -->
WITH reporting_period AS (
SELECT generate_series(date_trunc('week', date '2015-04-02'), date_trunc('week', date '2015-10-02'), interval '1 week') AS interval
)
SELECT
date(interval) AS interval
, count(users.created_at) as interval_count
, sum(count( users.created_at) ) OVER (order by date_trunc('week', users.created_at)) AS cumulative_count
FROM reporting_period
LEFT JOIN users
ON interval=date(date_trunc('week', users.created_at) )
GROUP BY interval, date_trunc('week', users.created_at) ORDER BY interval
It works almost perfectly. The cumulative value is calculated properly for weeks week a user was created. For weeks when no user was create it is set to grand total and not the cumulative total up to that point.
Notice the rows with ** the Week Tot column (interval_count) is 0 as expected but the Run Tot (cumulative_total) is 1053 which equals the grand total.
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 1053 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 1053 **
2015-05-11 | 0 | 1053 **
2015-05-18 | 1 | 30
2015-05-25 | 0 | 1053 **
...
2015-06-08 | 996 | 1031
...
2015-09-07 | 2 | 1052
2015-09-14 | 0 | 1053 **
2015-09-21 | 1 | 1053 **
2015-09-28 | 0 | 1053 **
This is what I would like
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 17 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 29 **
...
It seems to me that if the outer join can somehow apply the grand total to the last column it should be possible to apply the current running total but I'm at a loss on how to do it.
Is this possible?
This is not guaranteed to work out of the box as I havent tested on acutal tables, but the key here is to join users on created_at over a range of dates.
with reportingperiod as (
select intervaldate as interval_begin,
intervaldate + interval '1 month' as interval_end
from (
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-03-15')),
DATE(DATE_TRUNC('day', DATE '2015-10-15')), interval '1 month') AS intervaldate
) as rp
)
select interval_end,
interval_count,
sum(interval_count) over (order by interval_end) as running_sum
from (
select interval_end,
count(u.created_at) as interval_count
from reportingperiod rp
left join (
select created_at
from users
where created_at < '2015-10-02'
) u on u.created_at > rp.interval_begin
and u.created_at <= rp.interval_end
group by interval_end
) q
I figured it out. The trick was subqueries. Here's my approach
Add a count column to the generate_series call with default value of 0
Select interval and count(users.created_at) from the users data
Union the the generate_series and the result from the select in step #2
(At this point the result will have duplicates for each interval)
Use the results in a subquery to get interval and max(interval_count) which eliminates duplicates
Use the window aggregation as before to get the running total
SELECT
interval
, interval_count
, SUM(interval_count ) OVER (ORDER BY interval) AS cumulative_count
FROM
(
SELECT interval, MAX(interval_count) AS interval_count FROM
(
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('week', DATE '2015-04-02')),
DATE(DATE_TRUNC('week', DATE '2015-10-02')), interval '1 week') AS interval,
0 AS interval_count
UNION
SELECT DATE_TRUNC('week', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
) sub1
GROUP BY interval
) grouped_data
I'm not sure if there are any serious performance issues with this approach but it seems to work. If anyone has a better, more elegant or performant approach I would love the feedback.
Edit: My solution doesn't work when trying to group by arbitrary time windows
Just tried this solution with the following changes
/* generate series using DATE_TRUNC('day'...)*/
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-04-02')),
DATE(DATE_TRUNC('day', DATE '2015-10-02')), interval '1 month') AS interval,
0 AS interval_count
/* And this part */
SELECT DATE_TRUNC('day', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
For example is is possible to produce these similar results but have the data grouped by intervals as so
3/15/15 - 4/14/15,
4/15/15 - 5/14/15,
5/15/15 - 6/14/15
etc.

Getting data from postgres weekly (according to date)

user timespent(in sec) date(in timestamp)
u1 10 t1(2015-08-15)
u1 20 t2(2015-08-19)
u1 15 t3(2015-08-28)
u1 16 t4(2015-09-06)
Above is the format of my table, which represents timespent by user on a course and it is ordered by timestamp. I want to get sum of timespent by a particular user, say u1 weekly in the format :
start_date end_date sum
2015-08-15 2015-08-21 30
2015-08-22 2015-08-28 15
2015-08-29 2015-09-04 0
2015-09-05 2015-09-11 16
The difficulty lies in the fact that the seven-day periods that you want to get are not regular weeks starting with Monday.
You can not therefore use standard functions to get the week number based on the date, and have to use your own weeks generator using generate_series().
Example data:
create table sessions (user_name text, time_spent int, session_date timestamp);
insert into sessions values
('u1', 10, '2015-08-15'),
('u1', 20, '2015-08-19'),
('u1', 15, '2015-08-28'),
('u1', 16, '2015-09-06');
The query for an arbitrary chosen period from 2015-08-15 to 2015-09-06:
with weeks as (
select d::date start_date, d::date+ 6 end_date
from generate_series('2015-08-15', '2015-09-06', '7d'::interval) d
)
select w.start_date, w.end_date, coalesce(sum(time_spent), 0) total
from weeks w
left join (
select start_date, end_date, coalesce(time_spent, 0) time_spent
from weeks
join sessions
on session_date between start_date and end_date
where user_name = 'u1'
) s
on w.start_date = s.start_date and w.end_date = s.end_date
group by 1, 2
order by 1;
start_date | end_date | total
------------+------------+-------
2015-08-15 | 2015-08-21 | 30
2015-08-22 | 2015-08-28 | 15
2015-08-29 | 2015-09-04 | 0
2015-09-05 | 2015-09-11 | 16
(4 rows)
select
ui,
date_trunc('week', the_date)::date as start_date,
date_trunc('week', the_date)::date + 6 as end_date,
sum(timespent) as "sum"
from t
group by 1, 2, 3
order by 1,2
Something like this (assuming that by timestamp you mean the data type timestamp).
In order to make the 1st day of the week to be Sunday, I added and extra day to "date" in the group by.
select (start_date - date_part('dow', start_date) * interval '1 day')::date start_date,
(start_date + (6 - date_part('dow', start_date)) * interval '1 day')::date end_date,
total_time_spent
from (
select min("date") start_date, sum(timespent) total_time_spent
from mytable
where user=u1
group by date_part('year', "date"), date_part('week', "date" + interval '1 day')) "tmp"
order by start_date
This is a more generic approach, for any date interval.