Fixed range of timestamps for every uuid in SQL - postgresql

I would like to generate a table with the last n weeks timestamps of data (in this case, n=3) and all the data, even if it is null.
I am using the following pieces of code
with raw_weekly_data as (SELECT
distinct d.uuid,
date_trunc('week',a.start_timestamp) as tstamp,
avg(price) as price
FROM
a join d on a.uuid = d.uuid
where start_timestamp between date_trunc('week',now()) - interval '3 week' and date_trunc('week',now())
group by 1,2,3
order by 1)
,tstamp as (SELECT
distinct tstamp
FROM
raw_weekly_data
)
SELECT
t.tstamp,
r.*
from raw_weekly_data r right join tstamp t on r.tstamp = t.tstamp
order by uuid
I would like to have something like that:
week | uuid | price
w1 | 1 | 10
w2 | 1 | 2
w3 | 1 |
w1 | 2 | 20
w2 | 2 |
w3 | 2 |
w1 | 3 | 10
w2 | 3 | 10
w3 | 3 | 20
But instead all the null results are not showed. What is the best approach in here?
week | uuid | price
w1 | 1 | 10
w2 | 1 | 2
w1 | 2 | 20
w1 | 3 | 10
w2 | 3 | 10
w3 | 3 | 20

Form a Cartesian product of all weeks an UUIDs, then LEFT JOIN to actual avg, prices per (week, uuid). Like:
SELECT *
FROM generate_series (date_trunc('week', now() - interval '3 week')
, now() - interval '1 week'
, interval '1 week') tstamp
CROSS JOIN (SELECT DISTINCT uuid FROM a) a
LEFT JOIN (
SELECT d.uuid
, date_trunc('week', a.start_timestamp) AS tstamp
, avg(price) AS price -- d.price?
FROM a
JOIN d USING (uuid)
WHERE a.start_timestamp >= date_trunc('week',now()) - interval '3 week'
AND a.start_timestamp < date_trunc('week',now())
) ad USING (uuid, tstamp)
GROUP BY 1, 2
ORDER BY 1, 2
This way you get all combinations of the last three weeks and UUIDs, extended by the average price - if one should exist for the combination.
Based on some educated guesses to fill in missing information ..

Related

postgres, group by date, and bucketize per hour

I would like to create a result object that can be used with Grafana for a heatmap. In order to display the data correctly I need it the output to be like:
| date | 00:00 | 01:00 | 02:00 | 03:00 | ...etc |
| 2023-01-01 | 1 | 2 | 0 | 1 | ... |
| 2023-01-02 | 0 | 0 | 1 | 1 | ... |
| 2023-01-03 | 4 | 0 | 2 | 0 | ... |
my data table structure:
trades
-----
id
closed_at
asset
So far, I know that I need to use generate_series and use the interval function to return the hours, but I need my query to plot these hours as columns, but I've not been able to do that, as its getting a bit too advanced.
So far I have the following query:
SELECT
closed_at::DATE,
COUNT(id)
FROM trades
GROUP BY closed_at
ORDER BY closed_at
It now shows the amount of rows grouped by the days, I want to further aggregate the data, so it outputs the count per hour, as shown above.
Thanks for your help!
You can add more columns, now I only add 0:00 to 05:00.
filter usage: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
date_trunc usage: https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
BEGIN;
CREATE temp TABLE trades (
id bigint GENERATED BY DEFAULT AS IDENTITY,
closed_a timestamp,
asset text
) ON COMMIT DROP;
INSERT INTO trades (closed_a)
SELECT
date '2023-01-01' + interval '10 min' * (random() * i * 10)::int
FROM
generate_series(1, 10) g (i);
INSERT INTO trades (closed_a)
SELECT
date '2023-01-02' + interval '10 min' * (random() * i * 10)::int
FROM
generate_series(1, 10) g (i);
SELECT
closed_a::date
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date) AS "0:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '1 hour') AS "1:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '2 hour') AS "2:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '3 hour') AS "3:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '4 hour') AS "4:00"
,COUNT(id) FILTER (WHERE date_trunc('hour', closed_a) = closed_a::date + interval '5 hour') AS "5:00"
FROM
trades
GROUP BY
1;
END;

Get different LIMIT on each group on postgresql rank

To get 2 rows from each group I can use ROW_NUMBER() with condition <= 2 at last but my question is what If I want to get different limits on each group e.g 3 rows for section_id 1, 1 rows for 2 and 1 rows for 3?
Given the following table:
db=# SELECT * FROM xxx;
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
3 | 1 | C
4 | 1 | D
5 | 2 | E
6 | 2 | F
7 | 3 | G
8 | 2 | H
(8 rows)
I get the first 2 rows (ordered by name) for each section_id, i.e. a result similar to:
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
5 | 2 | E
6 | 2 | F
7 | 3 | G
(5 rows)
Current Query:
SELECT
*
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY section_id ORDER BY name) AS r,
t.*
FROM
xxx t) x
WHERE
x.r <= 2;
Create a table to contain the section limits, then join. The big advantage being that as new sections are required or limits change maintenance is reduced to a single table update and comes at very little cost. See example.
select s.section_id, s.name
from (select section_id, name
, row_number() over (partition by section_id order by name) rn
from sections
) s
left join section_limits sl on (sl.section_id = s.section_id)
where
s.rn <= coalesce(sl.limit_to,2);
Just fix up your where clause:
with numbered as (
select row_number() over (partition by section_id
order by name) as r,
t.*
from xxx t
)
select *
from numbered
where (section_id = 1 and r <= 3)
or (section_id = 2 and r <= 1)
or (section_id = 3 and r <= 1);

Grouping Events in Postgres

I've got an events table that is generated by user activity on a site:
timestamp | name
7:00 AM | ...
7:01 AM | ...
7:02 AM | ...
7:30 AM | ...
7:31 AM | ...
7:32 AM | ...
8:01 AM | ...
8:03 AM | ...
8:05 AM | ...
8:08 AM | ...
8:09 AM | ...
I'd like to aggregate over the events to provide a view of when a user is active. I'm defining active to mean the period in which an event is within +/- 2 minutes. For the above that'd mean:
from | till
7:00 AM | 7:02 AM
7:30 AM | 7:32 AM
8:01 AM | 8:05 AM
8:08 AM | 8:09 AM
What's the best way to write a query that'll aggregate in that method? Is it possible via a WINDOW function or self join or is PL/SQL required?
Use two window functions: one to calculate intervals between contiguous events (gaps) and another to find series of gaps less or equal 2 minutes:
select arr[1] as "from", arr[cardinality(arr)] as "till"
from (
select array_agg(timestamp order by timestamp) arr
from (
select timestamp, sum((gap > '2m' )::int) over w
from (
select timestamp, coalesce(timestamp - lag(timestamp) over w, '3m') gap
from events
window w as (order by timestamp)
) s
window w as (order by timestamp)
) s
group by sum
) s
from | till
----------+----------
07:00:00 | 07:02:00
07:30:00 | 07:32:00
08:01:00 | 08:05:00
(3 rows)
Test it here.
By grouping them around half-hour flooring and getting min & max values:
WITH x(t) AS ( VALUES
('7:02 AM'::TIME),('7:01 AM'::TIME),('7:00 AM'::TIME),
('7:30 AM'::TIME),('7:31 AM'::TIME),('7:32 AM'::TIME),
('8:01 AM'::TIME),('8:03 AM'::TIME),('8:05 AM'::TIME)
)
SELECT MIN(t) "from", MAX(t) "till"
FROM (select t, date_trunc('hour', t) +
CASE WHEN (t-date_trunc('hour', t)) >= '30 minutes'::interval
THEN '30 minutes'::interval ELSE '0'::interval END t1 FROM x ) y
GROUP BY t1 ORDER BY t1;
You can apply the same receipt with datetime values like:
WITH x(t) AS (
SELECT '2017-01-01'::TIMESTAMP + (RANDOM()*1440*'1 minute'::INTERVAL) t
FROM GENERATE_SERIES(0,1000))
SELECT MIN...

Postgresql running totals with groups missing data and outer joins

I've written a sql query that pulls data from a user table and produces a running total and cumulative total of when users were created. The data is grouped by week (using the windowing feature of postgres). I'm using a left outer join to include the weeks when no users where created. Here is the query...
<!-- language: lang-sql -->
WITH reporting_period AS (
SELECT generate_series(date_trunc('week', date '2015-04-02'), date_trunc('week', date '2015-10-02'), interval '1 week') AS interval
)
SELECT
date(interval) AS interval
, count(users.created_at) as interval_count
, sum(count( users.created_at) ) OVER (order by date_trunc('week', users.created_at)) AS cumulative_count
FROM reporting_period
LEFT JOIN users
ON interval=date(date_trunc('week', users.created_at) )
GROUP BY interval, date_trunc('week', users.created_at) ORDER BY interval
It works almost perfectly. The cumulative value is calculated properly for weeks week a user was created. For weeks when no user was create it is set to grand total and not the cumulative total up to that point.
Notice the rows with ** the Week Tot column (interval_count) is 0 as expected but the Run Tot (cumulative_total) is 1053 which equals the grand total.
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 1053 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 1053 **
2015-05-11 | 0 | 1053 **
2015-05-18 | 1 | 30
2015-05-25 | 0 | 1053 **
...
2015-06-08 | 996 | 1031
...
2015-09-07 | 2 | 1052
2015-09-14 | 0 | 1053 **
2015-09-21 | 1 | 1053 **
2015-09-28 | 0 | 1053 **
This is what I would like
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 17 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 29 **
...
It seems to me that if the outer join can somehow apply the grand total to the last column it should be possible to apply the current running total but I'm at a loss on how to do it.
Is this possible?
This is not guaranteed to work out of the box as I havent tested on acutal tables, but the key here is to join users on created_at over a range of dates.
with reportingperiod as (
select intervaldate as interval_begin,
intervaldate + interval '1 month' as interval_end
from (
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-03-15')),
DATE(DATE_TRUNC('day', DATE '2015-10-15')), interval '1 month') AS intervaldate
) as rp
)
select interval_end,
interval_count,
sum(interval_count) over (order by interval_end) as running_sum
from (
select interval_end,
count(u.created_at) as interval_count
from reportingperiod rp
left join (
select created_at
from users
where created_at < '2015-10-02'
) u on u.created_at > rp.interval_begin
and u.created_at <= rp.interval_end
group by interval_end
) q
I figured it out. The trick was subqueries. Here's my approach
Add a count column to the generate_series call with default value of 0
Select interval and count(users.created_at) from the users data
Union the the generate_series and the result from the select in step #2
(At this point the result will have duplicates for each interval)
Use the results in a subquery to get interval and max(interval_count) which eliminates duplicates
Use the window aggregation as before to get the running total
SELECT
interval
, interval_count
, SUM(interval_count ) OVER (ORDER BY interval) AS cumulative_count
FROM
(
SELECT interval, MAX(interval_count) AS interval_count FROM
(
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('week', DATE '2015-04-02')),
DATE(DATE_TRUNC('week', DATE '2015-10-02')), interval '1 week') AS interval,
0 AS interval_count
UNION
SELECT DATE_TRUNC('week', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
) sub1
GROUP BY interval
) grouped_data
I'm not sure if there are any serious performance issues with this approach but it seems to work. If anyone has a better, more elegant or performant approach I would love the feedback.
Edit: My solution doesn't work when trying to group by arbitrary time windows
Just tried this solution with the following changes
/* generate series using DATE_TRUNC('day'...)*/
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-04-02')),
DATE(DATE_TRUNC('day', DATE '2015-10-02')), interval '1 month') AS interval,
0 AS interval_count
/* And this part */
SELECT DATE_TRUNC('day', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
For example is is possible to produce these similar results but have the data grouped by intervals as so
3/15/15 - 4/14/15,
4/15/15 - 5/14/15,
5/15/15 - 6/14/15
etc.

Iterate through rows, compare them against each other and store results in another table

I have a table that contains the following rows:
product_id | order_date
A | 12/04/12
A | 01/11/13
A | 01/21/13
A | 03/05/13
B | 02/14/13
B | 03/09/13
What I now need is an overview for each month, how many products have been bought for the first time (=have not been bought the month before), how many are existing products (=have been bought the month before) and how many have not been purchased within a given month. Taken the sample above as an input, the script should deliver the following result, regardless of what period of time is in the data:
month | new | existing | nopurchase
12/2012 | 1 | 0 | 0
01/2013 | 0 | 1 | 0
02/2013 | 1 | 0 | 1
03/2013 | 1 | 1 | 0
Would be great to get a first hint how this could be solved so I'm able to continue.
Thanks!
SQL Fiddle
with t as (
select product_id pid, date_trunc('month', order_date)::date od
from t
group by 1, 2
)
select od,
sum(is_new::integer) "new",
sum(is_existing::integer) existing,
sum(not_purchased::integer) nopurchase
from (
select od,
lag(t_pid) over(partition by s_pid order by od) is null and t_pid is not null is_new,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is not null is_existing,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is null not_purchased
from (
select t.pid t_pid, s.pid s_pid, s.od
from
t
right join
(
select pid, s.od
from
t
cross join
(
select date_trunc('month', d)::date od
from
generate_series(
(select min(od) from t),
(select max(od) from t),
'1 month'
) s(d)
) s
group by pid, s.od
) s on t.od = s.od and t.pid = s.pid
) s
) s
group by 1
order by 1