Continuous aggregates in postgres/timescaledb requires time_bucket-function? - postgresql

I have a SELECT-query which gives me the aggregated sum(minutes_per_hour_used) of some stuff. Grouped by id, weekday and observed hour.
SELECT id,
extract(dow from observed_date) AS weekday, ( --observed_date is type date
observed_hour, -- is type timestamp without timezone, every full hour 00:00:00, 01:00:00, ...
sum(minutes_per_hour_used)
FROM base_table
GROUP BY id, weekday, observed_hour
ORDER BY id, weekday, observed_hour;
The result looks nice, but now I would like to store that in a self-maintained view, which only considers/aggregates the last 8 weeks. I thought contiouus aggregates are the right way, but I can't make it work (https://blog.timescale.com/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/). It seems I need to somehow use the time_bucket-function, but actually I don't know how. Any ideas/hints?
I am using postgres with timescaledb.
EDIT: This gives me the desired output, but I can't put it in a continouus aggregate
SELECT id,
extract(dow from observed_date) AS weekday,
observed_hour,
sum(minutes_per_hour_used)
FROM base_table
WHERE observed_date >= now() - interval '8 weeks'
GROUP BY id, weekday, observed_hour
ORDER BY id, weekday, observed_hour;
EDIT: Prepend this with
CREATE VIEW my_view
WITH (timescaledb.continuous) AS
gives me [0A000] ERROR: invalid SELECT query for continuous aggregate

Continuous aggregates require grouping by time_bucket:
SELECT <grouping_exprs>, <aggregate_functions>
FROM <hypertable>
[WHERE ... ]
GROUP BY time_bucket( <const_value>, <partition_col_of_hypertable> ),
[ optional grouping exprs>]
[HAVING ...]
It should be applied to a partitioned column, which is usually the time dimension column used in the hypertable creation. Also ORDER BY is not supported.
In the case of the aggregate query in the question no time column is used for grouping. Neither weekday nor observed_hour are time valid columns, since they don't increase as time, instead their values are repeat regularly. weekday repeats every 7 days and observed_hour repeats every 24 hours. This breaks requirements for continuous aggregates.
Since there is no ready solution for this use case, one approach is to use a continuous aggregate to reduce the amount of data for the targeted query, e.g., by bucketing by day:
CREATE MATERIALIZED VIEW daily
WITH (timescaledb.continuous) AS
SELECT id,
time_bucket('1day', observed_date) AS day,
observed_hour,
sum(minutes_per_hour_used)
FROM base_table
GROUP BY 1, 2, 3;
Then execute the targeted aggregate query on top of it:
SELECT id,
extract(dow from day) AS weekday,
observed_hour,
sum(minutes_per_hour_used)
FROM daily
WHERE day >= now() - interval '8 weeks'
GROUP BY id, weekday, observed_hour
ORDER BY id, weekday, observed_hour;
Another approach is to use PostgreSQL's materialized views and refresh it on regular basis with help of custom jobs, which is run by the job scheduling framework of TimescaleDB. Note that the refresh will re-calculate entire view, which in the example case covers 8 weeks of data. The materialized view can be written in terms of the original table base_table or in terms of the continuous aggregate suggested above.

Related

How to select data between two dates using only the start date?

I have problem select data between two dates if the only start_date is available.
The example I want to see is what discount_nr was active between 2020-07-01 and 2020-07-15 or only one day 2020-07-14. I tried different solutions, date range, generate series, and so on, but was still not able to get it to work.
Table only have start dates, no end dates
Example:
discount_nr, start_date
1, 2020-06-30
2, 2020-07-03
3, 2020-07-10
4, 2020-07-15
You can get the end dates by looking at the start date of the next row. This is done with lead. lead(start_date) over(order by start_date asc) will get you the start_date of the next row. If we take 1 day from that we'll get the inclusive end date.
Rather than separate start/end columns, a single daterange column is easier to work with. You can use that as a CTE or create a view.
create view discount_durations as
select
id,
daterange(
start_date,
lead(start_date) over(order by start_date asc)
) as duration
from discounts
Now querying it is easy using range operators. #> to check if the range contains a date.
select *
from discount_durations
where duration #> '2020-07-14'::date
And use && to see if they have any overlap.
select *
from discount_durations
where duration && daterange('2020-07-01', '2020-07-15');
Demonstration

PostgreSQL / TimescaleDB query that group data by custom interval

I have a working query that allows me to sum value field each day.
SELECT sum(value) as sum_value, to_char(time::date,'DD-MM-YY') as day
FROM "measures"
GROUP BY "day"
ORDER BY "day" asc
Is it possible to do the same query, but instead of grouping by day, grouping it by 2 days,or a specific duration ( days only, not hours)
You can use timescaledb extension time_bucket() function as #Mike Organek commented or use native PosgreSQL data transformation like:
SELECT
sum(value) as sum_value,
to_char(time::date,'DD-MM-YY') as day,
floor(extract(epoch from "time")/(60*60*24*2)) as time_bucket
FROM "measures"
GROUP BY "time_bucket"
ORDER BY "day" asc;
In above query we extract unixtime from datetime column, divide it to grouping period in seconds (60*60*24*2 - mean 2 days or 48 hours) and round it fro use into group by statement

How to execute SELECT DISTINCT ON query using SQLAlchemy

I have a requirement to display spend estimation for last 30 days. SpendEstimation is calculated multiple times a day. This can be achieved using simple SQL query:
SELECT DISTINCT ON (date) date(time) AS date, resource_id , time
FROM spend_estimation
WHERE
resource_id = '<id>'
and time > now() - interval '30 days'
ORDER BY date DESC, time DESC;
Unfortunately I can't seem to be able to do the same using SQLAlchemy. It always creates select distinct on all columns. Generated query does not contain distinct on.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
'date'
).order_by(
'date',
SpendEstimation.time
)
SELECT DISTINCT
date(time) AS date,
resource_id,
time
FROM spend
ORDER BY date, time
It is missing ON (date) bit. If I user query.group_by - then SQLAlchemy adds distinct on. Though I can't think of solution for given problem using group by.
Tried using function in distinct part and order by part as well.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
func.date(SpendEstimation.time).label('date')
).order_by(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.time
)
Which resulted in this SQL:
SELECT DISTINCT
date(time) AS date,
resource_id,
time,
date(time) AS date # only difference
FROM spend
ORDER BY date, time
Which is still missing DISTINCT ON.
Your SqlAlchemy version might be the culprit.
Sqlalchemy with postgres. Try to get 'DISTINCT ON' instead of 'DISTINCT'
Links to this bug report:
https://bitbucket.org/zzzeek/sqlalchemy/issues/2142
A fix wasn't backported to 0.6, looks like it was fixed in 0.7.
Stupid question: have you tried distinct on SpendEstimation.date instead of 'date'?
EDIT: It just struck me that you're trying to use the named column from the SELECT. SQLAlchemy is not that smart. Try passing in the func expression into the distinct() call.

Postgres generate_series excluding date ranges

I'm creating a subscription management system, and need to generate a list of upcoming billing date for the next 2 years. I've been able to use generate_series to get the appropriate dates as such:
SELECT i::DATE
FROM generate_series('2015-08-01', '2017-08-01', '1 month'::INTERVAL) i
The last step I need to take is exclude specific date ranges from the calculation. These excluded date ranges may be any range of time. Additionally, they should not be factored into the time range for the generate_series.
For example, say we have a date range exclusion from '2015-08-27' to '2015-09-03'. The resulting generate_series should exclude the date that week from the calculation, and basically push all future month billing dates one week to the future:
2015-08-01
2015-09-10
2015-10-10
2015-11-10
2015-12-10
First you create a time series of dates over the next two years, EXCEPT your blackout dates:
SELECT dt
FROM generate_series('2015-08-01'::date, '2017-08-01'::date, interval '1 day') AS s(dt)
EXCEPT
SELECT dt
FROM generate_series('2015-08-27'::date, '2015-09-03'::date, interval '1 day') as ex1(dt)
Note that you can have as many EXCEPT clauses as you need. For individual blackout days (as opposed to ranges) you could use a VALUES clause instead of a SELECT.
Then you window over that time-series to generate row numbers of billable days:
SELECT row_number() OVER (ORDER BY dt) AS rn, dt
FROM (<query above>) x
Then you select those days where you want to bill:
SELECT dt
FROM (<query above>) y
WHERE rn % 30 = 1; -- billing on the first day of the period
(This latter query following Craig's advice of billing by 30 days)
Yields:
SELECT dt
FROM (
SELECT row_number() OVER (ORDER BY dt) AS rn, dt
FROM (
SELECT dt
FROM generate_series('2015-08-01'::date, '2017-08-01'::date, interval '1 day') AS s(dt)
EXCEPT
SELECT dt
FROM generate_series('2015-08-27'::date, '2015-09-03'::date, interval '1 day') as ex1(dt)
) x
) y
WHERE rn % 30 = 1;
You will have to split the call to generate series for exclusions. Some thing like this:
Union of 3 queries
First query pulls dates from start to exclusion range from
Second query pulls dates between exclusion range to and your end date
Third query pulls dates when none of your series dates cross exclusion range
Note: You still need a way to loop through exclusion list (if you have one). Also this query may not be very efficient as such scenarios can be better handled through functions or procedural code.

how do you sum over a related period

I need to sum values that are + 2 months or within a quarter period (related date table)
is there a way to use dense rank to partition those periods (custom periods)?
select
FiscalMonth
,Value
from table
The sql will have to do the following:
Join the value table and the period table
Include the period in the select list and sum the value, grouping by the period
i.e
select b.period, sum(a.value)
from table a
inner join period b on a.FiscalMonth between b.StartMonth and b.EndMonth
group by b.period
Note: The join condition will have to be modified based on what data you actually have in the period table.
Hope this helps
Well, If you need value from an X interval, by month you could use something like:
SELECT *
FROM yourTable
MONTH(some_date) = MONTH(CURRENT_DATE - INTERVAL 1 MONTH) //Could be X interval!
This is an example (which show the results of the previous month, from the actual one). Just trying to write that it is possible to massage the query in functions on intervals.
Of course, you could use the SUMcommand for the adding.