PostgreSQL optimization: average over range of dates - postgresql

I have a query (with a subquery) that calculates an average of temperatures over the previous years, plus/minus one week per each day. It works, but it is not all that fast. The time series values below are just an example. Why I'm using doy is because I want a sliding window around the same date for every year.
SELECT days,
(SELECT avg(temperature)
FROM temperatures
WHERE site_id = ? AND
extract(doy FROM timestamp) BETWEEN
extract(doy FROM days) - 7 AND extract(doy FROM days) + 7
) AS temperature
FROM generate_series('2017-05-01'::date, '2017-08-31'::date, interval '1 day') days
So my question is, could this query somehow be improved? I was thinking about using some kind of window function or possibly lag and lead. However at least regular window functions only work on specific amount of rows, whereas there can be any number of measurements within the two-week window.
I can live with what I have for now, but as the amount of data grows so does the execution speed of the query. The two latter extracts could be removed, but that has no noticeable speed improvement and only makes the query less legible. Any help would be greatly appreciated.

The best index for your original query is
create index idx_temperatures_site_id_timestamp_doy
on temperatures(site_id, extract(doy from timestamp));
This can greatly improve your original query's performance.
While your original query is simple & readable, it has 1 flaw: it will calculate every day's average 14 times (on average). Instead, you could calculate these averages on a daily basis & calculate the 2 week window's weighted average (the weight for a day's average needs to be count of the individual rows in your original table). Something like this:
with p as (
select timestamp '2017-05-01' min,
timestamp '2017-08-31' max
)
select t.*
from p
cross join (select days, sum(sum(temperature)) over pn1week / sum(count(temperature)) over pn1week
from p
cross join generate_series(min - interval '1 week', max + interval '1 week', interval '1 day') days
left join temperatures on site_id = ? and extract(doy from timestamp) = extract(doy from days)
group by days
window pn1week as (order by days rows between 7 preceding and 7 following)) t
where days between min and max
order by days
However, there is not much gain here, as this is only twice as fast as your original query (with the optimal index).
http://rextester.com/JCAG41071
Notes: I used timestamp because I assumed your column's type is timestamp. But as it turned out, you use timestamptz (aka. timestamp with time zone). With that type, you cannot index the extract(doy from timestamp) expression, because that expression's output is dependent of the actual client's time zone setting.
For timestamptz use an index which (at least) starts with site_id. Using the window version should improve the performance anyway.
http://rextester.com/XTJSM42954

Related

Calculating Median using Percentile on Redshift

I have a large table with over 18M rows and I want to calculate Median and I am using PRECENTILE for that. However the time taken is around 17 minutes, which is not ideal.
Here is my query
WITH raw_data AS
(
SELECT name AS series,
(duration) /(60000) AS value
FROM warehouse.table
),
quartiles AS
(
SELECT series,
value,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q1,
MEDIAN(value) OVER (PARTITION BY series) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q3
FROM raw_data
)
SELECT series,
MIN(value) AS minimum,
AVG(q1) AS q1,
AVG(median) AS median,
AVG(q3) AS q3,
MAX(value) AS maximum
FROM quartiles
GROUP BY 1
Is there a way I could speed this up?
Thanks
Your query is asking Redshift to do a lot of work. The data must be distributed according to your PARTITION column and the sorted according to your ORDER BY column.
There are two options to make it faster:
Use more hardware. Redshift performance scales very linearly. Most queries will run 2x as fast on 2x as much hardware.
Do some work in advance. You can maximize performance for this query by restructuring the table. Use the PARTITION column as the distribution key (DISTKEY(series)) and first sort key. Use the ORDER BY column as the second sort key (SORTKEY(series,value)). This will minimize the work required to answer the query. Time savings will vary but I see a 3m30s PERCENTILE_CONT query drop to 30s using this approach on my small test cluster.
To get some speed up of part of this, try the following
SELECT distinct
series,
value,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q1,
MEDIAN(value) OVER (PARTITION BY series) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q3
FROM warehouse.table
This may be faster as it is more likely to correctly use the sort/dist of your table.
You would have to calculate the min and max elsewhere. but at least see if it runs faster.
You can try APPROXIMATE PERCENTILE_DISC ( percentile ) function which is optimized for working on distributed data with low error percentage, incl. median that will be 0.5

Finding difference between two dates in Tableau in terms of days, rounded to two decimals

I want to subtract two date fields in Tableau and find the difference in terms of days, rounded to two decimal places. I have already created a calculated field, however the result is rounded down to the nearest whole number.
The current calculated field I have is such:
DATEDIFF('day', [Reg Time], [Valid Time])
Which returns a result as such:
Reg Time | Valid Time | Datediff
11/1/2018 12:00 AM 11/1/2018 1:00 PM 0
What I want is this:
Reg Time | Valid Time | Datediff
11/1/2018 12:00 AM 11/1/2018 1:00 PM .5
The datediff would return a result of 0.50 because the difference is 12 hours(half a day)
All help is greatly appreciated.
I assume you are working with fields whose data type is datetime instead of date. Otherwise, a result in whole number of days is as good as it is going to get :-)
Dates are compound data types, with internal structure. Every date has three component parts: a year, a month and day of the month. Date fields have a calendar icon beside the field name in the sidebar data pane.
Datetimes are also compound data types, and add three additional components: hour, minute and second. Datetimes add a little watch to the calendar symbol in their icons.
So if your data source truly has datetime data, and the Tableau datatype is set to datetime, then the following calculations get the effect you requested -- showing the difference or duration measured in units of days.
DATEDIFF('hour', [Reg Time], [Valid Time]) / 24
or
DATEDIFF('minute', [Reg Time], [Valid Time]) / (24 * 60)
This calculation is useful when making Gantt bars since the field on the size shelf in that case is assumed to be expressed in days.
The DATEDIFF function will work for you however, the 'day' date part is going to round up. Working around this use the 'hour' date part in your DATEDIFF function.
Then you'll want to divide the result of this calculation by 24 (hours in the day) to get the fraction of a day.
The last thing you need to do is make sure not to aggregate these values, which Tableau will try to do by default.
Hope this helps.

Storing hours, minutes and seconds effectively

Using a PostgreSQL database, what is the best way to store time, in hours, minutes and seconds. E.g. "40:21" as in 40 minutes and 21 seconds.
Example data:
20:21
1:20:02
12:20:02
40:21
time would be the obvious candidate to store time as you describe it. It enforces the range of daily time (00:00:00 to 24:00:00) and occupies 8 bytes.
interval allows arbitrary intervals, even negative ones, or even a mix of positive and negative ones like '1 month - 3 seconds' - doesn't fit your description well - and occupies 16 bytes. See:
How to get the number of days in a month?
To optimize storage size, make it an integer (4 bytes) signifying seconds. To convert time back and forth:
SELECT EXTRACT(epoch FROM time '18:55:28'); -- 68128 (int)
SELECT time '00:00:01' * 68128; -- '18:55:28' (time)
It sounds like you want to store a length of time, or interval. PostgreSQL has a special interval type to store a length of time, e.g.
SELECT interval'2 hours 3 minutes 20 seconds';
This can be added to a timestamp in order to form a new timestamp, or multiplied (so that (2 * interval'2 hours') = interval'4 hours'. The interval type seems to tailor-made for your use case.

Use Number of Values selected in a Filter inside a Calculation Field

In Tableau, I have a Date Filter with 31 days (days in a month).
I have a calculation,
Sum(sales)/(No of days)
Based on no of days selected in the Date Filter, my calculation should change.
Eg : If 12 days are selected in filter
The Calculation should be Sum(sales)/12
If 20 days then Sum(sales)/20.
Regards
Try this formula
sum(Sales) / datediff('day', min(Date), max(Date))
Just realize that if you have missing data at the start or end of your period, say no entries at all for the first few days of the month, it will only start counting days at the very first day of data, i.e. min(Date)
For those reasons, it is often useful to pad your data to have at least one row per day, even if it most of the fields are null in the padded data row. You can use a union or left join to do that without disturbing your original data.

RANGE PRECEDING is only supported with UNBOUNDED

I want to try avg() aggregation within a time window
sql code
select
user_id,timestamp
avg(y) over(range between '5 second' preceding and '5 second' following),
from A
but the system report error
RANGE PRECEDING is only supported with UNBOUNDED
Is there any method to implement, say, a 10 second window for avg() window function?
frame of window function is as wide as range from n seconds preceding the timestamp of current row and m seconds following the timestamp of current row
RANGE PRECEDING is only supported with UNBOUNDED
Yep ... PostgreSQL's window functions don't yet implement ranges.
I've had many situations where they would've been useful, but it's a lot of work to implement them and time is limited.
Is there any method to implement, say, a 10 second window for avg() window function?
You will need to use a left join over generate_series (and, if appropriate, aggregation) to turn the range into a regular sequence of rows, inserting null rows where there's no data, and combining multiple data from within one second to a single value where there are multiple values.
Then you do a (ROWS n PRECEDING ...) window over the left-joined and aggregated data to get the running average.