Calculating Median using Percentile on Redshift - amazon-redshift

I have a large table with over 18M rows and I want to calculate Median and I am using PRECENTILE for that. However the time taken is around 17 minutes, which is not ideal.
Here is my query
WITH raw_data AS
(
SELECT name AS series,
(duration) /(60000) AS value
FROM warehouse.table
),
quartiles AS
(
SELECT series,
value,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q1,
MEDIAN(value) OVER (PARTITION BY series) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q3
FROM raw_data
)
SELECT series,
MIN(value) AS minimum,
AVG(q1) AS q1,
AVG(median) AS median,
AVG(q3) AS q3,
MAX(value) AS maximum
FROM quartiles
GROUP BY 1
Is there a way I could speed this up?
Thanks

Your query is asking Redshift to do a lot of work. The data must be distributed according to your PARTITION column and the sorted according to your ORDER BY column.
There are two options to make it faster:
Use more hardware. Redshift performance scales very linearly. Most queries will run 2x as fast on 2x as much hardware.
Do some work in advance. You can maximize performance for this query by restructuring the table. Use the PARTITION column as the distribution key (DISTKEY(series)) and first sort key. Use the ORDER BY column as the second sort key (SORTKEY(series,value)). This will minimize the work required to answer the query. Time savings will vary but I see a 3m30s PERCENTILE_CONT query drop to 30s using this approach on my small test cluster.

To get some speed up of part of this, try the following
SELECT distinct
series,
value,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q1,
MEDIAN(value) OVER (PARTITION BY series) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q3
FROM warehouse.table
This may be faster as it is more likely to correctly use the sort/dist of your table.
You would have to calculate the min and max elsewhere. but at least see if it runs faster.

You can try APPROXIMATE PERCENTILE_DISC ( percentile ) function which is optimized for working on distributed data with low error percentage, incl. median that will be 0.5

Related

Select random rows from Postgres table weighted towards value in column

Users are presented with 2 random items from an assets table to vote on.
There is a votes_count column in the assets table that counts each time users vote.
When choosing the 2 random items, I'd like to weight that more towards a lower value in votes_count. So items with a lower vote count have a higher probability of being selected randomly.
How can I do this with Postgres?
I've used various methods for selecting random records (RAND(), TABLESAMPLE BERNOULLI(), TABLESAMPLE SYSTEM()), but those don't have the weighting that I'm after.
I'm running PostgreSQL 13, FWIW.
I suggest you the following query :
SELECT *
FROM assets
ORDER BY random()/power(votes_count,1) DESC
LIMIT 2
It orders in a random way the rows of your table while dividing by the votes_count put the lower votes in a better position than the higher votes in the DESC order.
Replacing the 1 power by any real number n will increase the probability of selecting the lower votes when n > 1, and decrease the probability of selecting the lower votes when 0 < n < 1.

Randomly sampling n rows in impala using random() or tablesample system()

I would like to randomly sample n rows from a table using Impala. I can think of two ways to do this, namely:
SELECT * FROM TABLE ORDER BY RANDOM() LIMIT <n>
or
SELECT * FROM TABLE TABLESAMPLE SYSTEM(1) limit <n>
In my case I set n to 10000 and sample from a table of over 20 million rows. If I understand correctly, the first option essentially creates a random number between 0 and 1 for each row and orders by this random number.
The second option creates many different 'buckets' and then randomly samples at least 1% of the data (in practice this always seems to be much greater than the percentage provided). In both cases I then select only the 10000 first rows.
Is the first option reliable to randomly sample the 10K rows in my case?
Edit: some aditional context. The structure of the data is why the random sampling or shuffling of the entire table seems quite important to me. Additional rows are added to the table daily. For example, one of the columns is country and usually the incoming rows are then first all from country A, then from country B, etc. For this reason I am worried that the second option would maybe sample too many rows from a single country, rather than randomly. Is that a justified concern?
Related thread that reveals the second option: What is the best query to sample from Impala for a huge database?
I beg to differ OP. I prefer second optoin.
First option, you are assigning values 0 to 1 to all of your data and then picking up first 10000 records. so basically, impala has to process all rows in the table and thus the operation will be slow if you have a 20million row table.
Second option, impala randomly picks up rows from files based on percentage you provide. Since this works on the files, so return count of rows may different than the percentage you mentioned. Also, this method is used to compute statistics in Impala. So, performance wise this is much better and correctness of random can be a problem.
Final thought -
If you are worried about randomness and correctness of your random data, go for option 1. But if you are not much worried about randomness and want sample data and quick performance, then pick second option. Since Impala uses this for COMPUTE STATS, i pick this one :)
EDIT : After looking at your requirement, i have a method to sample over a particular field or fields.
We will use window function to set rownumber randomly to each country group. Then pick up 1% or whatever % you want to pick up from that data set.
This will make sure you have data evenly distributed between countries and each country have same % of rows in result data set.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 1/100 -- This is for 1% data
screenshot from my data -
HTH

filtering using Rank() and Index() not changing the total

I am calculating efficiency for mechanics using the sum of hours worked divided by the sum of hours we charged the customer as per work order. Using tableau's total from the analytics pane, it gives me the weighted average of their efficiency (whereas the average function is skewed as it only takes into account the final efficiency rating.
When I use index() or rank() to create a filter to remove individual work orders, the total doesn't change.
How can I remove work orders and change the total without having to use a filter that selects individual work orders?
You could try using LOD with specific condition in the if statement before you take the average or do any calculation.
Since the fixed calculation will take the data directly from the table. The number will only changed with filter when you put the parameter in the first part of LOD.
A quick example:
{Fixed [parameter]: AVG(IF [work orders] == condition then [weighted average] END)}

PostgreSQL optimization: average over range of dates

I have a query (with a subquery) that calculates an average of temperatures over the previous years, plus/minus one week per each day. It works, but it is not all that fast. The time series values below are just an example. Why I'm using doy is because I want a sliding window around the same date for every year.
SELECT days,
(SELECT avg(temperature)
FROM temperatures
WHERE site_id = ? AND
extract(doy FROM timestamp) BETWEEN
extract(doy FROM days) - 7 AND extract(doy FROM days) + 7
) AS temperature
FROM generate_series('2017-05-01'::date, '2017-08-31'::date, interval '1 day') days
So my question is, could this query somehow be improved? I was thinking about using some kind of window function or possibly lag and lead. However at least regular window functions only work on specific amount of rows, whereas there can be any number of measurements within the two-week window.
I can live with what I have for now, but as the amount of data grows so does the execution speed of the query. The two latter extracts could be removed, but that has no noticeable speed improvement and only makes the query less legible. Any help would be greatly appreciated.
The best index for your original query is
create index idx_temperatures_site_id_timestamp_doy
on temperatures(site_id, extract(doy from timestamp));
This can greatly improve your original query's performance.
While your original query is simple & readable, it has 1 flaw: it will calculate every day's average 14 times (on average). Instead, you could calculate these averages on a daily basis & calculate the 2 week window's weighted average (the weight for a day's average needs to be count of the individual rows in your original table). Something like this:
with p as (
select timestamp '2017-05-01' min,
timestamp '2017-08-31' max
)
select t.*
from p
cross join (select days, sum(sum(temperature)) over pn1week / sum(count(temperature)) over pn1week
from p
cross join generate_series(min - interval '1 week', max + interval '1 week', interval '1 day') days
left join temperatures on site_id = ? and extract(doy from timestamp) = extract(doy from days)
group by days
window pn1week as (order by days rows between 7 preceding and 7 following)) t
where days between min and max
order by days
However, there is not much gain here, as this is only twice as fast as your original query (with the optimal index).
http://rextester.com/JCAG41071
Notes: I used timestamp because I assumed your column's type is timestamp. But as it turned out, you use timestamptz (aka. timestamp with time zone). With that type, you cannot index the extract(doy from timestamp) expression, because that expression's output is dependent of the actual client's time zone setting.
For timestamptz use an index which (at least) starts with site_id. Using the window version should improve the performance anyway.
http://rextester.com/XTJSM42954

RANGE PRECEDING is only supported with UNBOUNDED

I want to try avg() aggregation within a time window
sql code
select
user_id,timestamp
avg(y) over(range between '5 second' preceding and '5 second' following),
from A
but the system report error
RANGE PRECEDING is only supported with UNBOUNDED
Is there any method to implement, say, a 10 second window for avg() window function?
frame of window function is as wide as range from n seconds preceding the timestamp of current row and m seconds following the timestamp of current row
RANGE PRECEDING is only supported with UNBOUNDED
Yep ... PostgreSQL's window functions don't yet implement ranges.
I've had many situations where they would've been useful, but it's a lot of work to implement them and time is limited.
Is there any method to implement, say, a 10 second window for avg() window function?
You will need to use a left join over generate_series (and, if appropriate, aggregation) to turn the range into a regular sequence of rows, inserting null rows where there's no data, and combining multiple data from within one second to a single value where there are multiple values.
Then you do a (ROWS n PRECEDING ...) window over the left-joined and aggregated data to get the running average.