Is there a way to calculate a weighted moving average with a fixed window size in Amazon Redshift? In more detail, given a table with a date column and a value column, for each date compute the weighted average value over a window of a specified size, with weights specified in an auxiliary table.
My search attempts so far yielded plenty of examples for doing this with window functions for simple average (without weights), for example here. There are also some related suggestions for postgres, e.g., this SO question, however Redshift's feature set is quite sparse compared with postgres and it doesn't support many of the advanced features that are suggested.
Assuming we have the following tables:
create temporary table _data (ref_date date, value int);
insert into _data values
('2016-01-01', 34)
, ('2016-01-02', 12)
, ('2016-01-03', 25)
, ('2016-01-04', 17)
, ('2016-01-05', 22)
;
create temporary table _weight (days_in_past int, weight int);
insert into _weight values
(0, 4)
, (1, 2)
, (2, 1)
;
Then, if we want to calculate a moving average over a window of three days (including the current date) where values closer to the current date are assigned a higher weight than those further in the past, we'd expect for the weighted average for 2016-01-05 (based on values from 2016-01-05, 2016-01-04 and 2016-01-03):
(22*4 + 17*2 + 25*1) / (4+2+1) = 147 / 7 = 21
And the query could look as follows:
with _prepare_window as (
select
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date) as days_in_past
, t2.value * weight as weighted_value
, weight
, count(t2.ref_date) over(partition by t1.ref_date rows between unbounded preceding and unbounded following) as num_values_in_window
from
_data t1
left join
_data t2 on datediff(day, t2.ref_date, t1.ref_date) between 0 and 2
left join
_weight on datediff(day, t2.ref_date, t1.ref_date) = days_in_past
order by
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date)
)
select
ref_date
, round(sum(weighted_value)::float/sum(weight), 0) as weighted_average
from
_prepare_window
where
num_values_in_window = 3
group by
ref_date
order by
ref_date
;
Giving the result:
ref_date | weighted_average
------------+------------------
2016-01-03 | 23
2016-01-04 | 19
2016-01-05 | 21
(3 rows)
Related
I have daily change in a table like below.
Table: performance
date
percent_change
2022/12/01
2
2022/12/02
-1
2022/12/03
3
I want to assume initial value as 100. and shows cumulative value till date, like below.
expected output:
date
percent_change
cumulative value
2022/12/01
2
102
2022/12/02
-1
100.98
2022/12/03
3
104.0094
A product of values, like the one you want to make, is nothing more than EXP(SUM(LN(...))). It results in a slightly verbose query but does not require new functions to be coded and can be ported as is to other DBMS.
In your case, as long as none of your percentages is below -100%:
SELECT date,
percent_change,
100 * EXP(SUM(LN(1+percent_change/100)) OVER (ORDER BY Date)) AS cumulative_value
FROM T
The SUM(...) OVER (ORDER BY ...) is what makes it a cumulative sum.
If you need to account for percentages lower than -100%, you need a bit more complexity.
SELECT date,
percent_change,
100 * -1 ^ SUM(CASE WHEN percent_change < -100 THEN 1 ELSE 0 END) OVER (ORDER BY Date)
* EXP(SUM(LN(ABS(1+percent_change/100))) OVER (ORDER BY Date))
AS cumulative_value
FROM T
WHERE NOT EXISTS (SELECT FROM T T2 WHERE T2.percent_change = -100 AND T2.date <= T.date)
UNION ALL
SELECT Date, percent_change, 0
FROM T
WHERE EXISTS (SELECT FROM T T2 WHERE T2.percent_change = -100 AND T2.date <= T.date)
Explanation:
An ABS(...) has been added to account for the values not supported in the previous query. It effectively strips the sign of 1 + percentage_value / 100
Before the EXP(SUM(LN(ABS(...)))), the -1 ^ SUM(...) is where the sign is put back to the calculation. Read it as: -1 to the power of how many times we encountered a negative value.
The part WHERE EXISTS(...) / WHERE NOT EXISTS(...) handles the special case of percentage_value = -100%. When we encounter -100, we cannot calculate the logarithm even with a call to ABS(...).However, this does not matter much as the products you want to calculate are going to be 0 from this point onward.
Side note:
You can save yourself some of the complexity of the above queries by changing how you store the changes.
Storing 0.02 to represent 2% removes the multiplications/divisions by 100.
Storing 0.0198026272961797 (LN(1 + 0.02)) removes the need to call for a logarithm in your query.
I assume that date in 3rd row is 2022/12/03. Otherwise you need to add an id or some other column to have order on percent changes that occurred in the same day.
Solution
To calculate value after percent_change, you need to multiply your current value by (100 + percent_change) / 100
For day n cumulative value is 100 multiplied by product of coefficients (100 + percent_change) / 100 up to day n.
In PostgreSQL "up to day n" can be implemented with window functions.
Since there is no aggregate function for multiplication, lets create it.
CREATE AGGREGATE PRODUCT(DOUBLE PRECISION) (
SFUNC = float8mul,
STYPE = FLOAT8
);
Final query will look like this:
SELECT
date,
percent_change,
100 * product((100 + percent_change)::float / 100) OVER (ORDER BY date) cumulative_value
FROM performance;
Below query when executed against a DB2 database does not bring in records from 31st March 2019. Ideally it should bring in those records as well since operator used is <=. There are rows and it works if I give <'2019-04-01' however we do not want to use this and go with <=.
select wonum, requireddate ,cost
from workorder
where reportdate >='2019-03-01' AND reportdate <= '2019-03-31'
If reportdate is a datetime, then you might want to consider renaming the column to eg. reportdatetime or maybe REPORT_DATETIME, but hey it's your Database design.
SO, anyway, you could do this
select wonum, requireddate ,cost from workorder
where DATE(reportdate) >='2019-03-01' AND DATE(reportdate) <= '2019-03-31'
or
select wonum, requireddate ,cost from workorder
where DATE(reportdate) BETWEEN '2019-03-01' AND '2019-03-31'
or
select wonum, requireddate ,cost from workorder
where reportdate >='2019-03-01' AND reportdate <= '2019-03-31-24.00.00'
This works as designed.
'2019-03-31' == timestamp('2019-03-31-00.00.00')
If you really don't want to use < (is the < sign forbidden in your organization? :)), try the following:
reportdate <= timestamp('2019-03-31-23.59.59.999999999999', 12)
BTW, There is an interesting thing with timestamps in Db2:
with t(row, ts) as (values
(1, timestamp('2019-03-31-23.59.59.999999999999', 12))
, (2, timestamp('2019-04-01-00.00.00', 12) - 0.000000000001 second)
, (3, timestamp('2019-03-31-24.00.00', 12))
, (4, timestamp('2019-03-31-23.59.59.999999999999', 12) + 0.000000000001 second)
, (5, timestamp('2019-04-01-00.00.00', 12))
)
select row, ts, dense_rank() over (order by ts) order
from t;
ROW TS ORDER
----------- -------------------------------- --------------------
1 2019-03-31-23.59.59.999999999999 1
2 2019-03-31-23.59.59.999999999999 1
3 2019-03-31-24.00.00.000000000000 2
4 2019-04-01-00.00.00.000000000000 3
5 2019-04-01-00.00.00.000000000000 3
2019-03-31-24.00.00 is a "special" timestamp (with the 24:00:00 time part).
It's greater than any 2019-03-31-xx timestamp, but less than 2019-04-01-00.00.00.
So, as Paul mentioned, you may use reportdate <= '2019-03-31-24.00.00' instead of reportdate <= timestamp('2019-03-31-23.59.59.999999999999', 12).
Note, that we must specify the fractional seconds length (12) explicitly in the latest case. The timestamp casts to timestamp(6) with data truncation otherwise.
The purpose of this question is to optimize some SQL by using set-based operations vs iterative (looping, like I'm doing below):
Some Explanation -
I have this cte that is inserted to a temp table #dataForPeak. Each row represents a minute, and a respective value retrieved.
For every row, my code uses a while loop to add 15 rows at a time (the current row + the next 14 rows). These sums are inserted into another temp table #PeakDemandIntervals, which is my workaround for then finding the max sum of these groups of 15.
I've bolded my end goal above. My code achieves this but in about 12 seconds for 26k rows. I'll be looking at much more data, so I know this is not enough for my use case.
My question is,
can anyone help me find a fast alternative to this loop?
It can include more tables, CTEs, nested queries, whatever. The while loop might not even be the issue, it's probably the inner code.
insert into #dataForPeak
select timestamp, value
from cte
order by timestamp;
while ##ROWCOUNT<>0
begin
declare #timestamp datetime = (select top 1 timestamp from #dataForPeak);
insert into #PeakDemandIntervals
select #timestamp, sum(interval.value) as peak
from (select * from #dataForPeak base
where base.timestamp >= #timestamp
and base.timestamp < DATEADD(minute,14,#timestamp)
) interval;
delete from #dataForPeak where timestamp = #timestamp;
end
select max(peak)
from #PeakDemandIntervals;
Edit
Here's an example of my goal, using groups of 3min instead of 15min.
Given the data:
Time | Value
1:50 | 2
1:51 | 4
1:52 | 6
1:53 | 8
1:54 | 6
1:55 | 4
1:56 | 2
the max sum (peak) I'm looking for is 20, because the group
1:52 | 6
1:53 | 8
1:54 | 6
has the highest sum.
Let me know if I need to clarify more than that.
Based on the example given it seems like you are trying to get the maximum value of a rolling sum. You can calculate the 15-minute rolling sum very easily as follow:
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
Note the key here is the ROWS 14 PRECEDING statement. It effectively state that SQL Server should sum the preceding 14 records with the current record which will give you your 15 minute interval.
Now you can simply max the result of the rolling sum. The full query will look as follow:
;WITH CTE_RollingSum
AS
(
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
)
SELECT MAX([RollingSum]) AS Peak
FROM CTE_RollingSum
I would like to calculate growth rate for customers for following data.
month | customers
-------------------------
01-2015 | 1
02-2015 | 10
03-2014 | 10
06-2015 | 15
I have used following formula to calculate the growth rate, it works only for one month interval as well as not able to give expected output due to gap between 3rd and 6th month as shown in above table
select
month, total,
(total::float / lag(total) over (order by month) - 1) * 100 growth
from (
select to_char(created, 'yyyy-mm') as month, count(id) total
from customers
group by month
) s
order by month;
I think this can be done by creating a date range and group by that range.
I expect two main output separately
1) Generate growth rate with exact one month difference
2) Growth rate with interval of 2 month instead of single month only. In above data sum the two month result and group by two month instead of month
Still not sure about the second part. Here's growth from your adapted query and twon month growth column:
select
month, total,
(total::float / lag(total) over (order by m) - 1) * 100 growth,m,m2
from (
select created, (sum(customers) over (order by m))::float total,customers,m,m2,to_char(created, 'yyyy-mm') as month
from customers c
right outer join (
select generate_series('2015-01-01','2015-06-01','1 month'::interval) m
) m1 on m=c.created
left outer join (
select generate_series('2015-01-01','2015-06-01','2 month'::interval) m2
) m2 on m2=m
order by m
) s
order by m;
basically answer is use generate_series
Need help with the following query:
Current Data format:
StudentID EnrolledStartTime EnrolledEndTime
1 7/18/2011 1.00 AM 7/18/2011 1.05 AM
2 7/18/2011 1.00 AM 7/18/2011 1.09 AM
3 7/18/2011 1.20 AM 7/18/2011 1.40 AM
4 7/18/2011 1.50 AM 7/18/2011 1.59 AM
5 7/19/2011 1.00 AM 7/19/2011 1.05 AM
6 7/19/2011 1.00 AM 7/19/2011 1.09 AM
7 7/19/2011 1.20 AM 7/19/2011 1.40 AM
8 7/19/2011 1.10 AM 7/18/2011 1.59 AM
I would like to calculate the time difference between EnrolledEndTime and EnrolledStartTime and group it with 15 minutes difference and the count of students that enrolled in the time.
Expected Result :
Count(StudentID) Date 0-15Mins 16-30Mins 31-45Mins 46-60Mins
4 7/18/2011 3 1 0 0
4 7/19/2011 2 1 0 1
Can I use a combination of the PIVOT function to acheive the required result. Any pointers would be helpful.
Create a table variable/temp table that includes all the columns from the original table, plus one column that marks the row as 0, 16, 31 or 46. Then
SELECT * FROM temp table name PIVOT (Count(StudentID) FOR new column name in (0, 16, 31, 46).
That should put you pretty close.
It's possible (just see the basic pivot instructions here: http://msdn.microsoft.com/en-us/library/ms177410.aspx), but one problem you'll have using pivot is that you need to know ahead of time which columns you want to pivot into.
E.g., you mention 0-15, 16-30, etc. but actually, you have no idea how long some students might take -- some might take 24-hours, or your full session timeout, or what have you.
So to alleviate this problem, I'd suggesting having a final column as a catch-all, labeled something like '>60'.
Other than that, just do a select on this table, selecting the student ID, the date, and a CASE statement, and you'll have everything you need to work the pivot on.
CASE WHEN date2 - date1 < 15 THEN '0-15' WHEN date2-date1 < 30 THEN '16-30'...ELSE '>60' END.
I have an old version of ms sql server that doesn't support pivot. I wrote the sql for getting the data. I cant test the pivot, so I tried my best, couldn't test the pivot part. The rest of the sql will give you the exact data for the pivot table. If you accept null instead of 0, it can be written alot more simple, you can skip the "a subselect" part defined in "with a...".
declare #t table (EnrolledStartTime datetime,EnrolledEndTime datetime)
insert #t values('2011/7/18 01:00', '2011/7/18 01:05')
insert #t values('2011/7/18 01:00', '2011/7/18 01:09')
insert #t values('2011/7/18 01:20', '2011/7/18 01:40')
insert #t values('2011/7/18 01:50', '2011/7/18 01:59')
insert #t values('2011/7/19 01:00', '2011/7/19 01:05')
insert #t values('2011/7/19 01:00', '2011/7/19 01:09')
insert #t values('2011/7/19 01:20', '2011/7/19 01:40')
insert #t values('2011/7/19 01:10', '2011/7/19 01:59')
;with a
as
(select * from
(select distinct dateadd(day, cast(EnrolledStartTime as int), 0) date from #t) dates
cross join
(select '0-15Mins' t, 0 group1 union select '16-30Mins', 1 union select '31-45Mins', 2 union select '46-60Mins', 3) i)
, b as
(select (datediff(minute, EnrolledStartTime, EnrolledEndTime )-1)/15 group1, dateadd(day, cast(EnrolledStartTime as int), 0) date
from #t)
select count(b.date) count, a.date, a.t, a.group1 from a
left join b
on a.group1 = b.group1
and a.date = b.date
group by a.date, a.t, a.group1
-- PIVOT(max(date)
-- FOR group1
-- in(['0-15Mins'], ['16-30Mins'], ['31-45Mins'], ['46-60Mins'])AS p