First, some sample data so the business problem can be explained -
select
ItemID = 276,
Quantity,
Bucket,
DaysInMonth = day(eomonth(Bucket)),
DailyQuantity = cast(Quantity * 1.0 / day(eomonth(Bucket)) as decimal(4, 0)),
DaysFactor
into #data
from
(
values
('1/1/2021', 95, 5500),
('2/1/2021', 75, 6000),
('3/1/2021', 80, 5000),
('4/1/2021', 82, 5300),
('5/1/2021', 90, 5200),
('6/1/2021', 80, 6500),
('7/1/2021', 85, 6100),
('8/1/2021', 90, 5100),
('9/1/2021', null, 5800),
('10/1/2021', null, 5900)
) d (Bucket, DaysFactor, Quantity);
select * from #data;
Now, the business problem -
The first row has a DaysFactor of 95.
The forward rolling sum for this row is calculated as
(31 x 177) + (28 x 214) + (31 x 161) + (5 x 177) = 17,355
That is...
the daily quantity for all 31 days of the 1/1/2021 bucket plus
the daily quantity for all 28 days of the 2/1/2021 bucket plus
the daily quantity for all 31 days of the 3/1/2021 bucket plus
the daily quantity for 5 days of the 4/1/2021 bucket.
This results in 95 days of forward looking quantity.
95 days = 31 + 28 + 31 + 5
For the second row, with a DaysFactor of 75, it would start with daily quantity for the 28 days in the 2/1/2021 bucket and go out until a total of 75 days' worth of quantity were summed, like so:
(28 x 214) + (31 x 161) + (16 x 177) = 13,815
75 days = 28 + 31 + 16
One approach to this is building a calendar of daily demand and then summing quantity over the specified days. However, I'm stuck on how to do the summing. Here is the code that builds the calendar with daily quantities:
with
dates as
(
select
FirstDay = min(cast(Bucket as date)),
LastDay = eomonth(max(cast(Bucket as date)))
from #data
),
tally as (
select top (select datediff(d, FirstDay, LastDay) + 1 from dates) --restrict to number of rows equal to number of days between first and last days
n = row_number() over(order by (select null)) - 1
from sys.messages
),
calendar as (
select
Bucket = dateadd(d, t.n, d.FirstDay)
from tally t
cross join dates d
)
select
c.Bucket,
d.DailyQuantity
from #data d
inner join calendar c
on year(d.Bucket) = year(c.Bucket)
and month(d.Bucket) = month(c.Bucket);
Here's a screenshot of a subset of rows from this query:
I was hoping to use T-SQL's LEAD() to do this but don't see a way to put the DaysFactor into the ROWS clause within OVER(). Is there a way to do that? If not, is there a set based approach to calculating the rolling forward sum?
Expected result set:
Figured it out using an approach different than LEAD(). This column was added to #data:
BucketEnd = cast(dateadd(d, DaysFactor - 1, Bucket) as date)
Then code that builds the calendar with daily quantities shown in original question was put into a temp table called #calendar.
Then this query performs the calculations:
select
d.ItemID,
d.Bucket,
RollingForwardQuantitySum = sum(iif(c.Bucket between d.Bucket and d.BucketEnd, c.DailyQuantity, null))
from #data d
cross join #calendar c
group by
d.ItemID,
d.Bucket
order by
d.ItemID,
cast(d.Bucket as date);
The output from this query matches the expected result set screen shot in the original post.
Related
I am running an analysis on medication prescribing practices. We want to identify whether someone has been on a class of medications for 60 days out of a 90 day quarter. We have a start and end date for each prescription, and the bounds of the quarter (e.g., 4/1/2022 – 6/30/2022). For each prescription I’ve calculated the number of days between the start and end date (only including days that fall within the bounds of the quarter). There are many instances in which multiple drugs within the same class are prescribed someone might try one antidepressant but not like it, so be given another in the same class.
My original strategy was just to total up number of days for each class of medication and see if it’s 60 or over. The days don’t have to be consecutive, but if they overlap, days during an overlap period shouldn’t count twice (which they would in a simple sum).
For instance in the data table below, patient 1 in row 1 should be included as they are over 60 days. Patient 2 should also get in (rows 2 and 3) because the non-overlapping total (57+8) within the same med class gets them to over 60 days. However, patient 3 should NOT get in, even though the total of 32 + 32 is over 60 because the intervals overlap. This means that they were really on the medication class for only 32 days – this is an instance where someone might be on two different antidepressants simultaneously.
It’s not sufficient to just sum the days in the interval, but I also have to include some way to examine whether the intervals are overlapping and only add days if an interval for a given medication class falls outside another interval for that same class.
Row num Patid Med class Start date End date Interval
1 1 A 2022-04-28 2022-09-12 63
2 2 B 2022-05-03 2022-06-29 57
3 2 B 2022-04-21 2022-04-29 8
4 3 A 2022-01-19 2022-05-03 32
5 3 A 2022-01-19 2022-05-03 32
I’m having a hard time figuring out how to do this. Note, I'm limited to just using SQL for this.
Code that produced the above data. I would embed this in another query to generate a total interval but need to deal with the overlap issue.
DECLARE #startdt DATE;
DECLARE #enddt DATE;
SET #startdt='4/1/2022'
SET #enddt='6/30/2022'
--for q4 fy2022-23 (4/1/2022-6/30/2022)`
SELECT DISTINCT
rx.patid, d.medication_category as medcat, start_date, end_date,
-- case statement to capture days within quarter only
CASE WHEN start_date<#startdt and end_date>#enddt then 90
WHEN start_date<#startdt and end_date>=#startdt then datediff(d,#startdt,end_date)
WHEN start_date>=#startdt and end_date>#enddt then datediff(d,start_date,#enddt)
ELSE datediff(d,start_date,end_date)
END as interval
FROM rx
INNER JOIN Drug_names_categories d
ON rx.drugname=d.drugname
WHERE start_date<'7/1/2022' and end_date>'3/30/2022'
AND rx.patid IS NOT NULL
AND d.medication_category IS NOT NULL
AND d.medication_category <>''
You can accomplish what you want by generating a calendar table (using a Common Table Expression) of individual days within the test range, joining those days with the prescriptions with overlapping days, and then counting distinct days for each patient and medication category combination.
Something like:
DECLARE #startdt DATE = '2022-04-01';
DECLARE #enddt DATE = '2022-06-30';
DECLARE #threshold INT = 60;
WITH Days AS (
SELECT #startdt AS Day
UNION ALL
SELECT DATEADD(day, 1, Day)
FROM Days
WHERE Day < #enddt
)
SELECT
rx.patid, d.medication_category as medcat,
COUNT(DISTINCT DD.Day) AS days_medicated,
MIN(DD.Day) AS start_date,
MAX(DD.Day) AS end_date
FROM rx
INNER JOIN Drug_names_categories d
ON rx.drugname = d.drugname
INNER JOIN Days DD
ON DD.Day BETWEEN rx.start_date AND rx.end_date
WHERE rx.start_date <= #enddt AND #startdt <= rx.end_date
GROUP BY rx.patid, d.medication_category
HAVING COUNT(DISTINCT DD.Day) >= #threshold
ORDER BY rx.patid, start_date;
If using SQL Server 2022 or later, the Days generator can be simplified by using the new GENERATE_SERIES() function:
WITH Days AS (
SELECT DATEADD(day, S.value, #startdt) AS Day
FROM GENERATE_SERIES(0, DATEDIFF(day, #Startdt, #enddt)) S
)
See this db<>fiddle for an example with some sample data.
I would do this using a date/calendar table, then it's pretty easy.
If you don't already have a date table, this link is one of many that describe how to create one easily ( https://www.mssqltips.com/sqlservertip/4054/creating-a-date-dimension-or-calendar-table-in-sql-server/ )
Here's the script from this link (in case the link dies)
DECLARE #StartDate date = '20100101';
DECLARE #CutoffDate date = DATEADD(DAY, -1, DATEADD(YEAR, 30, #StartDate));
;WITH seq(n) AS
(
SELECT 0 UNION ALL SELECT n + 1 FROM seq
WHERE n < DATEDIFF(DAY, #StartDate, #CutoffDate)
),
d(d) AS
(
SELECT DATEADD(DAY, n, #StartDate) FROM seq
),
src AS
(
SELECT
TheDate = CONVERT(date, d),
TheDay = DATEPART(DAY, d),
TheDayName = DATENAME(WEEKDAY, d),
TheWeek = DATEPART(WEEK, d),
TheISOWeek = DATEPART(ISO_WEEK, d),
TheDayOfWeek = DATEPART(WEEKDAY, d),
TheMonth = DATEPART(MONTH, d),
TheMonthName = DATENAME(MONTH, d),
TheQuarter = DATEPART(Quarter, d),
TheYear = DATEPART(YEAR, d),
TheFirstOfMonth = DATEFROMPARTS(YEAR(d), MONTH(d), 1),
TheLastOfYear = DATEFROMPARTS(YEAR(d), 12, 31),
TheDayOfYear = DATEPART(DAYOFYEAR, d)
FROM d
)
SELECT *
INTO MyDateTable
FROM src
ORDER BY TheDate
OPTION (MAXRECURSION 0);
No that you have your new date table you can join to it to get the list of dates that are within the start and end date, something like
SELECT DISTINCT COUNT(TheDate)
FROM rx
INNER JOIN MyDateTable dt on dt BETWEEN rx.start_date AND rx.end_date
INNER JOIN Drug_names_categories d ON rx.drugname=d.drugname
WHERE start_date<'7/1/2022' and end_date>'3/30/2022'
AND rx.patid IS NOT NULL
AND d.medication_category IS NOT NULL
AND d.medication_category <>''
Obviously this is simple example but you could extend this easily to include all the details you need, the point is that you now have a list of dates or distinct list of dates which you can work with easily.
You could also simply the date range applied by referencing the TheQuarter and TheYear columns. If this is a common task consider extending the date table to contain a comound YearQurater columns (e.g. 2023Q1/202301 etc)
I need to find the number of users that were invoiced for an amount greater than 0 in the previous month and were not invoiced in the current month. This calcualtion is to be done for 12 months in a single query. Output should be as below.
Month Count
01/07/2019 50
01/08/2019 34
01/09/2019 23
01/10/2019 98
01/11/2019 10
01/12/2019 5
01/01/2020 32
01/02/2020 65
01/03/2020 23
01/04/2020 12
01/05/2020 64
01/06/2020 54
01/07/2020 78
I am able to get the value only for one month. I want to get it for all months in a single query.
This is my current query:
SELECT COUNT(DISTINCT TWO_MONTHS_AGO.USER_ID), TWO_MONTHS_AGO.MONTH AS INVOICE_MONTH
FROM (
SELECT USER_ID, LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
WHERE invoice_amt > 0
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 2)
GROUP BY user_id
) AS TWO_MONTHS_AGO
LEFT JOIN (
SELECT user_id,LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 1)
GROUP BY USER_ID
) AS ONE_MONTH_AGO ON TWO_MONTHS_AGO.USER_ID = ONE_MONTH_AGO.USER_ID
WHERE ONE_MONTH_AGO.USER_ID IS NULL
GROUP BY INVOICE_MONTH;
Thank you in advance.
Lona
Probably lots of different approaches but the way I would do it is as follows:
Summarise data by user and month for the last 13 months (you need 12 months plus the previous month to that first month
Compare "this" month (that has data) to "next" month and select records where there is no "next" month data
Summarise this dataset by month and distinct userid
For example, assuming a table created as follows:
create table INVOICE_DATA (
USERID varchar(4),
INVOICE_DT date,
INVOICE_AMT NUMBER(10,2)
);
the following query should give you what you want - you may need to adjust it depending on whether you are including this month, or only up to the end of last month, in your calculation, etc.:
--Summarise data by user and month
WITH MONTH_SUMMARY AS
(
SELECT USERID
,TO_CHAR(INVOICE_DT,'YYYY-MM') "INVOICE_MONTH"
,TO_CHAR(ADD_MONTHS(INVOICE_DT,1),'YYYY-MM') "NEXT_MONTH"
,SUM(INVOICE_AMT) "MONTHLY_TOTAL"
FROM INVOICE_DATA
WHERE INVOICE_DT >= TRUNC(ADD_MONTHS(current_date(),-13),'MONTH') -- Last 13 months of data
GROUP BY 1,2,3
),
--Get data for users with invoices in this month but not the next month
USER_DATA AS
(
SELECT USERID, INVOICE_MONTH, MONTHLY_TOTAL
FROM MONTH_SUMMARY MS_THIS
WHERE NOT EXISTS
(
SELECT USERID
FROM MONTH_SUMMARY MS_NEXT
WHERE
MS_THIS.USERID = MS_NEXT.USERID AND
MS_THIS.NEXT_MONTH = MS_NEXT.INVOICE_MONTH
)
AND MS_THIS.INVOICE_MONTH < TO_CHAR(current_date(),'YYYY-MM') -- Don't include this month as obviously no next month to compare to
)
SELECT INVOICE_MONTH, COUNT(DISTINCT USERID) "USER_COUNT"
FROM USER_DATA
GROUP BY INVOICE_MONTH
ORDER BY INVOICE_MONTH
;
I've been pounding my head on this one for two days now.
Here's my issue:
I have 18 buckets. One bucket has a negative value. I need to distribute the bucket with the negative value across the other 17 buckets as a percent of total for the 17 buckets. I can do this in Excel, but I need to do it in T-SQL without any hard coding because this is going to be used in a stored procedure.
Here's my data (Bucket and Amount) and my results from Excel (Pct of Total, Distribution and New Amount):
BUCKET AMOUNT [Pct of Total] Distribution [New Amount]
1 $174,130.91 9.5384% $(281.49) $173,849.41
2 $54,274.13 2.9730% $(87.74) $54,186.39
3 $150,637.86 8.2515% $(243.51) $150,394.34
4 $389,910.65 21.3581% $(630.31) $389,280.34
5 $379,177.75 20.7702% $(612.96) $378,564.79
6 $79,230.40 4.3400% $(128.08) $79,102.32
7 $47,008.64 2.5750% $(75.99) $46,932.64
8 $47,224.95 2.5868% $(76.34) $47,148.60
9 $102,731.42 5.6273% $(166.07) $102,565.35
10 $8,955.93 0.4906% $(14.48) $8,941.45
11 $43,749.52 2.3965% $(70.72) $43,678.80
12 $16,140.85 0.8841% $(26.09) $16,114.76
13 $72,165.14 3.9530% $(116.66) $72,048.48
14 $23,542.26 1.2896% $(38.06) $23,504.21
15 $874.82 0.0479% $(1.41) $873.41
16 $65,665.10 3.5969% $(106.15) $65,558.95
17 $170,162.38 9.3210% $(275.08) $169,887.30
18 $(2,951.15)
Total $1,822,631.55 100.0000% $(2,951.15) $1,822,631.55
Here is an example how you can manage this:
;with neg as(select sum(amount) as amount from t where amount < 0),
pos as(select * from t where amount >= 0)
select *,
p.amount * 100 / sum(p.amount) over() as pct,
neg.amount * p.amount / sum(p.amount) over() as dist,
p.amount + neg.amount * p.amount / sum(p.amount) over() as new
from pos p
cross join neg
Fiddle here: http://sqlfiddle.com/#!3/0cdd7/4
I would like to calculate growth rate for customers for following data.
month | customers
-------------------------
01-2015 | 1
02-2015 | 10
03-2014 | 10
06-2015 | 15
I have used following formula to calculate the growth rate, it works only for one month interval as well as not able to give expected output due to gap between 3rd and 6th month as shown in above table
select
month, total,
(total::float / lag(total) over (order by month) - 1) * 100 growth
from (
select to_char(created, 'yyyy-mm') as month, count(id) total
from customers
group by month
) s
order by month;
I think this can be done by creating a date range and group by that range.
I expect two main output separately
1) Generate growth rate with exact one month difference
2) Growth rate with interval of 2 month instead of single month only. In above data sum the two month result and group by two month instead of month
Still not sure about the second part. Here's growth from your adapted query and twon month growth column:
select
month, total,
(total::float / lag(total) over (order by m) - 1) * 100 growth,m,m2
from (
select created, (sum(customers) over (order by m))::float total,customers,m,m2,to_char(created, 'yyyy-mm') as month
from customers c
right outer join (
select generate_series('2015-01-01','2015-06-01','1 month'::interval) m
) m1 on m=c.created
left outer join (
select generate_series('2015-01-01','2015-06-01','2 month'::interval) m2
) m2 on m2=m
order by m
) s
order by m;
basically answer is use generate_series
Is there a way to calculate a weighted moving average with a fixed window size in Amazon Redshift? In more detail, given a table with a date column and a value column, for each date compute the weighted average value over a window of a specified size, with weights specified in an auxiliary table.
My search attempts so far yielded plenty of examples for doing this with window functions for simple average (without weights), for example here. There are also some related suggestions for postgres, e.g., this SO question, however Redshift's feature set is quite sparse compared with postgres and it doesn't support many of the advanced features that are suggested.
Assuming we have the following tables:
create temporary table _data (ref_date date, value int);
insert into _data values
('2016-01-01', 34)
, ('2016-01-02', 12)
, ('2016-01-03', 25)
, ('2016-01-04', 17)
, ('2016-01-05', 22)
;
create temporary table _weight (days_in_past int, weight int);
insert into _weight values
(0, 4)
, (1, 2)
, (2, 1)
;
Then, if we want to calculate a moving average over a window of three days (including the current date) where values closer to the current date are assigned a higher weight than those further in the past, we'd expect for the weighted average for 2016-01-05 (based on values from 2016-01-05, 2016-01-04 and 2016-01-03):
(22*4 + 17*2 + 25*1) / (4+2+1) = 147 / 7 = 21
And the query could look as follows:
with _prepare_window as (
select
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date) as days_in_past
, t2.value * weight as weighted_value
, weight
, count(t2.ref_date) over(partition by t1.ref_date rows between unbounded preceding and unbounded following) as num_values_in_window
from
_data t1
left join
_data t2 on datediff(day, t2.ref_date, t1.ref_date) between 0 and 2
left join
_weight on datediff(day, t2.ref_date, t1.ref_date) = days_in_past
order by
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date)
)
select
ref_date
, round(sum(weighted_value)::float/sum(weight), 0) as weighted_average
from
_prepare_window
where
num_values_in_window = 3
group by
ref_date
order by
ref_date
;
Giving the result:
ref_date | weighted_average
------------+------------------
2016-01-03 | 23
2016-01-04 | 19
2016-01-05 | 21
(3 rows)