Window Function For Consecutive Dates - amazon-redshift

I want to know how many users were active for 3 consecutive days on any given day.
e.g on 2022-11-03, 1 user (user_id = 111) was active 3 days in a row. Could someone please advise what kind of window function(?) would be needed?
This is my dataset:
user_id
active_date
111
2022-11-01
111
2022-11-02
111
2022-11-03
222
2022-11-01
333
2022-11-01
333
2022-11-09
333
2022-11-10
333
2022-11-11

If you are confident there are no duplicate user_id + active_date rows in the source data, then you can use two LAG functions like this:
SELECT user_id,
active_date,
CASE WHEN DATEADD(day, -1, active_date) = LAG(active_date, 1) OVER (PARTITION BY user_id ORDER BY active_date)
AND DATEADD(day, -2, active_date) = LAG(active_date, 2) OVER (PARTITION BY user_id ORDER BY active_date)
THEN 'Yes'
ELSE 'No'
END AS rowof3
FROM your_table
ORDER BY user_id, active_date;
If there might be duplication, use this FROM clause instead:
FROM (SELECT DISTINCT user_id, active_date :: DATE FROM your_table)

Related

Cohort Analysis with RedShift by Month

I am trying to build a cohort analysis for monthly retention but experiencing challenge getting the Month Number column right. The month number is supposed to return month(s) user transacted i.e 0 for registration month, 1 for the first month after registration month, 2 for the second month until the last month but currently, it returns negative month numbers in some cells.
It should be like this table:
cohort_month total_users month_number percentage
---------- ----------- -- ------------ ---------
January 100 0 40
January 341 1 90
January 115 2 90
February 103 0 73
February 100 1 40
March 90 0 90
Here is the SQL:
with cohort_items as (
select
extract(month from insert_date) as cohort_month,
msisdn as user_id
from mfscore.t_um_user_detail where extract(year from insert_date)=2020
order by 1, 2
),
user_activities as (
select
A.sender_msisdn,
extract(month from A.insert_date)-C.cohort_month as month_number
from mfscore.t_wm_transaction_logs A
left join cohort_items C ON A.sender_msisdn = C.user_id
where extract(year from A.insert_date)=2020
group by 1, 2
),
cohort_size as (
select cohort_month, count(1) as num_users
from cohort_items
group by 1
order by 1
),
B as (
select
C.cohort_month,
A.month_number,
count(1) as num_users
from user_activities A
left join cohort_items C ON A.sender_msisdn = C.user_id
group by 1, 2
)
select
B.cohort_month,
S.num_users as total_users,
B.month_number,
B.num_users * 100 / S.num_users as percentage
from B
left join cohort_size S ON B.cohort_month = S.cohort_month
where B.cohort_month IS NOT NULL
order by 1, 3
I think the RANK window function is the right solution. So the idea is to assigne a rank to months of user activities for each user, order by year and month.
Something like:
WITH activity_per_user AS (
SELECT
user_id,
event_date,
RANK() OVER (PARTITION BY user_id ORDER BY DATE_PART('year', event_date) , DATE_PART('month', event_date) ASC) AS month_number
FROM user_activities_table
)
RANK number starts from 1, so you may want to substract 1.
Then, you can group by user_id and month_number to get the number of interactions for each user per month from the subscription (adapt to your use case accordingly).
SELECT
user_id,
month_number,
COUNT(1) AS n_interactions
FROM activity_per_user
GROUP BY 1, 2
Here is the documentation:
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html

Find if between dates having any date from a list of dates?

I am trying to find if (select dates from public_holidays) exist in dates between start_date and end_date.
It will be look like this:
select
id, name, start_date, end_date,
case when (public_holidays = true) then number - 1 else number end as find_real_number
from my_table
Sample Data:
ID Name Start_date End_Date Numbers
1 Mike 3/9/2020 4/9/2020 67
2 Rick 3/1/2020 3/6/2020 34
3 Simm 3/24/2020 3/28/2020 98
4 Lisa 3/27/2020 4/5/2020 103
5 Rosy 3/9/2020 4/9/2020 23
And some sample expected results:
ID Name Start_date End_Date Numbers
1 Mike 3/9/2020 4/9/2020 66
2 Rick 3/1/2020 3/6/2020 34
3 Simm 3/24/2020 3/28/2020 98
4 Lisa 3/27/2020 4/5/2020 102
5 Rosy 3/9/2020 4/9/2020 23
Because we assume the 1st of April is a public holiday, so number of row 1st and 4th got minus by 1.
And sample public holidays view I created:
Public_holidays Dates
April fools 04/01/2020
Labour Day 05/01/2020
Random Day 07/24/2020
However, because I am building query on the Metabase, it does not allow me to create a table. All I did was create a view where has 2 columns that are 'Public Holidays' and 'Dates'
Anyone possibly could give me a suggestion of how to do this? Thanks.
Try something like this:
SELECT id, name, start_date, end_date,
numbers - ( SELECT COUNT(*) FROM holidays
WHERE dates BETWEEN t.start_date AND t.end_date ) AS numbers
FROM my_table AS t
This assumes that your holidays are in a table/view named holidays. Also it counts the holidays between start and end dates and subtract it from numbers of my_table.
I think you want to check public holiday falls or not between start and end Date.
so you should compare dates like below:
select
id, name, start_date, end_date,
case when ((CAST(start_date as date) < CAST(public_holiday_date as date)
and CAST(public_holiday_date as date) < CAST(end_date as date))
then number - 1 else number end as find_real_number
from my_table
or
select
id, name, start_date, end_date,
case when CAST(public_holiday_date as date)
between (CAST(start_date as date) and CAST(end_date as date)
then number - 1 else number end as find_real_number
from my_table

Calculate past 3 month average for every past 3rd month

I am using SQL Server 2014. I have a table like this
create table revenue (id varchar(2), trasdate date, revenue int);
insert into revenue(id, trasdate, revenue)
values ('aa', '2018/09/01', 1234.5),
('aa' , '2018/08/04', 450),
('aa', '2018/07/03',500),
('aa', '2018/06/04',600),
('ab', '2018/09/01', 1234.5),
('ab' , '2018/08/04', 450),
('ab', '2018/07/03',500),
('ab', '2018/06/04',600),
('ab', '2018/05/03', 200),
('ab', '2018/04/02', 150),
('ab', '2018/03/01', 350),
('ab', '2018/02/05', 700),
('aa', '2018/01/07', 400)
;
I am preparing a SQL query to create a SSRS report. I want to calculate a past 3 month average for current and every past 3rd month with result like below. As we are in month of September right now. The result should show something like this:
**id Period Revenue_3Mon**
aa March-May 233
aa June-Aug 516
ab March-May 233
ab June-Aug 516
Though I can figure out about the Period column. I was mainly focussing on getting the Revenue_3Mon. So I initially tried with the below query after some googling. But this query throws an error as incorrect syntax near 'rows' and if I remove rows from the query then it throws an error as Incorrect syntax near the keyword 'between'. And incorrect syntax near i.
select i.id,i.mon,
avg([i.mon_revenue]) over (partition by i.id, i.mon order by [i.id],
[i.mon] rows between 3 preceding and 1 preceding row) as revenue_3mon --
-- using 3 preceding and 1 preceding row you exclude the current row
from (select a.id, month(a.trasdate) as mon,
sum(a.revenue) as mon_revenue
from revenue a
group by a.id, month(a.trasdate)) i
group by i.id, i.mon
order by i.id,i.mon;
After few efforts, I gave up on this query and came up with new solution which was a bit close to my expectation (after lots of trial and errors).
Declare #count as int;
declare #max as int;
set #count = 4
declare #temp as table (id varchar(2), monthoftrasdate int, revenue int,
[3monavg] int);
SET #MAX = (SELECT distinct MAX(a.ROWNUM) FROM (SELECT id, month(trasdate)
as mon, SUM(revenue) TotalRevenue,
-- sum(revenue) as mon_revenue,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY MONTH(TRASDATE)) AS ROWNUM
FROM revenue
GROUP BY ID, MONTH(TRASDATE)
) A GROUP BY A.ID);
while (#count <= #max )
begin
WITH CTE AS (
SELECT id, month(trasdate) as mon, SUM(revenue) TotalRevenue,
-- sum(revenue) as mon_revenue,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY MONTH(TRASDATE)) AS
ROWNUM
FROM revenue
GROUP BY ID, MONTH(TRASDATE)
)
insert into #temp
SELECT A.ID,A.MON, a.TotalRevenue
,( SELECT avg(b.TotalRevenue) as avgrev
FROM CTE B
WHERE B.ROWNUM BETWEEN A.ROWNUM-3 AND A.ROWNUM-1
AND A.ID = B.ID --AND A.mon = B.mon
--and b.ROWNUM < a.ROWNUM
and (a.mon > 3 and a.ROWNUM > 3)
GROUP BY B.id
) AS REVENUE_3MON
FROM CTE A
set #count = #count + 1
end
select distinct a.* from #temp a
The reason I had to use 'distinct' is because the query was showing duplicate records for every id and every month. So far the result shows like below
id MonthofTrasdate Revenue 3MonAvg
aa 1 400 NULL
aa 2 700 NULL
aa 3 350 NULL
aa 4 150 483
aa 5 200 400
aa 6 600 233
aa 7 500 316
aa 8 450 433
aa 9 1234 516
ab 1 400 NULL
ab 2 700 NULL
ab 3 350 NULL
ab 4 150 483
ab 5 200 400
ab 6 600 233
ab 7 500 316
ab 8 450 433
ab 9 1234 516
This pulls out past 3 month average for every month. But i will just manipulate the rest on SSRS the way i want it.
As currently my table has no data for previous year. This works for me showing the appropriate result for next couple of months for now. But my concern is when I have to show my boss for next year Jan, Feb and March then it should be able to pull also for these months as well like Oct-Dec (Previous year), Nov-Jan and Dec - Feb. I am struggling to figure out the proper way to put this in my query.
Can you please help me out with this query? And also let me know what is wrong with my former query.
Problems with your first attempt:
You enclosed some of the aliases and column names in square brackets like [i.mon_revenue]. There is no need for square brackets, but if you want to use them, you have to break them up at the dot: [i].[mon_revenue].
In your window function expression, there is one row too many (in the end).
Window functions are applied at the very end (after the rest of the respective query), so you also have to include i.mon_revenue in your GROUP BY clause of the outer query.
Knowing that the inner query will produce one row per id and mon, there will never be preceding rows in an id-mon partition. Therefore, you must not partition by both, but only by id.
To simplify the query after resolving the issues: ordering by a partition column generally makes no sense, and since - as already mentioned - the inner query returns unique id-mon combinations, you don't have to group by these in the outer query. Looking at that query, we see that the outer query just directly selects and uses the values from the inner query, which makes a separation in two queries unneccessary. So, in fact, you wanted to perform the following query, which will produce the rolling 3-month average (I added the monthly TotalRevenue as well):
SELECT id, MONTH(trasdate) AS mon, SUM(revenue) AS TotalRevenue,
AVG(SUM(revenue)) OVER (PARTITION BY id ORDER BY MONTH(trasdate) ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS revenue_3mon
FROM revenue
GROUP BY id, MONTH(trasdate)
ORDER BY id, MONTH(trasdate);
Suggestions on your second attempt:
When calculating the #MAX value, you rely on the fact that each id has revenues for the same number of months. Are you sure?
The code inside the WHILE loop does not depend on #count, so it will add the same data into the #temp table multiple times, which is probably the reason why you thought you needed a DISTINCT. Therfore: No need for the variables, no need for a loop and a #temp, no need for DISTINCT.
The conditions A.mon > 3 and A.rownum > 3 are redundant with your current data. In general, I guess, you don't want to explicitly excluse the months from January to March, so A.mon > 3 should be removed. A.rownum > 3 could be removed, too, unless you really don't want to see a 3-month average when there are only 2 preceding months or less.
As the subquery for the average is restricted to only one id, there's no need for a GROUP BY.
Since the ROW_NUMBER function doesn't care about gaps in the months, I suggest to use a different numbering function, for example DATEDIFF(month, MAX(trasdate), GETDATE()) AS mnum. Of course, the comparison in the WHERE clause of the subquery then has to be changed to B.mnum BETWEEN A.mnum+1 AND A.mnum+3.
So, your second attempt can be reduced to this, which will produce the same result as the above, at least with your sample data, where no gaps in the months exist:
WITH CTE AS (
SELECT id, MONTH(trasdate) AS mon, SUM(revenue) AS TotalRevenue,
DATEDIFF(month, MAX(trasdate), GETDATE()) AS mnum
FROM revenue
GROUP BY id, MONTH(trasdate)
)
SELECT id, mon, TotalRevenue
, (SELECT AVG(B.TotalRevenue)
FROM CTE B
WHERE B.mnum BETWEEN A.mnum+1 AND A.mnum+3
AND A.id = B.id
) AS revenue_3mon
FROM CTE A
ORDER BY id, mnum DESC;
Now, guess what, an expression like my mnum using DATEDIFF increases by one every month as we move to the past, regardless of a change of years, so this might be useful for grouping as well, whether you want to (or can?) use Window functions or not:
With OVER()
SELECT id, MONTH(MIN(trasdate)) AS mon, YEAR(MIN(trasdate)) AS yr, SUM(revenue) AS TotalRevenue,
AVG(SUM(revenue)) OVER (PARTITION BY id ORDER BY MIN(trasdate) ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS revenue_3mon
FROM revenue
GROUP BY id, DATEDIFF(month, trasdate, GETDATE())
ORDER BY id, DATEDIFF(month, trasdate, GETDATE()) DESC;
Without OVER()
WITH CTE AS (
SELECT id, MIN(trasdate) AS min_dt, SUM(revenue) AS TotalRevenue,
DATEDIFF(month, trasdate, GETDATE()) AS mnum
FROM revenue
GROUP BY id, DATEDIFF(month, trasdate, GETDATE())
)
SELECT id, MONTH(min_dt) AS mon, YEAR(min_dt) AS yr, TotalRevenue
, (SELECT AVG(B.TotalRevenue)
FROM CTE B
WHERE B.mnum BETWEEN A.mnum+1 AND A.mnum+3
AND A.id = B.id
) AS revenue_3mon
FROM CTE A
ORDER BY id, mnum DESC;
Both queries allow for retrieving the minimum and maximum date for each period (including month and year).
If you instead wanted what you originally posted under The result should show something like this (just grouping by previous 3-months intervals), you just would have to group your original revenue table by id and (DATEDIFF(month, trasdate, GETDATE())-1)/3 (filtering WHERE DATEDIFF(month, trasdate, GETDATE()) > 0). If so, this kind of grouping and aggregation could, of course, be done also by the Report Server.
I think this should do what you want:
select r.*,
avg(r.mon_revenue) over (partition by r.id
order by r.mon_min
rows between 3 preceding and 1 preceding row
) as revenue_3mon
-- using 3 preceding and 1 preceding row you exclude the current row
from (select r.id, month(r.trasdate) as mon,
min(r.trasdate) as mon_min,
sum(r.revenue) as mon_revenue
from revenue r
group by r.id, year(r.trasdate), month(r.trasdate)
) 4
order by r.id, r.mon, r.mon_min;
Notes:
I fixed the code so it recognizes years as well as dates.
The expression [i.mon_revenue] is not a valid column reference (in your case). You have no column with the name "i.mon_revenue" (with the . in the name).
I changed the column alias to r to match the table.
I added a date column for each month to make it easier to express the ordering.
The outer group by is not necessary.
There are several syntax errors in your code. This should give you what you need. The inner query is the important bit but hopefully this will be enough to get you on your way.
I switch our the temp table for variable and changed the revenue column to not be INT as you have decimal values in there but other than that your original sample table is unchanged
DECLARE #revenue table (id varchar(2), trasdate date, revenue float)
insert into #revenue(id, trasdate, revenue)
values ('aa', '2018/09/01', 1234.5),
('aa' , '2018/08/04', 450),
('aa', '2018/07/03',500),
('aa', '2018/06/04',600),
('ab', '2018/09/01', 1234.5),
('ab' , '2018/08/04', 450),
('ab', '2018/07/03',500),
('ab', '2018/06/04',600),
('ab', '2018/05/03', 200),
('ab', '2018/04/02', 150),
('ab', '2018/03/01', 350),
('ab', '2018/02/05', 700),
('aa', '2018/01/07', 400)
SELECT
*
FROM
(
SELECT
*
, MONTH(trasdate) as MonthNumber
, AVG(revenue) OVER (PARTITION BY id
ORDER BY
id
, MONTH(trasdate) ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) as ThreeMonthAvg
FROM #revenue
) a
WHERE MONTH(GETDATE()) - MonthNumber IN (0, 3, 6, 9)
This gives the following results
aa 2018-06-04 600 6 400
aa 2018-09-01 1234.5 9 516.666666666667
ab 2018-03-01 350 3 700
ab 2018-06-04 600 6 233.333333333333
ab 2018-09-01 1234.5 9 516.666666666667

PGSQL duplicate record in same column

i have a table and i want to know where duplicate records are present for same columns. These are my columns and i want to get record where group_id or week are different for same code and fweek and newcode
Id newcode fweek code group_id week
1 343001 2016-01 343 100 8
2 343002 2016-01 343 100 8
3 343001 2016-01 343 101 08
Required record is
Id newcode fweek code group_id week
3 343001 2016-01 343 101 08
To find the duplicate values i have joined the table with itself.
and we need to group the results with code,fweek and newcode to get more than one duplicate rows if they exist. i have used max() to get last inserted row.
you don't need to use is distinct from (it is same for inequality + NULL). if you don't want to compare NULL ones, use <> operator.
You find more information about here info
select r.*
from your_table r
where r.id in (select max(r.id)
from your_table r
join your_table r2 on r2.code = r.code and r2.fweek = r.fweek and r2.newcode = r.newcode
where
r2.group_id is distinct from r.group_id or
r2.week is distinct from r.week
group by r.code,
r.fweek,
r.newcode
having count(*) > 1)

How do I select the min opendate from a list of duplicates?

I have 3 columns. SSN|AccountNumber|OpenDate
1 SSN may have multiple AccountNumbers
Each AccountNumber has a corresponding OpenDate
In my list I have many SSN's, each containing several account numbers which may have been opened on different days.
I want the results of my query to be SSN|earlest OpenDate|AccountNumber that corresponds with the earliest opendate.
I'm dealing with about 200,000 records.
EDIT: First I did
select SSN, min(OpenDate), AcctNumber from Table Group By SSN, AccountNumber
but that didn't quite give me the correct data.
The raw data gives me something like this:
SSN | AcctNumber | OpenDate
---------------------------
10 101 Jan
10 102 Feb
10 103 Mar
Where I got 10, Jan, and AccNumber 102 which is not the account number that is associated with Jan OpenDate After looking at others, I found that the account number I got was just one of the account numbers associated with that SSN rather than the one that corresponds with the min(OpenDate)
WITH CTE AS ( SELECT SSN, AcctNumber, OpenDate, ROW_NUM() OVER (PARTITION BY SSN ORDER BY OpenDate DESC) AS RN ) SELECT SSN, AcctNumber, OpenDate FROM CTE WHERE RN=1;
If your table is like this:
SSN | AcctNumber | OpenDate
---------------------------
10 101 April
10 101 May
10 102 April
20 201 June
20 201 July
Do you want your query to return this?
SSN | AcctNumber | OpenDate
---------------------------
10 101 April
10 102 April
20 201 June
Then you would use this query:
select ssn, min(OpenDate), acctNumber from tbl group by ssn, acctNumber
You can try this..
select SSN , AcctNumber, OpenDate
from (SELECT SSN , AcctNumber, OpenDate
, ROW_NUMBER() OVER ( PARTITION BY SSN, ORDER BY OpenDate ASC ) AS RN
FROM table) AS temp
WHERE temp.RN= 1