Using Lag() function to retrieve values across dates - postgresql

I am trying to use the LAG() and LEAD() functions in postgres to retrieve values from other rows/records in a table and I am running into some difficulty. The functionality works as intended as long as the LAG or LEAD function is only looking at dates within the same month (i.e. June 2nd can look back to June 1st, but when I try to look back to May 31st, I retrieve a NULL value).
Here's what the table looks like
_date count_daily_active_users count_new_users day1_users users_arriving_today_who_returned_tomrrow day_retained_users
5/27/2013 1742 335 266 207 0.617910448
5/28/2013 1768 241 207 146 0.605809129
5/29/2013 1860 272 146 161 0.591911765
5/30/2013 2596 841 161 499 0.59334126
5/31/2013 2837 703 499 NULL NULL
6/1/2013 12881 10372 0 5446 0.525067489
6/2/2013 14340 6584 5446 2781 0.422387606
6/3/2013 12222 3690 2781 1494 0.404878049
6/4/2013 25861 17254 1494 8912 0.516517909
From that table you can see that on May 31st when I try to 'look ahead' to June 1st to retrieve the number of users who arrived for the first time on May 31st and then returned again on June 1st I get a NULL value. This happens at every month boundary and it happens regardless of the number of days I try to 'look ahead'. So if I look ahead two days, then I'd have NULLs for May 30th and May 31st.
Here's the SQL I wrote
SELECT
timestamp_session::date AS _date
, COUNT(DISTINCT dim_player_key) AS count_daily_active_users
, COUNT(DISTINCT CASE WHEN days_since_birth = 0 THEN dim_player_key ELSE NULL END) AS count_new_users
, COUNT(DISTINCT CASE WHEN days_since_birth != 0 THEN dim_player_key ELSE NULL END) AS count_returning_users
, COUNT(DISTINCT CASE WHEN days_since_birth = 1 THEN dim_player_key ELSE NULL END) AS day1_users -- note: the function is a LAG function instead of a LEAD function because of the sort order
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 0 THEN dim_player_key ELSE NULL END), 1) OVER (order by _date)::float, 0)) as AA
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 1 THEN dim_player_key ELSE NULL END), 1) OVER (order by _date)::float, 0)) as AB
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 0 THEN dim_player_key ELSE NULL END), 0) OVER (order by _date)::float, 0)) as BB
, (NULLIF(LAG(COUNT(DISTINCT CASE WHEN days_since_birth = 1 THEN dim_player_key ELSE NULL END), 0) OVER (order by _date)::float, 0)) as BA
FROM ( SELECT sessions_table.account_id AS dim_player_key,
sessions_table.session_id AS dim_session_key,
sessions_table.title_id AS dim_title_id,
sessions_table.appid AS dim_app_id,
sessions_table.loginip AS login_ip,
essions_table.logindate AS timestamp_session,
birthdate_table.birthdate AS timestamp_birthdate,
EXTRACT(EPOCH FROM (sessions_table.logindate - birthdate_table.birthdate)) AS count_age_in_seconds,
(date_part('day', sessions_table.logindate)- date_part('day', birthdate_table.birthdate)) AS days_since_birth
FROM
dataset.tablename1 AS sessions_table
JOIN (
SELECT
account_id,
MIN(logindate) AS birthdate
FROM
dataset.tablename1
GROUP BY
account_id )
-- call this sub-table the birthdate_table
birthdate_table ON
sessions_table.account_id = birthdate_table.account_id
-- call this table the outer_sessions_table
) AS outer_sessions_table
GROUP BY
_date
ORDER BY
_date ASC
I think that what I probably need to do is add an additional field in the inner select that reports the date as an integer value- something like that the EPOCH time for that date at midnight. But when I have tried that (adding a per day epoch time) it changes all of the values in the output table to 1. And I don't understand why.
Can anyone help me out?
Thanks,
Brad

The problem was with the days_since_birth calculation. I was using
(date_part('day',
sessions_table.logindate)- date_part('day',
birthdate_table.birthdate)) AS days_since_birth
as though it was subtracting the absolute date to give me a difference between those dates in days, but it's just converting the date to a day of the month and subtracting that, so at the month roll over, it returns -27, -29, -30 (depending on the month). I can fix this by wrapping it with an ABS function.

Related

How to subtract a seperate count from one grouping

I have a postgres query like this
select application.status as status, count(*) as "current_month" from application
where to_char(application.created, 'mon') = to_char('now'::timestamp - '1 month'::interval, 'mon')
and date_part('year',application.created) = date_part('year', CURRENT_DATE)
and application.job_status != 'expired'
group by application.status
it returns the table below that has the number of applications grouped by status for the current month. However I want to subtract a total count of a seperate but related query from the internal review number only. I want to count the number of rows with type = abc within the same table and for the same date range and then subtract that amount from the internal review number (Type is a seperate field). Current_month_desired is how it should look.
status
current_month
current_month_desired
fail
22
22
internal_review
95
22
pass
146
146
UNTESTED: but maybe...
The intent here is to use an analytic and case expression to conditionally sum. This way, the subtraction is not needed in the first place as you are only "counting" the values needed.
SELECT application.status as status
, sum(case when type = 'abc'
and application.status ='internal_review' then 0
else 1 end) over (partition by application.status)) as
"current_month"
FROM application
WHERE to_char(application.created, 'mon') = to_char('now'::timestamp - '1 month'::interval, 'mon')
and date_part('year',application.created) = date_part('year', CURRENT_DATE)
and application.job_status != 'expired'
GROUP BY application.status

Calculate past 3 month average for every past 3rd month

I am using SQL Server 2014. I have a table like this
create table revenue (id varchar(2), trasdate date, revenue int);
insert into revenue(id, trasdate, revenue)
values ('aa', '2018/09/01', 1234.5),
('aa' , '2018/08/04', 450),
('aa', '2018/07/03',500),
('aa', '2018/06/04',600),
('ab', '2018/09/01', 1234.5),
('ab' , '2018/08/04', 450),
('ab', '2018/07/03',500),
('ab', '2018/06/04',600),
('ab', '2018/05/03', 200),
('ab', '2018/04/02', 150),
('ab', '2018/03/01', 350),
('ab', '2018/02/05', 700),
('aa', '2018/01/07', 400)
;
I am preparing a SQL query to create a SSRS report. I want to calculate a past 3 month average for current and every past 3rd month with result like below. As we are in month of September right now. The result should show something like this:
**id Period Revenue_3Mon**
aa March-May 233
aa June-Aug 516
ab March-May 233
ab June-Aug 516
Though I can figure out about the Period column. I was mainly focussing on getting the Revenue_3Mon. So I initially tried with the below query after some googling. But this query throws an error as incorrect syntax near 'rows' and if I remove rows from the query then it throws an error as Incorrect syntax near the keyword 'between'. And incorrect syntax near i.
select i.id,i.mon,
avg([i.mon_revenue]) over (partition by i.id, i.mon order by [i.id],
[i.mon] rows between 3 preceding and 1 preceding row) as revenue_3mon --
-- using 3 preceding and 1 preceding row you exclude the current row
from (select a.id, month(a.trasdate) as mon,
sum(a.revenue) as mon_revenue
from revenue a
group by a.id, month(a.trasdate)) i
group by i.id, i.mon
order by i.id,i.mon;
After few efforts, I gave up on this query and came up with new solution which was a bit close to my expectation (after lots of trial and errors).
Declare #count as int;
declare #max as int;
set #count = 4
declare #temp as table (id varchar(2), monthoftrasdate int, revenue int,
[3monavg] int);
SET #MAX = (SELECT distinct MAX(a.ROWNUM) FROM (SELECT id, month(trasdate)
as mon, SUM(revenue) TotalRevenue,
-- sum(revenue) as mon_revenue,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY MONTH(TRASDATE)) AS ROWNUM
FROM revenue
GROUP BY ID, MONTH(TRASDATE)
) A GROUP BY A.ID);
while (#count <= #max )
begin
WITH CTE AS (
SELECT id, month(trasdate) as mon, SUM(revenue) TotalRevenue,
-- sum(revenue) as mon_revenue,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY MONTH(TRASDATE)) AS
ROWNUM
FROM revenue
GROUP BY ID, MONTH(TRASDATE)
)
insert into #temp
SELECT A.ID,A.MON, a.TotalRevenue
,( SELECT avg(b.TotalRevenue) as avgrev
FROM CTE B
WHERE B.ROWNUM BETWEEN A.ROWNUM-3 AND A.ROWNUM-1
AND A.ID = B.ID --AND A.mon = B.mon
--and b.ROWNUM < a.ROWNUM
and (a.mon > 3 and a.ROWNUM > 3)
GROUP BY B.id
) AS REVENUE_3MON
FROM CTE A
set #count = #count + 1
end
select distinct a.* from #temp a
The reason I had to use 'distinct' is because the query was showing duplicate records for every id and every month. So far the result shows like below
id MonthofTrasdate Revenue 3MonAvg
aa 1 400 NULL
aa 2 700 NULL
aa 3 350 NULL
aa 4 150 483
aa 5 200 400
aa 6 600 233
aa 7 500 316
aa 8 450 433
aa 9 1234 516
ab 1 400 NULL
ab 2 700 NULL
ab 3 350 NULL
ab 4 150 483
ab 5 200 400
ab 6 600 233
ab 7 500 316
ab 8 450 433
ab 9 1234 516
This pulls out past 3 month average for every month. But i will just manipulate the rest on SSRS the way i want it.
As currently my table has no data for previous year. This works for me showing the appropriate result for next couple of months for now. But my concern is when I have to show my boss for next year Jan, Feb and March then it should be able to pull also for these months as well like Oct-Dec (Previous year), Nov-Jan and Dec - Feb. I am struggling to figure out the proper way to put this in my query.
Can you please help me out with this query? And also let me know what is wrong with my former query.
Problems with your first attempt:
You enclosed some of the aliases and column names in square brackets like [i.mon_revenue]. There is no need for square brackets, but if you want to use them, you have to break them up at the dot: [i].[mon_revenue].
In your window function expression, there is one row too many (in the end).
Window functions are applied at the very end (after the rest of the respective query), so you also have to include i.mon_revenue in your GROUP BY clause of the outer query.
Knowing that the inner query will produce one row per id and mon, there will never be preceding rows in an id-mon partition. Therefore, you must not partition by both, but only by id.
To simplify the query after resolving the issues: ordering by a partition column generally makes no sense, and since - as already mentioned - the inner query returns unique id-mon combinations, you don't have to group by these in the outer query. Looking at that query, we see that the outer query just directly selects and uses the values from the inner query, which makes a separation in two queries unneccessary. So, in fact, you wanted to perform the following query, which will produce the rolling 3-month average (I added the monthly TotalRevenue as well):
SELECT id, MONTH(trasdate) AS mon, SUM(revenue) AS TotalRevenue,
AVG(SUM(revenue)) OVER (PARTITION BY id ORDER BY MONTH(trasdate) ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS revenue_3mon
FROM revenue
GROUP BY id, MONTH(trasdate)
ORDER BY id, MONTH(trasdate);
Suggestions on your second attempt:
When calculating the #MAX value, you rely on the fact that each id has revenues for the same number of months. Are you sure?
The code inside the WHILE loop does not depend on #count, so it will add the same data into the #temp table multiple times, which is probably the reason why you thought you needed a DISTINCT. Therfore: No need for the variables, no need for a loop and a #temp, no need for DISTINCT.
The conditions A.mon > 3 and A.rownum > 3 are redundant with your current data. In general, I guess, you don't want to explicitly excluse the months from January to March, so A.mon > 3 should be removed. A.rownum > 3 could be removed, too, unless you really don't want to see a 3-month average when there are only 2 preceding months or less.
As the subquery for the average is restricted to only one id, there's no need for a GROUP BY.
Since the ROW_NUMBER function doesn't care about gaps in the months, I suggest to use a different numbering function, for example DATEDIFF(month, MAX(trasdate), GETDATE()) AS mnum. Of course, the comparison in the WHERE clause of the subquery then has to be changed to B.mnum BETWEEN A.mnum+1 AND A.mnum+3.
So, your second attempt can be reduced to this, which will produce the same result as the above, at least with your sample data, where no gaps in the months exist:
WITH CTE AS (
SELECT id, MONTH(trasdate) AS mon, SUM(revenue) AS TotalRevenue,
DATEDIFF(month, MAX(trasdate), GETDATE()) AS mnum
FROM revenue
GROUP BY id, MONTH(trasdate)
)
SELECT id, mon, TotalRevenue
, (SELECT AVG(B.TotalRevenue)
FROM CTE B
WHERE B.mnum BETWEEN A.mnum+1 AND A.mnum+3
AND A.id = B.id
) AS revenue_3mon
FROM CTE A
ORDER BY id, mnum DESC;
Now, guess what, an expression like my mnum using DATEDIFF increases by one every month as we move to the past, regardless of a change of years, so this might be useful for grouping as well, whether you want to (or can?) use Window functions or not:
With OVER()
SELECT id, MONTH(MIN(trasdate)) AS mon, YEAR(MIN(trasdate)) AS yr, SUM(revenue) AS TotalRevenue,
AVG(SUM(revenue)) OVER (PARTITION BY id ORDER BY MIN(trasdate) ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS revenue_3mon
FROM revenue
GROUP BY id, DATEDIFF(month, trasdate, GETDATE())
ORDER BY id, DATEDIFF(month, trasdate, GETDATE()) DESC;
Without OVER()
WITH CTE AS (
SELECT id, MIN(trasdate) AS min_dt, SUM(revenue) AS TotalRevenue,
DATEDIFF(month, trasdate, GETDATE()) AS mnum
FROM revenue
GROUP BY id, DATEDIFF(month, trasdate, GETDATE())
)
SELECT id, MONTH(min_dt) AS mon, YEAR(min_dt) AS yr, TotalRevenue
, (SELECT AVG(B.TotalRevenue)
FROM CTE B
WHERE B.mnum BETWEEN A.mnum+1 AND A.mnum+3
AND A.id = B.id
) AS revenue_3mon
FROM CTE A
ORDER BY id, mnum DESC;
Both queries allow for retrieving the minimum and maximum date for each period (including month and year).
If you instead wanted what you originally posted under The result should show something like this (just grouping by previous 3-months intervals), you just would have to group your original revenue table by id and (DATEDIFF(month, trasdate, GETDATE())-1)/3 (filtering WHERE DATEDIFF(month, trasdate, GETDATE()) > 0). If so, this kind of grouping and aggregation could, of course, be done also by the Report Server.
I think this should do what you want:
select r.*,
avg(r.mon_revenue) over (partition by r.id
order by r.mon_min
rows between 3 preceding and 1 preceding row
) as revenue_3mon
-- using 3 preceding and 1 preceding row you exclude the current row
from (select r.id, month(r.trasdate) as mon,
min(r.trasdate) as mon_min,
sum(r.revenue) as mon_revenue
from revenue r
group by r.id, year(r.trasdate), month(r.trasdate)
) 4
order by r.id, r.mon, r.mon_min;
Notes:
I fixed the code so it recognizes years as well as dates.
The expression [i.mon_revenue] is not a valid column reference (in your case). You have no column with the name "i.mon_revenue" (with the . in the name).
I changed the column alias to r to match the table.
I added a date column for each month to make it easier to express the ordering.
The outer group by is not necessary.
There are several syntax errors in your code. This should give you what you need. The inner query is the important bit but hopefully this will be enough to get you on your way.
I switch our the temp table for variable and changed the revenue column to not be INT as you have decimal values in there but other than that your original sample table is unchanged
DECLARE #revenue table (id varchar(2), trasdate date, revenue float)
insert into #revenue(id, trasdate, revenue)
values ('aa', '2018/09/01', 1234.5),
('aa' , '2018/08/04', 450),
('aa', '2018/07/03',500),
('aa', '2018/06/04',600),
('ab', '2018/09/01', 1234.5),
('ab' , '2018/08/04', 450),
('ab', '2018/07/03',500),
('ab', '2018/06/04',600),
('ab', '2018/05/03', 200),
('ab', '2018/04/02', 150),
('ab', '2018/03/01', 350),
('ab', '2018/02/05', 700),
('aa', '2018/01/07', 400)
SELECT
*
FROM
(
SELECT
*
, MONTH(trasdate) as MonthNumber
, AVG(revenue) OVER (PARTITION BY id
ORDER BY
id
, MONTH(trasdate) ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) as ThreeMonthAvg
FROM #revenue
) a
WHERE MONTH(GETDATE()) - MonthNumber IN (0, 3, 6, 9)
This gives the following results
aa 2018-06-04 600 6 400
aa 2018-09-01 1234.5 9 516.666666666667
ab 2018-03-01 350 3 700
ab 2018-06-04 600 6 233.333333333333
ab 2018-09-01 1234.5 9 516.666666666667

SQL: Turning a single set of results into a table

I've created a T-SQL query which counts all of the students who are at a lesson at specific times, for each day of the week (below). Is there a way of taking this query(which gives a snapshot) and displaying results for a variety of different '#time' variables so I can get a table which looks something like:
Day 09:00 09:30..... etc
1 300 455
2 43 467
3 21 312
The increment in time along the columns would be constant - and ideally adjustable, but if that's needlessly complex, every 30 mins would probably be optimum.
Query:
DECLARE #time AS VARCHAR(5)
SELECT #time = '09:00'
SELECT A.day_of_week,
A.Sum(count_student) AS Students
FROM (
SELECT sttrgprf.register_id,
sttrgprf.day_of_week,
CONVERT(VARCHAR(5), start_time, 108) AS start_time,
CONVERT(VARCHAR(5), end_time, 108) AS end_time,
Count(DISTINCT Student_ID) AS count_student
FROM qlsdat.dbo.Stthstud
INNER JOIN qlsdat.dbo.Sttrgprf AS Sttrgprf
ON Stthstud.register_id = Sttrgprf.register_ID
AND Stthstud.register_group = Sttrgprf.register_group
AND Stthstud.acad_period = Sttrgprf.acad_period
WHERE sttrgprf.Acad_period = ''14/15''
GROUP BY sttrgprf.register_id,
sttrgprf.day_of_week,
sttrgprf.start_time,
sttrgprf.end_time
) AS A
WHERE Start_time <= #time
AND End_time > #time
GROUP BY day_of_week
ORDER BY day_of_week

postgres complicated query

I wonder is it possible to make such query. The problem is that I have a table where are some numbers for date.
Lets say I have 3 columns: Date, Value, Good/Bad
I.e:
2014-03-03 100 Good
2014-03-03 15 Bad
2014-03-04 120 Good
2014-03-04 10 Bad
And I want to select and subtract Good-Bad:
2014-03-03 85
2014-03-04 110
Is it possible? I am thinking a lot and don't have an idea yet. It would be rather simple if I had Good and Bad values in seperate tables.
The trick is to join your table back to it self as shown below. myTable as A will read only the Good rows and myTable as B will read only the Bad rows. Those rows then get joined into a signle row based on date.
SQL Fiddle Demo
select
a.date
,a.count as Good_count
,b.count as bad_count
,a.count-b.count as diff_count
from myTable as a
inner join myTable as b
on a.date = b.date and b.type = 'Bad'
where a.type = 'Good'
Output returned:
DATE GOOD_COUNT BAD_COUNT DIFF_COUNT
March, 03 2014 00:00:00+0000 100 15 85
March, 04 2014 00:00:00+0000 120 10 110
Another aproach would be to use Group by instead of the inner join:
select
a.date
,sum(case when type = 'Good' then a.count else 0 end) as Good_count
,sum(case when type = 'Bad' then a.count else 0 end) as Bad_count
,sum(case when type = 'Good' then a.count else 0 end) -
sum(case when type = 'Bad' then a.count else 0 end) as Diff_count
from myTable as a
group by a.date
order by a.date
Both approaches produce the same result.

Counting Differences within a Case-When

I'm trying to do some analytics on user activity, specifically how many users are still active, or at least logging in, over a period of time. However I have some conflicting numbers with the first months count which should just be the count of users that signed up during a month. To figure that out, my simple query is this.
SELECT count(user_id)
FROM users
WHERE date_part('year', member_since) = 2013
AND date_part('month', member_since) = 01
Hypothetically this returns '1,000' which I believe to be correct because of the simplicity. But if I do this...
SELECT
COUNT(CASE WHEN date_part('day', last_login - member_since) >= 0
THEN user_id END) days_0
FROM users
WHERE date_part('year', member_since) = 2013
AND date_part('month', member_since) = 01
...It will return a number less than 1,000. Theoretically this should return the same number as above because even if last_login is the same day as member_since that would be zero and should count those users. Both member_since and last_login are 'timestamp' types. I have a hunch that the difference could be users where last_login is the exact same as member_since, meaning that they signed up and never came back, but I'm not sure how I would test this. Is this a NULL issue? If so, how could I include that to get back to the count of '1,000'?
I would bet dollars to donuts that you nulls are what are causing the problem since they will always hit the else statement.
To correct for that, try this:
SELECT
COUNT(CASE WHEN last_login is null or date_part('day', last_login - member_since) >= 0
THEN user_id END) days_0
FROM users
WHERE date_part('year', member_since) = 2013
AND date_part('month', member_since) = 01;