7 Day Return/Retention Rate - postgresql

I've been trying to calculate 7 Day Return Rate (also known as Classic Retention Rate, as described here: https://www.braze.com/blog/calculate-retention-rate/) and then taking a 30 day average to reduce noise in Postgresql.
However, I'm sure I'm doing something wrong. First of all, the numbers look waaay higher than intuitively I feel they should be (generally around 5% for the rest of the sector). Also, I believe the first 7 days should show 0, as theoretically users should take at least 7 days to count as a "return". However, I get around 40-70%, as shown below.
Would someone mind taking a look at the code below and seeing if there are any errors? 7 Day Return Rate is a really common metric for apps, and I haven't found any questions using postgresql that calculate it to this level of sophistication on Stack Exchange (or even the rest of the web), so I feel like a solid response could be very useful to a lot of people.
Sample data
Wednesday, August 1, 2018 12:00 AM 71.14
Thursday, August 2, 2018 12:00 AM 55.44
Friday, August 3, 2018 12:00 AM 50.09
Saturday, August 4, 2018 12:00 AM 45.81
Sunday, August 5, 2018 12:00 AM 43.27
Monday, August 6, 2018 12:00 AM 40.61
Tuesday, August 7, 2018 12:00 AM 39.38
Wednesday, August 8, 2018 12:00 AM 38.46
Thursday, August 9, 2018 12:00 AM 36.81
Friday, August 10, 2018 12:00 AM 35.94
with
user_first_event as (
select distinct id, min(timestamp)::date as first_event_date
from log
where
timestamp <= current_date
and timestamp >= {{start_date}} and timestamp <= {{end_date}}
group by id),
event as (
select distinct id, timestamp::date as user_event_date
from log
where timestamp <= current_date and timestamp >= {{start_date}}),
gap as (
select
user_first_event.id,
user_first_event.first_event_date,
event.user_event_date,
event.user_event_date - user_first_event.first_event_date as days_since_signup
from user_first_event
join event on user_first_event.id = event.id
where user_first_event.first_event_date <= event.user_event_date),
conversion_rate as (
select
first_event_date,
(sum(case when days_since_signup = 7 then 1 else 0 end) * 100.0 /
count(distinct id)
) as seven_day_retention_rate
from gap
group by first_event_date
)
SELECT first_event_date,
AVG(seven_day_retention_rate)
OVER(ORDER BY first_event_date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_avg_retention_rate
FROM conversion_rate

The problem is a bit easier than your query makes it seem, you can actually do it with just one subquery and one out query as follows:
select first_event_date
, avg(seven_day_return) as seven_day_return_day_only
, avg( avg(seven_day_return) ) OVER(ORDER BY first_event_date asc ROWS BETWEEN 29 preceding AND CURRENT ROW ) AS thirty_day_rolling_retention
from (
--inner query to get value for user, 1 if they retain and 0 if they do not
select min(timestamp)::date as first_event_date
, case when array_agg(timestamp::date) #> ARRAY[ (min(timestamp)::date + 7) ] then 1 else 0 end as seven_day_return
from log
group by id ) t
group by t.first_event_date;
Note that this weights each day equally rather than each user equally across days. If you want to weight the average by user across days then you can update the outer calculation using more aggregates and windows to compute the value with weightings.
Reference: http://sqlfiddle.com/#!17/ee17e/1/0
If you don't have access to array_agg (but have access to window functions) you can use:
select first_event_date
, avg(seven_day_return) as day_seven_day_return
, avg( avg(seven_day_return) ) OVER(ORDER BY first_event_date asc ROWS BETWEEN 29 preceding AND CURRENT ROW ) AS thirty_day_rolling_retention
from (
--inner query to get value for user
select min(timestamp)::date as first_event_date
, case when exists(select 1 from log l2 where l2.id = log.id and l2.timestamp::date = min(log.timestamp)::date + 7) then 1 else 0 end as seven_day_return
from log
group by id ) t
group by t.first_event_date;

Related

Certain Range of Date in each Month

I'd like to have a range of day 20th - 25th in each month in BigQuery but i dont know what syntax should i use. For ex:
Jan 20 - 25
Feb 20 - 25
and so on
I only can think of creating a CTE for every month then union all those.
Consider below query.
SELECT DATE_ADD(month, INTERVAL day - 1 DAY) date_range,
FROM UNNEST(GENERATE_DATE_ARRAY('2022-01-01', '2022-03-01', INTERVAL 1 MONTH)) month,
UNNEST(GENERATE_ARRAY(20, 25)) day;
Query results
Below seem to be more simple than my original answer and you could adjust date range by specifying condition on WHERE clause.
SELECT *
FROM UNNEST(GENERATE_DATE_ARRAY('2022-01-01', '2022-12-31', INTERVAL 1 DAY)) date_range
WHERE EXTRACT(DAY FROM date_range) BETWEEN 21 AND 25
For the usecase that you commented,
WHERE EXTRACT(DAY FROM date_range) >= 21 OR EXTRACT(DAY FROM date_range) = 1

12 months rolling

I have the query below which returns 12 months rolling data. So if I run it today it brings data back from 23rd August 2015 to 23rd August 2016. Now ideally I would like it to start from the 1st August 2015 and if I was to run it again next month it would start from 1stSeptember 2015. Is this possible to do? Thanks
select
Date
Street
Town
Incidents
IncidentType A
IncidentType B
IncidentType C
FROM
(
select
COUNT(I.INC_NUM) as Incidents,
COUNT(case when i.INC_TYPE = ''A'' THEN 1
end)
"IncidentType A"
COUNT(case when i.INC_TYPE = ''B'' THEN 1
end)
"IncidentType B"
COUNT(case when i.INC_TYPE = ''C'' THEN 1
end)
"IncidentType C"
FROM Table i
GROUP BY i.INC_NUM
) i
where Date >= (now()-('12 months'::interval))
Your code suggests that you are using Postgres. If the code works and you just need to adjust the where clause, use date_trunc():
where Date >= date_trunc('month', now() - ('12 months'::interval))

Insert subquery date according to day

I would like to insert subquery a date based on it day. Plus, each date can only be used four times. Once it reached fourth times, the fifth value will use another date of same day. In other word, use date of Monday of next week. Example, Monday with 6 JUNE 2016 to Monday with 13 JUNE 2016 (you may check the calendar).
I have a query of getting a list of date based on presentationdatestart and presentationdateend from presentation table:
select a.presentationid,
a.presentationday,
to_char (a.presentationdatestart + delta, 'DD-MM-YYYY', 'NLS_CALENDAR=GREGORIAN') list_date
from presentation a,
(select level - 1 as delta
from dual
connect by level - 1 <= (select max (presentationdateend - presentationdatestart)
from presentation))
where a.presentationdatestart + delta <= a.presentationdateend
and a.presentationday = to_char(a.presentationdatestart + delta, 'fmDay')
order by a.presentationdatestart + delta,
a.presentationid; --IMPORTANT!!!--
For example,
presentationday presentationdatestart presentationdateend
Monday 01-05-2016 04-06-2016
Tuesday 01-05-2016 04-06-2016
Wednesday 01-05-2016 04-06-2016
Thursday 01-05-2016 04-06-2016
The query result will list all possible dates between 01-05-2016 until 04-06-2016:
Monday 02-05-2016
Tuesday 03-05-2016
Wednesday 04-05-2016
Thursday 05-05-2016
....
Monday 30-05-2016
Tuesday 31-05-2016
Wednesday 01-06-2016
Thursday 02-06-2016 (20 rows)
This is my INSERT query :
insert into CSP600_SCHEDULE (studentID,
studentName,
projectTitle,
supervisorID,
supervisorName,
examinerID,
examinerName,
exavailableID,
availableday,
availablestart,
availableend,
availabledate)
select '2013816591',
'mong',
'abc',
'1004',
'Sue',
'1002',
'hazlifah',
2,
'Monday', //BASED ON THIS DAY
'12:00:00',
'2:00:00',
to_char (a.presentationdatestart + delta, 'DD-MM-YYYY', 'NLS_CALENDAR=GREGORIAN') list_date //FOR AVAILABLEDATE
from presentation a,
(select level - 1 as delta
from dual
connect by level - 1 <= (select max (presentationdateend - presentationdatestart)
from presentation))
where a.presentationdatestart + delta <= a.presentationdateend
and a.presentationday = to_char(a.presentationdatestart + delta, 'fmDay')
order by a.presentationdatestart + delta,
a.presentationid;
This query successfully added 20 rows because all possible dates were 20 rows. I would like modify the query to be able to insert based on availableDay and each date can only be used four times for each different studentID.
Possible outcome in CSP600_SCHEDULE (I am removing unrelated columns to ease readability):
StudentID StudentName availableDay availableDate
2013 abc Monday 01-05-2016
2014 def Monday 01-05-2016
2015 ghi Monday 01-05-2016
2016 klm Monday 01-05-2016
2010 nop Tuesday 02-05-2016
2017 qrs Tuesday 02-05-2016
2018 tuv Tuesday 02-05-2016
2019 wxy Tuesday 02-05-2016
.....
2039 rrr Monday 09-05-2016
.....
You may check the calendar :)
I think what you're asking for is to list your students and then batch them up in groups of 4 - each batch is then allocated to a date. Is that right?
In which case something like this should work (I'm using a list of tables as the student names just so I don't need to insert any data into a custom table) :
WITH students AS
(SELECT table_name
FROM all_tables
WHERE rownum < 100
)
SELECT
table_name
,SYSDATE + (CEIL(rownum/4) -1)
FROM
students
;
I hope that helps you
...okay, following your comments, I think this might be a better solution :
WITH students AS
(SELECT table_name student_name
FROM all_tables
WHERE rownum < 100
)
, dates AS
(SELECT TRUNC(sysdate) appointment_date from dual UNION
SELECT TRUNC(sysdate+2) from dual UNION
SELECT TRUNC(sysdate+4) from dual UNION
SELECT TRUNC(sysdate+6) from dual UNION
SELECT TRUNC(sysdate+8) from dual UNION
SELECT TRUNC(sysdate+10) from dual UNION
SELECT TRUNC(sysdate+12) from dual UNION
SELECT TRUNC(sysdate+14) from dual
)
SELECT
s.student_name
,d.appointment_date
FROM
--get a list of students each with a sequential row number, ordered by student name
(SELECT
student_name
,ROW_NUMBER() OVER (ORDER BY student_name) rn
FROM students
) s
--get a list of available dates with a sequential row number, ordered by date
,(SELECT
appointment_date
,ROW_NUMBER() OVER (ORDER BY appointment_date) rn
FROM dates
) d
WHERE 1=1
--allocate the first four students to date rownumber1, next four students to date rownumber 2...
AND CEIL(s.rn/4) = d.rn
;

Postgres: longest streak per developer regardless of Saturdays and Sundays

I got the information I needed from my last post about Postgres: Defining the longest streak (in days) per developer.
However now I want know the longest streak per developer regardless of Saturdays or Sundays. For instance, Bob worked from Thursday 18, Friday 19, Monday 22 and Tuesday 23, hence Bob streak is 4 days.
I understand I can use the DOW window function, which gives me 0 as Sunday , 1 Monday and so on. But
I don’t see how I can apply DOW function in the last solution proposed by Gordon Linoff.
Can some of you help me in this matter? Cheers,
WITH
working_limits AS (
SELECT
MIN(mr_date) AS start_date,
MAX(mr_date) AS end_date
FROM
xxx
),
working_days AS (
SELECT
ROW_NUMBER() OVER () AS day_number,
s.d::date AS date
FROM
GENERATE_SERIES((SELECT start_date FROM working_limits),
(SELECT end_date FROM working_limits),
'1 day') AS s(d)
WHERE
EXTRACT(dow FROM s.d) BETWEEN 1 AND 5),
worked_days AS (
SELECT
ROW_NUMBER() OVER () AS day_number,
developer,
mr_date AS date
FROM
xxx
ORDER BY
developer,
mr_date
)
SELECT
y.developer,
MAX(y.days)
FROM (
SELECT
x.developer,
COUNT(*) AS days
FROM (
SELECT
wngd.date,
wd.developer,
wngd.day_number - wd.day_number AS delta
FROM
working_days wngd INNER JOIN worked_days wd
ON
wngd.date = wd.date) AS x
GROUP BY
x.developer,
x.delta) AS y
GROUP BY
y.developer;

Get yesterday's rows startign from 10AM

For previous day I used use the below expression .
DATE_INSERTED >=DATEADD(day, DATEDIFF(day,0,GETDATE())-1,0)
AND DATE_INSERTED < DATEADD(day, DATEDIFF(day,0,GETDATE()),0)
How to get rows from yesterday 10Am to today 10AM
-- yesterday at midnight:
DECLARE #yesterday DATETIME = DATEADD(DAY,DATEDIFF(DAY,1,GETDATE()),0);
SELECT
...
WHERE DATE_INSERTED >= DATEADD(HOUR, 10, #yesterday) -- 10 AM yesterday
AND DATE_INSERTED < DATEADD(HOUR, 34, #yesterday); -- 10 AM today
Instead of using zeroes, use some date(time)s that have the desirable properties:
DATE_INSERTED >=
DATEADD(day, DATEDIFF(day,'20010102',GETDATE()),'2001-01-01T10:00:00')
AND DATE_INSERTED <
DATEADD(day, DATEDIFF(day,'20010102',GETDATE()),'2001-01-02T10:00:00')
I.e. if you add the total number of days that have occurred since 2nd January 2001 onto 10:00am on the 1st January 2001, you'll always obtain a value which is "yesterday at 10am". The second one is almost identical.