Get daily MAU rolling sum using PostgreSQL

Get daily MAU rolling sum using PostgreSQL - postgresql

I have daily log data stored in a Postgres database structured with an id and date. Users can, obviously, have multiple rows in the database if they log in multiple times.
To visualize:
| id | timestamp |
|------|---------------------|
| 0099 | 2004-10-19 10:23:54 |
| 1029 | 2004-10-01 10:23:54 |
| 2353 | 2004-10-20 8:23:54 |
Let's say MAU ("monthly active users") is defined as the number of unique ids that log in for a given calendar month. I would like to get the rolling sum of MAU for each day in a month, i.e. MAU at different points in time as it grows. For example, if we were looking at October 2014:
| date | MAU |
|------------|-------|
| 2014-10-01 | 10000 |
| 2014-10-02 | 12948 |
| 2014-10-03 | 13465 |
And so forth until the end of the month. I've heard that window functions might be one way to solve this. Any ideas how to utilize that to get the rolling MAU sum?

After reading the documentation for Postgres window functions, here's one solution that gets the rolling MAU sum for the current month:
-- First, get id and date of each timestamp within the current month
WITH raw_data as (SELECT id, date_trunc('day', timestamp) as timestamp
FROM user_logs
WHERE date_trunc('month', timestamp) = date_trunc('month', current_timestamp)),
-- Since we only want to count the earliest login for a month
-- for a given login, use MIN() to aggregate
month_data as (SELECT id, MIN(timestamp) as timestamp_day FROM raw_data GROUP BY id)
-- Postgres doesn't support DISTINCT for window functions, so query
-- from the rolling sum to have each row as a day
SELECT timestamp_day as date, MAX(count) as MAU
FROM (SELECT timestamp_day, COUNT(id) OVER(ORDER BY timestamp_day) FROM month_data) foo
GROUP By timestamp_day

For a given month, you can calculate this by adding in a user on the first day during the month when they are seen:
select date_trunc('day', mints), count(*) as usersOnDay,
sum(count(*)) over (order by date_trunc('day', mints)) as cume_users
from (select id, min(timestamp) as mints
from log
where timestamp >= '2004-10-01'::date and timestamp < '2004-11-01'::date
group by id
) l
group by date_trunc('day', mints);
Note: This answers your question about one month. This can be extended to more calendar months, where you are counting the unique users on the first day and then adding increments.
If you have a question where the cumulative period passes month boundaries, then ask another question and explain what a month means under those circumstances.

Related

How to extract Year and Month in SQL Postgres by using Data_Part function

I am facing an issue extracting the month and year from the data (in Character varying type) >> InvoiceDate in SQL Postgres. I have seen the solution is relatively easy with MySQL function: DATEFROMPARTS as per the below Code which is not available in SQLpostgres. How can I get the same result DATA_PART function in Postgres SQL, but simultaneously I need to change the data type of the column "InvoiceDate" to the date
Select
CustomerID,
min(InvoiceDate) first_purchase_date,
DATEFROMPARTS(year(min(InvoiceDate)), month(min(InvoiceDate)), 1) Cohort_Date
into #cohort
from #online_retail_main
group by CustomerID
The output:
Customer ID| first_purchase_date |Cohort_Date|
-----------+-------------------------+-----------+
12345 | 2010-12-20 15:47:00:00 | 2010-12-01|
I am trying to make a date consits of Year and Month , while the day to be set as 1 for all

Assuming a valid Postgres timestamp:
select date_trunc('month', '2010-12-20 15:47:00.00'::timestamp)::date;
date_trunc
------------
12/01/2010
--or ISO format
set datestyle = 'ISO,MDY';
select date_trunc('month', '2010-12-20 15:47:00.00'::timestamp)::date;
date_trunc
------------
2010-12-01
Uses date_trunc to truncate the timestamp to a month which means the first of the month. Then cast(::date) to a date type. The DateStyle just deals with how the value is presented to the user. The value is not stored formatted.
To do something similar to what you did in MySQL:
select make_date(extract(year from '2010-12-20 15:47:00.00'::timestamp)::integer, extract(month from '2010-12-20 15:47:00.00'::timestamp)::integer, 1);
This uses make_date from here Date/time functions and extract to build a date.

How to get max of column based on grouping by date column in postgreSQL

First time poster here, not sure if my title really outlines what I am looking for here...
I am trying to get the following:
"Which month of the year does each property type earn the most money on average?"
I have two tables with the following fields I am working with:
calendar_metric
period (this is a date, formatted 'yyyy-mm-dd'
revenue
airbnb_property_id
property
property_type
airbnb_property_id
I have figured out how to get the month, property type, and average revenue to display, but am having trouble with grouping it correctly I think.
select
extract(month from calendar_metric.period) as month,
property.property_type,
avg(calendar_metric.revenue) as average_revenue
from
calendar_metric
inner join property on
calendar_metric.airbnb_property_id = property.airbnb_property_id
group by
month,
property_type
What I want it to output would look like this:
month | property_type | max_average_revenue
---------------------------------------------
1 | place | 123
2 | floor apt | 535
3 | hostel | 666
4 | b&b | 363
5 | boat | 777
etc| etc | etc
but currently I am getting this:
month-property_type | max_average_revenue
---------------------------------------------
1 | place | 123
2 | floor apt | 535
1 | place | 444
4 | b&b | 363
4 | b&b | 777
etc| etc | etc
So essentially, months are coming back duplicated as I extracted the month from a date stamp, the data set goes across 5 years or so, and am probably not grouping right? I know I am missing something simple probably, I just cannot seem to figure out how to do this correctly.
Help!

you should be grouping by year-month, since you are trying to view 5 year period.
select
extract(month from calendar_metric.period) as month,
property.property_type,
avg(calendar_metric.revenue) as average_revenue
from
calendar_metric
inner join property on
calendar_metric.airbnb_property_id = property.airbnb_property_id
group by
extract(year from period),
month,
property_type

I think your query is basically there, it's just returning all months rather than just filtering out the rows you don't want. I'd tend to use the DISTINCT ON clause for this sort of thing, something like:
SELECT DISTINCT ON (property_type)
p.property_type, extract(month from cm.period) AS month,
avg(cm.revenue) AS revenue
FROM calendar_metric AS cm
JOIN property AS p USING (airbnb_property_id)
GROUP BY property_type, month
ORDER BY property_type, revenue DESC;
I've shortened your query down a bit, hope it still makes sense to you.
using CTEs you can express this in two steps which might be easier to follow what's going on:
WITH results AS (
SELECT p.property_type, extract(month from cm.period) AS month,
avg(cm.revenue) AS revenue
FROM calendar_metric AS cm
JOIN property AS p USING (airbnb_property_id)
GROUP BY property_type, month
)
SELECT DISTINCT ON (property_type)
property_type, month, revenue
FROM results
ORDER BY property_type, revenue DESC;

Group staggered records that are separated by a small time difference

Difficult question to title, but I am trying to replicate what social media or notification feeds do where they batch recent events so they can display “sequences” of actions. For example, if these are "like" records, in reverse chronological order:
like_id | user_id | like_timestamp
--------------------------------
1 | bob | 12:30:00
2 | bob | 12:29:00
3 | jane | 12:27:00
4 | bob | 12:26:00
5 | jane | 12:24:00
6 | jane | 12:23:00
7 | scott | 12:22:00
8 | bob | 12:20:00
9 | alice | 12:19:00
10 | scott | 12:18:00
I would like to group them such that I get the last 3 "bursts" of user likes, grouped (partitioned?) by user. If the "burst" rule is that likes less than 5 minutes apart belong to the same burst, then we would get:
user_id | num_likes | burst_start | burst_end
----------------------------------------------
bob | 3 | 12:26:00 | 12:30:00
jane | 3 | 12:23:00 | 12:27:00
scott | 2 | 12:18:00 | 12:22:00
alice's like does not get counted because it's part of the 4th most recent batch, and like 8 does not get added to bob's tally because it is 6 minutes before the next one.
I've tried keeping track of bursts with postgres' lag function, which lets me mark start and end events, but since like events can be staggered, I have no way of tying a like back to its "originator" (for example, tying id 4 back to 2).
Is this grouping possible? If so, is it possible to keep track of the start and end timestamp of each burst?

step-by-step demo:db<>fiddle
WITH group_ids AS ( -- 1
SELECT DISTINCT
user_id,
first_value(like_id) OVER (PARTITION BY user_id ORDER BY like_id) AS group_id
FROM
likes
LIMIT 3
)
SELECT
user_id,
COUNT(*) AS num_likes,
burst_start,
burst_end
FROM (
SELECT
user_id,
-- 4
first_value(like_timestamp) OVER (PARTITION BY group_id ORDER BY like_id) AS burst_end,
first_value(like_timestamp) OVER (PARTITION BY group_id ORDER BY like_id DESC) AS burst_start
FROM (
SELECT
l.*, gi.group_id,
-- 2
lag(like_timestamp) OVER (PARTITION BY group_id ORDER BY like_id) - like_timestamp AS diff
FROM
likes l
JOIN
group_ids gi ON l.user_id = gi.user_id
) s
WHERE diff IS NULL OR diff <= '00:05:00' -- 3
) s
GROUP BY user_id, burst_start, burst_end -- 5
The CTE is for creating an ordered group id per user_id. So the first user (here the most recent one) gets the lowest group_id (which is bob). The second user the second highest (jane) and so on. This is used to able to work with all likes of a certain user within one partition. This step is necessary because you cannot simply order by user_id which would bring alice to the top. The LIMIT 3 limitates the whole query to the first three users.
After joining the calculated user's group_id the time differences are calculated using the lag() window function which allows you to get the previous value. So it can be used to easily calculate the difference between the current timestamp the the previous one. This happens only within the user's groups.
After that the likes that are to much away (more than 5 minutes from the last one) can be removed through the calculated diff
Then the highest and lower timestamp can be calculated with the first_value() window function (ascending and descending order). These mark your burst_start and burst_end
Finally you can group all users and count their records.

Postgres query with time zone

I have table of companies where each company has individual timezone.
for example -
company 1 has time zone UTC+10 and
company 2 - UTC+2
table companies has field time_zone and stored abbreviation of zone like America/Los_Angeles(I can add additional field for store offset value from UTC if need).
and has table requests with start_date field where stored TIMESTAMP without time zone(UTC-0)
for example -
id | company_id | start_date (utc-0)
------------------------------------
1 | 1 | 21-03-16 02-00 // added for company `21-03-16 12-00`
2 | 2 | 21-03-16 23-00 // added for company `22-03-16 01-00`
3 | 1 | 20-03-16 13-00 // added for company `20-03-16 23-00`
4 | 1 | 21-03-16 23-00 // added for company `22-03-16 09-00
I want select records that started from 21-03-16 00-00 to 21-03-16 23-59 considering time zone each company.
but if I will use -
select * from request where start_date between '2016-03-21 00:00:00.000000' AND '2016-03-21 23:59:59.999999'
I get requests where id = 2 and 4.
but these requests were added 22-03-16 by fact for each company.
Any suggestions how I can decide this situation by one select? Many thanks.

I'm not sure if I understand your question right, you might need to clarify.
But I'd say that joining with companies where the time zone information is stored should solve the problem:
SELECT r.*
FROM request r
JOIN companies c ON c.id = r.company_id
WHERE r.start_date BETWEEN '2016-03-21 00:00:00.000000'
AT TIME ZONE c.time_zone
AND '2016-03-21 23:59:59.999999'
AT TIME ZONE c.time_zone;

Postgresql , opposite of a group by

Here's my use case:
We have a analytics-like tools which used to count the number of users per hour on our system. And now the business would like to have the number of unique users. As our amount of user is very small, we will do that using
SELECT count(*)
FROM (
SELECT DISTINCT user_id
FROM unique users
WHERE date BETWEEN x and y
) distinct_users
i.e we will store the couple user_id, date and count unique users using DISTINCT (user_id is not a foreign key, as users are not logged in, it's just a unique identifier generated by the system, some kind of uuidv4 )
this works great in term of performance for a magnitude of data.
Now the problem is to import legacy data in it
I would like to know the SQL query to transform
date | number_of_users
12:00 | 2
13:00 | 4
into
date | user_id
12:00 | 1
12:00 | 2
13:00 | 1
13:00 | 2
13:00 | 3
13:00 | 4
(as long as the "count but not unique" returns the same number as before, we're fine if the "unique users count" is a bit off)
Of course, I could do a python script, but I was wondering if there was a SQL trick to do that, using generate_series or something related

generate_series() is indeed the way to go:
with data (date, number_of_users) as (
values
('12:00',2),
('13:00',4)
)
select d.date, i.n
from data d
cross join lateral generate_series(1, d.number_of_users) i (n)
order by d.date, i.n ;