Monthly Counting PostgreSQL giving months - postgresql

I have a table for customers like this
cust_id | date_signed_up | location_id
-----------------------------------------
1 | 2019/01/01 | 1
2 | 2019/03/05 | 1
3 | 2019/06/17 | 1
What I need is to have a monthly count but having the months even if its 0. Ex:
monthly_count | count
-------------------------
Jan | 1
Feb | 0
Mar | 1
Apr | 0
(months can be in numbers)
Right now I made this query:
SELECT date_trunc('MONTH', (date_signed_up::date)) AS monthly, count(customer_id) AS count FROM customer
WHERE group_id = 1
GROUP BY monthly
ORDER BY monthly asc
but it's giving me just for the months there's information, skipping the ones where it's zero. How can I get all the months even if they have or not information.

You need a list of months.
How to generate Month list in PostgreSQL?
SELECT a.month , count( y.cust_id )
FROM allMonths a
LEFT JOIN yourTable y
ON a.month = date_trunc('MONTH', (date_signed_up::date))
GROUP BY a.month

Related

Need help to merge overlapping time intervals

I need some help with merging overlapping time intervals if the interval not more than 4 minutes (for example only where id = 1).
I have the next table:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
2 | started | 2020-08-18 13:52:14
2 | suspended | 2020-08-18 13:52:17
3 | started | 2020-08-18 13:52:21
3 | suspended | 2020-08-18 13:52:24
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:08:56
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:11
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:13
Expected result:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:09:13
How it can be done? I will be very grateful for your help!
You want to eliminate suspended/start pairs that are for the same id and within 4 minutes. Use lag() and lead():
select t.*
from (select t.*,
lag(date) over (partition by id order by date) as prev_date,
lead(date) over (partition by id order by date) as next_date
from t
) t
where (action = 'start' and
prev_date > date - interval '4 minute'
) or
(action = 'suspended' and
next_date < date + interval '4 minute'
);
Date/time functions are notoriously database dependent. This is just adding or subtracting 4 minutes, which any database can do but the syntax might vary.
You're wanting to filter out certain rows, what is common with the rows you are removing?
It seems you want the first 'started' and last 'suspended' rows. Can you just ignore 'started' rows if there is another 'started' row in the previous 4 minutes, and ignore 'suspended' rows if there is another 'suspended row in the next 4 minutes?
from my_table a
where action = 'started' and not exists (
select 1 from my_table b
where b.id = a.id and b.action = 'started'
and datediff(minute, b.date, a.date) <= 4 -- row exists in the previous 4 min
)
Ditto for 'suspended' but the other way. That doesn't work if the difference between the last 'started' and 'suspended' is > 4 minutes though, but that can be overcome with another condition to check for no start within the last 4 minutes.
If you need to get overlapping intervals duration not more than 4 minutes, can use this query:
--cte where creating groups with time intervals
with base_cte as
(
select Tab.id,Tab.NumGr,Tab.date,
Tab.action from
(
select * from
(
--selecting only values where time difference <= 4 min
select *,sum(TimeDiff)over(partition by id,NumGr order by date rows unbounded preceding)SumTimeInterval from
(
--creating a group
select sum(Num)over(partition by id order by date rows unbounded preceding )NumGr, * from
(
select date,lead(date)over(partition by id order by date)lead_date,id,action,
lead(action)over(partition by id order by date)lead_action,
--split intervals between overlaps (240seconds)
iif(TimeDiff>240,1,0)Num,TimeDiff from
(
--find time difference in seconds between current and next date (partition by id)
select datediff(second,date,LEAD(date)over(partition by id order by date))TimeDiff,* from Table
)A
)B
)C
--selecting only pairs within time intervals
where TimeDiff<=240
--checking duration interval:<=4 min
)D where SumTimeInterval<=240
)E
CROSS JOIN LATERAL
(values (id,NumGr,date,action),
(id,NumGr,lead_date,lead_action)
)Tab(id,NumGr,date,action)
)
--selectig data with start/end overlapping time interval
select id,date,action from base_cte base
where date
in (select max(date) from base_cte where NumGr=base.NumGr)
or date in
(select min(date) from base_cte where NumGr=base.NumGr)

How can I find the status in each month using a start and end date?

[ Title was: "Find out the facts: How to find the month wise active members in an healthcare organization per each year and also find the growth percentage" ]
i have 5 years of history data and would like to do some analytics on it. the data will contain active and inactive members data. the ask is for finding the active members per each month per each year.
what i am doing is am extracting month and year from effective data and grouping by month and year based on active status i.e. Status ='Active'
But in this manner I am losing the history records.
for example, if a person had membership from 01-01-2015 to 31-12-2016. this member will be shown as an inactive member now but the same person was an active member in that duration. So if I filter on the status, I will lose these old records.
i need to go to that month, Jan 2015 and check all whoever were active by that time. So I thought of doing another way.
I have extracted the month of expiry date and filtered like exp_month equal to or greater than extracted month of effective date as shown below. Here, I am not relying on the incoming source field containing member status. I am creating a field with logic to identify the status of the member during the period we are finding. This is just to identify active members per each month of year But i am not sure if this is giving me the perfect solution. Please suggest me the better approach.
SELECT extract(YEAR FROM member_effective_date) AS year
, extract(MONTH FROM member_expiry_date) AS month
, CASE WHEN extract(MONTH FROM member_expiry_date)
= extract(MONTH FROM member_effective_date)
OR extract(MONTH FROM member_expiry_date)
> extract(MONTH FROM member_effective_date)
THEN 'Yes'
ELSE 'No' END AS active_status
FROM table_name
You need to use a cross join with table of dates to get the status in each period. The cross join "inflates" the status table so you can evaluate the status for each period.
Here is an example:
CREATE TEMP TABLE table_name AS
SELECT 'member1' AS member
, '2020-01-01'::DATE AS member_effective_date
, '2020-04-27'::DATE AS member_expiry_date
;
WITH month_list
-- Month start and end for previous 12 months
AS (SELECT DATE_TRUNC('month',dt) AS month_start
, MAX(dt) AS month_end
FROM
-- List of the previous 365 dates
(SELECT DATE_TRUNC('day',SYSDATE) - (n * INTERVAL '1 day') AS dt
FROM
-- List of numbers from 1 to 365
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 365) )
GROUP BY month_start
)
SELECT extract(YEAR FROM b.month_start) AS year
, extract(MONTH FROM b.month_start) AS month
, CASE WHEN -- Effective before the month ended and
(a.member_effective_date <= b.month_end
AND a.member_expiry_date > b.month_start)
THEN 'Yes'
ELSE 'No' END AS active
FROM table_name a
CROSS JOIN month_list b -- Explicit cartesian product
ORDER BY 1,2
;
| year | month | active|
|------|-------|-------|
| 2019 | 8 | No |
| 2019 | 9 | No |
| 2019 | 10 | No |
| 2019 | 11 | No |
| 2019 | 12 | No |
| 2020 | 1 | Yes |
| 2020 | 2 | Yes |
| 2020 | 3 | Yes |
| 2020 | 4 | Yes |
| 2020 | 5 | No |
| 2020 | 6 | No |
| 2020 | 7 | No |
| 2020 | 8 | No |

Count the number of consecutive entries fulfilling a condition within a GROUP BY

I've got a list of users who are behind on their bills, and I want to generate an entry for each of them that says how many consecutive bills they've been behind on. So here's the table:
user | bill_date | outstanding_balance
---------------------------------------
a | 2017-03-01 | 90
a | 2016-12-01 | 60
a | 2016-09-01 | 30
b | 2017-03-01 | 50
b | 2016-12-01 | 0
b | 2016-09-01 | 40
c | 2017-03-01 | 0
c | 2016-12-01 | 0
c | 2016-09-01 | 1
And I want a query that would generate the following table:
user | consecutive_billing_periods_behind
-----------------------------------------
a | 3
b | 1
a | 0
In other words, if you've paid up at any point, I want to ignore all of the earlier entries, and only count how many billing periods you've been behind since you've been last paid up. How do I do this most simply?
If I understood the question correctly, first you need to find the last date that any given customer paid their bill so the last date their outstanding balance was 0. You can do this by this subquery:
(SELECT
user1,
bill_date AS no_outstanding_bill_date
FROM table1
WHERE outstanding_balance = 0)
Then you need get the last bill date and create field for each row if they are outstanding bill. Then filter the rows between the last clear day to last bill date of each customer by this where clause:
WHERE bill_date >= last_clear_day AND bill_date <= last_bill_date
Then if you put the pieces together you can have the results by this query:
SELECT
DISTINCT
user1,
sum(is_outstanding_bill)
OVER (
PARTITION BY user1 ) AS consecutive_billing_periods_behind
FROM (
SELECT
user1,
last_value(bill_date)
OVER (
PARTITION BY user1
ORDER BY bill_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS last_bill_date,
CASE WHEN outstanding_balance > 0
THEN 1
ELSE 0 END AS is_outstanding_bill,
bill_date,
outstanding_balance,
nvl(max(t2.no_outstanding_bill_date)
OVER (
PARTITION BY user1 ), min(bill_date)
OVER (
PARTITION BY user1 )) AS last_clear_day
FROM table1 t1
LEFT JOIN (SELECT
user1,
bill_date AS no_outstanding_bill_date
FROM table1
WHERE outstanding_balance = 0) t2 USING (user1)
) table2
WHERE bill_date >= last_clear_day AND bill_date <= last_bill_date
Since we are using distinct you will not need the group by clause.
select
user,
count(case when min_balance > 0 then 1 end)
as consecutive_billing_periods_behind
from
(
select
user,
min(outstanding_balance)
over (partition by user order by bill_date) as min_balance
from tbl
)
group by user
Or:
select
user,
count(*)
as consecutive_billing_periods_behind
from
(
select
user,
bill_date,
max(case when outstanding_balance = 0 then bill_date) over
(partition by user)
as max_bill_date_with_zero_balance
from tbl
)
where
-- If user has no outstanding_balance = 0, then
max_bill_date_with_zero_balance is null
-- Count all rows in this case.
-- Otherwise
or
-- count rows with
bill_date > max_bill_date_with_zero_balance
group by user

Postgresql running totals with groups missing data and outer joins

I've written a sql query that pulls data from a user table and produces a running total and cumulative total of when users were created. The data is grouped by week (using the windowing feature of postgres). I'm using a left outer join to include the weeks when no users where created. Here is the query...
<!-- language: lang-sql -->
WITH reporting_period AS (
SELECT generate_series(date_trunc('week', date '2015-04-02'), date_trunc('week', date '2015-10-02'), interval '1 week') AS interval
)
SELECT
date(interval) AS interval
, count(users.created_at) as interval_count
, sum(count( users.created_at) ) OVER (order by date_trunc('week', users.created_at)) AS cumulative_count
FROM reporting_period
LEFT JOIN users
ON interval=date(date_trunc('week', users.created_at) )
GROUP BY interval, date_trunc('week', users.created_at) ORDER BY interval
It works almost perfectly. The cumulative value is calculated properly for weeks week a user was created. For weeks when no user was create it is set to grand total and not the cumulative total up to that point.
Notice the rows with ** the Week Tot column (interval_count) is 0 as expected but the Run Tot (cumulative_total) is 1053 which equals the grand total.
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 1053 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 1053 **
2015-05-11 | 0 | 1053 **
2015-05-18 | 1 | 30
2015-05-25 | 0 | 1053 **
...
2015-06-08 | 996 | 1031
...
2015-09-07 | 2 | 1052
2015-09-14 | 0 | 1053 **
2015-09-21 | 1 | 1053 **
2015-09-28 | 0 | 1053 **
This is what I would like
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 17 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 29 **
...
It seems to me that if the outer join can somehow apply the grand total to the last column it should be possible to apply the current running total but I'm at a loss on how to do it.
Is this possible?
This is not guaranteed to work out of the box as I havent tested on acutal tables, but the key here is to join users on created_at over a range of dates.
with reportingperiod as (
select intervaldate as interval_begin,
intervaldate + interval '1 month' as interval_end
from (
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-03-15')),
DATE(DATE_TRUNC('day', DATE '2015-10-15')), interval '1 month') AS intervaldate
) as rp
)
select interval_end,
interval_count,
sum(interval_count) over (order by interval_end) as running_sum
from (
select interval_end,
count(u.created_at) as interval_count
from reportingperiod rp
left join (
select created_at
from users
where created_at < '2015-10-02'
) u on u.created_at > rp.interval_begin
and u.created_at <= rp.interval_end
group by interval_end
) q
I figured it out. The trick was subqueries. Here's my approach
Add a count column to the generate_series call with default value of 0
Select interval and count(users.created_at) from the users data
Union the the generate_series and the result from the select in step #2
(At this point the result will have duplicates for each interval)
Use the results in a subquery to get interval and max(interval_count) which eliminates duplicates
Use the window aggregation as before to get the running total
SELECT
interval
, interval_count
, SUM(interval_count ) OVER (ORDER BY interval) AS cumulative_count
FROM
(
SELECT interval, MAX(interval_count) AS interval_count FROM
(
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('week', DATE '2015-04-02')),
DATE(DATE_TRUNC('week', DATE '2015-10-02')), interval '1 week') AS interval,
0 AS interval_count
UNION
SELECT DATE_TRUNC('week', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
) sub1
GROUP BY interval
) grouped_data
I'm not sure if there are any serious performance issues with this approach but it seems to work. If anyone has a better, more elegant or performant approach I would love the feedback.
Edit: My solution doesn't work when trying to group by arbitrary time windows
Just tried this solution with the following changes
/* generate series using DATE_TRUNC('day'...)*/
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-04-02')),
DATE(DATE_TRUNC('day', DATE '2015-10-02')), interval '1 month') AS interval,
0 AS interval_count
/* And this part */
SELECT DATE_TRUNC('day', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
For example is is possible to produce these similar results but have the data grouped by intervals as so
3/15/15 - 4/14/15,
4/15/15 - 5/14/15,
5/15/15 - 6/14/15
etc.

Selecting rows ordered by some column and distinct on another

Related to - PostgreSQL DISTINCT ON with different ORDER BY
I have table purchases (product_id, purchased_at, address_id)
Sample data:
| id | product_id | purchased_at | address_id |
| 1 | 2 | 20 Mar 2012 21:01 | 1 |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
| 3 | 2 | 20 Mar 2012 21:39 | 2 |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
The result I expect is the most recent purchased product (full row) for each address_id and that result must be sorted in descendant order by the purchased_at field:
| id | product_id | purchased_at | address_id |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
Using query:
SELECT DISTINCT ON (address_id) purchases.address_id, purchases.*
FROM "purchases"
WHERE "purchases"."product_id" = 2
ORDER BY purchases.address_id ASC, purchases.purchased_at DESC
I'm getting:
| id | product_id | purchased_at | address_id |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
So the rows is same, but order is wrong. Any way to fix it?
Quite a clear question :)
SELECT t1.* FROM purchases t1
LEFT JOIN purchases t2
ON t1.address_id = t2.address_id AND t1.purchased_at < t2.purchased_at
WHERE t2.purchased_at IS NULL
ORDER BY t1.purchased_at DESC
And most likely a faster approach:
SELECT t1.* FROM purchases t1
JOIN (
SELECT address_id, max(purchased_at) max_purchased_at
FROM purchases
GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC
Your ORDER BY is used by DISTINCT ON for picking which row for each distinct address_id to produce. If you then want to order the resulting records, make the DISTINCT ON a subselect and order its results:
SELECT * FROM
(
SELECT DISTINCT ON (address_id) purchases.address_id, purchases.*
FROM "purchases"
WHERE "purchases"."product_id" = 2
ORDER BY purchases.address_id ASC, purchases.purchased_at DESC
) distinct_addrs
order by distinct_addrs.purchased_at DESC
This query is trickier to rephrase properly than it looks.
The currently accepted, join-based answer doesn’t correctly handle the case where two candidate rows have the same given purchased_at value: it will return both rows.
You can get the right behaviour this way:
SELECT * FROM purchases AS given
WHERE product_id = 2
AND NOT EXISTS (
SELECT NULL FROM purchases AS other
WHERE given.address_id = other.address_id
AND (given.purchased_at < other.purchased_at OR given.id < other.id)
)
ORDER BY purchased_at DESC
Note how it has a fallback of comparing id values to disambiguate the case in which the purchased_at values match. This ensures that the condition can only ever be true for a single row among those that have the same address_id value.
The original query using DISTINCT ON handles this case automatically!
Also note the way that you are forced to encode the fact that you want “the latest for each address_id” twice, both in the given.purchased_at < other.purchased_at condition and the ORDER BY purchased_at DESC clause, and you have to make sure they match. I had to spend a few extra minutes to convince myself that this query is really positively correct.
It’s much easier to write this query correctly and understandbly by using DISTINCT ON together with an outer subquery, as suggested by dbenhur.
Try this !
SELECT DISTINCT ON (address_id) *
FROM purchases
WHERE product_id = 2
ORDER BY address_id, purchased_at DESC