Need help to merge overlapping time intervals - postgresql

I need some help with merging overlapping time intervals if the interval not more than 4 minutes (for example only where id = 1).
I have the next table:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
2 | started | 2020-08-18 13:52:14
2 | suspended | 2020-08-18 13:52:17
3 | started | 2020-08-18 13:52:21
3 | suspended | 2020-08-18 13:52:24
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:08:56
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:11
1 | started | 2020-08-18 15:09:11
1 | suspended | 2020-08-18 15:09:13
Expected result:
--------------------------------------
id | action | date
--------------------------------------
1 | started | 2020-08-18 13:51:02
1 | suspended | 2020-08-18 13:51:04
1 | started | 2020-08-18 13:57:21
1 | suspended | 2020-08-18 13:57:22
1 | started | 2020-08-18 15:07:56
1 | suspended | 2020-08-18 15:09:13
How it can be done? I will be very grateful for your help!

You want to eliminate suspended/start pairs that are for the same id and within 4 minutes. Use lag() and lead():
select t.*
from (select t.*,
lag(date) over (partition by id order by date) as prev_date,
lead(date) over (partition by id order by date) as next_date
from t
) t
where (action = 'start' and
prev_date > date - interval '4 minute'
) or
(action = 'suspended' and
next_date < date + interval '4 minute'
);
Date/time functions are notoriously database dependent. This is just adding or subtracting 4 minutes, which any database can do but the syntax might vary.

You're wanting to filter out certain rows, what is common with the rows you are removing?
It seems you want the first 'started' and last 'suspended' rows. Can you just ignore 'started' rows if there is another 'started' row in the previous 4 minutes, and ignore 'suspended' rows if there is another 'suspended row in the next 4 minutes?
from my_table a
where action = 'started' and not exists (
select 1 from my_table b
where b.id = a.id and b.action = 'started'
and datediff(minute, b.date, a.date) <= 4 -- row exists in the previous 4 min
)
Ditto for 'suspended' but the other way. That doesn't work if the difference between the last 'started' and 'suspended' is > 4 minutes though, but that can be overcome with another condition to check for no start within the last 4 minutes.

If you need to get overlapping intervals duration not more than 4 minutes, can use this query:
--cte where creating groups with time intervals
with base_cte as
(
select Tab.id,Tab.NumGr,Tab.date,
Tab.action from
(
select * from
(
--selecting only values where time difference <= 4 min
select *,sum(TimeDiff)over(partition by id,NumGr order by date rows unbounded preceding)SumTimeInterval from
(
--creating a group
select sum(Num)over(partition by id order by date rows unbounded preceding )NumGr, * from
(
select date,lead(date)over(partition by id order by date)lead_date,id,action,
lead(action)over(partition by id order by date)lead_action,
--split intervals between overlaps (240seconds)
iif(TimeDiff>240,1,0)Num,TimeDiff from
(
--find time difference in seconds between current and next date (partition by id)
select datediff(second,date,LEAD(date)over(partition by id order by date))TimeDiff,* from Table
)A
)B
)C
--selecting only pairs within time intervals
where TimeDiff<=240
--checking duration interval:<=4 min
)D where SumTimeInterval<=240
)E
CROSS JOIN LATERAL
(values (id,NumGr,date,action),
(id,NumGr,lead_date,lead_action)
)Tab(id,NumGr,date,action)
)
--selectig data with start/end overlapping time interval
select id,date,action from base_cte base
where date
in (select max(date) from base_cte where NumGr=base.NumGr)
or date in
(select min(date) from base_cte where NumGr=base.NumGr)

Related

postgresql pivot using crosstab

I have trouble using crosstab() in postgresql-11.
Here is my table,
CREATE TABLE monitor(tz timestamptz, level int, table_name text, status text);
The table monitors events on other tables. It contains
table_name (table on which the event occurred)
timestamp(time at which the event occurred)
level (level of the event)
status of the event (start/end of the event)
Here is the sample data to it.
tz | level | status | table_name
----------------------------------+-------+--------+--------------
2019-10-24 16:18:34.89435+05:30 | 2 | start | test_table_2
2019-10-24 16:18:58.922523+05:30 | 2 | end | test_table_2
2019-11-01 10:31:08.948459+05:30 | 3 | start | test_table_3
2019-11-01 10:41:22.863529+05:30 | 3 | end | test_table_3
2019-11-01 10:51:44.009129+05:30 | 3 | start | test_table_3
2019-11-01 12:35:23.280294+05:30 | 3 | end | test_table_3
Given a timestamp, I want to list out all current events at that time. It could be done using the criteria,
start_time >= 'given_timestamp' and end_time <= 'given_timestamp'
So I tried to use crosstab() to pivot the table over columns table_name,status and timestamp. My query is,
with q1 (table_name, start_time,end_time) as
(select * from crosstab
('select table_name, status, tz from monitor ')
as finalresult (table_name text, start_time timestamptz, end_time timestamptz)),
q2 (level,start_time,end_time) as
(select * from crosstab('select level, status, tz from monitor ')
as finalresult (level int, start_time timestamptz, end_time timestamptz))
select q1.table_name,q2.level,q1.start_time,q1.end_time
from q1,q2
where q1.start_time=q2.start_time;
The output of the query is,
table_name | level | start_time | end_time
--------------+-------+----------------------------------+----------------------------------
test_table_2 | 2 | 2019-10-24 16:18:34.89435+05:30 | 2019-10-24 16:18:58.922523+05:30
test_table_3 | 3 | 2019-11-01 10:31:08.948459+05:30 | 2019-11-01 10:41:22.863529+05:30
But my expected output is,
table_name | level | start_time | end_time
--------------+-------+----------------------------------+----------------------------------
test_table_2 | 2 | 2019-10-24 16:18:34.89435+05:30 | 2019-10-24 16:18:58.922523+05:30
test_table_3 | 3 | 2019-11-01 10:31:08.948459+05:30 | 2019-11-01 10:41:22.863529+05:30
test_table_3 | 3 | 2019-11-01 10:51:44.009129+05:30 | 2019-11-01 12:35:23.280294+05:30
How do I achieve the expected output? Or is there any better way other than crosstab?
I would use a self join for this. To keep the rows on the same level and table together you can use a window function to assign numbers to them so they can be distinguished.
with numbered as (
select tz, level, table_name, status,
row_number() over (partition by table_name, status order by tz) as rn
from monitor
)
select st.table_name, st.level, st.tz as start_time, et.tz as end_time
from numbered as st
join numbered as et on st.table_name = et.table_name
and et.status = 'end'
and et.level = st.level
and et.rn = st.rn
where st.status = 'start'
order by st.table_name, st.level;
This assumes that there will never be a row with status = 'end' and an earlier timestamp then the corresponding row with status = 'start'
Online example: https://rextester.com/QYJK57764

Monthly Counting PostgreSQL giving months

I have a table for customers like this
cust_id | date_signed_up | location_id
-----------------------------------------
1 | 2019/01/01 | 1
2 | 2019/03/05 | 1
3 | 2019/06/17 | 1
What I need is to have a monthly count but having the months even if its 0. Ex:
monthly_count | count
-------------------------
Jan | 1
Feb | 0
Mar | 1
Apr | 0
(months can be in numbers)
Right now I made this query:
SELECT date_trunc('MONTH', (date_signed_up::date)) AS monthly, count(customer_id) AS count FROM customer
WHERE group_id = 1
GROUP BY monthly
ORDER BY monthly asc
but it's giving me just for the months there's information, skipping the ones where it's zero. How can I get all the months even if they have or not information.
You need a list of months.
How to generate Month list in PostgreSQL?
SELECT a.month , count( y.cust_id )
FROM allMonths a
LEFT JOIN yourTable y
ON a.month = date_trunc('MONTH', (date_signed_up::date))
GROUP BY a.month

Postgresql: Create a date sequence, use it in date range query

I'm not great with SQL but I have been making good progress on a project up to this point. Now I am completely stuck.
I'm trying to get a count for the number of apartments with each status. I want this information for each day so that I can trend it over time. I have data that looks like this:
table: y_unit_status
unit | date_occurred | start_date | end_date | status
1 | 2017-01-01 | 2017-01-01 | 2017-01-05 | Occupied No Notice
1 | 2017-01-06 | 2017-01-06 | 2017-01-31 | Occupied Notice
1 | 2017-02-01 | 2017-02-01 | | Vacant
2 | 2017-01-01 | 2017-01-01 | | Occupied No Notice
And I want to get output that looks like this:
date | occupied_no_notice | occupied_notice | vacant
2017-01-01 | 2 | 0 | 0
...
2017-01-10 | 1 | 1 | 0
...
2017-02-01 | 1 | 0 | 1
Or, this approach would work:
date | status | count
2017-01-01 | occupied no notice | 2
2017-01-01 | occupied notice | 0
date_occurred: Date when the status of the unit changed
start_date: Same as date_occurred
end_date: Date when status stopped being x and changed to y.
I am pulling in the number of bedrooms and a property id so the second approach of selecting counts for one status at a time would produce a relatively large number of rows vs. option 1 (if that matters).
I've found a lot of references that have gotten me close to what I'm looking for but I always end up with a sort of rolling, cumulative count.
Here's my query, which produces a column of dates and counts, which accumulate over time rather than reflecting a snapshot of counts for a particular day. You can see my references to another table where I'm pulling in a property id. The table schema is Property -> Unit -> Unit Status.
WITH t AS(
SELECT i::date from generate_series('2016-06-29', '2017-08-03', '1 day'::interval) i
)
SELECT t.i as date,
u.hproperty,
count(us.hmy) as count --us.hmy is the id
FROM t
LEFT OUTER JOIN y_unit_status us ON t.i BETWEEN us.dtstart AND
us.dtend
INNER JOIN y_unit u ON u.hmy = us.hunit -- to get property id
WHERE us.sstatus = 'Occupied No Notice'
AND t.i >= us.dtstart
AND t.i <= us.dtend
AND u.hproperty = '1'
GROUP BY t.i, u.hproperty
ORDER BY t.i
limit 1500
I also tried a FOR loop, iterating over the dates to determine cases where the date was between start and end but my logic wasn't working. Thanks for any insight!
You are on the right track, but you'll need to handle NULL values in end_date. If those means that status is assumed to be changed somewhere in the future (but not sure when it will change), the containment operators (#> and <#) for the daterange type are perfect for you (because ranges can be "unbounded"):
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d, status, count(unit)
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1, 2
To achieve the first variant, you can use conditional aggregation:
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d,
count(unit) filter (where status = 'Occupied No Notice') occupied_no_notice,
count(unit) filter (where status = 'Occupied Notice') occupied_notice,
count(unit) filter (where status = 'Vacant') vacant
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1
Notes:
The syntax filter (where <predicate>) is new to 9.4+. Before that, you can use CASE (and the fact that most aggregate functions does not include NULL values) to emulate it.
You can even index the expression daterange(start_date, end_date, '[]') (using gist) for better performance.
http://rextester.com/HWKDE34743

Count rows within a group, but also from global result set: performance issue

I have a table with log records. Each log record is represented by a status (open or closed) and a date:
CREATE TABLE logs (
id BIGSERIAL PRIMARY KEY,
status VARCHAR NOT NULL,
inserted_at DATE NOT NULL
);
I need to get a daily report with a following information:
how many log records with status = open were created,
how many log records with status = closed were created,
how many log records with status = open exist to this day, including this day.
Here's a sample report output:
day | created | closed | total_open
------------+---------+--------+------------
2017-01-01 | 2 | 0 | 2
2017-01-02 | 2 | 1 | 3
2017-01-03 | 1 | 1 | 3
2017-01-04 | 1 | 0 | 4
2017-01-05 | 1 | 0 | 5
2017-01-06 | 1 | 0 | 6
2017-01-07 | 1 | 0 | 7
2017-01-08 | 0 | 1 | 6
2017-01-09 | 0 | 0 | 6
2017-01-10 | 0 | 0 | 6
(10 rows)
I solved this in a very "dirty" way:
INSERT INTO logs (status, inserted_at) VALUES
('created', '2017-01-01'),
('created', '2017-01-01'),
('closed', '2017-01-02'),
('created', '2017-01-02'),
('created', '2017-01-02'),
('created', '2017-01-03'),
('closed', '2017-01-03'),
('created', '2017-01-04'),
('created', '2017-01-05'),
('created', '2017-01-06'),
('created', '2017-01-07'),
('closed', '2017-01-08');
SELECT days.day,
count(case when logs.inserted_at = days.day AND logs.status = 'created' then 1 end) as created,
count(case when logs.inserted_at = days.day AND logs.status = 'closed' then 1 end) as closed,
count(case when logs.inserted_at <= days.day AND logs.status = 'created' then 1 end) -
count(case when logs.inserted_at <= days.day AND logs.status = 'closed' then 1 end) as total
FROM (SELECT day::date FROM generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) day) days,
logs
GROUP BY days.day
ORDER BY days.day;
Also (posted it on gist for brevity), and would like to improve the solution.
Right now explain for my query reveals some ridiculous cost numbers that that I would like to minimize (I don't have indexes yet).
How an efficient query to achieve the report above would look like?
A possible solution is to use window functions:
select s.*, sum(created - closed) over (order by inserted_at)
from (select inserted_at,
count(status) filter (where status = 'created') created,
count(status) filter (where status = 'closed') closed
from (select d::date inserted_at
from generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) d) d
left join logs using (inserted_at)
group by inserted_at) s
http://rextester.com/GFRRP71067
Also, an index on (inserted_at, status) could help you a lot with this query.
Note: count(...) filter (where ...) is really just a fancy way to write count(case when ... then ... [else null] end).

Postgresql running totals with groups missing data and outer joins

I've written a sql query that pulls data from a user table and produces a running total and cumulative total of when users were created. The data is grouped by week (using the windowing feature of postgres). I'm using a left outer join to include the weeks when no users where created. Here is the query...
<!-- language: lang-sql -->
WITH reporting_period AS (
SELECT generate_series(date_trunc('week', date '2015-04-02'), date_trunc('week', date '2015-10-02'), interval '1 week') AS interval
)
SELECT
date(interval) AS interval
, count(users.created_at) as interval_count
, sum(count( users.created_at) ) OVER (order by date_trunc('week', users.created_at)) AS cumulative_count
FROM reporting_period
LEFT JOIN users
ON interval=date(date_trunc('week', users.created_at) )
GROUP BY interval, date_trunc('week', users.created_at) ORDER BY interval
It works almost perfectly. The cumulative value is calculated properly for weeks week a user was created. For weeks when no user was create it is set to grand total and not the cumulative total up to that point.
Notice the rows with ** the Week Tot column (interval_count) is 0 as expected but the Run Tot (cumulative_total) is 1053 which equals the grand total.
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 1053 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 1053 **
2015-05-11 | 0 | 1053 **
2015-05-18 | 1 | 30
2015-05-25 | 0 | 1053 **
...
2015-06-08 | 996 | 1031
...
2015-09-07 | 2 | 1052
2015-09-14 | 0 | 1053 **
2015-09-21 | 1 | 1053 **
2015-09-28 | 0 | 1053 **
This is what I would like
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 17 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 29 **
...
It seems to me that if the outer join can somehow apply the grand total to the last column it should be possible to apply the current running total but I'm at a loss on how to do it.
Is this possible?
This is not guaranteed to work out of the box as I havent tested on acutal tables, but the key here is to join users on created_at over a range of dates.
with reportingperiod as (
select intervaldate as interval_begin,
intervaldate + interval '1 month' as interval_end
from (
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-03-15')),
DATE(DATE_TRUNC('day', DATE '2015-10-15')), interval '1 month') AS intervaldate
) as rp
)
select interval_end,
interval_count,
sum(interval_count) over (order by interval_end) as running_sum
from (
select interval_end,
count(u.created_at) as interval_count
from reportingperiod rp
left join (
select created_at
from users
where created_at < '2015-10-02'
) u on u.created_at > rp.interval_begin
and u.created_at <= rp.interval_end
group by interval_end
) q
I figured it out. The trick was subqueries. Here's my approach
Add a count column to the generate_series call with default value of 0
Select interval and count(users.created_at) from the users data
Union the the generate_series and the result from the select in step #2
(At this point the result will have duplicates for each interval)
Use the results in a subquery to get interval and max(interval_count) which eliminates duplicates
Use the window aggregation as before to get the running total
SELECT
interval
, interval_count
, SUM(interval_count ) OVER (ORDER BY interval) AS cumulative_count
FROM
(
SELECT interval, MAX(interval_count) AS interval_count FROM
(
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('week', DATE '2015-04-02')),
DATE(DATE_TRUNC('week', DATE '2015-10-02')), interval '1 week') AS interval,
0 AS interval_count
UNION
SELECT DATE_TRUNC('week', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
) sub1
GROUP BY interval
) grouped_data
I'm not sure if there are any serious performance issues with this approach but it seems to work. If anyone has a better, more elegant or performant approach I would love the feedback.
Edit: My solution doesn't work when trying to group by arbitrary time windows
Just tried this solution with the following changes
/* generate series using DATE_TRUNC('day'...)*/
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-04-02')),
DATE(DATE_TRUNC('day', DATE '2015-10-02')), interval '1 month') AS interval,
0 AS interval_count
/* And this part */
SELECT DATE_TRUNC('day', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
For example is is possible to produce these similar results but have the data grouped by intervals as so
3/15/15 - 4/14/15,
4/15/15 - 5/14/15,
5/15/15 - 6/14/15
etc.