Merge overlapping date intervals into big intervals providing uniquness of some id inside merged group - postgresql

I have some date intervals, each interval is characterized by known "prop_id". My goal is to merge overlapping intervals into big intervals, while keeping the uniquness of "prop_id" inside the merged group. I have some code that helps me to get big intervals, but I 've no idea how to keep condition of uniquness (. Thanks in advance for any assistance.
________1 ________1
___________2
________1 |________1
_________|__2
[1,2]_________|________[1,2]
For SQLFiddle:
CREATE SEQUENCE ido_seq;
create table slots (
ido integer NOT NULL default nextval('ido_seq'),
begin_at date,
end_at date,
prop_id integer);
ALTER SEQUENCE ido_seq owned by slots.ido;
INSERT INTO slots (ido, begin_at, end_at, prop_id) VALUES
(0, '2014-10-05', '2014-10-10', 1),
(1, '2014-10-08', '2014-10-15', 2),
(2, '2014-10-13', '2014-10-20', 1),
(3, '2014-10-21', '2014-10-30', 2);
-- disired output:
-- start, end, props
-- 2014-10-05, 2014-10-12, [1,2] --! the whole group is (2014-10-05, 2014-10-20, [1,2,1]), but props should be unique
-- 2014-10-13, 2014-10-20, [1,2] --so, we obtain 2 ranges instead of 1, each one with 2 generating prop_id
-- 2014-10-21, 2014-10-30 [2]
How do we get it:
if two date intervals overlap, we merge them. The first ['2014-10-05', '2014-10-10'] and second ['2014-10-08', '2014-10-15'] have part ['2014-10-08', '2014-10-10'] in common. So we can merge them to ['2014-10-05', '2014-10-15']. The generalizing props are unique - OK. The next one ['2014-10-13', '2014-10-20'] is overlapping with previously calculated ['2014-10-05', '2014-10-15'], but we can't merge them without breaking the condition of uniquness. So we are to split the big interval ['2014-10-05', '2014-10-20'] into 2 small using the begin date of repeating prop ('2014-10-13'), but keeng the condition and receive ['2014-10-05', '2014-10-12'] (as '2014-10-13' minus 1 day) and ['2014-10-13', '2014-10-20'] both generalizing by props 1 and 2.
My attempt to get merged intervals (not keeping uniqueness condition):
SELECT min(begin_at), max(enddate), array_agg(prop_id) AS props
FROM (
SELECT *,
count(nextstart > enddate OR NULL) OVER (ORDER BY begin_at DESC, end_at DESC) AS grp
FROM (
SELECT
prop_id
, begin_at
, end_at
, end_at AS enddate
, lead(begin_at) OVER (ORDER BY begin_at, end_at) AS nextstart
FROM slots
) a
)b
GROUP BY grp
ORDER BY 1;

The right solution here is probably to use a recursive CTE to find the large intervals no matter how many smaller intervals need to be combined, and then to remove the intervals that we do not need.
with recursive intervals(idos, begin_at,end_at,prop_ids) as(
select array[ido], begin_at, end_at, array[prop_id]
from slots
union
select i.idos || s.ido
, least(s.begin_at, i.begin_at)
, greatest(s.end_at, i.end_at)
, i.prop_ids || s.prop_id
from intervals i
join slots s
on (s.begin_at, s.end_at) overlaps (i.begin_at, i.end_at)
and not (i.prop_ids && array[s.prop_id]) --check that the prop id is not already in the large interval
where s.begin_at < i.begin_at --to avoid having double intervals
)
select * from intervals i
--finally, remove the intervals that are a subinterval of an included interval
where not exists(select 1 from intervals i2 where i2.idos #> i.idos
and i2.idos <> i.idos);

Related

Postgres join vs aggregation on very large partitioned tables

I have a large table with 100s of millions of rows. Because it is so big, it is partitioned by date range first, and then that partition is also partitioned by a period_id.
CREATE TABLE research.ranks
(
security_id integer NOT NULL,
period_id smallint NOT NULL,
classificationtype_id smallint NOT NULL,
dtz timestamp with time zone NOT NULL,
create_dt timestamp with time zone NOT NULL DEFAULT now(),
update_dt timestamp with time zone NOT NULL DEFAULT now(),
rank_1 smallint,
rank_2 smallint,
rank_3 smallint
)
CREATE TABLE zpart.ranks_y1990 PARTITION OF research.ranks
FOR VALUES FROM ('1990-01-01 00:00:00+00') TO ('1991-01-01 00:00:00+00')
PARTITION BY LIST (period_id);
CREATE TABLE zpart.ranks_y1990p1 PARTITION OF zpart.ranks_y1990
FOR VALUES IN ('1');
every year has a partition and there are another dozen partitions for each year.
I needed to see the result for security_ids side by side for different period_ids.
So the join I initially used was one like this:
select c1.security_id, c1.dtz,c1.rank_2 as rank_2_1, c9.rank_2 as rank_2_9
from research.ranks c1
left join research.ranks c9 on c9.dtz=c9.dtz and c1.security_id=c9.security_id and c9.period_id=9
where c1.period_id =1 and c1.dtz>now()-interval'10 years'
which was slow, but acceptable. I'll call this the JOIN version.
Then, we wanted to show two more period_ids and extended the above to add additional joins on the new period_ids.
This slowed down the join enough for us to look at a different solution.
We found that the following type of query runs about 6 or 7 times faster:
select c1.security_id, c1.dtz
,sum(case when c1.period_id=1 then c1.rank_2 end) as rank_2_1
,sum(case when c1.period_id=9 then c1.rank_2 end) as rank_2_9
,sum(case when c1.period_id=11 then c1.rank_2 end) as rank_2_11
,sum(case when c1.period_id=14 then c1.rank_2 end) as rank_2_14
from research.ranks c1
where c1.period_id in (1,11,14,9) and c1.dtz>now()-interval'10 years'
group by c1.security_id, c1.dtz;
We can use the sum because the table has unique indexes so we know there will only ever be one record that is being "summed". I'll call this the SUM version.
The speed is so much better that I'm questioning half of the code I have written previously! Two questions:
Should I be trying to use the SUM version rather than the JOIN version everywhere or is the efficiency likely to be a factor of the specific structure and not likely to be as useful in other circumstances?
Is there a problem with the logic of the SUM version in cases that I haven't considered?
To be honest, I don't think your "join" version was ever a good idea anyway. You only have one (partitioned) table so there never was a need for any join.
SUM() is the way to go, but I would use SUM(...) FILTER(WHERE ..) instead of a CASE:
SELECT
security_id,
dtz,
SUM(rank_2) FILTER (WHERE period_id = 1) AS rank_2_1,
SUM(rank_2) FILTER (WHERE period_id = 9) AS rank_2_9,
SUM(rank_2) FILTER (WHERE period_id = 11) AS rank_2_11,
SUM(rank_2) FILTER (WHERE period_id = 14) AS rank_2_14,
FROM
research.ranks
WHERE
period_id IN ( 1, 11, 14, 9 )
AND dtz > now( ) - INTERVAL '10 years'
GROUP BY
security_id,
dtz;

postgreSQL select interval and fill blanks

I'm working on a system to manage the problems in different projects.
I have the following tables:
Projects
id
Description
Country
1
3D experience
Brazil
2
Lorem Epsum
Chile
Problems
id
idProject
Description
1
1
Not loading
2
1
Breaking down
Problems_status
id
idProblem
Status
Start_date
End_date
1
1
Red
2020-10-17
2020-10-25
2
1
Yellow
2020-10-25
2020-11-20
3
1
Red
2020-11-20
4
2
Red
2020-11-01
2020-11-25
5
2
Yellow
2020-11-25
2020-12-22
6
2
Red
2020-12-22
2020-12-23
7
2
Green
2020-12-23
In the above examples, the problem 1 is still red, and the problem 2 is green (no end date).
I need to create a chart when the user selects an specific project, where the status of the problems along the weeks (starting by the week of the first registered problem) will be shown. The chart of the project 1 should look like this:
I'm trying to write a code in postgreSQL to return a table like this, so that I can populate this chart:
Week
Green
Yellow
Red
42/20
0
0
1
43/20
0
0
1
44/20
0
1
0
...
...
...
...
04/21
1
0
1
I've been trying multiple ways but just can't figure out how to do that, could someone help me please?
Bellow a db-fiddle to help:
CREATE TABLE projects (
id serial NOT NULL,
description character varying(50) NOT NULL,
country character varying(50) NOT NULL,
CONSTRAINT projects_pkey PRIMARY KEY (id)
);
CREATE TABLE problems (
id serial NOT NULL,
id_project integer NOT NULL,
description character varying(50) NOT NULL,
CONSTRAINT problems_pkey PRIMARY KEY (id),
CONSTRAINT problems_id_project_fkey FOREIGN KEY (id_project)
REFERENCES projects (id) MATCH SIMPLE
);
CREATE TABLE problems_status (
id serial NOT NULL,
id_problem integer NOT NULL,
status character varying(50) NOT NULL,
start_date date NOT NULL,
end_date date,
CONSTRAINT problems_status_pkey PRIMARY KEY (id),
CONSTRAINT problems_status_id_problem_fkey FOREIGN KEY (id_problem)
REFERENCES problems (id) MATCH SIMPLE
);
INSERT INTO projects (description, country) VALUES ('3D experience','Brazil');
INSERT INTO projects (description, country) VALUES ('Lorem Epsum','Chile');
INSERT INTO problems (id_project ,description) VALUES (1,'Not loading');
INSERT INTO problems (id_project ,description) VALUES (1,'Breaking down');
INSERT INTO problems_status (id_problem, status, start_date, end_date) VALUES
(1, 'Red', '2020-10-17', '2020-10-25'),(1, 'Yellow', '2020-10-25', '2020-11-20'),
(1, 'Red', '2020-11-20', NULL),(2, 'Red', '2020-11-01', '2020-11-25'),
(2, 'Yellow', '2020-11-25', '2020-12-22'),(2, 'Red', '2020-12-22', '2020-12-23'),
(2, 'Green', '2020-12-23', NULL);
If I understood correctly your goal is to produce a weekly tally by problem status for a particular project for a specific time period (Min db date to current date). Further if a problem status spans week then is should be included in each weeks tally. That involve 2 time periods, the report period against the status start/end dates and checking for overlap of those dates. Now there ate 5 overlaps scenarios that need checking; lets call the ranges let A the any week in the report period and B. the start/end of status. Now, allowing that A must end within the reporting period. but B does not we have the following.
A starts, B starts, A ends, B ends. B overlaps end of A.
A starts, B starts, B ends, A ends. B totally contained within A.
B starts, A starts, B ends, A ends. B overlaps start of A.
B starts, A starts, A ends, B ends. A totally enclosed within B.
Fortunately, Postgres provides functionally to handle all the above meaning the query does not have to handle the individual validations. This is DATERANGEs and the Overlap operator. The difficult work then becomes defining each week with in A. Then employ the Overlap operator on daterange for each week in A against the daterange for B (start_date, end_date). Then do conditional aggregation. for each overlap detected. See full example here.
with problem_list( problem_id ) as
-- identify the specific problem_ids desirded
(select ps.id
from projects p
join problems ps on(ps.id_project = p.id)
where p.id = &selected_project
) --select * from problem_list;
, report_period(srange, erange) as
-- generate the first day of week (Mon) for the
-- oldest start date through day of week of Current_Date
(select min(first_of_week(ps.start_date))
, first_of_week(current_date)
from problem_status ps
join problem_list pl
on (pl.problem_id = ps.id_problem)
) --select * from report_period;
, weekly_calendar(wk,yr, week_dates) as
-- expand the start, end date ranges to week dates (Mon-Sun)
-- and identify the week number with year
(select extract( week from mon)::integer wk
, extract( isoyear from mon)::integer yr
, daterange(mon, mon+6, '[]'::text) wk_dates
from (select generate_series(srange,erange, interval '7 days')::date mon
from report_period
) d
) -- select * from weekly_calendar;
, status_by_week(yr,wk,status) as
-- determine where problem start_date, end_date overlaps each calendar week
-- then where multiple statuses exist for any week keep only the lat
( select yr,wk,status
from (select wc.yr,wc.wk,ps.status
-- , ps.start_date, wc.week_dates,id_problem
, row_number() over (partition by ps.id_problem,yr,wk order by yr, wk, start_date desc) rn
from problem_status ps
join problem_list pl on (pl.problem_id = ps.id_problem)
join weekly_calendar wc on (wc.week_dates && daterange(ps.start_date,ps.end_date)) -- actual overlap test
) ac
where rn=1
) -- select * from status_by_week order by wk;
select 'Project ' || p.id || ': ' || p.description Project
, to_char(wk,'fm09') || '/' || substr(to_char(yr,'fm0000'),3) "WK"
, "Red", "Yellow", "Green"
from projects p
cross join (select sbw.yr,sbw.wk
, count(*) filter (where sbw.status = 'Red') "Red"
, count(*) filter (where sbw.status = 'Yellow') "Yellow"
, count(*) filter (where sbw.status = 'Green') "Green"
from status_by_week sbw
group by sbw.yr, sbw.wk
) sr
where p.id = &selected_project
order by yr,wk;
The CTEs and main operate as follows:
problem_list: Identifies the Problems (id_problem) related the
specified project.
report_period: Identifies the full reporting period start to end.
weekly_calendar: Generates the beginning date (Mon) and ending date (Sun) for each week within the reporting period (A above). Along the
way it also gathers week of the year and the ISO year.
status_by_week: This is the real work horse preforming two tasks.
First is passes each problem by each of the week in the calendar. It
builds row for each overlap detected. Then it enforces the "one
status" rule.
Finally, the main select aggregates the status into the appropriate
buckets and adds the syntactic sugar getting the Program Name.
Note the function first_of_week(). This is a user defined function and available in the example and below. I created it some time ago and have found it useful. You are free to use it. But you do so without any claim of suitability or guaranty.
create or replace
function first_of_week(date_in date)
returns date
language sql
immutable strict
/*
* Given a date return the first day of the week according to ISO-8601
*
* ISO-8601 Standard (in short)
* 1 All weeks begin on Monday.
* 2 All Weeks have exactly 7 days.
* 3 First week of any year is the Monday on or before 4-Jan.
* This implies that the last few days on Dec may be in the
* first week of the following year and that the first few
* days of Jan may be in week 53 (53) of the prior year.
* (Not at the same time obviously.)
*
*/
as $$
with wk_adj(l_days) as (values (array[0,1,2,3,4,5,6]))
select date_in - l_days[ extract (isodow from date_in)::integer ]
from wk_adj;
$$;
In the example I have implemented the query as a SQL function as it seems db<>fiddle has issues with bound variables
and substitution variables, Besides it gave the ability to parameterize it. (Hate hard coded values). For the example I
added additional data fro extra testing, Mostly as data that will not be selected. And an additional Status (what happens if it encounters something other than those 3 status values (in this case Pink). This easy to remove, just get rid on OTHER.
Your notice that "the daterange is covering mon-mon, instead of mon-sun" is incorrect, although it would appear that way for someone not use to looking at them. Lets take week 43. If you queried the date range it would show [2020-10-19,2020-10-26) and yes both those dates are Monday. However, the bracketing characters have meaning. The leading character [ says the date is to included and the trailing character ) says the date is not to be included. A standard condition:
somedate && [2020-10-19,2020-10-26)
is the same as
somedate >= 2020-10-19 and somedate < 2020-10-26
This is why when you change the increment from "mon+6" to "mon+5" you fixed week 43, but introduced errors into other weeks.
You can fill in blanks using COALESCE to select the first non-null value in the list.
SELECT COALESCE(<some_value_that_could_be_null>, <some_value_that_will_not_be_null>);
If you want to force the bounds of your time range into a result set you can UNION your result set with a specific date.
SELECT ... -- your data query here
UNION ALL
SELECT end_ts -- WHERE end_ts is a timestamptz type
In order to UNION you will need to have the same arity and same type of fields returned in the unioned query. You can fill in everything other than the timestamp with NULL casted to whichever the matching type is.
More concrete example:
WITH data AS -- get raw data
(
SELECT p.id
, ps.status
, ps.start_date
, COALESCE(ps.end_date, CURRENT_DATE, '01-01-2025'::DATE) -- you can fill in NULL values with COALESCE
, pj.country
, pj.description
, MAX(start_date) OVER (PARTITION BY p.id) AS latest_update
FROM problems p
JOIN projects pj ON (pj.id = p.id_project)
JOIN problem_status ps ON (p.id = ps.id_problem)
UNION ALL -- force bounds in the following
SELECT NULL::INTEGER -- could be null or a defaulted value
, NULL::TEXT -- could be null or a defaulted value
, start_date -- either as an input param to a function or a hard-coded date
, end_date -- either as an input param to a function or a hard-coded date
, NULL::TEXT
, NULL::TEXT
, NULL::DATE
) -- aggregate in the following
SELECT <week> -- you'll have to figure out how you're getting weeks out of the DATE data
, COUNT(*) FILTER (WHERE status = 'Red')
, COUNT(*) FILTER (WHERE status = 'Yellow')
, COUNT(*) FILTER (WHERE status = 'Green')
FROM data
WHERE start_date = latest_update
GROUP BY <week>
;
Some of the features used in this query are very powerful and you should look them up if they're new to you and you are going to be doing a bunch of reporting queries. Mainly coalesce, common table expressions (CTE), window functions, and aggregate expressions.
Aggregate Expressions
WITH Queries (CTEs)
COALESCE
Window Functions
I wrote a dbfiddle for you to take a look at here after you updated your requirements.

how to concatenate timestamp in different rows in postgresql?

I'm looking for a way to concatenate timestamp in two difference row, for an example, I have this table:
I want it to be grouped by weekday and concatenate the min(start_hour) with max(start_hour), to get something like this
and I'm using this query to retrieve the first image result
The query below should give you what you are looking for provided the information supplied. I made some assumptions. That the '00:00:00' in the start and end hours is not a valid time and can be ignored. If they should be considered valid, then Friday's output would be one entry of '00:00:00' - '11:30:00'.
I created two CTEs, one for the start hours and the other for the end hours where the values are not '00:00:00'. Added a row number to the CTEs so i can match up the day & row_number to get you a set.
SELECT day
,array_to_string(array_agg(t.shift), ',') shifts
FROM (
WITH cte_start AS (
SELECT row_number() OVER (PARTITION BY day)
,day
,start_hour
FROM test22
WHERE start_hour <> '00:00:00'::time
)
,cte_stop AS (
SELECT row_number() OVER (PARTITION BY day)
,day
,stop_hour
FROM test22
WHERE stop_hour <> '00:00:00'::time
)
SELECT cte_start.day
,cte_start.start_hour::varchar || ' - ' || cte_stop.stop_hour::varchar AS shift
FROM cte_start
LEFT OUTER JOIN cte_stop ON cte_start.day = cte_stop.day
AND cte_start.row_number = cte_stop.row_number
) T
GROUP BY T.day
-HTH

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

Aggregated values depending on an other field

I have a table with a date-time and multiples propertied some on which I group by and some on which I aggregate, the query will be like get me revenue per customer last week.
Now I want to see the change between the requested period and the previous one so I will have 2 columns revenue and previous_revenue.
Right now I'm requesting the rows of the requested period plus the rows of the previous period and for each aggregated field I add a case statement inside which return the value or 0 if not in the period that I want.
That lead to as many CASE as aggregate fields but always with the same conditional statement.
I'm wondering if there is a better design for this use case...
SELECT
customer,
SUM(
CASE TIMESTAMP_CMP('2016-07-01 00:00:00', ft.date) > 0 WHEN true THEN
REVENUE
ELSE 0 END
) AS revenue,
SUM(
CASE TIMESTAMP_CMP('2016-07-01 00:00:00', ft.date) < 0 WHEN true THEN
REVENUE
ELSE 0 END
) AS previous_revenue
WHERE date_hour >= '2016-06-01 00:00:00'
AND date_hour <= '2016-07-31 23:59:59'
GROUP BY customer
(In my real use case I have many columns which make it even more ugly)
First, I'd suggest to refactor out the timestamps and precalculate the current and previous period for later use. This is not strictly necessary to solve your problem, though:
create temporary table _period as
select
'2016-07-01 00:00:00'::timestamp as curr_period_start
, '2016-07-31 23:59:59'::timestamp as curr_period_end
, '2016-06-01 00:00:00'::timestamp as prev_period_start
, '2016-06-30 23:59:59'::timestamp as prev_period_end
;
Now a possible design to avoid repetition of timestamps and CASE statements is to group by the periods first and then doing a FULL OUTER JOIN for that table on itself:
with _aggregate as (
select
case
when date_hour between prev_period_start and prev_period_end then 'previous'
when date_hour between curr_period_start and curr_period_end then 'current'
end::varchar(20) as period
, customer
-- < other columns to group by go here >
, sum(revenue) as revenue
-- < other aggregates go here >
from
_revenue, _period
where
date_hour between prev_period_start and curr_period_end
group by 1, 2
)
select
customer
, current_period.revenue as revenue
, previous_period.revenue as previous_revenue
from
(select * from _aggregate where period = 'previous') previous_period
full outer join (select * from _aggregate where period = 'current') current_period
using(customer) -- All columns which have been group by must go into the using() clause:
-- e.g. using(customer, some_column, another_column)
;