How to shorten loading time with multiple subqueries and joins? - postgresql

I have a query to find which v_address belonged to which v_group, and when was the creation date of v_address. However, each v_group was sometimes active and inactive and it has dates. So, I wrote query to find the period of those changes.
My problem is my query took too long time to run because it has multiple subqueries and joins.
Does anyone have a better idea to shorten to load data?
I attached my query below identifier = 81 is for smaller testing. For the final result, I will need to retrieve data for more than 100k ids.
I tried to change inner joins to left joins but it loses null values and still took too long time to retrieve data.
with v_address as (
select id, v_group_id, created_at
from prod.venue_address --different schema and table. v_address.v_group_id can have multiple v_address.id
group by 1,2,3
order by 3 asc
),
v_group as (
select identifier, final_status, created_at
from dwh.venue_address_archive
where identifier = 81
group by 1,2,3
order by 3 asc
),
filtering as (
select identifier, created_at,
case when sum(case when final_state = 'active' then 1 else 0 end) > 0 then 'active' else 'inactive' end as filtered_status --This filters either of active or inactive
from v_group
group by 1,2
order by 2 asc
),
prev as (
select identifier, created_at, filtered_status,
lag(case when filtered_status = 'active' then 'active' else 'inactive' end) over (partition by identifier order by created_at) = filtered_status as is_prev_status
from filtering
group by 1,2,3
),
periods as (
select identifier, filtered_status, created_at, is_prev_status,
sum(case when is_prev_status = true then 0 else 1 end) over (order by identifier, created_at) as period
from prev
group by 1,2,3,4
),
islands_gaps_start as (
select identifier, period, min(created_at) as start_at
from periods
group by 2,1
),
islands_gaps as (
select identifier, period, start_at,
lead(start_at) over (partition by identifier order by period) as end_at
from islands_gaps_start
)
select vg.identifier as "vg_id", p.created_at as "vg_created_at", p.filtered_status as "status", p.is_prev_status, p.period, va.id as "va_id", g.start_at, g.end_at
from v_address va
left join v_group vg
on va.venue_id = vg.identifier
inner join filtering f
on vg.identifier = f.identifier
inner join prev pr
on pr.filtered_status = f.filtered_status
inner join periods p
on p.filtered_status = pr.filtered_status
inner join islands_gaps_start gs
on p.period = gs.period
inner join islands_gaps g
on gs.start_at = g.start_at
group by 6,1,2,3,4,5,7,8
order by 2 asc
I already have the output with an example identifier = 81 but I have to run this query for more than 100k identifiers so, I'm looking for any advice that I can shorten my query.

Related

Postgres - partitioning a table

How does the code look like to partition the following table. date and status are given, partition column shall be added. Column group is only to explain where the group starts and ends.
Finally, I like to do some analytics, e.g. how long takes the process per group.
In words but don't know to convert to code:
status 'approved' always defines the end. Only an 'open' after 'approval' defines the start. The other 'open' are not relevant.
date
status
Group
Partition
1.10.2022
open
Group 1 Starts
1
2.10.2022
waiting
1
3.10.2022
open
1
4.10.2022
waiting
1
5.10.2022
approved
Group 1 Ends
1
7.10.2022
open
Group 2 Start
2
8.10.2022
waiting
2
9.10.2022
open
2
10.10.2022
waiting
2
11.10.2022
open
2
12.10.2022
waiting
2
15.10.2022
approved
Group 2 Ends
2
17.10.2022
open
Group 3 Starts
3
20.10.2022
waiting
3
Thanks for the solution. Works fine :-) And sorry for not using the right expression. If Group is better than Partition even better...
Can we make it slightly more complicated?
This patter in the table applis to several parent records. So in reality there is an additional column Parent ID. This table below is then for example for parent ID A. There are many more parents.
How can an additional grouping be added by Parent ID?
At eeach new parent the counting starts again at 1
Assuming you have the first two columns and want to derive the last two, treat this as a gaps-and-islands problem:
with groups as ( -- Assign partitions
select *,
coalesce(
sum(case when status = 'approved' then 1 else 0 end)
over (order by date rows between unbounded preceding
and 1 preceding),
0
) + 1 as partition
from do_part
)
select date, status,
case -- Construct text descriptions
when partition != coalesce(lead(partition) over w, partition)
then format('Group %s Ends', partition)
when partition = lag(partition) over w
then ''
else format('Group %s Starts', partition)
end as "group",
partition
from groups
window w as (order by date);
Fiddle here
demo based on (Mike Organek)'s fiddle.
idea: left join with distinct on can properly cut the group.
SELECT DISTINCT ON (date,status)
date,
status,
coalesce(date_d, CURRENT_DATE) AS date_end
FROM
do_part t
LEFT JOIN (
SELECT
date AS date_d
FROM
do_part
WHERE
status = 'approved'
ORDER BY
date) s ON s.date_d >= t.date
ORDER BY
date,status,
date_d;
Final query (can be simplified):
WITH cte AS (
SELECT DISTINCT ON (date,
status)
date,
status,
coalesce(date_d, CURRENT_DATE) AS date_end
FROM
do_part t
LEFT JOIN (
SELECT
date AS date_d
FROM
do_part
WHERE
status = 'approved'
ORDER BY
date) s ON s.date_d >= t.date
ORDER BY
date,
status,
date_d
),
cte1 AS (
SELECT
*,
date_end - first_value(date) OVER (PARTITION BY date_end ORDER BY date) AS date_gap,
dense_rank() OVER (ORDER BY date_end),
CASE WHEN (date = first_value(date) OVER (PARTITION BY date_end ORDER BY date)) THEN
'group begin'
WHEN (status = 'approved') THEN
'group end '
ELSE
NULL
END AS grp
FROM
cte
)
SELECT
*,
CASE WHEN grp IS NOT NULL THEN
grp || dense_rank::text
END
FROM
cte1;

How to repeat some data points in query results?

I am trying to get the max date by account from 3 different tables and view those dates side by side. I created a separate query for each table, merged the results with UNION ALL, and then wrapped all that in a PIVOT.
The first 2 sections in the link/pic below show what I have been able to accomplish and the 3rd section is what I would like to do.
Query results by step
How can I get the results from 2 of the tables to repeat? Is that possible?
--define var_ent_type = 'ACOM'
--define var_ent_id = '52766'
--define var_dict_id = 113
SELECT
*
FROM
(
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'PERF_SUMMARY' as "TableName",
PS.DICTIONARY_ID,
to_char(MAX(PS.END_EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN PERFORMDBO.PERF_SUMMARY PS ON (PS.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND PS.DICTIONARY_ID >= 100
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'PERF_SUMMARY',
PS.DICTIONARY_ID
union all
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'POSITION' as "TableName",
0 as DICTIONARY_ID,
to_char(MAX(H.EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN HOLDINGDBO.POSITION H ON (H.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'POSITION',
1
union all
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'CASH_ACTIVITY' as "TableName",
0 as DICTIONARY_ID,
to_char(MAX(C.EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN CASHDBO.CASH_ACTIVITY C ON (C.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'CASH_ACTIVITY',
1
--ORDER BY
-- 2,3, 4
)
PIVOT
(
MAX("MaxDate")
FOR "TableName"
IN ('CASH_ACTIVITY', 'PERF_SUMMARY','POSITION')
)
Everything is possible. You only need a window function to make the value repeat across rows w/o data.
--Assuming current query is QC
With QC as (
...
)
select code, account, grouping,
--cash,
first_value(cash) over (partition by code, account order by grouping asc rows unbounded preceding) as cash_repeat,
perf,
--pos,
first_value(pos) over (partition by code, account order by grouping asc rows unbounded preceding) as pos_repeat
from QC
;
See first_value() help here: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC

Query to select by number of associated objects

I have two tables that look like the following:
Orders
------
id
tracking_number
ShippingLogs
------
tracking_number
created_at
stage
I would like to select the IDs of Orders that have ONLY ONE ShippingLog associated with it, and the stage of the ShippingLog must be error. If it has two ShippingLog entries, I don't want it. If it has one ShippingLog bug its stage is shipped, I don't want it.
This is what I have, and it doesn't work, and I know why (it finds the log with the error, but has no way of knowing if there are others). I just don't really know how to get it the way I need it.
SELECT DISTINCT
orders.id, shipping_logs.created_at, COUNT(shipping_logs.*)
FROM
orders
JOIN
shipping_logs ON orders.tracking_number = shipping_logs.tracking_number
WHERE
shipping_logs.created_at BETWEEN '2021-01-01 23:40:00'::timestamp AND '2021-01-26 23:40:00'::timestamp AND shipping_logs.stage = 'error'
GROUP BY
orders.id, shipping_logs.created_at
HAVING
COUNT(shipping_logs.*) = 1
ORDER BY
orders.id, shipping_logs.created_at DESC;
If you want to retain every column from the join of the two tables given your requirements, then I would suggest using COUNT here as an analytic function:
WITH cte AS (
SELECT o.id, sl.created_at,
COUNT(*) OVER (PARTITION BY o.id) num_logs,
COUNT(*) FILTER (WHERE sl.stage <> 'error')
OVER (PARTITION BY o.id) non_error_cnt
FROM orders o
INNER JOIN shipping_logs sl ON sl.tracking_number = o.tracking_number
WHERE sl.created_at BETWEEN '2021-01-01 23:40:00'::timestamp AND
'2021-01-26 23:40:00'::timestamp
)
SELECT id AS order_id, created_at
FROM cte
WHERE num_logs = 1 AND non_error_cnt = 0
ORDER BY id, created_at DESC;

Postgresql count by past weeks

select id, wk0_count
from teams
left join
(select team_id, count(team_id) as wk0_count
from (
select created_at, team_id, trunc(EXTRACT(EPOCH FROM age(CURRENT_TIMESTAMP,created_at)) / 604800) as wk_offset
from loan_files
where loan_type <> 2
order by created_at DESC) as t1
where wk_offset = 0
group by team_id) as t_wk0
on teams.id = t_wk0.team_id
I've created the query above that shows me how many loans each team did in a given week. Week 0 is the past seven days.
Ideally I want a table that shows how many loans each team did in the last 8 weeks, grouped by week. The output would look like:
Any ideas on the best way to do this?
select
t.id,
count(week = 0 or null) as wk0,
count(week = 1 or null) as wk1,
count(week = 2 or null) as wk2,
count(week = 3 or null) as wk3
from
teams t
left join
loan_files lf on lf.team_id = t.id and loan_type <> 2
cross join lateral
(select (current_date - created_at::date) / 7 as week) w
group by 1
In 9.4+ versions use the aggregate filter syntax:
count(*) filter (where week = 0) as wk0,
lateral is from 9.3. In a previous version move the week expression to the filter condition.
How about the following query?
SELECT team_id AS id, count(team_id) AS wk0_count
FROM teams LEFT JOIN loan_files ON teams.id = team_id
WHERE loan_type <> 2
AND trunc(EXTRACT(epoch FROM age(CURRENT_TIMESTAMP, created_at)) / 604800) = 0
GROUP BY team_id
Notable changes are:
ORDER BY clause in subquery was pointless;
created_at in innermost subquery was never used;
wk_offset test is moved on the WHERE clause and not done in two distinct steps;
outermost subquery was not needed.

Grouping consecutive dates in PostgreSQL

I have two tables which I need to combine as sometimes some dates are found in table A and not in table B and vice versa. My desired result is that for those overlaps on consecutive days be combined.
I'm using PostgreSQL.
Table A
id startdate enddate
--------------------------
101 12/28/2013 12/31/2013
Table B
id startdate enddate
--------------------------
101 12/15/2013 12/15/2013
101 12/16/2013 12/16/2013
101 12/28/2013 12/28/2013
101 12/29/2013 12/31/2013
Desired Result
id startdate enddate
-------------------------
101 12/15/2013 12/16/2013
101 12/28/2013 12/31/2013
Right. I have a query that I think works. It certainly works on the sample records you provided. It uses a recursive CTE.
First, you need to merge the two tables. Next, use a recursive CTE to get the sequences of overlapping dates. Finally, get the start and end dates, and join back to the "merged" table to get the id.
with recursive allrecords as -- this merges the input tables. Add a unique row identifier
(
select *, row_number() over (ORDER BY startdate) as rowid from
(select * from table1
UNION
select * from table2) a
),
path as ( -- the recursive CTE. This gets the sequences
select rowid as parent,rowid,startdate,enddate from allrecords a
union
select p.parent,b.rowid,b.startdate,b.enddate from allrecords b join path p on (p.enddate + interval '1 day')>=b.startdate and p.startdate <= b.startdate
)
SELECT id,g.startdate,g.enddate FROM -- outer query to get the id
-- inner query to get the start and end of each sequence
(select parent,min(startdate) as startdate, max(enddate) as enddate from
(
select *, row_number() OVER (partition by rowid order by parent,startdate) as row_number from path
) a
where row_number = 1 -- We only want the first occurrence of each record
group by parent)g
INNER JOIN allrecords a on a.rowid = parent
The below fragment does what you intend. (but it will probably be very slow) The problem is that detecteng (non)overlapping dateranges is impossible with standard range operators, since a range could be split into two parts.
So, my code does the following:
split the dateranges from table_A into atomic records, with one date per record
[the same for table_b]
cross join these two tables (we are only interested in A_not_in_B, and B_not_in_A) , remembering which of the L/R outer join wings it came from.
re-aggregate the resulting records into date ranges.
-- EXPLAIN ANALYZE
--
WITH RECURSIVE ranges AS (
-- Chop up the a-table into atomic date units
WITH ar AS (
SELECT generate_series(a.startdate,a.enddate , '1day'::interval)::date AS thedate
, 'A'::text AS which
, a.id
FROM a
)
-- Same for the b-table
, br AS (
SELECT generate_series(b.startdate,b.enddate, '1day'::interval)::date AS thedate
, 'B'::text AS which
, b.id
FROM b
)
-- combine the two sets, retaining a_not_in_b plus b_not_in_a
, moments AS (
SELECT COALESCE(ar.id,br.id) AS id
, COALESCE(ar.which, br.which) AS which
, COALESCE(ar.thedate, br.thedate) AS thedate
FROM ar
FULL JOIN br ON br.id = ar.id AND br.thedate = ar.thedate
WHERE ar.id IS NULL OR br.id IS NULL
)
-- use a recursive CTE to re-aggregate the atomic moments into ranges
SELECT m0.id, m0.which
, m0.thedate AS startdate
, m0.thedate AS enddate
FROM moments m0
WHERE NOT EXISTS ( SELECT * FROM moments nx WHERE nx.id = m0.id AND nx.which = m0.which
AND nx.thedate = m0.thedate -1
)
UNION ALL
SELECT rr.id, rr.which
, rr.startdate AS startdate
, m1.thedate AS enddate
FROM ranges rr
JOIN moments m1 ON m1.id = rr.id AND m1.which = rr.which AND m1.thedate = rr.enddate +1
)
SELECT * FROM ranges ra
WHERE NOT EXISTS (SELECT * FROM ranges nx
-- suppress partial subassemblies
WHERE nx.id = ra.id AND nx.which = ra.which
AND nx.startdate = ra.startdate
AND nx.enddate > ra.enddate
)
;