Subquery in select statement from a join table - postgresql

I have an accounts table, units table and reports table. An account has many units (foreign key of units is account_id), and a unit has many reports (foreign key of reports is unit_id). I want to select account name, total number of units for that account, and the last report time:
SELECT accounts.name AS account_name,
COUNT(units.id) AS unit_count,
(SELECT reports.time FROM reports INNER JOIN units ON units.id = reports.unit_id ORDER BY time desc LIMIT 1) AS last_reported_time
FROM accounts
INNER JOIN units ON accounts.id = units.account_id
INNER JOIN reports ON units.id = reports.unit_id
GROUP BY account_name, last_reported_time
ORDER BY unit_count desc;
This query has been running forever, and I am not sure it's doing what I expect.
An account has many units and a unit has many reports. I want to display the time of the newest report from all the units associated for each given account. Is this query correct? If not, how can I accomplish my task (if possible without using a scripting language)?
The result of EXPLAIN:
Sort (cost=21466114.58..21466547.03 rows=172980 width=38)
Sort Key: (count(public.units.id))
InitPlan 1 (returns $0)
-> Limit (cost=0.00..12.02 rows=1 width=8)
-> Nested Loop (cost=0.00..928988485.04 rows=77309416 width=8)
-> Index Scan Backward using index_reports_on_time on reports (cost=0.00..296291138.34 rows=77309416 width=12)
-> Index Scan using units_pkey on units (cost=0.00..8.17 rows=1 width=4)
Index Cond: (public.units.id = public.reports.unit_id)
-> GroupAggregate (cost=20807359.99..21446321.09 rows=172980 width=38)
-> Sort (cost=20807359.99..20966559.70 rows=63679885 width=38)
Sort Key: accounts.name, public.units.last_reported_time
-> Hash Join (cost=975.50..3846816.82 rows=63679885 width=38)
Hash Cond: (public.reports.unit_id = public.units.id)
-> Seq Scan on reports (cost=0.00..2919132.16 rows=77309416 width=4)
-> Hash (cost=961.43..961.43 rows=1126 width=38)
-> Hash Join (cost=16.37..961.43 rows=1126 width=38)
Hash Cond: (public.units.account_id = accounts.id)
-> Seq Scan on units (cost=0.00..928.67 rows=1367 width=28)
-> Hash (cost=11.72..11.72 rows=372 width=18)
-> Seq Scan on accounts (cost=0.00..11.72 rows=372 width=18)

About 95% of the cost of the query is here:
-> Sort (cost=20807359.99..20966559.70 rows=63679885 width=38)
Sort Key: accounts.name, public.units.last_reported_time
-> Hash Join (cost=975.50..3846816.82 rows=63679885 width=38)
Hash Cond: (public.reports.unit_id = public.units.id)
-> Seq Scan on reports (cost=0.00..2919132.16 rows=77309416 width=4)
Do you have an index on reports.unit_id? If not, you should definitely add one.
Other than that, the output column unit_count appears to be giving the number of units per account, but calculating it after all the joins and then ordering by it is very wasteful. The sub-query in the select list is something of mystery to me; I presume you want the most recent reporting time for each unit but it will give you only the last time of reporting over all units combined. Try this instead:
SELECT a.account_name, u.unit_count, r.last_reported_time
FROM account a
JOIN (
SELECT account_id, COUNT(*) AS unit_count
FROM units
GROUP BY 1) u ON u.account_id = a.id
LEFT JOIN ( -- allow for units that have not yet submitted a report
SELECT u.account_id, max(r.time) AS last_reported_time
FROM reports r
JOIN units u ON u.id = r.unit_id
GROUP BY 1) r ON r.account_id = a.id
ORDER BY 2 DESC;

Related

Bad execution plan on Postgresql

I'm trying to migrate from SQL Server to Postgresql.
Here is my Posgresql code:
Create View person_names As
SELECT lp."Code", n."Name", n."Type"
from "Persons" lp
Left Join LATERAL
(
Select *
From "Names" n
Where n.id = lp.id
Order By "Date" desc
Limit 1
) n on true
limit 100;
Explain
Select "Code" From person_names;
It prints
"Subquery Scan on person_names (cost=0.42..448.85 rows=100 width=10)"
" -> Limit (cost=0.42..447.85 rows=100 width=56)"
" -> Nested Loop Left Join (cost=0.42..303946.91 rows=67931 width=56)"
" -> Seq Scan on ""Persons"" lp (cost=0.00..1314.31 rows=67931 width=10)"
" -> Limit (cost=0.42..4.44 rows=1 width=100)"
" -> Index Only Scan Backward using ""IX_Names_Person"" on ""Names"" n (cost=0.42..4.44 rows=1 width=100)"
" Index Cond: ("id" = (lp."id")::numeric)"
Why there is an "Index Only Scan" for the "Names" table? This table is not required to get a result. On SQL Server I get only a single scan over the "Persons" table.
How can I tune Postgres to get a better query plans? I'm trying the lastest version, which is the Postgresql 15 beta 3.
Here is SQL Server version:
Create View person_names As
SELECT top 100 lp."Code", n."Name", n."Type"
from "Persons" lp
Outer Apply
(
Select Top 1 *
From "Names" n
Where n.id = lp.id
Order By "Date" desc
) n
GO
SET SHOWPLAN_TEXT ON;
GO
Select "Code" From person_names;
It gives correct execution plan:
|--Top(TOP EXPRESSION:((100)))
|--Index Scan(OBJECT:([Persons].[IX_Persons] AS [lp]))
Change the lateral join to a regular left join, then Postgres is able to remove the select on the Names table:
create View person_names
As
SELECT lp.Code, n.Name, n.Type
from Persons lp
Left Join (
Select distinct on (id) *
From Names n
Order By id, Date desc
) n on n.id = lp.id
limit 100;
The following index will support the distinct on () in case you do include columns from the Names table:
create index on "Names"(id, "Date" desc);
For select code from names this gives me this plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on persons lp (cost=0.00..309.00 rows=20000 width=7) (actual time=0.009..1.348 rows=20000 loops=1)
Planning Time: 0.262 ms
Execution Time: 1.738 ms
For select Code, name, type From person_names; this gives me this plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=559.42..14465.93 rows=20000 width=25) (actual time=5.585..68.545 rows=20000 loops=1)
Hash Cond: (n.id = lp.id)
-> Unique (cost=0.42..13653.49 rows=20074 width=26) (actual time=0.053..57.323 rows=20000 loops=1)
-> Index Scan using names_id_date_idx on names n (cost=0.42..12903.49 rows=300000 width=26) (actual time=0.052..41.125 rows=300000 loops=1)
-> Hash (cost=309.00..309.00 rows=20000 width=11) (actual time=5.407..5.407 rows=20000 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1116kB
-> Seq Scan on persons lp (cost=0.00..309.00 rows=20000 width=11) (actual time=0.011..2.036 rows=20000 loops=1)
Planning Time: 0.460 ms
Execution Time: 69.180 ms
Of course I had to guess the table structures as you haven't provided any DDL.
Online example
Change your view definition like that
create view person_names as
select p."Code",
(select "Name"
from "Names" n
where n.id = p.id
order by "Date" desc
limit 1)
from "Persons" p
limit 100;

Adding a where clause and order by slows down the view

I have this view that is using lateral against another function. The query is running fine and fast but as soon as I add the condition the where clause and order by. It crawls.
CREATE OR REPLACE VIEW public.vw_top_info_v1_0
AS
SELECT pse.symbol,
pse.order_book,
pse.company_name,
pse.logo_url,
pse.display_logo,
pse.base_url,
stats.value::numeric(20,4) AS stock_value,
stats.volume::numeric(20,0) AS volume,
stats.last_trade_price,
stats.stock_date AS last_trade_date
FROM ( SELECT pse_1.symbol,
pse_1.company_name,
pse_1.order_book,
pse_1.display_logo,
pse_1.base_url,
pse_1.logo_url
FROM vw_pse_traded_companies pse_1
WHERE pse_1.group_name::text = 'N'::text) pse,
LATERAL iq_get_stats_security_for_top_data_v1_0(pse.order_book, (( SELECT date(d.added_date) AS date
FROM prod_itchbbo_p_small_message d
ORDER BY d.added_date DESC
LIMIT 1))::timestamp without time zone) stats(value, volume, stock_date, last_trade_price)
WHERE stats.value IS NOT NULL
ORDER BY stats.value DESC;***
Here's the explain output.
Subquery Scan on vw_top_info_v1_0 (cost=161022.59..165450.34 rows=354220 width=192)
-> Sort (cost=161022.59..161908.14 rows=354220 width=200)
Sort Key: stats.value DESC
InitPlan 1 (returns $0)
-> Limit (cost=49734.18..49734.18 rows=1 width=12)
-> Sort (cost=49734.18..51793.06 rows=823553 width=12)
Sort Key: d.added_date DESC
-> Seq Scan on prod_itchbbo_p_small_message d (cost=0.00..45616.41 rows=823553 width=12)
-> Nested Loop (cost=188.59..10837.44 rows=354220 width=200)
-> Sort (cost=188.34..189.23 rows=356 width=2866)
Sort Key: info.order_book, listed.symbol
-> Hash Join (cost=18.19..173.25 rows=356 width=2866)
Hash Cond: ((info.symbol)::text = (listed.symbol)::text)
-> Seq Scan on prod_stock_information info (cost=0.00..151.85 rows=1220 width=12)
Filter: ((group_name)::text = 'N'::text)
-> Hash (cost=13.64..13.64 rows=364 width=128)
-> Seq Scan on prod_pse_listed_companies listed (cost=0.00..13.64 rows=364 width=128)
-> Function Scan on iq_get_stats_security_for_top_data_v1_0 stats (cost=0.25..10.25 rows=995 width=32)
Filter: (value IS NOT NULL)
Is there a way to improve the query?
I don't fully understand what this is doing but there's a significant cost is coming from the seq scan on prod_itchbbo_p_small_message to sort by date to find the max.
You indicated the cost changes when you add the sort, so if you don't have one, I'd add a b-tree index on prod_itchbbo_p_small_message.added_date.

Can a LEFT JOIN be deferred to only apply to matching rows?

When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.
Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.
Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;

PostgreSql doesn't use index on Join

Let's say we have the following 2 tables:
purchases
-> id
-> classic_id(indexed TEXT)
-> other columns
purchase_items_2(a temporary table)
-> id
-> order_id(indexed TEXT)
-> other columns
I want to do a SQL join between the 2 tables like so:
Select pi.id, pi.order_id, p.id
from purchase_items_2 pi
INNER JOIN purchases p ON pi.order_id = p.classic.id
This thing should use the indexes no? It is not.
Any clue why?
This is the explanation of the query
INNER JOIN purchases ON #{#table_name}.order_id = purchases.classic_id")
QUERY PLAN
---------------------------------------------------------------------------------
Hash Join (cost=433.80..744.69 rows=5848 width=216)
Hash Cond: ((purchase_items_2.order_id)::text = (purchases.classic_id)::text)
-> Seq Scan on purchase_items_2 (cost=0.00..230.48 rows=5848 width=208)
-> Hash (cost=282.80..282.80 rows=12080 width=16)
-> Seq Scan on purchases (cost=0.00..282.80 rows=12080 width=16)
(5 rows)
When I do a where query
Select pi.id
from purchase_items_2 pi
where pi.order_id = 'gigel'
It uses the index
QUERY PLAN
--------------------------------------------------------------------------------------------------
Bitmap Heap Scan on purchase_items_2 (cost=4.51..80.78 rows=29 width=208)
Recheck Cond: ((order_id)::text = 'gigel'::text)
-> Bitmap Index Scan on index_purchase_items_2_on_order_id (cost=0.00..4.50 rows=29 width=0)
Index Cond: ((order_id)::text = 'gigel'::text)\n(4 rows)
Since you have no WHERE condition, the query has to read all rows of both tables anyway. And since the hash table built by the hash join fits in work_mem, a hash join (that has to perform a sequential scan on both tables) is the most efficient join strategy.
PostgreSQL doesn't use the indexes because it is faster without them in this specific query.

PostgreSQL: Fast check whether all elements of an LTREE [] <# LTREE[]

I have the following tables (simplified):
CREATE TABLE groups
( id PRIMARY KEY,
path ltree,
...
);
CREATE TABLE items
( id bigserial,
path ltree,
...
PRIMARY KEY (id, path)
);
For each item, there is a list of groups that the item belongs too. A group is represented by its full path. There may be up to 10M items, each item belongs to about 20 groups.
I need to design the following query. Given (a) a "parent" group and (b) a list of up to 10 additional groups, find those immediate descendants of the "parent" group that have at least one item in their subtree that is contained in each of the groups in the search criteria.
For example, given the parent group "NorthAmerica.USA" and additional groups ["CandyLovers.ChocolateLovers", "Athletes.Footballers"], then "NorthAmerica.USA.CA" is a result if it there exists an item like "George" that is in the groups like ["NorthAmerica.USA.CA.LosAngeles", "Athletes.Footballers", "CandyLovers.ChocolateLovers.ChocolateDonutLovers"]
I tried a few different ways to write queries, and they scale very poorly: take minutes to return result on a set of 1M items and 3-4 paths in the search criteria. For example:
EXPLAIN ANALYZE
SELECT *
FROM groups
WHERE path ~ CAST ('1.2.22' || '.*{1}' AS lquery)
AND EXISTS
(SELECT 1
FROM
(SELECT array_agg(DISTINCT leaf_paths_sans_result_path.path) AS paths_of_a_match,
max(path_count) AS path_count
FROM items,
(SELECT path,
count(*) OVER() AS path_count
FROM (
VALUES (groups.path) , ('1.3'),('1.4')) t (path)) leaf_paths_sans_result_path
WHERE 1 = 1
AND items.path <# leaf_paths_sans_result_path.path
GROUP BY id) items_by_id
WHERE cardinality(paths_of_a_match) = path_count );
Results in the following:
Index Scan using idx_groups__path__gist on groups (cost=0.28..37013.74 rows=38 width=469) (actual time=11.735..322285.421 rows=950 loops=1)
Index Cond: (path ~ '1.2.22.*{1}'::lquery)
Filter: (SubPlan 1)
Rows Removed by Filter: 3
SubPlan 1
-> Subquery Scan on items_by_id (cost=0.55..1809359.86 rows=3752 width=0) (actual time=338.162..338.162 rows=1 loops=953)
-> GroupAggregate (cost=0.55..1809322.34 rows=3752 width=65) (actual time=338.159..338.159 rows=1 loops=953)
Group Key: ibt.id
Filter: (cardinality(array_agg(DISTINCT "*VALUES*".column1)) >= max(3))
Rows Removed by Filter: 7845
-> Nested Loop (cost=0.55..1809228.54 rows=3752 width=65) (actual time=0.044..307.087 rows=20423 loops=953)
Join Filter: (ibt.path <# "*VALUES*".column1)
Rows Removed by Join Filter: 651228
-> Index Scan using idx_items__id on items (cost=0.55..1752954.06 rows=1250543 width=193) (actual time=0.007..110.517 rows=223884 loops=953)
-> Materialize (cost=0.00..0.05 rows=3 width=32) (actual time=0.000..0.000 rows=3 loops=213361141)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=32) (actual time=0.002..0.003 rows=3 loops=953)
Planning time: 3.151 ms
Execution time: 322286.404 ms
(18 rows)
I can change the data model as needed in order to optimize for this query. I am running PostgreSQL v9.5
Many thanks! Sorry for a messy question.
Looks like you're using the ltree module? The following query avoids the intermediate array_agg arrays:
select *
from items i
join groups g
on i.groups = g.id
where g.path ~ '1.2.22.*' and
(
i.path ~ '*.1.3.*' or
i.path ~ '*.1.4.*'
)
group by
g.id
having count(distinct
case
when i.path ~ '*.1.3.*' then 1
when i.path ~ '*.1.4.*' then 2
end) = 2
The count constructs asserts that both conditions are met, not just two rows that match the same pattern.