I am trying to understand a Postgres explain plan using this website (http://explain.depesz.com).
This is my plan:
Unique (cost=795.43..800.89 rows=546 width=23) (actual time=0.599..0.755 rows=620 loops=1)
-> Sort (cost=795.43..796.79 rows=546 width=23) (actual time=0.598..0.626 rows=620 loops=1)
Sort Key: m.member_id, c.total_charged_amt DESC, c.claim_status
Sort Method: quicksort Memory: 73kB
-> Nested Loop (cost=9.64..770.60 rows=546 width=23) (actual time=0.023..0.342 rows=620 loops=1)
-> Bitmap Heap Scan on member m (cost=9.22..222.50 rows=63 width=8) (actual time=0.016..0.024 rows=62 loops=1)
Recheck Cond: (((member_id >= 1584161838) AND (member_id <= 1584161898)) OR (member_birth_dt = '1978-03-13'::date))
Heap Blocks: exact=3
-> BitmapOr (cost=9.22..9.22 rows=63 width=0) (actual time=0.013..0.013 rows=0 loops=1)
-> Bitmap Index Scan on member_pkey (cost=0.00..4.31 rows=2 width=0) (actual time=0.007..0.007 rows=61 loops=1)
Index Cond: ((member_id >= 1584161838) AND (member_id <= 1584161898))
-> Bitmap Index Scan on idx_dob (cost=0.00..4.88 rows=61 width=0) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: (member_birth_dt = '1978-03-13'::date)
-> Index Scan using idx_memberid on claim c (cost=0.42..8.60 rows=10 width=23) (actual time=0.001..0.003 rows=10 loops=62)
Index Cond: (member_id = m.member_id)
Planning Time: 0.218 ms
Execution Time: 0.812 ms
The plan is also available in the following link: https://explain.depesz.com/s/Qzau.
I have following questions:
Bitmap Index Scans which runs in 0.007s and 0.006s respectively runs in parallel because of indentation.
So why does Bitmapor start at 0.013s? Why is it adding time of both its children? The start time of Bitmapor should be the maximum of its 2 children. So ideally the exclusive time of Bitmapor should be 0.006 (0.013-0.007), but it is 0 (0.013-0.007-0.006)
Bitmap Heap Scan and index scan are children of Nested Loop and run in 0.024s and 0.186s, respectively.
Because of the indentation I assume these 2 children run in parallel. So the parent exclusive time should be 0.156 (0.342-0.186), but instead it is 0.132(0.342-0.186-0.24).
My understanding is we should subtract maximum of the child timings (since they run in parallel) to get the exclusive time spent in the node. But instead it is adding up child timings and then subtracting the sum from the end time of the parent to get the exclusive time. Did i understand incorrectly about explain plans?
Any help is really appreciated as I am totally confused with interpreting them.
As the horse said, there is no parallel processing involved. If parallel query were involved, you would see a Gather node that collects the results.
The indentation visualizes the graph of the query execution plan:
Unique
|
Sort
|
Nested Loop
/ \
Bitmap Heap Scan Index Scan
|
Bitmap Or
/ \
Bitmap Index Scan Bitmap Index Scan
That explains why explain.depesz.com calculates the exclusive time the way it does. Note that that is just a best effort, not the absolute truth, because (for example) a "bitmap or" does not take zero time. But explain.depesz.com can only go by the numbers it gets from EXPLAIN.
Related
We have a complex query which is dynamically built depending on different options on the customer and runs a query for data. We have functions in Azure which are running these queries to build reporting data every night, we run approx. 30k of these. The queries in isolation are about as fast as I can get them, approximately 100ms but when we are running functions in parallel on the consumption plan in Azure (restricted to a maximum of 5 functions running at the same time) the performance of the queries is dropping off and some are even timing out at 5 minutes, some which are timing out I have tested in isolation and are coming in at under 100ms. There are no writes as this is using a read replica in Azure to load this data.
We are running in Postgres 11.6 on hosted Azure with PgBouncer on a VM. All these queries are going to a read replica which is configured to a 4 vCore Memory Optimized.
What changes can we make to allow more parallel execution of these queries or is scaling up our only option?
I would like to share the EXPLAIN ANALYZE but this is restricted by the business. Please let me know what information would help and I will try to provide as much as possible.
CTE Scan on bravo_zulu romeo (cost=2151.89..2151.94 rows=1 width=204) (actual time=27.756..84.147 rows=36 loops=1)
CTE bravo_zulu
-> Nested Loop (cost=13.84..2151.89 rows=1 width=139) (actual time=27.744..84.009 rows=36 loops=1)
-> Nested Loop (cost=13.42..2151.43 rows=1 width=139) (actual time=26.811..76.983 rows=36 loops=1)
-> Nested Loop (cost=12.86..130.51 rows=1 width=44) (actual time=7.471..19.361 rows=29 loops=1)
-> Nested Loop (cost=4.88..97.73 rows=1 width=24) (actual time=7.410..10.480 rows=24 loops=1)
-> Index Scan using yankee on xray_zulu foxtrot_uniform (cost=0.28..8.29 rows=1 width=8) (actual time=1.339..1.340 rows=1 loops=1)
Index Cond: ("juliet" = 20)
-> Bitmap Heap Scan on golf_delta hotel_six (cost=4.60..89.43 rows=1 width=20) (actual time=6.064..9.123 rows=24 loops=1)
Recheck Cond: ("delta_oscar_hotel" = foxtrot_uniform."lima")
Filter: ("juliet" = ANY ('foxtrot_oscar'::integer[]))
Rows Removed by Filter: 442
Heap Blocks: exact=65
-> Bitmap Index Scan on papa (cost=0.00..4.60 rows=42 width=0) (actual time=0.024..0.024 rows=466 loops=1)
Index Cond: ("delta_oscar_hotel" = foxtrot_uniform."lima")
-> Bitmap Heap Scan on delta_sierra_two bravo_hotel (cost=7.98..32.76 rows=2 width=20) (actual time=0.321..0.363 rows=1 loops=24)
Recheck Cond: ((hotel_six."juliet" = "xray_india") OR (hotel_six."juliet" = "foxtrot_foxtrot"))
Filter: ("hotel_golf" = 23)
Rows Removed by Filter: 10
Heap Blocks: exact=240
-> BitmapOr (cost=7.98..7.98 rows=9 width=0) (actual time=0.066..0.066 rows=0 loops=24)
-> Bitmap Index Scan on delta_sierra_sierra (cost=0.00..3.99 rows=5 width=0) (actual time=0.063..0.063 rows=11 loops=24)
Index Cond: (hotel_six."juliet" = "xray_india")
-> Bitmap Index Scan on xray_sierra (cost=0.00..3.99 rows=4 width=0) (actual time=0.002..0.002 rows=0 loops=24)
Index Cond: (hotel_six."juliet" = "foxtrot_foxtrot")
-> Index Only Scan using echo on xray_papa victor (cost=0.56..2020.44 rows=48 width=102) (actual time=1.606..1.986 rows=1 loops=29)
Index Cond: (("five_lima" = 23) AND ("seven_yankee" = bravo_hotel."november") AND ("charlie_hotel" five_romeo NULL))
Filter: (("three" = 'charlie_romeo'::text) AND (("alpha" = 'golf_bravo'::text) OR ("alpha" = 'delta_echo'::text)) AND ((("alpha" = ANY ('mike_juliet'::text[])) AND ("mike_lima" >= 'xray_whiskey'::date) AND ("mike_lima" <= 'uniform'::date)) OR (("alpha" = ANY ('kilo'::text[])) AND ("quebec_uniform" >= 'xray_whiskey'::date) AND ("quebec_uniform" <= 'uniform'::date)) OR (("alpha" = 'quebec_alpha_quebec'::text) AND ("quebec_uniform" >= 'xray_whiskey'::date) AND ("quebec_uniform" <= 'uniform'::date) AND ("mike_lima" >= 'xray_whiskey'::date) AND ("mike_lima" <= 'uniform'::date))) AND ((("alpha" = ANY ('oscar'::text[])) AND ("seven_india" = ANY ('four'::text[]))) OR (("alpha" = ANY ('quebec_alpha_delta'::text[])) AND ("seven_charlie" = ANY ('four'::text[])))))
Rows Removed by Filter: 1059
Heap Fetches: 0
-> Index Scan using bravo_papa on tango sierra (cost=0.42..0.45 rows=1 width=16) (actual time=0.194..0.194 rows=1 loops=36)
Index Cond: (("bravo_two" = 23) AND ("delta_tango" = six1."delta_oscar_romeo"))
SubPlan
-> Result (cost=0.01..0.02 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=36)
One-Time Filter: ((romeo.zulu = 'golf_bravo'::text) AND (romeo.golf_uniform = 20) AND (romeo.charlie_two = 'charlie_romeo'::text))
SubPlan
-> Result (cost=0.01..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=0 loops=36)
One-Time Filter: ((romeo.zulu = 'delta_echo'::text) AND (romeo.charlie_two = 'charlie_romeo'::text) AND (romeo.golf_uniform = 20))
Planning time: 19.385 ms
Execution time: 84.373 ms
Above is an anonymised execution plan, this same query when running the functions in Azure in parallel timed out.
Table sizes are not large, largest is 8m rows but all others are low 100k.
When analysing a problem like this I find it usefull to restate what we know:
Your queries go from taking 100ms to timing out after 5 minutes.
This is happening on a 4 core system that is restricted to a maximum of 5 functions running at the same time.
This sounds more like a deadlock problem than a load problem.
There are 2 things that you could try:
Change to run one report at a time, to see if the timeouts disappear
Check your SQL code for and "with update" that could be leading to database locks
Edit:
Baced on answer in comment we can assume that it is not a database lock problem
Next thing to check is the connection to the database. It could be that the system is running out of available connections to the database. Things to check:
Max number of connections available
Is connection pooling used
Are connections being closed /released back to the pool, when they are no longer needed
I am facing a problem with a specific query on postgressql.
Look the explain:
-> Nested Loop Left Join (cost=21547.86..87609.16 rows=123 width=69) (actual time=28.997..562.299 rows=32710 loops=1)
-> Hash Join (cost=21547.30..87210.72 rows=123 width=53) (actual time=28.913..74.682 rows=32710 loops=1)
Hash Cond: (registry.id = profile.registry_id)
-> Bitmap Heap Scan on registry (cost=726.99..66218.46 rows=65503 width=53) (actual time=5.123..32.794 rows=66496 loops=1)
Recheck Cond: ((tenant_id = 1009469) AND active AND (excluded_at IS NULL))
Heap Blocks: exact=12563
-> Bitmap Index Scan on registry_tenant_id_excluded_at (cost=0.00..710.61 rows=65503 width=0) (actual time=3.589..3.589 rows=66496 loops=1)
Index Cond: (tenant_id = 1009469)
-> Hash (cost=20202.82..20202.82 rows=49399 width=16) (actual time=23.738..23.738 rows=32710 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2046kB
-> Index Only Scan using profile_tenant_id_registry_id on profile (cost=0.56..20202.82 rows=49399 width=16) (actual time=0.019..19.173 rows=32710 loops=1)
Index Cond: (tenant_id = 1009469)
Heap Fetches: 29493
It misestimate the hash join, even if both the scans are accurate.
I already tried to boost the statistics on the related columns but it just estimated from 117 to 123, so I guess this is not the issue.
Why it is misestimating so hard?
The nested loop takes a lot of work for the database.
It looks like rows with same tenant_id also mostly have the same value for registry_id/registry.id. But the planner doesn't understand that. It thinks that registry_id=registry.id will be true as often for the actually selected rows as it will be for randomly selected pairs of rows.
I don't think there is anything you can do about this.
This is the query:
EXPLAIN (analyze, BUFFERS, SETTINGS)
SELECT
operation.id
FROM
operation
RIGHT JOIN(
SELECT uid, did FROM (
SELECT uid, did FROM operation where id = 993754
) t
) parts ON (operation.uid = parts.uid AND operation.did = parts.did)
and EXPLAIN info:
Nested Loop Left Join (cost=0.85..29695.77 rows=100 width=8) (actual time=13.709..13.711 rows=1 loops=1)
Buffers: shared hit=4905
-> Unique (cost=0.42..8.45 rows=1 width=16) (actual time=0.011..0.013 rows=1 loops=1)
Buffers: shared hit=5
-> Index Only Scan using oi on operation operation_1 (cost=0.42..8.44 rows=1 width=16) (actual time=0.011..0.011 rows=1 loops=1)
Index Cond: (id = 993754)
Heap Fetches: 1
Buffers: shared hit=5
-> Index Only Scan using oi on operation (cost=0.42..29686.32 rows=100 width=24) (actual time=13.695..13.696 rows=1 loops=1)
Index Cond: ((uid = operation_1.uid) AND (did = operation_1.did))
Heap Fetches: 1
Buffers: shared hit=4900
Settings: max_parallel_workers_per_gather = '4', min_parallel_index_scan_size = '0', min_parallel_table_scan_size = '0', parallel_setup_cost = '0', parallel_tuple_cost = '0', work_mem = '256MB'
Planning Time: 0.084 ms
Execution Time: 13.728 ms
Why does Nested Loop cost more and more time than sum of childs cost? What can I do for that? The Execution Time should less than 1 ms right?
update:
Nested Loop Left Join (cost=5.88..400.63 rows=101 width=8) (actual time=0.012..0.012 rows=1 loops=1)
Buffers: shared hit=8
-> Index Scan using oi on operation operation_1 (cost=0.42..8.44 rows=1 width=16) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: (id = 993754)
Buffers: shared hit=4
-> Bitmap Heap Scan on operation (cost=5.45..391.19 rows=100 width=24) (actual time=0.004..0.005 rows=1 loops=1)
Recheck Cond: ((uid = operation_1.uid) AND (did = operation_1.did))
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on ou (cost=0.00..5.42 rows=100 width=0) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ((uid = operation_1.uid) AND (did = operation_1.did))
Buffers: shared hit=3
Settings: max_parallel_workers_per_gather = '4', min_parallel_index_scan_size = '0', min_parallel_table_scan_size = '0', parallel_setup_cost = '0', parallel_tuple_cost = '0', work_mem = '256MB'
Planning Time: 0.127 ms
Execution Time: 0.028 ms
Thanks all of you, when I split the index to btree(id) and btree(uid, did), everything's going perfect, but what caused those can not be used together? Any details or rules?
BTW, the sql is used for Real-Time Calculation, there are some Window Functions code didn't show here.
The Nested Loop does not take much time actually. The actual time of 13.709..13.711 means that it took 13.709 ms until the first row was ready to be emitted from this node and it took 0.002 ms until it was finished.
Note that the startup cost of 13.709 ms includes the cost of its two child nodes. Both of the child nodes need to emit at least one row before the nested loop can start.
The Unique child began emitting its first (and only) row after 0.011 ms. The Index Only Scan child however only started to emit its first (and only) row after 13.695 ms. This means that most of your actual time spent is in this Index Only Scan.
There is a great answer here which explains the costs and actual times in depth.
Also there is a nice tool at https://explain.depesz.com which calculates an inclusive and exclusive time for each node. Here it is used for your query plan which clearly shows that most of the time is spent in the Index Only Scan.
Since the query is spending almost all of the time in this index only scan, optimizations there will have the most benefit. Creating a separate index for the columns uid and did on the operation table should improve query time a lot.
CREATE INDEX operation_uid_did ON operation(uid, did);
The current execution plan contains 2 index only scans.
A slow one:
-> Index Only Scan using oi on operation (cost=0.42..29686.32 rows=100 width=24) (actual time=13.695..13.696 rows=1 loops=1)
Index Cond: ((uid = operation_1.uid) AND (did = operation_1.did))
Heap Fetches: 1
Buffers: shared hit=4900
And a fast one:
-> Index Only Scan using oi on operation operation_1 (cost=0.42..8.44 rows=1 width=16) (actual time=0.011..0.011 rows=1 loops=1)
Index Cond: (id = 993754)
Heap Fetches: 1
Buffers: shared hit=5
Both of them use the index oi but have different index conditions. Note how the fast one, who uses the id as index condition only needs to load 5 pages of data (Buffers: shared hit=5). The slow one needs to load 4900 pages instead (Buffers: shared hit=4900). This indicates that the index is optimized to query for id but not so much for uid and did. Probably the index oi covers all 3 columns id, uid, did in this order.
A multi-column btree index can only be used efficently when there are constraints in the query on the leftmost columns. The official documentation about multi-column indexes explains this very well in depth.
Why does Nested Loop cost more and more time than sum of childs cost?
Based on your example, it doesn't. Can you elaborate on what makes you think it does?
Anyway, it seems extravagant to visit 4900 pages to fetch 1 tuple. I'm guessing your tables are not getting vacuumed enough.
Although now I prefer Florian's suggestion, that "uid" and "did" are not the leading columns of the index, and that is why it is slow. It is basically doing a full index scan, using the index as a skinny version of the table. It is a shame that EXPLAIN output doesn't make it clear when a index is being used in this fashion, rather than the traditional "jump to a specific part of the index"
So you have a missing index.
Recently we experienced a performance problem in Production Aurora PG cluster. This is an EXPLAIN ANALYZE of the query.
The majority of the time is spent on Bitmap Index Scan on job_stage (cost=0.00..172.93 rows=9666 width=0) (actual time=238.410..238.410 rows=2019444 loops=1) where 2019444 are scanned. However, what troubles me is that there are only 70k rows in this table. Autovacuum is turned on, but the RDS was overloaded recently from another issue. We suspect that the autovacuum was running behind. If that is the case, would it explain our observation the scanned row exceeds actual row in table?
Nested Loop (cost=229.16..265.28 rows=1 width=464) (actual time=239.815..239.815 rows=0 loops=1)
-> Nested Loop (cost=228.62..252.71 rows=1 width=540) (actual time=239.814..239.814 rows=0 loops=1)
Join Filter: (job.scanner_uuid = scanner_resource_pool.resource_uuid)
Rows Removed by Join Filter: 1
-> Index Scan using scanner_resource_pool_scanner_index on scanner_resource_pool (cost=0.41..8.43 rows=1 width=115) (actual time=0.017..0.019 rows=1 loops=1)
Index Cond: ((box_uuid = '5d8a7e0c-23ff-4853-bb6d-ffff6a38afa7'::text) AND (scanner_uuid = '9be9ac50-de05-4ddd-9545-ddddc484dce'::text))
-> Bitmap Heap Scan on job (cost=228.22..244.23 rows=4 width=464) (actual time=239.790..239.791 rows=1 loops=1)
Recheck Cond: ((box_uuid = '5d8a7e0c-23ff-4853-bb6d-ffff6a38afa7'::text) AND (stage = 'active'::text))
Rows Removed by Index Recheck: 6
Heap Blocks: exact=791
-> BitmapAnd (cost=228.22..228.22 rows=4 width=0) (actual time=238.913..238.913 rows=0 loops=1)
-> Bitmap Index Scan on job_box_status (cost=0.00..55.04 rows=1398 width=0) (actual time=0.183..0.183 rows=899 loops=1)
Index Cond: (box_uuid = '5d8a7e0c-23ff-4853-bb6d-ffff6a38afa7'::text)
-> Bitmap Index Scan on job_stage (cost=0.00..172.93 rows=9666 width=0) (actual time=238.410..238.410 rows=2019444 loops=1)
Index Cond: (stage = 'active'::text)
-> Index Only Scan using uc_box_uuid on scanner (cost=0.54..12.56 rows=1 width=87) (never executed)
Index Cond: ((box_uuid = '5d8a7e0c-23ff-4853-bb6d-ffff6a38afa7'::text) AND (uuid = '9be9ac50-de05-4ddd-9545-ddddc484dce'::text))
Heap Fetches: 0
Planning time: 1.274 ms
Execution time: 239.876 ms
I found my answer by confirming with AWS. If autovacuum was running behind, the EXPLAIN ANALYZE result may show this discrepancy.
I am having problems optimizing a query in PostgreSQL 9.5.14.
select *
from file as f
join product_collection pc on (f.product_collection_id = pc.id)
where pc.mission_id = 7
order by f.id asc
limit 100;
Takes about 100 seconds. If I drop the limit clause it takes about 0.5:
With limit:
explain (analyze,buffers) ... -- query exactly as above
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.84..859.32 rows=100 width=457) (actual time=102793.422..102856.884 rows=100 loops=1)
Buffers: shared hit=222430592
-> Nested Loop (cost=0.84..58412343.43 rows=6804163 width=457) (actual time=102793.417..102856.872 rows=100 loops=1)
Buffers: shared hit=222430592
-> Index Scan using file_pkey on file f (cost=0.57..23409008.61 rows=113831736 width=330) (actual time=0.048..28207.152 rows=55858772 loops=1)
Buffers: shared hit=55652672
-> Index Scan using product_collection_pkey on product_collection pc (cost=0.28..0.30 rows=1 width=127) (actual time=0.001..0.001 rows=0 loops=55858772)
Index Cond: (id = f.product_collection_id)
Filter: (mission_id = 7)
Rows Removed by Filter: 1
Buffers: shared hit=166777920
Planning time: 0.803 ms
Execution time: 102856.988 ms
Without limit:
=> explain (analyze,buffers) ... -- query as above, just without limit
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=20509671.01..20526681.42 rows=6804163 width=457) (actual time=456.175..510.596 rows=142055 loops=1)
Sort Key: f.id
Sort Method: quicksort Memory: 79392kB
Buffers: shared hit=37956
-> Nested Loop (cost=0.84..16494851.02 rows=6804163 width=457) (actual time=0.044..231.051 rows=142055 loops=1)
Buffers: shared hit=37956
-> Index Scan using product_collection_mission_id_index on product_collection pc (cost=0.28..46.13 rows=87 width=127) (actual time=0.017..0.101 rows=87 loops=1)
Index Cond: (mission_id = 7)
Buffers: shared hit=10
-> Index Scan using file_product_collection_id_index on file f (cost=0.57..187900.11 rows=169535 width=330) (actual time=0.007..1.335 rows=1633 loops=87)
Index Cond: (product_collection_id = pc.id)
Buffers: shared hit=37946
Planning time: 0.807 ms
Execution time: 569.865 ms
I have copied the database to a backup server so that I may safely manipulate the database without something else changing it on me.
Cardinalities:
Table file: 113,831,736 rows.
Table product_collection: 1370 rows.
The query without LIMIT: 142,055 rows.
SELECT count(*) FROM product_collection WHERE mission_id = 7: 87 rows.
What I have tried:
searching stack overflow
vacuum full analyze
creating two column indexes on file.product_collection_id & file.id. (there already are single column indexes on every field touched.)
creating two column indexes on file.id & file.product_collection_id.
increasing the statistics on file.id & file.product_collection_id, then re-vacuum analyze.
changing various query planner settings.
creating non-materialized views.
walking up and down the hallway while muttering to myself.
None of them seem to change the performance in a significant way.
Thoughts?
UPDATE from OP:
Tested this on PostgreSQL 9.6 & 10.4, and found no significant changes in plans or performance.
However, setting random_page_cost low enough is the only way to get faster performance on the without limit search.
With a default random_page_cost = 4, the without limit:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=9270013.01..9287875.64 rows=7145054 width=457) (actual time=47782.523..47843.812 rows=145697 loops=1)
Sort Key: f.id
Sort Method: external sort Disk: 59416kB
Buffers: shared hit=3997185 read=1295264, temp read=7427 written=7427
-> Hash Join (cost=24.19..6966882.72 rows=7145054 width=457) (actual time=1.323..47458.767 rows=145697 loops=1)
Hash Cond: (f.product_collection_id = pc.id)
Buffers: shared hit=3997182 read=1295264
-> Seq Scan on file f (cost=0.00..6458232.17 rows=116580217 width=330) (actual time=0.007..17097.581 rows=116729984 loops=1)
Buffers: shared hit=3997169 read=1295261
-> Hash (cost=23.08..23.08 rows=89 width=127) (actual time=0.840..0.840 rows=87 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 15kB
Buffers: shared hit=13 read=3
-> Bitmap Heap Scan on product_collection pc (cost=4.97..23.08 rows=89 width=127) (actual time=0.722..0.801 rows=87 loops=1)
Recheck Cond: (mission_id = 7)
Heap Blocks: exact=10
Buffers: shared hit=13 read=3
-> Bitmap Index Scan on product_collection_mission_id_index (cost=0.00..4.95 rows=89 width=0) (actual time=0.707..0.707 rows=87 loops=1)
Index Cond: (mission_id = 7)
Buffers: shared hit=3 read=3
Planning time: 0.929 ms
Execution time: 47911.689 ms
User Erwin's answer below will take me some time to fully understand and generalize to all of the use cases needed. In the mean time we will probably use either a materialized view or just flatten our table structure.
This query is harder for the Postgres query planner than it might look. Depending on cardinalities, data distribution, value frequencies, sizes, ... completely different query plans can prevail and the planner has a hard time predicting which is best. Current versions of Postgres are better at this in several aspects, but it's still hard to optimize.
Since you retrieve only relatively few rows from product_collection, this equivalent query with LIMIT in a LATERAL subquery should avoid performance degradation:
SELECT *
FROM product_collection pc
CROSS JOIN LATERAL (
SELECT *
FROM file f -- big table
WHERE f.product_collection_id = pc.id
ORDER BY f.id
LIMIT 100
) f
WHERE pc.mission_id = 7
ORDER BY f.id
LIMIT 100;
Edit: This results in a query plan with explain (analyze,verbose) provided by the OP:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=30524.34..30524.59 rows=100 width=457) (actual time=13.128..13.167 rows=100 loops=1)
Buffers: shared hit=3213
-> Sort (cost=30524.34..30546.09 rows=8700 width=457) (actual time=13.126..13.152 rows=100 loops=1)
Sort Key: file.id
Sort Method: top-N heapsort Memory: 76kB
Buffers: shared hit=3213
-> Nested Loop (cost=0.57..30191.83 rows=8700 width=457) (actual time=0.060..9.868 rows=2880 loops=1)
Buffers: shared hit=3213
-> Seq Scan on product_collection pc (cost=0.00..69.12 rows=87 width=127) (actual time=0.024..0.336 rows=87 loops=1)
Filter: (mission_id = 7)
Rows Removed by Filter: 1283
Buffers: shared hit=13
-> Limit (cost=0.57..344.24 rows=100 width=330) (actual time=0.008..0.071 rows=33 loops=87)
Buffers: shared hit=3200
-> Index Scan using file_pc_id_index on file (cost=0.57..582642.42 rows=169535 width=330) (actual time=0.007..0.065 rows=33 loops=87)
Index Cond: (product_collection_id = pc.id)
Buffers: shared hit=3200
Planning time: 0.595 ms
Execution time: 13.319 ms
You need these indexes (will help your original query, too):
CREATE INDEX idx1 ON file (product_collection_id, id); -- crucial
CREATE INDEX idx2 ON product_collection (mission_id, id); -- helpful
You mentioned:
two column indexes on file.id & file.product_collection_id.
Etc. But we need it the other way round: id last. The order of index expressions is crucial. See:
Is a composite index also good for queries on the first field?
Rationale: With only 87 rows from product_collection, we only fetch a maximum of 87 x 100 = 8700 rows (fewer if not every pc.id has 100 rows in table file), which are then sorted before picking the top 100. Performance degrades with the number of rows you get from product_collection and with bigger LIMIT.
With the multicolumn index idx1 above, that's 87 fast index scans. The rest is not very expensive.
More optimization is possible, depending on additional information. Related:
Can spatial index help a “range - order by - limit” query