When does Postgresql do partition pruning with JOIN columns? - postgresql

I have two tables in a Postgres 11 database:
client table
--------
client_id integer
client_name character_varying
file table
--------
file_id integer
client_id integer
file_name character_varying
The client table is not partitioned, the file table is partitioned by client_id (partition by list). When a new client is inserted into the client table, a trigger creates a new partition for the file table.
The file table has a foreign key constraint referencing the client table on client_id.
When I execute this SQL (where c.client_id = 1), everything seems fine:
explain
select *
from client c
join file f using (client_id)
where c.client_id = 1;
Partition pruning is used, only the partition file_p1 is scanned:
Nested Loop (cost=0.00..3685.05 rows=100001 width=82)
-> Seq Scan on client c (cost=0.00..1.02 rows=1 width=29)
Filter: (client_id = 1)
-> Append (cost=0.00..2684.02 rows=100001 width=57)
-> Seq Scan on file_p1 f (cost=0.00..2184.01 rows=100001 width=57)
Filter: (client_id = 1)
But when I use a where clause like "where c.client_name = 'test'", the database scans in all partitions and does not recognize, that client_name "test" is equal to client_id 1:
explain
select *
from client c
join file f using (client_id)
where c.client_name = 'test';
Execution plan:
Hash Join (cost=1.04..6507.57 rows=100001 width=82)
Hash Cond: (f.client_id = c.client_id)
-> Append (cost=0.00..4869.02 rows=200002 width=57)
-> Seq Scan on file_p1 f (cost=0.00..1934.01 rows=100001 width=57)
-> Seq Scan on file_p4 f_1 (cost=0.00..1934.00 rows=100000 width=57)
-> Seq Scan on file_pdefault f_2 (cost=0.00..1.00 rows=1 width=556)
-> Hash (cost=1.02..1.02 rows=1 width=29)
-> Seq Scan on client c (cost=0.00..1.02 rows=1 width=29)
Filter: ((name)::text = 'test'::text)
So for this SQL, alle partitions in the file-table are scanned.
So should every select use the column on which the tables are partitioned by? Is the database not able to deviate the partition pruning criteria?
Edit:
To add some information:
In the past, I have been working with Oracle databases most of the time.
The execution plan there would be something like
Do a full table scan on client table with the client name to find out the client_id.
Do a "PARTITION LIST" access to the file table, where SQL Developer states PARTITION_START = KEY and PARTITION_STOP = KEY to indicate the exact partition is not known when calculating the execution plan, but the access will be done to only a list of paritions, which are calculated on the client_id found in the client table.
This is what I would have expected in Postgresql as well.

The documentation states that dynamic partition pruning is possible
(...) During actual execution of the query plan. Partition pruning may also be performed here to remove partitions using values which are only known during actual query execution. This includes values from subqueries and values from execution-time parameters such as those from parameterized nested loop joins.
If I understand it correctly, it applies to prepared statements or queries with subqueries which provide the partition key value as a parameter. Use explain analyse to see dynamic pruning (my sample data contains a million rows in three partitions):
explain analyze
select *
from file
where client_id = (
select client_id
from client
where client_name = 'test');
Append (cost=25.88..22931.88 rows=1000000 width=14) (actual time=0.091..96.139 rows=333333 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on client (cost=0.00..25.88 rows=6 width=4) (actual time=0.040..0.042 rows=1 loops=1)
Filter: (client_name = 'test'::text)
Rows Removed by Filter: 2
-> Seq Scan on file_p1 (cost=0.00..5968.66 rows=333333 width=14) (actual time=0.039..70.026 rows=333333 loops=1)
Filter: (client_id = $0)
-> Seq Scan on file_p2 (cost=0.00..5968.68 rows=333334 width=14) (never executed)
Filter: (client_id = $0)
-> Seq Scan on file_p3 (cost=0.00..5968.66 rows=333333 width=14) (never executed)
Filter: (client_id = $0)
Planning Time: 0.423 ms
Execution Time: 109.189 ms
Note that scans for partitions p2 and p3 were never executed.
Answering your exact question, the partition pruning in queries with joins described in the question is not implemented in Postgres (yet?)

Related

PostgreSQL - ORDER BY with LIMIT not using indexes as expected

We have two tables - event_deltas and deltas_to_retrieve - which both have BTREE indexes on the same two columns:
CREATE TABLE event_deltas
(
event_id UUID REFERENCES events(id) NOT NULL,
version INT NOT NULL,
json_patch JSONB NOT NULL,
PRIMARY KEY (event_id, version)
);
CREATE TABLE deltas_to_retrieve(event_id UUID NOT NULL, version INT NOT NULL);
CREATE UNIQUE INDEX event_id_version ON deltas_to_retrieve (event_id, version);
In terms of table size, deltas_to_retrieve is a tiny lookup table of ~500 rows. The event_deltas table contains ~7,000,000 rows. Due to the size of the latter table, we want to limit how much we retrieve at once. Therefore, the tables are queried as follows:
SELECT ed.event_id, ed.version
FROM deltas_to_retrieve zz, event_deltas ed
WHERE zz.event_id = ed.event_id
AND ed.version > zz.version
ORDER BY ed.event_id, ed.version
LIMIT 5000;
Without the LIMIT, for the example I'm looking at the query returns ~30,000 rows.
What's odd about this query is the impact of the ORDER BY. Due to the existing indexes, the data comes back in the order we want with or without it. I would rather keep the explicit ORDER BY there so we're future-proofed against future changes, as well as for readability etc. However, as things stand it has a significant negative impact on performance.
According to the docs:
An important special case is ORDER BY in combination with LIMIT n: an explicit sort will have to process all the data to identify the first n rows, but if there is an index matching the ORDER BY, the first n rows can be retrieved directly, without scanning the remainder at all.
This makes me think that, given the indexes we already have in place, the ORDER BY should not slow down the query at all. However, in practice I'm seeing execution times of ~10s with the ORDER BY and <1s without. I've included the plans outputted by EXPLAIN below:
Without ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.56..20033.38 rows=5000 width=20)
-> Nested Loop (cost=0.56..331980.39 rows=82859 width=20)
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..9.37 rows=537 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..616.66 rows=154 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.56..20039.35 rows=5000 width=20) (actual time=3.675..2083.063 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Nested Loop (cost=0.56..1055082.88 rows=263260 width=20) (actual time=3.673..2080.745 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..27.00 rows=1700 width=20) (actual time=0.022..0.307 rows=241 loops=1)
Buffers: local hit=2
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..619.07 rows=155 width=20) (actual time=1.317..8.612 rows=21 loops=241)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Heap Fetches: 5000
Buffers: shared hit=1450 read=4783
Planning Time: 1.150 ms
Execution Time: 2084.647 ms
With ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20)
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20)
-> Materialize (cost=0.28..6178.03 rows=1700 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20)
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20) (actual time=4457.770..506706.443 rows=5000 loops=1)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20) (actual time=4457.768..506704.815 rows=5000 loops=1)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20) (actual time=4.566..505443.407 rows=1813438 loops=1)
Heap Fetches: 1814767
Buffers: shared hit=78806 read=1071004 dirtied=148
-> Materialize (cost=0.28..6178.03 rows=1700 width=20) (actual time=0.063..2.524 rows=5000 loops=1)
Buffers: local hit=63
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20) (actual time=0.056..0.663 rows=78 loops=1)
Heap Fetches: 78
Buffers: local hit=63
Planning Time: 1.088 ms
Execution Time: 506709.819 ms
I'm not very experienced at reading these plans, but it's obviously thinking that it needs to retrieve everything, sort it and then return TOP N, rather than just grabbing the first N using the index. It's doing a Seq Scan on the smaller deltas_to_retrieve table rather than an Index Only Scan - is that the problem? That table is v. small (~500 rows), so I wonder if it's just not bothering to use the index because of that?
Postgres version: 11.12
Upgrading to Postgres 13 fixed this for us, with the introduction of incremental sort. From some docs on the feature:
Incremental sorting: Sorting is a performance-intensive task, so every improvement in this area can make a difference. Now PostgreSQL 13 introduces incremental sorting, which leverages early-stage sorts of a query and sorts only the incremental unsorted fields, increasing the chances the sorted block will fit in memory and by that, improving performance.
The new query plan from EXPLAIN is as follows, with the query now completing in <500ms consistently:
QUERY PLAN
Limit (cost=71.06..820.32 rows=5000 width=20)
-> Incremental Sort (cost=71.06..15461.82 rows=102706 width=20)
" Sort Key: ed.event_id, ed.version"
Presorted Key: ed.event_id
-> Nested Loop (cost=0.84..6659.05 rows=102706 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..1116.39 rows=541 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..8.35 rows=190 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Note:
[Start by running VACUUM ANALYZE on both tables]
since deltas_to_retrieve only needs to contain the lowest versions, it could be unique on event_id
you can simplify the query to:
SELECT event_id, version
FROM event_deltas ed
WHERE EXISTS (
SELECT * FROM deltas_to_retrieve zz
WHERE zz.event_id = ed.event_id
AND zz.version < ed.version
)
ORDER BY event_id, version
LIMIT 5000;

Can a LEFT JOIN be deferred to only apply to matching rows?

When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.
Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.
Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;

Postgres uses Hash Join with Seq Scan when Inner Select Index Cond is faster

Postgres is using a much heavier Seq Scan on table tracking when an index is available. The first query was the original attempt, which uses a Seq Scan and therefore has a slow query. I attempted to force an Index Scan with an Inner Select, but postgres converted it back to effectively the same query with nearly the same runtime. I finally copied the list from the Inner Select of query two to make the third query. Finally postgres used the Index Scan, which dramatically decreased the runtime. The third query is not viable in a production environment. What will cause postgres to use the last query plan?
(vacuum was used on both tables)
Tables
tracking (worker_id, localdatetime) total records: 118664105
project_worker (id, project_id) total records: 12935
INDEX
CREATE INDEX tracking_worker_id_localdatetime_idx ON public.tracking USING btree (worker_id, localdatetime)
Queries
SELECT worker_id, localdatetime FROM tracking t JOIN project_worker pw ON t.worker_id = pw.id WHERE project_id = 68475018
Hash Join (cost=29185.80..2638162.26 rows=19294218 width=16) (actual time=16.912..18376.032 rows=177681 loops=1)
Hash Cond: (t.worker_id = pw.id)
-> Seq Scan on tracking t (cost=0.00..2297293.86 rows=118716186 width=16) (actual time=0.004..8242.891 rows=118674660 loops=1)
-> Hash (cost=29134.80..29134.80 rows=4080 width=8) (actual time=16.855..16.855 rows=2102 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 115kB
-> Seq Scan on project_worker pw (cost=0.00..29134.80 rows=4080 width=8) (actual time=0.004..16.596 rows=2102 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 10833
Planning Time: 0.192 ms
Execution Time: 18382.698 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (SELECT id FROM project_worker WHERE project_id = 68475018 LIMIT 500)
Hash Semi Join (cost=6905.32..2923969.14 rows=27733254 width=24) (actual time=19.715..20191.517 rows=20530 loops=1)
Hash Cond: (t.worker_id = project_worker.id)
-> Seq Scan on tracking t (cost=0.00..2296948.27 rows=118698327 width=24) (actual time=0.005..9184.676 rows=118657026 loops=1)
-> Hash (cost=6899.07..6899.07 rows=500 width=8) (actual time=1.103..1.103 rows=500 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Limit (cost=0.00..6894.07 rows=500 width=8) (actual time=0.006..1.011 rows=500 loops=1)
-> Seq Scan on project_worker (cost=0.00..28982.65 rows=2102 width=8) (actual time=0.005..0.968 rows=500 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 4493
Planning Time: 0.224 ms
Execution Time: 20192.421 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (322016383,316007840,...,285702579)
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4766798.31 rows=21877360 width=24) (actual time=0.079..29.756 rows=22112 loops=1)
" Index Cond: (worker_id = ANY ('{322016383,316007840,...,285702579}'::bigint[]))"
Planning Time: 1.162 ms
Execution Time: 30.884 ms
... is in place of the 500 id entries used in the query
Same query ran on another set of 500 id's
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4776714.91 rows=21900980 width=24) (actual time=0.105..5528.109 rows=117838 loops=1)
" Index Cond: (worker_id = ANY ('{286237712,286237844,...,216724213}'::bigint[]))"
Planning Time: 2.105 ms
Execution Time: 5534.948 ms
The distribution of "worker_id" within "tracking" seems very skewed. For one thing, the number of rows in one of your instances of query 3 returns over 5 times as many rows as the other instance of it. For another, the estimated number of rows is 100 to 1000 times higher than the actual number. This can certainly lead to bad plans (although it is unlikely to be the complete picture).
What is the actual number of distinct values for worker_id within tracking: select count(distinct worker_id) from tracking? What does the planner think this value is: select n_distinct from pg_stats where tablename='tracking' and attname='worker_id'? If those values are far apart and you force the planner to use a more reasonable value with alter table tracking alter column worker_id set (n_distinct = <real value>); analyze tracking; does that change the plans?
If you want to nudge PostgreSQL towards a nested loop join, try the following:
Create an index on tracking that can be used for an index-only scan:
CREATE INDEX ON tracking (worker_id) INCLUDE (localdatetime);
Make sure that tracking is VACUUMed often, so that an index-only scan is effective.
Reduce random_page_cost and increase effective_cache_size so that the optimizer prices index scans lower (but don't use insane values).
Make sure that you have good estimates on project_worker:
ALTER TABLE project_worker ALTER project_id SET STATISTICS 1000;
ANALYZE project_worker;

PostgreSql doesn't use index on Join

Let's say we have the following 2 tables:
purchases
-> id
-> classic_id(indexed TEXT)
-> other columns
purchase_items_2(a temporary table)
-> id
-> order_id(indexed TEXT)
-> other columns
I want to do a SQL join between the 2 tables like so:
Select pi.id, pi.order_id, p.id
from purchase_items_2 pi
INNER JOIN purchases p ON pi.order_id = p.classic.id
This thing should use the indexes no? It is not.
Any clue why?
This is the explanation of the query
INNER JOIN purchases ON #{#table_name}.order_id = purchases.classic_id")
QUERY PLAN
---------------------------------------------------------------------------------
Hash Join (cost=433.80..744.69 rows=5848 width=216)
Hash Cond: ((purchase_items_2.order_id)::text = (purchases.classic_id)::text)
-> Seq Scan on purchase_items_2 (cost=0.00..230.48 rows=5848 width=208)
-> Hash (cost=282.80..282.80 rows=12080 width=16)
-> Seq Scan on purchases (cost=0.00..282.80 rows=12080 width=16)
(5 rows)
When I do a where query
Select pi.id
from purchase_items_2 pi
where pi.order_id = 'gigel'
It uses the index
QUERY PLAN
--------------------------------------------------------------------------------------------------
Bitmap Heap Scan on purchase_items_2 (cost=4.51..80.78 rows=29 width=208)
Recheck Cond: ((order_id)::text = 'gigel'::text)
-> Bitmap Index Scan on index_purchase_items_2_on_order_id (cost=0.00..4.50 rows=29 width=0)
Index Cond: ((order_id)::text = 'gigel'::text)\n(4 rows)
Since you have no WHERE condition, the query has to read all rows of both tables anyway. And since the hash table built by the hash join fits in work_mem, a hash join (that has to perform a sequential scan on both tables) is the most efficient join strategy.
PostgreSQL doesn't use the indexes because it is faster without them in this specific query.

postgres text field performance

I am using postgres 9.2.4.
We have a background job that imports user's emails into our system and stores them in a postgres database table.
Below is the table:
CREATE TABLE emails
(
id serial NOT NULL,
subject text,
body text,
personal boolean,
sent_at timestamp without time zone NOT NULL,
created_at timestamp without time zone,
updated_at timestamp without time zone,
account_id integer NOT NULL,
sender_user_id integer,
sender_contact_id integer,
html text,
folder text,
draft boolean DEFAULT false,
check_for_response timestamp without time zone,
send_time timestamp without time zone,
CONSTRAINT emails_pkey PRIMARY KEY (id),
CONSTRAINT emails_account_id_fkey FOREIGN KEY (account_id)
REFERENCES accounts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT emails_sender_contact_id_fkey FOREIGN KEY (sender_contact_id)
REFERENCES contacts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE emails
OWNER TO paulcowan;
-- Index: emails_account_id_index
-- DROP INDEX emails_account_id_index;
CREATE INDEX emails_account_id_index
ON emails
USING btree
(account_id);
-- Index: emails_sender_contact_id_index
-- DROP INDEX emails_sender_contact_id_index;
CREATE INDEX emails_sender_contact_id_index
ON emails
USING btree
(sender_contact_id);
-- Index: emails_sender_user_id_index
-- DROP INDEX emails_sender_user_id_index;
CREATE INDEX emails_sender_user_id_index
ON emails
USING btree
(sender_user_id);
The query is further complicated because I have a view on this table where I pull in other data:
CREATE OR REPLACE VIEW email_graphs AS
SELECT emails.id, emails.subject, emails.body, emails.folder, emails.html,
emails.personal, emails.draft, emails.created_at, emails.updated_at,
emails.sent_at, emails.sender_contact_id, emails.sender_user_id,
emails.addresses, emails.read_by, emails.check_for_response,
emails.send_time, ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
FROM emails
LEFT JOIN ( SELECT todos.reference_email_id AS email_id,
array_to_json(array_agg(todos.id)) AS ids
FROM todos
GROUP BY todos.reference_email_id) ts ON ts.email_id = emails.id
LEFT JOIN ( SELECT calls.reference_email_id AS email_id,
array_to_json(array_agg(calls.id)) AS ids
FROM calls
GROUP BY calls.reference_email_id) cs ON cs.email_id = emails.id
LEFT JOIN ( SELECT deals.reference_email_id AS email_id,
array_to_json(array_agg(deals.id)) AS ids
FROM deals
GROUP BY deals.reference_email_id) ds ON ds.email_id = emails.id
LEFT JOIN ( SELECT meetings.reference_email_id AS email_id,
array_to_json(array_agg(meetings.id)) AS ids
FROM meetings
GROUP BY meetings.reference_email_id) ms ON ms.email_id = emails.id
LEFT JOIN ( SELECT comments.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (comments.id,comments.text,comments.author_id,comments.created_at,comments.updated_at)) r(id, text, author_id, created_at, updated_at)))) AS comments
FROM comments
WHERE comments.email_id IS NOT NULL
GROUP BY comments.email_id) c ON c.email_id = emails.id
LEFT JOIN ( SELECT email_participants.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (email_participants.user_id,email_participants.contact_id,email_participants.kind)) r(user_id, contact_id, kind)))) AS people
FROM email_participants
GROUP BY email_participants.email_id) p ON p.email_id = emails.id
LEFT JOIN ( SELECT attachments.reference_email_id AS email_id,
array_to_json(array_agg(attachments.id)) AS ids
FROM attachments
GROUP BY attachments.reference_email_id) atts ON atts.email_id = emails.id;
ALTER TABLE email_graphs
OWNER TO paulcowan;
We then run paginated queries against this view e.g.
SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0
As the table has grown, queries on this table have dramatically slowed down.
If I run the paginated query with EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0;
I get this result
-> Seq Scan on deals (cost=0.00..9.11 rows=36 width=8) (actual time=0.003..0.044 rows=34 loops=1)
-> Sort (cost=5.36..5.43 rows=131 width=36) (actual time=0.416..0.416 rows=1 loops=1)
Sort Key: ms.email_id
Sort Method: quicksort Memory: 26kB
-> Subquery Scan on ms (cost=3.52..4.44 rows=131 width=36) (actual time=0.408..0.411 rows=1 loops=1)
-> HashAggregate (cost=3.52..4.05 rows=131 width=8) (actual time=0.406..0.408 rows=1 loops=1)
-> Seq Scan on meetings (cost=0.00..3.39 rows=131 width=8) (actual time=0.006..0.163 rows=161 loops=1)
-> Sort (cost=18.81..18.91 rows=199 width=36) (actual time=0.012..0.012 rows=0 loops=1)
Sort Key: c.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on c (cost=15.90..17.29 rows=199 width=36) (actual time=0.007..0.007 rows=0 loops=1)
-> HashAggregate (cost=15.90..16.70 rows=199 width=60) (actual time=0.006..0.006 rows=0 loops=1)
-> Seq Scan on comments (cost=0.00..12.22 rows=736 width=60) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (email_id IS NOT NULL)
Rows Removed by Filter: 2
SubPlan 1
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=56) (never executed)
-> Materialize (cost=4220.14..4883.55 rows=27275 width=36) (actual time=247.720..1189.545 rows=29516 loops=1)
-> GroupAggregate (cost=4220.14..4788.09 rows=27275 width=15) (actual time=247.715..1131.787 rows=29516 loops=1)
-> Sort (cost=4220.14..4261.86 rows=83426 width=15) (actual time=247.634..339.376 rows=82632 loops=1)
Sort Key: public.email_participants.email_id
Sort Method: external sort Disk: 1760kB
-> Seq Scan on email_participants (cost=0.00..2856.28 rows=83426 width=15) (actual time=0.009..88.938 rows=82720 loops=1)
SubPlan 2
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=40) (actual time=0.004..0.005 rows=1 loops=82631)
-> Sort (cost=2.01..2.01 rows=1 width=36) (actual time=0.074..0.077 rows=3 loops=1)
Sort Key: atts.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on atts (cost=2.00..2.01 rows=1 width=36) (actual time=0.048..0.060 rows=3 loops=1)
-> HashAggregate (cost=2.00..2.01 rows=1 width=8) (actual time=0.045..0.051 rows=3 loops=1)
-> Seq Scan on attachments (cost=0.00..2.00 rows=1 width=8) (actual time=0.013..0.021 rows=5 loops=1)
-> Index Only Scan using email_participants_email_id_user_id_index on email_participants (cost=0.00..990.04 rows=269 width=4) (actual time=1.357..2.886 rows=43 loops=1)
Index Cond: (user_id = 75)
Heap Fetches: 43
Total runtime: 1642.157 ms
(75 rows)
I am definitely not looking for fixes :) or a refactored query. Any sort of hight level advice would be most welcome.
Per my comment, the gist of the issue lies in aggregates that get joined with one another. This prevents the use of indexes, and yields a bunch of merge joins (and a materialize) in your query plan…
Put another way, think of it as a plan so convoluted that Postgres proceeds by materializing temporary tables in memory, and then sorting them repeatedly until they're all merge-joined as appropriate. From where I'm standing, the full hogwash seems to amount to selecting all rows from all tables, and all of their possible relationships. Once it has been figured out and conjured up into existence, Postgres proceeds to sorting the mess in order to extract the top-n rows.
Anyway, you want rewrite the query so it can actually use indexes to begin with.
Part of this is simple. This, for instance, is a big no no:
select …,
ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
Fetch the emails in one query. Fetch the related objects in separate queries with a bite-sized email_id in (…) clause. Merely doing that should speed things up quite a bit.
For the rest, it may or may not be simple or involve some re-engineering of your schema. I only scanned through the incomprehensible monster and its gruesome query plan, so I cannot comment for sure.
I think the big view is not likely to ever perform well and that you should break it into more manageable components, but still here are two specific bits of advice that come to mind:
Schema change
Move the text and html bodies out of the main table.
Although large contents get automatically stored in TOAST space, mail parts will often be smaller than the TOAST threshold (~2000 bytes), especially for plain text, so it won't happen systematically.
Each non-TOASTED content inflates the table in a way that is detrimental to I/O performance and caching, if you consider that the primary purpose of the table is to contain the header fields like sender, recipients, date, subject...
I can test this with contents I happen to have in a mail database. On a sample of the 55k mails in my inbox:
average text/plain size: 1511 bytes.
average text/html size: 11895 bytes (but 42395 messages have no html at all)
Size of the mail table without the bodies: 14Mb (no TOAST)
If adding the bodies as 2 more TEXT columns like you have: 59Mb in main storage, 61Mb in TOAST.
Despite TOAST, the main storage appears to be 4 times larger. So when scanning the table without need for the TEXT columns, 80% of the I/Os are wasted. Future row updates are likely to make this worse with the fragmentation effect.
The effect in terms of block reads can be spotted through the pg_statio_all_tables view (compare heap_blks_read + heap_blks_hit before and after the query)
Tuning
This part of the EXPLAIN
Sort Method: external sort Disk: 1760kB
suggests that you work_mem is too small. You don't want to hit the disk for such small sorts. Make it as least 10Mb unless you're low on free memory. While you're at it, set shared_buffers to a reasonable value if it's still the default. See http://wiki.postgresql.org/wiki/Performance_Optimization for more.