Very bad query plan in PostgreSQL 9.6 - postgresql

I have a performance problem with PostgreSQL 9.6.17 (x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit). Sometimes very inefficient query plan is chosen for relatively simple query.
There is dir_current table with 750M rows:
\d sf.dir_current
Table "sf.dir_current"
Column | Type | Collation | Nullable | Default
--------------------+-------------+-----------+----------+-----------------------------------------------
id | bigint | | not null | nextval('sf.object_id_seq'::regclass)
volume_id | bigint | | not null |
parent_id | bigint | | |
blocks | sf.blkcnt_t | | |
rec_aggrs | jsonb | | not null |
...
Indexes:
"dir_current_pk" PRIMARY KEY, btree (id), tablespace "sf_current"
"dir_current_parentid_idx" btree (parent_id), tablespace "sf_current"
"dir_current_volumeid_id_unq" UNIQUE CONSTRAINT, btree (volume_id, id), tablespace "sf_current"
Foreign-key constraints:
"dir_current_parentid_fk" FOREIGN KEY (parent_id) REFERENCES sf.dir_current(id) DEFERRABLE INITIALLY DEFERRED
(some columns omitted as they're irrelevant here).
Now, a temporary table is created with ca. 1K rows:
CREATE TEMP TABLE dir_process AS (
SELECT sf.dir_current.id, volume_id, parent_id, depth, size, blocks, atime, ctime, mtime, sync_time, local_aggrs FROM sf.dir_current
WHERE ....
);
CREATE INDEX dir_process_indx ON dir_process(volume_id, id);
ANALYZE dir_process;
The actual condition .... doesn't matter here - it selects some rows to be processed.
Here is a query which is sometimes very slow:
SELECT dir.id, dir.volume_id, dir.parent_id, dir.rec_aggrs, dir.blocks FROM sf.dir_current AS dir
INNER JOIN dir_process ON dir.parent_id = dir_process.id AND dir.volume_id = dir_process.volume_id
WHERE dir.volume_id = ANY(volume_ids)
A few slow plans:
duration: 1822060.789 ms plan:
Merge Join (cost=150260.47..265775.37 rows=1 width=456) (actual rows=14305 loops=1)
Merge Cond: (dir.volume_id = dir_process.volume_id)
Join Filter: (dir.parent_id = dir_process.id)
Rows Removed by Join Filter: 23117117695
-> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..922747.05 rows=624805 width=456) (actual rows=1231600 loops=1)
Index Cond: (volume_id = ANY ('{88}'::bigint[]))
-> Sort (cost=966.16..975.55 rows=18770 width=16) (actual rows=23115900401 loops=1)
Sort Key: dir_process.volume_id
Sort Method: quicksort Memory: 1648kB
-> Seq Scan on dir_process (cost=0.00..699.70 rows=18770 width=16) (actual rows=18770 loops=1)
duration: 10140968.829 ms plan:
Merge Join (cost=0.17..8389.13 rows=1 width=456) (actual rows=819 loops=1)
Merge Cond: (dir_process.volume_id = dir.volume_id)
Join Filter: (dir.parent_id = dir_process.id)
Rows Removed by Join Filter: 2153506735
-> Index Only Scan using dir_process_indx on dir_process (cost=0.06..659.76 rows=1166 width=16) (actual rows=1166 loops=1)
Heap Fetches: 1166
-> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..885276.20 rows=602172 width=456) (actual rows=2153506389 loops=1)
Index Cond: (volume_id = ANY ('{5}'::bigint[]))
duration: 12524111.200 ms plan:
Merge Join (cost=480671.74..878819.79 rows=1 width=456) (actual rows=62 loops=1)
Merge Cond: (dir.volume_id = dir_process.volume_id)
Join Filter: (dir.parent_id = dir_process.id)
Rows Removed by Join Filter: 153595373018
-> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..922747.05 rows=624805 width=456) (actual rows=2360101 loops=1)
Index Cond: (volume_id = ANY ('{441}'::bigint[]))
-> Sort (cost=2621.42..2653.96 rows=65080 width=16) (actual rows=153593012980 loops=1)
Sort Key: dir_process.volume_id
Sort Method: quicksort Memory: 4587kB
-> Seq Scan on dir_process (cost=0.00..1580.80 rows=65080 width=16) (actual rows=65080 loops=1)
The first reaction is as usual: "do I have up to date statistics?". The answer is yes: dir_current is changed frequently but is also analyzed ca. once an hour. dir_process is analyzed as soon as it's created.
Note that estimated number of rows matches pretty well:
est. 624805, actual=1231600
est. 624805, actual=2360101
Where the estimates are way off is the inner loop of merge join which actually shows the number of rows times the number of loops (153593012980 or 2153506389 or 23115900401). Poor executor is spinning in the inner loop, iterating over all rows with a given volume_id, looking for a given id.
The biggest problem seems to be that Postgres chooses to do a merge join on a very inefficient condition: dir.volume_id = dir_process.volume_id instead of dir.parent_id = dir_process.id. For a given volume_id there is a few million rows in dir_current, for a given parent_id there is hundreds or thousands of rows, but not millions.
The other effective query plan is a nested loop, where outer loop is iterating over dir_process and using an index to fetch the row from dir_current.
I understand that I could disable merge join before running this query but I was wondering if there is a better solution.
Any more idea about what can be done to avoid this inefficient plan? How is it possible that it's chosen over nested loops?

Related

Seemingly Random Delay in queries

This is a follow up to this issue I posted a while ago.
I have the following code:
SET work_mem = '16MB';
SELECT s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
FROM rm_o_resource_usage_instance_splits_new s
INNER JOIN rm_o_resource_usage r ON s.usage_id = r.id
INNER JOIN scheduledactivities sa ON s.activity_index = sa.activity_index AND r.schedule_id = sa.solution_id and s.solution = sa.solution_id
WHERE r.schedule_id = 10
ORDER BY r.resource_id, s.start_date
When I run EXPLAIN (ANALYZE, BUFFERS) I get the following:
Sort (cost=3724.02..3724.29 rows=105 width=89) (actual time=245.802..247.573 rows=22302 loops=1)
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6692kB
Buffers: shared hit=198702 read=5993 written=612
-> Nested Loop (cost=703.76..3720.50 rows=105 width=89) (actual time=1.898..164.741 rows=22302 loops=1)
Buffers: shared hit=198702 read=5993 written=612
-> Hash Join (cost=703.34..3558.54 rows=105 width=101) (actual time=1.815..11.259 rows=22302 loops=1)
Hash Cond: (s.usage_id = r.id)
Buffers: shared hit=3 read=397 written=2
-> Bitmap Heap Scan on rm_o_resource_usage_instance_splits_new s (cost=690.61..3486.58 rows=22477 width=69) (actual time=1.782..5.820 rows=22302 loops=1)
Recheck Cond: (solution = 10)
Heap Blocks: exact=319
Buffers: shared hit=2 read=396 written=2
-> Bitmap Index Scan on rm_o_resource_usage_instance_splits_new_solution_idx (cost=0.00..685.00 rows=22477 width=0) (actual time=1.609..1.609 rows=22302 loops=1)
Index Cond: (solution = 10)
Buffers: shared hit=2 read=77
-> Hash (cost=12.66..12.66 rows=5 width=48) (actual time=0.023..0.023 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1 read=1
-> Bitmap Heap Scan on rm_o_resource_usage r (cost=4.19..12.66 rows=5 width=48) (actual time=0.020..0.020 rows=1 loops=1)
Recheck Cond: (schedule_id = 10)
Heap Blocks: exact=1
Buffers: shared hit=1 read=1
-> Bitmap Index Scan on rm_o_resource_usage_sched (cost=0.00..4.19 rows=5 width=0) (actual time=0.017..0.017 rows=1 loops=1)
Index Cond: (schedule_id = 10)
Buffers: shared read=1
-> Index Scan using scheduledactivities_activity_index_idx on scheduledactivities sa (cost=0.42..1.53 rows=1 width=16) (actual time=0.004..0.007 rows=1 loops=22302)
Index Cond: (activity_index = s.activity_index)
Filter: (solution_id = 10)
Rows Removed by Filter: 5
Buffers: shared hit=198699 read=5596 written=610
Planning time: 7.070 ms
Execution time: 248.691 ms
Every time I run EXPLAIN, I get roughly the same results. The Execution Time is always between 170ms and 250ms, which, to me is perfectly fine. However, when this query is run through a C++ project (using PQexec(conn, query) where conn is a dedicated connection, and query is the above query), the time it takes seems to vary widely. In general, the query is very quick, and you don't notice a delay. The problem is, that on occasion, this query will take 2 to 3 minutes to complete.
If I open the pgadmin, and have a look at the "server activity" for the database, there's about 30 or so connections, mostly sitting at "idle". The above query's connection is marked as "active", and will stay as "active" for several minutes.
I am at a loss of why it randomly takes several minutes to complete the same query, with no change in data in the DB either. I have tried increasing the work_mem which didn't make any difference (nor did I really expect it to). Any help or suggestions would be greatly appreciated.
There isn't any more specific tags, but I'm currently using Postgres 10.11, but it's also been an issue on other versions of 10.x. System is a Xeon quad-core # 3.4Ghz, with SSD and 24GB of memory.
Per jjanes's suggestion, I put in the auto_explain. Eventually go this output:
duration: 128057.373 ms
plan:
Query Text: SET work_mem = '32MB';SELECT s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset FROM rm_o_resource_usage_instance_splits_new s INNER JOIN rm_o_resource_usage r ON s.usage_id = r.id INNER JOIN scheduledactivities sa ON s.activity_index = sa.activity_index AND r.schedule_id = sa.solution_id and s.solution = sa.solution_id WHERE r.schedule_id = 12642 ORDER BY r.resource_id, s.start_date
Sort (cost=14.36..14.37 rows=1 width=98) (actual time=128042.083..128043.287 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6585kB
Buffers: shared hit=21198435 read=388 dirtied=119
-> Nested Loop (cost=0.85..14.35 rows=1 width=98) (actual time=4.995..127958.935 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 705476285
Buffers: shared hit=21198435 read=388 dirtied=119
-> Nested Loop (cost=0.42..9.74 rows=1 width=110) (actual time=0.091..227.705 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, s.solution, r.resource_id, r.schedule_id
Inner Unique: true
Join Filter: (s.usage_id = r.id)
Buffers: shared hit=22102 read=388 dirtied=119
-> Index Scan using rm_o_resource_usage_instance_splits_new_solution_idx on public.rm_o_resource_usage_instance_splits_new s (cost=0.42..8.44 rows=1 width=69) (actual time=0.082..17.418 rows=21899 loops=1)
Output: s.start_time, s.end_time, s.resources, s.activity_index, s.usage_id, s.start_date, s.end_date, s.solution
Index Cond: (s.solution = 12642)
Buffers: shared hit=203 read=388 dirtied=119
-> Seq Scan on public.rm_o_resource_usage r (cost=0.00..1.29 rows=1 width=57) (actual time=0.002..0.002 rows=1 loops=21899)
Output: r.id, r.schedule_id, r.resource_id
Filter: (r.schedule_id = 12642)
Rows Removed by Filter: 26
Buffers: shared hit=21899
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.60 rows=1 width=16) (actual time=0.006..4.612 rows=32216 loops=21899)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12642)
Buffers: shared hit=21176333",,,,,,,,,""
EDIT: Full definitions of the tables are below:
CREATE TABLE public.rm_o_resource_usage_instance_splits_new
(
start_time integer NOT NULL,
end_time integer NOT NULL,
resources jsonb NOT NULL,
activity_index integer NOT NULL,
usage_id bigint NOT NULL,
start_date text COLLATE pg_catalog."default" NOT NULL,
end_date text COLLATE pg_catalog."default" NOT NULL,
solution bigint NOT NULL,
CONSTRAINT rm_o_resource_usage_instance_splits_new_pkey PRIMARY KEY (start_time, activity_index, usage_id),
CONSTRAINT rm_o_resource_usage_instance_splits_new_solution_fkey FOREIGN KEY (solution)
REFERENCES public.rm_o_schedule_stats (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE,
CONSTRAINT rm_o_resource_usage_instance_splits_new_usage_id_fkey FOREIGN KEY (usage_id)
REFERENCES public.rm_o_resource_usage (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_activity_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(activity_index ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_solution_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(solution ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_usage_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(usage_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE TABLE public.rm_o_resource_usage
(
id bigint NOT NULL DEFAULT nextval('rm_o_resource_usage_id_seq'::regclass),
schedule_id bigint NOT NULL,
resource_id text COLLATE pg_catalog."default" NOT NULL,
CONSTRAINT rm_o_resource_usage_pkey PRIMARY KEY (id),
CONSTRAINT rm_o_resource_usage_schedule_id_fkey FOREIGN KEY (schedule_id)
REFERENCES public.rm_o_schedule_stats (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_idx
ON public.rm_o_resource_usage USING btree
(id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_sched
ON public.rm_o_resource_usage USING btree
(schedule_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE TABLE public.scheduledactivities
(
id bigint NOT NULL DEFAULT nextval('scheduledactivities_id_seq'::regclass),
solution_id bigint NOT NULL,
activity_id text COLLATE pg_catalog."default" NOT NULL,
sequence_index integer,
startminute integer,
finishminute integer,
issue text COLLATE pg_catalog."default",
activity_index integer NOT NULL,
is_objective boolean NOT NULL,
usedresourceset integer DEFAULT '-1'::integer,
start timestamp without time zone,
finish timestamp without time zone,
is_ore boolean,
is_ignored boolean,
CONSTRAINT scheduled_activities_pkey PRIMARY KEY (id),
CONSTRAINT scheduledactivities_solution_id_fkey FOREIGN KEY (solution_id)
REFERENCES public.rm_o_schedule_stats (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX scheduled_activities_activity_id_idx
ON public.scheduledactivities USING btree
(activity_id COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX scheduled_activities_id_idx
ON public.scheduledactivities USING btree
(id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX scheduled_activities_idx
ON public.scheduledactivities USING btree
(solution_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX scheduledactivities_activity_index_idx
ON public.scheduledactivities USING btree
(activity_index ASC NULLS LAST)
TABLESPACE pg_default;
EDIT: Additional output from auto_explain after adding index on scheduledactivities (solution_id, activity_index)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6283kB
Buffers: shared hit=20159117 read=375 dirtied=190
-> Nested Loop (cost=0.85..10.76 rows=1 width=100) (actual time=5.518..122489.627 rows=20761 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 668815615
Buffers: shared hit=20159117 read=375 dirtied=190
-> Nested Loop (cost=0.42..5.80 rows=1 width=112) (actual time=0.057..217.563 rows=20761 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, s.solution, r.resource_id, r.schedule_id
Inner Unique: true
Join Filter: (s.usage_id = r.id)
Buffers: shared hit=20947 read=375 dirtied=190
-> Index Scan using rm_o_resource_usage_instance_splits_new_solution_idx on public.rm_o_resource_usage_instance_splits_new s (cost=0.42..4.44 rows=1 width=69) (actual time=0.049..17.622 rows=20761 loops=1)
Output: s.start_time, s.end_time, s.resources, s.activity_index, s.usage_id, s.start_date, s.end_date, s.solution
Index Cond: (s.solution = 12644)
Buffers: shared hit=186 read=375 dirtied=190
-> Seq Scan on public.rm_o_resource_usage r (cost=0.00..1.35 rows=1 width=59) (actual time=0.002..0.002 rows=1 loops=20761)
Output: r.id, r.schedule_id, r.resource_id
Filter: (r.schedule_id = 12644)
Rows Removed by Filter: 22
Buffers: shared hit=20761
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.94 rows=1 width=16) (actual time=0.007..4.654 rows=32216 loops=20761)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12644)
Buffers: shared hit=20138170",,,,,,,,,""
The easiest way to reproduce the issue is to add more values to the three tables. I didn't delete any, only did a few thousand INSERTs.
-> Index Scan using .. s (cost=0.42..8.44 rows=1 width=69) (actual time=0.082..17.418 rows=21899 loops=1)
Index Cond: (s.solution = 12642)
The planner thinks it will find 1 row, and instead finds 21899. That error can pretty clearly lead to bad plans. And a single equality condition should be estimated quite accurately, so I'd say the statistics on your table are way off. It could be that the autovac launcher is tuned poorly so it doesn't run often enough, or it could be that select parts of your data change very rapidly (did you just insert 21899 rows with s.solution = 12642 immediately before running the query?) and so the stats can't be kept accurate enough.
-> Nested Loop ...
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 705476285
-> ...
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.60 rows=1 width=16) (actual time=0.006..4.612 rows=32216 loops=21899)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12642)
If you can't get it to use the Hash Join, you can at least reduce the harm of the Nested Loop by building an index on scheduledactivities (solution_id, activity_index). That way the activity_index criterion could be part of the Index Condition, rather than being a Join Filter. You could probably then drop the index exclusively on solution_id, as there is little point in maintaining both indexes.
The SQL statement of the fast plan is using WHERE r.schedule_id = 10 and returns about 22000 rows (with estimated 105).
The SQL statement of the slow plan is using WHERE r.schedule_id = 12642 and returns about 21000 rows (with estimated only 1).
The slow plan is using nested loops instead of hash joins: maybe because there is a bad estimation for joins: estimated rows is 1 but actual rows is 21899.
For example in this step:
Nested Loop (cost=0.42..9.74 rows=1 width=110) (actual time=0.091..227.705 rows=21899 loops=1)
If data does not change there is maybe a statistic issue (skew data) for some columns.

PostgreSQL table indexing

I want to index my tables for the following query:
select
t.*
from main_transaction t
left join main_profile profile on profile.id = t.profile_id
left join main_customer customer on (customer.id = profile.user_id)
where
(upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%')))
and t.service_type = 'SERVICE_1'
and t.status = 'SUCCESS'
and t.mode = 'AUTO'
and t.transaction_type = 'WITHDRAW'
and customer.client = 'corp'
and t.pub_date>='2018-09-05' and t.pub_date<='2018-11-05'
order by t.pub_date desc, t.id asc
LIMIT 1000;
This is how I tried to index my tables:
CREATE INDEX main_transaction_pr_id ON main_transaction (profile_id);
CREATE INDEX main_profile_user_id ON main_profile (user_id);
CREATE INDEX main_customer_client ON main_customer (client);
CREATE INDEX main_transaction_gin_req_no ON main_transaction USING gin (upper(request_no) gin_trgm_ops);
CREATE INDEX main_customer_gin_phone ON main_customer USING gin (upper(phone) gin_trgm_ops);
CREATE INDEX main_transaction_general ON main_transaction (service_type, status, mode, transaction_type); --> don't know if this one is true!!
After indexing like above my query is spending over 4.5 seconds for just selecting 1000 rows!
I am selecting from the following table which has 34 columns including 3 FOREIGN KEYs and it has over 3 million data rows:
CREATE TABLE main_transaction (
id integer NOT NULL DEFAULT nextval('main_transaction_id_seq'::regclass),
description character varying(255) NOT NULL,
request_no character varying(18),
account character varying(50),
service_type character varying(50),
pub_date" timestamptz(6) NOT NULL,
"service_id" varchar(50) COLLATE "pg_catalog"."default",
....
);
I am also joining two tables (main_profile, main_customer) for searching customer.phone and for selecting customer.client. To get to the main_customer table from main_transaction table, I can only go by main_profile
My question is how can I index my table too increase performance for above query?
Please, do not use UNION for OR for this case (upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%'))) instead can we use case when condition? Because, I have to convert my PostgreSQL query into Hibernate JPA! And I don't know how to convert UNION except Hibernate - Native SQL which I am not allowed to use.
Explain:
Limit (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.381 rows=1 loops=1)
-> Sort (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.380 rows=1 loops=1)
Sort Key: t.pub_date DESC, t.id
Sort Method: quicksort Memory: 27kB
-> Hash Join (cost=20817.10..411600.73 rows=38 width=1906) (actual time=3214.473..3885.369 rows=1 loops=1)
Hash Cond: (t.profile_id = profile.id)
Join Filter: ((upper((t.request_no)::text) ~~ '%20181104-2158-2723948%'::text) OR (upper((customer.phone)::text) ~~ '%20181104-2158-2723948%'::text))
Rows Removed by Join Filter: 593118
-> Seq Scan on main_transaction t (cost=0.00..288212.28 rows=205572 width=1906) (actual time=0.068..1527.677 rows=593119 loops=1)
Filter: ((pub_date >= '2016-09-05 00:00:00+05'::timestamp with time zone) AND (pub_date <= '2018-11-05 00:00:00+05'::timestamp with time zone) AND ((service_type)::text = 'SERVICE_1'::text) AND ((status)::text = 'SUCCESS'::text) AND ((mode)::text = 'AUTO'::text) AND ((transaction_type)::text = 'WITHDRAW'::text))
Rows Removed by Filter: 2132732
-> Hash (cost=17670.80..17670.80 rows=180984 width=16) (actual time=211.211..211.211 rows=181516 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 3166kB
-> Hash Join (cost=6936.09..17670.80 rows=180984 width=16) (actual time=46.846..183.689 rows=181516 loops=1)
Hash Cond: (customer.id = profile.user_id)
-> Seq Scan on main_customer customer (cost=0.00..5699.73 rows=181106 width=16) (actual time=0.013..40.866 rows=181618 loops=1)
Filter: ((client)::text = 'corp'::text)
Rows Removed by Filter: 16920
-> Hash (cost=3680.04..3680.04 rows=198404 width=8) (actual time=46.087..46.087 rows=198404 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 2966kB
-> Seq Scan on main_profile profile (cost=0.00..3680.04 rows=198404 width=8) (actual time=0.008..20.099 rows=198404 loops=1)
Planning time: 0.757 ms
Execution time: 3885.680 ms
With the restriction to not use UNION, you won't get a good plan.
You can slightly speed up processing with the following indexes:
main_transaction ((service_type::text), (status::text), (mode::text),
(transaction_type::text), pub_date)
main_customer ((client::text))
These should at least get rid of the sequential scans, but the hash join that takes the lion's share of the processing time will remain.

Query on postgres 9.5 several times slower than on postgres 9.1

I am running performance tests on two systems that have the same postgres database,
with the same data. One system has postgres 9.1 and the other postgres 9.5.
There are only slight differences in the data are caused only by slightly different timings (time ordering) of concurrent
insertions, and they should not be significant (less than 0.1 percent of the number of rows).
The postgres configuration on both system is the same (given below).
The query, the database schema and the query plans both on postgres 9.5 and 9.1 are also given below.
I am consistently getting several times slower query execution on postgres
postgres 9.5: Execution time: 280.777 ms
postgres 9.1: Total runtime: 66.566 ms
which I can not understand.
In this case postgres 9.5 is several times slower.
It uses a different query plan compared to the postgres 9.1.
Any help with explaining why the performance is different, or
directions where to look into will be greatly appreciated.
Postgres configuration
The configuration of postgres 9.1 is:
------------------------------------------------------------------------------
shared_buffers = 2048MB
temp_buffers = 8MB
work_mem = 18MB
maintenance_work_mem = 512MB
max_stack_depth = 2MB
------------------------------------------------------------------------------
wal_level = minimal
wal_buffers = 8MB
checkpoint_segments = 32
checkpoint_timeout = 5min
checkpoint_completion_target = 0.9
checkpoint_warning = 30s
------------------------------------------------------------------------------
effective_cache_size = 5120MB
The configuration of the postgres 9.5 is:
------------------------------
shared_buffers = 2048MB
temp_buffers = 8MB
work_mem = 80MB
maintenance_work_mem = 512MB
dynamic_shared_memory_type = posix
------------------------------
wal_level = minimal
wal_buffers = 8MB
max_wal_size = 5GB
wal_keep_segments = 32
effective_cache_size = 4GB
default_statistics_target = 100
------------------------------
QUERY:
EXPLAIN ANALYZE SELECT bricks.id AS gid,TIMESTAMP '2017-07-03 06:00:00' AS time,
intensity, bricks.weather_type, polygons.geometry AS geom,
starttime, endtime, ST_AsGeoJSON(polygons.geometry) as geometry_json,
weather_types.name as weather_type_literal, notification_type, temp.level as level
FROM bricks
INNER JOIN weather_types
ON weather_types.id = bricks.weather_type
INNER JOIN polygons
ON bricks.polygon_id=polygons.id
JOIN notifications
ON bricks.notification_id = notifications.id
JOIN bricks_notif_type_priority prio
ON (bricks.polygon_id = prio.polygon_id
AND bricks.weather_type = prio.weather_type
AND notifications.notification_type = prio.priority_notif_type)
JOIN (VALUES (14, 1),
(4, 2),
(5, 3),
(1, 4),
(2, 5),
(3, 6),
(13, 7),
(15,8)) as order_list (id, ordering)
ON bricks.weather_type = order_list.id
JOIN (VALUES
(15,1,1,2,'warning'::notification_type_enum),
(15,2,2,3,'warning'::notification_type_enum),
(15,3,3,999999,'warning'::notification_type_enum),
(13,1,1,2,'warning'::notification_type_enum),
(13,2,2,3,'warning'::notification_type_enum),
(13,3,3,99999,'warning'::notification_type_enum),
(5,1,1,3,'warning'::notification_type_enum),
(5,2,3,5,'warning'::notification_type_enum),
(5,3,5,99999,'warning'::notification_type_enum),
(4,1,15,25,'warning'::notification_type_enum),
(4,2,25,35,'warning'::notification_type_enum),
(4,3,35,99999,'warning'::notification_type_enum),
(3,1,75,100,'warning'::notification_type_enum),
(3,2,100,130,'warning'::notification_type_enum),
(3,3,130,99999,'warning'::notification_type_enum),
(2,1,30,50,'warning'::notification_type_enum),
(2,2,50,100,'warning'::notification_type_enum),
(2,3,100,99999,'warning'::notification_type_enum),
(1,1,18,50,'warning'::notification_type_enum),
(1,2,50,300,'warning'::notification_type_enum),
(1,3,300,999999,'warning'::notification_type_enum),
(15,1,1,2,'autowarn'::notification_type_enum),
(15,2,2,3,'autowarn'::notification_type_enum),
(15,3,3,999999,'autowarn'::notification_type_enum),
(13,1,1,2,'autowarn'::notification_type_enum),
(13,2,2,3,'autowarn'::notification_type_enum),
(13,3,3,99999,'autowarn'::notification_type_enum),
(5,1,10,20,'autowarn'::notification_type_enum),
(5,2,20,50,'autowarn'::notification_type_enum),
(5,3,50,99999,'autowarn'::notification_type_enum),
(4,1,15,25,'autowarn'::notification_type_enum),
(4,2,25,35,'autowarn'::notification_type_enum),
(4,3,35,99999,'autowarn'::notification_type_enum),
(3,1,75,100,'autowarn'::notification_type_enum),
(3,2,100,130,'autowarn'::notification_type_enum),
(3,3,130,99999,'autowarn'::notification_type_enum),
(2,1,30,50,'autowarn'::notification_type_enum),
(2,2,50,100,'autowarn'::notification_type_enum),
(2,3,100,99999,'autowarn'::notification_type_enum),
(1,1,18,50,'autowarn'::notification_type_enum),
(1,2,50,300,'autowarn'::notification_type_enum),
(1,3,300,999999,'autowarn'::notification_type_enum),
(15,1,1,2,'forewarn'::notification_type_enum),
(15,2,2,3,'forewarn'::notification_type_enum),
(15,3,3,999999,'forewarn'::notification_type_enum),
(13,1,1,2,'forewarn'::notification_type_enum),
(13,2,2,3,'forewarn'::notification_type_enum),
(13,3,3,99999,'forewarn'::notification_type_enum),
(5,1,1,3,'forewarn'::notification_type_enum),
(5,2,3,5,'forewarn'::notification_type_enum),
(5,3,5,99999,'forewarn'::notification_type_enum),
(4,1,15,25,'forewarn'::notification_type_enum),
(4,2,25,35,'forewarn'::notification_type_enum),
(4,3,35,99999,'forewarn'::notification_type_enum),
(3,1,75,100,'forewarn'::notification_type_enum),
(3,2,100,130,'forewarn'::notification_type_enum),
(3,3,130,99999,'forewarn'::notification_type_enum),
(2,1,30,50,'forewarn'::notification_type_enum),
(2,2,50,100,'forewarn'::notification_type_enum),
(2,3,100,99999,'forewarn'::notification_type_enum),
(1,1,18,50,'forewarn'::notification_type_enum),
(1,2,50,300,'forewarn'::notification_type_enum),
(1,3,300,999999,'forewarn'::notification_type_enum),
(15,1,1,2,'auto-forewarn'::notification_type_enum),
(15,2,2,3,'auto-forewarn'::notification_type_enum),
(15,3,3,999999,'auto-forewarn'::notification_type_enum),
(13,1,1,2,'auto-forewarn'::notification_type_enum),
(13,2,2,3,'auto-forewarn'::notification_type_enum),
(13,3,3,99999,'auto-forewarn'::notification_type_enum),
(5,1,10,20,'auto-forewarn'::notification_type_enum),
(5,2,20,50,'auto-forewarn'::notification_type_enum),
(5,3,50,99999,'auto-forewarn'::notification_type_enum),
(4,1,15,25,'auto-forewarn'::notification_type_enum),
(4,2,25,35,'auto-forewarn'::notification_type_enum),
(4,3,35,99999,'auto-forewarn'::notification_type_enum),
(3,1,75,100,'auto-forewarn'::notification_type_enum),
(3,2,100,130,'auto-forewarn'::notification_type_enum),
(3,3,130,99999,'auto-forewarn'::notification_type_enum),
(2,1,30,50,'auto-forewarn'::notification_type_enum),
(2,2,50,100,'auto-forewarn'::notification_type_enum),
(2,3,100,99999,'auto-forewarn'::notification_type_enum),
(1,1,18,50,'auto-forewarn'::notification_type_enum),
(1,2,50,300,'auto-forewarn'::notification_type_enum),
(1,3,300,999999,'auto-forewarn'::notification_type_enum)
) AS temp (weather_type,level,min,max,ntype)
ON bricks.weather_type = temp.weather_type
AND notifications.notification_type=temp.ntype
AND intensity >= temp.min
AND intensity < temp.max
AND temp.level in (1,2,3)
WHERE polygons.set = 0
AND '2017-07-03 06:00:00' BETWEEN starttime AND endtime
AND weather_types.name in ('rain','snowfall','storm','freezingRain','thunderstorm')
ORDER BY
(CASE notifications.notification_type
WHEN 'forewarn' THEN 0
WHEN 'auto-forewarn' then 0
ELSE 1
END) ASC,
temp.level ASC,
order_list.ordering DESC
LIMIT 10000000
TABLE information:
Table "public.notifications"
Column | Type | Modifiers
-------------------+-----------------------------+------------------------------------------------------------
id | integer | not null default nextval('notifications_id_seq'::regclass)
weather_type | integer |
macro_blob | bytea | not null
logproxy_id | integer | not null
sent_time | timestamp without time zone | not null
max_endtime | timestamp without time zone | not null
notification_type | notification_type_enum | not null
Indexes:
"notifications_pkey" PRIMARY KEY, btree (id)
"notifications_unique_logproxy_id" UNIQUE CONSTRAINT, btree (logproxy_id)
"notifications_max_endtime_idx" btree (max_endtime)
"notifications_weather_type_idx" btree (weather_type)
Foreign-key constraints:
"notifications_weather_type_fkey" FOREIGN KEY (weather_type) REFERENCES weather_types(id) ON DELETE CASCADE
Referenced by:
TABLE "bricks_notif_extent" CONSTRAINT "bricks_notif_extent_notification_id_fkey" FOREIGN KEY (notification_id) REFERENCES notifications(id) ON DELETE CASCADE
TABLE "bricks" CONSTRAINT "bricks_notification_id_fkey" FOREIGN KEY (notification_id) REFERENCES notifications(id) ON DELETE CASCADE
Table "public.bricks"
Column | Type | Modifiers
-----------------+-----------------------------+-----------------------------------------------------
id | bigint | not null default nextval('bricks_id_seq'::regclass)
polygon_id | integer |
terrain_types | integer | not null
weather_type | integer |
intensity | integer | not null
starttime | timestamp without time zone | not null
endtime | timestamp without time zone | not null
notification_id | integer |
Indexes:
"bricks_pkey" PRIMARY KEY, btree (id)
"bricks_notification_id_idx" btree (notification_id)
"bricks_period_idx" gist (cube(date_part('epoch'::text, starttime), date_part('epoch'::text, endtime)))
"bricks_polygon_idx" btree (polygon_id)
"bricks_weather_type_idx" btree (weather_type)
Foreign-key constraints:
"bricks_notification_id_fkey" FOREIGN KEY (notification_id) REFERENCES notifications(id) ON DELETE CASCADE
"bricks_polygon_id_fkey" FOREIGN KEY (polygon_id) REFERENCES polygons(id) ON DELETE CASCADE
"bricks_weather_type_fkey" FOREIGN KEY (weather_type) REFERENCES weather_types(id) ON DELETE CASCADE
Table "public.polygons"
Column | Type | Modifiers
----------+-----------------------+-------------------------------------------------------
id | integer | not null default nextval('polygons_id_seq'::regclass)
country | character(2) | not null
set | integer | not null
geometry | geometry |
zone_id | character varying(32) |
Indexes:
"polygons_pkey" PRIMARY KEY, btree (id)
"polygons_geometry_idx" gist (geometry)
"polygons_zone_id_idx" btree (zone_id)
Check constraints:
"enforce_dims_geometry" CHECK (st_ndims(geometry) = 2)
"enforce_geotype_geometry" CHECK (geometrytype(geometry) = 'MULTIPOLYGON'::text OR geometry IS NULL)
"enforce_srid_geometry" CHECK (st_srid(geometry) = 4326)
Referenced by:
TABLE "bricks_notif_type_priority" CONSTRAINT "bricks_notif_type_priority_polygon_id_fkey" FOREIGN KEY (polygon_id) REFERENCES polygons(id) ON DELETE CASCADE
TABLE "bricks" CONSTRAINT "bricks_polygon_id_fkey" FOREIGN KEY (polygon_id) REFERENCES polygons(id) ON DELETE CASCADE
TABLE "polygon_contains" CONSTRAINT "polygon_contains_contained_id_fkey" FOREIGN KEY (contained_id) REFERENCES polygons(id) ON DELETE CASCADE
TABLE "polygon_contains" CONSTRAINT "polygon_contains_id_fkey" FOREIGN KEY (id) REFERENCES polygons(id) ON DELETE CASCADE
Table "public.weather_types"
Column | Type | Modifiers
--------+-----------------------+------------------------------------------------------------
id | integer | not null default nextval('weather_types_id_seq'::regclass)
name | character varying(32) | not null
Indexes:
"weather_types_pkey" PRIMARY KEY, btree (id)
"weather_type_unique_name" UNIQUE CONSTRAINT, btree (name)
Referenced by:
TABLE "bricks_notif_type_priority" CONSTRAINT "bricks_notif_type_priority_weather_type_fkey" FOREIGN KEY (weather_type) REFERENCES weather_types(id) ON DELETE CASCADE
TABLE "bricks" CONSTRAINT "bricks_weather_type_fkey" FOREIGN KEY (weather_type) REFERENCES weather_types(id) ON DELETE CASCADE
TABLE "notifications" CONSTRAINT "notifications_weather_type_fkey" FOREIGN KEY (weather_type) REFERENCES weather_types(id) ON DELETE CASCADE
Table "public.bricks_notif_type_priority"
Column | Type | Modifiers
---------------------+------------------------+-------------------------------------------------------------------------
id | integer | not null default nextval('bricks_notif_type_priority_id_seq'::regclass)
polygon_id | integer | not null
weather_type | integer | not null
priority_notif_type | notification_type_enum | not null
Indexes:
"bricks_notif_type_priority_pkey" PRIMARY KEY, btree (id)
"bricks_notif_type_priority_poly_idx" btree (polygon_id, weather_type)
Foreign-key constraints:
"bricks_notif_type_priority_polygon_id_fkey" FOREIGN KEY (polygon_id) REFERENCES polygons(id) ON DELETE CASCADE
"bricks_notif_type_priority_weather_type_fkey" FOREIGN KEY (weather_type) REFERENCES weather_types(id) ON DELETE CASCADE
The query plan on postgres 9.5:
Limit (cost=1339.71..1339.72 rows=1 width=2083) (actual time=280.390..280.429 rows=224 loops=1)
-> Sort (cost=1339.71..1339.72 rows=1 width=2083) (actual time=280.388..280.404 rows=224 loops=1)
Sort Key: (CASE notifications.notification_type WHEN 'forewarn'::notification_type_enum THEN 0 WHEN 'auto-forewarn'::notification_type_enum THEN 0 ELSE 1 END), "*VALUES*_1".column2, "*VALUES*".column2 DESC
Sort Method: quicksort Memory: 929kB
-> Nested Loop (cost=437.79..1339.70 rows=1 width=2083) (actual time=208.373..278.926 rows=224 loops=1)
Join Filter: (bricks.polygon_id = polygons.id)
-> Nested Loop (cost=437.50..1339.30 rows=1 width=62) (actual time=186.122..221.536 rows=307 loops=1)
Join Filter: ((weather_types.id = prio.weather_type) AND (notifications.notification_type = prio.priority_notif_type))
Rows Removed by Join Filter: 655
-> Nested Loop (cost=437.08..1336.68 rows=1 width=74) (actual time=5.522..209.237 rows=652 loops=1)
Join Filter: ("*VALUES*_1".column5 = notifications.notification_type)
Rows Removed by Join Filter: 1956
-> Merge Join (cost=436.66..1327.38 rows=4 width=74) (actual time=5.277..195.569 rows=2608 loops=1)
Merge Cond: (bricks.weather_type = weather_types.id)
-> Nested Loop (cost=435.33..1325.89 rows=28 width=60) (actual time=5.232..193.652 rows=2608 loops=1)
Join Filter: ((bricks.intensity >= "*VALUES*_1".column3) AND (bricks.intensity < "*VALUES*_1".column4))
Rows Removed by Join Filter: 5216
-> Merge Join (cost=1.61..1.67 rows=1 width=28) (actual time=0.093..0.181 rows=84 loops=1)
Merge Cond: ("*VALUES*".column1 = "*VALUES*_1".column1)
-> Sort (cost=0.22..0.24 rows=8 width=8) (actual time=0.022..0.024 rows=8 loops=1)
Sort Key: "*VALUES*".column1
Sort Method: quicksort Memory: 25kB
-> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=8) (actual time=0.005..0.007 rows=8 loops=1)
-> Sort (cost=1.39..1.40 rows=3 width=20) (actual time=0.067..0.097 rows=84 loops=1)
Sort Key: "*VALUES*_1".column1
Sort Method: quicksort Memory: 31kB
-> Values Scan on "*VALUES*_1" (cost=0.00..1.36 rows=3 width=20) (actual time=0.009..0.041 rows=84 loops=1)
Filter: (column2 = ANY ('{1,2,3}'::integer[]))
-> Bitmap Heap Scan on bricks (cost=433.72..1302.96 rows=1417 width=40) (actual time=0.568..2.277 rows=93 loops=84)
Recheck Cond: (weather_type = "*VALUES*".column1)
Filter: (('2017-07-03 06:00:00'::timestamp without time zone >= starttime) AND ('2017-07-03 06:00:00'::timestamp without time zone <= endtime))
Rows Removed by Filter: 6245
Heap Blocks: exact=18528
-> Bitmap Index Scan on bricks_weather_type_idx (cost=0.00..433.37 rows=8860 width=0) (actual time=0.536..0.536 rows=6361 loops=84)
Index Cond: (weather_type = "*VALUES*".column1)
-> Sort (cost=1.33..1.35 rows=5 width=14) (actual time=0.040..0.418 rows=2609 loops=1)
Sort Key: weather_types.id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on weather_types (cost=0.00..1.28 rows=5 width=14) (actual time=0.009..0.013 rows=5 loops=1)
Filter: ((name)::text = ANY ('{rain,snowfall,storm,freezingRain,thunderstorm}'::text[]))
Rows Removed by Filter: 12
-> Index Scan using notifications_pkey on notifications (cost=0.41..2.31 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=2608)
Index Cond: (id = bricks.notification_id)
-> Index Scan using bricks_notif_type_priority_poly_idx on bricks_notif_type_priority prio (cost=0.42..2.61 rows=1 width=12) (actual time=0.017..0.017 rows=1 loops=652)
Index Cond: ((polygon_id = bricks.polygon_id) AND (weather_type = bricks.weather_type))
-> Index Scan using polygons_pkey on polygons (cost=0.29..0.38 rows=1 width=2033) (actual time=0.021..0.022 rows=1 loops=307)
Index Cond: (id = prio.polygon_id)
Filter: (set = 0)
Rows Removed by Filter: 0
Planning time: 27.326 ms
Execution time: 280.777 ms
The query plan on postgres 9.1:
Limit (cost=2486.29..2486.30 rows=1 width=8219) (actual time=66.273..66.301 rows=224 loops=1)
-> Sort (cost=2486.29..2486.30 rows=1 width=8219) (actual time=66.272..66.281 rows=224 loops=1)
Sort Key: (CASE notifications.notification_type WHEN 'forewarn'::notification_type_enum THEN 0 WHEN 'auto-forewarn'::notification_type_enum THEN 0 ELSE 1 END), "*VALUES*".column2, "*VALUES*".column2
Sort Method: quicksort Memory: 1044kB
-> Nested Loop (cost=171.27..2486.28 rows=1 width=8219) (actual time=22.064..65.335 rows=224 loops=1)
Join Filter: ((bricks.intensity >= "*VALUES*".column3) AND (bricks.intensity < "*VALUES*".column4) AND (weather_types.id = "*VALUES*".column1) AND (notifications.notification_type = "*VALUES*".column5))
-> Nested Loop (cost=171.27..2484.85 rows=1 width=8231) (actual time=16.632..24.503 rows=224 loops=1)
Join Filter: (prio.priority_notif_type = notifications.notification_type)
-> Nested Loop (cost=171.27..2482.66 rows=1 width=8231) (actual time=3.318..22.309 rows=681 loops=1)
-> Nested Loop (cost=171.27..2479.52 rows=1 width=98) (actual time=3.175..19.252 rows=962 loops=1)
-> Nested Loop (cost=171.27..2471.14 rows=1 width=86) (actual time=3.144..14.751 rows=652 loops=1)
-> Hash Join (cost=0.20..29.73 rows=1 width=46) (actual time=0.025..0.039 rows=5 loops=1)
Hash Cond: (weather_types.id = "*VALUES*".column1)
-> Seq Scan on weather_types (cost=0.00..29.50 rows=5 width=38) (actual time=0.007..0.015 rows=5 loops=1)
Filter: ((name)::text = ANY ('{rain,snowfall,storm,freezingRain,thunderstorm}'::text[]))
-> Hash (cost=0.10..0.10 rows=8 width=8) (actual time=0.006..0.006 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=8) (actual time=0.002..0.004 rows=8 loops=1)
-> Bitmap Heap Scan on bricks (cost=171.07..2352.98 rows=7074 width=40) (actual time=0.718..2.917 rows=130 loops=5)
Recheck Cond: (weather_type = weather_types.id)
Filter: (('2017-07-03 06:00:00'::timestamp without time zone >= starttime) AND ('2017-07-03 06:00:00'::timestamp without time zone <= endtime))
-> Bitmap Index Scan on bricks_weather_type_idx (cost=0.00..170.71 rows=8861 width=0) (actual time=0.644..0.644 rows=8906 loops=5)
Index Cond: (weather_type = weather_types.id)
-> Index Scan using bricks_notif_type_priority_poly_idx on bricks_notif_type_priority prio (cost=0.00..8.37 rows=1 width=12) (actual time=0.006..0.006 rows=1 loops=652)
Index Cond: ((polygon_id = bricks.polygon_id) AND (weather_type = weather_types.id))
-> Index Scan using polygons_pkey on polygons (cost=0.00..3.13 rows=1 width=8145) (actual time=0.003..0.003 rows=1 loops=962)
Index Cond: (id = bricks.polygon_id)
Filter: (set = 0)
-> Index Scan using notifications_pkey on notifications (cost=0.00..2.17 rows=1 width=8) (actual time=0.002..0.003 rows=1 loops=681)
Index Cond: (id = bricks.notification_id)
-> Values Scan on "*VALUES*" (cost=0.00..1.36 rows=3 width=20) (actual time=0.001..0.028 rows=84 loops=224)
Filter: (column2 = ANY ('{1,2,3}'::integer[]))
Total runtime: 66.566 ms
(33 rows)
work_mem = 80MB
By assigning more memory to this parameter you can indeed speed some queries by performing larger operations in-memory but it can also encourage the query planner to choose less optimized paths. Try to tweak/decrease this parameter and rerun the query planner.
Correspondingly you might want to run vacuum analyze on your 9.5 database.
I believe you would have noticed the query plan has changed significantly b/w 9.5 and 9.1
It seems to me there are 2 pronged issues with the new query plan.
Too Many rows, refer -
-> Sort (cost=1.33..1.35 rows=5 width=14) (actual time=0.040..0.418 rows=2609 loops=1)
Sort Key: weather_types.id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on weather_types (cost=0.00..1.28 rows=5 width=14) (actual time=0.009..0.013 rows=5 loops=1)
Filter: ((name)::text = ANY ('{rain,snowfall,storm,freezingRain,thunderstorm}'::text[]))
Rows Removed by Filter: 12
This leads to high I/O and high CPU usage (to sort)
both are useless towards the end goal (refer the line -
Rows Removed by Join Filter: 655
Rows Removed by Join Filter: 1956
Rows Removed by Join Filter: 5216
There have been few changes in 9.1 to 9.5 towards query planning, but not sure which one has impacted in here.
The solution as listed in URL - https://www.datadoghq.com/blog/100x-faster-postgres-performance-by-changing-1-line/ works on 9.1 (refer the line in URL and your query plan for 9.1 -
-> Values Scan on "*VALUES*"
If you can share the data set for other tables, it would be easier to find the which join/where clause caused this.

PostgreSQL efficient query with filter over boolean

There's table with 15M rows holding user's inbox data
user_id | integer | not null
subject | character varying(255) | not null
...
last_message_id | integer |
last_message_at | timestamp with time zone |
deleted_at | timestamp with time zone |
Here's slow query in nutshell:
SELECT *
FROM dialogs
WHERE user_id = 1234
AND deleted_at IS NULL
LIMIT 21
Full query:
(irrelevant fields deleted)
SELECT "dialogs"."id", "dialogs"."subject", "dialogs"."product_id", "dialogs"."user_id", "dialogs"."participant_id", "dialogs"."thread_id", "dialogs"."last_message_id", "dialogs"."last_message_at", "dialogs"."read_at", "dialogs"."deleted_at", "products"."id", ... , T4."id", ... , "messages"."id", ...,
FROM "dialogs"
LEFT OUTER JOIN "products" ON ("dialogs"."product_id" = "products"."id")
INNER JOIN "auth_user" T4 ON ("dialogs"."participant_id" = T4."id")
LEFT OUTER JOIN "messages" ON ("dialogs"."last_message_id" = "messages"."id")
WHERE ("dialogs"."deleted_at" IS NULL AND "dialogs"."user_id" = 9069)
ORDER BY "dialogs"."last_message_id" DESC
LIMIT 21;
EXPLAIN:
Limit (cost=1.85..28061.24 rows=21 width=1693) (actual time=4.700..93087.871 rows=17 loops=1)
-> Nested Loop Left Join (cost=1.85..9707215.30 rows=7265 width=1693) (actual time=4.699..93087.861 rows=17 loops=1)
-> Nested Loop (cost=1.41..9647421.07 rows=7265 width=1457) (actual time=4.689..93062.481 rows=17 loops=1)
-> Nested Loop Left Join (cost=0.99..9611285.66 rows=7265 width=1115) (actual time=4.676..93062.292 rows=17 loops=1)
-> Index Scan Backward using dialogs_last_message_id on dialogs (cost=0.56..9554417.92 rows=7265 width=102) (actual time=4.629..93062.050 rows=17 loops=1)
Filter: ((deleted_at IS NULL) AND (user_id = 9069))
Rows Removed by Filter: 6852907
-> Index Scan using products_pkey on products (cost=0.43..7.82 rows=1 width=1013) (actual time=0.012..0.012 rows=1 loops=17)
Index Cond: (dialogs.product_id = id)
-> Index Scan using auth_user_pkey on auth_user t4 (cost=0.42..4.96 rows=1 width=342) (actual time=0.009..0.010 rows=1 loops=17)
Index Cond: (id = dialogs.participant_id)
-> Index Scan using messages_pkey on messages (cost=0.44..8.22 rows=1 width=236) (actual time=1.491..1.492 rows=1 loops=17)
Index Cond: (dialogs.last_message_id = id)
Total runtime: 93091.494 ms
(14 rows)
OFFSET is not used
There's index on user_id field.
Index on deleted_at isn't used because of high selectivity (90% values are actually NULL). Partial index (... WHERE deleted_at IS NULL) won't help either.
It gets especially slow if query hits some part of results that were created long time ago. Then query has to filter and discard millions of rows in between.
List of indexes:
Indexes:
"dialogs_pkey" PRIMARY KEY, btree (id)
"dialogs_deleted_at_d57b320e_uniq" btree (deleted_at) WHERE deleted_at IS NULL
"dialogs_last_message_id" btree (last_message_id)
"dialogs_participant_id" btree (participant_id)
"dialogs_product_id" btree (product_id)
"dialogs_thread_id" btree (thread_id)
"dialogs_user_id" btree (user_id)
Currently I'm thinking about querying only recent data (i.e. ... WHERE last_message_at > <date 3-6 month ago> with appropriate index (BRIN?).
What is best practice to speed up such queries?
As posted in comments:
Start by creating a partial index on (user_id, last_message_id) with a condition WHERE deleted_at IS NULL
Per your answer, this seems to be quite effective :-)
So, here's the results of solutions I tried
1) Index (user_id) WHERE deleted_at IS NULL is used in rare cases, depending on certain values user_id in WHERE user_id = ? condition. Most of the time query has to filter out rows as previously.
2) The greatest speedup was achieved using
(user_id, last_message_id) WHERE deleted_at IS NULL index. While it's 2.5x bigger than other tested indexes, it's used all the time and is very fast. Here's resulting query plan
Limit (cost=1.72..270.45 rows=11 width=1308) (actual time=0.105..0.468 rows=8 loops=1)
-> Nested Loop Left Join (cost=1.72..247038.21 rows=10112 width=1308) (actual time=0.104..0.465 rows=8 loops=1)
-> Nested Loop (cost=1.29..164532.13 rows=10112 width=1072) (actual time=0.071..0.293 rows=8 loops=1)
-> Nested Loop Left Join (cost=0.86..116292.45 rows=10112 width=736) (actual time=0.057..0.198 rows=8 loops=1)
-> Index Scan Backward using dialogs_user_id_last_message_id_d57b320e on dialogs (cost=0.43..38842.21 rows=10112 width=102) (actual time=0.038..0.084 rows=8 loops=1)
Index Cond: (user_id = 9069)
-> Index Scan using products_pkey on products (cost=0.43..7.65 rows=1 width=634) (actual time=0.012..0.012 rows=1 loops=8)
Index Cond: (dialogs.product_id = id)
-> Index Scan using auth_user_pkey on auth_user t4 (cost=0.42..4.76 rows=1 width=336) (actual time=0.010..0.010 rows=1 loops=8)
Index Cond: (id = dialogs.participant_id)
-> Index Scan using messages_pkey on messages (cost=0.44..8.15 rows=1 width=236) (actual time=0.019..0.020 rows=1 loops=8)
Index Cond: (dialogs.last_message_id = id)
Total runtime: 0.678 ms
Thanks #jcaron. Your suggestion should be an accepted answer.

PostgreSQL recursive CTE performance issue

I'm trying to understand such a huge difference in performance of two queries.
Let's assume I have two tables.
First one contains A records for some set of domains:
Table "public.dns_a"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------------------------+-----------+----------+--------------+-------------
name | character varying(125) | | extended | |
a | inet | | main | |
Indexes:
"dns_a_a_idx" btree (a)
"dns_a_name_idx" btree (name varchar_pattern_ops)
Second table handles CNAME records:
Table "public.dns_cname"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------------------------+-----------+----------+--------------+-------------
name | character varying(256) | | extended | |
cname | character varying(256) | | extended | |
Indexes:
"dns_cname_cname_idx" btree (cname varchar_pattern_ops)
"dns_cname_name_idx" btree (name varchar_pattern_ops)
Now I'm trying to solve "simple" problem with getting all the domains pointing to the same IP address, including CNAME.
The first attempt to use CTE works kind of fine:
EXPLAIN ANALYZE WITH RECURSIVE names_traverse AS (
(
SELECT name::varchar(256), NULL::varchar(256) as cname, a FROM dns_a WHERE a = '118.145.5.20'
)
UNION ALL
SELECT c.name, c.cname, NULL::inet as a FROM names_traverse nt, dns_cname c WHERE c.cname=nt.name
)
SELECT * FROM names_traverse;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on names_traverse (cost=3051757.20..4337044.86 rows=64264383 width=1064) (actual time=0.037..1697.444 rows=199 loops=1)
CTE names_traverse
-> Recursive Union (cost=0.57..3051757.20 rows=64264383 width=45) (actual time=0.036..1697.395 rows=199 loops=1)
-> Index Scan using dns_a_a_idx on dns_a (cost=0.57..1988.89 rows=1953 width=24) (actual time=0.035..0.064 rows=14 loops=1)
Index Cond: (a = '118.145.5.20'::inet)
-> Merge Join (cost=4377.00..176448.06 rows=6426243 width=45) (actual time=498.101..848.648 rows=92 loops=2)
Merge Cond: ((c.cname)::text = (nt.name)::text)
-> Index Scan using dns_cname_cname_idx on dns_cname c (cost=0.56..69958.06 rows=2268434 width=45) (actual time=4.732..688.456 rows=2219973 loops=2)
-> Materialize (cost=4376.44..4474.09 rows=19530 width=516) (actual time=0.039..0.084 rows=187 loops=2)
-> Sort (cost=4376.44..4425.27 rows=19530 width=516) (actual time=0.037..0.053 rows=100 loops=2)
Sort Key: nt.name USING ~<~
Sort Method: quicksort Memory: 33kB
-> WorkTable Scan on names_traverse nt (cost=0.00..390.60 rows=19530 width=516) (actual time=0.001..0.007 rows=100 loops=2)
Planning time: 0.130 ms
Execution time: 1697.477 ms
(15 rows)
There are two loops in the example above, so if I make a simple outer join query, I get much better results:
EXPLAIN ANALYZE
SELECT *
FROM dns_a a
LEFT JOIN dns_cname c1 ON (c1.cname=a.name)
LEFT JOIN dns_cname c2 ON (c2.cname=c1.name)
WHERE a.a='118.145.5.20';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=1.68..65674.19 rows=1953 width=114) (actual time=1.086..12.992 rows=189 loops=1)
-> Nested Loop Left Join (cost=1.12..46889.57 rows=1953 width=69) (actual time=1.085..2.154 rows=189 loops=1)
-> Index Scan using dns_a_a_idx on dns_a a (cost=0.57..1988.89 rows=1953 width=24) (actual time=0.022..0.055 rows=14 loops=1)
Index Cond: (a = '118.145.5.20'::inet)
-> Index Scan using dns_cname_cname_idx on dns_cname c1 (cost=0.56..19.70 rows=329 width=45) (actual time=0.137..0.148 rows=13 loops=14)
Index Cond: ((cname)::text = (a.name)::text)
-> Index Scan using dns_cname_cname_idx on dns_cname c2 (cost=0.56..6.33 rows=329 width=45) (actual time=0.057..0.057 rows=0 loops=189)
Index Cond: ((cname)::text = (c1.name)::text)
Planning time: 0.452 ms
Execution time: 13.012 ms
(10 rows)
Time: 13.787 ms
So, the performance difference is about 100 times and that's the thing that worries me.
I like the convenience of recursive CTE and prefer to use it instead of doing dirty tricks on application side, but I don't get why the cost of Index Scan using dns_cname_cname_idx on dns_cname c (cost=0.56..69958.06 rows=2268434 width=45) (actual time=4.732..688.456 rows=2219973 loops=2) is so high.
Am I missing something important regarding CTE or the issue is with something else?
Thanks!
Update: A friend of mine spotted the number of affected rows I missed Index Scan using dns_cname_cname_idx on dns_cname c (cost=0.56..69958.06 rows=2268434 width=45) (actual time=4.732..688.456 rows=2219973 loops=2), it equals total number of rows in the table and, if I understand correctly, it performs full index scan without condition and I don't get where condition is missed.
Result: After applying SET LOCAL enable_mergejoin TO false; execution time is much, much better.
EXPLAIN ANALYZE WITH RECURSIVE names_traverse AS (
(
SELECT name::varchar(256), NULL::varchar(256) as cname, a FROM dns_a WHERE a = '118.145.5.20'
)
UNION ALL
SELECT c.name, c.cname, NULL::inet as a FROM names_traverse nt, dns_cname c WHERE c.cname=nt.name
)
SELECT * FROM names_traverse;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on names_traverse (cost=4746432.42..6527720.02 rows=89064380 width=1064) (actual time=0.718..45.656 rows=199 loops=1)
CTE names_traverse
-> Recursive Union (cost=0.57..4746432.42 rows=89064380 width=45) (actual time=0.717..45.597 rows=199 loops=1)
-> Index Scan using dns_a_a_idx on dns_a (cost=0.57..74.82 rows=2700 width=24) (actual time=0.716..0.717 rows=14 loops=1)
Index Cond: (a = '118.145.5.20'::inet)
-> Nested Loop (cost=0.56..296507.00 rows=8906168 width=45) (actual time=11.276..22.418 rows=92 loops=2)
-> WorkTable Scan on names_traverse nt (cost=0.00..540.00 rows=27000 width=516) (actual time=0.000..0.013 rows=100 loops=2)
-> Index Scan using dns_cname_cname_idx on dns_cname c (cost=0.56..7.66 rows=330 width=45) (actual time=0.125..0.225 rows=1 loops=199)
Index Cond: ((cname)::text = (nt.name)::text)
Planning time: 0.253 ms
Execution time: 45.697 ms
(11 rows)
The first query is slow because of the index scan, as you noted.
The plan has to scan the complete index in order to get dns_cname sorted by cname, which is needed for the merge join. A merge join requires that both input tables are sorted by the join key, which can either be done with an index scan over the complete table (as in this case) or by a sequential scan followed by an explicit sort.
You will notice that the planner grossly overestimates all row counts for the CTE evaluation, which is probably the root of the problem. For fewer rows, PostgreSQL might choose a nested loop join which would not have to scan the whole table dns_cname.
That may be fixable or not. One thing that I can see immediately is that the estimate for the initial value '118.145.5.20' is too high by a factor 139.5, which is pretty bad. You might fix that by running ANALYZE on dns_cname, perhaps after increasing the statistics target for the column:
ALTER TABLE dns_a ALTER a SET STATISTICS 1000;
See if that makes a difference.
If that doesn't do the trick, you can manually set enable_mergejoin and enable_hashjoin to off and see if a plan with a nested loop join is really better or not. If you can get away with changing these parameters for this one statement only (probably with SET LOCAL) and get a better result that way, that is another option you have.