How to unique index over primary key constraint? - postgresql

I have a partitioned table and created a unique index on it.
I am trying to run some queries, some of these using primary key constraint and some using my created index. I want my queries to use unique index instead of primary constraint.
I tried reindexing, didn't work.
Here are two queries
1) Here my created index is getting used.
Query plan is :
Finalize Aggregate (cost=296958.94..296958.95 rows=1 width=8) (actual time=927.948..927.948 rows=1 loops=1)
-> Gather (cost=296958.72..296958.93 rows=2 width=8) (actual time=927.887..933.730 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=295958.72..295958.73 rows=1 width=8) actual time=924.885..924.885 rows=1 loops=3)
-> Parallel Append (cost=0.68..293370.57 rows=1035261 width=8) (actual time=0.076..852.758 rows=825334 loops=3)
-> Parallel Index Only Scan using testdate2019jan_april_cost_mo_user_id_account_id_resource__idx5 on testdate2019jan_april_cost_mod3rem2 (cost=0.68..146591.56 rows=525490 width=8) (actual time=0.082..388.130 rows=421251 loops=3)
Index Cond: (user_id = 1)
Heap Fetches: 3922
-> Parallel Index Only Scan using testdate2018sept_dec_cost_mod_user_id_account_id_resource__idx5 on testdate2018sept_dec_cost_mod3rem2 (cost=0.68..141570.15 rows=509767 width=8) (actual time=0.057..551.572 rows=606125 loops=2)
Index Cond: (user_id = 1)
Heap Fetches: 0
-> Parallel Index Scan using testdate2018jan_april_cost_mo_account_id_user_id_resource__idx2 on testdate2018jan_april_cost_mod3rem2 (cost=0.12..8.14 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: (user_id = 1)
-> Parallel Index Scan using testdate2018may_august_cost_m_account_id_user_id_resource__idx1 on testdate2018may_august_cost_mod3rem2 (cost=0.12..8.14 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: (user_id = 1)
-> Parallel Index Scan using testdate2019may_august_cost_m_account_id_user_id_resource__idx2 on testdate2019may_august_cost_mod3rem2 (cost=0.12..8.14 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (user_id = 1)
-> Parallel Index Scan using testdate2019sept_dec_cost_mod_account_id_user_id_resource__idx2 on testdate2019sept_dec_cost_mod3rem2 (cost=0.12..8.14 rows=1 width=8) (actual time=0.003..0.003 rows=0 loops=1)
Index Cond: (user_id = 1)
Planning Time: 0.754 ms
Execution Time: 933.797 ms
In the above query my index testdate2018may_august_cost_m_account_id_user_id_resource__idx1 is used like I want.
2) Here my created index is not getting used instead primary constraint is getting used.
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=2 read=66080
-> Finalize GroupAggregate (cost=388046.40..388187.55 rows=6 width=61) (actual time=510.710..513.262 rows=10 loops=1)
Group Key: c_1.instance_type, c_1.currency
Buffers: shared hit=2 read=66080
-> Gather Merge (cost=388046.40..388187.24 rows=12 width=85) (actual time=510.689..513.303 rows=28 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=26 read=206407
-> Partial GroupAggregate (cost=387046.38..387185.83 rows=6 width=85) (actual time=504.731..507.277 rows=9 loops=3)
Group Key: c_1.instance_type, c_1.currency
Buffers: shared hit=26 read=206407
-> Sort (cost=387046.38..387056.71 rows=4130 width=36) (actual time=504.694..504.933 rows=3895 loops=3)
Sort Key: c_1.instance_type, c_1.currency
Sort Method: quicksort Memory: 404kB
Worker 0: Sort Method: quicksort Memory: 354kB
Worker 1: Sort Method: quicksort Memory: 541kB
Buffers: shared hit=20 read=206407
-> Parallel Append (cost=0.13..386798.33 rows=4130 width=36) (actual time=0.081..501.720 rows=3895 loops=3)
Buffers: shared hit=6 read=206405
Subplans Removed: 3
-> Parallel Index Scan using testdate2019may_august_cost_mod3rem2_pkey on testdate2019may_august_cost_mod3rem2 c_1 (cost=0.13..8.15 rows=1 width=36) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: ((usage_start_date >= (CURRENT_DATE - 30)) AND (user_id = '1'::bigint))
Filter: ((instance_type IS NOT NULL) AND ((account_id)::text = '807331824280'::text) AND (usage_end_date <= CURRENT_DATE))
Buffers: shared hit=1
-> Parallel Index Scan using testdate2019sept_dec_cost_mod3rem2_pkey on testdate2019sept_dec_cost_mod3rem2 c_2 (cost=0.13..8.15 rows=1 width=36) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: ((usage_start_date >= (CURRENT_DATE - 30)) AND (user_id = '1'::bigint))
Filter: ((instance_type IS NOT NULL) AND ((account_id)::text = '807331824280'::text) AND (usage_end_date <= CURRENT_DATE))
Buffers: shared hit=1
-> Parallel Seq Scan on testdate2019jan_april_cost_mod3rem2 c (cost=0.00..258266.58 rows=4125 width=36) (actual time=0.076..501.060 rows=3895 loops=3)
Filter: ((instance_type IS NOT NULL) AND (user_id = '1'::bigint) AND ((account_id)::text = '807331824280'::text) AND (usage_end_date <= CURRENT_DATE) AND (usage_start_date >= (CURRENT_DATE - 30)))
Rows Removed by Filter: 1504689
Buffers: shared hit=4 read=206405
Planning Time: 1.290 ms
Execution Time: 513.439 ms
In above query testdate2019sept_dec_cost_mod3rem2_pkey, which is the primary constraint, is getting used.
I want it to use my created index instead of primary constraint.
Is my 2nd query plan is correct according to partition?
Table Creation Queries:
CREATE TABLE a2i.testawscost_line_item (
line_item_id uuid NOT NULL,
account_id character varying(255) COLLATE pg_catalog."default",
availability_zone character varying(255) COLLATE pg_catalog."default",
base_cost double precision,
base_rate double precision,
cost double precision,
currency character varying(255) COLLATE pg_catalog."default",
instance_family character varying(255) COLLATE pg_catalog."default",
instance_type character varying(255) COLLATE pg_catalog."default",
line_item_type character varying(255) COLLATE pg_catalog."default",
operating_system character varying(255) COLLATE pg_catalog."default",
operation character varying(255) COLLATE pg_catalog."default",
payer_account_id character varying(255) COLLATE pg_catalog."default",
product_code character varying(255) COLLATE pg_catalog."default",
product_family character varying(255) COLLATE pg_catalog."default",
product_group character varying(255) COLLATE pg_catalog."default",
product_name character varying(255) COLLATE pg_catalog."default",
rate double precision,
rate_description character varying(255) COLLATE pg_catalog."default",
reservation_id character varying(255) COLLATE pg_catalog."default",
resource_type character varying(255) COLLATE pg_catalog."default",
sku character varying(255) COLLATE pg_catalog."default",
tax_type character varying(255) COLLATE pg_catalog."default",
unit character varying(255) COLLATE pg_catalog."default",
usage_end_date timestamp without time zone,
usage_quantity double precision,
usage_start_date timestamp without time zone,
usage_type character varying(255) COLLATE pg_catalog."default",
user_id bigint,
resource_id character varying(255) COLLATE pg_catalog."default",
CONSTRAINT testawscost_line_item_pkey PRIMARY KEY
(line_item_id, usage_start_date, user_id),
CONSTRAINT fkptp4hyur3i4yj88wo3rxnaf05 FOREIGN KEY (resource_id)
REFERENCES a2i.awscost_resource (resource_id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
) PARTITION BY hash(user_id);
The partitions:
create table a2i.testuser_cost_mod3rem0
partition of a2i.testawscost_line_item
for values with (MODULUS 3, REMAINDER 0)
partition by range(usage_start_date);
create table a2i.testuser_cost_mod3rem1
partition of a2i.testawscost_line_item
for values with (MODULUS 3, REMAINDER 1)
partition by range(usage_start_date);
create table a2i.testuser_cost_mod3rem2
partition of a2i.testawscost_line_item
for values with (MODULUS 3, REMAINDER 2)
partition by range(usage_start_date);
Partitions of the partitions for 2019:
create table a2i.testdate2019jan_april_cost_mod3rem0
partition of a2i.testuser_cost_mod3rem0
for values from ('2019-01-01 00:00:00') to ('2019-05-01 00:00:00');
create table a2i.testdate2019may_august_cost_mod3rem0
partition of a2i.testuser_cost_mod3rem0
for values from ('2019-05-01 00:00:00') to ('2019-09-01 00:00:00');
create table a2i.testdate2019sept_dec_cost_mod3rem0
partition of a2i.testuser_cost_mod3rem0
for values from ('2019-09-01 00:00:00') to ('2020-01-01 00:00:00');
create table a2i.testdate2019jan_april_cost_mod3rem1
partition of a2i.testuser_cost_mod3rem1
for values from ('2019-01-01 00:00:00') to ('2019-05-01 00:00:00');
create table a2i.testdate2019may_august_cost_mod3rem1
partition of a2i.testuser_cost_mod3rem1
for values from ('2019-05-01 00:00:00') to ('2019-09-01 00:00:00');
create table a2i.testdate2019sept_dec_cost_mod3rem1
partition of a2i.testuser_cost_mod3rem1
for values from ('2019-09-01 00:00:00') to ('2020-01-01 00:00:00');
create table a2i.testdate2019jan_april_cost_mod3rem2
partition of a2i.testuser_cost_mod3rem2
for values from ('2019-01-01 00:00:00') to ('2019-05-01 00:00:00');
create table a2i.testdate2019may_august_cost_mod3rem2
partition of a2i.testuser_cost_mod3rem2
for values from ('2019-05-01 00:00:00') to ('2019-09-01 00:00:00');
create table a2i.testdate2019sept_dec_cost_mod3rem2
partition of a2i.testuser_cost_mod3rem2
for values from ('2019-09-01 00:00:00') to ('2020-01-01 00:00:00');
The index:
CREATE UNIQUE INDEX awscost_line_item_unique_pkey ON a2i.awscost_line_item (
account_id, user_id, resource_id, usage_start_date, usage_end_date, usage_type,
usage_quantity, line_item_type, sku, rate, base_rate, base_cost,
"cost", currency, product_code, operation
);
For 1st query plan,query is :
explain analyze select sum(cost) from testawscost_line_item where
user_id='1';
2nd query :
explain (analyze/*, buffers*/) SELECT sum (c.cost),
sum (case when c.resource_type = 'Compute' then c.cost end) as computeCost,
sum (case when c.resource_type = 'Storage' then c.cost end) as storageCost,
sum (case when c.resource_type = 'Network' then c.cost end) as networkCost,
sum (case when c.resource_type not in ('Compute', 'Network', 'Storage')
then c.cost end) as otherCost,
c.currency,
c.instance_type as productFamily,
avg (c.rate) FROM testawscost_line_item c WHERE
(c.user_id ='1') AND (c.account_id = '807331824280') AND
(c.usage_start_date >= current_date-30 AND c.usage_end_date <=
current_date) AND
(c.instance_type is not null )
GROUP BY c.instance_type, c.currency
ORDER BY 1 desc

The problem is that your index works well for the first query, but not for the second.
resource_id is in your index, but not in your query, so all index columns after that cannot be used for the query. PostgreSQL decides to use the much smaller primary key index.
The perfect index for this query is:
CREATE INDEX ON a2i.testawscost_line_item (user_id, account_id, usage_start_date)
WHERE instance_type IS NOT NULL;
I assume that the condition on usage_end_date is not more selective than the one on usage_start_date.

Related

Seemingly Random Delay in queries

This is a follow up to this issue I posted a while ago.
I have the following code:
SET work_mem = '16MB';
SELECT s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
FROM rm_o_resource_usage_instance_splits_new s
INNER JOIN rm_o_resource_usage r ON s.usage_id = r.id
INNER JOIN scheduledactivities sa ON s.activity_index = sa.activity_index AND r.schedule_id = sa.solution_id and s.solution = sa.solution_id
WHERE r.schedule_id = 10
ORDER BY r.resource_id, s.start_date
When I run EXPLAIN (ANALYZE, BUFFERS) I get the following:
Sort (cost=3724.02..3724.29 rows=105 width=89) (actual time=245.802..247.573 rows=22302 loops=1)
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6692kB
Buffers: shared hit=198702 read=5993 written=612
-> Nested Loop (cost=703.76..3720.50 rows=105 width=89) (actual time=1.898..164.741 rows=22302 loops=1)
Buffers: shared hit=198702 read=5993 written=612
-> Hash Join (cost=703.34..3558.54 rows=105 width=101) (actual time=1.815..11.259 rows=22302 loops=1)
Hash Cond: (s.usage_id = r.id)
Buffers: shared hit=3 read=397 written=2
-> Bitmap Heap Scan on rm_o_resource_usage_instance_splits_new s (cost=690.61..3486.58 rows=22477 width=69) (actual time=1.782..5.820 rows=22302 loops=1)
Recheck Cond: (solution = 10)
Heap Blocks: exact=319
Buffers: shared hit=2 read=396 written=2
-> Bitmap Index Scan on rm_o_resource_usage_instance_splits_new_solution_idx (cost=0.00..685.00 rows=22477 width=0) (actual time=1.609..1.609 rows=22302 loops=1)
Index Cond: (solution = 10)
Buffers: shared hit=2 read=77
-> Hash (cost=12.66..12.66 rows=5 width=48) (actual time=0.023..0.023 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1 read=1
-> Bitmap Heap Scan on rm_o_resource_usage r (cost=4.19..12.66 rows=5 width=48) (actual time=0.020..0.020 rows=1 loops=1)
Recheck Cond: (schedule_id = 10)
Heap Blocks: exact=1
Buffers: shared hit=1 read=1
-> Bitmap Index Scan on rm_o_resource_usage_sched (cost=0.00..4.19 rows=5 width=0) (actual time=0.017..0.017 rows=1 loops=1)
Index Cond: (schedule_id = 10)
Buffers: shared read=1
-> Index Scan using scheduledactivities_activity_index_idx on scheduledactivities sa (cost=0.42..1.53 rows=1 width=16) (actual time=0.004..0.007 rows=1 loops=22302)
Index Cond: (activity_index = s.activity_index)
Filter: (solution_id = 10)
Rows Removed by Filter: 5
Buffers: shared hit=198699 read=5596 written=610
Planning time: 7.070 ms
Execution time: 248.691 ms
Every time I run EXPLAIN, I get roughly the same results. The Execution Time is always between 170ms and 250ms, which, to me is perfectly fine. However, when this query is run through a C++ project (using PQexec(conn, query) where conn is a dedicated connection, and query is the above query), the time it takes seems to vary widely. In general, the query is very quick, and you don't notice a delay. The problem is, that on occasion, this query will take 2 to 3 minutes to complete.
If I open the pgadmin, and have a look at the "server activity" for the database, there's about 30 or so connections, mostly sitting at "idle". The above query's connection is marked as "active", and will stay as "active" for several minutes.
I am at a loss of why it randomly takes several minutes to complete the same query, with no change in data in the DB either. I have tried increasing the work_mem which didn't make any difference (nor did I really expect it to). Any help or suggestions would be greatly appreciated.
There isn't any more specific tags, but I'm currently using Postgres 10.11, but it's also been an issue on other versions of 10.x. System is a Xeon quad-core # 3.4Ghz, with SSD and 24GB of memory.
Per jjanes's suggestion, I put in the auto_explain. Eventually go this output:
duration: 128057.373 ms
plan:
Query Text: SET work_mem = '32MB';SELECT s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset FROM rm_o_resource_usage_instance_splits_new s INNER JOIN rm_o_resource_usage r ON s.usage_id = r.id INNER JOIN scheduledactivities sa ON s.activity_index = sa.activity_index AND r.schedule_id = sa.solution_id and s.solution = sa.solution_id WHERE r.schedule_id = 12642 ORDER BY r.resource_id, s.start_date
Sort (cost=14.36..14.37 rows=1 width=98) (actual time=128042.083..128043.287 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6585kB
Buffers: shared hit=21198435 read=388 dirtied=119
-> Nested Loop (cost=0.85..14.35 rows=1 width=98) (actual time=4.995..127958.935 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 705476285
Buffers: shared hit=21198435 read=388 dirtied=119
-> Nested Loop (cost=0.42..9.74 rows=1 width=110) (actual time=0.091..227.705 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, s.solution, r.resource_id, r.schedule_id
Inner Unique: true
Join Filter: (s.usage_id = r.id)
Buffers: shared hit=22102 read=388 dirtied=119
-> Index Scan using rm_o_resource_usage_instance_splits_new_solution_idx on public.rm_o_resource_usage_instance_splits_new s (cost=0.42..8.44 rows=1 width=69) (actual time=0.082..17.418 rows=21899 loops=1)
Output: s.start_time, s.end_time, s.resources, s.activity_index, s.usage_id, s.start_date, s.end_date, s.solution
Index Cond: (s.solution = 12642)
Buffers: shared hit=203 read=388 dirtied=119
-> Seq Scan on public.rm_o_resource_usage r (cost=0.00..1.29 rows=1 width=57) (actual time=0.002..0.002 rows=1 loops=21899)
Output: r.id, r.schedule_id, r.resource_id
Filter: (r.schedule_id = 12642)
Rows Removed by Filter: 26
Buffers: shared hit=21899
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.60 rows=1 width=16) (actual time=0.006..4.612 rows=32216 loops=21899)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12642)
Buffers: shared hit=21176333",,,,,,,,,""
EDIT: Full definitions of the tables are below:
CREATE TABLE public.rm_o_resource_usage_instance_splits_new
(
start_time integer NOT NULL,
end_time integer NOT NULL,
resources jsonb NOT NULL,
activity_index integer NOT NULL,
usage_id bigint NOT NULL,
start_date text COLLATE pg_catalog."default" NOT NULL,
end_date text COLLATE pg_catalog."default" NOT NULL,
solution bigint NOT NULL,
CONSTRAINT rm_o_resource_usage_instance_splits_new_pkey PRIMARY KEY (start_time, activity_index, usage_id),
CONSTRAINT rm_o_resource_usage_instance_splits_new_solution_fkey FOREIGN KEY (solution)
REFERENCES public.rm_o_schedule_stats (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE,
CONSTRAINT rm_o_resource_usage_instance_splits_new_usage_id_fkey FOREIGN KEY (usage_id)
REFERENCES public.rm_o_resource_usage (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_activity_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(activity_index ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_solution_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(solution ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_usage_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(usage_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE TABLE public.rm_o_resource_usage
(
id bigint NOT NULL DEFAULT nextval('rm_o_resource_usage_id_seq'::regclass),
schedule_id bigint NOT NULL,
resource_id text COLLATE pg_catalog."default" NOT NULL,
CONSTRAINT rm_o_resource_usage_pkey PRIMARY KEY (id),
CONSTRAINT rm_o_resource_usage_schedule_id_fkey FOREIGN KEY (schedule_id)
REFERENCES public.rm_o_schedule_stats (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_idx
ON public.rm_o_resource_usage USING btree
(id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_sched
ON public.rm_o_resource_usage USING btree
(schedule_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE TABLE public.scheduledactivities
(
id bigint NOT NULL DEFAULT nextval('scheduledactivities_id_seq'::regclass),
solution_id bigint NOT NULL,
activity_id text COLLATE pg_catalog."default" NOT NULL,
sequence_index integer,
startminute integer,
finishminute integer,
issue text COLLATE pg_catalog."default",
activity_index integer NOT NULL,
is_objective boolean NOT NULL,
usedresourceset integer DEFAULT '-1'::integer,
start timestamp without time zone,
finish timestamp without time zone,
is_ore boolean,
is_ignored boolean,
CONSTRAINT scheduled_activities_pkey PRIMARY KEY (id),
CONSTRAINT scheduledactivities_solution_id_fkey FOREIGN KEY (solution_id)
REFERENCES public.rm_o_schedule_stats (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX scheduled_activities_activity_id_idx
ON public.scheduledactivities USING btree
(activity_id COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX scheduled_activities_id_idx
ON public.scheduledactivities USING btree
(id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX scheduled_activities_idx
ON public.scheduledactivities USING btree
(solution_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX scheduledactivities_activity_index_idx
ON public.scheduledactivities USING btree
(activity_index ASC NULLS LAST)
TABLESPACE pg_default;
EDIT: Additional output from auto_explain after adding index on scheduledactivities (solution_id, activity_index)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6283kB
Buffers: shared hit=20159117 read=375 dirtied=190
-> Nested Loop (cost=0.85..10.76 rows=1 width=100) (actual time=5.518..122489.627 rows=20761 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 668815615
Buffers: shared hit=20159117 read=375 dirtied=190
-> Nested Loop (cost=0.42..5.80 rows=1 width=112) (actual time=0.057..217.563 rows=20761 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, s.solution, r.resource_id, r.schedule_id
Inner Unique: true
Join Filter: (s.usage_id = r.id)
Buffers: shared hit=20947 read=375 dirtied=190
-> Index Scan using rm_o_resource_usage_instance_splits_new_solution_idx on public.rm_o_resource_usage_instance_splits_new s (cost=0.42..4.44 rows=1 width=69) (actual time=0.049..17.622 rows=20761 loops=1)
Output: s.start_time, s.end_time, s.resources, s.activity_index, s.usage_id, s.start_date, s.end_date, s.solution
Index Cond: (s.solution = 12644)
Buffers: shared hit=186 read=375 dirtied=190
-> Seq Scan on public.rm_o_resource_usage r (cost=0.00..1.35 rows=1 width=59) (actual time=0.002..0.002 rows=1 loops=20761)
Output: r.id, r.schedule_id, r.resource_id
Filter: (r.schedule_id = 12644)
Rows Removed by Filter: 22
Buffers: shared hit=20761
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.94 rows=1 width=16) (actual time=0.007..4.654 rows=32216 loops=20761)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12644)
Buffers: shared hit=20138170",,,,,,,,,""
The easiest way to reproduce the issue is to add more values to the three tables. I didn't delete any, only did a few thousand INSERTs.
-> Index Scan using .. s (cost=0.42..8.44 rows=1 width=69) (actual time=0.082..17.418 rows=21899 loops=1)
Index Cond: (s.solution = 12642)
The planner thinks it will find 1 row, and instead finds 21899. That error can pretty clearly lead to bad plans. And a single equality condition should be estimated quite accurately, so I'd say the statistics on your table are way off. It could be that the autovac launcher is tuned poorly so it doesn't run often enough, or it could be that select parts of your data change very rapidly (did you just insert 21899 rows with s.solution = 12642 immediately before running the query?) and so the stats can't be kept accurate enough.
-> Nested Loop ...
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 705476285
-> ...
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.60 rows=1 width=16) (actual time=0.006..4.612 rows=32216 loops=21899)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12642)
If you can't get it to use the Hash Join, you can at least reduce the harm of the Nested Loop by building an index on scheduledactivities (solution_id, activity_index). That way the activity_index criterion could be part of the Index Condition, rather than being a Join Filter. You could probably then drop the index exclusively on solution_id, as there is little point in maintaining both indexes.
The SQL statement of the fast plan is using WHERE r.schedule_id = 10 and returns about 22000 rows (with estimated 105).
The SQL statement of the slow plan is using WHERE r.schedule_id = 12642 and returns about 21000 rows (with estimated only 1).
The slow plan is using nested loops instead of hash joins: maybe because there is a bad estimation for joins: estimated rows is 1 but actual rows is 21899.
For example in this step:
Nested Loop (cost=0.42..9.74 rows=1 width=110) (actual time=0.091..227.705 rows=21899 loops=1)
If data does not change there is maybe a statistic issue (skew data) for some columns.

PostGIS intersect/summary query very slow for particular table

Edited question based on comment by #peter so now the two tables in question are using the same geometry type
I'm running a query that intersects a table's geometry with a generic input geometry and then summarizes the results based on a specific attribute.
The reason I'm suspicious (and frustrated) with this particular query is that I can run the exact same query on a different table that is 40x the size and it takes 1/100 of the time.
I'm puzzled because I'm very happy with the speed (~400ms) that this query has on a large table with 1.3M records. However, on a particular table with 30,000 records the query takes 40+ seconds to complete.
Here's the query:
WITH input_geom AS (
SELECT ST_Transform(
ST_SetSRID(
ST_GeomFromGeoJSON(
'{"type":"Polygon","coordinates":[[[-91.865616,47.803339],[-91.830597,47.780274],[-91.810341,47.817404],[-91.865616,47.803339]]]}'
), 4326
), 26915
) AS geom
)
-- find total area and proportion for each type
SELECT
attr,
total_area_summtable AS area,
total_area_summtable / buff_area.area_sqm AS percent
FROM
(-- group by attribute and buffer
SELECT
attr,
sum(area_sqm) AS total_area_summtable
FROM
(-- find intersected area of each type
-- Clip ownership by input geom
SELECT
%attr% AS attr,
CASE
-- speed intersection calculation by using summary table
-- geom when it covers the entire buffer
-- otherwise use intersection of geometries
WHEN ST_CoveredBy(input_geom.geom, summtable.geom) THEN ST_Area(input_geom.geom)
ELSE ST_Area(ST_Multi(ST_Intersection(input_geom.geom,summtable.geom)))
END AS area_sqm
FROM input_geom
INNER JOIN %table% AS summtable ON ST_Intersects(input_geom.geom, summtable.geom)
) AS summtable_inter
-- group by type
GROUP BY attr
) AS summtable_area,
(-- find total area for the buffer
SELECT
ST_Area(ST_Collect(geom)) AS area_sqm
FROM input_geom
) AS buff_area
That produces results like this:
attr area percent
6 17106063.3199902 0.0630578194718625
8 41892903.9272884 0.154429170732226
2 4441738.70688669 0.016373513430921
....
Here are the Explain Analyze results for this query:
Nested Loop (cost=31.00..31.34 rows=9 width=23) (actual time=49042.306..49042.309 rows=5 loops=1)
Output: mown.owner_desc, (sum(CASE WHEN ((input_geom_1.geom # mown.geom) AND _st_coveredby(input_geom_1.geom, mown.geom)) THEN st_area(input_geom_1.geom) ELSE st_area(st_multi(st_intersection(input_geom_1.geom, mown.geom))) END)), ((sum(CASE WHEN ((input (...)
CTE input_geom
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
Output: '01030000202369000001000000040000003D484506D9D92141B7A4EA61F6325441A3BEFDC8A2EE2141E731F0497F305441774E0C95FEF9214173B409BA8C3454413D484506D9D92141B7A4EA61F6325441'::geometry
-> Aggregate (cost=0.02..0.06 rows=1 width=32) (actual time=0.035..0.036 rows=1 loops=1)
Output: st_area(st_collect(input_geom.geom))
-> CTE Scan on input_geom (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.006 rows=1 loops=1)
Output: input_geom.geom
-> HashAggregate (cost=30.96..31.05 rows=9 width=18085) (actual time=49042.264..49042.266 rows=5 loops=1)
Output: mown.owner_desc, sum(CASE WHEN ((input_geom_1.geom # mown.geom) AND _st_coveredby(input_geom_1.geom, mown.geom)) THEN st_area(input_geom_1.geom) ELSE st_area(st_multi(st_intersection(input_geom_1.geom, mown.geom))) END)
Group Key: mown.owner_desc
-> Nested Loop (cost=4.32..25.34 rows=18 width=18085) (actual time=3.304..791.829 rows=39 loops=1)
Output: input_geom_1.geom, mown.owner_desc, mown.geom
-> CTE Scan on input_geom input_geom_1 (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.003 rows=1 loops=1)
Output: input_geom_1.geom
-> Bitmap Heap Scan on public.gap_stewardship_2008_all_ownership_types mown (cost=4.32..25.30 rows=2 width=18053) (actual time=3.299..791.762 rows=39 loops=1)
Output: mown.gid, mown.wetland_ty, mown.county, mown.name, mown.unit, mown.owner, mown.owner_ver1, mown.owner_desc, mown.owner_name, mown.agency, mown.agncy_ver1, mown.agency_nam, mown.new_manage, mown.name_manag, mown.comments, mown.or (...)
Recheck Cond: (input_geom_1.geom && mown.geom)
Filter: _st_intersects(input_geom_1.geom, mown.geom)
Rows Removed by Filter: 208
Heap Blocks: exact=142
-> Bitmap Index Scan on gap_stewardship_2008_all_ownership_types_geom_idx (cost=0.00..4.31 rows=5 width=0) (actual time=0.651..0.651 rows=247 loops=1)
Index Cond: (input_geom_1.geom && mown.geom)
Planning time: 1.245 ms
Execution time: 49046.184 ms
Here is the SQL to recreate the tables in question:
Slow table (50,000 rows):
CREATE TABLE public.gap_stewardship_2008_all_ownership_types
(
gid integer NOT NULL DEFAULT nextval('gap_stewardship_2008_all_ownership_types_gid_seq'::regclass),
wetland_ty character varying(50),
county character varying(50),
name character varying(50),
unit character varying(50),
owner smallint,
owner_ver1 smallint,
owner_desc character varying(50),
owner_name character varying(50),
agency smallint,
agncy_ver1 smallint,
agency_nam character varying(50),
new_manage smallint,
name_manag character varying(50),
comments character varying(100),
origin character varying(50),
area numeric,
acres numeric,
perfeet numeric,
perimeter numeric,
km2 numeric,
shape_leng numeric,
shape_area numeric,
geom geometry(MultiPolygon,26915),
CONSTRAINT gap_stewardship_2008_all_ownership_types_pkey PRIMARY KEY (gid)
)
Fast table (1,300,000 rows):
CREATE TABLE public.nwi_combine
(
id integer NOT NULL DEFAULT nextval('"NWI_combine_AOI_id_seq"'::regclass),
geom geometry(MultiPolygon,26915),
attribute character varying(254),
wetland_ty character varying(254),
acres numeric,
hgm_code character varying(254),
hgm_desc character varying(254),
spcc_desc character varying(254),
cow_class1 character varying(254),
circ39_cla bigint,
hgm_ll_des character varying(254),
shape_leng numeric,
shape_area numeric,
nwi_code character varying(254),
new_cow character varying(254),
system character varying(254),
subsystem character varying(254),
class1 character varying(254),
subclass1 character varying(254),
class2 character varying(254),
subclass2 character varying(254),
wreg character varying(254),
soilm character varying(254),
spec_mod1 character varying(254),
spec_mod2 character varying(254),
circ39 character varying(254),
old_cow character varying(254),
mnwet character varying(254),
circ39_com bigint,
CONSTRAINT "NWI_combine_AOI_pkey" PRIMARY KEY (id)
)
Each of these tables has a GIST index on the geometry field.
Does anyone have any idea what could be contributing to such a difference?

How to speed up a PostgreSQL group by query through multiple joins?

This query searches for product_groupings often purchased with product_grouping ID 99999. As this query fans out to all the orders that contain product_grouping 99999, and then joins back down to count the number of times each product_grouping has been ordered, and takes the top 10.
Is there any way to speed this query up?
SELECT product_groupings.*, count(product_groupings.id) AS product_groupings_count
FROM "product_groupings"
INNER JOIN "products" ON "product_groupings"."id" = "products"."product_grouping_id"
INNER JOIN "variants" ON "products"."id" = "variants"."product_id"
INNER JOIN "order_items" ON "variants"."id" = "order_items"."variant_id"
INNER JOIN "shipments" ON "order_items"."shipment_id" = "shipments"."id"
INNER JOIN "orders" ON "shipments"."order_id" = "orders"."id"
INNER JOIN "shipments" "shipments_often_purchased_with_join" ON "orders"."id" = "shipments_often_purchased_with_join"."order_id"
INNER JOIN "order_items" "order_items_often_purchased_with_join" ON "shipments_often_purchased_with_join"."id" = "order_items_often_purchased_with_join"."shipment_id"
INNER JOIN "variants" "variants_often_purchased_with_join" ON "order_items_often_purchased_with_join"."variant_id" = "variants_often_purchased_with_join"."id"
INNER JOIN "products" "products_often_purchased_with_join" ON "variants_often_purchased_with_join"."product_id" = "products_often_purchased_with_join"."id"
WHERE "products_often_purchased_with_join"."product_grouping_id" = 99999 AND (product_groupings.id != 99999) AND "product_groupings"."state" = 'active' AND ("shipments"."state" NOT IN ('pending', 'cancelled'))
GROUP BY product_groupings.id
ORDER BY product_groupings_count desc LIMIT 10
schema:
CREATE TABLE product_groupings (
id integer NOT NULL,
state character varying(255) DEFAULT 'active'::character varying,
brand_id integer,
product_content_id integer,
hierarchy_category_id integer,
hierarchy_subtype_id integer,
hierarchy_type_id integer,
product_type_id integer,
description text,
keywords text,
created_at timestamp without time zone,
updated_at timestamp without time zone
);
CREATE INDEX index_product_groupings_on_brand_id ON product_groupings USING btree (brand_id);
CREATE INDEX index_product_groupings_on_hierarchy_category_id ON product_groupings USING btree (hierarchy_category_id);
CREATE INDEX index_product_groupings_on_hierarchy_subtype_id ON product_groupings USING btree (hierarchy_subtype_id);
CREATE INDEX index_product_groupings_on_hierarchy_type_id ON product_groupings USING btree (hierarchy_type_id);
CREATE INDEX index_product_groupings_on_name ON product_groupings USING btree (name);
CREATE INDEX index_product_groupings_on_product_content_id ON product_groupings USING btree (product_content_id);
CREATE INDEX index_product_groupings_on_product_type_id ON product_groupings USING btree (product_type_id);
ALTER TABLE ONLY product_groupings
ADD CONSTRAINT product_groupings_pkey PRIMARY KEY (id);
CREATE TABLE products (
id integer NOT NULL,
name character varying(255) NOT NULL,
prototype_id integer,
deleted_at timestamp without time zone,
created_at timestamp without time zone,
updated_at timestamp without time zone,
item_volume character varying(255),
upc character varying(255),
state character varying(255),
volume_unit character varying(255),
volume_value numeric,
container_type character varying(255),
container_count integer,
upc_ext character varying(8),
product_grouping_id integer,
short_pack_size character varying(255),
short_volume character varying(255),
additional_upcs character varying(255)[] DEFAULT '{}'::character varying[]
);
CREATE INDEX index_products_on_additional_upcs ON products USING gin (additional_upcs);
CREATE INDEX index_products_on_deleted_at ON products USING btree (deleted_at);
CREATE INDEX index_products_on_name ON products USING btree (name);
CREATE INDEX index_products_on_product_grouping_id ON products USING btree (product_grouping_id);
CREATE INDEX index_products_on_prototype_id ON products USING btree (prototype_id);
CREATE INDEX index_products_on_upc ON products USING btree (upc);
ALTER TABLE ONLY products
ADD CONSTRAINT products_pkey PRIMARY KEY (id);
CREATE TABLE variants (
id integer NOT NULL,
product_id integer NOT NULL,
sku character varying(255) NOT NULL,
name character varying(255),
price numeric(8,2) DEFAULT 0.0 NOT NULL,
deleted_at timestamp without time zone,
supplier_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
inventory_id integer,
product_active boolean DEFAULT false NOT NULL,
original_name character varying(255),
original_item_volume character varying(255),
protected boolean DEFAULT false NOT NULL,
sale_price numeric(8,2) DEFAULT 0.0 NOT NULL
);
CREATE INDEX index_variants_on_inventory_id ON variants USING btree (inventory_id);
CREATE INDEX index_variants_on_product_id_and_deleted_at ON variants USING btree (product_id, deleted_at);
CREATE INDEX index_variants_on_sku ON variants USING btree (sku);
CREATE INDEX index_variants_on_state_attributes ON variants USING btree (deleted_at, product_active, protected, id);
CREATE INDEX index_variants_on_supplier_id ON variants USING btree (supplier_id);
ALTER TABLE ONLY variants
ADD CONSTRAINT variants_pkey PRIMARY KEY (id);
CREATE TABLE order_items (
id integer NOT NULL,
price numeric(8,2),
total numeric(8,2),
variant_id integer NOT NULL,
shipment_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
quantity integer DEFAULT 1
);
CREATE INDEX index_order_items_on_shipment_id ON order_items USING btree (shipment_id);
CREATE INDEX index_order_items_on_variant_id ON order_items USING btree (variant_id);
ALTER TABLE ONLY order_items
ADD CONSTRAINT order_items_pkey PRIMARY KEY (id);
CREATE TABLE shipments (
id integer NOT NULL,
order_id integer,
shipping_method_id integer NOT NULL,
number character varying,
state character varying(255) DEFAULT 'pending'::character varying NOT NULL,
created_at timestamp without time zone,
updated_at timestamp without time zone,
supplier_id integer,
confirmed_at timestamp without time zone,
canceled_at timestamp without time zone,
out_of_hours boolean DEFAULT false NOT NULL,
delivered_at timestamp without time zone,
uuid uuid DEFAULT uuid_generate_v4()
);
CREATE INDEX index_shipments_on_order_id_and_supplier_id ON shipments USING btree (order_id, supplier_id);
CREATE INDEX index_shipments_on_state ON shipments USING btree (state);
CREATE INDEX index_shipments_on_supplier_id ON shipments USING btree (supplier_id);
ALTER TABLE ONLY shipments
ADD CONSTRAINT shipments_pkey PRIMARY KEY (id);
CREATE TABLE orders (
id integer NOT NULL,
number character varying(255),
ip_address character varying(255),
state character varying(255),
ship_address_id integer,
active boolean DEFAULT true NOT NULL,
completed_at timestamp without time zone,
created_at timestamp without time zone,
updated_at timestamp without time zone,
tip_amount numeric(8,2) DEFAULT 0.0,
confirmed_at timestamp without time zone,
delivery_notes text,
cancelled_at timestamp without time zone,
courier boolean DEFAULT false NOT NULL,
scheduled_for timestamp without time zone,
client character varying(255),
subscription_id character varying(255),
pickup_detail_id integer,
);
CREATE INDEX index_orders_on_bill_address_id ON orders USING btree (bill_address_id);
CREATE INDEX index_orders_on_completed_at ON orders USING btree (completed_at);
CREATE UNIQUE INDEX index_orders_on_number ON orders USING btree (number);
CREATE INDEX index_orders_on_ship_address_id ON orders USING btree (ship_address_id);
CREATE INDEX index_orders_on_state ON orders USING btree (state);
ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);
Query plan:
Limit (cost=685117.80..685117.81 rows=10 width=595) (actual time=33659.659..33659.661 rows=10 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, (count(product_groupings.id))
Buffers: shared hit=259132 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> Sort (cost=685117.80..685117.81 rows=14 width=595) (actual time=33659.658..33659.659 rows=10 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, (count(product_groupings.id))
Sort Key: (count(product_groupings.id))
Sort Method: top-N heapsort Memory: 30kB
Buffers: shared hit=259132 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> HashAggregate (cost=685117.71..685117.75 rows=14 width=595) (actual time=33659.407..33659.491 rows=122 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, count(product_groupings.id)
Buffers: shared hit=259129 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> Hash Join (cost=453037.24..685117.69 rows=14 width=595) (actual time=26019.889..33658.886 rows=181 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name
Hash Cond: (order_items_often_purchased_with_join.variant_id = variants_often_purchased_with_join.id)
Buffers: shared hit=259129 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> Hash Join (cost=452970.37..681530.70 rows=4693428 width=599) (actual time=22306.463..32908.056 rows=8417034 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, order_items_often_purchased_with_join.variant_id
Hash Cond: (products.product_grouping_id = product_groupings.id)
Buffers: shared hit=259080 read=85650, temp read=30892 written=30886
I/O Timings: read=5540.529
-> Hash Join (cost=381952.28..493289.49 rows=5047613 width=8) (actual time=21028.128..25416.504 rows=8417518 loops=1)
Output: products.product_grouping_id, order_items_often_purchased_with_join.variant_id
Hash Cond: (order_items_often_purchased_with_join.shipment_id = shipments_often_purchased_with_join.id)
Buffers: shared hit=249520 read=77729
I/O Timings: read=5134.878
-> Seq Scan on public.order_items order_items_often_purchased_with_join (cost=0.00..82689.54 rows=4910847 width=8) (actual time=0.003..1061.456 rows=4909856 loops=1)
Output: order_items_often_purchased_with_join.shipment_id, order_items_often_purchased_with_join.variant_id
Buffers: shared hit=67957
-> Hash (cost=373991.27..373991.27 rows=2274574 width=8) (actual time=21027.220..21027.220 rows=2117538 loops=1)
Output: products.product_grouping_id, shipments_often_purchased_with_join.id
Buckets: 262144 Batches: 1 Memory Usage: 82717kB
Buffers: shared hit=181563 read=77729
I/O Timings: read=5134.878
-> Hash Join (cost=249781.35..373991.27 rows=2274574 width=8) (actual time=10496.552..20383.404 rows=2117538 loops=1)
Output: products.product_grouping_id, shipments_often_purchased_with_join.id
Hash Cond: (shipments.order_id = orders.id)
Buffers: shared hit=181563 read=77729
I/O Timings: read=5134.878
-> Hash Join (cost=118183.04..233677.13 rows=1802577 width=8) (actual time=6080.516..14318.439 rows=1899610 loops=1)
Output: products.product_grouping_id, shipments.order_id
Hash Cond: (variants.product_id = products.id)
Buffers: shared hit=107220 read=55876
I/O Timings: read=5033.540
-> Hash Join (cost=83249.21..190181.06 rows=1802577 width=8) (actual time=4526.391..11330.434 rows=1899808 loops=1)
Output: variants.product_id, shipments.order_id
Hash Cond: (order_items.variant_id = variants.id)
Buffers: shared hit=88026 read=44439
I/O Timings: read=4009.465
-> Hash Join (cost=40902.30..138821.27 rows=1802577 width=8) (actual time=3665.477..8553.803 rows=1899816 loops=1)
Output: order_items.variant_id, shipments.order_id
Hash Cond: (order_items.shipment_id = shipments.id)
Buffers: shared hit=56654 read=43022
I/O Timings: read=3872.065
-> Seq Scan on public.order_items (cost=0.00..82689.54 rows=4910847 width=8) (actual time=0.003..2338.108 rows=4909856 loops=1)
Output: order_items.variant_id, order_items.shipment_id
Buffers: shared hit=55987 read=11970
I/O Timings: read=1059.971
-> Hash (cost=38059.31..38059.31 rows=812284 width=8) (actual time=3664.973..3664.973 rows=834713 loops=1)
Output: shipments.id, shipments.order_id
Buckets: 131072 Batches: 1 Memory Usage: 32606kB
Buffers: shared hit=667 read=31052
I/O Timings: read=2812.094
-> Seq Scan on public.shipments (cost=0.00..38059.31 rows=812284 width=8) (actual time=0.017..3393.420 rows=834713 loops=1)
Output: shipments.id, shipments.order_id
Filter: ((shipments.state)::text <> ALL ('{pending,cancelled}'::text[]))
Rows Removed by Filter: 1013053
Buffers: shared hit=667 read=31052
I/O Timings: read=2812.094
-> Hash (cost=37200.34..37200.34 rows=1470448 width=8) (actual time=859.887..859.887 rows=1555657 loops=1)
Output: variants.product_id, variants.id
Buckets: 262144 Batches: 1 Memory Usage: 60768kB
Buffers: shared hit=31372 read=1417
I/O Timings: read=137.400
-> Seq Scan on public.variants (cost=0.00..37200.34 rows=1470448 width=8) (actual time=0.009..479.528 rows=1555657 loops=1)
Output: variants.product_id, variants.id
Buffers: shared hit=31372 read=1417
I/O Timings: read=137.400
-> Hash (cost=32616.92..32616.92 rows=661973 width=8) (actual time=1553.664..1553.664 rows=688697 loops=1)
Output: products.product_grouping_id, products.id
Buckets: 131072 Batches: 1 Memory Usage: 26903kB
Buffers: shared hit=19194 read=11437
I/O Timings: read=1024.075
-> Seq Scan on public.products (cost=0.00..32616.92 rows=661973 width=8) (actual time=0.011..1375.757 rows=688697 loops=1)
Output: products.product_grouping_id, products.id
Buffers: shared hit=19194 read=11437
I/O Timings: read=1024.075
-> Hash (cost=125258.00..125258.00 rows=1811516 width=12) (actual time=4415.081..4415.081 rows=1847746 loops=1)
Output: orders.id, shipments_often_purchased_with_join.order_id, shipments_often_purchased_with_join.id
Buckets: 262144 Batches: 1 Memory Usage: 79396kB
Buffers: shared hit=74343 read=21853
I/O Timings: read=101.338
-> Hash Join (cost=78141.12..125258.00 rows=1811516 width=12) (actual time=1043.228..3875.433 rows=1847746 loops=1)
Output: orders.id, shipments_often_purchased_with_join.order_id, shipments_often_purchased_with_join.id
Hash Cond: (shipments_often_purchased_with_join.order_id = orders.id)
Buffers: shared hit=74343 read=21853
I/O Timings: read=101.338
-> Seq Scan on public.shipments shipments_often_purchased_with_join (cost=0.00..37153.55 rows=1811516 width=8) (actual time=0.006..413.785 rows=1847766 loops=1)
Output: shipments_often_purchased_with_join.order_id, shipments_often_purchased_with_join.id
Buffers: shared hit=31719
-> Hash (cost=70783.52..70783.52 rows=2102172 width=4) (actual time=1042.239..1042.239 rows=2097229 loops=1)
Output: orders.id
Buckets: 262144 Batches: 1 Memory Usage: 73731kB
Buffers: shared hit=42624 read=21853
I/O Timings: read=101.338
-> Seq Scan on public.orders (cost=0.00..70783.52 rows=2102172 width=4) (actual time=0.012..553.606 rows=2097229 loops=1)
Output: orders.id
Buffers: shared hit=42624 read=21853
I/O Timings: read=101.338
-> Hash (cost=20222.66..20222.66 rows=637552 width=595) (actual time=1278.121..1278.121 rows=626176 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name
Buckets: 16384 Batches: 4 Memory Usage: 29780kB
Buffers: shared hit=9559 read=7921, temp written=10448
I/O Timings: read=405.651
-> Seq Scan on public.product_groupings (cost=0.00..20222.66 rows=637552 width=595) (actual time=0.020..873.844 rows=626176 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name
Filter: ((product_groupings.id <> 99999) AND ((product_groupings.state)::text = 'active'::text))
Rows Removed by Filter: 48650
Buffers: shared hit=9559 read=7921
I/O Timings: read=405.651
-> Hash (cost=66.86..66.86 rows=4 width=4) (actual time=2.223..2.223 rows=30 loops=1)
Output: variants_often_purchased_with_join.id
Buckets: 1024 Batches: 1 Memory Usage: 2kB
Buffers: shared hit=49 read=7
I/O Timings: read=1.684
-> Nested Loop (cost=0.17..66.86 rows=4 width=4) (actual time=0.715..2.211 rows=30 loops=1)
Output: variants_often_purchased_with_join.id
Buffers: shared hit=49 read=7
I/O Timings: read=1.684
-> Index Scan using index_products_on_product_grouping_id on public.products products_often_purchased_with_join (cost=0.08..5.58 rows=2 width=4) (actual time=0.074..0.659 rows=6 loops=1)
Output: products_often_purchased_with_join.id
Index Cond: (products_often_purchased_with_join.product_grouping_id = 99999)
Buffers: shared hit=5 read=4
I/O Timings: read=0.552
-> Index Scan using index_variants_on_product_id_and_deleted_at on public.variants variants_often_purchased_with_join (cost=0.09..30.60 rows=15 width=8) (actual time=0.222..0.256 rows=5 loops=6)
Output: variants_often_purchased_with_join.id, variants_often_purchased_with_join.product_id
Index Cond: (variants_often_purchased_with_join.product_id = products_often_purchased_with_join.id)
Buffers: shared hit=44 read=3
I/O Timings: read=1.132
Total runtime: 33705.142 ms
Gained a significant ~20x increase in throughput using a sub select;
SELECT product_groupings.*, count(product_groupings.id) AS product_groupings_count
FROM "product_groupings"
INNER JOIN "products" ON "products"."product_grouping_id" = "product_groupings"."id"
INNER JOIN "variants" ON "variants"."product_id" = "products"."id"
INNER JOIN "order_items" ON "order_items"."variant_id" = "variants"."id"
INNER JOIN "shipments" ON "shipments"."id" = "order_items"."shipment_id"
WHERE ("product_groupings"."id" != 99999)
AND "product_groupings"."state" = 'active'
AND ("shipments"."state" NOT IN ('pending', 'cancelled'))
AND ("shipments"."order_id" IN (
SELECT "shipments"."order_id"
FROM "shipments"
INNER JOIN "order_items" ON "order_items"."shipment_id" = "shipments"."id"
INNER JOIN "variants" ON "variants"."id" = "order_items"."variant_id"
INNER JOIN "products" ON "products"."id" = "variants"."product_id"
WHERE "products"."product_grouping_id" = 99999 AND ("shipments"."state" NOT IN ('pending', 'cancelled'))
GROUP BY "shipments"."order_id"
ORDER BY "shipments"."order_id" ASC
))
GROUP BY product_groupings.id
ORDER BY product_groupings_count desc
LIMIT 10
Although I'd welcome any further optimisations. :)

Storing 'ties' for Contests in Postgres

I'm trying to determine if there a "low cost" optimization for the following query. We've implemented a system whereby 'tickets' earn 'points' and thus can be ranked. In order to support analytical type of queries, we store the rank of every ticket and wether the ticket is tied along with the ticket.
I've found that, at scale, storing the is_tied field is very slow. I'm attempting to run the scenario below on a set of "tickets" that is about 20-75k tickets.
I'm hoping that someone can help identify why and offer some help.
We're on postgres 9.3.6
Here's a simplified ticket table schema:
ogs_1=> \d api_ticket
Table "public.api_ticket"
Column | Type | Modifiers
------------------------------+--------------------------+---------------------------------------------------------
id | integer | not null default nextval('api_ticket_id_seq'::regclass)
status | character varying(3) | not null
points_earned | integer | not null
rank | integer | not null
event_id | integer | not null
user_id | integer | not null
is_tied | boolean | not null
Indexes:
"api_ticket_pkey" PRIMARY KEY, btree (id)
"api_ticket_4437cfac" btree (event_id)
"api_ticket_e8701ad4" btree (user_id)
"api_ticket_points_earned_idx" btree (points_earned)
"api_ticket_rank_idx" btree ("rank")
Foreign-key constraints:
"api_ticket_event_id_598c97289edc0e3e_fk_api_event_id" FOREIGN KEY (event_id) REFERENCES api_event(id) DEFERRABLE INITIALLY DEFERRED
(user_id) REFERENCES auth_user(id) DEFERRABLE INITIALLY DEFERRED
Here's the query that I'm executing:
UPDATE api_ticket t SET is_tied = False
WHERE t.event_id IN (SELECT id FROM api_event WHERE status = 'c');
UPDATE api_ticket t SET is_tied = True
FROM (
SELECT event_id, rank
FROM api_ticket tt
WHERE event_id in (SELECT id FROM api_event WHERE status = 'c')
AND tt.status <> 'x'
GROUP BY rank, event_id
HAVING count(*) > 1
) AS tied_tickets
WHERE t.rank = tied_tickets.rank AND
tied_tickets.event_id = t.event_id;
Here's the explain on a set of about 35k rows:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Update on api_ticket t (cost=3590.01..603570.21 rows=157 width=128)
-> Nested Loop (cost=3590.01..603570.21 rows=157 width=128)
-> Subquery Scan on tied_tickets (cost=2543.31..2556.18 rows=572 width=40)
-> HashAggregate (cost=2543.31..2550.46 rows=572 width=8)
Filter: (count(*) > 1)
-> Nested Loop (cost=0.84..2539.02 rows=572 width=8)
-> Index Scan using api_event_status_idx1 on api_event (cost=0.29..8.31 rows=1 width=4)
Index Cond: ((status)::text = 'c'::text)
-> Index Scan using api_ticket_4437cfac on api_ticket tt (cost=0.55..2524.99 rows=572 width=8)
Index Cond: (event_id = api_event.id)
Filter: ((status)::text <> 'x'::text)
-> Bitmap Heap Scan on api_ticket t (cost=1046.70..1050.71 rows=1 width=92)
Recheck Cond: (("rank" = tied_tickets."rank") AND (event_id = tied_tickets.event_id))
-> BitmapAnd (cost=1046.70..1046.70 rows=1 width=0)
-> Bitmap Index Scan on api_ticket_rank_idx (cost=0.00..26.65 rows=708 width=0)
Index Cond: ("rank" = tied_tickets."rank")
-> Bitmap Index Scan on api_ticket_4437cfac (cost=0.00..1019.79 rows=573 width=0)
Index Cond: (event_id = tied_tickets.event_id)

Why is this simple SQL query so slow?

I have a query
select p.id
from Product p,Identifier i
where p.id=i.product_id
and p.aggregatorId='5109037'
and p.deletionDate is null
and i.type='03'
and i.value='9783639382891'
which takes about 3.5 seconds to run. Product has about 4.7m entries, Identifier about 20m. As you can see in the schema below, every column used in the query is indexed, but not all indexes are used. If I exclude the columns p.aggregatorId and i.type, the query runs as fast as I would have expected. I have also tried to join the Identifier table, but with no change in the explain plan.
Why the indexes are not used?
The explain plan looks like this:
Nested Loop (cost=3.21..63.48 rows=1 width=33) (actual time=10.856..3236.830 rows=1 loops=1)
-> Index Scan using idx_prod_aggr on product p (cost=0.43..2.45 rows=1 width=33) (actual time=0.041..191.339 rows=146692 loops=1)
Index Cond: ((aggregatorid)::text = '5109037'::text)
Filter: (deletiondate IS NULL)
-> Bitmap Heap Scan on identifier i (cost=2.78..61.01 rows=1 width=33) (actual time=0.019..0.019 rows=0 loops=146692)
Recheck Cond: ((product_id)::text = (p.id)::text)
Filter: (((type)::text = '03'::text) AND ((value)::text = '9783639382891'::text))
Rows Removed by Filter: 2"
-> Bitmap Index Scan on idx_id_prod_id (cost=0.00..2.78 rows=29 width=0) (actual time=0.016..0.016 rows=2 loops=146692)
Index Cond: ((product_id)::text = (p.id)::text)
The reduced DB schema looks like this:
CREATE TABLE product
(
id character varying(32) NOT NULL,
version integer,
active boolean,
aggregatorid character varying(15),
deletiondate timestamp without time zone,
)
WITH (
OIDS=FALSE
);
CREATE INDEX idx_prod_active
ON product
USING btree
(active);
CREATE INDEX idx_prod_aggr
ON product
USING btree
(aggregatorid COLLATE pg_catalog."default");
CREATE INDEX idx_prod_del_date
ON product
USING btree
(deletiondate);
CREATE TABLE identifier
(
id character varying(32) NOT NULL,
version integer,
typename character varying(50),
type character varying(3) NOT NULL,
value character varying(512) NOT NULL,
product_id character varying(32),
CONSTRAINT identifier_pkey PRIMARY KEY (id),
CONSTRAINT fk165a88c9c93f3e7f FOREIGN KEY (product_id)
REFERENCES product (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
OIDS=FALSE
);
CREATE INDEX idx_id_prod_type
ON identifier
USING btree
(type COLLATE pg_catalog."default");
CREATE INDEX idx_id_prod_value
ON identifier
USING btree
(value COLLATE pg_catalog."default");
CREATE INDEX idx_id_typename
ON identifier
USING btree
(typename COLLATE pg_catalog."default");
CREATE INDEX idx_prod_ident
ON identifier
USING btree
(product_id COLLATE pg_catalog."default");