I have this query that is highly inefficient, if I remove all the count columns, it takes 10 seconds to query (the tables are quite large, around 750mb each). But if I add 1 count column, it takes 36 seconds to execute, if I leave it all in, it doesn't finish at all
I tried sum(case when r.value is not null then 1 else 0 end) in place of count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses, but it gets incorrect counts
SELECT c.id,
c.title,
count(DISTINCT cc.*) AS contacts,
count(DISTINCT m.user_id) AS texters,
count(DISTINCT cc.*) FILTER (WHERE cc.assignment_id IS NULL) AS needs_assignment,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = false) AS sent_messages,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = true) AS received_messages,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = false) AS contacted,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = true) AS received_reply,
count(DISTINCT cc.*) FILTER (WHERE cc.message_status = 'needsResponse'::text AND NOT cc.is_opted_out) AS needs_response,
count(DISTINCT cc.*) FILTER (WHERE cc.is_opted_out = true) AS opt_outs,
count(DISTINCT m.*) FILTER (WHERE m.error_code IS NOT NULL AND m.error_code <> 0) AS errors,
c.is_started,
c.is_archived,
c.use_dynamic_assignment,
c.due_by,
c.created_at,
creator.email AS creator_email,
concat(c.join_token, '/join/', c.id) AS join_path,
count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses,
c.batch_size,
c.texting_hours_start,
c.texting_hours_end,
c.timezone,
c.organization_id
FROM campaign c
JOIN "user" creator ON c.creator_id = creator.id
LEFT JOIN campaign_contact cc ON c.id = cc.campaign_id
LEFT JOIN message m ON m.campaign_contact_id = cc.id
LEFT JOIN question_response r ON cc.id = r.campaign_contact_id
GROUP BY c.id, creator.email
Any direction is appreciated, thank you!
Create some test data...
create unlogged table users( user_id serial primary key, login text unique not null );
insert into users (login) select 'user'||n from generate_series(1,100000) n;
create unlogged table messages( message_id serial primary key, sender_id integer not null, receiver_id integer not null);
insert into messages (sender_id,receiver_id) select random()*100000+1, random()*100000+1 from generate_series(1,1000000);
create index messages_s on messages(sender_id);
create index messages_r on messages(receiver_id);
vacuum analyze users,messages;
And then:
EXPLAIN ANALYZE
SELECT user_id, count(DISTINCT m1.message_id), count(DISTINCT m2.message_id)
FROM users u
LEFT JOIN messages m1 ON m1.receiver_id = user_id
LEFT JOIN messages m2 ON m2.sender_id = user_id
GROUP BY user_id;
GroupAggregate (cost=4.39..326190.22 rows=100000 width=20) (actual time=4.023..3331.031 rows=100000 loops=1)
Group Key: u.user_id
-> Merge Left Join (cost=4.39..250190.22 rows=10000000 width=12) (actual time=3.987..2161.032 rows=9998915 loops=1)
Merge Cond: (u.user_id = m1.receiver_id)
-> Merge Left Join (cost=2.11..56522.26 rows=1000000 width=8) (actual time=3.978..515.730 rows=1000004 loops=1)
Merge Cond: (u.user_id = m2.sender_id)
-> Index Only Scan using users_pkey on users u (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.016..10.149 rows=100000 loops=1)
Heap Fetches: 0
-> Index Scan using messages_s on messages m2 (cost=0.42..41168.40 rows=1000000 width=8) (actual time=0.011..397.128 rows=999996 loops=1)
-> Materialize (cost=0.42..43668.42 rows=1000000 width=8) (actual time=0.008..746.748 rows=9998810 loops=1)
-> Index Scan using messages_r on messages m1 (cost=0.42..41168.42 rows=1000000 width=8) (actual time=0.006..392.426 rows=999997 loops=1)
Execution Time: 3432.131 ms
Since I put in 100k users and 1M messages, each user has about 100 messages as sender and 100 also as receiver, which means the joins generate 100*100=10k rows per user which then have to be processed by the count(DISTINCT ...) aggregates. Postgres doesn't realize this is all unnecessary because the counts and group by's should really be moved inside the joined tables, which means this is extremely slow.
The solution is to move the aggregation inside the joined tables manually, to avoid generating all these unnecessary extra rows.
EXPLAIN ANALYZE
SELECT user_id, m1.cnt, m2.cnt
FROM users u
LEFT JOIN (SELECT receiver_id, count(*) cnt FROM messages GROUP BY receiver_id) m1 ON m1.receiver_id = user_id
LEFT JOIN (SELECT sender_id, count(*) cnt FROM messages GROUP BY sender_id) m2 ON m2.sender_id = user_id;
Hash Left Join (cost=46780.40..48846.42 rows=100000 width=20) (actual time=469.699..511.613 rows=100000 loops=1)
Hash Cond: (u.user_id = m2.sender_id)
-> Hash Left Join (cost=23391.68..25195.19 rows=100000 width=12) (actual time=237.435..262.545 rows=100000 loops=1)
Hash Cond: (u.user_id = m1.receiver_id)
-> Seq Scan on users u (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.015..5.162 rows=100000 loops=1)
-> Hash (cost=22243.34..22243.34 rows=91867 width=12) (actual time=237.252..237.253 rows=99991 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m1 (cost=20406.00..22243.34 rows=91867 width=12) (actual time=210.817..227.793 rows=99991 loops=1)
-> HashAggregate (cost=20406.00..21324.67 rows=91867 width=12) (actual time=210.815..222.794 rows=99991 loops=1)
Group Key: messages.receiver_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..47.173 rows=1000000 loops=1)
-> Hash (cost=22241.52..22241.52 rows=91776 width=12) (actual time=232.003..232.004 rows=99992 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m2 (cost=20406.00..22241.52 rows=91776 width=12) (actual time=205.401..222.517 rows=99992 loops=1)
-> HashAggregate (cost=20406.00..21323.76 rows=91776 width=12) (actual time=205.400..217.518 rows=99992 loops=1)
Group Key: messages_1.sender_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages messages_1 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.008..43.402 rows=1000000 loops=1)
Planning Time: 0.574 ms
Execution Time: 515.753 ms
I used a schema that is a bit different from yours, but you get the idea: instead of generating lots of duplicate rows by doing what is essentially a cross product, push aggregations into the joined tables so they return only one row per value of whatever column you're joining on, then remove the GROUP BY from the main query since it is no longer necessary.
Note that count(DISTINCT table.*) is not smart enough to understand that it can do this by looking only at the primary key of the table if there is one, so it will pull the whole row to run the distinct on it. When a table is named "message" or "question_response" it smells like it has a largish TEXT column in it, which will make this very slow. So in case you really need a count(distinct ...) you should use count(DISTINCT table.primarykey):
explain analyze SELECT count(distinct user_id) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=15.220..15.221 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.016..5.830 rows=100000 loops=1)
Execution Time: 15.263 ms
explain analyze SELECT count(distinct users.*) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=90.896..90.896 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=37) (actual time=0.038..38.497 rows=100000 loops=1)
Execution Time: 90.958 ms
The problem is the DISTINCT in the aggregate functions. PostgreSQL is not very smart about processing these.
No knowing your data model, I cannot tell if the DISTINCT is really needed. Omit it if you can.
I have table (over 100 millions records) on PostgreSQL 13.1
CREATE TABLE report
(
id serial primary key,
license_plate_id integer,
datetime timestamp
);
Indexes (for test I create both of them):
create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);
So, my question is why query like
select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100
Is very slow (~10sec). But query without order statement is fast (milliseconds).
Explain:
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
Buffers: shared hit=103
-> Index Scan using report_lp_id_idx on report r (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
Buffers: shared hit=11455 read=671
-> Sort (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
Sort Key: datetime DESC
Sort Method: top-N heapsort Memory: 128kB
Buffers: shared hit=11455 read=671
-> Bitmap Heap Scan on report r (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Heap Blocks: exact=12063
Buffers: shared hit=11455 read=671
-> Bitmap Index Scan on report_lp_id_idx (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms
You seem to have rather slow storage, if reading 671 8kB-blocks from disk takes a couple of seconds.
The way to speed this up is to reorder the table in the same way as the index, so that you can find the required rows in the same or adjacent table blocks:
CLUSTER report_lp_id_idx USING report_lp_id_idx;
Be warned that rewriting the table in this way causes downtime – the table will not be available while it is being rewritten. Moreover, PostgreSQL does not maintain the table order, so subsequent data modifications will cause performance to gradually deteriorate, so that after a while you will have to run CLUSTER again.
But if you need this query to be fast no matter what, CLUSTER is the way to go.
Your two indices do exactly the same thing, so you can remove the second one, it's useless.
To optimize your query, the order of the fields inside the index must be reversed:
create index report_lp_datetime_index on report (datetime,license_plate_id);
BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;
Limit (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
-> Index Only Scan Backward using foo_d_i on foo (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9016
Heap Fetches: 0
Planning Time: 0.339 ms
Execution Time: 9.387 ms
Note the index is not used to optimize the WHERE clause. It is used here as a compact and fast way to store references to the rows ordered by date DESC, so the ORDER BY can do an index-only scan and avoid sorting. By adding column id to the index, an index-only scan can be performed to test the condition on id, without hitting the table for every row. Since there is a low LIMIT value it does not need to scan the whole index, it only scans it in date DESC order until it finds enough rows satisfying the WHERE condition to return the result.
It will be faster if you create the index in date DESC order, this could be useful if you use ORDER BY date DESC + LIMIT in other queries too.
You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.
Easy to work around. The optimum way to do this query would be an index-only scan to filter on WHERE conditions, then LIMIT, then hit the table to get the rows. For some reason if "select *" is used postgres takes the id column from the table instead of taking it from the index, which results in lots of unnecessary heap fetches for rows whose id is rejected by the WHERE condition.
Easy to work around, by doing it manually. I've also added another bogus column to make sure the SELECT * hits the table.
EXPLAIN (ANALYZE,buffers) SELECT * FROM foo
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i)
ORDER BY d DESC LIMIT 100;
Limit (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
Buffers: shared hit=453
-> Nested Loop (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
Buffers: shared hit=453
-> Limit (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
Buffers: shared hit=53
-> Index Only Scan using foo_d_i on foo foo_1 (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=53
-> Index Scan using foo_d_i on foo (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
Buffers: shared hit=400
Execution Time: 3.663 ms
Another option is to just add the primary key to the date,license_plate index.
SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;
Limit (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
Buffers: shared hit=473
-> Sort (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
Sort Key: foo.d DESC
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=473
-> Nested Loop (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
Buffers: shared hit=473
-> Limit (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
Buffers: shared hit=73
-> Index Only Scan using foo_d_i_id on foo foo_1 (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=73
-> Index Scan using foo_pkey on foo (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
Index Cond: (id = foo_1.id)
Buffers: shared hit=400
Execution Time: 3.972 ms
Edit
After thinking about it... since the LIMIT restricts the output to 100 rows ordered by date desc, wouldn't it be nice if we could get the 100 most recent rows for each license_plate_id, put all that into a top-n sort, and only keep the best 100 for all license_plate_ids? That would avoid reading and throwing away a lot of rows from the index. Even if that's much faster than hitting the table, it will still load up these index pages in RAM and clog up your buffers with stuff you don't actually need to keep in cache. Let's use LATERAL JOIN:
EXPLAIN (ANALYZE,BUFFERS)
SELECT * FROM foo
JOIN (SELECT d,i FROM
(VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist
CROSS JOIN LATERAL
(SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2
ORDER BY d DESC LIMIT 100
) f3 USING (d,i)
ORDER BY d DESC LIMIT 100;
It's even faster: 2ms, and it uses the index on (license_plate_id,date) instead of the other way around. Also, and this is important, since each subquery in the lateral hits only the index pages that contain rows that will actually be selected, while the previous queries hit much more index pages. So you save on RAM buffers.
If you don't need the index on (date,license_plate_id) and don't want to keep a useless index, that could be interesting since this query doesn't use it. On the other hand, if you need the index on (date,license_plate_id) for something else and want to keep it, then... maybe not.
Please post results for the winning query 🔥
I am having a really slow query (~100mins). I have omitted a lot of the inner child nodes by denoting it with a suffix ...
HashAggregate (cost=6449635645.84..6449635742.59 rows=1290 width=112) (actual time=5853093.882..5853095.159 rows=785 loops=1)
Group Key: p.processid
-> Nested Loop (cost=10851145.36..6449523319.09 rows=832050 width=112) (actual time=166573.289..5853043.076 rows=3904 loops=1)
Join Filter: (SubPlan 2)
Rows Removed by Join Filter: 617040
-> Merge Left Join (cost=5425572.68..5439530.95 rows=1290 width=799) (actual time=80092.782..80114.828 rows=788 loops=1) ...
-> Materialize (cost=5425572.68..5439550.30 rows=1290 width=112) (actual time=109.689..109.934 rows=788 loops=788) ...
SubPlan 2
-> Limit (cost=3869.12..3869.13 rows=5 width=8) (actual time=9.155..9.156 rows=5 loops=620944) ...
Planning time: 1796.764 ms
Execution time: 5853316.418 ms
(2836 rows)
The above query plan is a query executed to the view, schema below (simplified)
create or replace view foo_bar_view(processid, column_1, run_count) as
SELECT
q.processid,
q.column_1,
q.run_count
FROM
(
SELECT
r.processid,
avg(h.some_column) AS column_1,
-- many more aggregate function on many more columns
count(1) AS run_count
FROM
foo_bar_table r,
foo_bar_table h
WHERE (h.processid IN (SELECT p.processid
FROM process p
LEFT JOIN bar i ON p.barid = i.id
LEFT JOIN foo ii ON i.fooid = ii.fooid
JOIN foofoobar pt ON p.typeid = pt.typeid AND pt.displayname ~~
((SELECT ('%'::text || property.value) || '%'::text
FROM property
WHERE property.name = 'something'::text))
WHERE p.processid < r.processid
AND (ii.name = r.foo_name OR ii.name IS NULL AND r.foo_name IS NULL)
ORDER BY p.processid DESC
LIMIT 5))
GROUP BY r.processid
) q;
I would just like to understand, does this mean that most of the time is spent performing the GROUP BY processid?
If not, what is causing the issue? I can't think of a reason why is this query so slow.
The aggregate functions used are avg, min, max, stddev.
A total of 52 of them were used, 4 on each of the 13 columns.
Update: Expanding on the child node of SubPlan 2. We can see that the Bitmap Index Scan on process_pkey part is the bottleneck.
-> Bitmap Heap Scan on process p_30 (cost=1825.89..3786.00 rows=715 width=24) (actual time=8.642..8.833 rows=394 loops=620944)
Recheck Cond: ((typeid = pt_30.typeid) AND (processid < p.processid))
Heap Blocks: exact=185476288
-> BitmapAnd (cost=1825.89..1825.89 rows=715 width=0) (actual time=8.611..8.611 rows=0 loops=620944)
-> Bitmap Index Scan on ix_process_typeid (cost=0.00..40.50 rows=2144 width=0) (actual time=0.077..0.077 rows=788 loops=620944)
Index Cond: (typeid = pt_30.typeid)
-> Bitmap Index Scan on process_pkey (cost=0.00..1761.20 rows=95037 width=0) (actual time=8.481..8.481 rows=145093 loops=620944)
Index Cond: (processid < p.processid)
What I am unable to figure out is why is it using a Bitmap Index Scan and not Index Scan. From what it seems, there should only be 788 rows that needs to be compared? Wouldn't that be faster? If not how can I optimise this query?
processid is of bigint type and has an index
The complete execution plan is here.
You conveniently left out the names of the tables in the execution plan, but I assume that the nested loop join is between foo_bar_table r and foo_bar_table h, and the subplan is the IN condition.
The high execution time is caused by the subplan, which is executed for each potential join result, that is 788 * 788 = 620944 times. 620944 * 9.156 accounts for 5685363 milliseconds.
Create this index:
CREATE INDEX ON process (typeid, processid, installationid);
And run VACUUM:
VACUUM process;
That should give you a fast index-only scan.
I'm trying to take advantages of partitioning in one case:
I have table "events" which partitioned by list by field "dt_pk" which is foreign key to table "dates".
-- Schema
drop schema if exists test cascade;
create schema test;
-- Tables
create table if not exists test.dates (
id bigint primary key,
dt date not null
);
create sequence test.seq_events_id;
create table if not exists test.events
(
id bigint not null,
dt_pk bigint not null,
content_int bigint,
foreign key (dt_pk) references test.dates(id) on delete cascade,
primary key (dt_pk, id)
)
partition by list (dt_pk);
-- Partitions
create table test.events_1 partition of test.events for values in (1);
create table test.events_2 partition of test.events for values in (2);
create table test.events_3 partition of test.events for values in (3);
-- Fill tables
insert into test.dates (id, dt)
select id, dt
from (
select 1 id, '2020-01-01'::date as dt
union all
select 2 id, '2020-01-02'::date as dt
union all
select 3 id, '2020-01-03'::date as dt
) t;
do $$
declare
dts record;
begin
for dts in (
select id
from test.dates
) loop
for k in 1..10000 loop
insert into test.events (id, dt_pk, content_int)
values (nextval('test.seq_events_id'), dts.id, random_between(1, 1000000));
end loop;
commit;
end loop;
end;
$$;
vacuum analyze test.dates, test.events;
I want to run select like this:
select *
from test.events e
join test.dates d on e.dt_pk = d.id
where d.dt between '2020-01-02'::date and '2020-01-03'::date;
But in this case partition pruning doesn't work. It's clear, I don't have constant for partition key. But from documentation I know that there is partition pruning at execution time, which works with value obtained from a subquery:
Partition pruning can be performed not only during the planning of a
given query, but also during its execution. This is useful as it can
allow more partitions to be pruned when clauses contain expressions
whose values are not known at query planning time, for example,
parameters defined in a PREPARE statement, using a value obtained from
a subquery, or using a parameterized value on the inner side of a
nested loop join.
So I rewrite my query like this and I expected partitionin pruning:
select *
from test.events e
where e.dt_pk in (
select d.id
from test.dates d
where d.dt between '2020-01-02'::date and '2020-01-03'::date
);
But explain for this select says:
Hash Join (cost=1.07..833.07 rows=20000 width=24) (actual time=3.581..15.989 rows=20000 loops=1)
Hash Cond: (e.dt_pk = d.id)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.005..6.361 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.104 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.127 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.097 rows=10000 loops=1)
-> Hash (cost=1.04..1.04 rows=2 width=8) (actual time=0.006..0.006 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.206 ms
Execution Time: 17.237 ms
So, we read all partitions. I even tried to the planner to use nested loop join, because I read in documentation "parameterized value on the inner side of a nested loop join", but it didn't work:
set enable_hashjoin to off;
set enable_mergejoin to off;
And again:
Nested Loop (cost=0.00..1443.05 rows=20000 width=24) (actual time=9.160..25.252 rows=20000 loops=1)
Join Filter: (e.dt_pk = d.id)
Rows Removed by Join Filter: 30000
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.008..6.280 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.105 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.047 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.007..1.082 rows=10000 loops=1)
-> Materialize (cost=0.00..1.05 rows=2 width=8) (actual time=0.000..0.000 rows=2 loops=30000)
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.202 ms
Execution Time: 26.516 ms
Then I noticed that in every example of "partition pruning at execution time" I see only = condition, not in.
And it really works that way:
explain (analyze) select * from test.events e where e.dt_pk = (select id from test.dates where id = 2);
Append (cost=1.04..718.04 rows=30000 width=24) (actual time=0.014..3.018 rows=10000 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on dates (cost=0.00..1.04 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
Filter: (id = 2)
Rows Removed by Filter: 2
-> Seq Scan on events_1 e (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
-> Seq Scan on events_2 e_1 (cost=0.00..189.00 rows=10000 width=24) (actual time=0.004..2.009 rows=10000 loops=1)
Filter: (dt_pk = $0)
-> Seq Scan on events_3 e_2 (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
Planning Time: 0.135 ms
Execution Time: 3.639 ms
And here is my final question: does partition pruning at execution time work only with subquery returning one item, or there is a way to get advantages of partition pruning with subquery returning a list?
And why doesn't it work with nested loop join, did I understand something wrong in words:
This includes values from subqueries and values from execution-time
parameters such as those from parameterized nested loop joins.
Or "parameterized nested loop joins" is something different from regular nested loop joins?
There is no partition pruning in your nested loop join because the partitioned table is on the outer side, which is always scanned completely. The inner side is scanned with the join key from the outer side as parameter (hence parameterized scan), so if the partitioned table were on the inner side of the nested loop join, partition pruning could happen.
Partition pruning with IN lists can take place if the list vales are known at plan time:
EXPLAIN (COSTS OFF)
SELECT * FROM test.events WHERE dt_pk IN (1, 2);
QUERY PLAN
---------------------------------------------------
Append
-> Seq Scan on events_1
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
-> Seq Scan on events_2
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
(5 rows)
But no attempts are made to flatten a subquery, and PostgreSQL doesn't use partition pruning, even if you force the partitioned table to be on the inner side (enable_material = off, enable_hashjoin = off, enable_mergejoin = off):
EXPLAIN (ANALYZE)
SELECT * FROM test.events WHERE dt_pk IN (SELECT 1 UNION SELECT 2);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.06..2034.09 rows=20000 width=24) (actual time=0.057..15.523 rows=20000 loops=1)
Join Filter: (events_1.dt_pk = (1))
Rows Removed by Join Filter: 40000
-> Unique (cost=0.06..0.07 rows=2 width=4) (actual time=0.026..0.029 rows=2 loops=1)
-> Sort (cost=0.06..0.07 rows=2 width=4) (actual time=0.024..0.025 rows=2 loops=1)
Sort Key: (1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..0.05 rows=2 width=4) (actual time=0.006..0.009 rows=2 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.012..4.334 rows=30000 loops=2)
-> Seq Scan on events_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.011..1.057 rows=10000 loops=2)
-> Seq Scan on events_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.004..0.641 rows=10000 loops=2)
-> Seq Scan on events_3 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.002..0.594 rows=10000 loops=2)
Planning Time: 0.531 ms
Execution Time: 16.567 ms
(16 rows)
I am not certain, but it may be because the tables are so small. You might want to try with bigger tables.
If you care more about get it working than the fine details, and you haven't tried this yet: you can rewrite the query to something like
explain analyze select *
from test.dates d
join test.events e on e.dt_pk = d.id
where
d.dt between '2020-01-02'::date and '2020-01-03'::date
and e.dt_pk in (extract(day from '2020-01-02'::date)::int,
extract(day from '2020-01-03'::date)::int);
which will give the expected pruning.
I have very simple query which uses json data for joining on primary table:
WITH
timecode_range AS
(
SELECT
(t->>'table_id')::integer AS table_id,
(t->>'timecode_from')::bigint AS timecode_from,
(t->>'timecode_to')::bigint AS timecode_to
FROM (SELECT '{"table_id":1,"timecode_from":19890328,"timecode_to":119899328}'::jsonb t) rowset
)
SELECT n.*
FROM partition.json_notification n
INNER JOIN timecode_range r ON n.table_id = r.table_id AND n.timecode > r.timecode_from AND n.timecode <= r.timecode_to
It works perfectly when "timecode_range" returns only 1 record:
Nested Loop (cost=0.43..4668.80 rows=1416 width=97) (actual time=0.352..0.352 rows=0 loops=1)
CTE timecode_range
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
-> CTE Scan on timecode_range r (cost=0.00..0.02 rows=1 width=20) (actual time=0.007..0.007 rows=1 loops=1)
-> Index Scan using json_notification_pkey on json_notification n (cost=0.42..4654.61 rows=1416 width=97) (actual time=0.322..0.322 rows=0 loops=1)
Index Cond: ((timecode > r.timecode_from) AND (timecode <= r.timecode_to))
Filter: (r.table_id = table_id)
Planning time: 2.292 ms
Execution time: 0.665 ms
But when I need to return several records:
WITH
timecode_range AS
(
SELECT
(t->>'table_id')::integer AS table_id,
(t->>'timecode_from')::bigint AS timecode_from,
(t->>'timecode_to')::bigint AS timecode_to
FROM (SELECT json_array_elements('[{"table_id":1,"timecode_from":19890328,"timecode_to":119899328}]') t) rowset
)
SELECT n.*
FROM partition.json_notification n
INNER JOIN timecode_range r ON n.table_id = r.table_id AND n.timecode > r.timecode_from AND n.timecode <= r.timecode_to
It starts using sequential scan and execution time dramatically grows :(
Hash Join (cost=7.01..37289.68 rows=92068 width=97) (actual time=418.563..418.563 rows=0 loops=1)
Hash Cond: (n.table_id = r.table_id)
Join Filter: ((n.timecode > r.timecode_from) AND (n.timecode <= r.timecode_to))
Rows Removed by Join Filter: 14444
CTE timecode_range
-> Subquery Scan on rowset (cost=0.00..3.76 rows=100 width=32) (actual time=0.233..0.234 rows=1 loops=1)
-> Result (cost=0.00..0.51 rows=100 width=0) (actual time=0.218..0.218 rows=1 loops=1)
-> Seq Scan on json_notification n (cost=0.00..21703.36 rows=840036 width=97) (actual time=0.205..312.991 rows=840036 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=20) (actual time=0.239..0.239 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> CTE Scan on timecode_range r (cost=0.00..2.00 rows=100 width=20) (actual time=0.235..0.236 rows=1 loops=1)
Planning time: 4.729 ms
Execution time: 418.937 ms
What am I doing wrong?
PostgreSQL has no possibility to estimate the number of rows returned from a table function, so it uses the ROWS value specified in CREATE FUNCTION (default 1000).
For json_array_elements this value is set to 100:
SELECT prorows FROM pg_proc WHERE proname = 'json_array_elements';
┌─────────┐
│ prorows │
├─────────┤
│ 100 │
└─────────┘
(1 row)
But in your case the function returns only 1 row.
This misestimate makes PostgreSQL choose another join strategy (hash join instead of nested loop), which causes the longer execution time.
If you can choose some other construct than such a table function (e.g. a VALUES statement) that PostgreSQL can estimate, you'll get a better plan.
An alternative is to use a LIMIT clause on the CTE definition if you can safely specify an upper limit.
If you think that PostgreSQL is wrong when it switches to a hash join beyond a certain row count, you can test as follows:
Run the query (using a sequential scan and a hash join) and measure the duration (psql's \timing command will help).
Force a nested loop join:
SET enable_hashjoin=off;
SET enable_mergejoin=off;
Run the query again (with a nested loop join) and measure the duration.
If PostgreSQL is indeed wrong, you could adjust the optimizer parameters by lowering random_page_cost to a value closer to seq_page_cost.