Why PostgreSQL recursive view execution plan is so inefficient? - postgresql

My app employs a multilevel hierarchical structure. There are many PL/pgSQL functions in the app that use the same type of selection: "select entities according to a list and all their child entities". I created a recursive view trying to avoid redundancy. The problem is, if i understand correctly, PostgreSQL (12.3, compiled by Visual C++ build 1914, 64-bit) selects all entities first and then it filters the records.
Here is a simplified example.
drop view if exists v;
drop table if exists t;
create table t
(
id int primary key,
parent_id int
);
insert into t (id, parent_id)
select s, (s - 1) * random()
from generate_series(1, 100000) as s;
create recursive view v (start_id, id, pid) as
select id, id, parent_id
from t
union all
select v.start_id, t.id, t.parent_id
from t
inner join v on v.id = t.parent_id;
explain (analyze)
select *
from v
where start_id = 10
order by start_id, id;
explain (analyze)
select *
from v
where start_id in (10, 11, 12, 20, 100)
order by start_id, id;
Is there a better solution? Any help is greatly appreciated.
Here is the query plan I got on my computer:
Sort (actual time=3809.581..3812.541 rows=29652 loops=1)
" Sort Key: v.start_id, v.id"
Sort Method: quicksort Memory: 2158kB
-> CTE Scan on v (actual time=0.044..3795.424 rows=29652 loops=1)
" Filter: (start_id = ANY ('{10,11,12,20,100}'::integer[]))"
Rows Removed by Filter: 1069171
CTE v
-> Recursive Union (actual time=0.028..3411.325 rows=1098823 loops=1)
-> Seq Scan on t (actual time=0.025..19.465 rows=100000 loops=1)
-> Merge Join (actual time=74.631..127.916 rows=41618 loops=24)
Merge Cond: (t_1.parent_id = v_1.id)
-> Sort (actual time=46.021..59.589 rows=99997 loops=24)
Sort Key: t_1.parent_id
Sort Method: external merge Disk: 1768kB
-> Seq Scan on t t_1 (actual time=0.016..11.797 rows=100000 loops=24)
-> Materialize (actual time=23.542..42.088 rows=65212 loops=24)
-> Sort (actual time=23.188..29.740 rows=45385 loops=24)
Sort Key: v_1.id
Sort Method: quicksort Memory: 25kB
-> WorkTable Scan on v v_1 (actual time=0.017..7.412 rows=45784 loops=24)
Planning Time: 0.260 ms
Execution Time: 3819.152 ms

Related

Improve PostgreSQL query

I have this query that is highly inefficient, if I remove all the count columns, it takes 10 seconds to query (the tables are quite large, around 750mb each). But if I add 1 count column, it takes 36 seconds to execute, if I leave it all in, it doesn't finish at all
I tried sum(case when r.value is not null then 1 else 0 end) in place of count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses, but it gets incorrect counts
SELECT c.id,
c.title,
count(DISTINCT cc.*) AS contacts,
count(DISTINCT m.user_id) AS texters,
count(DISTINCT cc.*) FILTER (WHERE cc.assignment_id IS NULL) AS needs_assignment,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = false) AS sent_messages,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = true) AS received_messages,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = false) AS contacted,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = true) AS received_reply,
count(DISTINCT cc.*) FILTER (WHERE cc.message_status = 'needsResponse'::text AND NOT cc.is_opted_out) AS needs_response,
count(DISTINCT cc.*) FILTER (WHERE cc.is_opted_out = true) AS opt_outs,
count(DISTINCT m.*) FILTER (WHERE m.error_code IS NOT NULL AND m.error_code <> 0) AS errors,
c.is_started,
c.is_archived,
c.use_dynamic_assignment,
c.due_by,
c.created_at,
creator.email AS creator_email,
concat(c.join_token, '/join/', c.id) AS join_path,
count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses,
c.batch_size,
c.texting_hours_start,
c.texting_hours_end,
c.timezone,
c.organization_id
FROM campaign c
JOIN "user" creator ON c.creator_id = creator.id
LEFT JOIN campaign_contact cc ON c.id = cc.campaign_id
LEFT JOIN message m ON m.campaign_contact_id = cc.id
LEFT JOIN question_response r ON cc.id = r.campaign_contact_id
GROUP BY c.id, creator.email
Any direction is appreciated, thank you!
Create some test data...
create unlogged table users( user_id serial primary key, login text unique not null );
insert into users (login) select 'user'||n from generate_series(1,100000) n;
create unlogged table messages( message_id serial primary key, sender_id integer not null, receiver_id integer not null);
insert into messages (sender_id,receiver_id) select random()*100000+1, random()*100000+1 from generate_series(1,1000000);
create index messages_s on messages(sender_id);
create index messages_r on messages(receiver_id);
vacuum analyze users,messages;
And then:
EXPLAIN ANALYZE
SELECT user_id, count(DISTINCT m1.message_id), count(DISTINCT m2.message_id)
FROM users u
LEFT JOIN messages m1 ON m1.receiver_id = user_id
LEFT JOIN messages m2 ON m2.sender_id = user_id
GROUP BY user_id;
GroupAggregate (cost=4.39..326190.22 rows=100000 width=20) (actual time=4.023..3331.031 rows=100000 loops=1)
Group Key: u.user_id
-> Merge Left Join (cost=4.39..250190.22 rows=10000000 width=12) (actual time=3.987..2161.032 rows=9998915 loops=1)
Merge Cond: (u.user_id = m1.receiver_id)
-> Merge Left Join (cost=2.11..56522.26 rows=1000000 width=8) (actual time=3.978..515.730 rows=1000004 loops=1)
Merge Cond: (u.user_id = m2.sender_id)
-> Index Only Scan using users_pkey on users u (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.016..10.149 rows=100000 loops=1)
Heap Fetches: 0
-> Index Scan using messages_s on messages m2 (cost=0.42..41168.40 rows=1000000 width=8) (actual time=0.011..397.128 rows=999996 loops=1)
-> Materialize (cost=0.42..43668.42 rows=1000000 width=8) (actual time=0.008..746.748 rows=9998810 loops=1)
-> Index Scan using messages_r on messages m1 (cost=0.42..41168.42 rows=1000000 width=8) (actual time=0.006..392.426 rows=999997 loops=1)
Execution Time: 3432.131 ms
Since I put in 100k users and 1M messages, each user has about 100 messages as sender and 100 also as receiver, which means the joins generate 100*100=10k rows per user which then have to be processed by the count(DISTINCT ...) aggregates. Postgres doesn't realize this is all unnecessary because the counts and group by's should really be moved inside the joined tables, which means this is extremely slow.
The solution is to move the aggregation inside the joined tables manually, to avoid generating all these unnecessary extra rows.
EXPLAIN ANALYZE
SELECT user_id, m1.cnt, m2.cnt
FROM users u
LEFT JOIN (SELECT receiver_id, count(*) cnt FROM messages GROUP BY receiver_id) m1 ON m1.receiver_id = user_id
LEFT JOIN (SELECT sender_id, count(*) cnt FROM messages GROUP BY sender_id) m2 ON m2.sender_id = user_id;
Hash Left Join (cost=46780.40..48846.42 rows=100000 width=20) (actual time=469.699..511.613 rows=100000 loops=1)
Hash Cond: (u.user_id = m2.sender_id)
-> Hash Left Join (cost=23391.68..25195.19 rows=100000 width=12) (actual time=237.435..262.545 rows=100000 loops=1)
Hash Cond: (u.user_id = m1.receiver_id)
-> Seq Scan on users u (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.015..5.162 rows=100000 loops=1)
-> Hash (cost=22243.34..22243.34 rows=91867 width=12) (actual time=237.252..237.253 rows=99991 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m1 (cost=20406.00..22243.34 rows=91867 width=12) (actual time=210.817..227.793 rows=99991 loops=1)
-> HashAggregate (cost=20406.00..21324.67 rows=91867 width=12) (actual time=210.815..222.794 rows=99991 loops=1)
Group Key: messages.receiver_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..47.173 rows=1000000 loops=1)
-> Hash (cost=22241.52..22241.52 rows=91776 width=12) (actual time=232.003..232.004 rows=99992 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m2 (cost=20406.00..22241.52 rows=91776 width=12) (actual time=205.401..222.517 rows=99992 loops=1)
-> HashAggregate (cost=20406.00..21323.76 rows=91776 width=12) (actual time=205.400..217.518 rows=99992 loops=1)
Group Key: messages_1.sender_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages messages_1 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.008..43.402 rows=1000000 loops=1)
Planning Time: 0.574 ms
Execution Time: 515.753 ms
I used a schema that is a bit different from yours, but you get the idea: instead of generating lots of duplicate rows by doing what is essentially a cross product, push aggregations into the joined tables so they return only one row per value of whatever column you're joining on, then remove the GROUP BY from the main query since it is no longer necessary.
Note that count(DISTINCT table.*) is not smart enough to understand that it can do this by looking only at the primary key of the table if there is one, so it will pull the whole row to run the distinct on it. When a table is named "message" or "question_response" it smells like it has a largish TEXT column in it, which will make this very slow. So in case you really need a count(distinct ...) you should use count(DISTINCT table.primarykey):
explain analyze SELECT count(distinct user_id) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=15.220..15.221 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.016..5.830 rows=100000 loops=1)
Execution Time: 15.263 ms
explain analyze SELECT count(distinct users.*) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=90.896..90.896 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=37) (actual time=0.038..38.497 rows=100000 loops=1)
Execution Time: 90.958 ms
The problem is the DISTINCT in the aggregate functions. PostgreSQL is not very smart about processing these.
No knowing your data model, I cannot tell if the DISTINCT is really needed. Omit it if you can.

Postgres optimization failing to filter window function partitions early

In some cases, PostgreSQL does not filter out window function partitions until they are calculated, while in a very similar scenario PostgreSQL filters row before performing window function calculation.
Tables used for minimal STR - log is the main data table, each row contains either increment or absolute value. Absolute value resets the current counter with a new base value. Window functions need to process all logs for a given account_id to calculate the correct running total. View uses a subquery to ensure that underlying log rows are not filtered by ts, otherwise, this would break the window function.
CREATE TABLE account(
id serial,
name VARCHAR(100)
);
CREATE TABLE log(
id serial,
absolute int,
incremental int,
account_id int,
ts timestamp,
PRIMARY KEY(id),
CONSTRAINT fk_account
FOREIGN KEY(account_id)
REFERENCES account(id)
);
CREATE FUNCTION get_running_total_func(
aggregated_total int,
absolute int,
incremental int
) RETURNS int
LANGUAGE sql IMMUTABLE CALLED ON NULL INPUT AS
$$
SELECT
CASE
WHEN absolute IS NOT NULL THEN absolute
ELSE COALESCE(aggregated_total, 0) + incremental
END
$$;
CREATE AGGREGATE get_running_total(integer, integer) (
sfunc = get_running_total_func,
stype = integer
);
Slow view:
CREATE VIEW test_view
(
log_id,
running_value,
account_id,
ts
)
AS
SELECT log_running.* FROM
(SELECT
log.id,
get_running_total(
log.absolute,
log.incremental
)
OVER(
PARTITION BY log.account_id
ORDER BY log.ts RANGE UNBOUNDED PRECEDING
),
account.id,
ts
FROM log log JOIN account account ON log.account_id=account.id
) AS log_running;
CREATE VIEW
postgres=# EXPLAIN ANALYZE SELECT * FROM test_view WHERE account_id=1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on log_running (cost=12734.02..15981.48 rows=1 width=20) (actual time=7510.851..16122.404 rows=20 loops=1)
Filter: (log_running.id_1 = 1)
Rows Removed by Filter: 99902
-> WindowAgg (cost=12734.02..14732.46 rows=99922 width=32) (actual time=7510.830..14438.783 rows=99922 loops=1)
-> Sort (cost=12734.02..12983.82 rows=99922 width=28) (actual time=7510.628..9312.399 rows=99922 loops=1)
Sort Key: log.account_id, log.ts
Sort Method: external merge Disk: 3328kB
-> Hash Join (cost=143.50..2042.24 rows=99922 width=28) (actual time=169.941..5431.650 rows=99922 loops=1)
Hash Cond: (log.account_id = account.id)
-> Seq Scan on log (cost=0.00..1636.22 rows=99922 width=24) (actual time=0.063..1697.802 rows=99922 loops=1)
-> Hash (cost=81.00..81.00 rows=5000 width=4) (actual time=169.837..169.865 rows=5000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 240kB
-> Seq Scan on account (cost=0.00..81.00 rows=5000 width=4) (actual time=0.017..84.639 rows=5000 loops=1)
Planning Time: 0.199 ms
Execution Time: 16127.275 ms
(15 rows)
Fast view - only change is account.id -> log.account_id (!):
CREATE VIEW test_view
(
log_id,
running_value,
account_id,
ts
)
AS
SELECT log_running.* FROM
(SELECT
log.id,
get_running_total(
log.absolute,
log.incremental
)
OVER(
PARTITION BY log.account_id
ORDER BY log.ts RANGE UNBOUNDED PRECEDING
),
log.account_id,
ts
FROM log log JOIN account account ON log.account_id=account.id
) AS log_running;
CREATE VIEW
postgres=# EXPLAIN ANALYZE SELECT * FROM test_view WHERE account_id=1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on log_running (cost=1894.96..1895.56 rows=20 width=20) (actual time=34.718..45.958 rows=20 loops=1)
-> WindowAgg (cost=1894.96..1895.36 rows=20 width=28) (actual time=34.691..45.307 rows=20 loops=1)
-> Sort (cost=1894.96..1895.01 rows=20 width=24) (actual time=34.367..35.925 rows=20 loops=1)
Sort Key: log.ts
Sort Method: quicksort Memory: 26kB
-> Nested Loop (cost=0.28..1894.53 rows=20 width=24) (actual time=0.542..34.066 rows=20 loops=1)
-> Index Only Scan using account_pkey on account (cost=0.28..8.30 rows=1 width=4) (actual time=0.025..0.054 rows=1 loops=1)
Index Cond: (id = 1)
Heap Fetches: 1
-> Seq Scan on log (cost=0.00..1886.03 rows=20 width=24) (actual time=0.195..32.937 rows=20 loops=1)
Filter: (account_id = 1)
Rows Removed by Filter: 99902
Planning Time: 0.297 ms
Execution Time: 47.300 ms
(14 rows)
Is it a bug in PostgreSQL implementation? It seems that this change in view definition shouldn't affect performance at all, PostgreSQL should be able to filter data before applying window function for all data set.

Aggregate timeseries data with value array

I have sensor data in a table by timestamp with multiple values in an array. E.g.:
CREATE TABLE test_raw (
ts timestamp without time zone NOT NULL,
values real[]
);
INSERT INTO test_raw VALUES
('2020-7-14 00:00:00', ARRAY[1, 10]),
('2020-7-14 00:01:00', ARRAY[2, 20, 30]),
('2020-7-14 00:20:00', ARRAY[3, NULL, 30, 40]),
('2020-7-14 00:23:00', ARRAY[9, NULL, 50, 80]),
('2020-7-14 00:10:00', ARRAY[3, 30, 40]),
('2020-7-14 00:11:00', ARRAY[3, 30, NULL, 50])
;
The array corresponds to different metrics collected by a device, e.g., values[1] might be temperature, values[2] might be humidity, etc. The full schema has additional columns (e.g. device ID) that indicate what the array contains.
I'd now like to create an aggregate/rollup table that has, say, the average over 10 minutes. If values were a scalar and not an array, I'd write the following view (which I'd use to populate the rollup table):
CREATE VIEW test_raw_10m AS
SELECT
floor(extract(epoch FROM ts)/600)*600 as ts,
AVG(value) /* scalar value! */
FROM test_raw
GROUP BY ts;
But it's not so simple with a values array. I saw the answer to a very closely related question: Pairwise array sum aggregate function?
This leads me to the following, which seems overly complicated:
WITH test_raw_10m AS (
SELECT floor(extract(epoch FROM ts)/600)*600 as ts, values
FROM test_raw
)
SELECT
t.ts,
ARRAY( SELECT
AVG(value) as value
FROM test_raw_10m tt, UNNEST(tt.values) WITH ORDINALITY x(value, rn)
WHERE tt.ts = t.ts
GROUP by x.rn
ORDER by x.rn) AS values
FROM test_raw_10m AS t
GROUP BY ts
ORDER by ts
;
My question: Is there a better way to do this?
For completeness, here's the result given the above sample data:
ts | values
------------+----------------
1594684800 | {1.5,15,30}
1594685400 | {3,30,40,50}
1594686000 | {6,NULL,40,60}
(3 rows)
and here's the query plan:
QUERY PLAN
-------------------------------------------------------------------------------------------
Group (cost=119.37..9490.26 rows=200 width=40)
Group Key: t.ts
CTE test_raw_10m
-> Seq Scan on test_raw (cost=0.00..34.00 rows=1200 width=40)
-> Sort (cost=85.37..88.37 rows=1200 width=8)
Sort Key: t.ts
-> CTE Scan on test_raw_10m t (cost=0.00..24.00 rows=1200 width=8)
SubPlan 2
-> Sort (cost=46.57..46.82 rows=100 width=16)
Sort Key: x.rn
-> HashAggregate (cost=42.00..43.25 rows=100 width=16)
Group Key: x.rn
-> Nested Loop (cost=0.00..39.00 rows=600 width=12)
-> CTE Scan on test_raw_10m tt (cost=0.00..27.00 rows=6 width=32)
Filter: (ts = t.ts)
-> Function Scan on unnest x (cost=0.00..1.00 rows=100 width=12)
The following query is significantly faster on my real dataset if I do partial updates by changing FROM test_raw to something like FROM test_raw WHERE ts >= <some timestamp> (in both queries):
SELECT bucket as ts, ARRAY_AGG(v)
FROM (
SELECT to_timestamp(floor(extract(epoch FROM ts)/600)*600) as bucket, AVG(values[i]) AS v
FROM (SELECT ts, generate_subscripts(values, 1) AS i, values FROM test_raw) AS foo
GROUP BY bucket, i
ORDER BY bucket, i
) bar
GROUP BY bucket;
I believe the ORDER BY bucket, i is not necessary, but I'm not sure.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
-
GroupAggregate (cost=228027.62..241630.12 rows=200 width=40) (actual time=0.948..1.209 rows=3 loops=1)
Group Key: (to_timestamp((floor((date_part('epoch'::text, foo.ts) / '600'::double precision)) * '600'::double precision)))
-> GroupAggregate (cost=228027.62..241027.62 rows=40000 width=20) (actual time=0.826..1.099 rows=11 loops=1)
Group Key: (to_timestamp((floor((date_part('epoch'::text, foo.ts) / '600'::double precision)) * '600'::double precision))), foo.i
-> Sort (cost=228027.62..231027.62 rows=1200000 width=44) (actual time=0.773..0.870 rows=20 loops=1)
Sort Key: (to_timestamp((floor((date_part('epoch'::text, foo.ts) / '600'::double precision)) * '600'::double precision))), foo.i
Sort Method: quicksort Memory: 19kB
-> Subquery Scan on foo (cost=0.00..33031.00 rows=1200000 width=44) (actual time=0.165..0.619 rows=20 loops=1)
-> ProjectSet (cost=0.00..6031.00 rows=1200000 width=44) (actual time=0.131..0.312 rows=20 loops=1)
-> Seq Scan on test_raw (cost=0.00..22.00 rows=1200 width=40) (actual time=0.034..0.070 rows=6 loops=1)
Planning Time: 0.525 ms
Execution Time: 1.504 ms
(12 rows)

Can PostgreSQL 12 do partition pruning at execution time with subquery returning a list?

I'm trying to take advantages of partitioning in one case:
I have table "events" which partitioned by list by field "dt_pk" which is foreign key to table "dates".
-- Schema
drop schema if exists test cascade;
create schema test;
-- Tables
create table if not exists test.dates (
id bigint primary key,
dt date not null
);
create sequence test.seq_events_id;
create table if not exists test.events
(
id bigint not null,
dt_pk bigint not null,
content_int bigint,
foreign key (dt_pk) references test.dates(id) on delete cascade,
primary key (dt_pk, id)
)
partition by list (dt_pk);
-- Partitions
create table test.events_1 partition of test.events for values in (1);
create table test.events_2 partition of test.events for values in (2);
create table test.events_3 partition of test.events for values in (3);
-- Fill tables
insert into test.dates (id, dt)
select id, dt
from (
select 1 id, '2020-01-01'::date as dt
union all
select 2 id, '2020-01-02'::date as dt
union all
select 3 id, '2020-01-03'::date as dt
) t;
do $$
declare
dts record;
begin
for dts in (
select id
from test.dates
) loop
for k in 1..10000 loop
insert into test.events (id, dt_pk, content_int)
values (nextval('test.seq_events_id'), dts.id, random_between(1, 1000000));
end loop;
commit;
end loop;
end;
$$;
vacuum analyze test.dates, test.events;
I want to run select like this:
select *
from test.events e
join test.dates d on e.dt_pk = d.id
where d.dt between '2020-01-02'::date and '2020-01-03'::date;
But in this case partition pruning doesn't work. It's clear, I don't have constant for partition key. But from documentation I know that there is partition pruning at execution time, which works with value obtained from a subquery:
Partition pruning can be performed not only during the planning of a
given query, but also during its execution. This is useful as it can
allow more partitions to be pruned when clauses contain expressions
whose values are not known at query planning time, for example,
parameters defined in a PREPARE statement, using a value obtained from
a subquery, or using a parameterized value on the inner side of a
nested loop join.
So I rewrite my query like this and I expected partitionin pruning:
select *
from test.events e
where e.dt_pk in (
select d.id
from test.dates d
where d.dt between '2020-01-02'::date and '2020-01-03'::date
);
But explain for this select says:
Hash Join (cost=1.07..833.07 rows=20000 width=24) (actual time=3.581..15.989 rows=20000 loops=1)
Hash Cond: (e.dt_pk = d.id)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.005..6.361 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.104 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.127 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.097 rows=10000 loops=1)
-> Hash (cost=1.04..1.04 rows=2 width=8) (actual time=0.006..0.006 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.206 ms
Execution Time: 17.237 ms
So, we read all partitions. I even tried to the planner to use nested loop join, because I read in documentation "parameterized value on the inner side of a nested loop join", but it didn't work:
set enable_hashjoin to off;
set enable_mergejoin to off;
And again:
Nested Loop (cost=0.00..1443.05 rows=20000 width=24) (actual time=9.160..25.252 rows=20000 loops=1)
Join Filter: (e.dt_pk = d.id)
Rows Removed by Join Filter: 30000
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.008..6.280 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.105 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.047 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.007..1.082 rows=10000 loops=1)
-> Materialize (cost=0.00..1.05 rows=2 width=8) (actual time=0.000..0.000 rows=2 loops=30000)
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.202 ms
Execution Time: 26.516 ms
Then I noticed that in every example of "partition pruning at execution time" I see only = condition, not in.
And it really works that way:
explain (analyze) select * from test.events e where e.dt_pk = (select id from test.dates where id = 2);
Append (cost=1.04..718.04 rows=30000 width=24) (actual time=0.014..3.018 rows=10000 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on dates (cost=0.00..1.04 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
Filter: (id = 2)
Rows Removed by Filter: 2
-> Seq Scan on events_1 e (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
-> Seq Scan on events_2 e_1 (cost=0.00..189.00 rows=10000 width=24) (actual time=0.004..2.009 rows=10000 loops=1)
Filter: (dt_pk = $0)
-> Seq Scan on events_3 e_2 (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
Planning Time: 0.135 ms
Execution Time: 3.639 ms
And here is my final question: does partition pruning at execution time work only with subquery returning one item, or there is a way to get advantages of partition pruning with subquery returning a list?
And why doesn't it work with nested loop join, did I understand something wrong in words:
This includes values from subqueries and values from execution-time
parameters such as those from parameterized nested loop joins.
Or "parameterized nested loop joins" is something different from regular nested loop joins?
There is no partition pruning in your nested loop join because the partitioned table is on the outer side, which is always scanned completely. The inner side is scanned with the join key from the outer side as parameter (hence parameterized scan), so if the partitioned table were on the inner side of the nested loop join, partition pruning could happen.
Partition pruning with IN lists can take place if the list vales are known at plan time:
EXPLAIN (COSTS OFF)
SELECT * FROM test.events WHERE dt_pk IN (1, 2);
QUERY PLAN
---------------------------------------------------
Append
-> Seq Scan on events_1
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
-> Seq Scan on events_2
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
(5 rows)
But no attempts are made to flatten a subquery, and PostgreSQL doesn't use partition pruning, even if you force the partitioned table to be on the inner side (enable_material = off, enable_hashjoin = off, enable_mergejoin = off):
EXPLAIN (ANALYZE)
SELECT * FROM test.events WHERE dt_pk IN (SELECT 1 UNION SELECT 2);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.06..2034.09 rows=20000 width=24) (actual time=0.057..15.523 rows=20000 loops=1)
Join Filter: (events_1.dt_pk = (1))
Rows Removed by Join Filter: 40000
-> Unique (cost=0.06..0.07 rows=2 width=4) (actual time=0.026..0.029 rows=2 loops=1)
-> Sort (cost=0.06..0.07 rows=2 width=4) (actual time=0.024..0.025 rows=2 loops=1)
Sort Key: (1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..0.05 rows=2 width=4) (actual time=0.006..0.009 rows=2 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.012..4.334 rows=30000 loops=2)
-> Seq Scan on events_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.011..1.057 rows=10000 loops=2)
-> Seq Scan on events_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.004..0.641 rows=10000 loops=2)
-> Seq Scan on events_3 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.002..0.594 rows=10000 loops=2)
Planning Time: 0.531 ms
Execution Time: 16.567 ms
(16 rows)
I am not certain, but it may be because the tables are so small. You might want to try with bigger tables.
If you care more about get it working than the fine details, and you haven't tried this yet: you can rewrite the query to something like
explain analyze select *
from test.dates d
join test.events e on e.dt_pk = d.id
where
d.dt between '2020-01-02'::date and '2020-01-03'::date
and e.dt_pk in (extract(day from '2020-01-02'::date)::int,
extract(day from '2020-01-03'::date)::int);
which will give the expected pruning.

PostgreSQL 11 goes for parallel seq scan on partitioned table where index should be enough

The problem is I keep getting seq scan on a rather simple query for a very trivial setup. What am I doing wrong?
Postgres 11 on Windows Server 2016
Config changes done: constraint_exclusion = partition
A single table partitioned to 200 subtables, dozens of million records per partition.
Index on a field in question (assuming one is partitioned also)
Here's the create statement:
CREATE TABLE A (
K int NOT NULL,
X bigint NOT NULL,
Date timestamp NOT NULL,
fy smallint NOT NULL,
fz decimal(18, 8) NOT NULL,
fw decimal(18, 8) NOT NULL,
fv decimal(18, 8) NULL,
PRIMARY KEY (K, X)
) PARTITION BY LIST (K);
CREATE TABLE A_1 PARTITION OF A FOR VALUES IN (1);
CREATE TABLE A_2 PARTITION OF A FOR VALUES IN (2);
...
CREATE TABLE A_200 PARTITION OF A FOR VALUES IN (200);
CREATE TABLE A_Default PARTITION OF A DEFAULT;
CREATE INDEX IX_A_Date ON A (Date);
The query in question:
SELECT K, MIN(Date), MAX(Date)
FROM A
GROUP BY K
That always gives a sequence scan which takes several minutes while it's clearly evident there's no need for table data at all as Date field is indexed and I'm just asking for first and last leaf of its B-tree.
Originally the index was on (K, Date) and it rendered to me quickly that Postgres will not honor one in any query I assumed it to be in use in. Index on (Date) did the trick for other queries and it seems like Postgres claims to partition indexes automatically. However this specific simple query always goes for seq scan.
Any thoughts appreciated!
UPDATE
Query plan (analyze, buffers) is as follows:
Finalize GroupAggregate (cost=4058360.99..4058412.66 rows=200 width=20) (actual time=148448.183..148448.189 rows=5 loops=1)
Group Key: a_16.k
Buffers: shared hit=5970 read=548034 dirtied=4851 written=1446
-> Gather Merge (cost=4058360.99..4058407.66 rows=400 width=20) (actual time=148448.166..148463.953 rows=8 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5998 read=1919356 dirtied=4865 written=1454
-> Sort (cost=4057360.97..4057361.47 rows=200 width=20) (actual time=148302.271..148302.285 rows=3 loops=3)
Sort Key: a_16.k
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
Buffers: shared hit=5998 read=1919356 dirtied=4865 written=1454
-> Partial HashAggregate (cost=4057351.32..4057353.32 rows=200 width=20) (actual time=148302.199..148302.203 rows=3 loops=3)
Group Key: a_16.k
Buffers: shared hit=5984 read=1919356 dirtied=4865 written=1454
-> Parallel Append (cost=0.00..3347409.96 rows=94658849 width=12) (actual time=1.678..116664.051 rows=75662243 loops=3)
Buffers: shared hit=5984 read=1919356 dirtied=4865 written=1454
-> Parallel Seq Scan on a_16 (cost=0.00..1302601.32 rows=42870432 width=12) (actual time=0.320..41625.766 rows=34283419 loops=3)
Buffers: shared hit=14 read=873883 dirtied=14 written=8
-> Parallel Seq Scan on a_19 (cost=0.00..794121.94 rows=26070794 width=12) (actual time=0.603..54017.937 rows=31276617 loops=2)
Buffers: shared read=533414
-> Parallel Seq Scan on a_20 (cost=0.00..447025.50 rows=14900850 width=12) (actual time=0.347..52866.404 rows=35762000 loops=1)
Buffers: shared hit=5964 read=292053 dirtied=4850 written=1446
-> Parallel Seq Scan on a_18 (cost=0.00..198330.23 rows=6450422 width=12) (actual time=4.504..27197.706 rows=15481014 loops=1)
Buffers: shared read=133826
-> Parallel Seq Scan on a_17 (cost=0.00..129272.31 rows=4308631 width=12) (actual time=3.014..18423.307 rows=10340224 loops=1)
Buffers: shared hit=6 read=86180 dirtied=1
...
-> Parallel Seq Scan on a_197 (cost=0.00..14.18 rows=418 width=12) (actual time=0.000..0.000 rows=0 loops=1)
-> Parallel Seq Scan on a_198 (cost=0.00..14.18 rows=418 width=12) (actual time=0.001..0.002 rows=0 loops=1)
-> Parallel Seq Scan on a_199 (cost=0.00..14.18 rows=418 width=12) (actual time=0.001..0.001 rows=0 loops=1)
-> Parallel Seq Scan on a_default (cost=0.00..14.18 rows=418 width=12) (actual time=0.001..0.002 rows=0 loops=1)
Planning Time: 16.893 ms
Execution Time: 148466.519 ms
UPDATE 2 Just to avoid future comments like “you should index on (K, Date)”:
The query plan with both indexes in place is exactly the same, analysis numbers are the same and even buffer hits/reads are almost the same.
Aggregate push-down into parallel plans can be enabled by setting enable_partitionwise_aggregate to on.
That will probably speed up your query somewhat, because PostgreSQL doesn't have to pass so many data between the parallel workers.
But it looks like PostgreSQL isn't smart enough to figure out it can use the index to speed up min and max for each partition, although it is smart enough to do that with a non-partitioned table.
There is no pretty way to work around that; you could resort to querying each partition:
SELECT k, min(min_date), max(max_date)
FROM (
SELECT 1 AS k, MIN(date) AS min_date, MAX(date) AS max_date FROM a_1
UNION ALL
SELECT 2, MIN(date), MAX(date) FROM a_2
UNION ALL
...
SELECT 200, MIN(date), MAX(date) FROM a_200
UNION ALL
SELECT k, MIN(date), MAX(date) FROM a_default
) AS all_a
GROUP BY k;
Yuck! There is clearly room for improvement here.
I dug into the code and found the reason in src/backend/optimizer/plan/planagg.c:
/*
* preprocess_minmax_aggregates - preprocess MIN/MAX aggregates
*
* Check to see whether the query contains MIN/MAX aggregate functions that
* might be optimizable via indexscans. If it does, and all the aggregates
* are potentially optimizable, then create a MinMaxAggPath and add it to
* the (UPPERREL_GROUP_AGG, NULL) upperrel.
[...]
*/
void
preprocess_minmax_aggregates(PlannerInfo *root, List *tlist)
{
[...]
/*
* Reject unoptimizable cases.
*
* We don't handle GROUP BY or windowing, because our current
* implementations of grouping require looking at all the rows anyway, and
* so there's not much point in optimizing MIN/MAX.
*/
if (parse->groupClause || list_length(parse->groupingSets) > 1 ||
parse->hasWindowFuncs)
return;
Basically, PostgreSQL punts when it sees a GROUP BY clause.