Postgresql very slow query on indexed columns - postgresql

I have very simple query which uses json data for joining on primary table:
timecode_range AS
(t->>'table_id')::integer AS table_id,
(t->>'timecode_from')::bigint AS timecode_from,
(t->>'timecode_to')::bigint AS timecode_to
FROM (SELECT '{"table_id":1,"timecode_from":19890328,"timecode_to":119899328}'::jsonb t) rowset
FROM partition.json_notification n
INNER JOIN timecode_range r ON n.table_id = r.table_id AND n.timecode > r.timecode_from AND n.timecode <= r.timecode_to
It works perfectly when "timecode_range" returns only 1 record:
Nested Loop (cost=0.43..4668.80 rows=1416 width=97) (actual time=0.352..0.352 rows=0 loops=1)
CTE timecode_range
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
-> CTE Scan on timecode_range r (cost=0.00..0.02 rows=1 width=20) (actual time=0.007..0.007 rows=1 loops=1)
-> Index Scan using json_notification_pkey on json_notification n (cost=0.42..4654.61 rows=1416 width=97) (actual time=0.322..0.322 rows=0 loops=1)
Index Cond: ((timecode > r.timecode_from) AND (timecode <= r.timecode_to))
Filter: (r.table_id = table_id)
Planning time: 2.292 ms
Execution time: 0.665 ms
But when I need to return several records:
timecode_range AS
(t->>'table_id')::integer AS table_id,
(t->>'timecode_from')::bigint AS timecode_from,
(t->>'timecode_to')::bigint AS timecode_to
FROM (SELECT json_array_elements('[{"table_id":1,"timecode_from":19890328,"timecode_to":119899328}]') t) rowset
FROM partition.json_notification n
INNER JOIN timecode_range r ON n.table_id = r.table_id AND n.timecode > r.timecode_from AND n.timecode <= r.timecode_to
It starts using sequential scan and execution time dramatically grows :(
Hash Join (cost=7.01..37289.68 rows=92068 width=97) (actual time=418.563..418.563 rows=0 loops=1)
Hash Cond: (n.table_id = r.table_id)
Join Filter: ((n.timecode > r.timecode_from) AND (n.timecode <= r.timecode_to))
Rows Removed by Join Filter: 14444
CTE timecode_range
-> Subquery Scan on rowset (cost=0.00..3.76 rows=100 width=32) (actual time=0.233..0.234 rows=1 loops=1)
-> Result (cost=0.00..0.51 rows=100 width=0) (actual time=0.218..0.218 rows=1 loops=1)
-> Seq Scan on json_notification n (cost=0.00..21703.36 rows=840036 width=97) (actual time=0.205..312.991 rows=840036 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=20) (actual time=0.239..0.239 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> CTE Scan on timecode_range r (cost=0.00..2.00 rows=100 width=20) (actual time=0.235..0.236 rows=1 loops=1)
Planning time: 4.729 ms
Execution time: 418.937 ms
What am I doing wrong?

PostgreSQL has no possibility to estimate the number of rows returned from a table function, so it uses the ROWS value specified in CREATE FUNCTION (default 1000).
For json_array_elements this value is set to 100:
SELECT prorows FROM pg_proc WHERE proname = 'json_array_elements';
│ prorows │
│ 100 │
(1 row)
But in your case the function returns only 1 row.
This misestimate makes PostgreSQL choose another join strategy (hash join instead of nested loop), which causes the longer execution time.
If you can choose some other construct than such a table function (e.g. a VALUES statement) that PostgreSQL can estimate, you'll get a better plan.
An alternative is to use a LIMIT clause on the CTE definition if you can safely specify an upper limit.
If you think that PostgreSQL is wrong when it switches to a hash join beyond a certain row count, you can test as follows:
Run the query (using a sequential scan and a hash join) and measure the duration (psql's \timing command will help).
Force a nested loop join:
SET enable_hashjoin=off;
SET enable_mergejoin=off;
Run the query again (with a nested loop join) and measure the duration.
If PostgreSQL is indeed wrong, you could adjust the optimizer parameters by lowering random_page_cost to a value closer to seq_page_cost.


GIN index not working with `SELECT 1` but it works if I do `SELECT COUNT(*)` on PostgreSQL

I have the following query
> explain analyze SELECT 1 AS one FROM "orders" WHERE "orders"."email"
Limit (cost=0.00..470.44 rows=1 width=4) (actual time=2303.032..2303.033 rows=1 loops=1)
Output: 1
-> Seq Scan on public.orders (cost=0.00..108200.10 rows=230 width=4) (actual time=2303.031..2303.031 rows=1 loops=1)
Output: 1
Filter: (( ~~* ''::text)
Rows Removed by Filter: 2309367
Planning Time: 0.195 ms
Execution Time: 2303.047 ms
If I run the same query but instead of using SELECT 1 I use SELECT COUNT(*) the gin index (gin_trgm_ops) start to work
> explain analyze SELECT COUNT(*) FROM "orders" WHERE "orders"."email"
Limit (cost=1263.98..1263.99 rows=1 width=8) (actual time=18.074..18.075 rows=1 loops=1)
-> Aggregate (cost=1263.98..1263.99 rows=1 width=8) (actual time=18.073..18.073 rows=1 loops=1)
-> Bitmap Heap Scan on orders (cost=377.78..1263.40 rows=230 width=0) (actual time=18.062..18.067 rows=3 loops=1)
Recheck Cond: ((email)::text ~~* ''::text)
Heap Blocks: exact=2
-> Bitmap Index Scan on index_orders_on_email_gin (cost=0.00..377.72 rows=230 width=0) (actual time=18.043..18.044 rows=3 loops=1)
Index Cond: ((email)::text ~~* ''::text)
Planning Time: 0.575 ms
Execution Time: 18.120 ms
Any idea why?
With SELECT 1 ... LIMIT 1, it can stop early once it finds one qualifying row. Since PostgreSQL misestimates how many qualifying rows there are, it misestimates how useful this stopping early will be.
The LIMIT doesn't do anything when used with COUNT(*) but without a GROUP BY, since only one row is returned anyway. There is no stopping-early that can be done, as every qualifying row needs to be found in order to count them.
The crux of the matter is not SELECT 1 versus SELECT COUNT(*), it is a LIMIT that does something versus one that does not.

Can PostgreSQL 12 do partition pruning at execution time with subquery returning a list?

I'm trying to take advantages of partitioning in one case:
I have table "events" which partitioned by list by field "dt_pk" which is foreign key to table "dates".
-- Schema
drop schema if exists test cascade;
create schema test;
-- Tables
create table if not exists test.dates (
id bigint primary key,
dt date not null
create sequence test.seq_events_id;
create table if not exists
id bigint not null,
dt_pk bigint not null,
content_int bigint,
foreign key (dt_pk) references test.dates(id) on delete cascade,
primary key (dt_pk, id)
partition by list (dt_pk);
-- Partitions
create table test.events_1 partition of for values in (1);
create table test.events_2 partition of for values in (2);
create table test.events_3 partition of for values in (3);
-- Fill tables
insert into test.dates (id, dt)
select id, dt
from (
select 1 id, '2020-01-01'::date as dt
union all
select 2 id, '2020-01-02'::date as dt
union all
select 3 id, '2020-01-03'::date as dt
) t;
do $$
dts record;
for dts in (
select id
from test.dates
) loop
for k in 1..10000 loop
insert into (id, dt_pk, content_int)
values (nextval('test.seq_events_id'),, random_between(1, 1000000));
end loop;
end loop;
vacuum analyze test.dates,;
I want to run select like this:
select *
from e
join test.dates d on e.dt_pk =
where d.dt between '2020-01-02'::date and '2020-01-03'::date;
But in this case partition pruning doesn't work. It's clear, I don't have constant for partition key. But from documentation I know that there is partition pruning at execution time, which works with value obtained from a subquery:
Partition pruning can be performed not only during the planning of a
given query, but also during its execution. This is useful as it can
allow more partitions to be pruned when clauses contain expressions
whose values are not known at query planning time, for example,
parameters defined in a PREPARE statement, using a value obtained from
a subquery, or using a parameterized value on the inner side of a
nested loop join.
So I rewrite my query like this and I expected partitionin pruning:
select *
from e
where e.dt_pk in (
from test.dates d
where d.dt between '2020-01-02'::date and '2020-01-03'::date
But explain for this select says:
Hash Join (cost=1.07..833.07 rows=20000 width=24) (actual time=3.581..15.989 rows=20000 loops=1)
Hash Cond: (e.dt_pk =
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.005..6.361 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.104 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.127 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.097 rows=10000 loops=1)
-> Hash (cost=1.04..1.04 rows=2 width=8) (actual time=0.006..0.006 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.206 ms
Execution Time: 17.237 ms
So, we read all partitions. I even tried to the planner to use nested loop join, because I read in documentation "parameterized value on the inner side of a nested loop join", but it didn't work:
set enable_hashjoin to off;
set enable_mergejoin to off;
And again:
Nested Loop (cost=0.00..1443.05 rows=20000 width=24) (actual time=9.160..25.252 rows=20000 loops=1)
Join Filter: (e.dt_pk =
Rows Removed by Join Filter: 30000
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.008..6.280 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.105 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.047 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.007..1.082 rows=10000 loops=1)
-> Materialize (cost=0.00..1.05 rows=2 width=8) (actual time=0.000..0.000 rows=2 loops=30000)
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.202 ms
Execution Time: 26.516 ms
Then I noticed that in every example of "partition pruning at execution time" I see only = condition, not in.
And it really works that way:
explain (analyze) select * from e where e.dt_pk = (select id from test.dates where id = 2);
Append (cost=1.04..718.04 rows=30000 width=24) (actual time=0.014..3.018 rows=10000 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on dates (cost=0.00..1.04 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
Filter: (id = 2)
Rows Removed by Filter: 2
-> Seq Scan on events_1 e (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
-> Seq Scan on events_2 e_1 (cost=0.00..189.00 rows=10000 width=24) (actual time=0.004..2.009 rows=10000 loops=1)
Filter: (dt_pk = $0)
-> Seq Scan on events_3 e_2 (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
Planning Time: 0.135 ms
Execution Time: 3.639 ms
And here is my final question: does partition pruning at execution time work only with subquery returning one item, or there is a way to get advantages of partition pruning with subquery returning a list?
And why doesn't it work with nested loop join, did I understand something wrong in words:
This includes values from subqueries and values from execution-time
parameters such as those from parameterized nested loop joins.
Or "parameterized nested loop joins" is something different from regular nested loop joins?
There is no partition pruning in your nested loop join because the partitioned table is on the outer side, which is always scanned completely. The inner side is scanned with the join key from the outer side as parameter (hence parameterized scan), so if the partitioned table were on the inner side of the nested loop join, partition pruning could happen.
Partition pruning with IN lists can take place if the list vales are known at plan time:
SELECT * FROM WHERE dt_pk IN (1, 2);
-> Seq Scan on events_1
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
-> Seq Scan on events_2
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
(5 rows)
But no attempts are made to flatten a subquery, and PostgreSQL doesn't use partition pruning, even if you force the partitioned table to be on the inner side (enable_material = off, enable_hashjoin = off, enable_mergejoin = off):
Nested Loop (cost=0.06..2034.09 rows=20000 width=24) (actual time=0.057..15.523 rows=20000 loops=1)
Join Filter: (events_1.dt_pk = (1))
Rows Removed by Join Filter: 40000
-> Unique (cost=0.06..0.07 rows=2 width=4) (actual time=0.026..0.029 rows=2 loops=1)
-> Sort (cost=0.06..0.07 rows=2 width=4) (actual time=0.024..0.025 rows=2 loops=1)
Sort Key: (1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..0.05 rows=2 width=4) (actual time=0.006..0.009 rows=2 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.012..4.334 rows=30000 loops=2)
-> Seq Scan on events_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.011..1.057 rows=10000 loops=2)
-> Seq Scan on events_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.004..0.641 rows=10000 loops=2)
-> Seq Scan on events_3 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.002..0.594 rows=10000 loops=2)
Planning Time: 0.531 ms
Execution Time: 16.567 ms
(16 rows)
I am not certain, but it may be because the tables are so small. You might want to try with bigger tables.
If you care more about get it working than the fine details, and you haven't tried this yet: you can rewrite the query to something like
explain analyze select *
from test.dates d
join e on e.dt_pk =
d.dt between '2020-01-02'::date and '2020-01-03'::date
and e.dt_pk in (extract(day from '2020-01-02'::date)::int,
extract(day from '2020-01-03'::date)::int);
which will give the expected pruning.

PostgreSQL - Why does planner underestimate number of rows when doing a Foreign Scan? [duplicate]

I am running a query with an INNER JOIN where the planner decides to use a Nested Loop. I've figured out that it has do with the WHERE conditions as I have tried writing the query with different WHERE conditions so it returns the same result but does not use a Nested Loop.
My question is why has the planner decided to make the different decisions when the queries appear to be identical as they both return the same result? The query runs in 77 secs with the Nested Loop and in 13 sec without, and the query that runs in 13 sec is quite ugly and inelegant making me think there is a better way to write it.
Here are the two queries. Note that the difference between the two is how the WHERE clause filters by date where the first uses BETWEEN and the second uses a series of OR statements. I am aware that it's strange that current_date is wrapped in their own subqueries but that is because these queries are using foreign data wrappers. This allows current_date to be passed as an immutable object to greatly speed up performance.
SELECT ROUND(AVG(m.forecast - w.wind),6) from pjm.wind_forecast_recent w
INNER JOIN pjm.load_forecast_recent m ON w.pricedate = m.pricedate AND w.hour = m.hour
WHERE w.hour = 5 AND m.area = 'RTO_COMBINED' AND
(w.pricedate BETWEEN (SELECT current_date-6) AND (SELECT current_date));
SELECT ROUND(AVG(m.forecast - w.wind),6) from pjm.wind_forecast_recent w
INNER JOIN pjm.load_forecast_recent m ON w.pricedate = m.pricedate AND w.hour = m.hour
WHERE w.hour = 5 AND m.area = 'RTO_COMBINED' AND (
w.pricedate = (SELECT current_date-6) OR
w.pricedate = (SELECT current_date-5) OR
w.pricedate = (SELECT current_date-4) OR
w.pricedate = (SELECT current_date-3) OR
w.pricedate = (SELECT current_date-2) OR
w.pricedate = (SELECT current_date-1) OR
w.pricedate = (SELECT current_date))
And here are the respective EXPLAIN ANALYZE:
Aggregate (cost=842341.01..842341.02 rows=1 width=32) (actual time=77120.088..77120.089 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> Nested Loop (cost=840333.25..842340.97 rows=1 width=18) (actual time=14719.661..77119.994 rows=7 loops=1)
-> Foreign Scan on wind_forecast_recent w (cost=242218.45..242218.49 rows=1 width=18) (actual time=3184.714..3184.720 rows=7 loops=1)
-> Foreign Scan on load_forecast_recent m (cost=598114.80..600122.47 rows=1 width=16) (actual time=10531.723..10531.724 rows=1 loops=7)
Planning Time: 744.979 ms
Execution Time: 77227.512 ms
Aggregate (cost=841657.94..841657.95 rows=1 width=32) (actual time=13683.022..13683.023 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
InitPlan 3 (returns $2)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
InitPlan 4 (returns $3)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
InitPlan 5 (returns $4)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
InitPlan 6 (returns $5)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
InitPlan 7 (returns $6)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> Foreign Scan (cost=833725.15..841657.83 rows=1 width=18) (actual time=13682.974..13682.977 rows=7 loops=1)
Relations: (pjm.wind_forecast_recent w) INNER JOIN (pjm.load_forecast_recent m)
Planning Time: 332.870 ms
Functions: 16
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 4.163 ms, Inlining 15.088 ms, Optimization 44.489 ms, Emission 28.064 ms, Total 91.804 ms
Execution Time: 13724.094 ms
I am running PostgreSQL 12.1 on an Ubuntu 18.04 server.
Let me know if you have any further questions. Thanks!
The planner does not decide to use a certain join strategy based on deep reasoning, it simply constructs all possible join strategies, estimates the cost and chooses the cheapest one.
That said, nested loop joins are usually the best choice if the outer table is small, so that the inner loop does not have to be executed often. Also, an index on the join condition of the inner table can greatly reduce the cost of a nested loop join and make it an attractive strategy.
In your case, the bad choice is due to a mis-estimate:
Foreign Scan on wind_forecast_recent w (cost=... rows=1 ...) (actual ... rows=7 ...)
That causes the inner loop to be executed 7 times rather than once, so that the execution time is 70 seconds rather than 10.
You should collect table statistics on wind_forecast_recent:
ANALYZE wind_forecast_recent;
Remember that autoanalyze does not treat foreign tables; you have to take care of that yourself.
If that doesn't do the trick, you can try setting the use_remote_estimate option on the foreign table and make sure that the table statistics are accurate on the remote database.

Postgres using unperformant plan when rerun

I'm importing a non circular graph and flattening the ancestors to an array per code. This works fine (for a bit): ~45s for 400k codes over ~900k edges.
However, after the first successful execution Postgres decides to stop using the Nested Loop and the update query performance drops drastically: ~2s per code.
I can force the issue by putting a vacuum right before the update but I am curious why the unoptimization is happening.
CREATE TABLE tmp_rel (
from_id BIGINT,
to_id BIGINT,
CREATE TABLE tmp_edges(
start_node BIGINT,
end_node BIGINT
INSERT INTO tmp_edges(start_node, end_node)
SELECT from_id AS start_node, to_id AS end_node
FROM tmp_rel;
CREATE INDEX tmp_edges_end ON tmp_edges (end_node);
CREATE TABLE tmp_codes (
active SMALLINT,
COPY tmp_codes FROM 'codes.txt' WITH DELIMITER E'\t' CSV HEADER;
code BIGINT,
ancestors BIGINT[]
FROM tmp_codes
WHERE active = 1;
CREATE INDEX tmp_anc_codes ON tmp_anc_codes (code);
VACUUM; -- Need this for the update to execute optimally
UPDATE tmp_anc sa SET ancestors = (
WITH RECURSIVE ancestors(code) AS (
SELECT start_node FROM tmp_edges WHERE end_node = sa.code
SELECT se.start_node
FROM tmp_edges se, ancestors a
WHERE se.end_node = a.code
SELECT array_agg(code) FROM ancestors
Table stats:
tmp_rel 507 MB 0 bytes
tmp_edges 74 MB 37 MB
tmp_codes 32 MB 0 bytes
tmp_anc 22 MB 8544 kB
Without VACUUM before UPDATE:
Update on tmp_anc sa (cost=10000000000.00..11081583053.74 rows=10 width=46) (actual time=38294.005..38294.005 rows=0 loops=1)
-> Seq Scan on tmp_anc sa (cost=10000000000.00..11081583053.74 rows=10 width=46) (actual time=3300.974..38292.613 rows=10 loops=1)
SubPlan 2
-> Aggregate (cost=108158305.25..108158305.26 rows=1 width=32) (actual time=3829.253..3829.253 rows=1 loops=10)
CTE ancestors
-> Recursive Union (cost=81.97..66015893.05 rows=1872996098 width=8) (actual time=0.037..3827.917 rows=45 loops=10)
-> Bitmap Heap Scan on tmp_edges (cost=81.97..4913.18 rows=4328 width=8) (actual time=0.022..0.022 rows=2 loops=10)
Recheck Cond: (end_node = sa.code)
Heap Blocks: exact=12
-> Bitmap Index Scan on tmp_edges_end (cost=0.00..80.89 rows=4328 width=0) (actual time=0.014..0.014 rows=2 loops=10)
Index Cond: (end_node = sa.code)
-> Merge Join (cost=4198.89..2855105.79 rows=187299177 width=8) (actual time=163.746..425.295 rows=10 loops=90)
Merge Cond: (a.code = se.end_node)
-> Sort (cost=4198.47..4306.67 rows=43280 width=8) (actual time=0.012..0.016 rows=5 loops=90)
Sort Key: a.code
Sort Method: quicksort Memory: 25kB
-> WorkTable Scan on ancestors a (cost=0.00..865.60 rows=43280 width=8) (actual time=0.000..0.001 rows=5 loops=90)
-> Materialize (cost=0.42..43367.08 rows=865523 width=16) (actual time=0.010..337.592 rows=537171 loops=90)
-> Index Scan using tmp_edges_end on edges se (cost=0.42..41203.27 rows=865523 width=16) (actual time=0.009..247.547 rows=537171 loops=90)
-> CTE Scan on ancestors (cost=0.00..37459921.96 rows=1872996098 width=8) (actual time=1.227..3829.159 rows=45 loops=10)
With VACUUM before UPDATE:
Update on tmp_anc sa (cost=0.00..2949980136.43 rows=387059 width=14) (actual time=74701.329..74701.329 rows=0 loops=1)
-> Seq Scan on tmp_anc sa (cost=0.00..2949980136.43 rows=387059 width=14) (actual time=0.336..70324.848 rows=387059 loops=1)
SubPlan 2
-> Aggregate (cost=7621.50..7621.51 rows=1 width=8) (actual time=0.180..0.180 rows=1 loops=387059)
CTE ancestors
-> Recursive Union (cost=0.42..7583.83 rows=1674 width=8) (actual time=0.005..0.162 rows=32 loops=387059)
-> Index Scan using tmp_edges_end on tmp_edges (cost=0.42..18.93 rows=4 width=8) (actual time=0.004..0.005 rows=2 loops=387059)
Index Cond: (end_node = sa.code)
-> Nested Loop (cost=0.42..753.14 rows=167 width=8) (actual time=0.003..0.019 rows=10 loops=2700448)
-> WorkTable Scan on ancestors a (cost=0.00..0.80 rows=40 width=8) (actual time=0.000..0.001 rows=5 loops=2700448)
-> Index Scan using tmp_edges_end on tmp_edges se (cost=0.42..18.77 rows=4 width=16) (actual time=0.003..0.003 rows=2 loops=12559395)
Index Cond: (end_node = a.code)
-> CTE Scan on ancestors (cost=0.00..33.48 rows=1674 width=8) (actual time=0.007..0.173 rows=32 loops=387059)
The first execution plan has really bad estimates (Bitmap Index Scan on tmp_edges_end estimates 4328 instead of 2 rows), while the second execution has good estimates and thus chooses a good plan.
So something between the two executions you quote above must have changed the estimates.
Moreover, you say that the first execution of the UPDATE (for which we have no EXPLAIN (ANALYZE) output) was fast.
The only good explanation for the initial performance drop is that it takes the autovacuum daemon some time to collect statistics for the new tables. This normally improves query performance, but of course it can also work the other way around.
Also, a VACUUM usually doesn't fix performance issues. Could it be that you used VACUUM (ANALYZE)?
It would be interesting to know how things are when you collect statistics before your initial UPDATE:
ANALYZE tmp_edges;
When I read your query more closely, however, I wonder why you use a correlated subquery for that. Maybe it would be faster to do something like:
UPDATE tmp_anc sa
SET ancestors =
FROM (WITH RECURSIVE ancestors(code, start_node) AS
(SELECT tmp_anc.code, tmp_edges.start_node
FROM tmp_edges
JOIN tmp_anc ON tmp_edges.end_node = tmp_anc.code
SELECT a.code, se.start_node
FROM tmp_edges se
JOIN ancestors a ON se.end_node = a.code
SELECT code,
array_agg(start_node) AS codes
FROM ancestors
GROUP BY (code)
) a
WHERE sa.code = a.code;
(This is untested, so there may be mistakes.)

Postgres Slow group by query with max

I am using postgres 9.1 and I have a table with about 3.5M rows of eventtype (varchar) and eventtime (timestamp) - and some other fields. There are only about 20 different eventtype's and the event time spans about 4 years.
I want to get the last timestamp of each event type. If I run a query like:
select eventtype, max(eventtime)
from allevents
group by eventtype
it takes around 20 seconds. Selecting distinct eventtype's is equally slow. The query plan shows a full sequential scan of the table - not surprising it is slow.
Explain analyse for the above query gives:
HashAggregate (cost=84591.47..84591.68 rows=21 width=21) (actual time=20918.131..20918.141 rows=21 loops=1)
-> Seq Scan on allevents (cost=0.00..66117.98 rows=3694698 width=21) (actual time=0.021..4831.793 rows=3694392 loops=1)
Total runtime: 20918.204 ms
If I add a where clause to select a specific eventtype, it takes anywhere from 40ms to 150ms which is at least decent.
Query plan when selecting specific eventtype:
GroupAggregate (cost=343.87..24942.71 rows=1 width=21) (actual time=98.397..98.397 rows=1 loops=1)
-> Bitmap Heap Scan on allevents (cost=343.87..24871.07 rows=14325 width=21) (actual time=6.820..89.610 rows=19736 loops=1)
Recheck Cond: ((eventtype)::text = 'TEST_EVENT'::text)
-> Bitmap Index Scan on allevents_idx2 (cost=0.00..340.28 rows=14325 width=0) (actual time=6.121..6.121 rows=19736 loops=1)
Index Cond: ((eventtype)::text = 'TEST_EVENT'::text)
Total runtime: 98.482 ms
Primary key is (eventtype, eventtime). I also have the following indexes:
allevents_idx (event time desc, eventtype)
allevents_idx2 (eventtype).
How can I speed up the query?
Results of query play for correlated subquery suggested by #denis below with 14 manually entered values gives:
Function Scan on unnest val (cost=0.00..185.40 rows=100 width=32) (actual time=0.121..8983.134 rows=14 loops=1)
SubPlan 2
-> Result (cost=1.83..1.84 rows=1 width=0) (actual time=641.644..641.645 rows=1 loops=14)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..1.83 rows=1 width=8) (actual time=641.640..641.641 rows=1 loops=14)
-> Index Scan using allevents_idx on allevents (cost=0.00..322672.36 rows=175938 width=8) (actual time=641.638..641.638 rows=1 loops=14)
Index Cond: ((eventtime IS NOT NULL) AND ((eventtype)::text = val.val))
Total runtime: 8983.203 ms
Using the recursive query suggested by #jjanes, the query runs between 4 and 5 seconds with the following plan:
CTE Scan on t (cost=260.32..448.63 rows=101 width=32) (actual time=0.146..4325.598 rows=22 loops=1)
-> Recursive Union (cost=2.52..260.32 rows=101 width=32) (actual time=0.075..1.449 rows=22 loops=1)
-> Result (cost=2.52..2.53 rows=1 width=0) (actual time=0.074..0.074 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..2.52 rows=1 width=13) (actual time=0.070..0.071 rows=1 loops=1)
-> Index Scan using allevents_idx2 on allevents (cost=0.00..9315751.37 rows=3696851 width=13) (actual time=0.070..0.070 rows=1 loops=1)
Index Cond: ((eventtype)::text IS NOT NULL)
-> WorkTable Scan on t (cost=0.00..25.58 rows=10 width=32) (actual time=0.059..0.060 rows=1 loops=22)
Filter: (eventtype IS NOT NULL)
SubPlan 3
-> Result (cost=2.53..2.54 rows=1 width=0) (actual time=0.059..0.059 rows=1 loops=21)
InitPlan 2 (returns $3)
-> Limit (cost=0.00..2.53 rows=1 width=13) (actual time=0.057..0.057 rows=1 loops=21)
-> Index Scan using allevents_idx2 on allevents (cost=0.00..3114852.66 rows=1232284 width=13) (actual time=0.055..0.055 rows=1 loops=21)
Index Cond: (((eventtype)::text IS NOT NULL) AND ((eventtype)::text > t.eventtype))
SubPlan 6
-> Result (cost=1.83..1.84 rows=1 width=0) (actual time=196.549..196.549 rows=1 loops=22)
InitPlan 5 (returns $6)
-> Limit (cost=0.00..1.83 rows=1 width=8) (actual time=196.546..196.546 rows=1 loops=22)
-> Index Scan using allevents_idx on allevents (cost=0.00..322946.21 rows=176041 width=8) (actual time=196.544..196.544 rows=1 loops=22)
Index Cond: ((eventtime IS NOT NULL) AND ((eventtype)::text = t.eventtype))
Total runtime: 4325.694 ms
What you need is a "skip scan" or "loose index scan". PostgreSQL's planner does not yet implement those automatically, but you can trick it into using one by using a recursive query.
SELECT min(eventtype) AS eventtype FROM allevents
SELECT (SELECT min(eventtype) as eventtype FROM allevents WHERE eventtype > t.eventtype)
FROM t where t.eventtype is not null
select eventtype, (select max(eventtime) from allevents where eventtype=t.eventtype) from t;
There may be a way to collapse the max(eventtime) into the recursive query rather than doing it outside that query, but if so I have not hit upon it.
This needs an index on (eventtype, eventtime) in order to be efficient. You can have it be DESC on the eventtime, but that is not necessary. This is efficiently only if eventtype has only a few distinct values (21 of them, in your case).
Based on the question you already have the relevant index.
If upgrading to Postgres 9.3 or an index on (eventtype, eventtime desc) doesn't make a difference, this is a case where rewriting the query so it uses a correlated subquery works very well if you can enumerate all of the event types manually:
select val as eventtype,
(select max(eventtime)
from allevents
where allevents.eventtype = val
) as eventtime
from unnest('{type1,type2,…}'::text[]) as val;
Here's the plans I get when running similar queries:
denis=# select version();
PostgreSQL 9.3.1 on x86_64-apple-darwin11.4.2, compiled by Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn), 64-bit
(1 row)
Test data:
denis=# create table test (evttype int, evttime timestamp, primary key (evttype, evttime));
denis=# insert into test (evttype, evttime) select i, now() + (i % 3) * interval '1 min' - j * interval '1 sec' from generate_series(1,10) i, generate_series(1,10000) j;
INSERT 0 100000
denis=# create index on test (evttime, evttype);
denis=# vacuum analyze test;
First query:
denis=# explain analyze select evttype, max(evttime) from test group by evttype; QUERY PLAN
HashAggregate (cost=2041.00..2041.10 rows=10 width=12) (actual time=54.983..54.987 rows=10 loops=1)
-> Seq Scan on test (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.009..15.954 rows=100000 loops=1)
Total runtime: 55.045 ms
(3 rows)
Second query:
denis=# explain analyze select val as evttype, (select max(evttime) from test where test.evttype = val) as evttime from unnest('{1,2,3,4,5,6,7,8,9,10}'::int[]) val;
Function Scan on unnest val (cost=0.00..48.39 rows=100 width=4) (actual time=0.086..0.292 rows=10 loops=1)
SubPlan 2
-> Result (cost=0.46..0.47 rows=1 width=0) (actual time=0.024..0.024 rows=1 loops=10)
InitPlan 1 (returns $1)
-> Limit (cost=0.42..0.46 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=10)
-> Index Only Scan Backward using test_pkey on test (cost=0.42..464.42 rows=10000 width=8) (actual time=0.019..0.019 rows=1 loops=10)
Index Cond: ((evttype = val.val) AND (evttime IS NOT NULL))
Heap Fetches: 0
Total runtime: 0.370 ms
(9 rows)
index on (eventtype, eventtime desc) should help. or reindex on primary key index. I would also recommend replace type of eventtype to enum (if number of types is fixed) or int/smallint. This will decrease size of data and indexes so queries will run faster.