Select records with IDs containing in another table - postgresql

vit=# select count(*) from evtags;
count
---------
4496914
vit=# explain select tag from evtags where evid in (1002, 1023);
QUERY PLAN
---------------------------------------------------------------------------------
Index Only Scan using evtags_pkey on evtags (cost=0.00..15.64 rows=12 width=7)
Index Cond: (evid = ANY ('{1002,1023}'::integer[]))
This seems completely ok so far. Next, I want to use IDs from another table instead of specifying them in the query.
vit=# select count(*) from zzz;
count
-------
49738
Here we go...
vit=# explain select tag from evtags where evid in (select evid from zzz);
QUERY PLAN
-----------------------------------------------------------------------
Hash Semi Join (cost=1535.11..142452.47 rows=291712 width=7)
Hash Cond: (evtags.evid = zzz.evid)
-> Seq Scan on evtags (cost=0.00..69283.14 rows=4496914 width=11)
-> Hash (cost=718.38..718.38 rows=49738 width=4)
-> Seq Scan on zzz (cost=0.00..718.38 rows=49738 width=4)
Why index scan on the much more larger table and what's the correct way to do this?
EDIT
I recreated my zzz table and now it is better for some reason:
vit=# explain analyze select tag from evtags where evid in (select evid from zzz);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=708.00..2699.17 rows=2248457 width=7) (actual time=28.935..805.923 rows=244353 loops=1)
-> HashAggregate (cost=708.00..710.00 rows=200 width=4) (actual time=28.893..54.461 rows=38822 loops=1)
-> Seq Scan on zzz (cost=0.00..601.80 rows=42480 width=4) (actual time=0.032..10.985 rows=40000 loops=1)
-> Index Only Scan using evtags_pkey on evtags (cost=0.00..9.89 rows=6 width=11) (actual time=0.015..0.017 rows=6 loops=38822)
Index Cond: (evid = zzz.evid)
Heap Fetches: 0
Total runtime: 825.651 ms
But after several executions it changes to
vit=# explain analyze select tag from evtags where evid in (select evid from zzz);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Merge Semi Join (cost=4184.11..127258.48 rows=235512 width=7) (actual time=38.269..1461.755 rows=244353 loops=1)
Merge Cond: (evtags.evid = zzz.evid)
-> Index Only Scan using evtags_pkey on evtags (cost=0.00..136736.89 rows=4496914 width=11) (actual time=0.038..899.647 rows=3630070 loops=1)
Heap Fetches: 0
-> Materialize (cost=4184.04..4384.04 rows=40000 width=4) (actual time=38.212..61.038 rows=40000 loops=1)
-> Sort (cost=4184.04..4284.04 rows=40000 width=4) (actual time=38.208..51.104 rows=40000 loops=1)
Sort Key: zzz.evid
Sort Method: external sort Disk: 552kB
-> Seq Scan on zzz (cost=0.00..577.00 rows=40000 width=4) (actual time=0.018..8.833 rows=40000 loops=1)
Total runtime: 1484.293 ms
...Which is actually slower. Is there any way to hint it a 'correct' execution plan?
The point of these operations is that I want to perform number of queries on a subset of my data and wanted to use separate temporary table to hold IDs of records I want to process.

An inner join has a better chance of a good plan:
select e.tag
from
evtags e
inner join
zzz z using (evid)
Or this:
select e.tag
from evtags e
where exists (
select 1
from zzz
where evid = e.evid
)
As pointed in the comments run analyze evtags; analyze zzz;

Related

Improve PostgreSQL query

I have this query that is highly inefficient, if I remove all the count columns, it takes 10 seconds to query (the tables are quite large, around 750mb each). But if I add 1 count column, it takes 36 seconds to execute, if I leave it all in, it doesn't finish at all
I tried sum(case when r.value is not null then 1 else 0 end) in place of count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses, but it gets incorrect counts
SELECT c.id,
c.title,
count(DISTINCT cc.*) AS contacts,
count(DISTINCT m.user_id) AS texters,
count(DISTINCT cc.*) FILTER (WHERE cc.assignment_id IS NULL) AS needs_assignment,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = false) AS sent_messages,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = true) AS received_messages,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = false) AS contacted,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = true) AS received_reply,
count(DISTINCT cc.*) FILTER (WHERE cc.message_status = 'needsResponse'::text AND NOT cc.is_opted_out) AS needs_response,
count(DISTINCT cc.*) FILTER (WHERE cc.is_opted_out = true) AS opt_outs,
count(DISTINCT m.*) FILTER (WHERE m.error_code IS NOT NULL AND m.error_code <> 0) AS errors,
c.is_started,
c.is_archived,
c.use_dynamic_assignment,
c.due_by,
c.created_at,
creator.email AS creator_email,
concat(c.join_token, '/join/', c.id) AS join_path,
count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses,
c.batch_size,
c.texting_hours_start,
c.texting_hours_end,
c.timezone,
c.organization_id
FROM campaign c
JOIN "user" creator ON c.creator_id = creator.id
LEFT JOIN campaign_contact cc ON c.id = cc.campaign_id
LEFT JOIN message m ON m.campaign_contact_id = cc.id
LEFT JOIN question_response r ON cc.id = r.campaign_contact_id
GROUP BY c.id, creator.email
Any direction is appreciated, thank you!
Create some test data...
create unlogged table users( user_id serial primary key, login text unique not null );
insert into users (login) select 'user'||n from generate_series(1,100000) n;
create unlogged table messages( message_id serial primary key, sender_id integer not null, receiver_id integer not null);
insert into messages (sender_id,receiver_id) select random()*100000+1, random()*100000+1 from generate_series(1,1000000);
create index messages_s on messages(sender_id);
create index messages_r on messages(receiver_id);
vacuum analyze users,messages;
And then:
EXPLAIN ANALYZE
SELECT user_id, count(DISTINCT m1.message_id), count(DISTINCT m2.message_id)
FROM users u
LEFT JOIN messages m1 ON m1.receiver_id = user_id
LEFT JOIN messages m2 ON m2.sender_id = user_id
GROUP BY user_id;
GroupAggregate (cost=4.39..326190.22 rows=100000 width=20) (actual time=4.023..3331.031 rows=100000 loops=1)
Group Key: u.user_id
-> Merge Left Join (cost=4.39..250190.22 rows=10000000 width=12) (actual time=3.987..2161.032 rows=9998915 loops=1)
Merge Cond: (u.user_id = m1.receiver_id)
-> Merge Left Join (cost=2.11..56522.26 rows=1000000 width=8) (actual time=3.978..515.730 rows=1000004 loops=1)
Merge Cond: (u.user_id = m2.sender_id)
-> Index Only Scan using users_pkey on users u (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.016..10.149 rows=100000 loops=1)
Heap Fetches: 0
-> Index Scan using messages_s on messages m2 (cost=0.42..41168.40 rows=1000000 width=8) (actual time=0.011..397.128 rows=999996 loops=1)
-> Materialize (cost=0.42..43668.42 rows=1000000 width=8) (actual time=0.008..746.748 rows=9998810 loops=1)
-> Index Scan using messages_r on messages m1 (cost=0.42..41168.42 rows=1000000 width=8) (actual time=0.006..392.426 rows=999997 loops=1)
Execution Time: 3432.131 ms
Since I put in 100k users and 1M messages, each user has about 100 messages as sender and 100 also as receiver, which means the joins generate 100*100=10k rows per user which then have to be processed by the count(DISTINCT ...) aggregates. Postgres doesn't realize this is all unnecessary because the counts and group by's should really be moved inside the joined tables, which means this is extremely slow.
The solution is to move the aggregation inside the joined tables manually, to avoid generating all these unnecessary extra rows.
EXPLAIN ANALYZE
SELECT user_id, m1.cnt, m2.cnt
FROM users u
LEFT JOIN (SELECT receiver_id, count(*) cnt FROM messages GROUP BY receiver_id) m1 ON m1.receiver_id = user_id
LEFT JOIN (SELECT sender_id, count(*) cnt FROM messages GROUP BY sender_id) m2 ON m2.sender_id = user_id;
Hash Left Join (cost=46780.40..48846.42 rows=100000 width=20) (actual time=469.699..511.613 rows=100000 loops=1)
Hash Cond: (u.user_id = m2.sender_id)
-> Hash Left Join (cost=23391.68..25195.19 rows=100000 width=12) (actual time=237.435..262.545 rows=100000 loops=1)
Hash Cond: (u.user_id = m1.receiver_id)
-> Seq Scan on users u (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.015..5.162 rows=100000 loops=1)
-> Hash (cost=22243.34..22243.34 rows=91867 width=12) (actual time=237.252..237.253 rows=99991 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m1 (cost=20406.00..22243.34 rows=91867 width=12) (actual time=210.817..227.793 rows=99991 loops=1)
-> HashAggregate (cost=20406.00..21324.67 rows=91867 width=12) (actual time=210.815..222.794 rows=99991 loops=1)
Group Key: messages.receiver_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..47.173 rows=1000000 loops=1)
-> Hash (cost=22241.52..22241.52 rows=91776 width=12) (actual time=232.003..232.004 rows=99992 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m2 (cost=20406.00..22241.52 rows=91776 width=12) (actual time=205.401..222.517 rows=99992 loops=1)
-> HashAggregate (cost=20406.00..21323.76 rows=91776 width=12) (actual time=205.400..217.518 rows=99992 loops=1)
Group Key: messages_1.sender_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages messages_1 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.008..43.402 rows=1000000 loops=1)
Planning Time: 0.574 ms
Execution Time: 515.753 ms
I used a schema that is a bit different from yours, but you get the idea: instead of generating lots of duplicate rows by doing what is essentially a cross product, push aggregations into the joined tables so they return only one row per value of whatever column you're joining on, then remove the GROUP BY from the main query since it is no longer necessary.
Note that count(DISTINCT table.*) is not smart enough to understand that it can do this by looking only at the primary key of the table if there is one, so it will pull the whole row to run the distinct on it. When a table is named "message" or "question_response" it smells like it has a largish TEXT column in it, which will make this very slow. So in case you really need a count(distinct ...) you should use count(DISTINCT table.primarykey):
explain analyze SELECT count(distinct user_id) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=15.220..15.221 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.016..5.830 rows=100000 loops=1)
Execution Time: 15.263 ms
explain analyze SELECT count(distinct users.*) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=90.896..90.896 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=37) (actual time=0.038..38.497 rows=100000 loops=1)
Execution Time: 90.958 ms
The problem is the DISTINCT in the aggregate functions. PostgreSQL is not very smart about processing these.
No knowing your data model, I cannot tell if the DISTINCT is really needed. Omit it if you can.

Can PostgreSQL 12 do partition pruning at execution time with subquery returning a list?

I'm trying to take advantages of partitioning in one case:
I have table "events" which partitioned by list by field "dt_pk" which is foreign key to table "dates".
-- Schema
drop schema if exists test cascade;
create schema test;
-- Tables
create table if not exists test.dates (
id bigint primary key,
dt date not null
);
create sequence test.seq_events_id;
create table if not exists test.events
(
id bigint not null,
dt_pk bigint not null,
content_int bigint,
foreign key (dt_pk) references test.dates(id) on delete cascade,
primary key (dt_pk, id)
)
partition by list (dt_pk);
-- Partitions
create table test.events_1 partition of test.events for values in (1);
create table test.events_2 partition of test.events for values in (2);
create table test.events_3 partition of test.events for values in (3);
-- Fill tables
insert into test.dates (id, dt)
select id, dt
from (
select 1 id, '2020-01-01'::date as dt
union all
select 2 id, '2020-01-02'::date as dt
union all
select 3 id, '2020-01-03'::date as dt
) t;
do $$
declare
dts record;
begin
for dts in (
select id
from test.dates
) loop
for k in 1..10000 loop
insert into test.events (id, dt_pk, content_int)
values (nextval('test.seq_events_id'), dts.id, random_between(1, 1000000));
end loop;
commit;
end loop;
end;
$$;
vacuum analyze test.dates, test.events;
I want to run select like this:
select *
from test.events e
join test.dates d on e.dt_pk = d.id
where d.dt between '2020-01-02'::date and '2020-01-03'::date;
But in this case partition pruning doesn't work. It's clear, I don't have constant for partition key. But from documentation I know that there is partition pruning at execution time, which works with value obtained from a subquery:
Partition pruning can be performed not only during the planning of a
given query, but also during its execution. This is useful as it can
allow more partitions to be pruned when clauses contain expressions
whose values are not known at query planning time, for example,
parameters defined in a PREPARE statement, using a value obtained from
a subquery, or using a parameterized value on the inner side of a
nested loop join.
So I rewrite my query like this and I expected partitionin pruning:
select *
from test.events e
where e.dt_pk in (
select d.id
from test.dates d
where d.dt between '2020-01-02'::date and '2020-01-03'::date
);
But explain for this select says:
Hash Join (cost=1.07..833.07 rows=20000 width=24) (actual time=3.581..15.989 rows=20000 loops=1)
Hash Cond: (e.dt_pk = d.id)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.005..6.361 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.104 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.127 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.097 rows=10000 loops=1)
-> Hash (cost=1.04..1.04 rows=2 width=8) (actual time=0.006..0.006 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.206 ms
Execution Time: 17.237 ms
So, we read all partitions. I even tried to the planner to use nested loop join, because I read in documentation "parameterized value on the inner side of a nested loop join", but it didn't work:
set enable_hashjoin to off;
set enable_mergejoin to off;
And again:
Nested Loop (cost=0.00..1443.05 rows=20000 width=24) (actual time=9.160..25.252 rows=20000 loops=1)
Join Filter: (e.dt_pk = d.id)
Rows Removed by Join Filter: 30000
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.008..6.280 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.105 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.047 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.007..1.082 rows=10000 loops=1)
-> Materialize (cost=0.00..1.05 rows=2 width=8) (actual time=0.000..0.000 rows=2 loops=30000)
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.202 ms
Execution Time: 26.516 ms
Then I noticed that in every example of "partition pruning at execution time" I see only = condition, not in.
And it really works that way:
explain (analyze) select * from test.events e where e.dt_pk = (select id from test.dates where id = 2);
Append (cost=1.04..718.04 rows=30000 width=24) (actual time=0.014..3.018 rows=10000 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on dates (cost=0.00..1.04 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
Filter: (id = 2)
Rows Removed by Filter: 2
-> Seq Scan on events_1 e (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
-> Seq Scan on events_2 e_1 (cost=0.00..189.00 rows=10000 width=24) (actual time=0.004..2.009 rows=10000 loops=1)
Filter: (dt_pk = $0)
-> Seq Scan on events_3 e_2 (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
Planning Time: 0.135 ms
Execution Time: 3.639 ms
And here is my final question: does partition pruning at execution time work only with subquery returning one item, or there is a way to get advantages of partition pruning with subquery returning a list?
And why doesn't it work with nested loop join, did I understand something wrong in words:
This includes values from subqueries and values from execution-time
parameters such as those from parameterized nested loop joins.
Or "parameterized nested loop joins" is something different from regular nested loop joins?
There is no partition pruning in your nested loop join because the partitioned table is on the outer side, which is always scanned completely. The inner side is scanned with the join key from the outer side as parameter (hence parameterized scan), so if the partitioned table were on the inner side of the nested loop join, partition pruning could happen.
Partition pruning with IN lists can take place if the list vales are known at plan time:
EXPLAIN (COSTS OFF)
SELECT * FROM test.events WHERE dt_pk IN (1, 2);
QUERY PLAN
---------------------------------------------------
Append
-> Seq Scan on events_1
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
-> Seq Scan on events_2
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
(5 rows)
But no attempts are made to flatten a subquery, and PostgreSQL doesn't use partition pruning, even if you force the partitioned table to be on the inner side (enable_material = off, enable_hashjoin = off, enable_mergejoin = off):
EXPLAIN (ANALYZE)
SELECT * FROM test.events WHERE dt_pk IN (SELECT 1 UNION SELECT 2);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.06..2034.09 rows=20000 width=24) (actual time=0.057..15.523 rows=20000 loops=1)
Join Filter: (events_1.dt_pk = (1))
Rows Removed by Join Filter: 40000
-> Unique (cost=0.06..0.07 rows=2 width=4) (actual time=0.026..0.029 rows=2 loops=1)
-> Sort (cost=0.06..0.07 rows=2 width=4) (actual time=0.024..0.025 rows=2 loops=1)
Sort Key: (1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..0.05 rows=2 width=4) (actual time=0.006..0.009 rows=2 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.012..4.334 rows=30000 loops=2)
-> Seq Scan on events_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.011..1.057 rows=10000 loops=2)
-> Seq Scan on events_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.004..0.641 rows=10000 loops=2)
-> Seq Scan on events_3 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.002..0.594 rows=10000 loops=2)
Planning Time: 0.531 ms
Execution Time: 16.567 ms
(16 rows)
I am not certain, but it may be because the tables are so small. You might want to try with bigger tables.
If you care more about get it working than the fine details, and you haven't tried this yet: you can rewrite the query to something like
explain analyze select *
from test.dates d
join test.events e on e.dt_pk = d.id
where
d.dt between '2020-01-02'::date and '2020-01-03'::date
and e.dt_pk in (extract(day from '2020-01-02'::date)::int,
extract(day from '2020-01-03'::date)::int);
which will give the expected pruning.

Postgres optimize/replace DISTINCT

Trying to select users with most "followed_by" joining to filter by "tag". Both tables have millions of records. Using distinct to only select unique users.
select distinct u.*
from users u join posts p
on u.id=p.user_id
where p.tags #> ARRAY['love']
order by u.followed_by desc nulls last limit 21
It runs over 16s, seems because of the 'distinct' causing a Seq Scan over 6+ million users. Here is the explain analyse
Limit (cost=15509958.30..15509959.09 rows=21 width=292) (actual time=16882.861..16883.753 rows=21 loops=1)
-> Unique (cost=15509958.30..15595560.30 rows=2282720 width=292) (actual time=16882.859..16883.749 rows=21 loops=1)
-> Sort (cost=15509958.30..15515665.10 rows=2282720 width=292) (actual time=16882.857..16883.424 rows=525 loops=1)
Sort Key: u.followed_by DESC NULLS LAST, u.id, u.username, u.fullna
Sort Method: external merge Disk: 583064kBme, u.follows, u
-> Gather (cost=1000.57..14956785.06 rows=2282720 width=292) (actual time=0.377..11506.001 rows=1680890 loops=1).media, u.profile_pic_url_hd, u.is_private, u.is_verified, u.biography, u.external_url, u.updated, u.location_id, u.final_post
Workers Planned: 9
Workers Launched: 9
-> Nested Loop (cost=0.57..14727513.06 rows=253636 width=292) (actual time=1.013..12031.634 rows=168089 loops=10)
-> Parallel Seq Scan on posts p (cost=0.00..13187797.79 rows=253636 width=8) (actual time=0.940..10872.630 rows=168089 loops=10)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 6251355
-> Index Scan using user_pk on users u (cost=0.57..6.06 rows=1 width=292) (actual time=0.006..0.006 rows=1 loops=1680890)
Index Cond: (id = p.user_id)
Planning time: 1.276 ms
Execution time: 16964.271 ms
Would appreciate tips on how to make this fast.
Update
Thanks to #a_horse_with_no_name, for "love" tags it became really fast
Limit (cost=1.14..4293986.91 rows=21 width=292) (actual time=1.735..31.613 rows=21 loops=1)
-> Nested Loop Semi Join (cost=1.14..10959887484.70 rows=53600 width=292) (actual time=1.733..31.607 rows=21 loops=1)
-> Index Scan using idx_followed_by on users u (cost=0.57..322693786.19 rows=232404560 width=292) (actual time=0.011..0.103 rows=32 loops=1)
-> Index Scan using fki_user_fk1 on posts p (cost=0.57..1943.85 rows=43 width=8) (actual time=0.983..0.983 rows=1 loops=32)
Index Cond: (user_id = u.id)
Filter: (tags #> '{love}'::text[])
Rows Removed by Filter: 1699
Planning time: 1.322 ms
Execution time: 31.656 ms
However for some other tags like "beautiful" it's better, but still little slow. It also takes a different execution path
Limit (cost=3893365.84..3893365.89 rows=21 width=292) (actual time=2813.876..2813.892 rows=21 loops=1)
-> Sort (cost=3893365.84..3893499.84 rows=53600 width=292) (actual time=2813.874..2813.887 rows=21 loops=1)
Sort Key: u.followed_by DESC NULLS LAST
Sort Method: top-N heapsort Memory: 34kB
-> Nested Loop (cost=3437011.27..3891920.70 rows=53600 width=292) (actual time=1130.847..2779.928 rows=35230 loops=1)
-> HashAggregate (cost=3437010.70..3437546.70 rows=53600 width=8) (actual time=1130.809..1148.209 rows=35230 loops=1)
Group Key: p.user_id
-> Bitmap Heap Scan on posts p (cost=10484.20..3434173.21 rows=1134993 width=8) (actual time=268.602..972.390 rows=814919 loops=1)
Recheck Cond: (tags #> '{beautiful}'::text[])
Heap Blocks: exact=347002
-> Bitmap Index Scan on idx_tags (cost=0.00..10200.45 rows=1134993 width=0) (actual time=168.453..168.453 rows=814919 loops=1)
Index Cond: (tags #> '{beautiful}'::text[])
-> Index Scan using user_pk on users u (cost=0.57..8.47 rows=1 width=292) (actual time=0.045..0.046 rows=1 loops=35230)
Index Cond: (id = p.user_id)
Planning time: 1.388 ms
Execution time: 2814.132 ms
I did have a gin index for 'tags' already in place
This should be faster:
select *
from users u
where exists (select *
from posts p
where u.id=p.user_id
and p.tags #> ARRAY['love'])
order by u.followed_by desc nulls last
limit 21;
If there are only a few (<10%) posts with that tag, an index on posts.tags should help as well:
create index using gin on posts (tags);

Postgres using unperformant plan when rerun

I'm importing a non circular graph and flattening the ancestors to an array per code. This works fine (for a bit): ~45s for 400k codes over ~900k edges.
However, after the first successful execution Postgres decides to stop using the Nested Loop and the update query performance drops drastically: ~2s per code.
I can force the issue by putting a vacuum right before the update but I am curious why the unoptimization is happening.
DROP TABLE IF EXISTS tmp_anc;
DROP TABLE IF EXISTS tmp_rel;
DROP TABLE IF EXISTS tmp_edges;
DROP TABLE IF EXISTS tmp_codes;
CREATE TABLE tmp_rel (
from_id BIGINT,
to_id BIGINT,
);
COPY tmp_rel FROM 'rel.txt' WITH DELIMITER E'\t' CSV HEADER;
CREATE TABLE tmp_edges(
start_node BIGINT,
end_node BIGINT
);
INSERT INTO tmp_edges(start_node, end_node)
SELECT from_id AS start_node, to_id AS end_node
FROM tmp_rel;
CREATE INDEX tmp_edges_end ON tmp_edges (end_node);
CREATE TABLE tmp_codes (
id BIGINT,
active SMALLINT,
);
COPY tmp_codes FROM 'codes.txt' WITH DELIMITER E'\t' CSV HEADER;
CREATE TABLE tmp_anc(
code BIGINT,
ancestors BIGINT[]
);
INSERT INTO tmp_anc
SELECT DISTINCT(id)
FROM tmp_codes
WHERE active = 1;
CREATE INDEX tmp_anc_codes ON tmp_anc_codes (code);
VACUUM; -- Need this for the update to execute optimally
UPDATE tmp_anc sa SET ancestors = (
WITH RECURSIVE ancestors(code) AS (
SELECT start_node FROM tmp_edges WHERE end_node = sa.code
UNION
SELECT se.start_node
FROM tmp_edges se, ancestors a
WHERE se.end_node = a.code
)
SELECT array_agg(code) FROM ancestors
);
Table stats:
tmp_rel 507 MB 0 bytes
tmp_edges 74 MB 37 MB
tmp_codes 32 MB 0 bytes
tmp_anc 22 MB 8544 kB
Explains:
Without VACUUM before UPDATE:
Update on tmp_anc sa (cost=10000000000.00..11081583053.74 rows=10 width=46) (actual time=38294.005..38294.005 rows=0 loops=1)
-> Seq Scan on tmp_anc sa (cost=10000000000.00..11081583053.74 rows=10 width=46) (actual time=3300.974..38292.613 rows=10 loops=1)
SubPlan 2
-> Aggregate (cost=108158305.25..108158305.26 rows=1 width=32) (actual time=3829.253..3829.253 rows=1 loops=10)
CTE ancestors
-> Recursive Union (cost=81.97..66015893.05 rows=1872996098 width=8) (actual time=0.037..3827.917 rows=45 loops=10)
-> Bitmap Heap Scan on tmp_edges (cost=81.97..4913.18 rows=4328 width=8) (actual time=0.022..0.022 rows=2 loops=10)
Recheck Cond: (end_node = sa.code)
Heap Blocks: exact=12
-> Bitmap Index Scan on tmp_edges_end (cost=0.00..80.89 rows=4328 width=0) (actual time=0.014..0.014 rows=2 loops=10)
Index Cond: (end_node = sa.code)
-> Merge Join (cost=4198.89..2855105.79 rows=187299177 width=8) (actual time=163.746..425.295 rows=10 loops=90)
Merge Cond: (a.code = se.end_node)
-> Sort (cost=4198.47..4306.67 rows=43280 width=8) (actual time=0.012..0.016 rows=5 loops=90)
Sort Key: a.code
Sort Method: quicksort Memory: 25kB
-> WorkTable Scan on ancestors a (cost=0.00..865.60 rows=43280 width=8) (actual time=0.000..0.001 rows=5 loops=90)
-> Materialize (cost=0.42..43367.08 rows=865523 width=16) (actual time=0.010..337.592 rows=537171 loops=90)
-> Index Scan using tmp_edges_end on edges se (cost=0.42..41203.27 rows=865523 width=16) (actual time=0.009..247.547 rows=537171 loops=90)
-> CTE Scan on ancestors (cost=0.00..37459921.96 rows=1872996098 width=8) (actual time=1.227..3829.159 rows=45 loops=10)
With VACUUM before UPDATE:
Update on tmp_anc sa (cost=0.00..2949980136.43 rows=387059 width=14) (actual time=74701.329..74701.329 rows=0 loops=1)
-> Seq Scan on tmp_anc sa (cost=0.00..2949980136.43 rows=387059 width=14) (actual time=0.336..70324.848 rows=387059 loops=1)
SubPlan 2
-> Aggregate (cost=7621.50..7621.51 rows=1 width=8) (actual time=0.180..0.180 rows=1 loops=387059)
CTE ancestors
-> Recursive Union (cost=0.42..7583.83 rows=1674 width=8) (actual time=0.005..0.162 rows=32 loops=387059)
-> Index Scan using tmp_edges_end on tmp_edges (cost=0.42..18.93 rows=4 width=8) (actual time=0.004..0.005 rows=2 loops=387059)
Index Cond: (end_node = sa.code)
-> Nested Loop (cost=0.42..753.14 rows=167 width=8) (actual time=0.003..0.019 rows=10 loops=2700448)
-> WorkTable Scan on ancestors a (cost=0.00..0.80 rows=40 width=8) (actual time=0.000..0.001 rows=5 loops=2700448)
-> Index Scan using tmp_edges_end on tmp_edges se (cost=0.42..18.77 rows=4 width=16) (actual time=0.003..0.003 rows=2 loops=12559395)
Index Cond: (end_node = a.code)
-> CTE Scan on ancestors (cost=0.00..33.48 rows=1674 width=8) (actual time=0.007..0.173 rows=32 loops=387059)
The first execution plan has really bad estimates (Bitmap Index Scan on tmp_edges_end estimates 4328 instead of 2 rows), while the second execution has good estimates and thus chooses a good plan.
So something between the two executions you quote above must have changed the estimates.
Moreover, you say that the first execution of the UPDATE (for which we have no EXPLAIN (ANALYZE) output) was fast.
The only good explanation for the initial performance drop is that it takes the autovacuum daemon some time to collect statistics for the new tables. This normally improves query performance, but of course it can also work the other way around.
Also, a VACUUM usually doesn't fix performance issues. Could it be that you used VACUUM (ANALYZE)?
It would be interesting to know how things are when you collect statistics before your initial UPDATE:
ANALYZE tmp_edges;
When I read your query more closely, however, I wonder why you use a correlated subquery for that. Maybe it would be faster to do something like:
UPDATE tmp_anc sa
SET ancestors = a.codes
FROM (WITH RECURSIVE ancestors(code, start_node) AS
(SELECT tmp_anc.code, tmp_edges.start_node
FROM tmp_edges
JOIN tmp_anc ON tmp_edges.end_node = tmp_anc.code
UNION
SELECT a.code, se.start_node
FROM tmp_edges se
JOIN ancestors a ON se.end_node = a.code
)
SELECT code,
array_agg(start_node) AS codes
FROM ancestors
GROUP BY (code)
) a
WHERE sa.code = a.code;
(This is untested, so there may be mistakes.)

Postgres Query Optimization w/ simple join

I have the following query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users
on users.id = person_dimensions.user_id
where users.team_id = 2
The following is the result of EXPLAIN ANALYZE:
Nested Loop (cost=0.43..93033.84 rows=452 width=11) (actual time=1245.321..42915.426 rows=827 loops=1)
-> Seq Scan on person_dimensions (cost=0.00..254.72 rows=13772 width=15) (actual time=0.022..9.907 rows=13772 loops=1)
-> Index Scan using users_pkey on users (cost=0.43..6.73 rows=1 width=4) (actual time=2.978..3.114 rows=0 loops=13772)
Index Cond: (id = person_dimensions.user_id)
Filter: (team_id = 2)
Rows Removed by Filter: 1
Planning time: 0.396 ms
Execution time: 42915.678 ms
Indexes exist on person_dimensions.user_id and users.team_id, so it is unclear as to why this seemingly simple query would be taking so long.
Maybe it has something to do with team_id being unable to be used in the join condition? Ideas how to speed this up?
EDIT:
I tried this query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users ON users.id = person_dimensions.user_id
WHERE users.id IN (2337,2654,3501,56,4373,1060,3170,97,4629,41,3175,4541,2827)
which contains the id's returned by the subquery:
SELECT id FROM users WHERE team_id = 2
The result was 380ms versus 42s as above. I could use this as a workaround, but I am really curious as to what is going on here...
I rebooted my DB server yesterday, and when it came back up this same query was performing as expected with a completely different query plan that used expected indices:
QUERY PLAN
Hash Join (cost=1135.63..1443.45 rows=84 width=11) (actual time=0.354..6.312 rows=835 loops=1)
Hash Cond: (person_dimensions.user_id = users.id)
-> Seq Scan on person_dimensions (cost=0.00..255.17 rows=13817 width=15) (actual time=0.002..2.764 rows=13902 loops=1)
-> Hash (cost=1132.96..1132.96 rows=214 width=4) (actual time=0.175..0.175 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Bitmap Heap Scan on users (cost=286.07..1132.96 rows=214 width=4) (actual time=0.032..0.157 rows=60 loops=1)
Recheck Cond: (team_id = 2)
Heap Blocks: exact=68
-> Bitmap Index Scan on index_users_on_team_id (cost=0.00..286.02 rows=214 width=0) (actual time=0.021..0.021 rows=82 loops=1)
Index Cond: (team_id = 2)
Planning time: 0.215 ms
Execution time: 6.474 ms
Anyone have any ideas why it required a reboot to be aware of all of this? Could it be that manual vacuums were required that hadn't been done in a while, or something like this? Recall I did do an analyze on the relevant tables before the reboot and it didn't change anything.