I looking for ways to abstract database access to postgres. In my examples I will use a hypothetical twitter clone in nodejs, but in the end it's a question about how postgres handles prepared statements, so the language and library don't really matter:
Suppose I want to be able to access a list of all tweets from a user by username:
name: "tweets by username"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.username = $1"
values: [username]
That works fine, but it seems inefficient, both in practical terms and code-quality terms to have to make another function to handle getting tweets by email rather than by username:
name: "tweets by email"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.email = $1"
values: [email]
Is it possible to include a field as a parameter to the prepared statement?
name: "tweets by user"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.$1 = $2"
values: [field, value]
While it's true that this might be a bit less efficient in the corner case of accessing tweets by user_id, that's a trade I'm willing to make to improve code quality, and hopefully overall improve efficiency by reducing the number of query templates to 1 instead of 3+.
#Clodoaldo 's answer is correct in that it allows the capability you desire and should return the right results. Unfortunately it produces rather slow execution.
I set up an experimental data base with tweets and users. populated 10K users each with 100 tweets (1M tweet records). I indexed the PKs u.id, t.id, the FK t.user_id and the predicate fields u.username, u.email.
create table t(id serial PRIMARY KEY, data integer, user_id bignit);
create index t1 t(user_id);
create table u(id serial PRIMARY KEY, name text, email text);
create index u1 on u(name);
create index u2 on u(email);
insert into u(name,email) select i::text, i::text from generate_series(1,10000) i;
insert into t(data,user_id) select i, (i/100)::bigint from generate_series(1,1000000) i;
analyze table t;
analyze table u;
A simple query using one field as predicate is very fast:
prepare qn as select t.* from t join u on t.user_id = u.id where u.name = $1;
explain analyze execute qn('1111');
Nested Loop (cost=0.00..19.81 rows=1 width=16) (actual time=0.030..0.057 rows=100 loops=1)
-> Index Scan using u1 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1)
Index Cond: (name = $1)
-> Index Scan using t1 on t (cost=0.00..10.10 rows=100 width=16) (actual time=0.007..0.023 rows=100 loops=1)
Index Cond: (t.user_id = u.id)
Total runtime: 0.093 ms
A query using case in the where as #Clodoaldo proposed takes almost 30 seconds:
prepare qen as select t.* from t join u on t.user_id = u.id
where case $2 when 'e' then u.email = $1 when 'n' then u.name = $1 end;
explain analyze execute qen('1111','n');
Merge Join (cost=25.61..38402.69 rows=500000 width=16) (actual time=27.771..26345.439 rows=100 loops=1)
Merge Cond: (t.user_id = u.id)
-> Index Scan using t1 on t (cost=0.00..30457.35 rows=1000000 width=16) (actual time=0.023..17.741 rows=111200 loops=1)
-> Index Scan using u_pkey on u (cost=0.00..42257.36 rows=500000 width=4) (actual time=0.325..26317.384 rows=1 loops=1)
Filter: CASE $2 WHEN 'e'::text THEN (u.email = $1) WHEN 'n'::text THEN (u.name = $1) ELSE NULL::boolean END
Total runtime: 26345.535 ms
Observing that plan, I thought that using a union subselect then filtering its results to get the id appropriate to the parametrized predicate choice would allow the planner to use specific indexes for each predicate. Turns out I was right:
prepare qen2 as
select t.*
from t
join (
SELECT id from
(
SELECT 'n' as fld, id from u where u.name = $1
UNION ALL
SELECT 'e' as fld, id from u where u.email = $1
) poly
where poly.fld = $2
) uu
on t.user_id = uu.id;
explain analyze execute qen2('1111','n');
Nested Loop (cost=0.00..28.31 rows=100 width=16) (actual time=0.058..0.120 rows=100 loops=1)
-> Subquery Scan poly (cost=0.00..16.96 rows=1 width=4) (actual time=0.041..0.073 rows=1 loops=1)
Filter: (poly.fld = $2)
-> Append (cost=0.00..16.94 rows=2 width=4) (actual time=0.038..0.070 rows=2 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..8.47 rows=1 width=4) (actual time=0.038..0.038 rows=1 loops=1)
-> Index Scan using u1 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.038..0.038 rows=1 loops=1)
Index Cond: (name = $1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..8.47 rows=1 width=4) (actual time=0.031..0.032 rows=1 loops=1)
-> Index Scan using u2 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.030..0.031 rows=1 loops=1)
Index Cond: (email = $1)
-> Index Scan using t1 on t (cost=0.00..10.10 rows=100 width=16) (actual time=0.015..0.028 rows=100 loops=1)
Index Cond: (t.user_id = poly.id)
Total runtime: 0.170 ms
SELECT t.*
FROM tweets t
inner join users u on t.user_id = u.user_id
WHERE case $2
when 'username' then u.username = $1
when 'email' then u.email = $1
else u.user_id = $1
end
Related
While executing below two queries, I notice serious difference in query plan. Why is that?
select * from table1
where id = 'dummy' or id in (select id from table2 where id = 'dummy')
Query plan
Seq Scan on table1 (cost=8.30..49611.63 rows=254478 width=820) (actual time=535.477..557.431 rows=1 loops=1)
Filter: (((code)::text = 'dummy'::text) OR (hashed SubPlan 1))
Rows Removed by Filter: 510467
SubPlan 1
-> Index Scan using idx on table2 (cost=0.29..8.30 rows=1 width=8) (actual time=0.009..0.012 rows=0 loops=1)
Index Cond: ((id)::text = 'dummy'::text)
Planning Time: 0.165 ms
Execution Time: 557.517 ms
select * from table1
where id = 'dummy'
union
select * from table1
where id in (select id from table2 where id = 'dummy')
Unique (cost=25.22..25.42 rows=2 width=5818) (actual time=0.045..0.047 rows=1 loops=1)
-> Sort (cost=25.22..25.23 rows=2 width=5818) (actual time=0.045..0.046 rows=1 loops=1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.42..25.21 rows=2 width=5818) (actual time=0.016..0.026 rows=1 loops=1)
-> Index Scan using id on table1 (cost=0.42..8.44 rows=1 width=820) (actual time=0.015..0.016 rows=1 loops=1)
Index Cond: ((id)::text = 'dummy'::text)
-> Nested Loop (cost=0.71..16.74 rows=1 width=820) (actual time=0.009..0.009 rows=0 loops=1)
-> Index Scan using idx on table2 (cost=0.29..8.30 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: ((id)::text = 'dummy'::text)
-> Index Scan using pkey on table1 (cost=0.42..8.44 rows=1 width=820) (never executed)
Index Cond: (id = table2.id)
Planning Time: 0.753 ms
Execution Time: 0.131 ms
So the main difference you can see is the first query returns 254478 rows but the second just returns 2 rows.
Why is that?
Please do another test -- run both these queries -- do they give the same results as the queries without my changes?
select * from table1
where table1.id = 'dummy' or
table1.id in (select table2.id from table2 where table2.id = 'dummy')
select * from table1
where table1.id = 'dummy'
union
select * from table1
where table1.id in (select table2.id from table2 where table2.id = 'dummy')
I don't think you are sharing your actual code with us -- because as written your code makes little sense -- you are returning a list of ids in the sub-query that equal 'dummy' -- so you will just get a list of dummy multiple times.
Note these comments are not true since they had no impact on the results -- the order of operations was working as expected
What result do you get when when you do this:
select * from table1
where (id = 'dummy') or id in (select id from table2 where id = 'dummy')
The reason your query was giving more results is because it was selecting records from table1 where id equals dummy or id = id. The query in the original post gives you all the records. The OR was being applied to the first expression not splitting two expressions.
We are working on a platform right now and I'm having difficulties optimizing one of our queries, it's actually a view.
When explaining the query a full table scan is performed on my user_role table.
The query behind the view looks like what I've posted below (I used * for select instead of different columns of the different tables, just to showcase my issue). All results are based on that simplified query.
SELECT t.*
FROM task t
LEFT JOIN project p ON t.project_id = p.id
LEFT JOIN company comp ON comp.id = p.company_id
LEFT JOIN company_users cu ON cu.companies_id = comp.id
LEFT JOIN user_table u ON u.email= cu.users_email
LEFT JOIN user_role role ON u.email= role.user_email
WHERE lower(t.type) = 'project'
AND t.project_id IS NOT NULL
AND role.role::text in ('SOME_ROLE_1', 'SOME_ROLE_2')
Basically this query runs ok like this. If I explain the query it uses all my index in place, nice!
But.. As soon as I start added to extra where clauses the issue arises. I simply add this:
and u.email = 'my#email.com'
and comp.id = 4
and p.id = 3
And all of the sudden on tables company_users and user_role a full table scan is performed. No on table project.
The full query plan is:
Nested Loop (cost=0.98..22.59 rows=1 width=97) (actual time=0.115..4.448 rows=189 loops=1)
-> Nested Loop (cost=0.84..22.02 rows=1 width=632) (actual time=0.099..3.091 rows=252 loops=1)
-> Nested Loop (cost=0.70..20.10 rows=1 width=613) (actual time=0.082..1.774 rows=252 loops=1)
-> Nested Loop (cost=0.56..19.81 rows=1 width=621) (actual time=0.068..0.919 rows=252 loops=1)
-> Nested Loop (cost=0.43..19.62 rows=1 width=101) (actual time=0.058..0.504 rows=63 loops=1)
-> Index Scan using task_project_id_index on task t (cost=0.28..11.43 rows=1 width=97) (actual time=0.041..0.199 rows=63 loops=1)
Index Cond: (project_id IS NOT NULL)
Filter: (lower((type)::text) = 'project'::text)
-> Index Scan using project_id_uindex on project p (cost=0.15..8.17 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=63)
Index Cond: (id = t.project_id)
-> Index Scan using company_users_companies_id_index on company_users cu (cost=0.14..0.17 rows=1 width=520) (actual time=0.002..0.004 rows=4 loops=63)
Index Cond: (companies_id = p.company_id)
-> Index Only Scan using company_id_index on company comp (cost=0.14..0.29 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=252)
Index Cond: (id = p.company_id)
Heap Fetches: 252
-> Index Only Scan using user_table_email_index on user_table u (cost=0.14..1.81 rows=1 width=19) (actual time=0.004..0.004 rows=1 loops=252)
Index Cond: (email = (cu.users_email)::text)
Heap Fetches: 252
-> Index Scan using user_role_user_email_index on user_role role (cost=0.14..0.56 rows=1 width=516) (actual time=0.004..0.004 rows=1 loops=252)
Index Cond: ((user_email)::text = (u.email)::text)
Filter: ((role)::text = ANY ('{COMPANY_ADMIN,COMPANY_USER}'::text[]))
Rows Removed by Filter: 0
Planning time: 2.581 ms
Execution time: 4.630 ms
The explanation for company_users in particular is:
SEQ_SCAN (Seq Scan) table: company_users; 1 1.44 0.0 Parent Relationship = Inner;
Parallel Aware = false;
Alias = cu;
Plan Width = 520;
Filter = ((companies_id = 4) AND ((users_email)::text = 'my#email.com'::text));
However I already created an index on the company_users table: create index if not exists company_users_users_email_companies_id_index on company_users (users_email, companies_id);.
Same counts for the user_role table.
The explanation is
SEQ_SCAN (Seq Scan) table: user_role; 1 1.45 0.0 Parent Relationship = Inner;
Parallel Aware = false;
Alias = role;
Plan Width = 516;
Filter = (((role)::text = ANY ('{SOME_ROLE_1,SOME_ROLE_2}'::text[])) AND ((user_email)::text = 'my#email.com'::text));
And thus for user_role I also have an index in place on columns role and user_email: create index if not exists user_role_role_user_email_index on user_role (role, user_email);
Did not expect it to change anything but anyways I did try to adapt the query to include only a single role in the where class and also tried with OR statements instead of an IN but all makes no difference!
The weird thing to me is that without my extra 3 where filters it works perfect, and as soon as I add them it does not work, but I do have indexes created for the fields mentioned in the explanations...
We are using exactly the same algo in a few other queries and they all suffer the same issue on those two fields...
So the big question is, how can I further improve my queries and make it use the index instead of a full table scan?
Create an index and calculate statistics:
CREATE INDEX ON task (lower(type)) WHERE project_id IS NOT NULL;
ANALYZE task;
That should improve the estimate, so that PostgreSQL chooses a better join strategy.
I have this query that is highly inefficient, if I remove all the count columns, it takes 10 seconds to query (the tables are quite large, around 750mb each). But if I add 1 count column, it takes 36 seconds to execute, if I leave it all in, it doesn't finish at all
I tried sum(case when r.value is not null then 1 else 0 end) in place of count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses, but it gets incorrect counts
SELECT c.id,
c.title,
count(DISTINCT cc.*) AS contacts,
count(DISTINCT m.user_id) AS texters,
count(DISTINCT cc.*) FILTER (WHERE cc.assignment_id IS NULL) AS needs_assignment,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = false) AS sent_messages,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = true) AS received_messages,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = false) AS contacted,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = true) AS received_reply,
count(DISTINCT cc.*) FILTER (WHERE cc.message_status = 'needsResponse'::text AND NOT cc.is_opted_out) AS needs_response,
count(DISTINCT cc.*) FILTER (WHERE cc.is_opted_out = true) AS opt_outs,
count(DISTINCT m.*) FILTER (WHERE m.error_code IS NOT NULL AND m.error_code <> 0) AS errors,
c.is_started,
c.is_archived,
c.use_dynamic_assignment,
c.due_by,
c.created_at,
creator.email AS creator_email,
concat(c.join_token, '/join/', c.id) AS join_path,
count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses,
c.batch_size,
c.texting_hours_start,
c.texting_hours_end,
c.timezone,
c.organization_id
FROM campaign c
JOIN "user" creator ON c.creator_id = creator.id
LEFT JOIN campaign_contact cc ON c.id = cc.campaign_id
LEFT JOIN message m ON m.campaign_contact_id = cc.id
LEFT JOIN question_response r ON cc.id = r.campaign_contact_id
GROUP BY c.id, creator.email
Any direction is appreciated, thank you!
Create some test data...
create unlogged table users( user_id serial primary key, login text unique not null );
insert into users (login) select 'user'||n from generate_series(1,100000) n;
create unlogged table messages( message_id serial primary key, sender_id integer not null, receiver_id integer not null);
insert into messages (sender_id,receiver_id) select random()*100000+1, random()*100000+1 from generate_series(1,1000000);
create index messages_s on messages(sender_id);
create index messages_r on messages(receiver_id);
vacuum analyze users,messages;
And then:
EXPLAIN ANALYZE
SELECT user_id, count(DISTINCT m1.message_id), count(DISTINCT m2.message_id)
FROM users u
LEFT JOIN messages m1 ON m1.receiver_id = user_id
LEFT JOIN messages m2 ON m2.sender_id = user_id
GROUP BY user_id;
GroupAggregate (cost=4.39..326190.22 rows=100000 width=20) (actual time=4.023..3331.031 rows=100000 loops=1)
Group Key: u.user_id
-> Merge Left Join (cost=4.39..250190.22 rows=10000000 width=12) (actual time=3.987..2161.032 rows=9998915 loops=1)
Merge Cond: (u.user_id = m1.receiver_id)
-> Merge Left Join (cost=2.11..56522.26 rows=1000000 width=8) (actual time=3.978..515.730 rows=1000004 loops=1)
Merge Cond: (u.user_id = m2.sender_id)
-> Index Only Scan using users_pkey on users u (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.016..10.149 rows=100000 loops=1)
Heap Fetches: 0
-> Index Scan using messages_s on messages m2 (cost=0.42..41168.40 rows=1000000 width=8) (actual time=0.011..397.128 rows=999996 loops=1)
-> Materialize (cost=0.42..43668.42 rows=1000000 width=8) (actual time=0.008..746.748 rows=9998810 loops=1)
-> Index Scan using messages_r on messages m1 (cost=0.42..41168.42 rows=1000000 width=8) (actual time=0.006..392.426 rows=999997 loops=1)
Execution Time: 3432.131 ms
Since I put in 100k users and 1M messages, each user has about 100 messages as sender and 100 also as receiver, which means the joins generate 100*100=10k rows per user which then have to be processed by the count(DISTINCT ...) aggregates. Postgres doesn't realize this is all unnecessary because the counts and group by's should really be moved inside the joined tables, which means this is extremely slow.
The solution is to move the aggregation inside the joined tables manually, to avoid generating all these unnecessary extra rows.
EXPLAIN ANALYZE
SELECT user_id, m1.cnt, m2.cnt
FROM users u
LEFT JOIN (SELECT receiver_id, count(*) cnt FROM messages GROUP BY receiver_id) m1 ON m1.receiver_id = user_id
LEFT JOIN (SELECT sender_id, count(*) cnt FROM messages GROUP BY sender_id) m2 ON m2.sender_id = user_id;
Hash Left Join (cost=46780.40..48846.42 rows=100000 width=20) (actual time=469.699..511.613 rows=100000 loops=1)
Hash Cond: (u.user_id = m2.sender_id)
-> Hash Left Join (cost=23391.68..25195.19 rows=100000 width=12) (actual time=237.435..262.545 rows=100000 loops=1)
Hash Cond: (u.user_id = m1.receiver_id)
-> Seq Scan on users u (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.015..5.162 rows=100000 loops=1)
-> Hash (cost=22243.34..22243.34 rows=91867 width=12) (actual time=237.252..237.253 rows=99991 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m1 (cost=20406.00..22243.34 rows=91867 width=12) (actual time=210.817..227.793 rows=99991 loops=1)
-> HashAggregate (cost=20406.00..21324.67 rows=91867 width=12) (actual time=210.815..222.794 rows=99991 loops=1)
Group Key: messages.receiver_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..47.173 rows=1000000 loops=1)
-> Hash (cost=22241.52..22241.52 rows=91776 width=12) (actual time=232.003..232.004 rows=99992 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m2 (cost=20406.00..22241.52 rows=91776 width=12) (actual time=205.401..222.517 rows=99992 loops=1)
-> HashAggregate (cost=20406.00..21323.76 rows=91776 width=12) (actual time=205.400..217.518 rows=99992 loops=1)
Group Key: messages_1.sender_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages messages_1 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.008..43.402 rows=1000000 loops=1)
Planning Time: 0.574 ms
Execution Time: 515.753 ms
I used a schema that is a bit different from yours, but you get the idea: instead of generating lots of duplicate rows by doing what is essentially a cross product, push aggregations into the joined tables so they return only one row per value of whatever column you're joining on, then remove the GROUP BY from the main query since it is no longer necessary.
Note that count(DISTINCT table.*) is not smart enough to understand that it can do this by looking only at the primary key of the table if there is one, so it will pull the whole row to run the distinct on it. When a table is named "message" or "question_response" it smells like it has a largish TEXT column in it, which will make this very slow. So in case you really need a count(distinct ...) you should use count(DISTINCT table.primarykey):
explain analyze SELECT count(distinct user_id) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=15.220..15.221 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.016..5.830 rows=100000 loops=1)
Execution Time: 15.263 ms
explain analyze SELECT count(distinct users.*) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=90.896..90.896 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=37) (actual time=0.038..38.497 rows=100000 loops=1)
Execution Time: 90.958 ms
The problem is the DISTINCT in the aggregate functions. PostgreSQL is not very smart about processing these.
No knowing your data model, I cannot tell if the DISTINCT is really needed. Omit it if you can.
Here I attached my query. It's getting 60141 ms to execute. I don't know what should i do. But I want to execute in short time.And now I posted my analyze and execute output of my query. Please help on this.
EXPLAIN (BUFFERS,ANALYZE) SELECT id
FROM activitylog
WHERE (url = '/staff/save/117' OR url = '/staff/create/117')
AND timestamp > '1990-01-01 00:00:00'
AND userid IN ( SELECT id
FROM users
WHERE companyid = ( SELECT companyid
FROM users
WHERE id='150' ) )
ORDER BY timestamp DESC
Output:
Sort (cost=934879.83..934879.83 rows=1 width=12) (actual time=63918.947..63918.948 rows=4 loops=1)
Sort Key: activitylog."timestamp"
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=168161 read=561433
InitPlan 1 (returns $0)
-> Index Scan using "usersPrimary" on users users_1 (cost=0.14..8.16 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: (id = 150)
Buffers: shared hit=2
-> Nested Loop (cost=0.00..934871.66 rows=1 width=12) (actual time=63918.693..63918.917 rows=4 loops=1)
Join Filter: (activitylog.userid = users.id)
Rows Removed by Join Filter: 400
Buffers: shared hit=168158 read=561433
-> Seq Scan on users (cost=0.00..10.53 rows=25 width=4) (actual time=0.018..0.085 rows=101 loops=1)
Filter: (companyid = $0)
Rows Removed by Filter: 114
Buffers: shared hit=10
-> Materialize (cost=0.00..934860.39 rows=2 width=16) (actual time=120.024..632.858 rows=4 loops=101)
Buffers: shared hit=168148 read=561433
-> Seq Scan on activitylog (cost=0.00..934860.38 rows=2 width=16) (actual time=12122.376..63918.564 rows=4 loops=1)
Filter: (("timestamp" > '2019-01-02 19:19:12.649837+00'::timestamp with time zone) AND (((url)::text = '/jobs/save/81924'::text) OR ((url)::text = '/jobs/create/81924'::text)))
Rows Removed by Filter: 11935833
Buffers: shared hit=168148 read=561433
Planning time: 0.806 ms
Execution time: 63919.748 ms
Thanks In advance.
Try this one, Join query will enhance the indexing and optimized query execution time
SELECT id
FROM activitylog
WHERE url in ('/staff/save/117','/staff/create/117')
AND TIMESTAMP > '1990-01-01 00:00:00'
AND EXISTS
(SELECT 1
FROM users AS u
JOIN users ur ON ur.CompanyID = u.CompanyID
WHERE ur.ID = '150'
AND u.id = activitylog.userid)
ORDER BY TIMESTAMP DESC
You should rewrite the query using IN instead of OR like Pranesh Janarthanan's answer suggests, because OR is a performance killer.
In addition, you need indexes to avoid the expensive sequential scan on activitylog:
CREATE INDEX ON activitylog (timestamp);
CREATE INDEX ON activitylog (url);
Which of these indexes you need depends on how selective the respective conditions are.
Try with each of these indexes and with both (which may give you a “Bitmap And”) and keep what works best.
You should make sure you have the right indexes. Also, let's optimize the query:
SELECT id
FROM activitylog
WHERE url in ('/staff/save/117', '/staff/create/117')
AND timestamp > '1990-01-01 00:00:00'
AND
(
(userid = 150) or
EXISTS
(
select 1
from users workmate
join users u150
on workmate.companyid = u150.companyid and u150.id = 150
and activitylog.userid = workmate.id
)
)
ORDER BY timestamp DESC
I have a query where the Postgres is performing a Hash join with sequence scan instead of an Index join with Nested loop, when I use an OR condition. This is causing the query to take 2 seconds instead of completing in < 100ms. I have run VACUUM ANALYZE and have rebuilt the index on the PATIENTCHARTNOTE table (which is about 5GB) but its still using hash join. Do you have any suggestions on how I can improve this?
explain analyze
SELECT Count (_pcn.id) AS total_open_note
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND ( _pt.assigned_to_user_id = '136964'
OR _pcn.createdby_id = '136964'
);
Aggregate (cost=237655.59..237655.60 rows=1 width=8) (actual time=1602.069..1602.069 rows=1 loops=1)
-> Hash Join (cost=83095.43..237645.30 rows=4117 width=4) (actual time=944.850..1602.014 rows=241 loops=1)
Hash Cond: (_appt.patient_id = _pt.id)
Join Filter: ((_pt.assigned_to_user_id = 136964) OR (_pcn.createdby_id = 136964))
Rows Removed by Join Filter: 94036
-> Hash Join (cost=46650.68..182243.64 rows=556034 width=12) (actual time=415.862..1163.812 rows=94457 loops=1)
Hash Cond: (_pcn.appointment_id = _appt.id)
-> Seq Scan on patientchartnote _pcn (cost=0.00..112794.20 rows=1073978 width=12) (actual time=0.016..423.262 rows=1
073618 loops=1)
Filter: (active AND (title IS NOT NULL) AND ((title)::text <> ''::text))
Rows Removed by Filter: 22488
-> Hash (cost=35223.61..35223.61 rows=696486 width=8) (actual time=414.749..414.749 rows=692839 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2732kB
-> Seq Scan on appointment _appt (cost=0.00..35223.61 rows=696486 width=8) (actual time=0.010..271.208 rows=69
2839 loops=1)
Filter: (datecomplete IS NULL)
Rows Removed by Filter: 652426
-> Hash (cost=24698.57..24698.57 rows=675694 width=12) (actual time=351.566..351.566 rows=674929 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2737kB
-> Seq Scan on patient _pt (cost=0.00..24698.57 rows=675694 width=12) (actual time=0.013..197.268 rows=674929 loops=
1)
Filter: active
Rows Removed by Filter: 17426
Planning time: 1.533 ms
Execution time: 1602.715 ms
When I replace "OR _pcn.createdby_id = '136964'" with "AND _pcn.createdby_id = '136964'", Postgres performs an index scan
Aggregate (cost=29167.56..29167.57 rows=1 width=8) (actual time=937.743..937.743 rows=1 loops=1)
-> Nested Loop (cost=1.28..29167.55 rows=7 width=4) (actual time=19.136..937.669 rows=37 loops=1)
-> Nested Loop (cost=0.85..27393.03 rows=1654 width=4) (actual time=2.154..910.250 rows=1649 loops=1)
-> Index Scan using patient_activeassigned_idx on patient _pt (cost=0.42..3075.00 rows=1644 width=8) (actual time=1.
599..11.820 rows=1627 loops=1)
Index Cond: ((active = true) AND (assigned_to_user_id = 136964))
Filter: active
-> Index Scan using appointment_datepatient_idx on appointment _appt (cost=0.43..14.75 rows=4 width=8) (actual time=
0.543..0.550 rows=1 loops=1627)
Index Cond: ((patient_id = _pt.id) AND (datecomplete IS NULL))
-> Index Scan using patientchartnote_activeappointment_idx on patientchartnote _pcn (cost=0.43..1.06 rows=1 width=8) (actual time=0.014..0.014 rows=0 loops=1649)
Index Cond: ((active = true) AND (createdby_id = 136964) AND (appointment_id = _appt.id) AND (title IS NOT NULL))
Filter: (active AND ((title)::text <> ''::text))
Planning time: 1.489 ms
Execution time: 937.910 ms
(13 rows)
Using OR in SQL queries usually results in bad performance.
That is because – different from AND – it does not restrict, but extend the number of rows in the query result. With AND, you can use an index scan for one part of the condition and further restrict the result set with a filter on the second condition. That is not possible with OR.
So PostgreSQL does the only thing left: it computes the whole join and then filters out all rows that do not match the condition. Of course that is very inefficient when you are joining three tables (I didn't count the outer join).
Assuming that all columns called id are primary keys, you could rewrite the query as follows:
SELECT count(*) FROM
(SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pt.assigned_to_user_id = '136964'
UNION
SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pcn.createdby_id = '136964'
) q;
Even though this is running the query twice, indexes can be used to filter out all but a few rows early on, so this query should perform better.