optimize Query in PostgreSQL - postgresql

SELECT count(*)
FROM contacts_lists
JOIN plain_contacts
ON contacts_lists.contact_id = plain_contacts.contact_id
JOIN contacts
ON contacts.id = plain_contacts.contact_id
WHERE plain_contacts.has_email
AND NOT contacts.email_bad
AND NOT contacts.email_unsub
AND contacts_lists.list_id =67339
how can i optimize this query.. could you please explain...

Reformatting your query plan for clarity:
QUERY PLAN Aggregate (cost=126377.96..126377.97 rows=1 width=0)
-> Hash Join (cost=6014.51..126225.38 rows=61033 width=0)
Hash Cond: (contacts_lists.contact_id = plain_contacts.contact_id)
-> Hash Join (cost=3067.30..121828.63 rows=61033 width=8)
Hash Cond: (contacts_lists.contact_id = contacts.id)
-> Index Scan using index_contacts_lists_on_list_id_and_contact_id
on contacts_lists (cost=0.00..116909.97 rows=61033 width=4)
Index Cond: (list_id = 66996)
-> Hash (cost=1721.41..1721.41 rows=84551 width=4)
-> Seq Scan on contacts (cost=0.00..1721.41 rows=84551 width=4)
Filter: ((NOT email_bad) AND (NOT email_unsub))
-> Hash (cost=2474.97..2474.97 rows=37779 width=4)
-> Seq Scan on plain_contacts (cost=0.00..2474.97 rows=37779 width=4)
Filter: has_email
Two partial indexes might eliminate seq scans depending on your data distribution:
-- if many contacts have bad emails or are unsubscribed:
CREATE INDEX contacts_valid_email_idx ON contacts (id)
WHERE (NOT email_bad AND NOT email_unsub);
-- if many contacts have no email:
CREATE INDEX plain_contacts_valid_email_idx ON plain_contacts (id)
WHERE (has_email);
You might be missing an index on a foreign key:
CREATE INDEX plain_contacts_contact_id_idx ON plain_contacts (contact_id);
Last but not least if you've never analyzed your data, you need to run:
VACUUM ANALYZE;
If it's still slow once all that is done, there isn't much you can do short of merging your plain_contacts and your contacts tables: getting the above query plan in spite of the above indexes means most/all of your subscribers are subscribed to that particular list -- in which case the above query plan is the fastest you'll get.

This is already a very simple query that the database will run in the most efficient way providing that statistics are up to date
So in terms of the query itself there's not much to do.
In terms of database administration you can add indexes - there should be indexes in the database for all the join conditions and also for the most selective part of the where clause (list_id, contact_id as FK in plain_contacts and contacts_lists). This is the most significant opportunity to improve performance of this query (orders of magnitude). Still as SpliFF notes, you probably already have those indexes, so check.
Also, postgres has good explain command that you should learn and use. It will help with optimizing queries.

Since you only want to inlude rows that has some flags set in the joined tables, I would move that statements into the join clause:
SELECT count(*)
FROM contacts_lists
JOIN plain_contacts
ON contacts_lists.contact_id = plain_contacts.contact_id
AND NOT plain_contacts.has_email
JOIN contacts
ON contacts.id = plain_contacts.contact_id
AND NOT contacts.email_unsub
AND NOT contacts.email_bad
WHERE contacts_lists.list_id =67339
I'm not sure if this would make a great impact on performance, but worth a try. You should probably have indexes on the joined tables as well for optimal performance, like this:
plain_contacts: contact_id, has_email
contacts: id, email_unsub, email_bad

Have you run ANALYZE on the database recently? Do the row counts in the EXPLAIN plan look like they make sense? (Looks like you ran only EXPLAIN. EXPLAIN ANALYZE gives both estimated and actual timings.)

You can use SELECT count(1) ... but other than that I'd say it looks fine. You could always cache some parts of the query using views or put indexes on contact_id and list_id if you're really struggling (I assume you have one on id already).

Related

Postges makes Seq Scan instead of Index Only Scan for a very simple SELECT x ... ORDER BY x

I have a table union_events with 17M rows and 102 columns in Postgres 10. I run the commands:
CREATE INDEX union_events_index ON temp_schema_to_delete.union_events(id)
ANALYZE temp_schema_to_delete.union_events
EXPLAIN SELECT id FROM temp_schema_to_delete.union_events ORDER BY id
and get the following results:
Sort (cost=3614290.72..3658708.19 rows=17766988 width=4)
Sort Key: id
-> Seq Scan on union_events (cost=0.00..1474905.88 rows=17766988 width=4)
id is some non-null and non-unique integer field.
I expect that my index will be used and I haven't to sort the table once more.
I made a quick test:
SELECT s INTO temp_schema_to_delete.test FROM generate_series(0, 10000000) AS s
CREATE INDEX test_index ON temp_schema_to_delete.test(s)
ANALYZE temp_schema_to_delete.test
EXPLAIN SELECT s FROM temp_schema_to_delete.test ORDER BY s
It gets:
Index Only Scan using test_index on test (cost=0.43..303940.15 rows=10000048 width=4)
It seems okay.
What is wrong with my first table or query? Why the index on id is not used?
As #joop recommended, I've made a VACUUM for the table.
It resulted in a better plan that uses my index.
For me still it's not clear, how VACUUM could help with repsect to the fact that I've created this table using SELECT ... INTO TABLE and haven't done any UPDATEs or DELETEs.
It would be great if somebody could explain this in other answer, maybe there exists some effective solution then VACUUM.

Postgres doing a sort on simple join

I have two tables in my database (address and person_address). Address has a PK in address_id. person_address has a PK on (address_id, person_id, usage_code)
When joining this two tables through the address_id, my expectation is that the PK index is used on both cases. However, Postgres is adding sort and materialize steps to the plan, which slows down the execution of the query. I have tried dropping indexes (person_address had an index on address_id), analyzing stats, without success.
I will appreciate any help on how to isolate this situation since those queries run slower than expected on our production environment
This is the query:
select *
from person_addresses pa
join address a
on pa.address_id = a.address_id
This is the plan :
Merge Join (cost=1506935.96..2416648.39 rows=16033774 width=338)
Merge Cond: (pa.address_id = ((a.address_id)::numeric))
-> Index Scan using person_addresses_pkey on person_addresses pa (cost=0.43..592822.76 rows=5256374 width=104)
-> Materialize (cost=1506935.53..1526969.90 rows=4006874 width=234)
-> Sort (cost=1506935.53..1516952.71 rows=4006874 width=234)
Sort Key: ((a.address_id)::numeric)
-> Seq Scan on address a (cost=0.00..163604.74 rows=4006874 width=234)
Thanks.
Edit 1. After the comment checked the data types and found a discrepancy. Fixing the data type changed the plan to the following
Hash Join (cost=343467.18..881125.47 rows=5256374 width=348)
Hash Cond: (pa.address_id = a.address_id)
-> Seq Scan on person_addresses pa (cost=0.00..147477.74 rows=5256374 width=104)
-> Hash (cost=159113.97..159113.97 rows=4033697 width=244)
-> Seq Scan on address_normalization a (cost=0.00..159113.97 rows=4033697 width=244)
Performance improvement is evident on the plan, but am wondering if the sequential scans are expected without any filters
So there are two questions here:
why did Postgres choose the (expensive) "Merge Join" in the first query?
The reason for this is that it could not use the more efficient "Hash Join" because the hash values of integer and numeric values would be different. But the Merge join requires that the values are sorted, and that's where the "Sort" step comes from in the first execution plan. Given the number of rows a "Nested Loop" would have been even more expensive.
The second question is:
I am wondering if the sequential scans are expected without any filters
Yes they are expected. The query retrieves all matching rows from both tables and that is done most efficiently by scanning all rows. An index scan requires about 2-3 I/O operations per row that has to be retrieved. A sequential scan usually requires less than one I/O operation as one block (which is the smallest unit the database reads from the disk) contains multiple rows.
You can run explain (analyze, buffers) to see how much "logical reads" each step takes.

Database where filter using compound index

I've a like object that can be used to keep likes to any other db table, each represented by a content_type_id
SELECT *
from like LEFT OUTER JOIN "accounts_user"
ON ("like"."user_id" = "accounts_user"."id")
WHERE ("like"."object_id" = %s AND "like"."content_type_id" = %s AND "like"."like" = %s)`
like is a Boolean field. What would be the best indices (indexes) for this database table?
Operations are, checking if a user liked an object, and checking which users liked which objects.
My current indices are
CREATE INDEX like_417f1b1d ON like USING btree (content_type_id)
CREATE INDEX like_e8701ad5 ON like USING btree (user_id)
CREATE UNIQUE INDEX like_user_id_uniq ON like USING btree (user_id, content_type_id, object_id
)
And the query plan is:
Query plan -> Nested Loop Left Join (cost=26563.29..168218.18 rows=1 width=1383)
Query plan -> Bitmap Heap Scan on like (cost=26562.86..168209.72 rows=1 width=29)
Query plan Recheck Cond: ?
Query plan Filter: ?
Query plan -> Bitmap Index Scan on like_417f1b1d (cost=0.00..26562.86 rows=1271257 width=0)
Query plan Index Cond: ?
Query plan -> Index Scan using accounts_user_pkey on accounts_user (cost=0.43..8.45 rows=1 width=1354)
Query plan Index Cond: ?
Can you spot any irregularities? Because this is taking way longer than expected. (1.5 seconds)
I assume since the indices I have are not the exact WHERE clause this might cause a slowdown but DB engine should be smart enough to handle it, so that's why I am here.
I would recommend
CREATE INDEX ON like(object_id);
This assumes that object_id is selective (there are many different object_ids in the table) and content_type_id and like are not selective.
If there are many different values for content_type_id, add it to the index.
If like is not evenly distributed (e.g. 99% of all values ate TRUE), it might also help to add it to the index.
I'd like to mention that posting a truncated execution plan like that does not help at all. We need the complete thing.

Query simplification based on selected columns

I'm trying to understand how PostgreSQL simplifies a query: let's say I have 2 tables ("tb_thing" and "tb_thing_template"), where each thing points to a template, and that I run a query like this:
EXPLAIN SELECT
tb_thing.id
FROM
tb_thing,
tb_thing_template
WHERE
tb_thing_template.id = tb_thing.template_id
;
This is the result:
QUERY PLAN
---------------------------------------------------------------------------------
Hash Join (cost=34.75..64.47 rows=788 width=4)
Hash Cond: (tb_thing.template_id = tb_thing_template.id)
-> Seq Scan on tb_thing (cost=0.00..18.88 rows=788 width=8)
-> Hash (cost=21.00..21.00 rows=1100 width=4)
-> Seq Scan on tb_thing_template (cost=0.00..21.00 rows=1100 width=4)
The planner is joining the two tables even if I'm just selecting one field from "tb_thing" and nothing from "tb_thing_template". I was hoping the planner was smart enough to figure out it didn't need to actually join the "tb_thing_template" table because I'm not selecting anything from it.
Why does it do the join anyway? Why isn't the column selection taken into account when the query is planned?
Thanks!
Semantically your query and a simple SELECT tb_thing.id FROM tb_thing are not the same.
Assume, for instance, that table tb_thing_template has 4 rows with an identical id value that is also a tb_thing.template_id. The result of your query will then have 4 rows with the same tb_thing.id. Inversely, if a tb_thing.template_id is not present in tb_thing_template.id then that row will not be output.
Only when tb_thing_template.id is a PRIMARY KEY (so unique) and tb_thing.template_id is a FOREIGN KEY to that id with just a single row for each PRIMARY KEY, so a 1:1 relationship, are both queries semantically the same. Even a 1:N relationship, which is more typical in a PK-FK relationship, would require the join in a semantic sense. But the planner has no way of knowing if the relationship is 1:1, so you get the join.
But you should not try to spoof the query planner; it is smart, but not necessarily smarter than you (might be) dumb.

Postgres combining multiple Indexes

I have the following table/indexes -
CREATE TABLE test
(
coords geography(Point,4326),
user_id varchar(50),
created_at timestamp
);
CREATE INDEX ix_coords ON test USING GIST (coords);
CREATE INDEX ix_user_id ON test (user_id);
CREATE INDEX ix_created_at ON test (created_at DESC);
This is the query I want to execute:
select *
from updates
where ST_DWithin(coords, ST_MakePoint(-126.4, 45.32)::geography, 30000)
and user_id='3212312'
order by created_at desc
limit 60
When I run the query it only uses ix_coords index. How can I ensure that Postgres uses ix_user_id and ix_created_at index as well for the query?
This is a new table in which I did bulk insert of production data. Total rows in the test table: 15,069,489
I am running PostgreSQL 9.2.1 (with Postgis) with (effective_cache_size = 2GB). This is my local OSX with 16GB RAM, Core i7/2.5 GHz, non-SSD disk.
Adding the EXPLAIN ANALYZE output -
Limit (cost=71.64..71.65 rows=1 width=280) (actual time=1278.652..1278.665 rows=60 loops=1)
-> Sort (cost=71.64..71.65 rows=1 width=280) (actual time=1278.651..1278.662 rows=60 loops=1)
Sort Key: created_at
Sort Method: top-N heapsort Memory: 33kB
-> Index Scan using ix_coords on test (cost=0.00..71.63 rows=1 width=280) (actual time=0.198..1278.227 rows=178 loops=1)
Index Cond: (coords && '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography)
Filter: (((user_id)::text = '4f1092000b921a000100015c'::text) AND ('0101000020E61000006666666666E63C40C3F5285C8F824440'::geography && _st_expand(coords, 30000::double precision)) AND _st_dwithin(coords, '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography, 30000::double precision, true))
Rows Removed by Filter: 3122459
Total runtime: 1278.701 ms
UPDATE:
Based on the suggestions below I tried index on cords + user_id:
CREATE INDEX ix_coords_and_user_id ON updates USING GIST (coords, user_id);
..but get the following error:
ERROR: data type character varying has no default operator class for access method "gist"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
UPDATE:
So the CREATE EXTENSION btree_gist; solved the btree/gist compound index issue. And now my index looks like
CREATE INDEX ix_coords_user_id_created_at ON test USING GIST (coords, user_id, created_at);
NOTE: btree_gist does not accept DESC/ASC.
New query plan:
Limit (cost=134.99..135.00 rows=1 width=280) (actual time=273.282..273.292 rows=60 loops=1)
-> Sort (cost=134.99..135.00 rows=1 width=280) (actual time=273.281..273.285 rows=60 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 41kB
-> Index Scan using ix_updates_coords_user_id_created_at on updates (cost=0.00..134.98 rows=1 width=280) (actual time=0.406..273.110 rows=115 loops=1)
Index Cond: ((coords && '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography) AND ((user_id)::text = '4e952bb5b9a77200010019ad'::text))
Filter: (('0101000020E61000006666666666E63C40C3F5285C8F824440'::geography && _st_expand(coords, 30000::double precision)) AND _st_dwithin(coords, '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography, 30000::double precision, true))
Rows Removed by Filter: 1
Total runtime: 273.331 ms
The query is performing better than before, almost a second better but still not great. I guess this is the best that I can get?? I was hoping somewhere around 60-80ms. Also taking order by created_at desc from the query, shaves off another 100ms, meaning it is unable to use the index. Anyway to fix this?
I don't know if Pg can combine a GiST index and regular b-tree indexes with a bitmap index scan, but I suspect not. You may be getting the best result you can without adding a user_id column to your GiST index (and consequently making it bigger and slower for other queries that don't use user_id).
As an experiment you could:
CREATE EXTENSION btree_gist;
CREATE INDEX ix_coords_and_user_id ON test USING GIST (coords, user_id);
which is likely to result in a big index, but might boost that query - if it works. Be aware that maintaining such an index will significantly slow INSERT and UPDATEs. If you drop the old ix_coords your queries will use ix_coords_and_user_id even if they don't filter on user_id, but it'll be slower than ix_coords. Keeping both will make the INSERT and UPDATE slowdown even worse.
See btree-gist
(Obsoleted by edit to question that changes the question completely; when written the user had a multicolumn index they've now split into two separate ones):
You don't seem to be filtering or sorting on user_id, only create_date. Pg won't (can't?) use only the second term of a multi-column index like (user_id, create_date), it needs use of the first item too.
If you want to index create_date, create a separate index for it. If you use and need the (user_id, create_date) index and don't generally use just user_id alone, see if you can reverse the column order. Alternately create two independent indexes, (user_id) and (create_date). When both columns are needed Pg can combine the two indepependent indexes using a bitmap index scan.
I think Craig is correct with his answer, but I just wanted to add a few things (and it wouldn't fit in a comment)
You have to work pretty hard to force PostgreSQL to use an index. The Query optimizer is smart and there are times where it will believe that a sequential table scan will be faster. It is usually right! :) But, you can play with some settings (such as seq_page_cost, random_page_cost, etc) you can play with to try and get it to favor an index. Here is a link to some of the configurations that you might want to examine if you feel like it is not making the correct decision. But, again... my experience is that most of the time, Postgres is smarter than I am! :)
Hope this helps you (or someone in the future).