Postgres doing a sort on simple join - postgresql

I have two tables in my database (address and person_address). Address has a PK in address_id. person_address has a PK on (address_id, person_id, usage_code)
When joining this two tables through the address_id, my expectation is that the PK index is used on both cases. However, Postgres is adding sort and materialize steps to the plan, which slows down the execution of the query. I have tried dropping indexes (person_address had an index on address_id), analyzing stats, without success.
I will appreciate any help on how to isolate this situation since those queries run slower than expected on our production environment
This is the query:
select *
from person_addresses pa
join address a
on pa.address_id = a.address_id
This is the plan :
Merge Join (cost=1506935.96..2416648.39 rows=16033774 width=338)
Merge Cond: (pa.address_id = ((a.address_id)::numeric))
-> Index Scan using person_addresses_pkey on person_addresses pa (cost=0.43..592822.76 rows=5256374 width=104)
-> Materialize (cost=1506935.53..1526969.90 rows=4006874 width=234)
-> Sort (cost=1506935.53..1516952.71 rows=4006874 width=234)
Sort Key: ((a.address_id)::numeric)
-> Seq Scan on address a (cost=0.00..163604.74 rows=4006874 width=234)
Thanks.
Edit 1. After the comment checked the data types and found a discrepancy. Fixing the data type changed the plan to the following
Hash Join (cost=343467.18..881125.47 rows=5256374 width=348)
Hash Cond: (pa.address_id = a.address_id)
-> Seq Scan on person_addresses pa (cost=0.00..147477.74 rows=5256374 width=104)
-> Hash (cost=159113.97..159113.97 rows=4033697 width=244)
-> Seq Scan on address_normalization a (cost=0.00..159113.97 rows=4033697 width=244)
Performance improvement is evident on the plan, but am wondering if the sequential scans are expected without any filters

So there are two questions here:
why did Postgres choose the (expensive) "Merge Join" in the first query?
The reason for this is that it could not use the more efficient "Hash Join" because the hash values of integer and numeric values would be different. But the Merge join requires that the values are sorted, and that's where the "Sort" step comes from in the first execution plan. Given the number of rows a "Nested Loop" would have been even more expensive.
The second question is:
I am wondering if the sequential scans are expected without any filters
Yes they are expected. The query retrieves all matching rows from both tables and that is done most efficiently by scanning all rows. An index scan requires about 2-3 I/O operations per row that has to be retrieved. A sequential scan usually requires less than one I/O operation as one block (which is the smallest unit the database reads from the disk) contains multiple rows.
You can run explain (analyze, buffers) to see how much "logical reads" each step takes.

Related

What is the proper postgresql index for listing all distinct json array values?

I have the following query
select distinct c1::text
from (select json_array_elements((value::jsonb -> 'boundaries')::json) as c1 from geoinfo) t1;
And I get this query plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------
HashAggregate (cost=912918.87..912921.87 rows=200 width=32)
Group Key: (t1.c1)::text
-> Subquery Scan on t1 (cost=1000.00..906769.25 rows=2459849 width=32)
-> Gather (cost=1000.00..869871.51 rows=2459849 width=32)
Workers Planned: 2
-> ProjectSet (cost=0.00..622886.61 rows=102493700 width=32)
-> Parallel Seq Scan on geoinfo (cost=0.00..89919.37 rows=1024937 width=222)
There are ~500 rows returned from a table with 2.5 Million rows.
What index can I create that will cause this query to execute much faster?
I tried the somewhat obvious, and it didn't work:
# create index gin_boundaries_array on geoinfo using gin (json_array_elements((value::jsonb -> 'boundaries')::json));
ERROR: set-returning functions are not allowed in index expressions
LINE 1: ... index gin_boundaries_array on geoinfo using gin (json_array...
+1000 on Bergi's comment. JSON is a fantastic interchange format. But for searchable storage? It's an edge-case that's become mainstream. Doesn't mean that it's always ill-advised, but when you start having to do joins against nested/embedded elements, spend a lot of mental bandwidth (and syntactic sugar) to get things done, etc., it's worth asking "is the convenience of stuffing things into JSON outweighed by the cost and hassle of seeing inside the objects, and getting things out?"
Specifically in Postgres, you can index JSON elements, but the planner cannot maintain useful statistics in the way that it can on full columns. (As I understand it, I haven't tested this out...I use JSON(B) for raw storage of JSON and search by other columns.)
As you may already know, Postgres has a lot of handy utilities for dealing with JSON. I have one JSONB field that I expand using jsonb_to_recordset and a cross join. If you're in the mood, it would be worth setting yourself up a test database. Unpack the JSON into a real table, and then try your queries against that. See for yourself if it's a better fit for your needs or not.

Query simplification based on selected columns

I'm trying to understand how PostgreSQL simplifies a query: let's say I have 2 tables ("tb_thing" and "tb_thing_template"), where each thing points to a template, and that I run a query like this:
EXPLAIN SELECT
tb_thing.id
FROM
tb_thing,
tb_thing_template
WHERE
tb_thing_template.id = tb_thing.template_id
;
This is the result:
QUERY PLAN
---------------------------------------------------------------------------------
Hash Join (cost=34.75..64.47 rows=788 width=4)
Hash Cond: (tb_thing.template_id = tb_thing_template.id)
-> Seq Scan on tb_thing (cost=0.00..18.88 rows=788 width=8)
-> Hash (cost=21.00..21.00 rows=1100 width=4)
-> Seq Scan on tb_thing_template (cost=0.00..21.00 rows=1100 width=4)
The planner is joining the two tables even if I'm just selecting one field from "tb_thing" and nothing from "tb_thing_template". I was hoping the planner was smart enough to figure out it didn't need to actually join the "tb_thing_template" table because I'm not selecting anything from it.
Why does it do the join anyway? Why isn't the column selection taken into account when the query is planned?
Thanks!
Semantically your query and a simple SELECT tb_thing.id FROM tb_thing are not the same.
Assume, for instance, that table tb_thing_template has 4 rows with an identical id value that is also a tb_thing.template_id. The result of your query will then have 4 rows with the same tb_thing.id. Inversely, if a tb_thing.template_id is not present in tb_thing_template.id then that row will not be output.
Only when tb_thing_template.id is a PRIMARY KEY (so unique) and tb_thing.template_id is a FOREIGN KEY to that id with just a single row for each PRIMARY KEY, so a 1:1 relationship, are both queries semantically the same. Even a 1:N relationship, which is more typical in a PK-FK relationship, would require the join in a semantic sense. But the planner has no way of knowing if the relationship is 1:1, so you get the join.
But you should not try to spoof the query planner; it is smart, but not necessarily smarter than you (might be) dumb.

Group by too slow on Amazon RDS Postgres

I am running Postgres 9.4.4 on an Amazon RDS db.r3.4xlarge instance
- 16CPUs, 122GB Memory.
I recently came across one of the queries which needed a fairly straight forward aggregation on a large table (~270 million records). The query takes over 5 hours to execute.
The joining column and the grouping column on the large table have indexes defined. I have tried experimenting with the work_mem and temp_buffers by setting each to 1GB but it dint help much.
Here's the query and the execution plan. Any leads will be highly appreciated.
explain SELECT
largetable.column_group,
MAX(largetable.event_captured_dt) AS last_open_date,
.....
FROM largetable
LEFT JOIN smalltable
ON smalltable.column_b = largetable.column_a
WHERE largetable.column_group IS NOT NULL
GROUP BY largetable.column_group
Here is the execution plan -
GroupAggregate (cost=699299968.28..954348399.96 rows=685311 width=38)
Group Key: largetable.column_group
-> Sort (cost=699299968.28..707801354.23 rows=3400554381 width=38)
Sort Key: largetable.column_group
-> Merge Left Join (cost=25512.78..67955201.22 rows=3400554381 width=38)
Merge Cond: (largetable.column_a = smalltable.column_b)
-> Index Scan using xcrmstg_largetable_launch_id on largetable (cost=0.57..16241746.24 rows=271850823 width=34)
Filter: (column_a IS NOT NULL)
-> Sort (cost=25512.21..26127.21 rows=246000 width=4)
Sort Key: smalltable.column_b
-> Seq Scan on smalltable (cost=0.00..3485.00 rows=246000 width=4)
You say the joining key and the grouping key on the large table are indexed, but you don't mention the joining key on the small table.
The merges and sorts are a big source of slowness. However, I'm also worried that you're returning ~700,000 rows of data. Is that really useful to you? What's the situation where you need to return that much data, but a 5 hour wait is too long? If you don't need all that data coming out, then filtering as early as possible is by far and away the largest speed gain you'll realize.

optimize Query in PostgreSQL

SELECT count(*)
FROM contacts_lists
JOIN plain_contacts
ON contacts_lists.contact_id = plain_contacts.contact_id
JOIN contacts
ON contacts.id = plain_contacts.contact_id
WHERE plain_contacts.has_email
AND NOT contacts.email_bad
AND NOT contacts.email_unsub
AND contacts_lists.list_id =67339
how can i optimize this query.. could you please explain...
Reformatting your query plan for clarity:
QUERY PLAN Aggregate (cost=126377.96..126377.97 rows=1 width=0)
-> Hash Join (cost=6014.51..126225.38 rows=61033 width=0)
Hash Cond: (contacts_lists.contact_id = plain_contacts.contact_id)
-> Hash Join (cost=3067.30..121828.63 rows=61033 width=8)
Hash Cond: (contacts_lists.contact_id = contacts.id)
-> Index Scan using index_contacts_lists_on_list_id_and_contact_id
on contacts_lists (cost=0.00..116909.97 rows=61033 width=4)
Index Cond: (list_id = 66996)
-> Hash (cost=1721.41..1721.41 rows=84551 width=4)
-> Seq Scan on contacts (cost=0.00..1721.41 rows=84551 width=4)
Filter: ((NOT email_bad) AND (NOT email_unsub))
-> Hash (cost=2474.97..2474.97 rows=37779 width=4)
-> Seq Scan on plain_contacts (cost=0.00..2474.97 rows=37779 width=4)
Filter: has_email
Two partial indexes might eliminate seq scans depending on your data distribution:
-- if many contacts have bad emails or are unsubscribed:
CREATE INDEX contacts_valid_email_idx ON contacts (id)
WHERE (NOT email_bad AND NOT email_unsub);
-- if many contacts have no email:
CREATE INDEX plain_contacts_valid_email_idx ON plain_contacts (id)
WHERE (has_email);
You might be missing an index on a foreign key:
CREATE INDEX plain_contacts_contact_id_idx ON plain_contacts (contact_id);
Last but not least if you've never analyzed your data, you need to run:
VACUUM ANALYZE;
If it's still slow once all that is done, there isn't much you can do short of merging your plain_contacts and your contacts tables: getting the above query plan in spite of the above indexes means most/all of your subscribers are subscribed to that particular list -- in which case the above query plan is the fastest you'll get.
This is already a very simple query that the database will run in the most efficient way providing that statistics are up to date
So in terms of the query itself there's not much to do.
In terms of database administration you can add indexes - there should be indexes in the database for all the join conditions and also for the most selective part of the where clause (list_id, contact_id as FK in plain_contacts and contacts_lists). This is the most significant opportunity to improve performance of this query (orders of magnitude). Still as SpliFF notes, you probably already have those indexes, so check.
Also, postgres has good explain command that you should learn and use. It will help with optimizing queries.
Since you only want to inlude rows that has some flags set in the joined tables, I would move that statements into the join clause:
SELECT count(*)
FROM contacts_lists
JOIN plain_contacts
ON contacts_lists.contact_id = plain_contacts.contact_id
AND NOT plain_contacts.has_email
JOIN contacts
ON contacts.id = plain_contacts.contact_id
AND NOT contacts.email_unsub
AND NOT contacts.email_bad
WHERE contacts_lists.list_id =67339
I'm not sure if this would make a great impact on performance, but worth a try. You should probably have indexes on the joined tables as well for optimal performance, like this:
plain_contacts: contact_id, has_email
contacts: id, email_unsub, email_bad
Have you run ANALYZE on the database recently? Do the row counts in the EXPLAIN plan look like they make sense? (Looks like you ran only EXPLAIN. EXPLAIN ANALYZE gives both estimated and actual timings.)
You can use SELECT count(1) ... but other than that I'd say it looks fine. You could always cache some parts of the query using views or put indexes on contact_id and list_id if you're really struggling (I assume you have one on id already).

How can I "think better" when reading a PostgreSQL query plan?

I spent over an hour today puzzling myself over a query plan that I couldn't understand. The query was an UPDATE and it just wouldn't run at all. Totally deadlocked: pg_locks showed it wasn't waiting for anything either. Now, I don't consider myself the best or worst query plan reader, but I find this one exceptionally difficult. I'm wondering how does one read these? Is there a methodology that the Pg aces follow in order to pinpoint the error?
I plan on asking another question as to how to work around this issue, but right now I'm speaking specifically about how to read these types of plans.
QUERY PLAN
--------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=47680.88..169413.12 rows=1 width=77)
Join Filter: ((co.fkey_style = v.chrome_styleid) AND (co.name = o.name))
-> Nested Loop (cost=5301.58..31738.10 rows=1 width=81)
-> Hash Join (cost=5301.58..29722.32 rows=229 width=40)
Hash Cond: ((io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text))
-> Seq Scan on options io (cost=0.00..20223.32 rows=23004 width=36)
Filter: (name IS NULL)
-> Hash (cost=4547.33..4547.33 rows=36150 width=24)
-> Seq Scan on vehicles iv (cost=0.00..4547.33 rows=36150 width=24)
Filter: (date_sold IS NULL)
-> Index Scan using options_pkey on options co (cost=0.00..8.79 rows=1 width=49)
Index Cond: ((co.fkey_style = iv.chrome_styleid) AND (co.code = io.code))
-> Hash Join (cost=42379.30..137424.09 rows=16729 width=26)
Hash Cond: ((v.lot_id = o.lot_id) AND ((v.vin)::text = (o.vin)::text))
-> Seq Scan on vehicles v (cost=0.00..4547.33 rows=65233 width=24)
-> Hash (cost=20223.32..20223.32 rows=931332 width=44)
-> Seq Scan on options o (cost=0.00..20223.32 rows=931332 width=44)
(17 rows)
The issue with this query plan - I believe I understand - is probably best said by RhodiumToad (he is definitely better at this, so I'll bet on his explanation being better) of irc://irc.freenode.net/#postgresql:
oh, that plan is potentially disastrous
the problem with that plan is that it's running a hugely expensive hashjoin for each row
the problem is the rows=1 estimate from the other join and
the planner thinks it's ok to put a hugely expensive query in the inner path of a nestloop where the outer path is estimated to return only one row.
since, obviously, by the planner's estimate the expensive part will only be run once
but this has an obvious tendency to really mess up in practice
the problem is that the planner believes its own estimates
ideally, the planner needs to know the difference between "estimated to return 1 row" and "not possible to return more than 1 row"
but it's not at all clear how to incorporate that into the existing code
He goes on to say:
it can affect any join, but usually joins against subqueries are the most likely
Now when I read this plan the first thing I noticed was the Nested Loop Anti Join, this had a cost of 169,413 (I'll stick to upper bounds). This Anti-Join breaks down to the result of a Nested Loop at cost of 31,738, and the result of a Hash Join at a cost of 137,424. Now, the 137,424, is much greater than 31,738 so I knew the problem was the Hash Join.
Then I proceed to EXPLAIN ANALYZE the Hash Join segment outside of the query. It executed in 7 secs. I made sure there was indexes on (lot_id, vin), and (co.code, and v.code) -- there was. I disabled seq_scan and hashjoin individually and notice a speed increase of less than 2 seconds. Not near enough to account for why it wasn't progressing after an hour.
But, after all this I'm totally wrong! Yes, it was the slower part of the query, but because the rows="1" bit (I presume it was on the Nested Loop Anti Join). Here it is a bug (lack of ability) in the planner mis-estimating the amount of rows? How am I supposed to read into this to come to the same conclusion RhodiumToad did?
Is it simply rows="1" that is supposed to trigger me figuring this out?
I did run VACUUM FULL ANALYZE on all of the tables involved, and this is Postgresql 8.4.
Seeing through issues like this requires some experience on where things can go wrong. But to find issues in the query plans, try to validate the produced plan from inside out, check if the number of rows estimates are sane and cost estimates match spent time. Btw. the two cost estimates aren't lower and upper bounds, first is the estimated cost to produce the first row of output, the second number is the estimated total cost, see explain documentation for details, there is also some planner documentation available. It also helps to know how the different access methods work. As a starting point Wikipedia has information on nested loop, hash and merge joins.
In your example, you'd start with:
-> Seq Scan on options io (cost=0.00..20223.32 rows=23004 width=36)
Filter: (name IS NULL)
Run EXPLAIN ANALYZE SELECT * FROM options WHERE name IS NULL; and see if the returned rows matches the estimate. A factor of 2 off isn't usually a problem, you're trying to spot order of magnitude differences.
Then see EXPLAIN ANALYZE SELECT * FROM vehicles WHERE date_sold IS NULL; returns expected amount of rows.
Then go up one level to the hash join:
-> Hash Join (cost=5301.58..29722.32 rows=229 width=40)
Hash Cond: ((io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text))
See if EXPLAIN ANALYZE SELECT * FROM vehicles AS iv INNER JOIN options io ON (io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text) WHERE iv.date_sold IS NULL AND io.name IS NULL; results in 229 rows.
Up one more level adds INNER JOIN options co ON (co.fkey_style = iv.chrome_styleid) AND (co.code = io.code) and is expected to return only one row. This is probably where the issue is because if the actual numebr of rows goes from 1 to 100, the total cost estimate of traversing the inner loop of the containing nested loop is off by a factor of 100.
The underlying mistake that the planner is making is probably that it expects that the two predicates for joining in co are independent of each other and multiplies their selectivities. While in reality they may be heavily correlated and the selectivity is closer to MIN(s1, s2) not s1*s2.
Did you ANALYZE the tables? And what does pg_stats has to say about these tables? The queryplan is based on the stats, these have to be ok. And what version do you use? 8.4?
The costs can be calculated by using the stats, the amount of relpages, amount of rows and the settings in postgresql.conf for the Planner Cost Constants.
work_mem is also involved, it might be too low and force the planner to do a seqscan, to kill performance...