Nested Join vs Merge Join vs Hash Join in PostgreSQL - postgresql

I know how the
Nested Join
Merge Join
Hash Join
works and its functionality.
I wanted to know in which situation these joins are used in Postgres

The following are a few rules of thumb:
Nested loop joins are preferred if one of the sides of the join has few rows. Nested loop joins are also used as the only option if the join condition does not use the equality operator.
Hash Joins are preferred if the join condition uses an equality operator and both sides of the join are large and the hash fits into work_mem.
Merge Joins are preferred if the join condition uses an equality operator and both sides of the join are large, but can be sorted on the join condition efficiently (for example, if there is an index on the expressions used in the join column).
A typical OLTP query that chooses only one row from one table and the associated rows from another table will always use a nested loop join as the only efficient method.
Queries that join tables with many rows (which cannot be filtered out before the join) would be very inefficient with a nested loop join and will always use a hash or merge join if the join condition allows it.
The optimizer considers each of these join strategies and uses the one that promises the lowest costs. The most important factor on which this decision is based is the estimated row count from both sides of the join. Consequently, wrong optimizer choices are usually caused by misestimates in the row counts.

Related

Merge join in PostgreSQL performs sort on indexed column

I am trying to optimize the following query in postgresql
SELECT ci.country_id, ci.ci_id,ci.name
FROM customer c
INNER JOIN address a ON c.a_id = a.a_id
INNER JOIN city ci ON ci.ci_id = a.ci_id
The columns customer.a_id, address.a_id, city.ci_id and adress.ci_id all have an btree index.
I wanted to use a merge join instead of a hash join as I read that a hash join not really uses indexes so I turned of the hash joins with Set enable_hashjoin=off.
My query is now according to the query plan using a merge join but it is performing always a quick sort before the merge join. I know that for merge join the columns need to be sorted but they should already be sorted through the index. Is there a way to force Postgres to use the index and not to perform the sort?
You are joining three tables. It is using two merge joins to do that, with the output of one merge join being one input of the other. The intermediate table is joined using two different columns, but it can't be ordered on two different columns simultaneously, so if you are only going to use merge joins, you need at least one sort.
This whole thing seems pointless, as the query is already very fast, and why do you care if it uses a hash join or not?

In outer join, where does a plain filter condition come from?

From PostgreSQL document, when explaining basics of EXPLAIN command:
When dealing with outer joins, you might see join plan nodes with both
“Join Filter” and plain “Filter” conditions attached. Join Filter
conditions come from the outer join's ON clause, so a row that
fails the Join Filter condition could still get emitted as a
null-extended row. But a plain Filter condition is applied after the outer-join rules and so acts to remove rows unconditionally. In an inner join there is no semantic
difference between these types of filters.
"Join Filter conditions come from the outer join's ON clause". Then in outer join, where does a plain filter condition come from?
Could you give some examples?
Thanks.
There is no other use of the term "plain Filter condition" used elsewhere in the Postgres documentation, so I would suspect that the author meant the word "plain" literally like not decorated or elaborate; simple or ordinary in character.
So really they are saying "When a filter is applied in an OUTER JOIN's ON clause the table or derived table being joined is just plainly filtered before the join occurs. This will lead to any columns from this table or derived table in the result set to be null".
Here is a little example that might enlighten you:
CREATE TABLE a(a_id) AS VALUES (1), (3), (4);
CREATE TABLE b(b_id) AS VALUES (1), (2), (5);
Now we have to force a nested loop join:
SET enable_hashjoin = off;
SET enable_mergejoin = off;
Our query is:
SELECT *
FROM a
LEFT JOIN b ON a_id = b_id
WHERE a_id > coalesce(b_id, 0);
a_id | b_id
------+------
3 |
4 |
(2 rows)
The plan is:
QUERY PLAN
------------------------------------------
Nested Loop Left Join
Join Filter: (a.a_id = b.b_id)
Filter: (a.a_id > COALESCE(b.b_id, 0))
-> Seq Scan on a
-> Materialize
-> Seq Scan on b
The “plain filter” is a condition that is applied after the join.
It is a frequent mistake to believe that conditions in the WHERE clause are the same as conditions in a JOIN … ON clause. That is only the case for inner joins. For outer joins, rows from the outer side that don't meet the condition are also included in the result.
That makes it necessary to have two different filters.

AWS docs say that merge joins can be used for outer joins, but not full joins. Are those the same things?

Redshift documentation says:
Merge Join
Typically the fastest join, a merge join is used for inner joins and
outer joins. The merge join is not used for full joins.
But I've always read that full joins and outer joins are the same thing: rows from both tables are kept, regardless of if they exist in the other table.
Are they just referring to left outer joins and right outer joins as those that work for merge sort, while "outer join"s (full outer joins) do not?
Good spot, perhaps the docs could be clearer. You can submit a pull request for our docs if you feel motivated to do so.
In this case, the doc means that a merge join is used when one table is the "primary" table, e.g, it will be used for INNER JOIN, LEFT [OUTER] JOIN, RIGHT [OUTER] JOIN but not used for FULL [OUTER] JOIN where rows from both side must be retained.

How to optimize a query with several join?

I have to write on paper a good physical plan for a Postgresql's query with several natural join, is it the same as treating a query with a simple join or should I use a different approach?
I am working on this one, by the way
SELECT zname
FROM Cage natural join Animal natural join DailyFeeds natural join Zookeeper
WHERE shift=’const’ AND clocation=’const’;
By Oracle
A NATURAL JOIN is a JOIN operation that creates an implicit join
clause for you based on the common columns in the two tables being
joined. Common columns are columns that have the same name in both
tables.
I think the above is answering following
is it the same as treating a query with a simple join or should I use a different approach?
I hope it helps.

Difference between INNER JOIN and WHERE?

First Query:
Select * from table1 inner join table2 on table1.Id = table2.Id
Second Query:
Select * from table1, table2 where table1.Id = table2.Id
What is difference between these query regarding performance which should one use?
The two statements you posted are logically identical. There isn't really a
practical reason to prefer one over the other, it's largely a matter of
personal style and readability. Some people prefer the INNER JOIN syntax and
some prefer just to use WHERE.
Refering to Using Inner Joins:
In the ISO standard, inner joins can
be specified in either the FROM or
WHERE clause. This is the only type of
join that ISO supports in the WHERE
clause. Inner joins specified in the
WHERE clause are known as old-style
inner joins.
Refering to Join Fundamentals:
Specifying the join conditions in the
FROM clause helps separate them from
any other search conditions that may
be specified in a WHERE clause, and is
the recommended method for specifying
joins.
Personaly, I prefer using INNER JOIN. I find it much clearer, as I can separate the join conditions from the filter conditions and using a seperate join block for each joined table.
To amplify #Akram's answer - many people prefer the inner join syntax, since it then allows you to more easily distinguish between the join conditions (how the various tables in the FROM clause relate to each other) from the filter conditions (those conditions that should be used to reduce the overall result set. There's no difference between them in this circumstance, but on larger queries, with more tables, it may improve readability to use the inner join form.
In addition, once you start considering outer joins, you pretty well need to use the infix join syntax (left outer join,right outer join), so many find a form of symmetry in using the same style for inner join. There is an older deprecated syntax for performing outer joins in the WHERE clause (using *=), but support for such joins is dying out.