Joining too many tables makes Postgres query extremely slow - postgresql

I've been trying to optimize this simple query on Postgres 12 that joins several tables to a base relation. They each have 1-to-1 relation and have anywhere from 10 thousand to 10 million rowss.
SELECT *
FROM base
LEFT JOIN t1 ON t1.id = base.t1_id
LEFT JOIN t2 ON t2.id = base.t2_id
LEFT JOIN t3 ON t3.id = base.t3_id
LEFT JOIN t4 ON t4.id = base.t4_id
LEFT JOIN t5 ON t5.id = base.t5_id
LEFT JOIN t6 ON t6.id = base.t6_id
LEFT JOIN t7 ON t7.id = base.t7_id
LEFT JOIN t8 ON t8.id = base.t8_id
LEFT JOIN t9 ON t9.id = base.t9_id
(the actual relations are a bit more complicated than this, but for demonstration purposes this is fine)
I noticed that the query is still very slow when I only do SELECT base.id which seems odd, because then query planner should know that the joins are unnecessary and shouldn't affect the performance.
Then I noticed that 8 seems to be some kind of magic number. If I remove any single one of the joins, the query time goes from 500ms to 1ms. With EXPLAIN I was able to see that Postgres is doing index only scans when joining 8 tables, but with 9 tables it starts doing sequential scans.
That's even when I only do SELECT base.id so somehow the amount of tables is tripping up the query planner.

We finally found out that there is indeed a configuration setting in postgres called join_collapse_limit, which is set to 8 by default.
https://www.postgresql.org/docs/current/runtime-config-query.html
The planner will rewrite explicit JOIN constructs (except FULL JOINs) into lists of FROM items whenever a list of no more than this many items would result. Smaller values reduce planning time but might yield inferior query plans. By default, this variable is set the same as from_collapse_limit, which is appropriate for most uses. Setting it to 1 prevents any reordering of explicit JOINs. Thus, the explicit join order specified in the query will be the actual order in which the relations are joined. Because the query planner does not always choose the optimal join order, advanced users can elect to temporarily set this variable to 1, and then specify the join order they desire explicitly.
After reading this article we decided to increase the limit, along with other values such as from_collapse_limit and geco_threshold. Beware that query planning time increases exponentially with the amount of joins, so the limit is there for a reason and should not be increased carelessly.

Related

TSQL -- Does ordering of on clause matter

For the simplified query:
Select t1.c1, t1.c2, t2.d1
FROM table1 t1
LEFT JOIN table2 t2
ON
(t1.c1 = t2.d2)
It seems from math by the symmetric property that this would be exactly the same as if one reversed the ON to:
Select t1.c1, t1.c2, t2.d1
FROM table1 t1
LEFT JOIN table2 t2
ON
(t2.d2 = t1.c1)
Is this always to true in TSQL or is there some exception where one could get more rows if the query was subtly changed as described above?
I've learned that very subtle join changes (in queries much more complicated than this simple example) can affect row counts greatly.
Also, in addition to rows returned (which I think should be EXACTLY THE SAME for both queries always) I would suppose that one ordering of the "ON" clause could make the query have better performance speed. Could someone verify that?
The ON condition in any sort of JOIN can be commutative, as you have observed. You can do ON a = b or ON b = a and have them mean precisely the same thing. They're nothing but Boolean expressions.

SQL NESTED Query Optimization

I am running two sql queries say,
select obname from table1 where obid = 12
select modname from table2 where modid = 12
Both are taking very less time, say 300 ms each.
But when I am running:
select obname, modname
from (select obname from table1 where obid = 12) as alias1,
(select modname from table2 where modid = 12) as alias2
It is taking 3500ms. Why is it so?
In general, putting two scalar queries in the from clause is not going to affect performance. In fact, from an application perspective, one query may be faster because there is less overhead going back and forth to the database. A scalar query returns one column and one row.
However, if the queries are returning multiple rows, then your little comma is doing a massive Cartesian product (which is why I always use CROSS JOIN rather than a comma in a FROM clause). In that case, all bets are off, because the data has to be processed after the results start returning.

Can I apply predicates to the same columns against multiple tables in a JOIN only once?

I want to join two tables together and add additional information from two other tables to the same columns in both queried tables. I've come up with the below code, which works, but I don't feel comfortable about having to add another JOIN clause for each table, as it would make the query substantially long if I wanted to join/add more things.
Is there a way to combine it, so that I can join additional tables only once (just use S and E aliases every time)?
SELECT
J.JobId,
J.StandardJobId,
S.JobName,
J.EngineerId,
E.EngineerName,
JF.JobId AS FollowUpJobId,
JF.StandardJobId AS FollowUpStandardJobId,
SF.JobName AS FollowUpJobName,
JF.EngineerId AS FollowUpEngineerId,
EF.EngineerName AS FollowUpEngineerName
FROM
Jobs J
INNER JOIN
Jobs JF
ON
J.FollowUpJobId = JF.JobId
INNER JOIN
StandardJobs S
ON
J.StandardJobId = S.StandardJobId
INNER JOIN
Engineers E
ON
E.EngineerId = J.EngineerId
INNER JOIN
StandardJobs SF
ON
SF.StandardJobId = JF.StandardJobId
INNER JOIN
Engineers EF
ON
EF.EngineerId = JF.EngineerId
One approach would be to use a Common Table Expression (CTE) - something like:
with cte as
(SELECT J.JobId,
J.StandardJobId,
S.JobName,
J.EngineerId,
E.EngineerName,
J.FollowUpJobId
FROM Jobs J
INNER JOIN StandardJobs S ON J.StandardJobId = S.StandardJobId
INNER JOIN Engineers E ON E.EngineerId = J.EngineerId)
SELECT O.*,
F.StandardJobId AS FollowUpStandardJobId,
F.JobName AS FollowUpJobName,
F.EngineerId AS FollowUpEngineerId,
F.EngineerName AS FollowUpEngineerName
FROM CTE AS O
JOIN CTE AS F ON O.FollowUpJobId = F.JobId
You can sort of do this with either a CTE (Common Table Expressions, the WITH clause) or a View:
;WITH Jobs_Extended As
(
SELECT j.*,
s.JobName,
E.EngineerName
FROM Jobs As j
JOIN StandardJobs As s ON s.StandardJobId = j.StandardJobId
JOIN Engineer As e ON e.EngineerId = j.EngineerId
)
SELECT
J.JobId,
J.StandardJobId,
J.JobName,
J.EngineerId,
J.EngineerName,
JF.JobId AS FollowUpJobId,
JF.StandardJobId AS FollowUpStandardJobId,
JF.JobName AS FollowUpJobName,
JF.EngineerId AS FollowUpEngineerId,
JF.EngineerName AS FollowUpEngineerName
FROM Jobs_Extended J
JOIN Jobs_Extended JF ON J.FollowUpJobId = JF.JobId
In this example the CTE Jobs_Extended becomes a defined alias for the relationship between the Jobs, Engineers and StandardJobs tables. Then once defined, you can use it multiple times in the query without having to redefine those interior relations.
You can do the same thing by change the WITH to a View, which will make the defined alias permannet in your database.
No, you cannot avoid JOINing related tables each time a separate reference is needed. The issue is that you are not working with the tables in a general sense but instead working with the specific rows of each table, even more specifically, just those rows that match the JOIN and WHERE conditions.
There is no way to specify the references to either StandardJobs or Engineers only once because you are needing to work with two rows from each table at the same time, at least in the given example.
However, depending on which direction you are wanting to go with "additional tables" (more references to Jobs or more lookups like StandardJobs and Engineers for the given 2 references of Jobs), the CTE construct shown by Mark is the probably the easiest / best way to abstract it. I posted this answer mainly to explain the issue at hand.

T-SQL different JOIN approaches, same results, which one would you prefer?

these are 3 approaches how to make a join. I would like to hear some word on perforance of these 3 queries.
Thank you
SELECT * FROM
tableA A LEFT JOIN tableB B
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
ON B.ColumnB = A.ColumnB
WHERE ColumnX = 'XY'
Versus
SELECT * FROM
tableA A LEFT JOIN tableB B
ON B.ColumnB = A.ColumnB
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
WHERE ColumnX = 'XY'
Versus Common Table Expression
WITH T...
It does not matter.
SQL Server has a cost-based optimizer (as opposed to a rule-based optimizer). That means that the engine is able to figure out that both of your first two options are identical. Run your estimated and actual execution plans and you will see that this is the case.
The only reason you would choose one option over the other is for readability's sake. I go with your second option, because it's a lot easier to read when there are a great many joins involved. ON clauses in reverse order become quite difficult to track.
In my experience, any of the above could be quicker depending on your tables.
As you're setting up joins, you want to start with the most restrictive as possible (without negatively affecting your end result, obviously). This same logic also applies to the Where clause for the same reason. By starting with the most restrictive, you're limiting the number of rows that are being joined and thus evaluated by the Where clause and then returned/manipulated in the select clause. For my answers below regarding the three specific scenarios, I'm assuming a sufficiently complicated query that is doing more than just looking to combine data from multiple tables (i.e., queries answering specific questions).
If Table A is huge and Tables B & C are smaller and more directly related to the data you're trying to isolate, then the first option would likely be fastest.
If Table B or C are huge and Table A is more related to your desired data, the second option would likely be fastest.
As far as option 3 goes, I love CTEs but I try to only use them when I need to do so. Using a CTE will speed up your overall query if the data joined, manipulated, and returned by the CTE is only related to the rest of the query in a limited fashion. Including tables that are only partially related to your end result in your primary string of joins is going to needlessly slow down your query. If you can parse out that data into a CTE, it can run quickly by itself and then be incorporated back into the main query at the end.

PostgreSQL slow COUNT() - is trigger the only solution?

I have a table with posts, which are categorized by:
type
tag
language
All of those "categories" are stored in next tables (posts_types) and connected via next tables (posts_types_assignment).
COUNTing in PostgreSQL is really slow (i have more than 500k records in that table) and i need to get the number of posts categorized by any combination of type/tag/lang.
If i would solve it through triggers, it would be full of many multi-level loops, which really doesn't look like nice and is hard to maintenance.
Is there any other solution how to effectively get actual number of posts categorized in any type/tag/language?
Let me get this straight.
You have a table posts. You have a table posts_types. The two have a many to many join on posts_types_assignment. And you have some query like this that is slow:
SELECT count(*)
FROM posts p
JOIN posts_types_assigment pta1
ON p.id = pta1.post_id
JOIN posts_types pt1
ON pt1.id = pta1.post_type_id
AND pt1.type = 'language'
AND pt1.name = 'English'
JOIN posts_types_assigment pta2
ON p.id = pta2.post_id
JOIN posts_types pt2
ON pt2.id = pta2.post_type_id
AND pt2.type = 'tag'
AND pt2.name = 'awesome'
And you would like to know why it is painfully slow.
My first note is that PostgreSQL would have to do a lot less work if you had the identifiers in the posts table rather than in the joins. But that is a moot issue, the decision has been made.
My more useful note is that I believe that PostgreSQL has a similar query optimizer to Oracle. In which case to limit the combinatorial explosion of possible query plans that it has to consider, it only considers plans that start with some table, and then repeatedly joins on one more data set at a time. However no such query plan will work here. You can start with pt1, get 1 record, then go to pta1, get a bunch of records, join p, wind up with the same number of records, then join pta2, and now you get a huge number of records, then join to pt2, get just a few records. Joining to pta2 is the slow step, because the database has no idea which records you want, and therefore has to create a temporary result set for every combination of a post and a piece of metadata (type, language or tag) on it.
If this is indeed your problem, then the right plan looks like this. Join pt1 to pta1, put an index on it. Join pt2 to pta2, then join to the result of the first query, then join to p. Then count. This means that we don't get huge result sets.
If this the case, there is no way to tell the query optimizer that this once you want it to think up a new type of execution plan. But there is a way to force it.
CREATE TEMPORARY TABLE t1
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
WHERE pt.type = 'language'
AND pt.name = 'English';
CREATE INDEX idx1 ON t1 (post_id);
CREATE TEMPORARY TABLE t2
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
JOIN t1
ON t1.post_id = pta.post_id
WHERE pt.type = 'language'
AND pt.name = 'English';
SELECT COUNT(*)
FROM posts p
JOIN t1
ON p.id = t1.post_id;
Barring random typos, etc, this is likely to perform somewhat better. If it doesn't, double check the indexes on your tables.
As btilly notes, and if he has correctly guessed the schema, the table design does not help - it seems (at first sight, at least) that, for example, to have three tables posts_tag(post_id,tag) post_lang(post_id,lang) post_type(post_id,type) would be more natural and much more efficient.
Apart from that (or in addition to that), one could think of a table or materialized view that summarizes all the possible countings, with columns (lang,type,tag,nposts). Of course, to compute this in full would be VERY slow, but (apart from the first time) it can be done either in full "in background", at some intervals (if the data does not vary much, and if you don't require exact counts), or eagerly with triggers.
See for example here