TSQL -- how does optimizer deal with joins with unused tables - tsql

TSQL -- how does optimizer deal with joins with unused tables
SELECT table1.col1, table2.col1, table2.col2 -- etc.
FROM dbo.table1
LEFT JOIN dbo.table2
on (table1.id = table2.id)
LEFT JOIN dbo.table3
on (table1.id = table3.id)
In the simple example above it is obvious that table 3 isn't needed, however more intricate cases can exist.
Question: Is the TSQL query optimizer able to determine that the JOIN with table 3 isn't needed? If so, would it potentially miss other optimizations for more complex queries, if table3 wasn't manually removed from the query?
I use SSMS 14.017 with an underlying SQL database of select ##version = Microsoft SQL Server 2014 (SP3)

Yes, query optimizer could skip table3 because it is not referenced in SELECT list and the join condition is LEFT OUTER JOIN so there is no filtering.
Related: 10 Cool SQL Optimisations That do not Depend on the Cost Model by Lukas Eder:
Unneeded Table Accesses
JOIN Elimination
JOIN Elimination: An Essential Optimiser Feature for Advanced SQL Usage

Related

Joining too many tables makes Postgres query extremely slow

I've been trying to optimize this simple query on Postgres 12 that joins several tables to a base relation. They each have 1-to-1 relation and have anywhere from 10 thousand to 10 million rowss.
SELECT *
FROM base
LEFT JOIN t1 ON t1.id = base.t1_id
LEFT JOIN t2 ON t2.id = base.t2_id
LEFT JOIN t3 ON t3.id = base.t3_id
LEFT JOIN t4 ON t4.id = base.t4_id
LEFT JOIN t5 ON t5.id = base.t5_id
LEFT JOIN t6 ON t6.id = base.t6_id
LEFT JOIN t7 ON t7.id = base.t7_id
LEFT JOIN t8 ON t8.id = base.t8_id
LEFT JOIN t9 ON t9.id = base.t9_id
(the actual relations are a bit more complicated than this, but for demonstration purposes this is fine)
I noticed that the query is still very slow when I only do SELECT base.id which seems odd, because then query planner should know that the joins are unnecessary and shouldn't affect the performance.
Then I noticed that 8 seems to be some kind of magic number. If I remove any single one of the joins, the query time goes from 500ms to 1ms. With EXPLAIN I was able to see that Postgres is doing index only scans when joining 8 tables, but with 9 tables it starts doing sequential scans.
That's even when I only do SELECT base.id so somehow the amount of tables is tripping up the query planner.
We finally found out that there is indeed a configuration setting in postgres called join_collapse_limit, which is set to 8 by default.
https://www.postgresql.org/docs/current/runtime-config-query.html
The planner will rewrite explicit JOIN constructs (except FULL JOINs) into lists of FROM items whenever a list of no more than this many items would result. Smaller values reduce planning time but might yield inferior query plans. By default, this variable is set the same as from_collapse_limit, which is appropriate for most uses. Setting it to 1 prevents any reordering of explicit JOINs. Thus, the explicit join order specified in the query will be the actual order in which the relations are joined. Because the query planner does not always choose the optimal join order, advanced users can elect to temporarily set this variable to 1, and then specify the join order they desire explicitly.
After reading this article we decided to increase the limit, along with other values such as from_collapse_limit and geco_threshold. Beware that query planning time increases exponentially with the amount of joins, so the limit is there for a reason and should not be increased carelessly.

When is it better to use CTE or temp table postgres

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

Can I apply predicates to the same columns against multiple tables in a JOIN only once?

I want to join two tables together and add additional information from two other tables to the same columns in both queried tables. I've come up with the below code, which works, but I don't feel comfortable about having to add another JOIN clause for each table, as it would make the query substantially long if I wanted to join/add more things.
Is there a way to combine it, so that I can join additional tables only once (just use S and E aliases every time)?
SELECT
J.JobId,
J.StandardJobId,
S.JobName,
J.EngineerId,
E.EngineerName,
JF.JobId AS FollowUpJobId,
JF.StandardJobId AS FollowUpStandardJobId,
SF.JobName AS FollowUpJobName,
JF.EngineerId AS FollowUpEngineerId,
EF.EngineerName AS FollowUpEngineerName
FROM
Jobs J
INNER JOIN
Jobs JF
ON
J.FollowUpJobId = JF.JobId
INNER JOIN
StandardJobs S
ON
J.StandardJobId = S.StandardJobId
INNER JOIN
Engineers E
ON
E.EngineerId = J.EngineerId
INNER JOIN
StandardJobs SF
ON
SF.StandardJobId = JF.StandardJobId
INNER JOIN
Engineers EF
ON
EF.EngineerId = JF.EngineerId
One approach would be to use a Common Table Expression (CTE) - something like:
with cte as
(SELECT J.JobId,
J.StandardJobId,
S.JobName,
J.EngineerId,
E.EngineerName,
J.FollowUpJobId
FROM Jobs J
INNER JOIN StandardJobs S ON J.StandardJobId = S.StandardJobId
INNER JOIN Engineers E ON E.EngineerId = J.EngineerId)
SELECT O.*,
F.StandardJobId AS FollowUpStandardJobId,
F.JobName AS FollowUpJobName,
F.EngineerId AS FollowUpEngineerId,
F.EngineerName AS FollowUpEngineerName
FROM CTE AS O
JOIN CTE AS F ON O.FollowUpJobId = F.JobId
You can sort of do this with either a CTE (Common Table Expressions, the WITH clause) or a View:
;WITH Jobs_Extended As
(
SELECT j.*,
s.JobName,
E.EngineerName
FROM Jobs As j
JOIN StandardJobs As s ON s.StandardJobId = j.StandardJobId
JOIN Engineer As e ON e.EngineerId = j.EngineerId
)
SELECT
J.JobId,
J.StandardJobId,
J.JobName,
J.EngineerId,
J.EngineerName,
JF.JobId AS FollowUpJobId,
JF.StandardJobId AS FollowUpStandardJobId,
JF.JobName AS FollowUpJobName,
JF.EngineerId AS FollowUpEngineerId,
JF.EngineerName AS FollowUpEngineerName
FROM Jobs_Extended J
JOIN Jobs_Extended JF ON J.FollowUpJobId = JF.JobId
In this example the CTE Jobs_Extended becomes a defined alias for the relationship between the Jobs, Engineers and StandardJobs tables. Then once defined, you can use it multiple times in the query without having to redefine those interior relations.
You can do the same thing by change the WITH to a View, which will make the defined alias permannet in your database.
No, you cannot avoid JOINing related tables each time a separate reference is needed. The issue is that you are not working with the tables in a general sense but instead working with the specific rows of each table, even more specifically, just those rows that match the JOIN and WHERE conditions.
There is no way to specify the references to either StandardJobs or Engineers only once because you are needing to work with two rows from each table at the same time, at least in the given example.
However, depending on which direction you are wanting to go with "additional tables" (more references to Jobs or more lookups like StandardJobs and Engineers for the given 2 references of Jobs), the CTE construct shown by Mark is the probably the easiest / best way to abstract it. I posted this answer mainly to explain the issue at hand.

T-SQL different JOIN approaches, same results, which one would you prefer?

these are 3 approaches how to make a join. I would like to hear some word on perforance of these 3 queries.
Thank you
SELECT * FROM
tableA A LEFT JOIN tableB B
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
ON B.ColumnB = A.ColumnB
WHERE ColumnX = 'XY'
Versus
SELECT * FROM
tableA A LEFT JOIN tableB B
ON B.ColumnB = A.ColumnB
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
WHERE ColumnX = 'XY'
Versus Common Table Expression
WITH T...
It does not matter.
SQL Server has a cost-based optimizer (as opposed to a rule-based optimizer). That means that the engine is able to figure out that both of your first two options are identical. Run your estimated and actual execution plans and you will see that this is the case.
The only reason you would choose one option over the other is for readability's sake. I go with your second option, because it's a lot easier to read when there are a great many joins involved. ON clauses in reverse order become quite difficult to track.
In my experience, any of the above could be quicker depending on your tables.
As you're setting up joins, you want to start with the most restrictive as possible (without negatively affecting your end result, obviously). This same logic also applies to the Where clause for the same reason. By starting with the most restrictive, you're limiting the number of rows that are being joined and thus evaluated by the Where clause and then returned/manipulated in the select clause. For my answers below regarding the three specific scenarios, I'm assuming a sufficiently complicated query that is doing more than just looking to combine data from multiple tables (i.e., queries answering specific questions).
If Table A is huge and Tables B & C are smaller and more directly related to the data you're trying to isolate, then the first option would likely be fastest.
If Table B or C are huge and Table A is more related to your desired data, the second option would likely be fastest.
As far as option 3 goes, I love CTEs but I try to only use them when I need to do so. Using a CTE will speed up your overall query if the data joined, manipulated, and returned by the CTE is only related to the rest of the query in a limited fashion. Including tables that are only partially related to your end result in your primary string of joins is going to needlessly slow down your query. If you can parse out that data into a CTE, it can run quickly by itself and then be incorporated back into the main query at the end.

Difference between INNER JOIN and WHERE?

First Query:
Select * from table1 inner join table2 on table1.Id = table2.Id
Second Query:
Select * from table1, table2 where table1.Id = table2.Id
What is difference between these query regarding performance which should one use?
The two statements you posted are logically identical. There isn't really a
practical reason to prefer one over the other, it's largely a matter of
personal style and readability. Some people prefer the INNER JOIN syntax and
some prefer just to use WHERE.
Refering to Using Inner Joins:
In the ISO standard, inner joins can
be specified in either the FROM or
WHERE clause. This is the only type of
join that ISO supports in the WHERE
clause. Inner joins specified in the
WHERE clause are known as old-style
inner joins.
Refering to Join Fundamentals:
Specifying the join conditions in the
FROM clause helps separate them from
any other search conditions that may
be specified in a WHERE clause, and is
the recommended method for specifying
joins.
Personaly, I prefer using INNER JOIN. I find it much clearer, as I can separate the join conditions from the filter conditions and using a seperate join block for each joined table.
To amplify #Akram's answer - many people prefer the inner join syntax, since it then allows you to more easily distinguish between the join conditions (how the various tables in the FROM clause relate to each other) from the filter conditions (those conditions that should be used to reduce the overall result set. There's no difference between them in this circumstance, but on larger queries, with more tables, it may improve readability to use the inner join form.
In addition, once you start considering outer joins, you pretty well need to use the infix join syntax (left outer join,right outer join), so many find a form of symmetry in using the same style for inner join. There is an older deprecated syntax for performing outer joins in the WHERE clause (using *=), but support for such joins is dying out.