Is it possible in Postgres to have an optional join?
My use case is something like
select ...
from a
inner join b using (b_id)
where b.type in (...)
a is a very large reporting table. b is used to filter a, BUT the most common use case is that we will want all b.types, and therefore all the b records in the join. In other words, in most cases we don't want to filter by b at all, and would not need the join in that case, but the filtering optionality still needs to be there in cases when the user wants to filter by type.
So is it possible to invoke the join optionally, and save the join effort in cases when we just want all of a?
If not, what's my next best option? IF ... THEN or CTE with a union of separate queries?
If you don't need any of b's columns, there is no need to JOIN table b, You can filter by using EXISTS(SELECT .. FROM b WHERE ...).
If you want to conditionally exclude a part of the WHERE clause, you could use the following construct: (the ignore_b boolean will function as an on/off switch)
-- $ignore_b is a Boolean flag
-- when True, the optimiser will ignore the exists(...)
SELECT ...
FROM a
WHERE ( $ignore_b OR EXISTS (
SELECT *
FROM b
WHERE b.b_id = a.some_id
AND b.type in (1,2,3,4,5)
)
);
In our example, you are still filtering based on b, based on whether a row with that b_id exists in b in the first place.
Postgresql will remove unneeded joins under very specific circumstances. You write the join as a left join, so that no rows of A can be removed due to the absence of corresponding rows in B. The column B.b_id is a declared unique or primary key, so that no rows of A can be duplicated due to duplicate matches in B. And of course, no column of B can referenced in the query (except the reference to the key column in the left join condition).
In those cases, you can just always write the LEFT JOIN, and PostgreSQL will figure out that it can skip it.
You can argue that if you have a declared foreign key constraint on the join condition, then you shouldn't need the JOIN to be a LEFT JOIN in order to implement this optimization. I think that that argument is correct, but PostgreSQL does not implement it that way.
I would just do it programatically. If you are already programmatically adding references to B in the WHERE clause, you should be able to do it for the join as well.
Related
From PostgreSQL document, when explaining basics of EXPLAIN command:
When dealing with outer joins, you might see join plan nodes with both
“Join Filter” and plain “Filter” conditions attached. Join Filter
conditions come from the outer join's ON clause, so a row that
fails the Join Filter condition could still get emitted as a
null-extended row. But a plain Filter condition is applied after the outer-join rules and so acts to remove rows unconditionally. In an inner join there is no semantic
difference between these types of filters.
"Join Filter conditions come from the outer join's ON clause". Then in outer join, where does a plain filter condition come from?
Could you give some examples?
Thanks.
There is no other use of the term "plain Filter condition" used elsewhere in the Postgres documentation, so I would suspect that the author meant the word "plain" literally like not decorated or elaborate; simple or ordinary in character.
So really they are saying "When a filter is applied in an OUTER JOIN's ON clause the table or derived table being joined is just plainly filtered before the join occurs. This will lead to any columns from this table or derived table in the result set to be null".
Here is a little example that might enlighten you:
CREATE TABLE a(a_id) AS VALUES (1), (3), (4);
CREATE TABLE b(b_id) AS VALUES (1), (2), (5);
Now we have to force a nested loop join:
SET enable_hashjoin = off;
SET enable_mergejoin = off;
Our query is:
SELECT *
FROM a
LEFT JOIN b ON a_id = b_id
WHERE a_id > coalesce(b_id, 0);
a_id | b_id
------+------
3 |
4 |
(2 rows)
The plan is:
QUERY PLAN
------------------------------------------
Nested Loop Left Join
Join Filter: (a.a_id = b.b_id)
Filter: (a.a_id > COALESCE(b.b_id, 0))
-> Seq Scan on a
-> Materialize
-> Seq Scan on b
The “plain filter” is a condition that is applied after the join.
It is a frequent mistake to believe that conditions in the WHERE clause are the same as conditions in a JOIN … ON clause. That is only the case for inner joins. For outer joins, rows from the outer side that don't meet the condition are also included in the result.
That makes it necessary to have two different filters.
So if table A is:
no | username
1 | admin
2 | chicken
And table B is:
id | no
a | 1
b | 3
c | 4
Then, I do a NATURAL FULL OUTER JOIN as so:
SELECT no
FROM A NATURAL FULL OUTER JOIN
B;
Then, what is the result? And is the result the same for all PostgreSQL implementations?
Because does the 'no' come from table A, or table B, it is ambiguous. But, NATURAL joins combine the 'no'. But what if one of the 'no' is ambiguous, i.e. A.no IS NOT NULL, but B.no IS NULL, which of the 'no' does it pick? And what if A.no and B.no are both NULL?
TL;DR: So the question is, WHAT is the value of the no in SELECT no: Is it the A.no or B.no, or is it the COLAESCE of them?
SELECT no
FROM A NATURAL FULL OUTER JOIN
B;
First, don't use natural for joins. It is a bug waiting to happen. As you note in your question, natural chooses the join keys based on the names of columns. It doesn't take types into account. It doesn't even take explicitly declared foreign key relationships in to account.
The particularly insidious problem, though, is that someone reading the query does not see the join keys. That makes is much harder to debug queries or to modify/enhance them.
So, my advice is to use using instead.
SELECT no
FROM A FULL OUTER JOIN
B
USING (no);
What does a full join return? It returns all rows from both tables, regardless of whether the join matches or not. Because a NULL comparison always fails, NULL will not match in the join conditions.
For example, the following query returns 4 rows not 2 containing a NULL value:
with x as (
select NULL::int as id union all select NULL as id
)
select id
from x full join
x y
using (id);
You would get the same result with a natural join, but I simply don't use that construct.
I'm not 100% sure, but I'm pretty sure that all versions of Postgres that support full join would work the same way. This behavior is derived specifically from the ANSI definitions of joins and join conditions.
Perhaps I'm approaching this all wrong, in which case feel free to point out a better way to solve the overall question, which "How do I use an intermediate table for future queries?"
Let's say I've got tables foo and bar, which join on some baz_id, and I want to use combine this into an intermediate table to be fed into upcoming queries. I know of the WITH .. AS (...) statement, but am running into problems as such:
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar ON bar.baz_id = foo.baz_id
)
SELECT
baz_id
-- some other things as well
FROM
foobar
The issue is that (Postgres 9.4) tells me baz_id is ambiguous. I understand this happens because SELECT * includes all the columns in both tables, so baz_id shows up twice; but I'm not sure how to get around it. I was hoping to avoid copying the column names out individually, like
SELECT
foo.var1, foo.var2, foo.var3, ...
bar.other1, bar.other2, bar.other3, ...
FROM foo INNER JOIN bar ...
because there are hundreds of columns in these tables.
Is there some way around this I'm missing, or some altogether different way to approach the question at hand?
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar USING(baz_id)
)
SELECT
baz_id
-- some other things as well
FROM
foobar
It leaves only one instance of the baz_id column in the select list.
From the documentation:
The USING clause is a shorthand that allows you to take advantage of the specific situation where both sides of the join use the same name for the joining column(s). It takes a comma-separated list of the shared column names and forms a join condition that includes an equality comparison for each one. For example, joining T1 and T2 with USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b = T2.b.
Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.
I have a query that uses a subquery and I am having a problem returning the expected results. The error I receive is..."Only one expression can be specified in the select list when the subquery is not introduced with EXISTS." How can I rewrite this to work?
SELECT
a.Part,
b.Location,
b.LeadTime
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
AND
Date IN (SELECT Location, MAX(Date) FROM dbo.Vendor GROUP BY Location)
GROUP BY
a.Part,
b.Location,
b.LeadTime
ORDER BY
a.Part
I think something like this may be what you're looking for. You didn't say what version of SQL Server--this works in SQL 2005 and up:
SELECT
p.Part,
p.Location, -- from *p*, otherwise if no match we'll get a NULL
v.LeadTime
FROM
dbo.Parts p
OUTER APPLY (
SELECT TOP (1) * -- * here is okay because we specify columns outside
FROM dbo.Vendor v
WHERE p.Location = v.Location -- the correlation part
ORDER BY v.Date DESC
) v
WHERE
p.Location IN ('A','B','C')
ORDER BY
p.Part
;
Now, your query can be repaired as is by adding the "correlation" part to change your query into a correlated subquery as demonstrated in Kory's answer (you'd also remove the GROUP BY clause). However, that method still requires an additional and unnecessary join, hurting performance, plus it can only pull one column at a time. This method allows you to pull all the columns from the other table, and has no extra join.
Note: this gives logically the same results as Lamak's answer, however I prefer it for a few reasons:
When there is an index on the correlation columns (Location, here) this can be satisfied with seeks, but the Row_Number solution has to scan (I believe).
I prefer the way this expresses the intent of the query more directly and succinctly. In the Row_Number method, one must get out to the outer condition to see that we are only grabbing the rn = 1 values, then bop back into the CTE to see what that is.
Using CROSS APPLY or OUTER APPLY, all the other tables not involved in the single-inner-row-per-outer-row selection are outside where (to me) they belong. We aren't squishing concerns together. Using Row_Number feels a bit like throwing a DISTINCT on a query to fix duplication rather than dealing with the underlying issue. I guess this is basically the same issue as the previous point worded in a different way.
The moment you have TWO tables from which you wish to pull the most recent value, the Row_Number() solution blows up completely. With this syntax, you just easily add another APPLY clause, and it's crystal clear what you're doing. There is a way to use Row_Number for the multiple tables scenario by moving the other tables outside, but I still don't prefer that syntax.
Using this syntax allows you to perform additional joins based on whether the selected row exists or not (in the case that no matching row was found). In the Row_Number solution, you can only reasonably do that NOT NULL checking in the outer query--so you are forced to split up the query into multiple, separated parts (you don't want to be joining to values you will be discarding!).
P.S. I strongly encourage you to use aliases that hint at the table they represent. Please don't use a and b. I used p for Parts and v for Vendor--this helps you and others make sense of the query more quickly in the future.
If I understood you corrrectly, you want the rows with the max date for locations A, B and C. Now, assuming SQL Server 2005+, you can do this:
;WITH CTE AS
(
SELECT
a.Part,
b.Location,
b.LeadTime,
RN = ROW_NUMBER() OVER(PARTITION BY a.Part ORDER BY [Date] DESC)
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
)
SELECT Part,
Location,
LeadTime
FROM CTE
WHERE RN = 1
ORDER BY Part
In your subquery you need to correlate the Location and Part to the outer query.
Example:
Date = (SELECT MAX(Date)
FROM dbo.Vender v
WHERE v.Location = b.Location
AND v.Part = b.Part
)
So this will bring back one date for each location and part
I have a query
SELECT
cd.signoffdate,
min(cmp.dsignoff) as dsignoff
FROM clients AS c
LEFT JOIN campaigns AS cmp ORDER BY dsignoff;
If I want to have something like this built into the postgres query will it work and how do I do it
if the cd.signoffdate is empty it should take min(cmp.dsignoff) as dsignoff as the value and then order by this column, so in other words it should order by dsignoff and cd.signoffdate and tread it as one column, is this possible and how?
Your query could look like this:
SELECT c.client_id, COALESCE(c.signoffdate, min(cmp.dsignoff)) AS signoff
FROM clients c
LEFT JOIN campaigns cmp ON cmp.client_id = c.client_id -- join condition!
GROUP BY c.client_id, cd.signoffdate -- group by!
ORDER BY COALESCE(c.signoffdate, min(cmp.dsignoff));
Or, with simplified syntax:
SELECT c.client_id, COALESCE(c.signoffdate, min(cmp.dsignoff)) AS signoff
FROM clients c
LEFT JOIN campaigns cmp USING (client_id)
GROUP BY 1, cd.signoffdate
ORDER BY 2;
Major points:
Used alias c, but referenced as cd.
No join condition leads to cross join, probably not intended.
Missing GROUP BY.
I assume that you want to group by the primary key column of clients and call it client_id.
I also assume that client_id links the two tables together.
COALESCE() serves as fallback in case signoffdate IS NULL.
ORDER BY coalesce(cd.signoffdate, min(cmp.dsignoff));
But don't you need some GROUP BY in your original query?
You can use COALESCE
SELECT COALESCE(cd.signoffdate, min(cmp.dsignoff)) as dsignoff
I'm not sure if you can order by coalesce in Postgres - might be worth just ordering by both columns