Getting Different results with "LEFT OUTER JOIN" and "IN", where did my logic go wrong? - tsql

I have four tables, one is a Master Invoice table and three others are Invoices from different region. What I am trying to achieve is to return only records from the Master Invoice table where the invoice number is in one of the other three tables. For example:
SELECT * FROM Invoice_Master M
LEFT OUTER JOIN Invoice_North N
ON M.InvNo = N.InvNo
LEFT OUTER JOIN Invoice_East E
ON M.InvNo = E.InvNo
LEFT OUTER Invoice_South S
ON M.InvNo = S.InvNo
WHERE N.InvNo IS NOT NULL
OR E.InvNo IS NOT NULL
OR S.InvNo IS NOT NULL
The logic is if I "LEFT OUTER JOIN" the 3 tables to the Master table, if any InvNo is not null then the invoice must exist in the original Master table.
However, when I write the code in this Implicit Join I get a slightly less records in return:
select * FROM Invoice_Master
WHERE InvNo IN (
SELECT InvNo FROM Invoice_North)
OR InvNo IN (
SELECT InvNo FROM Invoice_East)
OR InvNo IN (
SELECT InvNo FROM Invoice_South)
Where did my logic go wrong?

The difference could be due to the fact that the second query selects discrete rows from the master table, whereas your first query could be returing join-results that have duplicate rows. i.e. if the left outer join matched two rows in, say, invoice_north, then both those rows will be shown in the main select.

Related

How can I list other matching values ​even if there is an unmatched value in the query?

In my query there is a value that will not match in the demand category table. Therefore, since one value does not match in the output of my query, other matching values ​​do not appear.
I want to do;
How can I list other matching values ​​even if there is an unmatched value in the query?
process Table
fk_unit_id fk_unit_position fk_demand_category
1 2 1
unit table
unit_id
1
unit_position table
unit_position
2
demand_category table
demand_category
1
Query:
SELECT unit_name,unit_position_name,demand_category_name From process
INNER JOIN unit ON process.fk_unit_id = unit_id and unit_id =1
INNER JOIN unit_position ON process.fk_unit_position_id = unit_position_id and unit_position_id = 2
INNER JOIN demand_category ON process.fk_demand_category_id = demand_category_id and demand_category_id =0 ;
Switch INNER JOIN on demand_category with LEFT JOIN
LEFT JOIN gets all records from the LEFT linked and the related record from the right table ,but if you have selected some columns from the RIGHT table, if there is no related records, these columns will contain NULL.
SELECT unit_name,unit_position_name,demand_category_name From process
INNER JOIN unit ON process.fk_unit_id = unit_id and unit_id =1
INNER JOIN unit_position ON process.fk_unit_position_id = unit_position_id and unit_position_id = 2
LEFT JOIN demand_category ON process.fk_demand_category_id = demand_category_id and demand_category_id =0 ;
You can use outer join to have the columns that don't match, just the corresponding values in other table will be padded with null. Other way is to use IN operator, but slower query performance.

Snowflake "Exploding Join" issue while doing left join for multiple tables

I am trying to do some left joins on multiple tables and facing the following issue.
Row Counts of tables
Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M
When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
;
Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City
FROM Table1
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):
SELECT
Table1.ID1, Table1.ID2,
Table2.Name, Table2.City,
Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1
But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.
this could be seen via:
SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3
ORDER BY counts DESC
;
this will show the total distinct pairs, and which are the worst contributors to the combination explosion
When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.
Try reading these questions here to help here and here
Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns
i.e. :-
Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System
HAVING COUNT(*)>1 -- Filtering on duplicate rows that has more than a row satisfying the join condition
Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records
Have you tried adding a DISTINCT clause?
SELECT DISTINCT columns, of, choice
FROM Table1
LEFT JOIN Table2 on ...
LEFT JOIN Table3 on ...
I think what's happening is you have dups that left join on another giant set of dups.
Use the proper keys to join the two tables, it solves the issue.

COALESCE TSQL with a join tsql

I have a requirement to pick up data that is in more than one place and I have some form of recognition if using the coalesce function. Basically I am looking to coalesce the join itself but looking online its seems as if i can only do this on the fields.
So we have a Products and Suppliers table, we also have these as a temp table so in total 4 tables (products, tempproducts, suppliers, tempsuppliers). In the suppliers and products table is where we store our products and suppliers and their temptables we store any new suppliers/products. We also have a tempsupplierproduct which joins new suppliers to new products. However we can end in a situation where a new supplier has an existing product so the new supplier will be in the tempsuppliers table and its product is in the products table NOT the tempproducts as it is not new, we will also have a new tempsupplierproduct to join the two up.
So i want a query which looks in the tempsupplierproducts table and then gets basic information about the supplier and products. To do this i am using a coalesce.
SELECT DISTINCT SP.*, COALESCE(P.Product, PD.Product) 'Product', COALESCE(S.Supplier, SU.Supplier) 'Supplier'
FROM tempsupplierproduct SP
LEFT JOIN tempProduct P ON SP.ProductCode = P.Code
LEFT JOIN Products PD ON SP.ProductCode = PD.Code
LEFT JOIN tempSupplier S ON SP.SupplierCode = S.Code
LEFT JOIN Suppliers SU ON SP.SupplierCode = SU.Code
Now while this works, something at the back of my head tells me it is not entirely right, ideally i want if data is not in table A then join to table B. I have seen maybe coalescing inside the join itself but I am unsure how to do this
LEFT JOIN Suppliers Su ON SP.SupplierCode = COALESCE(S.Code, SU.Code)
maybe away, but I am confused by this, all it is saying is use code in temptable if not there then use supplier code. So what would this mean if we have a code in the temptable, will this try to join on it, if so then this is incorrect also.
Any help is appreciated
You can union the two suppliers tables together and then join them in one go like this. I'm assuming that there are no duplicates between the two tables in this case but with a bit of extra work that could be resolved as well.
WITH AllSuppliers AS
(
SELECT Code, Supplier FROM Suppliers
UNION ALL
SELECT Code, Supplier FROM tempSupplier
)
SELECT DISTINCT SP.*, COALESCE(P.Product, PD.Product) 'Product', S.Supplier
FROM tempsupplierproduct SP
LEFT JOIN tempProduct P ON SP.ProductCode = P.Code
LEFT JOIN Products PD ON SP.ProductCode = PD.Code
LEFT JOIN AllSuppliers S ON SP.SupplierCode = S.Code
If you need to handle duplicates in the two suppliers tables then an approach like this should work, essentially we rank the duplicates and then pick the highest ranked result. For two tables you could use a full outer join between the two but this approach will scale to any number of tables.
WITH AllSuppliers AS
(
SELECT Code, Supplier, 1 AS TablePriority FROM Suppliers
UNION ALL
SELECT Code, Supplier, 2 AS TablePriority FROM tempSupplier
),
SuppliersRanked AS
(
SELECT Code, Supplier,
ROW_NUMBER() OVER (PARTITION BY Code ORDER BY TablePriority) AS RowPriority
FROM AllSuppliers
)
SELECT DISTINCT SP.*, COALESCE(P.Product, PD.Product) 'Product', S.Supplier
FROM tempsupplierproduct SP
LEFT JOIN tempProduct P ON SP.ProductCode = P.Code
LEFT JOIN Products PD ON SP.ProductCode = PD.Code
LEFT JOIN SuppliersRanked S ON SP.SupplierCode = S.Code
AND RowPriority = 1
You can absolutely join on a coalesced field. Here is a snippet from one of my production views:
LEFT JOIN [Portal].tblHelpdeskresource supplier ON PO.fld_str_SupplierID = supplier.fld_str_SupplierID
-- Job type a
LEFT JOIN [Portal].tblHelpDeskFault HDF ON PO.fld_int_HelpdeskFaultID = HDF.fld_int_ID
-- Job Type b
LEFT JOIN [Portal].tblProjectHeader PH ON PO.fld_int_ProjectHeaderID = PH.fld_int_ID
LEFT JOIN [Portal].tblPPMScheduleLine PSL ON PH.fld_int_PPMScheduleRef = PSL.fld_int_ID
-- Managers (used to be separate for a & b type, now converged)
LEFT JOIN [Portal].uvw_HelpDeskSiteManagers PSM ON COALESCE(PSL.fld_int_StoreID,HDF.fld_int_StoreID) = PSM.PortalSiteId
LEFT JOIN [Portal].tblHelpdeskResource PHDR ON PSM.PortalResourceId = PHDR.fld_int_ID

SQL left join on maximum date

I have two tables: contracts and contract_descriptions.
On contract_descriptions there is a column named contract_id which is equal on contracts table records.
I am trying to join the latest record on contract_descriptions:
SELECT *
FROM contracts c
LEFT JOIN contract_descriptions d ON d.contract_id = c.contract_id
AND d.date_description =
(SELECT MAX(date_description)
FROM contract_descriptions t
WHERE t.contract_id = c.contract_id)
It works, but is it the performant way to do it? Is there a way to avoid the second SELECT?
You could also alternatively use DISTINCT ON:
SELECT * FROM contracts c LEFT JOIN (
SELECT DISTINCT ON (cd.contract_id) cd.* FROM contract_descriptions cd
ORDER BY cd.contract_id, cd.date_description DESC
) d ON d.contract_id = c.contract_id
DISTINCT ON selects only one row per contract_id while the sort clause cd.date_description DESC ensures that it is always the last description.
Performance depends on many values (for example, table size). In any case, you should compare both approaches with EXPLAIN.
Your query looks okay to me. One typical way to join only n rows by some order from the other table is a lateral join:
SELECT *
FROM contracts c
CROSS JOIN LATERAL
(
SELECT *
FROM contract_descriptions cd
WHERE cd.contract_id = c.contract_id
ORDER BY cd.date_description DESC
FETCH FIRST 1 ROW ONLY
) cdlast;

Selecting non-repeating values in Postgres

SELECT DISTINCT a.s_id, select2Result.s_id, select2Result."mNrPhone",
select2Result."dNrPhone"
FROM "Table1" AS a INNER JOIN
(
SELECT b.s_id, c."mNrPhone", c."dNrPhone" FROM "Table2" AS b, "Table3" AS c
WHERE b.a_id = 1001 AND b.s_id = c.s_id
ORDER BY b.last_name) AS select2Result
ON a.a_id = select2Result.student_id
WHERE a.k_id = 11211
It returns:
1001;1001;"";""
1002;1002;"";""
1002;1002;"2342342232123";"2342342"
1003;1003;"";""
1004;1004;"";""
1002 value is repeated twice, but it shouldn't because I used DISTINCT and no other table has an id repeated twice.
You can use DISTINCT ON like this:
SELECT DISTINCT ON (a.s_id)
a.s_id, select2Result.s_id, select2Result."mNrPhone",
select2Result."dNrPhone"
...
But like other persons have told you, the "repeated records" are different really.
The qualifier DISTINCT applies to the entire row, not to the first column in the select-list. Since columns 3 and 4 (mNrPhone and dNrPhone) are different for the two rows with s_id = 1002, the DBMS correctly lists both rows. You have to write your query differently if you only want the s_id = 1002 to appear once, and you have to decide which auxilliary data you want shown.
As an aside, it is strongly recommended that you always use the explicit JOIN notation (which was introduced in SQL-92) in all queries and sub-queries. Do not use the old implicit join notation (which is all that was available in SQL-86 or SQL-89), and especially do not use a mixture of explicit and implicit join notations (where your sub-query uses the implicit join, but the main query uses explicit join). You need to know the old notation so you can understand old queries. You should write new queries in the new notation.
First of all, the query displayed does not work at all, student_id is missing in the sub-query. You use it in the JOIN later.
More interestingly:
Pick a certain row out of a set with DISTINCT
DISTINCT and DISTINCT ON return distinct values by sorting all rows according to the set of columns to be distinct, then it picks the first row from every set. It sorts by all rows for a general DISTINCT and only the specified rows for DISTINCT ON. Here lies the opportunity to pick certain rows out of a set over other.
For instance if you prefer rows with not-empty "mNrPhone" in your example:
SELECT DISTINCT ON (a.s_id) -- sure you didn't want a.a_id?
,a.s_id AS a_s_id -- use aliases to avoid dupe name
,s.s_id AS s_s_id
,s."mNrPhone"
,s."dNrPhone"
FROM "Table1" a
JOIN (
SELECT b.s_id, c."mNrPhone", c."dNrPhone", ??.student_id -- misssing!
FROM "Table2" b
JOIN "Table3" c USING (s_id)
WHERE b.a_id = 1001
-- ORDER BY b.last_name -- pointless, DISTINCT will re-order
) s ON a.a_id = s.student_id
WHERE a.k_id = 11211
ORDER BY a.s_id -- first col must agree with DISTINCT ON, could add DESC though
,("mNrPhone" <> '') DESC -- non-empty first
ORDER BY cannot disagree with DISTINCT on the same query level. To get around this you can either use GROUP BY instead or put the whole query in a sub-query and run another SELECT with ORDER BY on it.
The ORDER BY you had in the sub-query is voided now.
In this particular case, if - as it seems - the dupes come only from the sub-query (you'd have to verify), you could instead:
SELECT a.a_id, s.s_id, s."mNrPhone", s."dNrPhone" -- picking a.a_id over s_id
FROM "Table1" a
JOIN (
SELECT DISTINCT ON (b.s_id)
,b.s_id, c."mNrPhone", c."dNrPhone", ??.student_id -- misssing!
FROM "Table2" b
JOIN "Table3" c USING (s_id)
WHERE b.a_id = 1001
ORDER BY b.s_id, (c."mNrPhone" <> '') DESC -- pick non-empty first
) s ON a.a_id = s.student_id
WHERE a.k_id = 11211
ORDER BY a.a_id -- now you can ORDER BY freely