Snowflake "Exploding Join" issue while doing left join for multiple tables - left-join

I am trying to do some left joins on multiple tables and facing the following issue.
Row Counts of tables
Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M
When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
;
Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City
FROM Table1
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):
SELECT
Table1.ID1, Table1.ID2,
Table2.Name, Table2.City,
Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;

So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1
But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.
this could be seen via:
SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3
ORDER BY counts DESC
;
this will show the total distinct pairs, and which are the worst contributors to the combination explosion

When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.
Try reading these questions here to help here and here
Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns
i.e. :-
Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System
HAVING COUNT(*)>1 -- Filtering on duplicate rows that has more than a row satisfying the join condition
Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records

Have you tried adding a DISTINCT clause?
SELECT DISTINCT columns, of, choice
FROM Table1
LEFT JOIN Table2 on ...
LEFT JOIN Table3 on ...
I think what's happening is you have dups that left join on another giant set of dups.

Use the proper keys to join the two tables, it solves the issue.

Related

How can I list other matching values ​even if there is an unmatched value in the query?

In my query there is a value that will not match in the demand category table. Therefore, since one value does not match in the output of my query, other matching values ​​do not appear.
I want to do;
How can I list other matching values ​​even if there is an unmatched value in the query?
process Table
fk_unit_id fk_unit_position fk_demand_category
1 2 1
unit table
unit_id
1
unit_position table
unit_position
2
demand_category table
demand_category
1
Query:
SELECT unit_name,unit_position_name,demand_category_name From process
INNER JOIN unit ON process.fk_unit_id = unit_id and unit_id =1
INNER JOIN unit_position ON process.fk_unit_position_id = unit_position_id and unit_position_id = 2
INNER JOIN demand_category ON process.fk_demand_category_id = demand_category_id and demand_category_id =0 ;
Switch INNER JOIN on demand_category with LEFT JOIN
LEFT JOIN gets all records from the LEFT linked and the related record from the right table ,but if you have selected some columns from the RIGHT table, if there is no related records, these columns will contain NULL.
SELECT unit_name,unit_position_name,demand_category_name From process
INNER JOIN unit ON process.fk_unit_id = unit_id and unit_id =1
INNER JOIN unit_position ON process.fk_unit_position_id = unit_position_id and unit_position_id = 2
LEFT JOIN demand_category ON process.fk_demand_category_id = demand_category_id and demand_category_id =0 ;
You can use outer join to have the columns that don't match, just the corresponding values in other table will be padded with null. Other way is to use IN operator, but slower query performance.

How to make postgres (cursor?) start at particular row

I have created the following query:
select t.id, t.row_id, t.content, t.location, t.retweet_count, t.favorite_count, t.happened_at,
a.id, a.screen_name, a.name, a.description, a.followers_count, a.friends_count, a.statuses_count,
c.id, c.code, c.name,
t.parent_id
from tweets t
join accounts a on a.id = t.author_id
left outer join countries c on c.id = t.country_id
where t.row_id > %s
-- order by t.row_id
limit 100
Where %s is a number that starts at 0 and is incremented by 100 after each such query is conducted. I want to fetch all records from the database using this method, where I just increase the %s in the where condition. I found this approach on https://ivopereira.net/efficient-pagination-dont-use-offset-limit. I also included a column in my table which is corresponding to row number (I named it row_id). Now the problem is when I run this query the first time, it returns rows which have an row_id of 3 million. I would like the cursor (not sure if my terminology is correct) to start from rows with row_id 1 through 100 and so on. The table contains 7 million rows. Am I missing something obvious with which I could achieve my goal?

SQL left join on maximum date

I have two tables: contracts and contract_descriptions.
On contract_descriptions there is a column named contract_id which is equal on contracts table records.
I am trying to join the latest record on contract_descriptions:
SELECT *
FROM contracts c
LEFT JOIN contract_descriptions d ON d.contract_id = c.contract_id
AND d.date_description =
(SELECT MAX(date_description)
FROM contract_descriptions t
WHERE t.contract_id = c.contract_id)
It works, but is it the performant way to do it? Is there a way to avoid the second SELECT?
You could also alternatively use DISTINCT ON:
SELECT * FROM contracts c LEFT JOIN (
SELECT DISTINCT ON (cd.contract_id) cd.* FROM contract_descriptions cd
ORDER BY cd.contract_id, cd.date_description DESC
) d ON d.contract_id = c.contract_id
DISTINCT ON selects only one row per contract_id while the sort clause cd.date_description DESC ensures that it is always the last description.
Performance depends on many values (for example, table size). In any case, you should compare both approaches with EXPLAIN.
Your query looks okay to me. One typical way to join only n rows by some order from the other table is a lateral join:
SELECT *
FROM contracts c
CROSS JOIN LATERAL
(
SELECT *
FROM contract_descriptions cd
WHERE cd.contract_id = c.contract_id
ORDER BY cd.date_description DESC
FETCH FIRST 1 ROW ONLY
) cdlast;

How to join vertical and horizontal table together table

I have two table with one of them is vertical i.e store only key value pair with ref id from table 1. i want to join both table and dispaly key value pair as a column in select. and also perform sorting on few keys.
T1 having (id,empid,dpt)
T2 having (empid,key,value)
select
T1.*,
t21.value,
t22.value,
t23.value,
t24.value
from Table1 t1
join Table2 t21 on t1.empid = t21.empid
join Table2 t22 on t1.empid = t22.empid
join Table2 t23 on t1.empid = t23.empid
where
t21.key = 'FNAME'
and t22.key = 'LNAME'
and t23.key='AGE'
The query you demonstrate is very inefficient (another join for each additional column) and also has a potential problem: if there isn't a row in T2 for every key in the WHERE clause, the whole row is excluded.
The second problem can be avoided with LEFT [OUTER] JOIN instead of [INNER] JOIN. But don't bother, the solution to the first problem is a completely different query. "Pivot" T2 using crosstab() from the additional module tablefunc:
SELECT * FROM crosstab(
'SELECT empid, key, value FROM t2 ORDER BY 1'
, $$VALUES ('FNAME'), ('LNAME'), ('AGE')$$ -- more?
) AS ct (empid int -- use *actual* data types
, fname text
, lname text
, age text);
-- more?
Then just join to T1:
select *
from t1
JOIN (<insert query from above>) AS t2 USING (empid);
This time you may want to use [INNER] JOIN.
The USING clause conveniently removes the second instance of the empid column.
Detailed instructions:
PostgreSQL Crosstab Query

Degraded SQL Query Speed By Nesting a Single Query inline vs temp table

I have a query of the following, basic form.
SELECT DISTINCT
a.field1,
b.field2,
c.agg_values
FROM a
INNER JOIN b ON a.something = b.something
LEFT JOIN (
SELECT
array_to_string(array_agg(label), ';;') AS agg_values,
some_table.some_field
FROM some_table
WHERE some_table.some_field = 'some-fixed-value'
GROUP BY some_field
) AS c ON a.some_field = c.some_field
WHERE a.some_other_field = 'some-other-fixed-value'
There's nothing too wild about this query. Pretty run of the mill!
This query runs pretty slow in my Postgres 9.4.5 (~4 minutes), where I have maybe 15k records returned total. some_table has probably ~10k records.
If I move the content of that LEFT JOIN sub-query to a temp table and left join from the temp table, my performance increases substantially. My query may take only 15s now, vs 240s. To be more explicit, if I remove SELECT array_to_string ... GROUP BY some_field query, and put that query into a temp table, then left join on that temp table, BAM, fast.
CREATE TEMP TABLE temp_table_c ( ... );
INSERT INTO temp_table_c SELECT ... same query nested in LEFT JOIN from before ...;
SELECT DISTINCT
a.field1,
b.field2,
c.agg_values
FROM a
INNER JOIN ON a.something = b.something
LEFT JOIN temp_table_c AS c ON a.some_field = c.some_field
WHERE a.some_other_field = 'some-other-fixed-value'
I would appreciate it if someone could explain why the TEMP TABLE version of the query is so much more performant.
Thanks!