SQL - Distinct entries with some cruddy data - tsql

I have an issue. I have 2 tables that are linked by an ID.
TableA
ID
Price
Other_Stuff
TableB
ID
TableA_ID
Type
Age
I can do a:
SELECT
M1.Age,
F1.Age,
M2.Age,
F2.Age,
FROM TableA
LEFT JOIN (SELECT * FROM TableB WHERE TableB.Type='1') AS M1 ON M1.TableA_ID=TableA.ID
LEFT JOIN (SELECT * FROM TableB WHERE TableB.Type='2') AS F1 ON F1.TableA_ID=TableA.ID
LEFT JOIN (SELECT * FROM TableB WHERE TableB.Type='3') AS M2 ON M2.TableA_ID=TableA.ID
LEFT JOIN (SELECT * FROM TableB WHERE TableB.Type='4') AS F2 ON F2.TableA_ID=TableA.ID
And things work as expected while the data is good, but the data is not always good. Normally there is at most one or none of each Type in TableB. The problem is that for older data in the table, before types 3 and 5 existed, there is the possibility that there are two type 1 and or two type 2s. in that case I would want the second type 1 to be treated as a type 2 and the second type 2 is treated as a type 4.
Basically I want a single record returned for each entry in TableA with the 4 ages listed in their own columns, I do not want multiple records for each in TableA.
I am using MS SQL 2000.... old, i know.
Thanks,

try changing the subqueries to something like:
SELECT TOP 1 * FROM TableB WHERE TableB.Type='1 AND TableA_ID=TableA.ID

Related

Join two tables on all columns to determine if they contain identical information

I want to check if tables table_a and table_b are identical. I thought I could full outer join both tables on all columns and count the number of rows and missing values. However, both tables have many columns and I do not want to explicitly type out every column name.
Both tables have the same number of columns as well as names. How can I full outer join both of them on all columns without explicitly typing every column name?
I would like to do something along this syntax:
select
count(1)
,sum(case when x.id is null then 1 else 0 end) as x_nulls
,sum(case when y.id is null then 1 else 0 end) as y_nulls
from
x
full outer join
y
on
*
;
You can use NATURAL FULL OUTER JOIN here. The NATURAL key word will join on all columns that have the same name.
Just testing if the tables are identical could then be:
SELECT *
FROM x NATURAL FULL OUTER JOIN y
WHERE x.id IS NULL OR y.id IS NULL
This will show "orphaned" rows in either table.
You might use except operators.
For example the following would return an empty set if both tables contain the same rows:
select * from t1
except
select * from t2;
If you want to find rows in t1 that are different to those in t2 you could do
select * from t1
where not exists (select * from t1 except select * from t2);
Provided the number and types of columns match you can use select *, the tables' columns can vary in names; you could also invert the above and union to return combined differences.

Snowflake "Exploding Join" issue while doing left join for multiple tables

I am trying to do some left joins on multiple tables and facing the following issue.
Row Counts of tables
Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M
When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
;
Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City
FROM Table1
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):
SELECT
Table1.ID1, Table1.ID2,
Table2.Name, Table2.City,
Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1
But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.
this could be seen via:
SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3
ORDER BY counts DESC
;
this will show the total distinct pairs, and which are the worst contributors to the combination explosion
When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.
Try reading these questions here to help here and here
Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns
i.e. :-
Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System
HAVING COUNT(*)>1 -- Filtering on duplicate rows that has more than a row satisfying the join condition
Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records
Have you tried adding a DISTINCT clause?
SELECT DISTINCT columns, of, choice
FROM Table1
LEFT JOIN Table2 on ...
LEFT JOIN Table3 on ...
I think what's happening is you have dups that left join on another giant set of dups.
Use the proper keys to join the two tables, it solves the issue.

dynamically choose fields from different table based on existense

I have two tables A and B.
Both the tables have same number of columns.
Table A always contains all ids of Table B.
Need to fetch row from Table B first if it does not exist then have
to fetch from Table A.
I was trying to dynamically do this
select
CASE
WHEN b.id is null THEN
a.*
ELSE
b.*
END
from A a
left join B b on b.id = a.id
I think this syntax is not correct.
Can some one suggest how to proceed.
It looks like you want to select all columns from table A except when a matching ID exists in table B. In that case you want to select all columns from table B.
That can be done with this query as long as the number and types of columns in both tables are compatible:
select * from a where not exists (select 1 from b where b.id = a.id)
union all
select * from b
If the number, types, or order of columns differs you will need to explicitly specify the columns to return in each sub query.

JPA query returning a Tuple where one part is an Entity

I have two unrelated tables that I want to do an LEFT JOIN on, I only want 1 column from the LEFT table but the entire entity (which I intend to update if its present or create if not) from the right.
Simplified version of tables:
TABLE1
id, type, data
TABLE2
id, type, and, other, stuff
Current JPQL:
SELECT T1.type,
(SELECT T2
FROM TABLE2 T2
WHERE T2.id = T1.id
AND T2.type = T1.type)
FROM T1
WHERE T1.id = :ID
I am currently getting some sort of logical union error...
Can this been done or should I just use separate queries?
The exact exception is:
Caused by: java.lang.ClassCastException: org.apache.openjpa.jdbc.sql.LogicalUnion$UnionSelect incompatible with org.apache.openjpa.jdbc.sql.SelectImpl
The Java code I use follows:
Query q = this.em.createQuery(jql, Tuple.class);
q.setParameter("ID", id);
#SuppressWarnings("unchecked")
List<Tuple> result = q.getResultList();
The subquery is not essential to my solution - it's just the only form that was parseable - a regular SQL LEFT JOIN wasn't. In words what I am trying to do is for a given ID in TABLE1 find all rows in TABLE2 that have the same ID and type or null if there is no row. Later code will create rows in TABLE2 where there are none for the id and type. I'm expecting 2-3 types per ID in TABLE1 and about half the time for a matching row in TABLE2.

Full outer join on multiple tables in PostgreSQL

In PostgreSQL, I have N tables, each consisting of two columns: id and value. Within each table, id is a unique identifier and value is numeric.
I would like to join all the tables using id and, for each id, create a sum of values of all the tables where the id is present (meaning the id may be present only in subset of tables).
I was trying the following query:
SELECT COALESCE(a.id, b.id, c.id) AS id,
COALESCE(a.value,0) + COALESCE(b.value,0) + COALESCE(c.value.0) AS value
FROM
a
FULL OUTER JOIN
b
ON (a.id=b.id)
FULL OUTER JOIN
c
ON (b.id=c.id)
But it doesn't work for cases when the id is present in a and c, but not in b.
I suppose I would have to do some bracketing like:
SELECT COALESCE(x.id, c.id) AS id, x.value+c.value AS value
FROM
(SELECT COALESCE(a.id, b.id), a.value+b.value AS value
FROM
a
FULL OUTER JOIN
b
ON (a.id=b.id)
) AS x
FULL OUTER JOIN
c
ON (x.id = c.id)
It was only 3 tables and the code is ugly enough already imho. Is there some elegant, systematic ways how to do the join for N tables? Not to get lost in my code?
I would also like to point out that I did some simplifications in my example. Tables a, b, c, ..., are actually results of quite complex queries over several materialized views. But the syntactical problem remains the same.
I understood you need to sum the values from N tables and group them by id, correct?
For that I would do this:
Select x.id, sum (x.value) from (
Select * from a
Union all
Select * from b
Union all........
) as x group by x.id;
Since the n tables are composed by the same fields you can union them all creating a big table full of all the id - value tuples from all tables. Use union all because union filters for duplicates!
Then just sum all the values grouped by id.