Inner Joining a large query with itself - tsql

Problem:
I need to remove duplicate pairs from the result of a query
(same problem as described here)
So if the result has (A,B), (B,A), (C,A)
I am only interested in (A,B) and (C,A)
The Complication:
Unlike in the linked question, the data is not available in a table to perform an self join and retrieve easily. It is more in the following state
(SELECT C1, C2 from a mind boggling number of joins and unions)
So I can make it a temp table as follows
SELECT T.C1, T.C2
((SELECT C1, C2 from a mind boggling number of joins and unions)) T1
I would like to perform an inner join to remove duplicate pairs as mentioned above
So is there a way to do that in such a scenario
Below query is syntactically wrong, but hopefully it conveys the idea
SELECT A.C1, A.C2
((SELECT C1, C2 from a mind boggling number of joins and unions)) T1 A
INNER JOIN T1 B
ON A.C1 = B.C1 AND
A.C2 < B.C2
I am running SQL Server 2012

here is one way to achieve what you want with CTEs
you can as well use temporary table to store result and use cte1 alone.
with cte
as
(
select col1, col2 from --- your query here.
)
, cte1
as
(
select col1, col2, row_number() over
( partition by (case when col1 >= col2 then col1
else col2
end) ,
(case when col1 <= col2 then col1
else col2
end) order by (select null)
) as rn
from cte
)
select * from cte1 where rn =1

Related

Return multiple columns on single CASE DB2

Is it possible to return multiple column on single CASE evaluation in DB2?
below query return single column.
select (case when 1=1 then 0 else 1 end) as col from table;
I need multiple column like
select (case when 1=1 then 0 as col, 1 as col1 else 2 as col1 , 3 as col2 end) from table;
select (case when 1=1 then 0,1 else 2, 3 end)col , col1 from table;
Is coalesce function is use full for above conditions? thanks.
It’s not possible with a single CASE statement in Db2.
But you may use something like below.
select
coalesce(t1.c1, t2.c1, t3.c1) c1
, coalesce(t1.c2, t2.c2, t3.c2) c2
from
(
select tabschema, tabname, rownumber() over (partition by tabschema) rn_
from syscat.tables
) b
left join table(values ('_SYSIBM_', b.tabname)) t1 (c1, c2) on b.tabschema='SYSIBM'
left join table(values ('_SYSCAT_', b.tabname)) t2 (c1, c2) on b.tabschema='SYSCAT'
cross join table(values (b.tabschema, b.tabname)) t3 (c1, c2)
where b.rn_=1;
The sub-select on syscat.tables is constructed to return only one table from each schema just to show the idea (your base table must be there instead of it). "Case condition" here is what you see in the on clause of each join. "Returned values" of this "Case expression" are inside the values clauses.
A CASE statement can be re-written as a UNION. Logically they are the same thing.
So, you could do this
select 0 as col, 1 as col1 from table where 1=1
UNION ALL
select 2 as col, 3 as col1 from table where NOT 1=1 OR 1=1 IS NULL

Sort two csv fields by removing duplicates and without row-by-row processing

I am trying to combine two csv fields, eliminate duplicates, sort and store it in a new field.
I was able to achieve this. However, I encountered a scenario where the values are like abc and abc*. I need to keep the one with abc* and remove the other.
Could this be achieved without row by row processing?
Here is what I have.
CREATE TABLE csv_test
(
Col1 VARCHAR(100),
Col2 VARCHAR(100),
Col3 VARCHAR(500)
);
INSERT dbo.csv_test (Col1, Col2)
VALUES ('xyz,def,abc', 'abc*,tuv,def,xyz*,abc'), ('qwe,bca,a23', 'qwe,bca,a23*,abc')
--It is assumed that there are no spaces around commas
SELECT Col1, Col2, Col1 + ',' + Col2 AS Combined_NonUnique_Unsorted,
STUFF((
SELECT ',' + Item
FROM (SELECT DISTINCT Item FROM dbo.DelimitedSplit8K(Col1 + ',' + Col2,',')) t
ORDER BY Item
FOR XML PATH('')
),1,1,'') Combined_Unique_Sorted
, ExpectedResult = 'Keep the one with * and make it unique'
FROM dbo.csv_test;
--Expected Results; if there are values like abc and abc* ; I need to keep abc* and remove abc ;
--How can I achieve this without looping or using temp tables?
abc,abc*,def,tuv,xyz,xyz* -> abc*,def,tuv,xyz*
a23,a23*,abc,bca,qwe -> a23*,abc,bca,qwe
Well, since you agree that normalizing the database is the correct thing to do, I decided to try to come up with a solution for you.
I ended up with quite a cumbersome solution involving 4(!) common table expressions - cumbersome, but it works.
The first cte is to add a row identifier missing from your table - I've used ROW_NUMBER() OVER(ORDER BY Col1, Col2) for that.
The second cte is to get a unique set of values from combining both csv columns. Note that this does not handle the * part yet.
The third cte is handling the * issue.
And finally, the fourth cte is putting all the unique items back into a single csv. (I could do it in the third cte but I wanted to have each cte responsible of a single part of the solution - it's much more readable.)
Now all that's left is to update the first cte's Col3 with the fourth cte's Combined_Unique_Sorted:
;WITH cte1 as
(
SELECT Col1,
Col2,
Col3,
ROW_NUMBER() OVER(ORDER BY Col1, Col2) As rn
FROM dbo.csv_test
), cte2 as
(
SELECT rn, Item
FROM cte1
CROSS APPLY
(
SELECT DISTINCT Item
FROM dbo.DelimitedSplit8K(Col1 +','+ Col2, ',')
) x
), cte3 AS
(
SELECT rn, Item
FROM cte2 t0
WHERE NOT EXISTS
(
SELECT 1
FROM cte2 t1
WHERE t0.Item + '*' = t1.Item
AND t0.rn = t1.rn
)
), cte4 AS
(
SELECT rn,
STUFF
((
SELECT ',' + Item
FROM cte3 t1
WHERE t1.rn = t0.rn
ORDER BY Item
FOR XML PATH('')
), 1, 1, '') Combined_Unique_Sorted
FROM cte3 t0
)
UPDATE t0
SET Col3 = Combined_Unique_Sorted
FROM cte1 t0
INNER JOIN cte4 t1 ON t0.rn = t1.rn
To verify the results:
SELECT *
FROM csv_test
ORDER BY Col1, Col2
Results:
Col1 Col2 Col3
qwe,bca,a23 qwe,bca,a23*,abc a23*,abc,bca,qwe
xyz,def,abc abc*,tuv,def,xyz*,abc abc*,def,tuv,xyz*
You can see a live demo on rextester.

How to optimise a SQL query to check for consistency of column values across tables

I would like to check across multiple tables that the same keys / same number of keys are present in each of the tables.
Currently I have created a solution that checks the count of keys per individual table, checks the count of keys when all tables are merged together, then compares.
This solution works but I wonder if there is a more optimal solution...
Example solution as it stands:
SELECT COUNT(DISTINCT variable) AS num_ids FROM table_a;
SELECT COUNT(DISTINCT variable) AS num_ids FROM table_b;
SELECT COUNT(DISTINCT variable) AS num_ids FROM table_c;
SELECT COUNT(DISTINCT a.variable) AS num_ids
FROM (SELECT DISTINCT VARIABLE FROM table_a) a
INNER JOIN (SELECT DISTINCT VARIABLE FROM table_b) b ON a.variable = b.variable
INNER JOIN (SELECT DISTINCT VARIABLE FROM table_c) c ON a.variable = c.variable;
UPDATE:
The difficultly that I'm facing putting this together in one query is that any of the tables might not be unique on the VARIABLE that I am looking to check, so I've had to use distinct before merging to avoid expanding the join
Since we are only counting, I think there is no need in joining the tables on the variable column. A UNION should be enough.
We still have to use DISTINCT to ignore/suppress duplicates, which often means extra sort.
An index on variable should help for getting counts for separate tables, but it will not help for getting the count of the combined table.
Here is an example for comparing two tables:
WITH
CTE_A
AS
(
SELECT COUNT(DISTINCT variable) AS CountA
FROM TableA
)
,CTE_B
AS
(
SELECT COUNT(DISTINCT variable) AS CountB
FROM TableB
)
,CTE_AB
AS
(
SELECT COUNT(DISTINCT variable) AS CountAB
FROM
(
SELECT variable
FROM TableA
UNION ALL
-- sic! use ALL here to avoid sort when merging two tables
-- there should be only one distinct sort for the outer `COUNT`
SELECT variable
FROM TableB
) AS AB
)
SELECT
CASE WHEN CountA = CountAB AND CountB = CountAB
THEN 'same' ELSE 'different' END AS ResultAB
FROM
CTE_A
CROSS JOIN CTE_B
CROSS JOIN CTE_AB
;
Three tables:
WITH
CTE_A
AS
(
SELECT COUNT(DISTINCT variable) AS CountA
FROM TableA
)
,CTE_B
AS
(
SELECT COUNT(DISTINCT variable) AS CountB
FROM TableB
)
,CTE_C
AS
(
SELECT COUNT(DISTINCT variable) AS CountC
FROM TableC
)
,CTE_ABC
AS
(
SELECT COUNT(DISTINCT variable) AS CountABC
FROM
(
SELECT variable
FROM TableA
UNION ALL
-- sic! use ALL here to avoid sort when merging two tables
-- there should be only one distinct sort for the outer `COUNT`
SELECT variable
FROM TableB
UNION ALL
-- sic! use ALL here to avoid sort when merging two tables
-- there should be only one distinct sort for the outer `COUNT`
SELECT variable
FROM TableC
) AS AB
)
SELECT
CASE WHEN CountA = CountABC AND CountB = CountABC AND CountC = CountABC
THEN 'same' ELSE 'different' END AS ResultABC
FROM
CTE_A
CROSS JOIN CTE_B
CROSS JOIN CTE_C
CROSS JOIN CTE_ABC
;
I deliberately chose CTE, because as far as I know Postgres materializes CTE and in our case each CTE will have only one row.
Using array_agg with order by is even better variant, if it is available on redshift. You'll still need to use DISTINCT, but you don't have to merge all tables together.
WITH
CTE_A
AS
(
SELECT array_agg(DISTINCT variable ORDER BY variable) AS A
FROM TableA
)
,CTE_B
AS
(
SELECT array_agg(DISTINCT variable ORDER BY variable) AS B
FROM TableB
)
,CTE_C
AS
(
SELECT array_agg(DISTINCT variable ORDER BY variable) AS C
FROM TableC
)
SELECT
CASE WHEN A = B AND B = C
THEN 'same' ELSE 'different' END AS ResultABC
FROM
CTE_A
CROSS JOIN CTE_B
CROSS JOIN CTE_C
;
Well, here is probably the nastiest piece of SQL I could build for you :) I will forever deny that I wrote this and that my stackoverflow account was hacked ;)
SELECT
'All OK'
WHERE
( SELECT COUNT(DISTINCT id) FROM table_a ) = ( SELECT COUNT(DISTINCT id) FROM table_b )
AND ( SELECT COUNT(DISTINCT id) FROM table_b ) = ( SELECT COUNT(DISTINCT id) FROM table_c )
By the way, this won't optimise the query - it's still doing three queries (but I guess it's better than 4?).
UPDATE: In light of your use-case below: NEW sql fiddle http://sqlfiddle.com/#!15/a0403/1
SELECT DISTINCT
tbl_a.a_count,
tbl_b.b_count,
tbl_c.c_count
FROM
( SELECT COUNT(id) a_count, array_agg(id order by id) ids FROM table_a) tbl_a,
( SELECT COUNT(id) b_count, array_agg(id order by id) ids FROM table_b) tbl_b,
( SELECT COUNT(id) c_count, array_agg(id order by id) ids FROM table_c) tbl_c
WHERE
tbl_a.ids = tbl_b.ids
AND tbl_b.ids = tbl_c.ids
The above query will only return if all tables have the same number of rows, ensuring that the IDS are also the same.

How to optimize SELECT DISTINCT when using multiple Joins?

I have read that using cte's you can speed up a select distinct up to 100 times. Link to the website . They have this following example:
USE tempdb;
GO
DROP TABLE dbo.Test;
GO
CREATE TABLE
dbo.Test
(
data INTEGER NOT NULL,
);
GO
CREATE CLUSTERED INDEX c ON dbo.Test (data);
GO
-- Lots of duplicated values
INSERT dbo.Test WITH (TABLOCK)
(data)
SELECT TOP (5000000)
ROW_NUMBER() OVER (ORDER BY (SELECT 0)) / 117329
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
WITH RecursiveCTE
AS (
SELECT data = MIN(T.data)
FROM dbo.Test T
UNION ALL
SELECT R.data
FROM (
-- A cunning way to use TOP in the recursive part of a CTE :)
SELECT T.data,
rn = ROW_NUMBER() OVER (ORDER BY T.data)
FROM dbo.Test T
JOIN RecursiveCTE R
ON R.data < T.data
) R
WHERE R.rn = 1
)
SELECT *
FROM RecursiveCTE
OPTION (MAXRECURSION 0);
How would one apply this to a query that has multiple joins? For example i am trying to run this query found below, however it takes roughly two and a half minutes. How would I optimize this accordingly?
SELECT DISTINCT x.code
From jpa
INNER JOIN jp ON jpa.ID=jp.ID
INNER JOIN jd ON (jd.ID=jp.ID And jd.JID=3)
INNER JOIN l ON jpa.ID=l.ID AND l.CID=3
INNER JOIN fa ON fa.ID=jpa.ID
INNER JOIN x ON fa.ID=x.ID
1) GROUP BY on every column worked faster for me.
2) If you have duplicates in some of the tables then you can also pre select that and join from that as an inner query.
3) Generally you can nest join if you expect that this join will limit data.
SQL join format - nested inner joins

Is it possible to reference the same CTE in more than one unrelated query?

I have this CTE1 that I'd like to reference it in 2 or more unrelated queries
WITH CTE1
AS
(
SELECT col1, col2, col3
FROM Table1
WHERE condition
)
SELECT * FROM CTE1;
SELECT * FROM CTE1 c // This is not working -- Invalid object name 'CTE1'.
JOIN TABLE2 t
ON c.colx = t.xolx;
Is there a way to accomplish this?
Thanks for helping