If you run the following sample code in SQL Server, you'll notice that newid() materializes after the join whereas row_number() materializes before the join. Does anyone understand this and if there's a way to work around it?
declare #a table ( num varchar(10) )
insert into #a values ('dan')
insert into #a values ('dan')
insert into #a values ('fran')
insert into #a values ('fran')
select *
from #a T
inner join
(select num, newid() id
from #a
group by num) T1 on T1.num = T.num
select *
from #a T
inner join
(select num, row_number() over (order by num) id
from #a
group by num) T1 on T1.num = T.num
i had a similar problem and found out that the "inner join" was the problem. i was able to use "left joins" ...
Not sure I see what the problem is here. Materialize the subquery T1 first:
SELECT num, ROW_NUMBER() OVER (ORDER BY num)
FROM #a
GROUP BY num;
You get two rows:
dan 1
fran 2
Now join that against a on num = num, you get 4 rows, 2 for each distinct value. What is your actual goal here? Perhaps you should be applying ROW_NUMBER() outside?
The order of materialization is up to the optimizer. You'll find that other built-ins (RAND(), GETDATE() etc.) have similarly inconsistent materialization behavior. Not much you can do about it, and not much chance they're going to "fix" it.
EDIT
New code sample. Write the contents of #a to a #temp table to "materialize" the NEWID() assignment per unique num value.
SELECT num, id = NEWID()
INTO #foo FROM #a GROUP BY num;
SELECT a.num, f.id
FROM #a AS a
INNER JOIN #foo AS f
ON a.num = f.num;
DROP TABLE #foo;
Related
I would like to check across multiple tables that the same keys / same number of keys are present in each of the tables.
Currently I have created a solution that checks the count of keys per individual table, checks the count of keys when all tables are merged together, then compares.
This solution works but I wonder if there is a more optimal solution...
Example solution as it stands:
SELECT COUNT(DISTINCT variable) AS num_ids FROM table_a;
SELECT COUNT(DISTINCT variable) AS num_ids FROM table_b;
SELECT COUNT(DISTINCT variable) AS num_ids FROM table_c;
SELECT COUNT(DISTINCT a.variable) AS num_ids
FROM (SELECT DISTINCT VARIABLE FROM table_a) a
INNER JOIN (SELECT DISTINCT VARIABLE FROM table_b) b ON a.variable = b.variable
INNER JOIN (SELECT DISTINCT VARIABLE FROM table_c) c ON a.variable = c.variable;
UPDATE:
The difficultly that I'm facing putting this together in one query is that any of the tables might not be unique on the VARIABLE that I am looking to check, so I've had to use distinct before merging to avoid expanding the join
Since we are only counting, I think there is no need in joining the tables on the variable column. A UNION should be enough.
We still have to use DISTINCT to ignore/suppress duplicates, which often means extra sort.
An index on variable should help for getting counts for separate tables, but it will not help for getting the count of the combined table.
Here is an example for comparing two tables:
WITH
CTE_A
AS
(
SELECT COUNT(DISTINCT variable) AS CountA
FROM TableA
)
,CTE_B
AS
(
SELECT COUNT(DISTINCT variable) AS CountB
FROM TableB
)
,CTE_AB
AS
(
SELECT COUNT(DISTINCT variable) AS CountAB
FROM
(
SELECT variable
FROM TableA
UNION ALL
-- sic! use ALL here to avoid sort when merging two tables
-- there should be only one distinct sort for the outer `COUNT`
SELECT variable
FROM TableB
) AS AB
)
SELECT
CASE WHEN CountA = CountAB AND CountB = CountAB
THEN 'same' ELSE 'different' END AS ResultAB
FROM
CTE_A
CROSS JOIN CTE_B
CROSS JOIN CTE_AB
;
Three tables:
WITH
CTE_A
AS
(
SELECT COUNT(DISTINCT variable) AS CountA
FROM TableA
)
,CTE_B
AS
(
SELECT COUNT(DISTINCT variable) AS CountB
FROM TableB
)
,CTE_C
AS
(
SELECT COUNT(DISTINCT variable) AS CountC
FROM TableC
)
,CTE_ABC
AS
(
SELECT COUNT(DISTINCT variable) AS CountABC
FROM
(
SELECT variable
FROM TableA
UNION ALL
-- sic! use ALL here to avoid sort when merging two tables
-- there should be only one distinct sort for the outer `COUNT`
SELECT variable
FROM TableB
UNION ALL
-- sic! use ALL here to avoid sort when merging two tables
-- there should be only one distinct sort for the outer `COUNT`
SELECT variable
FROM TableC
) AS AB
)
SELECT
CASE WHEN CountA = CountABC AND CountB = CountABC AND CountC = CountABC
THEN 'same' ELSE 'different' END AS ResultABC
FROM
CTE_A
CROSS JOIN CTE_B
CROSS JOIN CTE_C
CROSS JOIN CTE_ABC
;
I deliberately chose CTE, because as far as I know Postgres materializes CTE and in our case each CTE will have only one row.
Using array_agg with order by is even better variant, if it is available on redshift. You'll still need to use DISTINCT, but you don't have to merge all tables together.
WITH
CTE_A
AS
(
SELECT array_agg(DISTINCT variable ORDER BY variable) AS A
FROM TableA
)
,CTE_B
AS
(
SELECT array_agg(DISTINCT variable ORDER BY variable) AS B
FROM TableB
)
,CTE_C
AS
(
SELECT array_agg(DISTINCT variable ORDER BY variable) AS C
FROM TableC
)
SELECT
CASE WHEN A = B AND B = C
THEN 'same' ELSE 'different' END AS ResultABC
FROM
CTE_A
CROSS JOIN CTE_B
CROSS JOIN CTE_C
;
Well, here is probably the nastiest piece of SQL I could build for you :) I will forever deny that I wrote this and that my stackoverflow account was hacked ;)
SELECT
'All OK'
WHERE
( SELECT COUNT(DISTINCT id) FROM table_a ) = ( SELECT COUNT(DISTINCT id) FROM table_b )
AND ( SELECT COUNT(DISTINCT id) FROM table_b ) = ( SELECT COUNT(DISTINCT id) FROM table_c )
By the way, this won't optimise the query - it's still doing three queries (but I guess it's better than 4?).
UPDATE: In light of your use-case below: NEW sql fiddle http://sqlfiddle.com/#!15/a0403/1
SELECT DISTINCT
tbl_a.a_count,
tbl_b.b_count,
tbl_c.c_count
FROM
( SELECT COUNT(id) a_count, array_agg(id order by id) ids FROM table_a) tbl_a,
( SELECT COUNT(id) b_count, array_agg(id order by id) ids FROM table_b) tbl_b,
( SELECT COUNT(id) c_count, array_agg(id order by id) ids FROM table_c) tbl_c
WHERE
tbl_a.ids = tbl_b.ids
AND tbl_b.ids = tbl_c.ids
The above query will only return if all tables have the same number of rows, ensuring that the IDS are also the same.
I have read that using cte's you can speed up a select distinct up to 100 times. Link to the website . They have this following example:
USE tempdb;
GO
DROP TABLE dbo.Test;
GO
CREATE TABLE
dbo.Test
(
data INTEGER NOT NULL,
);
GO
CREATE CLUSTERED INDEX c ON dbo.Test (data);
GO
-- Lots of duplicated values
INSERT dbo.Test WITH (TABLOCK)
(data)
SELECT TOP (5000000)
ROW_NUMBER() OVER (ORDER BY (SELECT 0)) / 117329
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
WITH RecursiveCTE
AS (
SELECT data = MIN(T.data)
FROM dbo.Test T
UNION ALL
SELECT R.data
FROM (
-- A cunning way to use TOP in the recursive part of a CTE :)
SELECT T.data,
rn = ROW_NUMBER() OVER (ORDER BY T.data)
FROM dbo.Test T
JOIN RecursiveCTE R
ON R.data < T.data
) R
WHERE R.rn = 1
)
SELECT *
FROM RecursiveCTE
OPTION (MAXRECURSION 0);
How would one apply this to a query that has multiple joins? For example i am trying to run this query found below, however it takes roughly two and a half minutes. How would I optimize this accordingly?
SELECT DISTINCT x.code
From jpa
INNER JOIN jp ON jpa.ID=jp.ID
INNER JOIN jd ON (jd.ID=jp.ID And jd.JID=3)
INNER JOIN l ON jpa.ID=l.ID AND l.CID=3
INNER JOIN fa ON fa.ID=jpa.ID
INNER JOIN x ON fa.ID=x.ID
1) GROUP BY on every column worked faster for me.
2) If you have duplicates in some of the tables then you can also pre select that and join from that as an inner query.
3) Generally you can nest join if you expect that this join will limit data.
SQL join format - nested inner joins
I made two queries that I thought should have the same result:
SELECT COUNT(*) FROM (
SELECT DISTINCT ON (id1) id1, value
FROM (
SELECT table1.id1, table2.value
FROM table1
JOIN table2 ON table1.id1=table2.id
WHERE table2.value = '1')
AS result1 ORDER BY id1)
AS result2;
SELECT COUNT(*) FROM (
SELECT DISTINCT ON (id1) id1, value
FROM (
SELECT table1.id1, table2.value
FROM table1
JOIN table2 ON table1.id1=table2.id
)
AS result1 ORDER BY id1)
AS result2
WHERE value = '1';
The only difference being that one had the WHERE clause inside SELECT DISTINCT ON, and the other outside that, but inside SELECT COUNT. But the results were not the same. I don't understand why the position of the WHERE clause should make a difference in this case. Can anyone explain? Or is there a better way to phrase this question?
here's a good way to look at this:
SELECT DISTINCT ON (id) id, value
FROM (select 1 as id, 1 as value
union
select 1 as id, 2 as value) a;
SELECT DISTINCT ON (id) id, value
FROM (select 1 as id, 1 as value
union
select 1 as id, 2 as value) a
WHERE value = 2;
The problem has to do with the unique conditions and what is visible where. It is behavior by design.
in my stored procedure I have a table variable contains rows ID. The are 2 scenarios - that table variable is empty and not.
declare #IDTable as table
(
number NUMERIC(18,0)
)
In the main query, I join that table:
inner join #IDTable tab on (tab.number = csr.id)
BUT:
as we know how inner join works, I need that my query returns some rows:
when #IDTable is empty
OR
return ONLY rows that exist in
#IDTable
I tried also with LEFT join but it doesn't work. Any ideas how to solve it ?
If `#IDTable' is empty then what rows do you return? Do you just ignore the Join on to the table?
I'm not sure I get what you're trying to do but this might be easier.
if (Select Count(*) From #IDTable) == 0
begin
-- do a SELECT that doesn't join on to the #IDTable
end
else
begin
-- do a SELECT that joins on to #IDTable
end
It is not optimal, but it works:
declare #z table
(
id int
)
--insert #z values(2)
select * from somTable n
left join #z z on (z.id = n.id)
where NOT exists(select 1 from #z) or (z.id is not null)
Scenario:
Let's say I have two tables, TableA and TableB. TableB's primary key is a single column (BId), and is a foreign key column in TableA.
In my situation, I want to remove all rows in TableA that are linked with specific rows in TableB: Can I do that through joins? Delete all rows that are pulled in from the joins?
DELETE FROM TableA
FROM
TableA a
INNER JOIN TableB b
ON b.BId = a.BId
AND [my filter condition]
Or am I forced to do this:
DELETE FROM TableA
WHERE
BId IN (SELECT BId FROM TableB WHERE [my filter condition])
The reason I ask is it seems to me that the first option would be much more effecient when dealing with larger tables.
Thanks!
DELETE TableA
FROM TableA a
INNER JOIN TableB b
ON b.Bid = a.Bid
AND [my filter condition]
should work
I would use this syntax
Delete a
from TableA a
Inner Join TableB b
on a.BId = b.BId
WHERE [filter condition]
Yes you can. Example :
DELETE TableA
FROM TableA AS a
INNER JOIN TableB AS b
ON a.BId = b.BId
WHERE [filter condition]
Was trying to do this with an access database and found I needed to use a.* right after the delete.
DELETE a.*
FROM TableA AS a
INNER JOIN TableB AS b
ON a.BId = b.BId
WHERE [filter condition]
It's almost the same in MySQL, but you have to use the table alias right after the word "DELETE":
DELETE a
FROM TableA AS a
INNER JOIN TableB AS b
ON a.BId = b.BId
WHERE [filter condition]
The syntax above doesn't work in Interbase 2007. Instead, I had to use something like:
DELETE FROM TableA a WHERE [filter condition on TableA]
AND (a.BId IN (SELECT a.BId FROM TableB b JOIN TableA a
ON a.BId = b.BId
WHERE [filter condition on TableB]))
(Note Interbase doesn't support the AS keyword for aliases)
I'm using this
DELETE TableA
FROM TableA a
INNER JOIN
TableB b on b.Bid = a.Bid
AND [condition]
and #TheTXI way is good as enough but I read answers and comments and I found one things must be answered is using condition in WHERE clause or as join condition. So I decided to test it and write an snippet but didn't find a meaningful difference between them. You can see sql script here and important point is that I preferred to write it as commnet because of this is not exact answer but it is large and can't be put in comments, please pardon me.
Declare #TableA Table
(
aId INT,
aName VARCHAR(50),
bId INT
)
Declare #TableB Table
(
bId INT,
bName VARCHAR(50)
)
Declare #TableC Table
(
cId INT,
cName VARCHAR(50),
dId INT
)
Declare #TableD Table
(
dId INT,
dName VARCHAR(50)
)
DECLARE #StartTime DATETIME;
SELECT #startTime = GETDATE();
DECLARE #i INT;
SET #i = 1;
WHILE #i < 1000000
BEGIN
INSERT INTO #TableB VALUES(#i, 'nameB:' + CONVERT(VARCHAR, #i))
INSERT INTO #TableA VALUES(#i+5, 'nameA:' + CONVERT(VARCHAR, #i+5), #i)
SET #i = #i + 1;
END
SELECT #startTime = GETDATE()
DELETE a
--SELECT *
FROM #TableA a
Inner Join #TableB b
ON a.BId = b.BId
WHERE a.aName LIKE '%5'
SELECT Duration = DATEDIFF(ms,#StartTime,GETDATE())
SET #i = 1;
WHILE #i < 1000000
BEGIN
INSERT INTO #TableD VALUES(#i, 'nameB:' + CONVERT(VARCHAR, #i))
INSERT INTO #TableC VALUES(#i+5, 'nameA:' + CONVERT(VARCHAR, #i+5), #i)
SET #i = #i + 1;
END
SELECT #startTime = GETDATE()
DELETE c
--SELECT *
FROM #TableC c
Inner Join #TableD d
ON c.DId = d.DId
AND c.cName LIKE '%5'
SELECT Duration = DATEDIFF(ms,#StartTime,GETDATE())
If you could get good reason from this script or write another useful, please share. Thanks and hope this help.
Let's say you have 2 tables, one with a Master set (eg. Employees) and one with a child set (eg. Dependents) and you're wanting to get rid of all the rows of data in the Dependents table that cannot key up with any rows in the Master table.
delete from Dependents where EmpID in (
select d.EmpID from Employees e
right join Dependents d on e.EmpID = d.EmpID
where e.EmpID is null)
The point to notice here is that you're just collecting an 'array' of EmpIDs from the join first, the using that set of EmpIDs to do a Deletion operation on the Dependents table.
In SQLite, the only thing that work is something similar to beauXjames' answer.
It seems to come down to this
DELETE FROM table1 WHERE table1.col1 IN (SOME TEMPORARY TABLE);
and that some temporary table can be crated by SELECT and JOIN your two table which you can filter this temporary table based on the condition that you want to delete the records in Table1.
The simpler way is:
DELETE TableA
FROM TableB
WHERE TableA.ID = TableB.ID
DELETE FROM table1
where id IN
(SELECT id FROM table2..INNER JOIN..INNER JOIN WHERE etc)
Minimize use of DML queries with Joins. You should be able to do most of all DML queries with subqueries like above.
In general, joins should only be used when you need to SELECT or GROUP by columns in 2 or more tables. If you're only touching multiple tables to define a population, use subqueries. For DELETE queries, use correlated subquery.
You can run this query:
DELETE FROM TableA
FROM
TableA a, TableB b
WHERE
a.Bid=b.Bid
AND
[my filter condition]