Tsql, returning rows with identical column values - tsql

Given an example table 'Users', which has an int column named 'UserID' (and some arbitrary number of other columns), what is the best way to select all rows from which UserID appears more than once?
So far I've come up with
select * from Users where UserID in
(select UserID from Users group by UserID having COUNT(UserID) > 1)
This seems like quite an innefficient way to do this though, is there a better way?

In SQL Server 2005+ you could use this approach:
;WITH UsersNumbered AS (
SELECT
UserID,
rownum = ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY UserID)
FROM Users
)
SELECT u.*
FROM Users u
INNER JOIN UsersNumbered n ON u.UserID = n.UserID AND n.rownum = 2
Provided there exists a non-clustered index on UserID, this yields a slightly worse execution plan than your approach. To make it better (actually, same as yours), you'll need to use... a subquery, however counter-intuitive it may seem:
;WITH UsersNumbered AS (
SELECT
UserID,
rownum = ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY UserID)
FROM Users
)
SELECT u.*
FROM Users u
WHERE EXISTS (
SELECT *
FROM UsersNumbered n
WHERE u.UserID = n.UserID AND n.rownum = 2
);
In case of a clustered index on UserID all three solutions give the same plan.

This would do the same thing but evaluate the performance and it would likely be faster/more efficient. Of course there should be an index on this UserID column.
select u.*
from Users u
join (select UserID,count(UserID) as CUserID from Users group by UserID) u1 on u1.UserID = u.UserID
where CUserID > 1

Related

Query to select by number of associated objects

I have two tables that look like the following:
Orders
------
id
tracking_number
ShippingLogs
------
tracking_number
created_at
stage
I would like to select the IDs of Orders that have ONLY ONE ShippingLog associated with it, and the stage of the ShippingLog must be error. If it has two ShippingLog entries, I don't want it. If it has one ShippingLog bug its stage is shipped, I don't want it.
This is what I have, and it doesn't work, and I know why (it finds the log with the error, but has no way of knowing if there are others). I just don't really know how to get it the way I need it.
SELECT DISTINCT
orders.id, shipping_logs.created_at, COUNT(shipping_logs.*)
FROM
orders
JOIN
shipping_logs ON orders.tracking_number = shipping_logs.tracking_number
WHERE
shipping_logs.created_at BETWEEN '2021-01-01 23:40:00'::timestamp AND '2021-01-26 23:40:00'::timestamp AND shipping_logs.stage = 'error'
GROUP BY
orders.id, shipping_logs.created_at
HAVING
COUNT(shipping_logs.*) = 1
ORDER BY
orders.id, shipping_logs.created_at DESC;
If you want to retain every column from the join of the two tables given your requirements, then I would suggest using COUNT here as an analytic function:
WITH cte AS (
SELECT o.id, sl.created_at,
COUNT(*) OVER (PARTITION BY o.id) num_logs,
COUNT(*) FILTER (WHERE sl.stage <> 'error')
OVER (PARTITION BY o.id) non_error_cnt
FROM orders o
INNER JOIN shipping_logs sl ON sl.tracking_number = o.tracking_number
WHERE sl.created_at BETWEEN '2021-01-01 23:40:00'::timestamp AND
'2021-01-26 23:40:00'::timestamp
)
SELECT id AS order_id, created_at
FROM cte
WHERE num_logs = 1 AND non_error_cnt = 0
ORDER BY id, created_at DESC;

How should I add fields without adding them to a GROUP BY?

I have a SQL statement that works as-is. I get an area name and the minimum value within that area. next, I need to add in a key so I can actually do something with the results. The key is necessary since names and values are unlikely to be unique.
select g.name, min(g.rndval) from
(
select p.rndval, a.name, p.id
from points p, areas a
where ST_WITHIN(p.geom, a.geom)
) AS g
group by g.name
When I add the Id field to the group by, the query returns multiple rows for each area, as expected since it's grouping by the name and id combination, and the results are no longer what I need. How should I add in the id field (p.id in the inner select)?
You can try:
WITH cte AS
( select p.rndval, a.name, p.id
from points p, areas a
where ST_WITHIN(p.geom, a.geom)
), cte_aggregated AS
(
SELECT name, min(rndval) AS min_value
FROM cte
GROUP BY name
)
SELECT DISTINCT c.rndval, c.name, c.id
FROM cte c
JOIN cte_aggregated ca
ON c.rndval = ca.min_value
AND c.name = ca.name;
You can solve this quite elegantly with a window function:
select name, rndval as min, id
from (
select a.name, p.rndval, p.id, rank() over (partition by a.name order by p.rndval) as rnk
from points p
join areas a on ST_Within(p.geom, a.geom)) as g
where rnk = 1;

How to get fields and added in group by in PostreSQL8.4?

I am selecting column used in group by and count, and query looks something like
SELECT s.country, count(*) AS posts_ct
FROM store s
JOIN store_post_map sp ON sp.store_id = s.id
GROUP BY 1;
However, I want to select some more fields, like store name or store address from store table where count is max, but I don't to include that in group by clause.
For instance, to get the stores with the highest post-count per country:
SELECT DISTINCT ON (s.country)
s.country, s.store_id, s.name, sp.post_ct
FROM store s
JOIN (
SELECT store_id, count(*) AS post_ct
FROM store_post_map
GROUP BY store_id
) sp ON sp.store_id = s.id
ORDER BY s.country, sp.post_ct DESC
Add any number of columns from store to the SELECT list.
Details about this query style in this related answer:
Select first row in each GROUP BY group?
Reply to comment
This produces the count per country and picks (one of) the store(s) with the highest post-count:
SELECT DISTINCT ON (s.country)
s.country, s.store_id, s.name
,sum(post_ct) OVER (PARTITION BY s.country) AS post_ct_for_country
FROM store s
JOIN (
SELECT store_id, count(*) AS post_ct
FROM store_post_map
GROUP BY store_id
) sp ON sp.store_id = s.id
ORDER BY s.country, sp.post_ct DESC;
This works because the window function sum() is applied before DISTINCT ON per definition.

TSQL Group By with an "OR"?

This query for creating a list of Candidate duplicates is easy enough:
SELECT Count(*), Can_FName, Can_HPhone, Can_EMail
FROM Can
GROUP BY Can_FName, Can_HPhone, Can_EMail
HAVING Count(*) > 1
But if the actual rule I want to check against is FName and (HPhone OR Email) - how can I adjust the GROUP BY to work with this?
I'm fairly certain I'm going to end up with a UNION SELECT here (i.e. do FName, HPhone on one and FName, EMail on the other and combine the results) - but I'd love to know if anyone knows an easier way to do it.
Thank you in advance for any help.
Scott in Maine
Before I can advise anything, I need to know the answer to this question:
name phone email
John 555-00-00 john#example.com
John 555-00-01 john#example.com
John 555-00-01 john-other#example.com
What COUNT(*) you want for this data?
Update:
If you just want to know that a record has any duplicates, use this:
WITH q AS (
SELECT 1 AS id, 'John' AS name, '555-00-00' AS phone, 'john#example.com' AS email
UNION ALL
SELECT 2 AS id, 'John', '555-00-01', 'john#example.com'
UNION ALL
SELECT 3 AS id, 'John', '555-00-01', 'john-other#example.com'
UNION ALL
SELECT 4 AS id, 'James', '555-00-00', 'james#example.com'
UNION ALL
SELECT 5 AS id, 'James', '555-00-01', 'james-other#example.com'
)
SELECT *
FROM q qo
WHERE EXISTS
(
SELECT NULL
FROM q qi
WHERE qi.id <> qo.id
AND qi.name = qo.name
AND (qi.phone = qo.phone OR qi.email = qo.email)
)
It's more efficient, but doesn't tell you where the duplicate chain started.
This query select all entries along with the special field, chainid, that indicates where the duplicate chain started.
WITH q AS (
SELECT 1 AS id, 'John' AS name, '555-00-00' AS phone, 'john#example.com' AS email
UNION ALL
SELECT 2 AS id, 'John', '555-00-01', 'john#example.com'
UNION ALL
SELECT 3 AS id, 'John', '555-00-01', 'john-other#example.com'
UNION ALL
SELECT 4 AS id, 'James', '555-00-00', 'james#example.com'
UNION ALL
SELECT 5 AS id, 'James', '555-00-01', 'james-other#example.com'
),
dup AS (
SELECT id AS chainid, id, name, phone, email, 1 as d
FROM q
UNION ALL
SELECT chainid, qo.id, qo.name, qo.phone, qo.email, d + 1
FROM dup
JOIN q qo
ON qo.name = dup.name
AND (qo.phone = dup.phone OR qo.email = dup.email)
AND qo.id > dup.id
),
chains AS
(
SELECT *
FROM dup do
WHERE chainid NOT IN
(
SELECT id
FROM dup di
WHERE di.chainid < do.chainid
)
)
SELECT *
FROM chains
ORDER BY
chainid
None of these answers is correct. Quassnoi's is a decent approach, but you will notice one fatal flaw in the expressions "qo.id > dup.id" and "di.chainid < do.chainid": comparisons made by ID! This is ALWAYS bad practice because it depends on some inherent ordering in the IDs. IDs should NEVER be given any implicit meaning and should ONLY participate in equality or null testing. You can easily break Quassnoi's solution in this example by simply reordering the IDs in the data.
The essential problem is a disjunctive condition with a grouping, which leads to the possibility of two records being related through an intermediate, though they are not directly relatable.
e.g., you stated these records should all be grouped:
(1) John 555-00-00 john#example.com
(2) John 555-00-01 john#example.com
(3) John 555-00-01 john-other#example.com
You can see that #1 and #2 are relatable, as are #2 and #3, but clearly #1 and #3 are not directly relatable as a group.
This establishes that a recursive or iterative solution is the ONLY possible solution.
So, recursion is not viable since you can easily end up in a looping situation. This is what Quassnoi was trying to avoid with his ID comparisons, but in doing so he broke the algorithm. You could try limiting the levels of recursion, but you may not then complete all relations, and you will still potentially be following loops back upon yourself, leading to excessive data size and prohibitive inefficiency.
The best solution is ITERATIVE: Start a result set by tagging each ID as a unique group ID, and then spin through the result set and update it, combining IDs into the same unique group ID as they match on the disjunctive condition. Repeat the process on the updated set each time until no further updates can be made.
I will create example code for this soon.
GROUP BY doesn't support OR - it's implicitly AND and must include every non-aggregator in the select list.
I assume you also have a unique ID integer as the primary key on this table. If you don't, it's a good idea to have one, for this purpose and many others.
Find those duplicates by a self-join:
select
c1.ID
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
, c2.ID
, c2.Can_FName
, c2.Can_HPhone
, c2.Can_Email
from
(
select
min(ID),
Can_FName,
Can_HPhone,
Can_Email
from Can
group by
Can_FName,
Can_HPhone,
Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID
where
c1.Can_FName = c2.Can_FName
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
order by
c1.ID
The query gives you N-1 rows for each N duplicate combinations - if you want just a count along with each unique combination, count the rows grouped by the "left" side:
select count(1) + 1,
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
from
(
select
min(ID),
Can_FName,
Can_HPhone,
Can_Email
from Can
group by
Can_FName,
Can_HPhone,
Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID
where
c1.Can_FName = c2.Can_FName
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
group by
c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
Granted, this is more involved than a union - but I think it illustrates a good way of thinking about duplicates.
Project the desired transformation first from a derived table, then do the aggregation:
SELECT COUNT(*)
, CAN_FName
, Can_HPhoneOrEMail
FROM (
SELECT Can_FName
, ISNULL(Can_HPhone,'') + ISNULL(Can_EMail,'') AS Can_HPhoneOrEMail
FROM Can) AS Can_Transformed
GROUP BY Can_FName, Can_HPhoneOrEMail
HAVING Count(*) > 1
Adjust your 'OR' operation as needed in the derived table project list.
I know this answer will be criticised for the use of the temp table, but it will work anyway:
-- create temp table to give the table a unique key
create table #tmp(
ID int identity,
can_Fname varchar(200) null, -- real type and len here
can_HPhone varchar(200) null, -- real type and len here
can_Email varchar(200) null, -- real type and len here
)
-- just copy the rows where a duplicate fname exits
-- (better performance specially for a big table)
insert into #tmp
select can_fname,can_hphone,can_email
from Can
where can_fname exists in (select can_fname from Can
group by can_fname having count(*)>1)
-- select the rows that have the same fname and
-- at least the same phone or email
select can_Fname, can_Hphone, can_Email
from #tmp a where exists
(select * from #tmp b where
a.ID<>b.ID and A.can_fname = b.can_fname
and (isnull(a.can_HPhone,'')=isnull(b.can_HPhone,'')
or (isnull(a.can_email,'')=isnull(b.can_email,'') )
Try this:
SELECT Can_FName, COUNT(*)
FROM (
SELECT
rank() over(partition by Can_FName order by Can_FName,Can_HPhone) rnk_p,
rank() over(partition by Can_FName order by Can_FName,Can_EMail) rnk_m,
Can_FName
FROM Can
) X
WHERE rnk_p=1 or rnk_m =1
GROUP BY Can_FName
HAVING COUNT(*)>1

SPROC T-SQL Syntax to return results if rows exist on multiple days

what I need to test for on my table is if there are rows for a given user id and order id on two separate days (DATETIME field for a timestamp).
I'm pretty sure I'd need a having clause and that's why I'm here...that frightens me terribly.
Having shouldn't scare you, it is just a "Where" on an aggregated field:
Select UserID, Count(*) From OrderTbl Group By UserID Having Count(*) > 1
That'll give you all the Users that have multiple orders.
Select UserID, Count(*) From OrderTbl Where (UserID=#UserID) Group By UserID Having Count(*) > 1
will give you the count if there are multiple records for the user id in #UserID and null if not.
if exists (Select UserID, Count(*) From OrderTbl Where (UserID=#UserID) Group By UserID
Having Count(*) > 1) Select 1 else Select 0
will return a 1 if there are multiple records for the User, 0 if not.
Update: Didn't realize that you could have multiple orders per day. This query will do what you want:
With DistinctDates as (Select Distinct UserID, [DATE] From OrderTbl Where (UserID=#UserID))
Select UserID, Count(*) From DistinctDates
Group By UserID Having Count(*) > 1
I am not sure if I understood your question, but this may work for you. The HAVING is your friend and you can still use the WHERE clause. This should let you know what order and user id combo is occuring more than once in the table.
SELECT [UserId], [OrderId]
FROM OrderTable
WHERE UserId = #UserId
AND OrderId = #OrderId
GROUP BY UserId, OrderId
HAVING Count(*) > 1