Identifying duplicates within a table: looking for query advice

Identifying duplicates within a table: looking for query advice - tsql

So I am trying to identify duplicated contact records within an account, and looking for the best way to do this. There is a an account table, and a contact table. Below is the query I've come up with to give me what I need, but I feel like there is probably a better/more efficient way to do this, so looking for any feedback/advice. Thanks in advance!
SELECT * FROM sysdba.CONTACT a WITH(NOLOCK)
WHERE EXISTS
(
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL FROM sysdba.CONTACT b WITH(NOLOCK)
GROUP BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
HAVING COUNT(*) > 1
AND a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL
)
ORDER BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
Here is another way I can do this, but having to use DISTINCT seems ugly..
SELECT DISTINCT a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL FROM sysdba.CONTACT a WITH(NOLOCK)
JOIN sysdba.CONTACT b WITH(NOLOCK)
ON a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL AND a.CONTACTID != b.CONTACTID
ORDER BY a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL
When checking the execution plans for both, the first query is 37% compared to 63% in the second query, which is surprising, as I've always though (apparently wrong) that using joins is quicker than relying on a where clause.

Quite common practice, when you trying to identify duplicates, is to use windowed aggregate functions, such as COUNT() OVER (...) and ROW_NUMBER() OVER (...).
Below is the query that should return you groups of records, where there are more than one CONTACTID for the same ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL combination. In other words this query returns records, having duplicates, along with their duplicates:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
CNT = COUNT(*) OVER (PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE CNT > 1;
And the following query should return duplicates only, without records that they duplicates are:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
NUM = ROW_NUMBER() OVER (
PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
ORDER BY CONTACTID)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE NUM > 1;

Related

Selecting distinct values

The domain is:
company (id, name, adress)
employee (id, name, adress, company_id, expertise_id)
dependantrelative (id, name, employee_id)
expertise (id, name, class)
I want to know how to get the number of dependantrelatives of each employee who are unique experts in their respective companies.
The Query below does not return the correct answer. Can you help me?
SELECT DISTINCT dependantrelative.employee_id
, COUNT(*) AS qty_dependantrelatives
FROM dependantrelative
INNER JOIN employee
ON employee.id = dependantrelative.employee_id
GROUP BY dependantrelative.employee_id

I just tried out the Query below and it works, but I want to know if there is a faster and simple way of getting the answer.
SELECT employee.id
,COUNT(dependantrelative.employee_id) AS qty_dependantrelatives
FROM (
SELECT employee.company_id
, employee.expertise_id AS expert
, COUNT(employee.expertise_id)
FROM employee
GROUP BY employee.company_id
, employee.expertise_id
HAVING COUNT(employee.expertise_id)<2
) AS uniexpert
LEFT JOIN employee
ON employee.expertise_id = uniexpert.expert
LEFT JOIN salesorderdetail
ON dependantrelative.employee_id = employee.id
GROUP BY employee.id
ORDER BY employee.id

How to manage NULL strings or dates in sql queries (PostgreSQL)

PostgreSQL 11.1
With the below sql query, where $1 and $2 are strings and $3 is a timestamp, how can the below query be rewritten so that a null value in $3 allows for every date to be selected (not just null dates).
SELECT lastname, firstname, birthdate FROM patients
WHERE UPPER(lastname) LIKE UPPER($1)||'%' and UPPER(firstname) LIKE UPPER($2)||'%' AND birthdate::date = $3::date
UNION
SELECT lastname, firstname, birthdate FROM appointment_book
WHERE UPPER(lastname) LIKE UPPER($1)||'%' and UPPER(firstname) LIKE UPPER($2)||'$' and birthdate::date = $3::date
That is, if $3 is null, then this should reduce to:
SELECT lastname, firstname, birthdate FROM patients
WHERE UPPER(lastname) LIKE UPPER($1)||'%' and UPPER(firstname) LIKE UPPER($2)||'%'
UNION
SELECT lastname, firstname, birthdate FROM appointment_book
WHERE UPPER(lastname) LIKE UPPER($1)||'%' and UPPER(firstname) LIKE UPPER($2)||'$'

Untested but I think you can handle that with a CASE expression
SELECT lastname, firstname, birthdate FROM patients p
WHERE UPPER(p.lastname) LIKE UPPER($1)||'%'
AND UPPER(p.firstname) LIKE UPPER($2)||'%'
AND (CASE WHEN $3 IS NULL THEN TRUE
ELSE p.birthdate::date = $3::date
END)
UNION
SELECT lastname, firstname, birthdate FROM appointment_book ab
WHERE UPPER(ab.lastname) LIKE UPPER($1)||'%'
AND UPPER(ab.firstname) LIKE UPPER($2)||'%'
AND (CASE WHEN $3 IS NULL THEN TRUE
ELSE ab.birthdate::date = $3::date
END);

Getting value from table with max key

I have a table with two columns:
UserId (auto int)
Email(Nvarchar)
I want to retrieve the email that was last inserted on table.
I've tried some options, but nothing seems to be working.
Thanks in advance.

Perhaps simply:
SELECT TOP 1 email FROM dbo.Table ORDER BY UserId DESC
or
SELECT UserId, Email
FROM dbo.Table
WHERE UserId = (SELECT MAX(UserId) FROM dbo.Table)
However, it's not good practise to abuse a primary-key column for information like "last inserted". Add a datetime column for this.
You could also use the ROW_NUMBER function:
WITH x AS (
SELECT UserId, Email,
rn = Row_number() OVER(ORDER BY UserId DESC)
FROM dbo.table)
SELECT UserId, Email
FROM x
WHERE rn = 1

Union Select Distinct syntax?

I have a huge table that contains both shipping address information and billing address information. I can get unique shipping and billing addresses in two separate tables with the following:
SELECT DISTINCT ShipToName, ShipToAddress1, ShipToAddress2, ShipToAddress3, ShipToCity, ShipToZipCode
FROM Orders
ORDER BY Orders.ShipToName
SELECT DISTINCT BillToName, BillToAddress1, BillToAddress2, BillToAddress3, BillToCity, BillToZipCode
FROM Orders
ORDER BY Orders.BillToName
How can I get the distinct intersection of the two? I am unsure of the syntax.

something like this?
SELECT DISTINCT
toname, addr1, addr2, addr3, city, zip
FROM
(SELECT DISTINCT
ShipToName AS toName,
ShipToAddress1 AS addr1,
ShipToAddress2 AS addr2,
ShipToAddress3 AS addr3,
ShipToCity AS city,
ShipToZipCode AS zip
FROM
Orders
UNION ALL
SELECT DISTINCT
BillToName AS toName,
BillToAddress1 AS addr1,
BillToAddress2 AS addr2,
BillToAddress3 AS addr3,
BillToCity AS city,
BillToZipCode AS zip
FROM
Orders) o
ORDER BY ToName

You say "Intersection" but you accepted the Union answer so I guess you just want the UNION DISTINCT. No need for derived tables and the three DISTINCT. You can use the simple:
SELECT
ShipToName AS Name,
ShipToAddress1 AS Address1,
ShipToAddress2 AS Address2,
ShipToAddress3 AS Address3,
ShipToCity AS City,
ShipToZipCode AS ZipCode
FROM
Orders
UNION --- UNION means UNION DISTINCT
SELECT
BillToName
BillToAddress1,
BillToAddress2,
BillToAddress3,
BillToCity,
BillToZipCode
FROM
Orders
ORDER BY
Name ;

You can join both sets on all fields and this will return the records that match:
SELECT *
FROM Orders o1
INNER JOIN Orders o2
ON o1.ShipToName = o2.BillToName
AND o1.ShipToAddress1 = o2.BillToAddress1
AND o1.ShipToAddress2 = o2.BillToAddress2
AND o1.ShipToAddress3 = o2.BillToAddress3
AND o1.ShipToCity = o2.BillToCity
AND o1.ShipToZipCode = o2.BillToZipCode
Or you should be able to use INTERSECT:
SELECT ShipToName, ShipToAddress1, ShipToAddress2, ShipToAddress3, ShipToCity, ShipToZipCode
FROM Orders
INTERSECT
SELECT BillToName, BillToAddress1, BillToAddress2, BillToAddress3, BillToCity, BillToZipCode
FROM Orders
Or even a UNION query (UNION removes duplicates between two sets of data):
SELECT ShipToName, ShipToAddress1, ShipToAddress2, ShipToAddress3, ShipToCity, ShipToZipCode
FROM Orders
UNION
SELECT BillToName, BillToAddress1, BillToAddress2, BillToAddress3, BillToCity, BillToZipCode
FROM Orders

How do you perform a search on a 1-to-many relationship when the criteria could be on either table?

I am using t-sql. I have what I thought would be an easy search. There is a 1-to-many relationship between SalesPerson and TradeShow. 1 salesperson could have gone to many trade shows. I need to be able to search on the SalePerson. I also need to be able to search on the LAST trade show they attended. I thought I would be able to do simple join and group on their last trade show, but I can not display the City or State.
SELECT SalePersonID, FirstName, LastName, TradeShow.DateLastWent
FROM SalesPerson INNER JOIN
(SELECT SalePersonID, MAX(DateLastWent) AS DateLastWent
FROM TradeShow
GROUP BY SalesPersonID) AS TradeShow ON SalesPerson.SalePersonID= TradeShow.SalePersonID
This workds, but the Tradeshow also has city and State. I need to be able to search on and display city and state. But if I include them in the subquery, I have to include thm in an aggregate function, and if I do that, I get the incorrect city and state.
The tables are simple
SALEPERSON
salespersonID PK
firstname
lastname
TRADESHOW
tradeshowID PK
datelastwent
city
state
salespersonID FK

Re-word it: what you want is the salesperson, plus the information from the last show that they have been to.
Select
SalePersonID,
FirstName,
LastName,
TradeShow.DateLastWent,
TradeShow.City,
TradeShow.State
From
SalesPerson
Inner Join TradeShow
On SalesPerson.SalePersonID = TradeShow.SalePersonID
Where
TradeShow.TradeShowID =
(Select Top 1 Latest.TradeShowID
From TradeShow As Latest
Where SalesPerson.SalePersonID = Latest.SalePersonID
Order By Latest.DateLastWent Desc)

You can join TradeShow twice :
SELECT SalePersonID, FirstName, LastName, TS1.DateLastWent,
TS2.City, TS2.State
FROM SalesPerson INNER JOIN
(SELECT SalePersonID, MAX(DateLastWent) AS DateLastWent
FROM TradeShow
GROUP BY SalesPersonID
) AS TS1 ON (SalesPerson.SalePersonID= TradeShow.SalePersonID)
INNER JOIN TradeShow TS2 ON
(TS2.SalePersonID = TS1.SalePersonID AND TS2.DateLastWent = TS1.DateLastWent)
WHERE TS2.City = 'CityName'

There is likely a more elegant way to solve this, but my first thought is to simply grab the newest TradeShow record to join with
SELECT SalePersonID, FirstName, LastName, TradeShow.DateLastWent
FROM SalesPerson
INNER JOIN (
SELECT *
FROM (
SELECT TradeShowId, DateLastWent, City, State, SalesPersonId
FROM TradeShow
ORDER BY datelastwent DESC
)
WHERE ROWNUM <= 1
) ON SalesPerson.SalesPersonId = TradeShow.SalesPersonId
Edit
Oops... been playing with Oracle too much
ROW_NUMBER() OVER(order by date) or SELECT TOP X
would be thw SQL Server way for doing this... don't have an instance of SQL-Server running, but pretty sure the syntax ends up being something like
SELECT SalePersonID, FirstName, LastName, TradeShow.DateLastWent
FROM SalesPerson
INNER JOIN (
SELECT TradeShowId, DateLastWent, City, State, SalesPersonId, ROW_NUMBER() OVER(PARTITION BY TradeShow.SalesPersonId ORDER BY DateLastWent DESC) RowNumber
FROM TradeShow
) ON SalesPerson.SalesPersonId = TradeShow.SalesPersonId AN TradeShow.RowNumber = 1

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Identifying duplicates within a table: looking for query advice - tsql

Related

Selecting distinct values

How to manage NULL strings or dates in sql queries (PostgreSQL)

Getting value from table with max key

Union Select Distinct syntax?

How do you perform a search on a 1-to-many relationship when the criteria could be on either table?

Categories

Resources