SQL Server 2008 De-duping

SQL Server 2008 De-duping - tsql

Long story short, I took over a project and a table in the database is in serious need of de-duping. The table looks like this:
supply_req_id | int | [primary key]
supply_req_dt | datetime |
request_id | int | [foreign key]
supply_id | int | [foreign key]
is_disabled | bit |
The duplication is exists with records having the same request_id and supply_id. I'd like to find a best practice way to de-dupe this table.
[EDIT]
#Kirk_Broadhurst, thanks for the question. Since supply_req_id is not referenced anywhere else, I would answer by saying keep the first, delete any subsequent occurances.
Happy Holidays

This creates a rank for each row in the (supply_req_dt, request_id) grouping, starting with 1 = lowest supply_req_id. Any dupe has a value > 1
;WITH cDupes AS
(
SELECT
supply_req_id,
ROW_NUMBER() OVER (PARTITION BY supply_req_dt, request_id ORDER BY supply_req_id) AS RowNum
FROM
MyTable
)
DELETE
cDupes
WHERE
RowNum > 1
Then add a unique constraint or INDEX
CREATE UNIQUE INDEX IXU_NoDupes ON MyTable (supply_req_dt, request_id)

Seems like there should be a command for this, but maybe that's because I'm used to a different database server. Here's the relevant support doc:
How to remove duplicate rows from a table in SQL Server
http://support.microsoft.com/kb/139444

You need to clarify your rule for determining which record to keep in the case of a 'match' - the most recent, the earliest, the one that has is_disabled true, or false?
Once you've identified that rule, the rest is fairly simple:
select the records you want to keep - the distinct records
join back to the original table to get the ids for those records.
delete everthing where not in the joined dataset.
So let's say you want to keep the most recent records of any 'duplicate' pair. Your query would look like this:
DELETE FROM [table] WHERE supply_req_id NOT IN
(SELECT supply_req_id from [table] t
INNER JOIN
(SELECT MAX(supply_req_dt) dt, request_id, supply_id
FROM [table]
GROUP BY request_id, supply_id) d
ON t.supply_req_dt = d.dt
AND t.request_id = d.request_id
AND t.supply_id = d.supply_id)
The catch is that if the supply_req_dt is also duplicated, then you'll be keeping both of the duplicates. The fix is to do another group by and select the top id
select MAX(supply_req_id), supply_req_dt, request_id, supply_id
group by supply_req_dt, request_id, supply_id
as an interim step. But if you don't need to do that, don't bother with it.

Related

delete duplicates in a table and update references

I have a table with id, we now added a new field where we calculated uniques from an external source, which made us realize we actually have duplicates in the database:
Main Table
id | unique_id | ...
---|------------
4 | A |
5 | A
6 | B
We can see: 5 is actually a duplicate of 4, as they both have the same unique_id.
Now this needs to be cleaned up.
I sadly can not simply delete those duplicates (5), as other tables depend on it:
Other Table (OtherTable.main_id REFERENCES MainTable.id)
id | main_id | ...
---|------------
1 | 4 | Blah
2 | 5
3 | 6
Now I have to clean up the duplicates, here
UPDATE OtherTable SET main_id = 5 WHERE main_id=4
How can I do that in an efficient update?
I tried to simply update every reference to the first one with that same unique_id, however that didn't complete in a day.
UPDATE "OtherTable" SET "main_id" = (SELECT "id" FROM "MainTable" WHERE "unique_id" = (SELECT "unique_id" FROM "MainTable" WHERE "id" == "OtherTable"."main_id") LIMIT 1)
If it helps, the MainTable contains about 750,000 entries, the OtherTable contains 12,000,000 rows.
Probably that's because those tripple select one is quite inefficient.
For the simple part of deletion the duplicates (after I would be done with changing the references to the first one of it's kind) I found this query to work swiftly enough:
DELETE FROM MainTable
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY unique_id
ORDER BY id ) AS row_num
FROM MainTable ) t
WHERE t.row_num > 1 );
However I need a way to update the references to the non-deleted ones of the duplicates.

Instead of UPDATE with a nested query, I'd suggest using UPDATE FROM for a join, and the same window function as in your DELETE statement:
UPDATE "OtherTable" AS other
SET main_id = main.min_id
FROM (SELECT
id,
first_value(id) OVER (PARTITION BY unique_id ORDER BY id) AS min_id
FROM "MainTable"
) AS main
WHERE main.id = other.main_id
AND main.id <> main.min_id

How to use new created column in where column in sql?

Hi I have a query which looks like the following :
SELECT device_id, tag_id, at, _deleted, data,
row_number() OVER (PARTITION BY device_id ORDER BY at DESC) AS row_num
FROM mdb_history.devices_tags_mapping_history
WHERE at <= '2019-04-01'
AND _deleted = False
AND (tag_id = '275674' or tag_id = '275673')
AND row_num = 1
However when I run the following query, I get the following error :
ERROR: column "row_num" does not exist
Is there any way to go about this. One way I tried was to use it in the following way:
SELECT * from (SELECT device_id, tag_id, at, _deleted, data,
row_number() OVER (PARTITION BY device_id ORDER BY at DESC) AS row_num
FROM mdb_history.devices_tags_mapping_history
WHERE at <= '2019-04-01'
AND _deleted = False
AND (tag_id = '275674' or tag_id = '275673')) tag_deleted
WHERE tag_deleted.row_num = 1
But this becomes way too complicated as I do it with other queries as I have number of join and I have to select the column as stated from so it causes alot of select statement. Any smart way of doing that in a more simpler way. Thanks

You can't refer to the row_num alias which you defined in the same level of the select in your query. So, your main option here would be to subquery, where row_num would be available. But, Postgres actually has an option to get what you want in another way. You could use DISTINCT ON here:
SELECT DISTINCT ON (device_id), device_id, tag_id, at, _deleted, data
FROM mdb_history.devices_tags_mapping_history
WHERE
at <= '2019-04-01' AND
_deleted = false AND
tag_id IN ('275674', '275673')
ORDER BY
device_id,
at DESC;

Too long/ formatted for a comment. There is a reason behind #TimBiegeleisen statement "alias which you defined in the same level of the select". That reason is that all SQL statement follow the same sequence for evaluation. Unfortunately that sequence does NOT follow the sequence of clauses within the statement presentation. that sequence is in order:
from
where
group by
having
select
limits
You will notice that what actually gets selected fall well after evaluation of the where clause. Since your alias is defined within the select phase it does not exist during the where phase.

Postgres: Need to match records from two tables based on key value and earliest dates in each table

I'm dealing with a pretty unique record matching problem within postgres right now. Essentially I have a table (A) with a lot of records in it, including a key value that I need to match on and the date of the record. Then I have this other table (B) that I want to match the first table on that key value. However, there can be multiple of the same 'key values' in both tables. To get around this I need to match the earliest key value from table A to the earliest key value to table B, the second earliest to the second earliest, and so on... However, if table B runs out of key value matches in table B then I want to default to the latest key value match in A, even though something else already matched on it.
My initial thought is to use a something like this on both tables:
ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
And then join on the rank and key_value field. However, I'm not exactly sure how to get that default scenario to work with this method. And if records are added to one table and not the other and I try the join again, I feel like it might get out of sync.
My other thought was to use a cursor, but I'm really struggling to see how I'd implement that.
Any help would be greatly appreciated!

first you need number all your rows, the find the one with matching ranks.
After that match the one without matching to the latest_date
with cteA as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableA
), cteB as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableB
), ranked_match as (
SELECT ctA.*, cteB.*
FROM cteA
LEFT JOIN cteB
ON cteA.key_value = cteB.key_value
AND cteA.rank = cteB.rank
), latest_row as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date DESC) AS rank
FROM tableB
)
SELECT *
FROM ranked_match
WHERE cteB.key_value IS NOT NULL
UNION ALL
SELECT *
FROM ranked_match
JOIN latest_row
ON ranked_match.key_value = latest_row .key_value
WHERE cteB.key_value IS NULL
AND latest_row .rank = 1

Updating with Nested Select Statements

I have a table that holds 3 fields of data: Acct#, YMCode, and EmployeeID. The YMCode is an Int that is formatted 201308, 201307, etc. For each Acct#, I need to select the EmployeedID used for the YMCode 201308 and then update all of the other YMCodes for the Acct# to the EmployeedID used in 201308.
so for each customer account in the table...
Update MyTable
Set EmployeeID = EmployeeID used in YMCode 201308
Having a hard time with it.

Put it in a transaction and look at the results before committing, but I think this is what you want:
UPDATE b
SET EmployeeID = a.EmployeeID
FROM MyTable a
INNER JOIN MyTable b
ON a.[Acct#] = b.[Acct#]
where a.YMCode =
(SELECT MAX(YMCode) from MyTable)
To get max YMCode, just add select statement at the end.

Retrieving Representative Records for Unique Values of Single Column

For Postgresql 8.x, I have an answers table containing (id, user_id, question_id, choice) where choice is a string value. I need a query that will return a set of records (all columns returned) for all unique choice values. What I'm looking for is a single representative record for each unique choice. I also want to have an aggregate votes column that is a count() of the number of records matching each unique choice accompanying each record. I want to force choice to lowercase for this comparison to be made (HeLLo and Hello should be considered equal). I can't GROUP BY lower(choice) because I want all columns in the result-set. Grouping by all columns causes all records to return, including all duplicates.
1. Closest I've gotten
select lower(choice), count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
The issue with this is it will not return all columns.
lower | votes
-----------------------------------------------+-------
dancing in the moonlight | 8
pumped up kicks | 7
party rock anthem | 6
sexy and i know it | 5
moves like jagger | 4
2. Trying with all columns
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
Because I am not specifying every column from the SELECT in my GROUP BY, this throws an error telling me to do so.
3. Specifying all columns in the GROUP BY
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice), id, user_id, question_id, choice order by votes desc;
This simply dumps the table with votes column as 1 for all records.
How can I get the vote count and unique representative records from 1., but with all columns from the table returned?

Join grouped results back with primary table, then show only one row for each (question,answer) combination.
similar to this:
WITH top5 AS (
select question_id, lower(choice) as choice, count(*) as votes
from answers
where question_id = 21
group by question_id , lower(choice)
order by count(*) desc
limit 5
)
SELECT DISTINCT ON(question_id,choice) *
FROM top5
JOIN answers USING(question_id,lower(choice))
ORDER BY question_id, lower(choice), answers.id;

Here's what I ended up with:
SELECT answers.*, cc.votes as votes FROM answers join (
select max(id) as id, count(id) as votes
from answers
group by trim(lower(choice))
) cc
on answers.id = cc.id ORDER BY votes desc, lower(response) asc

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SQL Server 2008 De-duping - tsql

Seems like there should be a command for this, but maybe that's because I'm used to a different database server. Here's the relevant support doc: How to remove duplicate rows from a table in SQL Server http://support.microsoft.com/kb/139444

Related

delete duplicates in a table and update references

How to use new created column in where column in sql?

Postgres: Need to match records from two tables based on key value and earliest dates in each table

Updating with Nested Select Statements

Retrieving Representative Records for Unique Values of Single Column

Categories

Resources