Fixing duplicate rows to adhere to constraint - postgresql

Is there a way to force existing rows to be unique on a column before adding a unique constraint? I am adding this constraint to my db:
create unique index customfields_name_org_id_key
on CustomFields(name, org_id) where deleted is false;
But would like to first find all cases where this constraint wouldn't be met, and add 1 to the name of one of the rows (adding higher numbers if there are more than 2 colliding rows). So, for example,
SELECT name, org_id, deleted,
row_number() OVER (PARTITION BY name, deleted ORDER BY org_id) AS rnum
FROM customfields ORDER BY org_id;
gives me
name | org_id | deleted | rnum
-----------+--------+---------+------
Another | 1 | f | 1
Bad email | 1 | t | 1
Dog? | 1 | f | 1
New | 1 | f | 1
New | 1 | f | 2
New | 1 | t | 1
New field | 1 | t | 1
and I would like
New | 1 | f | 2
To be renamed "New2"
I have written this code:
update CustomFields
set name = case
when cf.rnum = 1 or cf.deleted
then cf.name
else cf.name || rnum
end
from (SELECT name, org_id, deleted,
row_number() OVER (PARTITION BY name, deleted ORDER BY org_id) AS rnum
FROM customfields ORDER BY org_id) as cf;
But it just takes the first row from the select and renames all the names to "Another". How do I alter this code so that the update works on the corresponding rows in cf?
Sample code: https://www.db-fiddle.com/#&togetherjs=iKSKze0tGm

You could update table by using:
WITH cte AS (
SELECT *,row_number() OVER (PARTITION BY name, deleted ORDER BY org_id) AS rnum
FROM customfields
)
UPDATE CustomFields
SET name = (SELECT case when cf.rnum = 1 or cf.deleted then cf.name
else cf.name || rnum end
FROM cte cf WHERE cf.id = CustomFields.id);
db<>fiddle demo

Related

Finding duplicate records posted within a lapse of time, in PostgreSQL

I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from #MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
With the LAG function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15

Delete Duplicate Data on PostgreSQL

How to delete duplicate data on a table which have kind data like these.
I want to keep it with the latest updated_at at each attribute id.
Like as follows:
attribute id | created at | product_id
1 | 2020-04-28 15:31:11 | 112235
4 | 2020-04-28 15:30:25 | 112235
1 | 2020-04-29 15:30:25 | 112236
4 | 2020-04-29 15:30:25 | 112236
You can use an EXISTS condition.
delete from the_table t1
where exists (select *
from the_table t2
where t2.created_at > t1.created_at
and t2.attribute_id = t1.attribute_id);
This will delete all rows where another row for the same attribute_id exists that has bigger created_at value (thus keeping only the row with the highest created_at for each attribute_id). Note that if two created_at values are identical, nothing will be deleted for that attribute_id
Online example

Fetch records with distinct value of one column while replacing another col's value when multiple records

I have 2 tables that I need to join based on distinct rid while replacing the column value with having different values in multiple rows. Better explained with an example set below.
CREATE TABLE usr (rid INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(12) NOT NULL,
email VARCHAR(20) NOT NULL);
CREATE TABLE usr_loc
(rid INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
code CHAR NOT NULL PRIMARY KEY,
loc_id INT NOT NULL PRIMARY KEY);
INSERT INTO usr VALUES
(1,'John','john#product'),
(2,'Linda','linda#product'),
(3,'Greg','greg#product'),
(4,'Kate','kate#product'),
(5,'Johny','johny#product'),
(6,'Mary','mary#test');
INSERT INTO usr_loc VALUES
(1,'A',4532),
(1,'I',4538),
(1,'I',4545),
(2,'I',3123),
(3,'A',4512),
(3,'A',4527),
(4,'I',4567),
(4,'A',4565),
(5,'I',4512),
(6,'I',4567);
(6,'I',4569);
Required Result Set
+-----+-------+------+-----------------+
| rid | name | Code | email |
+-----+-------+------+-----------------+
| 1 | John | B | 'john#product' |
| 2 | Linda | I | 'linda#product' |
| 3 | Greg | A | 'greg#product' |
| 4 | Kate | B | 'kate#product' |
| 5 | Johny | I | 'johny#product' |
| 6 | Mary | I | 'mary#test' |
+-----+-------+------+-----------------+
I have tried some queries to join and some to count but lost with the one which exactly satisfies the whole scenario.
The query I came up with is
SELECT distinct(a.rid)as rid, a.name, a.email, 'B' as code
FROM usr
JOIN usr_loc b ON a.rid=b.rid
WHERE a.rid IN (SELECT rid FROM usr_loc GROUP BY rid HAVING COUNT(*) > 1);`
You need to group by the users and count how many occurrences you have in usr_loc. If more than a single one, then replace the code by B. See below:
select
rid,
name,
case when cnt > 1 then 'B' else min_code end as code,
email
from (
select u.rid, u.name, u.email, min(l.code) as min_code, count(*) as cnt
from usr u
join usr_loc l on l.rid = u.rid
group by u.rid, u.name, u.email
) x;
Seems to me that you are using MySQL, rather than IBM DB2. Is that so?

removing duplicate rows and dependencies without cursor

I have a table that has a long list of duplicated items. I am working on a stored procedure to consolidate them all into one record. Each one of the duplicated items has a number of child tables that should either be deleted, or rekeyed to point at the resulting record. My table has an Id, but the ReadableIdentifier is the column I need to deduplicate.
Id | ReadableIdentifier | Name | UpdatedOn
1 | ABC1234 | Product X | 2014-04-25 16:00:08.000
2 | ABC1234 | Product X | 2014-04-28 16:00:08.000
3 | ABC1234 | Product X | 2014-04-21 16:00:08.000
4 | ABDD9945 | Widget R | 2014-04-25 16:00:08.000
5 | ABDD9945 | Widget R | 2014-04-25 18:45:08.000
As you can see, records 1-3 are duplicates with different Id's and UpdatedOn dates. Same for 4-5. I need to consolidate these into one record, preferring the one with the most recent UpdatedOn date.
End Goal (not showing children tables):
Id | ReadableIdentifier | Name | UpdatedOn
2 | ABC1234 | Product X | 2014-04-28 16:00:08.000
5 | ABDD9945 | Widget R | 2014-04-25 18:45:08.000
I am using a CURSOR to do this, but am wondering if there is a better solution.
DECLARE dupeCursor CURSOR
FAST_FORWARD
FOR
WITH Counts AS (
SELECT
COUNT(1) Count,
ReadableIdentifier
FROM dbo.Item WITH (NOLOCK)
WHERE ReadableIdentifier IS NOT NULL
GROUP BY ReadableIdentifier)
SELECT
Counts.Count,
Counts.ReadableIdentifier,
Counts.CompanyId
FROM
Counts
WHERE Counts.Count > 1;
OPEN dupeCursor;
DECLARE #readableId VARCHAR(50);
DECLARE #itemToPersistId INT, #itemToDeleteId INT;
FETCH NEXT FROM dupeCursor INTO #readableId;
WHILE ##FETCH_STATUS = 0
BEGIN
WITH V AS (
SELECT Id, ROW_NUMBER() OVER (PARTITION BY ReadableId ORDER BY UpdatedOn DESC) as Row
FROM dbo.Item WITH (NOLOCK) WHERE ReadableId = #readableId
)
SELECT #itemToPersistId = Id
FROM V
WHERE V.Row = 1
CREATE TABLE #itemsToDelete (Id UNIQUEIDENTIFIER)
INSERT INTO #itemsToDelete
SELECT Id
FROM dbo.Item WITH (NOLOCK)
WHERE ReadableId = #readableId AND Id != #itemToPersistId;
--UPDATE CHILDREN TABLES
DELETE FROM dbo.ItemDetails WHERE ItemId IN (SELECT Id FROM #itemsToDelete);
UPDATE dbo.ItemPurchases SET ItemId = #itemToPersistId
WHERE ItemId IN (SELECT Id FROM #itemsToDelete);
UPDATE dbo.PurchaseOrders SET ItemId = #itemToPersistId
WHERE ItemId IN (SELECT Id FROM #itemsToDelete);
DELETE FROM dbo.ItemMetadata WHERE ItemId IN (SELECT Id FROM #itemsToDelete);
--delete Duplicated Items
DELETE FROM dbo.Item WHERE Id IN (SELECT Id FROM #itemsToDelete);
DROP TABLE #itemsToDelete
FETCH NEXT FROM dupeCursor INTO #readableId;
END
CLOSE dupeCursor;
DEALLOCATE dupeCursor;
I realize the cursor is most likely the issue, but I'm not sure how to go about updating all of the child tables without using one.
Ok I dont have data to test this for the child tables but it should work:
WITH V
AS (SELECT *,
ROW_NUMBER() OVER(PARTITION BY ReadableId ORDER BY UpdatedOn DESC) AS Row
FROM dbo.Item WITH (NOLOCK))
SELECT *
INTO #itemsToDelete
FROM V;
--UPDATE CHILDREN TABLES
DELETE FROM dbo.ItemDetails
WHERE ItemId IN
(
SELECT Id
FROM #itemsToDelete
WHERE Row > 1
);
UPDATE IP
SET
IP.ItemId = itk.ID
FROM dbo.ItemPurchases AS IP
INNER JOIN #itemsToDelete AS itd ON IP.ItemId = itd.ID
AND itd.Row > 1
INNER JOIN #itemsToDelete AS itk ON itk.ReadableIdentifier = itd.ReadableIdentifier
AND itk.Row = 1
AND itd.Row > 1;
UPDATE po
SET
po.ItemId = itk.ID
FROM dbo.PurchaseOrders AS po
INNER JOIN #itemsToDelete AS itd ON po.ItemId = itd.ID
AND itd.Row > 1
INNER JOIN #itemsToDelete AS itk ON itk.ReadableIdentifier = itd.ReadableIdentifier
AND itk.Row = 1
AND itd.Row > 1;
DELETE FROM dbo.ItemMetadata
WHERE ItemId IN
(
SELECT Id
FROM #itemsToDelete
WHERE Row > 1
);
--delete Duplicated Items
DELETE FROM dbo.Item
WHERE Id IN
(
SELECT Id
FROM #itemsToDelete
WHERE Row > 1
);

Update Count column in Postgresql

I have a single table laid out as such:
id | name | count
1 | John |
2 | Jim |
3 | John |
4 | Tim |
I need to fill out the count column such that the result is the number of times the specific name shows up in the column name.
The result should be:
id | name | count
1 | John | 2
2 | Jim | 1
3 | John | 2
4 | Tim | 1
I can get the count of occurrences of unique names easily using:
SELECT COUNT(name)
FROM table
GROUP BY name
But that doesn't fit into an UPDATE statement due to it returning multiple rows.
I can also get it narrowed down to a single row by doing this:
SELECT COUNT(name)
FROM table
WHERE name = 'John'
GROUP BY name
But that doesn't allow me to fill out the entire column, just the 'John' rows.
you can do that with a common table expression:
with counted as (
select name, count(*) as name_count
from the_table
group by name
)
update the_table
set "count" = c.name_count
from counted c
where c.name = the_table.name;
Another (slower) option would be to use a co-related sub-query:
update the_table
set "count" = (select count(*)
from the_table t2
where t2.name = the_table.name);
But in general it is a bad idea to store values that can easily be calculated on the fly:
select id,
name,
count(*) over (partition by name) as name_count
from the_table;
Another method : Using a derived table
UPDATE tb
SET count = t.count
FROM (
SELECT count(NAME)
,NAME
FROM tb
GROUP BY 2
) t
WHERE t.NAME = tb.NAME