Delete Duplicate Data on PostgreSQL - postgresql

How to delete duplicate data on a table which have kind data like these.
I want to keep it with the latest updated_at at each attribute id.
Like as follows:
attribute id | created at | product_id
1 | 2020-04-28 15:31:11 | 112235
4 | 2020-04-28 15:30:25 | 112235
1 | 2020-04-29 15:30:25 | 112236
4 | 2020-04-29 15:30:25 | 112236

You can use an EXISTS condition.
delete from the_table t1
where exists (select *
from the_table t2
where t2.created_at > t1.created_at
and t2.attribute_id = t1.attribute_id);
This will delete all rows where another row for the same attribute_id exists that has bigger created_at value (thus keeping only the row with the highest created_at for each attribute_id). Note that if two created_at values are identical, nothing will be deleted for that attribute_id
Online example

Related

How can I ensure that a join table is referencing two tables with a composite FK, one of the two column being in common on both tables?

I have 3 tables : employee, event, and these are N-N so the 3rd table employee_event.
The trick is, they can only N-N within the same group
employee
+---------+--------------+
| id | group |
+---------+--------------+
| 1 | A |
| 2 | B |
+---------+--------------+
event
+---------+--------------+
| id | group |
+---------+--------------+
| 43 | A |
| 44 | B |
+----
employee_event
+---------+--------------+
| employee_id | event_id |
+-------------+--------------+
| 1 | 43 |
| 2 | 44 |
+---------+--------------+
So the combination employee_id=1 event_id=44 should not be possible, because employee from group A can not attend an event from group B. How can I secure my DB with this?
My first idea is to add the column employee_event.group so that I can make my two FK (composite) with employee_id + group and event_id + group respectively to the table employee and event. But is there a way to avoid adding a column in the join table for the only purpose of FKs?
Thx!
You may create a function and use it as a check constraint on table employee_event.
create or replace function groups_match (employee_id integer, event_id integer)
returns boolean language sql as
$$
select
(select group from employee where id = employee_id) =
(select group from event where id = event_id);
$$;
and then add a check constraint on table employee_event.
ALTER TABLE employee_event
ADD CONSTRAINT groups_match_check
CHECK groups_match(employee_id, event_id);
Still bear in mind that rows in employee_event that used to be valid may become invalid but still remain intact if certain changes in tables employee and event occur.

Finding duplicate records posted within a lapse of time, in PostgreSQL

I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from #MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
With the LAG function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15

remove duplicate records in postgres where all records are duplicate

My postgres table model have exactly duplicate record, I need to write a query to delete them.
id | model | model_id | dependent_on_model
-----+-------+----------+--------------------
1 | Card | 72 | Metric
1 | Card | 72 | Metric
2 | Card | 79 | Metric
2 | Card | 79 | Metric
3 | Card | 83 | Metric
3 | Card | 83 | Metric
5 | Card | 86 | Metric
using Cte is not helping as i am getting the error
relation "cte" does not exist.
Please suggest a query which delete the duplicate row and i will have just 4 distinct records at the end.
My suggestion is to duplicate the table in a TEMPORARY TABLE WITH OIDS. This way you have some other id to distinguish the two identical rows.
Idea:
Duplicate the data with another ID in a temporary table.
Remove duplicates in temporary table.
Delete actual table
Copy data back into actual table from temporary table.
Delete the TEMPORARY TABLE
You'll have to perform some destructive action on your actual table so make sure your TEMPORARY TABLE has what you want remaining before deleting anything from your actual table.
This is how you would create the TEMPORARY TABLE:
CREATE TEMPORARY TABLE dups_with_oids
( id integer
, model text
, model_id integer
, dependent_on_model text
) WITH OIDS;
Here is the DELETE query:
WITH temp AS
(
SELECT d.id AS keep
, d.oid AS keep_oid
, d2.id AS del
, d2.oid AS del_oid
FROM dups_with_oids d
JOIN dups_with_oids d2 ON (d.id = d2.id AND d.oid < d2.oid)
)
DELETE FROM dups_with_oids d
WHERE d.oid IN (SELECT temp.del_oid FROM temp);
SQLFiddle to prove the theory.
I should add that if id were a PRIMARY KEY or UNIQUE these duplicates wouldn't have been possible.

Fixing duplicate rows to adhere to constraint

Is there a way to force existing rows to be unique on a column before adding a unique constraint? I am adding this constraint to my db:
create unique index customfields_name_org_id_key
on CustomFields(name, org_id) where deleted is false;
But would like to first find all cases where this constraint wouldn't be met, and add 1 to the name of one of the rows (adding higher numbers if there are more than 2 colliding rows). So, for example,
SELECT name, org_id, deleted,
row_number() OVER (PARTITION BY name, deleted ORDER BY org_id) AS rnum
FROM customfields ORDER BY org_id;
gives me
name | org_id | deleted | rnum
-----------+--------+---------+------
Another | 1 | f | 1
Bad email | 1 | t | 1
Dog? | 1 | f | 1
New | 1 | f | 1
New | 1 | f | 2
New | 1 | t | 1
New field | 1 | t | 1
and I would like
New | 1 | f | 2
To be renamed "New2"
I have written this code:
update CustomFields
set name = case
when cf.rnum = 1 or cf.deleted
then cf.name
else cf.name || rnum
end
from (SELECT name, org_id, deleted,
row_number() OVER (PARTITION BY name, deleted ORDER BY org_id) AS rnum
FROM customfields ORDER BY org_id) as cf;
But it just takes the first row from the select and renames all the names to "Another". How do I alter this code so that the update works on the corresponding rows in cf?
Sample code: https://www.db-fiddle.com/#&togetherjs=iKSKze0tGm
You could update table by using:
WITH cte AS (
SELECT *,row_number() OVER (PARTITION BY name, deleted ORDER BY org_id) AS rnum
FROM customfields
)
UPDATE CustomFields
SET name = (SELECT case when cf.rnum = 1 or cf.deleted then cf.name
else cf.name || rnum end
FROM cte cf WHERE cf.id = CustomFields.id);
db<>fiddle demo

PostgreSQL Group By not working as expected - wants too many inclusions

I have a simple postgresql table that I'm tying to query. Imaging a table like this...
| ID | Account_ID | Iteration |
|----|------------|-----------|
| 1 | 100 | 1 |
| 2 | 101 | 1 |
| 3 | 100 | 2 |
I need to get the ID column for each Account_ID where Iteration is at its maximum value. So, you'd think something like this would work
SELECT "ID", "Account_ID", MAX("Iteration")
FROM "Table_Name"
GROUP BY "Account_ID"
And I expect to get:
| ID | Account_ID | MAX(Iteration) |
|----|------------|----------------|
| 2 | 101 | 1 |
| 3 | 100 | 2 |
But when I do this, Postgres complains:
ERROR: column "ID" must appear in the GROUP BY clause or be used in an aggregate function
Which, when I do that it just destroys the grouping altogether and gives me the whole table!
Is the best way to approach this using the following?
SELECT DISTINCT ON ("Account_ID") "ID", "Account_ID", "Iteration"
FROM "Marketing_Sparks"
ORDER BY "Account_ID" ASC, "Iteration" DESC;
The GROUP BY statement aggregates rows with the same values in the columns included in the group by into a single row. Because this row isn't the same as the original row, you can't have a column that is not in the group by or in an aggregate function. To get what you want, you will probably have to select without the ID column, then join the result to the original table. I don't know PostgreSQL syntax, but I assume it would be something like the following.
SELECT Table_Name.ID, aggregate.Account_ID, aggregate.MIteration
(SELECT Account_ID, MAX(Iteration) AS MIteration
FROM Table_Name
GROUP BY Account_ID) aggregate
LEFT JOIN Table_Name ON aggregate.Account_ID = Table_Name.Account_ID AND
aggregate.MIteration = Tabel_Name.Iteration