Delete duplicate records (effectively) - postgresql

PostgreSQL 10.4
I have a table:
Column | Type
-------------------------
id | integer|
title | character varying(200)
Indexes:
"phrases_pkey" PRIMARY KEY, btree (id)
"phrases_index" btree (title)
The content is as follows:
rinopt=# select count(distinct title) from phrases;
count
---------
9787866
(1 строка)
rinopt=# select count(title) from phrases;
count
----------
13573099
(1 строка)
I'd like to keep only distinct records:
delete from phrases where phrases.id not in (
select id from (
select distinct on (title) * from phrases
) as phrases_id
)
Well, this command have been working 16 hours and I can't predict when it finishes.
Almost 14 million records is not a tiny data base, but not unimaginable. It seems that I have made a very ineffective select statement.
Could you tell me whether it is possible to write a more optimal command to clean duplicates?

A single subselect should be enough. You could probably delete phrases with equal titles with ids which are greater than the current one:
DELETE FROM phrases p WHERE EXISTS (
SELECT p1.id FROM phrases p2 WHERE p.title = p2.title AND p.id > p2.id
);
A JOIN-like delete is also possible:
DELETE FROM phrases p USING phrases p2 WHERE p.title = p2.title AND p.id > p2.id;
Both statements should keep the phrases with the least id per title.

Related

delete duplicates in a table and update references

I have a table with id, we now added a new field where we calculated uniques from an external source, which made us realize we actually have duplicates in the database:
Main Table
id | unique_id | ...
---|------------
4 | A |
5 | A
6 | B
We can see: 5 is actually a duplicate of 4, as they both have the same unique_id.
Now this needs to be cleaned up.
I sadly can not simply delete those duplicates (5), as other tables depend on it:
Other Table (OtherTable.main_id REFERENCES MainTable.id)
id | main_id | ...
---|------------
1 | 4 | Blah
2 | 5
3 | 6
Now I have to clean up the duplicates, here
UPDATE OtherTable SET main_id = 5 WHERE main_id=4
How can I do that in an efficient update?
I tried to simply update every reference to the first one with that same unique_id, however that didn't complete in a day.
UPDATE "OtherTable" SET "main_id" = (SELECT "id" FROM "MainTable" WHERE "unique_id" = (SELECT "unique_id" FROM "MainTable" WHERE "id" == "OtherTable"."main_id") LIMIT 1)
If it helps, the MainTable contains about 750,000 entries, the OtherTable contains 12,000,000 rows.
Probably that's because those tripple select one is quite inefficient.
For the simple part of deletion the duplicates (after I would be done with changing the references to the first one of it's kind) I found this query to work swiftly enough:
DELETE FROM MainTable
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY unique_id
ORDER BY id ) AS row_num
FROM MainTable ) t
WHERE t.row_num > 1 );
However I need a way to update the references to the non-deleted ones of the duplicates.
Instead of UPDATE with a nested query, I'd suggest using UPDATE FROM for a join, and the same window function as in your DELETE statement:
UPDATE "OtherTable" AS other
SET main_id = main.min_id
FROM (SELECT
id,
first_value(id) OVER (PARTITION BY unique_id ORDER BY id) AS min_id
FROM "MainTable"
) AS main
WHERE main.id = other.main_id
AND main.id <> main.min_id

How can I SUM distinct records in a Postgres database where there are duplicate records?

Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!
Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.
You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3
Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle
You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES
If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle
In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id
I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;

Retrieving Representative Records for Unique Values of Single Column

For Postgresql 8.x, I have an answers table containing (id, user_id, question_id, choice) where choice is a string value. I need a query that will return a set of records (all columns returned) for all unique choice values. What I'm looking for is a single representative record for each unique choice. I also want to have an aggregate votes column that is a count() of the number of records matching each unique choice accompanying each record. I want to force choice to lowercase for this comparison to be made (HeLLo and Hello should be considered equal). I can't GROUP BY lower(choice) because I want all columns in the result-set. Grouping by all columns causes all records to return, including all duplicates.
1. Closest I've gotten
select lower(choice), count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
The issue with this is it will not return all columns.
lower | votes
-----------------------------------------------+-------
dancing in the moonlight | 8
pumped up kicks | 7
party rock anthem | 6
sexy and i know it | 5
moves like jagger | 4
2. Trying with all columns
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
Because I am not specifying every column from the SELECT in my GROUP BY, this throws an error telling me to do so.
3. Specifying all columns in the GROUP BY
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice), id, user_id, question_id, choice order by votes desc;
This simply dumps the table with votes column as 1 for all records.
How can I get the vote count and unique representative records from 1., but with all columns from the table returned?
Join grouped results back with primary table, then show only one row for each (question,answer) combination.
similar to this:
WITH top5 AS (
select question_id, lower(choice) as choice, count(*) as votes
from answers
where question_id = 21
group by question_id , lower(choice)
order by count(*) desc
limit 5
)
SELECT DISTINCT ON(question_id,choice) *
FROM top5
JOIN answers USING(question_id,lower(choice))
ORDER BY question_id, lower(choice), answers.id;
Here's what I ended up with:
SELECT answers.*, cc.votes as votes FROM answers join (
select max(id) as id, count(id) as votes
from answers
group by trim(lower(choice))
) cc
on answers.id = cc.id ORDER BY votes desc, lower(response) asc

SQL Server 2008 De-duping

Long story short, I took over a project and a table in the database is in serious need of de-duping. The table looks like this:
supply_req_id | int | [primary key]
supply_req_dt | datetime |
request_id | int | [foreign key]
supply_id | int | [foreign key]
is_disabled | bit |
The duplication is exists with records having the same request_id and supply_id. I'd like to find a best practice way to de-dupe this table.
[EDIT]
#Kirk_Broadhurst, thanks for the question. Since supply_req_id is not referenced anywhere else, I would answer by saying keep the first, delete any subsequent occurances.
Happy Holidays
This creates a rank for each row in the (supply_req_dt, request_id) grouping, starting with 1 = lowest supply_req_id. Any dupe has a value > 1
;WITH cDupes AS
(
SELECT
supply_req_id,
ROW_NUMBER() OVER (PARTITION BY supply_req_dt, request_id ORDER BY supply_req_id) AS RowNum
FROM
MyTable
)
DELETE
cDupes
WHERE
RowNum > 1
Then add a unique constraint or INDEX
CREATE UNIQUE INDEX IXU_NoDupes ON MyTable (supply_req_dt, request_id)
Seems like there should be a command for this, but maybe that's because I'm used to a different database server. Here's the relevant support doc:
How to remove duplicate rows from a table in SQL Server
http://support.microsoft.com/kb/139444
You need to clarify your rule for determining which record to keep in the case of a 'match' - the most recent, the earliest, the one that has is_disabled true, or false?
Once you've identified that rule, the rest is fairly simple:
select the records you want to keep - the distinct records
join back to the original table to get the ids for those records.
delete everthing where not in the joined dataset.
So let's say you want to keep the most recent records of any 'duplicate' pair. Your query would look like this:
DELETE FROM [table] WHERE supply_req_id NOT IN
(SELECT supply_req_id from [table] t
INNER JOIN
(SELECT MAX(supply_req_dt) dt, request_id, supply_id
FROM [table]
GROUP BY request_id, supply_id) d
ON t.supply_req_dt = d.dt
AND t.request_id = d.request_id
AND t.supply_id = d.supply_id)
The catch is that if the supply_req_dt is also duplicated, then you'll be keeping both of the duplicates. The fix is to do another group by and select the top id
select MAX(supply_req_id), supply_req_dt, request_id, supply_id
group by supply_req_dt, request_id, supply_id
as an interim step. But if you don't need to do that, don't bother with it.

PostgreSQL: Can't use DISTINCT for some data types

I have a table called _sample_table_delme_data_files which contains some duplicates. I want to copy its records, without duplicates, into data_files:
INSERT INTO data_files (SELECT distinct * FROM _sample_table_delme_data_files);
ERROR: could not identify an ordering operator for type box3d
HINT: Use an explicit ordering operator or modify the query.
Problem is, PostgreSQL can not compare (or order) box3d types. How do I supply such an ordering operator so I can get only the distinct into my destination table?
Thanks in advance,
Adam
If you don't add the operator, you could try translating the box3d data to text using its output function, something like:
INSERT INTO data_files (SELECT distinct othercols,box3dout(box3dcol) FROM _sample_table_delme_data_files);
Edit The next step is: cast it back to box3d:
INSERT INTO data_files SELECT othercols, box3din(b) FROM (SELECT distinct othercols,box3dout(box3dcol) AS b FROM _sample_table_delme_data_files);
(I don't have box3d on my system so it's untested.)
The datatype box3d doesn't have an operator for the DISTINCT-operation. You have to create the operator, or ask the PostGIS-project, maybe somebody has already fixed this problem.
Finally, this was solved by a colleague.
Let's see how many dups are there:
SELECT COUNT(*) FROM _sample_table_delme_data_files ;
count
-------
12728
(1 row)
Now, we shall add another column to the source table to help us differentiate similar rows:
ALTER TABLE _sample_table_delme_data_files ADD COLUMN id2 serial;
We can now see the dups:
SELECT id, id2 FROM _sample_table_delme_data_files ORDER BY id LIMIT 10;
id | id2
--------+------
198748 | 6449
198748 | 85
198801 | 166
198801 | 6530
198829 | 87
198829 | 6451
198926 | 88
198926 | 6452
199062 | 6532
199062 | 168
(10 rows)
And remove them:
DELETE FROM _sample_table_delme_data_files
WHERE id2 IN (SELECT max(id2) FROM _sample_table_delme_data_files
GROUP BY id
HAVING COUNT(*)>1);
Let's see it worked:
SELECT id FROM _sample_table_delme_data_files GROUP BY id HAVING COUNT(*)>1;
id
----
(0 rows)
Remove the auxiliary column:
ALTER TABLE _sample_table_delme_data_files DROP COLUMN id2;
ALTER TABLE
Insert the remaining rows into the destination table:
INSERT INTO data_files (SELECT * FROM _sample_table_delme_data_files);
INSERT 0 6364