duplicate multi column entries postgresql - postgresql

I have a bunch of data in a postgresql database. I think that two keys should form a unique pair,
so want to enforce that in the database. I try
create unique index key1_key2_idx on table(key1,key2)
but that fails, telling me that I have duplicate entries.
How do I find these duplicate entries so I can delete them?

select key1,key2,count(*)
from table
group by key1,key2
having count(*) > 1
order by 3 desc;
The critical part of the query to determine the duplicates is having count(*) > 1.
There are a whole bunch of neat tricks at the following link, including some examples of removing duplicates: http://postgres.cz/wiki/PostgreSQL_SQL_Tricks

Assuming you only want to delete the duplicates and keep the original, the accepted answer is inaccurate -- it'll delete your originals as well and only keep records that have one entry from the start. This works on 9.x:
SELECT * FROM tblname WHERE ctid IN
(SELECT ctid FROM
(SELECT ctid, ROW_NUMBER() OVER
(partition BY col1, col2, col3 ORDER BY ctid) AS rnum
FROM tblname) t
WHERE t.rnum > 1);
https://wiki.postgresql.org/wiki/Deleting_duplicates

Related

How to find the distinct row of max value which have unique phone number [duplicate]

Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
This question already has answers here:
Retrieving the last record in each group - MySQL
(33 answers)
Closed 3 years ago.
I have this table for documents (simplified version here):
id
rev
content
1
1
...
2
1
...
1
2
...
1
3
...
How do I select one row per id and only the greatest rev?
With the above data, the result should contain two rows: [1, 3, ...] and [2, 1, ..]. I'm using MySQL.
Currently I use checks in the while loop to detect and over-write old revs from the resultset. But is this the only method to achieve the result? Isn't there a SQL solution?
At first glance...
All you need is a GROUP BY clause with the MAX aggregate function:
SELECT id, MAX(rev)
FROM YourTable
GROUP BY id
It's never that simple, is it?
I just noticed you need the content column as well.
This is a very common question in SQL: find the whole data for the row with some max value in a column per some group identifier. I heard that a lot during my career. Actually, it was one the questions I answered in my current job's technical interview.
It is, actually, so common that Stack Overflow community has created a single tag just to deal with questions like that: greatest-n-per-group.
Basically, you have two approaches to solve that problem:
Joining with simple group-identifier, max-value-in-group Sub-query
In this approach, you first find the group-identifier, max-value-in-group (already solved above) in a sub-query. Then you join your table to the sub-query with equality on both group-identifier and max-value-in-group:
SELECT a.id, a.rev, a.contents
FROM YourTable a
INNER JOIN (
SELECT id, MAX(rev) rev
FROM YourTable
GROUP BY id
) b ON a.id = b.id AND a.rev = b.rev
Left Joining with self, tweaking join conditions and filters
In this approach, you left join the table with itself. Equality goes in the group-identifier. Then, 2 smart moves:
The second join condition is having left side value less than right value
When you do step 1, the row(s) that actually have the max value will have NULL in the right side (it's a LEFT JOIN, remember?). Then, we filter the joined result, showing only the rows where the right side is NULL.
So you end up with:
SELECT a.*
FROM YourTable a
LEFT OUTER JOIN YourTable b
ON a.id = b.id AND a.rev < b.rev
WHERE b.id IS NULL;
Conclusion
Both approaches bring the exact same result.
If you have two rows with max-value-in-group for group-identifier, both rows will be in the result in both approaches.
Both approaches are SQL ANSI compatible, thus, will work with your favorite RDBMS, regardless of its "flavor".
Both approaches are also performance friendly, however your mileage may vary (RDBMS, DB Structure, Indexes, etc.). So when you pick one approach over the other, benchmark. And make sure you pick the one which make most of sense to you.
My preference is to use as little code as possible...
You can do it using IN
try this:
SELECT *
FROM t1 WHERE (id,rev) IN
( SELECT id, MAX(rev)
FROM t1
GROUP BY id
)
to my mind it is less complicated... easier to read and maintain.
I am flabbergasted that no answer offered SQL window function solution:
SELECT a.id, a.rev, a.contents
FROM (SELECT id, rev, contents,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) ranked_order
FROM YourTable) a
WHERE a.ranked_order = 1
Added in SQL standard ANSI/ISO Standard SQL:2003 and later extended with ANSI/ISO Standard SQL:2008, window (or windowing) functions are available with all major vendors now. There are more types of rank functions available to deal with a tie issue: RANK, DENSE_RANK, PERSENT_RANK.
Yet another solution is to use a correlated subquery:
select yt.id, yt.rev, yt.contents
from YourTable yt
where rev =
(select max(rev) from YourTable st where yt.id=st.id)
Having an index on (id,rev) renders the subquery almost as a simple lookup...
Following are comparisons to the solutions in #AdrianCarneiro's answer (subquery, leftjoin), based on MySQL measurements with InnoDB table of ~1million records, group size being: 1-3.
While for full table scans subquery/leftjoin/correlated timings relate to each other as 6/8/9, when it comes to direct lookups or batch (id in (1,2,3)), subquery is much slower then the others (Due to rerunning the subquery). However I couldnt differentiate between leftjoin and correlated solutions in speed.
One final note, as leftjoin creates n*(n+1)/2 joins in groups, its performance can be heavily affected by the size of groups...
I can't vouch for the performance, but here's a trick inspired by the limitations of Microsoft Excel. It has some good features
GOOD STUFF
It should force return of only one "max record" even if there is a tie (sometimes useful)
It doesn't require a join
APPROACH
It is a little bit ugly and requires that you know something about the range of valid values of the rev column. Let us assume that we know the rev column is a number between 0.00 and 999 including decimals but that there will only ever be two digits to the right of the decimal point (e.g. 34.17 would be a valid value).
The gist of the thing is that you create a single synthetic column by string concatenating/packing the primary comparison field along with the data you want. In this way, you can force SQL's MAX() aggregate function to return all of the data (because it has been packed into a single column). Then you have to unpack the data.
Here's how it looks with the above example, written in SQL
SELECT id,
CAST(SUBSTRING(max(packed_col) FROM 2 FOR 6) AS float) as max_rev,
SUBSTRING(max(packed_col) FROM 11) AS content_for_max_rev
FROM (SELECT id,
CAST(1000 + rev + .001 as CHAR) || '---' || CAST(content AS char) AS packed_col
FROM yourtable
)
GROUP BY id
The packing begins by forcing the rev column to be a number of known character length regardless of the value of rev so that for example
3.2 becomes 1003.201
57 becomes 1057.001
923.88 becomes 1923.881
If you do it right, string comparison of two numbers should yield the same "max" as numeric comparison of the two numbers and it's easy to convert back to the original number using the substring function (which is available in one form or another pretty much everywhere).
Unique Identifiers? Yes! Unique identifiers!
One of the best ways to develop a MySQL DB is to have each id AUTOINCREMENT (Source MySQL.com). This allows a variety of advantages, too many to cover here. The problem with the question is that its example has duplicate ids. This disregards these tremendous advantages of unique identifiers, and at the same time, is confusing to those familiar with this already.
Cleanest Solution
DB Fiddle
Newer versions of MySQL come with ONLY_FULL_GROUP_BY enabled by default, and many of the solutions here will fail in testing with this condition.
Even so, we can simply select DISTINCT someuniquefield, MAX( whateverotherfieldtoselect ), ( *somethirdfield ), etc., and have no worries understanding the result or how the query works :
SELECT DISTINCT t1.id, MAX(t1.rev), MAX(t2.content)
FROM Table1 AS t1
JOIN Table1 AS t2 ON t2.id = t1.id AND t2.rev = (
SELECT MAX(rev) FROM Table1 t3 WHERE t3.id = t1.id
)
GROUP BY t1.id;
SELECT DISTINCT Table1.id, max(Table1.rev), max(Table2.content) : Return DISTINCT somefield, MAX() some otherfield, the last MAX() is redundant, because I know it's just one row, but it's required by the query.
FROM Employee : Table searched on.
JOIN Table1 AS Table2 ON Table2.rev = Table1.rev : Join the second table on the first, because, we need to get the max(table1.rev)'s comment.
GROUP BY Table1.id: Force the top-sorted, Salary row of each employee to be the returned result.
Note that since "content" was "..." in OP's question, there's no way to test that this works. So, I changed that to "..a", "..b", so, we can actually now see that the results are correct:
id max(Table1.rev) max(Table2.content)
1 3 ..d
2 1 ..b
Why is it clean? DISTINCT(), MAX(), etc., all make wonderful use of MySQL indices. This will be faster. Or, it will be much faster, if you have indexing, and you compare it to a query that looks at all rows.
Original Solution
With ONLY_FULL_GROUP_BY disabled, we can use still use GROUP BY, but then we are only using it on the Salary, and not the id:
SELECT *
FROM
(SELECT *
FROM Employee
ORDER BY Salary DESC)
AS employeesub
GROUP BY employeesub.Salary;
SELECT * : Return all fields.
FROM Employee : Table searched on.
(SELECT *...) subquery : Return all people, sorted by Salary.
GROUP BY employeesub.Salary: Force the top-sorted, Salary row of each employee to be the returned result.
Unique-Row Solution
Note the Definition of a Relational Database: "Each row in a table has its own unique key." This would mean that, in the question's example, id would have to be unique, and in that case, we can just do :
SELECT *
FROM Employee
WHERE Employee.id = 12345
ORDER BY Employee.Salary DESC
LIMIT 1
Hopefully this is a solution that solves the problem and helps everyone better understand what's happening in the DB.
Another manner to do the job is using MAX() analytic function in OVER PARTITION clause
SELECT t.*
FROM
(
SELECT id
,rev
,contents
,MAX(rev) OVER (PARTITION BY id) as max_rev
FROM YourTable
) t
WHERE t.rev = t.max_rev
The other ROW_NUMBER() OVER PARTITION solution already documented in this post is
SELECT t.*
FROM
(
SELECT id
,rev
,contents
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rank
FROM YourTable
) t
WHERE t.rank = 1
This 2 SELECT work well on Oracle 10g.
MAX() solution runs certainly FASTER that ROW_NUMBER() solution because MAX() complexity is O(n) while ROW_NUMBER() complexity is at minimum O(n.log(n)) where n represent the number of records in table !
Something like this?
SELECT yourtable.id, rev, content
FROM yourtable
INNER JOIN (
SELECT id, max(rev) as maxrev
FROM yourtable
GROUP BY id
) AS child ON (yourtable.id = child.id) AND (yourtable.rev = maxrev)
I like to use a NOT EXIST-based solution for this problem:
SELECT
id,
rev
-- you can select other columns here
FROM YourTable t
WHERE NOT EXISTS (
SELECT * FROM YourTable t WHERE t.id = id AND rev > t.rev
)
This will select all records with max value within the group and allows you to select other columns.
SELECT *
FROM Employee
where Employee.Salary in (select max(salary) from Employee group by Employe_id)
ORDER BY Employee.Salary
Note: I probably wouldn't recommend this anymore in MySQL 8+ days. Haven't used it in years.
A third solution I hardly ever see mentioned is MySQL specific and looks like this:
SELECT id, MAX(rev) AS rev
, 0+SUBSTRING_INDEX(GROUP_CONCAT(numeric_content ORDER BY rev DESC), ',', 1) AS numeric_content
FROM t1
GROUP BY id
Yes it looks awful (converting to string and back etc.) but in my experience it's usually faster than the other solutions. Maybe that's just for my use cases, but I have used it on tables with millions of records and many unique ids. Maybe it's because MySQL is pretty bad at optimizing the other solutions (at least in the 5.0 days when I came up with this solution).
One important thing is that GROUP_CONCAT has a maximum length for the string it can build up. You probably want to raise this limit by setting the group_concat_max_len variable. And keep in mind that this will be a limit on scaling if you have a large number of rows.
Anyway, the above doesn't directly work if your content field is already text. In that case you probably want to use a different separator, like \0 maybe. You'll also run into the group_concat_max_len limit quicker.
I think, You want this?
select * from docs where (id, rev) IN (select id, max(rev) as rev from docs group by id order by id)
SQL Fiddle :
Check here
NOT mySQL, but for other people finding this question and using SQL, another way to resolve the greatest-n-per-group problem is using Cross Apply in MS SQL
WITH DocIds AS (SELECT DISTINCT id FROM docs)
SELECT d2.id, d2.rev, d2.content
FROM DocIds d1
CROSS APPLY (
SELECT Top 1 * FROM docs d
WHERE d.id = d1.id
ORDER BY rev DESC
) d2
Here's an example in SqlFiddle
I would use this:
select t.*
from test as t
join
(select max(rev) as rev
from test
group by id) as o
on o.rev = t.rev
Subquery SELECT is not too eficient maybe, but in JOIN clause seems to be usable. I'm not an expert in optimizing queries, but I've tried at MySQL, PostgreSQL, FireBird and it does work very good.
You can use this schema in multiple joins and with WHERE clause. It is my working example (solving identical to yours problem with table "firmy"):
select *
from platnosci as p
join firmy as f
on p.id_rel_firmy = f.id_rel
join (select max(id_obj) as id_obj
from firmy
group by id_rel) as o
on o.id_obj = f.id_obj and p.od > '2014-03-01'
It is asked on tables having teens thusands of records, and it takes less then 0,01 second on really not too strong machine.
I wouldn't use IN clause (as it is mentioned somewhere above). IN is given to use with short lists of constans, and not as to be the query filter built on subquery. It is because subquery in IN is performed for every scanned record which can made query taking very loooong time.
Since this is most popular question with regard to this problem, I'll re-post another answer to it here as well:
It looks like there is simpler way to do this (but only in MySQL):
select *
from (select * from mytable order by id, rev desc ) x
group by id
Please credit answer of user Bohemian in this question for providing such a concise and elegant answer to this problem.
Edit: though this solution works for many people it may not be stable in the long run, since MySQL doesn't guarantee that GROUP BY statement will return meaningful values for columns not in GROUP BY list. So use this solution at your own risk!
If you have many fields in select statement and you want latest value for all of those fields through optimized code:
select * from
(select * from table_name
order by id,rev desc) temp
group by id
How about this:
SELECT all_fields.*
FROM (SELECT id, MAX(rev) FROM yourtable GROUP BY id) AS max_recs
LEFT OUTER JOIN yourtable AS all_fields
ON max_recs.id = all_fields.id
This solution makes only one selection from YourTable, therefore it's faster. It works only for MySQL and SQLite(for SQLite remove DESC) according to test on sqlfiddle.com. Maybe it can be tweaked to work on other languages which I am not familiar with.
SELECT *
FROM ( SELECT *
FROM ( SELECT 1 as id, 1 as rev, 'content1' as content
UNION
SELECT 2, 1, 'content2'
UNION
SELECT 1, 2, 'content3'
UNION
SELECT 1, 3, 'content4'
) as YourTable
ORDER BY id, rev DESC
) as YourTable
GROUP BY id
Here is a nice way of doing that
Use following code :
with temp as (
select count(field1) as summ , field1
from table_name
group by field1 )
select * from temp where summ = (select max(summ) from temp)
I like to do this by ranking the records by some column. In this case, rank rev values grouped by id. Those with higher rev will have lower rankings. So highest rev will have ranking of 1.
select id, rev, content
from
(select
#rowNum := if(#prevValue = id, #rowNum+1, 1) as row_num,
id, rev, content,
#prevValue := id
from
(select id, rev, content from YOURTABLE order by id asc, rev desc) TEMP,
(select #rowNum := 1 from DUAL) X,
(select #prevValue := -1 from DUAL) Y) TEMP
where row_num = 1;
Not sure if introducing variables makes the whole thing slower. But at least I'm not querying YOURTABLE twice.
here is another solution hope it will help someone
Select a.id , a.rev, a.content from Table1 a
inner join
(SELECT id, max(rev) rev FROM Table1 GROUP BY id) x on x.id =a.id and x.rev =a.rev
None of these answers have worked for me.
This is what worked for me.
with score as (select max(score_up) from history)
select history.* from score, history where history.score_up = score.max
Here's another solution to retrieving the records only with a field that has the maximum value for that field. This works for SQL400 which is the platform I work on. In this example, the records with the maximum value in field FIELD5 will be retrieved by the following SQL statement.
SELECT A.KEYFIELD1, A.KEYFIELD2, A.FIELD3, A.FIELD4, A.FIELD5
FROM MYFILE A
WHERE RRN(A) IN
(SELECT RRN(B)
FROM MYFILE B
WHERE B.KEYFIELD1 = A.KEYFIELD1 AND B.KEYFIELD2 = A.KEYFIELD2
ORDER BY B.FIELD5 DESC
FETCH FIRST ROW ONLY)
Sorted the rev field in reverse order and then grouped by id which gave the first row of each grouping which is the one with the highest rev value.
SELECT * FROM (SELECT * FROM table1 ORDER BY id, rev DESC) X GROUP BY X.id;
Tested in http://sqlfiddle.com/ with the following data
CREATE TABLE table1
(`id` int, `rev` int, `content` varchar(11));
INSERT INTO table1
(`id`, `rev`, `content`)
VALUES
(1, 1, 'One-One'),
(1, 2, 'One-Two'),
(2, 1, 'Two-One'),
(2, 2, 'Two-Two'),
(3, 2, 'Three-Two'),
(3, 1, 'Three-One'),
(3, 3, 'Three-Three')
;
This gave the following result in MySql 5.5 and 5.6
id rev content
1 2 One-Two
2 2 Two-Two
3 3 Three-Two
You can make the select without a join when you combine the rev and id into one maxRevId value for MAX() and then split it back to original values:
SELECT maxRevId & ((1 << 32) - 1) as id, maxRevId >> 32 AS rev
FROM (SELECT MAX(((rev << 32) | id)) AS maxRevId
FROM YourTable
GROUP BY id) x;
This is especially fast when there is a complex join instead of a single table. With the traditional approaches the complex join would be done twice.
The above combination is simple with bit functions when rev and id are INT UNSIGNED (32 bit) and combined value fits to BIGINT UNSIGNED (64 bit). When the id & rev are larger than 32-bit values or made of multiple columns, you need combine the value into e.g. a binary value with suitable padding for MAX().
Explanation
This is not pure SQL. This will use the SQLAlchemy ORM.
I came here looking for SQLAlchemy help, so I will duplicate Adrian Carneiro's answer with the python/SQLAlchemy version, specifically the outer join part.
This query answers the question of:
"Can you return me the records in this group of records (based on same id) that have the highest version number".
This allows me to duplicate the record, update it, increment its version number, and have the copy of the old version in such a way that I can show change over time.
Code
MyTableAlias = aliased(MyTable)
newest_records = appdb.session.query(MyTable).select_from(join(
MyTable,
MyTableAlias,
onclause=and_(
MyTable.id == MyTableAlias.id,
MyTable.version_int < MyTableAlias.version_int
),
isouter=True
)
).filter(
MyTableAlias.id == None,
).all()
Tested on a PostgreSQL database.
I used the below to solve a problem of my own. I first created a temp table and inserted the max rev value per unique id.
CREATE TABLE #temp1
(
id varchar(20)
, rev int
)
INSERT INTO #temp1
SELECT a.id, MAX(a.rev) as rev
FROM
(
SELECT id, content, SUM(rev) as rev
FROM YourTable
GROUP BY id, content
) as a
GROUP BY a.id
ORDER BY a.id
I then joined these max values (#temp1) to all of the possible id/content combinations. By doing this, I naturally filter out the non-maximum id/content combinations, and am left with the only max rev values for each.
SELECT a.id, a.rev, content
FROM #temp1 as a
LEFT JOIN
(
SELECT id, content, SUM(rev) as rev
FROM YourTable
GROUP BY id, content
) as b on a.id = b.id and a.rev = b.rev
GROUP BY a.id, a.rev, b.content
ORDER BY a.id

Most efficient way to remove duplicates - Postgres

I have always deleted duplicates with this kind of query:
delete from test a
using test b
where a.ctid < b.ctid
and a.col1=b.col1
and a.col2=b.col2
and a.col3=b.col3
Also, I have seen this query being used:
DELETE FROM test WHERE test.ctid NOT IN
(SELECT ctid FROM (
SELECT DISTINCT ON (col1, col2) *
FROM test));
And even this one (repeated until you run out of duplicates):
delete from test ju where ju.ctid in
(select ctid from (
select distinct on (col1, col2) * from test ou
where (select count(*) from test inr
where inr.col1= ou.col1 and inr.col2=ou.col2) > 1
Now I have run into a table with 5 million rows, which have indexes in the columns that are going to match in the where clause. And now I wonder:
Which, of all those methods that apparently do the same, is the most efficient and why?
I just run the second one and it is taking it over 45 minutes to remove duplicates. I'm just curious about which would be the most efficient one, in case I have to remove duplicates from another huge table. It wouldn't matter if it has a primary key in the first place, you can always create it or not.
demo:db<>fiddle
Finding duplicates can be easily achieved by using row_number() window function:
SELECT ctid
FROM(
SELECT
*,
ctid,
row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid)
FROM test
)s
WHERE row_number >= 2
This orders groups tied rows and adds a row counter. So every row with row_number > 1 is a duplicate which can be deleted:
DELETE
FROM test
WHERE ctid IN
(
SELECT ctid
FROM(
SELECT
*,
ctid,
row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid)
FROM test
)s
WHERE row_number >= 2
)
I don't know if this solution is faster than your attempts but your could give it a try.
Furthermore - as #a_horse_with_no_name already stated - I would recommend to use an own identifier instead of ctid for performance issues.
Edit:
For my test data your first version seems to be a little bit faster than my solution. Your second version seems to be slower and your third version does not work for me (after fixing the compiling errors it shows no result).
demo:db<>fiddle

Postgres Remove records by duplicate control_id [duplicate]

I have a table in a PostgreSQL 8.3.8 database, which has no keys/constraints on it, and has multiple rows with exactly the same values.
I would like to remove all duplicates and keep only 1 copy of each row.
There is one column in particular (named "key") which may be used to identify duplicates, i.e. there should only exist one entry for each distinct "key".
How can I do this? (Ideally, with a single SQL command.)
Speed is not a problem in this case (there are only a few rows).
A faster solution is
DELETE FROM dups a USING (
SELECT MIN(ctid) as ctid, key
FROM dups
GROUP BY key HAVING COUNT(*) > 1
) b
WHERE a.key = b.key
AND a.ctid <> b.ctid
DELETE FROM dupes a
WHERE a.ctid <> (SELECT min(b.ctid)
FROM dupes b
WHERE a.key = b.key);
This is fast and concise:
DELETE FROM dupes T1
USING dupes T2
WHERE T1.ctid < T2.ctid -- delete the older versions
AND T1.key = T2.key; -- add more columns if needed
See also my answer at How to delete duplicate rows without unique identifier which includes more information.
EXISTS is simple and among the fastest for most data distributions:
DELETE FROM dupes d
WHERE EXISTS (
SELECT FROM dupes
WHERE key = d.key
AND ctid < d.ctid
);
From each set of duplicate rows (defined by identical key), this keeps the one row with the minimum ctid.
Result is identical to the currently accepted answer by a_horse. Just faster, because EXISTS can stop evaluating as soon as the first offending row is found, while the alternative with min() has to consider all rows per group to compute the minimum. Speed is of no concern to this question, but why not take it?
You may want to add a UNIQUE constraint after cleaning up, to prevent duplicates from creeping back in:
ALTER TABLE dupes ADD CONSTRAINT constraint_name_here UNIQUE (key);
About the system column ctid:
Is the system column “ctid” legitimate for identifying rows to delete?
If there is any other column defined UNIQUE NOT NULL column in the table (like a PRIMARY KEY) then, by all means, use it instead of ctid.
If key can be NULL and you only want one of those, too, use IS NOT DISTINCT FROM instead of =. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
As that's slower, you might instead run the above query as is, and this in addition:
DELETE FROM dupes d
WHERE key IS NULL
AND EXISTS (
SELECT FROM dupes
WHERE key IS NULL
AND ctid < d.ctid
);
And consider:
Create unique constraint with null columns
For small tables, indexes generally do not help performance. And we need not look further.
For big tables and few duplicates, an existing index on (key) can help (a lot).
For mostly duplicates, an index may add more cost than benefit, as it has to be kept up to date concurrently. Finding duplicates without index becomes faster anyway because there are so many and EXISTS only needs to find one. But consider a completely different approach if you can afford it (i.e. concurrent access allows it): Write the few surviving rows to a new table. That also removes table (and index) bloat in the process. See:
How to delete duplicate entries?
I tried this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
provided by Postgres wiki:
https://wiki.postgresql.org/wiki/Deleting_duplicates
I would use a temporary table:
create table tab_temp as
select distinct f1, f2, f3, fn
from tab;
Then, delete tab and rename tab_temp into tab.
I had to create my own version. Version written by #a_horse_with_no_name is way too slow on my table (21M rows). And #rapimo simply doesn't delete dups.
Here is what I use on PostgreSQL 9.5
DELETE FROM your_table
WHERE ctid IN (
SELECT unnest(array_remove(all_ctids, actid))
FROM (
SELECT
min(b.ctid) AS actid,
array_agg(ctid) AS all_ctids
FROM your_table b
GROUP BY key1, key2, key3, key4
HAVING count(*) > 1) c);
Another approach (works only if you have any unique field like id in your table) to find all unique ids by columns and remove other ids that are not in unique list
DELETE
FROM users
WHERE users.id NOT IN (SELECT DISTINCT ON (username, email) id FROM users);
Postgresql has windows function, you can use rank() to archive your goal, sample:
WITH ranked as (
SELECT
id, column1,
"rank" () OVER (
PARTITION BY column1
order by column1 asc
) AS r
FROM
table1
)
delete from table1 t1
using ranked
where t1.id = ranked.id and ranked.r > 1
Here is another solution, that worked for me.
delete from table_name a using table_name b
where a.id < b.id
and a.column1 = b.column1;
How about:
WITH
u AS (SELECT DISTINCT * FROM your_table),
x AS (DELETE FROM your_table)
INSERT INTO your_table SELECT * FROM u;
I had been concerned about execution order, would the DELETE happen before the SELECT DISTINCT, but it works fine for me.
And has the added bonus of not needing any knowledge about the table structure.
Here is a solution using PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:
DELETE FROM dups
USING (
SELECT
ctid,
(
ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])
) AS is_duplicate
FROM dups
) dups_find_duplicates
WHERE dups.ctid == dups_find_duplicates.ctid
AND dups_find_duplicates.is_duplicate
A subquery is used to mark all rows as duplicates or not, based on whether they share the same "key columns", but not the same ctid, as the "first" one found in the "partition" of rows sharing the same keys.
In other words, "first" is defined as:
min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])
Then, all rows where is_duplicate is true are deleted by their ctid.
From the documentation, ctid represents (emphasis mine):
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.
well, none of this solution would work if the id is duplicated which is my use case, then the solution is simple:
myTable:
id name
0 value
0 value
0 value
1 value1
1 value1
create dedupMyTable as select distinct * from myTable;
delete from myTable;
insert into myTable select * from dedupMyTable;
select * from myTable;
id name
0 value
1 value1
well you shouldn't have duplicates id into your table unless it doesn't have PK constraints or simply doesn't support it such as Hive/data lake tables
Better pay attention when loading your data to avoid dups over ID's
DELETE FROM tracking_order
WHERE
mvd_id IN (---column you need to remove duplicate
SELECT
mvd_id
FROM (
SELECT
mvd_id,thoi_gian_gui,
ROW_NUMBER() OVER (
PARTITION BY mvd_id
ORDER BY thoi_gian_gui desc) AS row_num
FROM
tracking_order
) s_alias
WHERE row_num > 1)
AND thoi_gian_gui in ( --column you used to compare to delete duplicates, eg last update time
SELECT
thoi_gian_gui
FROM (
SELECT
thoi_gian_gui,
ROW_NUMBER() OVER (
PARTITION BY mvd_id
ORDER BY thoi_gian_gui desc) AS row_num
FROM
tracking_order
) s_alias
WHERE row_num > 1)
My code, I remove all duplicates 7800445 row and keep only 1 copy of each row with 7 min 28 secs.
enter image description here
This worked well for me. I had a table, terms, that contained duplicate values. Ran a query to populate a temp table with all of the duplicate rows. Then I ran the a delete statement with those ids in the temp table. value is the column that contained the duplicates.
CREATE TEMP TABLE dupids AS
select id from (
select value, id, row_number()
over (partition by value order by value)
as rownum from terms
) tmp
where rownum >= 2;
delete from [table] where id in (select id from dupids)

Remove all records with duplicates in db2. (Not just the duplicate records)

How can I remove all the records with duplicates in db2. I have looked at various answers but they only remove the duplicates leaving one record from that set in the table. This is what I found already.
DELETE FROM
(SELECT ROWNUMBER() OVER (PARTITION BY ONE, TWO, THREE) AS RN
FROM SESSION.TEST) AS A
WHERE RN > 1;
But, I need a query that will remove all the records that contain duplicates not leaving behind one of them in the table.
A A 1 <-- delete this
A A 2 <-- delete this too
B B 3
C C 4
P.S: Using RN >= 1 does not work as it will make the table empty by deleting all records.
Your original statement wouldn't work in any case - it would only delete anything after the after row (and given you seem to list the unique id column in the PARTITION BY, shouldn't actually delete anything at all).
The following should work in LUW:
DELETE FROM (SELECT col1, col2, col3
FROM <tableName> ot
JOIN (SELECT col1, col2
FROM <tableName>
GROUP BY col1, col2
HAVING COUNT(*) > 1) dt
ON dt.col1 = ot.col1
AND dt.col2 = ot.col2)
(although I have no way to test this)
I believe the following should also work, and be near universal (work on most RDBMSs):
DELETE FROM Temp
WHERE (col1, col2) IN (SELECT col1, col2
FROM Temp
GROUP BY col1, col2
HAVING COUNT(*) > 1)
SQL Fiddle Example

How to prune a table down to the first 5000 records of 50000

I have a rather large table of 50000 records, and I want to cut this down to 5000. How would I write an SQL query to delete the other 45000 records. The basic table structure contains the column of a datetime.
A rough idea of the query I want is the following
DELETE FROM mytable WHERE countexceeded(5000) ORDER BY filedate DESC;
I could write this in C# somehow grabbing the row index number and doing some work around that, however is there a tidy way to do this?
The answer you have accepted is not valid syntax as DELETE does not allow an ORDER BY clause. You can use
;WITH T AS
(
SELECT TOP 45000 *
FROM mytable
ORDER BY filedate
)
DELETE FROM T
DELETE TOP(45000) FROM mytable ORDER BY filedate ASC;
Change the order by to ascending to get the rows in reverse order and then delete the top 45000.
Hope this helps.
Edit:-
I apologize for the invalid syntax. Here is my second attempt.
DELETE FROM myTable a INNER JOIN
(SELECT TOP(45000) * FROM myTable ORDER BY fileDate ASC) b ON a.id = b.id
If you do not have a unique column then please use Martin Smith's CTE answer.
if the table is correctly ordered:
DELETE FROM mytable LIMIT 5000
if not and the table has correctly ordered auto_increment index:
get the row
SELECT id, filedate FROM mytable LIMIT 1, 50000;
save the id and then delete
DELETE FROM mytable WHERE id >= #id;
if not ordered correctly, you could use filedate instead of id, but if it's a date without time, you could get undesired rows deleted from the same date, so be carefull with filedate deletion solution