Remove all records with duplicates in db2. (Not just the duplicate records) - db2

How can I remove all the records with duplicates in db2. I have looked at various answers but they only remove the duplicates leaving one record from that set in the table. This is what I found already.
DELETE FROM
(SELECT ROWNUMBER() OVER (PARTITION BY ONE, TWO, THREE) AS RN
FROM SESSION.TEST) AS A
WHERE RN > 1;
But, I need a query that will remove all the records that contain duplicates not leaving behind one of them in the table.
A A 1 <-- delete this
A A 2 <-- delete this too
B B 3
C C 4
P.S: Using RN >= 1 does not work as it will make the table empty by deleting all records.

Your original statement wouldn't work in any case - it would only delete anything after the after row (and given you seem to list the unique id column in the PARTITION BY, shouldn't actually delete anything at all).
The following should work in LUW:
DELETE FROM (SELECT col1, col2, col3
FROM <tableName> ot
JOIN (SELECT col1, col2
FROM <tableName>
GROUP BY col1, col2
HAVING COUNT(*) > 1) dt
ON dt.col1 = ot.col1
AND dt.col2 = ot.col2)
(although I have no way to test this)
I believe the following should also work, and be near universal (work on most RDBMSs):
DELETE FROM Temp
WHERE (col1, col2) IN (SELECT col1, col2
FROM Temp
GROUP BY col1, col2
HAVING COUNT(*) > 1)
SQL Fiddle Example

Related

Most efficient way to remove duplicates - Postgres

I have always deleted duplicates with this kind of query:
delete from test a
using test b
where a.ctid < b.ctid
and a.col1=b.col1
and a.col2=b.col2
and a.col3=b.col3
Also, I have seen this query being used:
DELETE FROM test WHERE test.ctid NOT IN
(SELECT ctid FROM (
SELECT DISTINCT ON (col1, col2) *
FROM test));
And even this one (repeated until you run out of duplicates):
delete from test ju where ju.ctid in
(select ctid from (
select distinct on (col1, col2) * from test ou
where (select count(*) from test inr
where inr.col1= ou.col1 and inr.col2=ou.col2) > 1
Now I have run into a table with 5 million rows, which have indexes in the columns that are going to match in the where clause. And now I wonder:
Which, of all those methods that apparently do the same, is the most efficient and why?
I just run the second one and it is taking it over 45 minutes to remove duplicates. I'm just curious about which would be the most efficient one, in case I have to remove duplicates from another huge table. It wouldn't matter if it has a primary key in the first place, you can always create it or not.
demo:db<>fiddle
Finding duplicates can be easily achieved by using row_number() window function:
SELECT ctid
FROM(
SELECT
*,
ctid,
row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid)
FROM test
)s
WHERE row_number >= 2
This orders groups tied rows and adds a row counter. So every row with row_number > 1 is a duplicate which can be deleted:
DELETE
FROM test
WHERE ctid IN
(
SELECT ctid
FROM(
SELECT
*,
ctid,
row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid)
FROM test
)s
WHERE row_number >= 2
)
I don't know if this solution is faster than your attempts but your could give it a try.
Furthermore - as #a_horse_with_no_name already stated - I would recommend to use an own identifier instead of ctid for performance issues.
Edit:
For my test data your first version seems to be a little bit faster than my solution. Your second version seems to be slower and your third version does not work for me (after fixing the compiling errors it shows no result).
demo:db<>fiddle

Postgres Remove records by duplicate control_id [duplicate]

I have a table in a PostgreSQL 8.3.8 database, which has no keys/constraints on it, and has multiple rows with exactly the same values.
I would like to remove all duplicates and keep only 1 copy of each row.
There is one column in particular (named "key") which may be used to identify duplicates, i.e. there should only exist one entry for each distinct "key".
How can I do this? (Ideally, with a single SQL command.)
Speed is not a problem in this case (there are only a few rows).
A faster solution is
DELETE FROM dups a USING (
SELECT MIN(ctid) as ctid, key
FROM dups
GROUP BY key HAVING COUNT(*) > 1
) b
WHERE a.key = b.key
AND a.ctid <> b.ctid
DELETE FROM dupes a
WHERE a.ctid <> (SELECT min(b.ctid)
FROM dupes b
WHERE a.key = b.key);
This is fast and concise:
DELETE FROM dupes T1
USING dupes T2
WHERE T1.ctid < T2.ctid -- delete the older versions
AND T1.key = T2.key; -- add more columns if needed
See also my answer at How to delete duplicate rows without unique identifier which includes more information.
EXISTS is simple and among the fastest for most data distributions:
DELETE FROM dupes d
WHERE EXISTS (
SELECT FROM dupes
WHERE key = d.key
AND ctid < d.ctid
);
From each set of duplicate rows (defined by identical key), this keeps the one row with the minimum ctid.
Result is identical to the currently accepted answer by a_horse. Just faster, because EXISTS can stop evaluating as soon as the first offending row is found, while the alternative with min() has to consider all rows per group to compute the minimum. Speed is of no concern to this question, but why not take it?
You may want to add a UNIQUE constraint after cleaning up, to prevent duplicates from creeping back in:
ALTER TABLE dupes ADD CONSTRAINT constraint_name_here UNIQUE (key);
About the system column ctid:
Is the system column “ctid” legitimate for identifying rows to delete?
If there is any other column defined UNIQUE NOT NULL column in the table (like a PRIMARY KEY) then, by all means, use it instead of ctid.
If key can be NULL and you only want one of those, too, use IS NOT DISTINCT FROM instead of =. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
As that's slower, you might instead run the above query as is, and this in addition:
DELETE FROM dupes d
WHERE key IS NULL
AND EXISTS (
SELECT FROM dupes
WHERE key IS NULL
AND ctid < d.ctid
);
And consider:
Create unique constraint with null columns
For small tables, indexes generally do not help performance. And we need not look further.
For big tables and few duplicates, an existing index on (key) can help (a lot).
For mostly duplicates, an index may add more cost than benefit, as it has to be kept up to date concurrently. Finding duplicates without index becomes faster anyway because there are so many and EXISTS only needs to find one. But consider a completely different approach if you can afford it (i.e. concurrent access allows it): Write the few surviving rows to a new table. That also removes table (and index) bloat in the process. See:
How to delete duplicate entries?
I tried this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
provided by Postgres wiki:
https://wiki.postgresql.org/wiki/Deleting_duplicates
I would use a temporary table:
create table tab_temp as
select distinct f1, f2, f3, fn
from tab;
Then, delete tab and rename tab_temp into tab.
I had to create my own version. Version written by #a_horse_with_no_name is way too slow on my table (21M rows). And #rapimo simply doesn't delete dups.
Here is what I use on PostgreSQL 9.5
DELETE FROM your_table
WHERE ctid IN (
SELECT unnest(array_remove(all_ctids, actid))
FROM (
SELECT
min(b.ctid) AS actid,
array_agg(ctid) AS all_ctids
FROM your_table b
GROUP BY key1, key2, key3, key4
HAVING count(*) > 1) c);
Another approach (works only if you have any unique field like id in your table) to find all unique ids by columns and remove other ids that are not in unique list
DELETE
FROM users
WHERE users.id NOT IN (SELECT DISTINCT ON (username, email) id FROM users);
Postgresql has windows function, you can use rank() to archive your goal, sample:
WITH ranked as (
SELECT
id, column1,
"rank" () OVER (
PARTITION BY column1
order by column1 asc
) AS r
FROM
table1
)
delete from table1 t1
using ranked
where t1.id = ranked.id and ranked.r > 1
Here is another solution, that worked for me.
delete from table_name a using table_name b
where a.id < b.id
and a.column1 = b.column1;
How about:
WITH
u AS (SELECT DISTINCT * FROM your_table),
x AS (DELETE FROM your_table)
INSERT INTO your_table SELECT * FROM u;
I had been concerned about execution order, would the DELETE happen before the SELECT DISTINCT, but it works fine for me.
And has the added bonus of not needing any knowledge about the table structure.
Here is a solution using PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:
DELETE FROM dups
USING (
SELECT
ctid,
(
ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])
) AS is_duplicate
FROM dups
) dups_find_duplicates
WHERE dups.ctid == dups_find_duplicates.ctid
AND dups_find_duplicates.is_duplicate
A subquery is used to mark all rows as duplicates or not, based on whether they share the same "key columns", but not the same ctid, as the "first" one found in the "partition" of rows sharing the same keys.
In other words, "first" is defined as:
min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])
Then, all rows where is_duplicate is true are deleted by their ctid.
From the documentation, ctid represents (emphasis mine):
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.
well, none of this solution would work if the id is duplicated which is my use case, then the solution is simple:
myTable:
id name
0 value
0 value
0 value
1 value1
1 value1
create dedupMyTable as select distinct * from myTable;
delete from myTable;
insert into myTable select * from dedupMyTable;
select * from myTable;
id name
0 value
1 value1
well you shouldn't have duplicates id into your table unless it doesn't have PK constraints or simply doesn't support it such as Hive/data lake tables
Better pay attention when loading your data to avoid dups over ID's
DELETE FROM tracking_order
WHERE
mvd_id IN (---column you need to remove duplicate
SELECT
mvd_id
FROM (
SELECT
mvd_id,thoi_gian_gui,
ROW_NUMBER() OVER (
PARTITION BY mvd_id
ORDER BY thoi_gian_gui desc) AS row_num
FROM
tracking_order
) s_alias
WHERE row_num > 1)
AND thoi_gian_gui in ( --column you used to compare to delete duplicates, eg last update time
SELECT
thoi_gian_gui
FROM (
SELECT
thoi_gian_gui,
ROW_NUMBER() OVER (
PARTITION BY mvd_id
ORDER BY thoi_gian_gui desc) AS row_num
FROM
tracking_order
) s_alias
WHERE row_num > 1)
My code, I remove all duplicates 7800445 row and keep only 1 copy of each row with 7 min 28 secs.
enter image description here
This worked well for me. I had a table, terms, that contained duplicate values. Ran a query to populate a temp table with all of the duplicate rows. Then I ran the a delete statement with those ids in the temp table. value is the column that contained the duplicates.
CREATE TEMP TABLE dupids AS
select id from (
select value, id, row_number()
over (partition by value order by value)
as rownum from terms
) tmp
where rownum >= 2;
delete from [table] where id in (select id from dupids)

duplicate multi column entries postgresql

I have a bunch of data in a postgresql database. I think that two keys should form a unique pair,
so want to enforce that in the database. I try
create unique index key1_key2_idx on table(key1,key2)
but that fails, telling me that I have duplicate entries.
How do I find these duplicate entries so I can delete them?
select key1,key2,count(*)
from table
group by key1,key2
having count(*) > 1
order by 3 desc;
The critical part of the query to determine the duplicates is having count(*) > 1.
There are a whole bunch of neat tricks at the following link, including some examples of removing duplicates: http://postgres.cz/wiki/PostgreSQL_SQL_Tricks
Assuming you only want to delete the duplicates and keep the original, the accepted answer is inaccurate -- it'll delete your originals as well and only keep records that have one entry from the start. This works on 9.x:
SELECT * FROM tblname WHERE ctid IN
(SELECT ctid FROM
(SELECT ctid, ROW_NUMBER() OVER
(partition BY col1, col2, col3 ORDER BY ctid) AS rnum
FROM tblname) t
WHERE t.rnum > 1);
https://wiki.postgresql.org/wiki/Deleting_duplicates

Entity Framework: View exclusion without primary key

I am using SQL Server where I have designed a view to sum the results of two tables and I want the output to be a single table with the results. My query simplified is something like:
SELECT SUM(col1), col2, col3
FROM Table1
GROUP BY col2, col3
This gives me the data I want, but when updating my EDM the view is excluded because "a primary key cannot be inferred".
With a little research I modified the query to spoof an id column to as follows:
SELECT ROW_NUMBER() OVER (ORDER BY col2) AS 'ID', SUM(col1), col2, col3
FROM Table1
GROUP BY col2, col3
This kind of query gives me a nice increasing set of ids. However, when I attempt to update my model it still excludes my view because it cannot infer a primary key. How can we use views that aggregate records and connect them with Linq-to-Entities?
As already discussed in the comments you can try adding MAX(id) as id to the view. Based on your feedback this would become:
SELECT ISNULL(MAX(id), 0) as ID,
SUM(col1),
col2,
col3
FROM Table1
GROUP BY col2, col3
Another option is to try creating an index on the view:
CREATE UNIQUE CLUSTERED INDEX idx_view1 ON dbo.View1(id)
I use this code alter view
ISNULL(ROW_NUMBER() OVER(ORDER BY ActionDate DESC), -1) AS RowID
I use this clause in multi relations view / table query
ROW_NUMBER never give null value because it never seen -1
This is all I needed to add in order to import my view into EF6.
select ISNULL(1, 1) keyField

select where not exists excluding identity column

I am inserting only new records that do not exist in a live table from a "dump" table. My issue is there is an identity column that I don't want to insert into the live, I want the live tables identity column to take care of incrementing the value but I am getting an insert error "Insert Error: Column name or number of supplied values does not match table definition." Is there a way around this or is the only fix to remove the identity column all together?
Thanks,
Sam
You need to list of all the needed columns in your query, excluding the identity column.
One more reason why you should never use SELECT *.
INSERT liveTable
(col1, col2, col3)
SELECT col1, col2, col3
FROM dumpTable dt
WHERE NOT EXISTS
(
SELECT 1
FROM liveTable lt
WHERE lt.Id == dt.Id
)
Pro tip: You can also achieve the above by using an OUTER JOIN between the dump and live tables and using WHERE liveTable.col1 = NULL (you will probably need to qualify the column names selected with the dump table alias).
I figured out the issue.... my live table didn't have the ID field set as an identity, somehow when I created it that field wasn't set up correctly.
you can leave that column in your insert statment like this
insert into destination (col2, col3, col4)
select col2, col3 col4 from source
Don't do just
insert into destination
select * from source