PostgreSQL Removing duplicates - postgresql

I am working on postgres query to remove duplicates from a table. The following table is dynamically generated and I want to write a select query which will remove the record if the first row has duplicate values.
The table looks something like this
Ist col 2nd col
4 62
6 34
5 26
5 12
I want to write a select query which remove either row 3 or 4.

There is no need for an intermediate table:
delete from df1
where ctid not in (select min(ctid)
from df1
group by first_column);
If you are deleting many rows from a large table, the approach with an intermediate table is probably faster.
If you just want to get unique values for one column, you can use:
select distinct on (first_column) *
from the_table
order by first_column;
Or simply
select first_column, min(second_column)
from the_table
group by first_column;

select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1)
if you want to keep one of the rows (sorry, I initially missed it if you wanted that):
select first, min(second)
from df1
group by first
Where the table's name is df1 and the columns are named first and second.
You can actually leave off the count(first) as cnt if you want.
At the risk of stating the obvious, once you know how to select the data you want (or don't want) the delete the records any of a dozen ways is simple.
If you want to replace the table or make a new table you can just use create table as for the deletion:
create table tmp as
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1);
drop table df1;
create table df1 as select * from tmp;
or using DELETE FROM:
DELETE FROM df1 WHERE first NOT IN (SELECT first FROM tmp);
You could also use select into, etc, etc.

if you want to SELECT unique rows:
SELECT * FROM ztable u
WHERE NOT EXISTS ( -- There is no other record
SELECT * FROM ztable x
WHERE x.id = u.id -- with the same id
AND x.ctid < u.ctid -- , but with a different(lower) "internal" rowid
); -- so u.* must be unique
if you want to SELECT the other rows, which were suppressed in the previous query:
SELECT * FROM ztable nu
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = nu.id -- with the same id
AND x.ctid < nu.ctid -- , but with a different(lower) "internal" rowid
);
if you want to DELETE records, making the table unique (but keeping one record per id):
DELETE FROM ztable d
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = d.id -- with the same id
AND x.ctid < d.ctid -- , but with a different(lower) "internal" rowid
);

So basically I did this
create temp t1 as
select first, min (second) as second
from df1
group by first
select * from df1
inner join t1 on t1.first = df1.first and t1.second = df1.second
Its a satisfactory answer. Thanks for your help #Hack-R

Related

remove duplicate items from postgres

I need help writing the query to SELECT rows which have duplicate productIDs
the table is 4 columns
id,property_id,status,price
20,13356,sold,200000
24,78436,sold,730000
12504,13356,sold,200000
...
I currently have the following python script
from psycopg2.extensions import AsIs
import psycopg2
conn = psycopg2.connect(...)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
def get_dict_sql(cur, query, single=False):
cur.execute(query)
if single:
return dict(cur.fetchone())
z = cur.fetchall()
return [dict(row) for row in z]
columns = ['property_id', 'status', 'price']
seen = set()
rows = get_dict_sql(cursor, "SELECT * FROM listings")
insert_statement = 'insert into listings_temp (%s) values %s'
for row in rows:
if row['product_id'] in seen:
continue
seen.add(row['product_id'])
values = [row[column] for column in columns]
cursor.execute(insert_statement)
q2 = cursor.mogrify(insert_statement, (AsIs(','.join(columns)), tuple(values)))
cursor.execute(q2)
conn.commit()
I created a new table to store the new data and this script 26 hours ago and still didn't finish, is there a way of SELECT only rows where product_id is duplicated?
or even better a query which does directly in Postgres?
The PostgreSQL way to fetch duplicates:
demo:db<>fiddle
This gives you duplicates:
SELECT
*
FROM (
SELECT
*,
row_number() OVER (PARTITION BY product_id)
FROM
listings
) s
WHERE row_number >= 2
The row_number() window function adds a row count to every element of a certain group (the PARTITION, which are the product_ids here). With that you are able to fetch only those, where the row count is >= 2
To remove the fetched record directly, you can combine the SELECT statement with a DELETE statement:
step-by-step demo:db<>fiddle
DELETE FROM t
WHERE id IN
(
SELECT
id
FROM (
SELECT
*,
row_number() OVER (PARTITION BY product_id)
FROM
t
) s
WHERE row_number >= 2
);

how to select multiple column from the table using group by( based on one column) , having and count in hive query

Requirement :
Using group by A and get records having count > 1
eg:
SELECT count(sk), id, sk
FROM table x
GROUP BY id
HAVING COUNT(sk) > 1
But I am not able to select sk in select statement. Is there any other way to do this. how to use partition on this input and output set attached here?
Something like this, you can do.
select * from (
SELECT count(sk)over(partition by id) as cnt, id, sk
FROM table x) a
where a.cnt >1

how to select identical rows in postgresql?

The dataset that I'm looking into has an id for the incident, but a few columns (a_dttm, b_dttm, and c_dttm) have dates and times that appear more than once. I looked into it and found that even though the ids are unique, there are entire rows that look almost identical.
So without having to go through 200 rows of potential identical rows, what can I write in postgres to search for rows that are identical in a_dttm, b_dttm, and c_dttm?
This is what I've been doing to select the identical rows one by one:
SELECT *
FROM data
WHERE a_dttm::timestamp = '2007-01-13 08:29:35'
order by a_dttm desc
I got the timestamp from another query.
I know if these three columns are completely identical, then the rows are for sure duplicates.
Try
select count(*), a_dttm, b_dttm, c_dttm
from data
group by a_ddtm, b_dttm, c_dttm;
This should tell you how many duplicates you have.
This will select all the rows for which (at least one) other row exists, with the same {a_dttm,b_dttm,c_dttm}, but with a different id:
SELECT *
FROM the_table t
WHERE EXISTS (
SELECT*
FROM the_table x
WHERE x.a_dttm = t.a_dttm -- same
AND x.b_dttm = t.b_dttm --same
AND x.c_dttm = t.x_dttm --same
AND x.id <> t.id -- different
);
Similar, but now actually DELETING (some of) theduplicates:
DELETE
FROM the_table t
WHERE EXISTS (
SELECT*
FROM the_table x
WHERE x.a_dttm = t.a_dttm -- same
AND x.b_dttm = t.b_dttm --same
AND x.c_dttm = t.x_dttm --same
AND x.id > t.id -- different (actually: with a higher id)
);

UPDATE column from one table to another

I need to update a column in a table to the latest date/time combination from another table. How can I get the latest date/time combination from the one table and then update a column with that date in another table?
The two tables I am using are called dbo.DD and dbo.PurchaseOrders. The JOIN between the two tables are dbo.DueDate.XDORD = dbo.PurchaseOrders.PBPO AND dbo.DueDate.XDLINE = dbo.PurchaseOrders.PBSEQ. The columns from dbo.DueDate that I need the latest date/time from are dbo.DueDate.XDCCTD and dbo.DueDate.XDCCTT.
I need to set dbo.PurchaseOrders.PBDUE = dbo.DueDate.XDCURDT.I can't use an ORDER BY statement in the UPDATE statement, so I'm not sure how to do this. I know row_number sometimes works in these situations, but I'm unsure of how to implement.
The general pattern is:
;WITH s AS
(
SELECT
key, -- may be multiple columns
date_col,
rn = ROW_NUMBER() OVER
(
PARTITION BY key -- again, may be multiple columns
ORDER BY date_col DESC
)
FROM dbo.SourceTable
)
UPDATE d
SET d.date_col = s.date_col
FROM dbo.DestinationTable AS d
INNER JOIN s
ON d.key = s.key -- one more time, may need multiple columns here
WHERE s.rn = 1;
I didn't try to map your table names and columns because (a) I didn't get from your word problem which table was the source and which was the destination and (b) those column names look like alphabet soup and I would have screwed them up anyway.
Did seem though that the OP got this specific code working:
;WITH s AS
(
SELECT
XDORD, XDLINE,
XDCURDT,
rn = ROW_NUMBER() OVER
(
PARTITION BY XDORD, XDLINE
ORDER BY XDCCTD DESC, XDCCTT desc
)
FROM dbo.DueDate
)
UPDATE d
SET d.PBDUE = s.XDCURDT
FROM dbo.PurchaseOrders AS d
INNER JOIN s
ON d.PBPO = s.XDORD AND d.PBSEQ = s.XDLINE
WHERE s.rn = 1;

Firebird 2.5 Removing Rows with Duplicate Fields

I am trying to removing duplicate values which, for some reason, was imported in a specific Table.
There is no Primary Key in this table.
There is 27797 unique records.
Select distinct txdate, plunumber from itemaudit
Give me the correct records, but only displays the txdate, plunumber of course.
If it was possible to select all the fields but only select the distinct of txdate,plunumber I could export the values, delete the duplicated ones and re-import.
Or if its possible to delete the distinct values from the entire table.
If you select the distinct of all fields the value is incorrect.
To get all information on the duplicates, you simply need to query all information for the duplicate rows using a JOIN:
SELECT b.*
FROM (SELECT COUNT(*) as cnt, txdate, plunumber
FROM itemaudit
GROUP BY txdate, plunumber
HAVING COUNT(*) > 1) a
INNER JOIN itemaudit b ON a.txdate = b.txdate AND a.plunumber = b.plunumber
DELETE FROM itemaudit t1
WHERE EXISTS (
SELECT 1 FROM itemaudit t2
WHERE t1.txdate = t2.txdate and t1.plunumber = t2.plunumber
AND t1.RDB$DB_KEY < t2.RDB$DB_KEY
);