remove duplicate items from postgres - postgresql

I need help writing the query to SELECT rows which have duplicate productIDs
the table is 4 columns
id,property_id,status,price
20,13356,sold,200000
24,78436,sold,730000
12504,13356,sold,200000
...
I currently have the following python script
from psycopg2.extensions import AsIs
import psycopg2
conn = psycopg2.connect(...)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
def get_dict_sql(cur, query, single=False):
cur.execute(query)
if single:
return dict(cur.fetchone())
z = cur.fetchall()
return [dict(row) for row in z]
columns = ['property_id', 'status', 'price']
seen = set()
rows = get_dict_sql(cursor, "SELECT * FROM listings")
insert_statement = 'insert into listings_temp (%s) values %s'
for row in rows:
if row['product_id'] in seen:
continue
seen.add(row['product_id'])
values = [row[column] for column in columns]
cursor.execute(insert_statement)
q2 = cursor.mogrify(insert_statement, (AsIs(','.join(columns)), tuple(values)))
cursor.execute(q2)
conn.commit()
I created a new table to store the new data and this script 26 hours ago and still didn't finish, is there a way of SELECT only rows where product_id is duplicated?
or even better a query which does directly in Postgres?

The PostgreSQL way to fetch duplicates:
demo:db<>fiddle
This gives you duplicates:
SELECT
*
FROM (
SELECT
*,
row_number() OVER (PARTITION BY product_id)
FROM
listings
) s
WHERE row_number >= 2
The row_number() window function adds a row count to every element of a certain group (the PARTITION, which are the product_ids here). With that you are able to fetch only those, where the row count is >= 2
To remove the fetched record directly, you can combine the SELECT statement with a DELETE statement:
step-by-step demo:db<>fiddle
DELETE FROM t
WHERE id IN
(
SELECT
id
FROM (
SELECT
*,
row_number() OVER (PARTITION BY product_id)
FROM
t
) s
WHERE row_number >= 2
);

Related

Postgres, update statement from jsonb array with sorting

I have a jsonb column in my table - it contains array of json objects
one of fields in these objects is a date.
Now i added new column in my table of type timestamp.
And now i need statement which hepls me to update new column with most recent date value from jsonb array column af a same record.
Following statement works great on selecting most recent date from jsonb array column of certain record:
select history.date
from document,
jsonb_to_recordset(document.history) as history(date date)
where document.id = 'd093d6b0-702f-11eb-9439-0242ac130002'
order by history.date desc
limit 1;
On update i have tried following:
update document
set status_recent_change_date = subquery.history.date
from (
select id, history.date
from document,
jsonb_to_recordset(document.history) as history(date date)
) as subquery
where document.id = subquery.id
order by history.date desc
limit 1;
Last statement does not working.
demo:db<>fiddle
UPDATE document d
SET status_recent_change_date = s.date
FROM (
SELECT DISTINCT ON (id)
*
FROM document,
jsonb_to_recordset(document.history) AS history(date date)
ORDER BY id, history.date DESC
) s
WHERE d.id = s.id;
Using LIMIT would not work because you limit the entire output of your SELECT statement. But you want to limit the output of each document.id. This can be done using DISTINCT ON (id).
This result can be used to update each record using their id values.
You most likely don't need to use LIMIT command.
It is enough to do the sorting inside SUBQUERY:
UPDATE document SET status_recent_change_date = subquery.hdate
FROM (
SELECT id, history.date AS hdate
FROM document, jsonb_to_recordset(document.history) AS history(date date)
ORDER BY history.date DESC
) AS subquery
WHERE document.id = subquery.id

tsql how to get unique rows with first and end date times

When I run the sub queries separately, I get unique dates for each row but together I am getting repeating rows. How do I correct?
SELECT a.*
,b.*
FROM (
SELECT DISTINCT serial_number_id
,s_start_dttm
,x_dept_id AS dept_a
,x_dept_name AS dept_name_a
,event_type_c AS event_type_a
,evnt.NAME AS event_name_a
FROM TABLE x
WHERE event_type_c = 3
) a
,(
SELECT DISTINCT serial_number_id
,s_end_dttm
,x_dept_id AS E_x_DEPT_ID
,x_DEPT_NAME AS E_x_DEPT_NAME
,event_type_c AS event_type_b
,event_name AS event_type_b
WHERE event_type_c = 4
) b
WHERE a.serial_number_id = b.serial_number_id

find rows not following by the same values in 3 columns

I have a table named raw_data with the following data
as You can see id 1 and 2 share the same values in field desa, kecamatan and kabupaten, also id 3,4,5.
So basically I want to select all rows that is not followed by the same previous values. expected result would be:
I know it's easy to do this in any programming languages such as PHP, but I need this in postgresql. is this doable? Thanks in Advance.
Assuming higher id denotes latest row, if a row with same all three columns is present not together and you don't want to filter it out as it doesn't have same values as previous row (order by id or created_date), then you can make use of analytic lag() function:
select *
from (
select
t.*,
case
when desa = lag(desa) over (order by id)
and kecamatan = lag(kecamatan) over (order by id)
and kabupaten = lag(kabupaten) over (order by id)
then 0 else 1
end flag
from your_table t
) t where flag = 1;

PostgreSQL Removing duplicates

I am working on postgres query to remove duplicates from a table. The following table is dynamically generated and I want to write a select query which will remove the record if the first row has duplicate values.
The table looks something like this
Ist col 2nd col
4 62
6 34
5 26
5 12
I want to write a select query which remove either row 3 or 4.
There is no need for an intermediate table:
delete from df1
where ctid not in (select min(ctid)
from df1
group by first_column);
If you are deleting many rows from a large table, the approach with an intermediate table is probably faster.
If you just want to get unique values for one column, you can use:
select distinct on (first_column) *
from the_table
order by first_column;
Or simply
select first_column, min(second_column)
from the_table
group by first_column;
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1)
if you want to keep one of the rows (sorry, I initially missed it if you wanted that):
select first, min(second)
from df1
group by first
Where the table's name is df1 and the columns are named first and second.
You can actually leave off the count(first) as cnt if you want.
At the risk of stating the obvious, once you know how to select the data you want (or don't want) the delete the records any of a dozen ways is simple.
If you want to replace the table or make a new table you can just use create table as for the deletion:
create table tmp as
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1);
drop table df1;
create table df1 as select * from tmp;
or using DELETE FROM:
DELETE FROM df1 WHERE first NOT IN (SELECT first FROM tmp);
You could also use select into, etc, etc.
if you want to SELECT unique rows:
SELECT * FROM ztable u
WHERE NOT EXISTS ( -- There is no other record
SELECT * FROM ztable x
WHERE x.id = u.id -- with the same id
AND x.ctid < u.ctid -- , but with a different(lower) "internal" rowid
); -- so u.* must be unique
if you want to SELECT the other rows, which were suppressed in the previous query:
SELECT * FROM ztable nu
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = nu.id -- with the same id
AND x.ctid < nu.ctid -- , but with a different(lower) "internal" rowid
);
if you want to DELETE records, making the table unique (but keeping one record per id):
DELETE FROM ztable d
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = d.id -- with the same id
AND x.ctid < d.ctid -- , but with a different(lower) "internal" rowid
);
So basically I did this
create temp t1 as
select first, min (second) as second
from df1
group by first
select * from df1
inner join t1 on t1.first = df1.first and t1.second = df1.second
Its a satisfactory answer. Thanks for your help #Hack-R

After doing CTE Select Order By and then Update, Update results are not ordered the same (TSQL)

The code is roughly like this:
WITH cte AS
(
SELECT TOP 4 id, due_date, check
FROM table_a a
INNER JOIN table_b b ON a.linkid = b.linkid
WHERE
b.status = 1
AND due_date > GetDate()
ORDER BY due_date, id
)
UPDATE cte
SET check = 1
OUTPUT
INSERTED.id,
INSERTED.due_date
Note: the actual data has same due_date.
When I ran the SELECT statement only inside the cte, I could get the result, for ex: 1, 2, 3, 4.
But after the UPDATE statement, the updated results are: 4, 1, 2, 3
Why is this (order-change) happening?
How to keep or re-order the results back to 1,2,3,4 in this same 1 query?
In MSDN https://msdn.microsoft.com/pl-pl/library/ms177564(v=sql.110).aspx you can read that
There is no guarantee that the order in which the changes are applied
to the table and the order in which the rows are inserted into the
output table or table variable will correspond.
Thats mean you can't solve your problem with only one query. But you still can use one batch to do what you need. Because your output don't guarantee the order then you have to save it in another table and order it after update. This code will return your output values in order that you assume:
declare #outputTable table( id int, due_date date);
with cte as (
select top 4 id, due_date, check
from table_a a
inner join table_b b on a.linkid = b.linkid
where b.status = 1
and due_date > GetDate()
order by due_date, id
)
update cte
set check = 1
output inserted.id, inserted.due_date
into #outputTable;
select *
from #outputTable
order by due_date, id;