Handling PostgreSQL similiar, but not quite duplicate records

Handling PostgreSQL similiar, but not quite duplicate records - postgresql

I have a table that contains a number of rows that are duplicated. I can get rid of those with:
DELETE FROM files
WHERE id IN (
SELECT id
FROM (SELECT id, ROW_NUMBER() OVER (partition BY name, active, filesize, start_timestamp, end_timestamp ORDER BY id) AS rnum
FROM files) t
WHERE t.rnum > 1);
This works fine most of the time, however I have a number of rows where the filesize and end_timestamp changes, but all of the remaining data remains the same. What I would like to do is when duplicate records exist change the active attribute of the records with the smallest filesize and end_timestamp to false.
I'm just a having a moment and cant seem to figure out how to do that.

Related

Select from a delete subquery returning values

I'm trying to combine two steps into a single query. I'm trying to remove rows from one table with a particular store ID, and then deactivate employees on another table if they no longer have any matching rows in the first table. Here's what I've got:
UPDATE business.employee
SET active = FALSE
WHERE employee_id IN
(SELECT employee_id FROM (DELETE FROM business.employeeStore
WHERE store_id = 1000
RETURNING employee_id) Deleted
LEFT JOIN business.employeeStore EmployeeStore
ON Deleted.employee_id = EmployeeStore.employee_id
WHERE EmployeeStore.store_id IS NULL)
Logically, I think what I've written is sound, but syntactically, it's not quite there. It seems like this should be possible, since the DELETE FROM subquery is returning a single column table of results, and that subquery works fine by itself. But it tells me there is a syntax error at or near FROM. Even if I don't include the UPDATE portion of the query, and just do the interior SELECT part, it gives me the same error.
UPDATE: I tried using a WITH command to get around the syntax problem as follows:
WITH Deleted AS (DELETE FROM business.employeeStore
WHERE store_id = 1000
RETURNING employee_id)
UPDATE business.employee
SET active = FALSE
WHERE employee_id IN
(SELECT employee_id FROM Deleted
LEFT JOIN business.employeeStore EmployeeStore
ON Deleted.employee_id = EmployeeStore.employee_id
WHERE EmployeeStore.store_id IS NULL)
This doesn't produce any errors, but after playing around with the code for a while, I've determined that while it does get the results from the WITH part, it doesn't actually do the DELETE until after the UPDATE completes. So the SELECT subquery doesn't return any results.

I finally was able to work out how to do this using the WITH. The main issue was needing to handle the table in its pre-DELETE state. I've kept it all in one query like so:
WITH Deleted AS
(DELETE FROM business.employeeStore
WHERE store_id = 1000
RETURNING employee_id)
UPDATE business.employee
SET active = FALSE
WHERE employee_id IN
(SELECT employee_id FROM Deleted)
AND employee_id NOT IN
(SELECT employee_id FROM Deleted
JOIN business.employeeStore EmployeeStore
ON Deleted.employee_id = EmployeeStore.employee_id
WHERE EmployeeStore.store_id != 1000)

Find time difference between two most recent orders

I am trying to estimate the time of a new order from repeat customers by finding the time difference between the most recent order and the second most recent order, and then adding that difference to the most recent order.
I have been trying limit and offset, but this returns a blanket date for every row. I am thinking I need to do a lateral join, but not sure how to implement it correctly. When I try to do it, I receive no output.
select public.orders.customer_id,
max(public.orders.created_at) as last_order_date,
(select created_at from public.orders group by created_at order by created_at desc limit 1 offset 1) as second_last
from public.orders
inner join
(select
customer_id, count(*)
from public.orders
where status = 'fulfilled'
group by public.orders.customer_id
having count(customer_id) >1) repeat_customers
on public.orders.customer_id = repeat_customers.customer_id
group by public.orders.customer_id;
I wanted the second_last field to be populated by the second most recent date for each customer_id, but the output is the second most recent date for the entire table, resulting in the same date for every entry.

For your second_last column you're not limiting it per customer, it will indeed find the max of everything just like the results you've seen. See the WHERE clause in the example below which should solve this:
(SELECT
created_at
FROM
public.orders po
WHERE
po.customer_id = customer_id
ORDER BY
created_at
LIMIT 1 OFFSET 1) AS second_last
I've also aliased the table because I wasn't sure if it would complain about ambiguity since the same table is mentioned in the main select.

select last of an item for each user in postgres

I want to get the last entry for each user but the customer_id is a hash 'ASAG#...' order by customer_id destroys the query. Is there an alternative?
Select Distinct On (l.customer_id)
l.customer_id
,l.created_at
,l.text
From likes l
Order By l.customer_id, l.created_at Desc

Your current query already appears to be working, q.v. here:
Demo
I don't know why your current query is not generating the results you would expect. It should return one distinct record for every customer, corresponding to the more recent one, given your ORDER BY statement.
In any case, if it does not do what you want, an alternative would be to use ROW_NUMBER() here with a partition by user. The inner query assigns a row number to each user, with the value 1 going to the most recent record for each user. Then the outer query retains only the latest record.
SELECT
t.customer_id,
t.created_at,
t.text
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) rn
FROM likes
) t
WHERE t.rn = 1
To speed up the inner query which uses ROW_NUMBER() you can try adding a composite index on the customer_id and created_at columns:
CREATE INDEX yourIdx ON likes (customer_id, created_at);

duplicate multi column entries postgresql

I have a bunch of data in a postgresql database. I think that two keys should form a unique pair,
so want to enforce that in the database. I try
create unique index key1_key2_idx on table(key1,key2)
but that fails, telling me that I have duplicate entries.
How do I find these duplicate entries so I can delete them?

select key1,key2,count(*)
from table
group by key1,key2
having count(*) > 1
order by 3 desc;
The critical part of the query to determine the duplicates is having count(*) > 1.
There are a whole bunch of neat tricks at the following link, including some examples of removing duplicates: http://postgres.cz/wiki/PostgreSQL_SQL_Tricks

Assuming you only want to delete the duplicates and keep the original, the accepted answer is inaccurate -- it'll delete your originals as well and only keep records that have one entry from the start. This works on 9.x:
SELECT * FROM tblname WHERE ctid IN
(SELECT ctid FROM
(SELECT ctid, ROW_NUMBER() OVER
(partition BY col1, col2, col3 ORDER BY ctid) AS rnum
FROM tblname) t
WHERE t.rnum > 1);
https://wiki.postgresql.org/wiki/Deleting_duplicates

Resequencing a column with identifier in Postgresql

The following code works and creates a temporary table with a sequence number which is restarted for every new name:
with results as (select row_number() over (partition by name order BY name) as mytid,name from telephn_table)
select * from results order by name
My objective however is to insert the new sequence number permanently into the telephone table.
How do I transfer the new sequence number from the results table to the telephone table? I have come across the following for MySql but was not able to convert it to Postgresql.
MySQL: Add sequence column based on another field
Can anyone help?

If memory serves, row_number() returns the number within its own partition. In other words, row_number() over (partition by name order BY name) would return 1 for each row except duplicates. You likely want rank() over (order by name) instead.
After a long discussion:
update telephn_table
set sid = rows.new_sid
from (select pkey,
row_number() over (partition BY name) as new_sid,
name
from telephn_table
) as rows
where rows.pkey = telephn_table.pkey;

THIS WORKS! (See my OP link to a previous MySql link. In Postgresql it works without need for a temporary table)
alter table telephn_table add column tid integer default 0;
UPDATE telephn_table set tid=(SELECT count(*)+1 from telephn_table t where t.sid < telephn_table.sid and telephn_table.name=t.name)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Handling PostgreSQL similiar, but not quite duplicate records - postgresql

Related

Select from a delete subquery returning values

Find time difference between two most recent orders

select last of an item for each user in postgres

duplicate multi column entries postgresql

Resequencing a column with identifier in Postgresql

Categories

Resources