Delete duplicate rows based on columns - postgresql

I have a table called Aircraft and there are many records. The problem is that some are duplicates. I know how to select the duplicates and their counts:
SELECT flight_id, latitude, longitude, altitude, call_sign, measurement_time, COUNT(*)
FROM Aircraft
GROUP BY flight_id, latitude, longitude, altitude, call_sign, measurement_time
HAVING COUNT(*) > 1;
This returns something like:
Now, what I need to do is remove the duplicates, leaving just one each so that when I run the query again, all counts become 1.
I know that I can use the DELETE keyword, but I'm not sure how to delete it from the SELECT.
I'm sure I am missing an easy step, but I do not want to ruin my DB being a newbie.
How do I do this?

SELECT
flight_id, latitude, longitude, altitude, call_sign, measurement_time
FROM Aircraft a
WHERE EXISTS (
SELECT * FROM Aircraft x
WHERE x.flight_id = a.flight_id
AND x.latitude = a.latitude
AND x.longitude = a.longitude
AND x.altitude = a.altitude
AND x.call_sign = a.call_sign
AND x.measurement_time = a.measurement_time
AND x.id < a.id
)
;
If the query above returns thecorrect rows (to be deleted)
you can change it into a delete statement:
DELETE
FROM Aircraft a
WHERE EXISTS (
SELECT * FROM Aircraft x
WHERE x.flight_id = a.flight_id
AND x.latitude = a.latitude
AND x.longitude = a.longitude
AND x.altitude = a.altitude
AND x.call_sign = a.call_sign
AND x.measurement_time = a.measurement_time
AND x.id < a.id
)
;

I have always used the CTE method in SQL SERVER. This allows you to define columns that you want to compare, once you have established what columns make up a duplicate, then you can assign a CTE value to it and then go back and cleanup the CTE values that are greater than 1. This is an example of duplicate checking that I do.
WITH CTE AS
(select d.UID
,d.LotKey
,d.SerialNo
,d.HotWeight
,d.MarketValue
,RN = ROW_NUMBER()OVER(PARTITION BY d.HotWeight, d.serialNo, d.MarketValue order by d.SerialNo)
from LotDetail d
where d.LotKey = ('1~20161019~305')
)
DELETE FROM CTE WHERE RN <> 1
In my example I am looking at the LotDetail table where the d.hotweight and d.serial no are matching. if there is a match then the original gets CTE 1 and any duplicates get CTE 2 or greater depending on the amount of duplicates. Then you use the last DELETE statement to clear the entries that come up as duplicate. THis is really flexible so you should be able to adapt it to your issue.
Here is an example tailored to your situation.
WITH CTE AS
(select d.Flight_ID
,d.Latitude
,d.Longitude
,d.Altitude
,d.Call_sign
,d.Measurement*
,RN = ROW_NUMBER()OVER(PARTITION BY d.Flight_ID, d.Latitude, d.Longitude, d.Altitude, d.Call_Sign, d.Measurement* order by d.SerialNo)
from Aircraft d
where d.flight_id = ('**INSERT VALUE HERE')
)
DELETE FROM CTE WHERE RN <> 1

If it's a one-time operation you can create a temp table with the same schema and then copy unique rows over like so:
insert into Aircraft_temp
select distinct on (flight_id, measurement_time) Aircraft.* from Aircraft
Then swap them out by renaming, or truncate Aircraft and copy the temp contents back (truncate Aircraft; insert into Aircraft select * from Aircraft_temp;).
Safer to rename Aircraft to Aircraft_old and Aircraft_temp to Aircraft so you keep your original data until you are sure things are correct. Or at least check that the number of rows in your count query above match the count of rows in the temp table before doing the truncate.
Update2: With a separate valid primary key (assuming it is called id) you can do a DELETE based on a self join like this:
delete from Aircraft using (
select a1.id
from Aircraft a1
left join (select flight_id, measurement_time, min(id) as id from Aircraft group by 1,2) a2
on a1.id = a2.id
where a2.id is null
) as d
where Aircraft.id=d.id
This finds the minimum id (could do max too for the "latest") for each flight and identifies all the records from the full set having an id that is not the minimum (no match in the join). The unmatched ids are deleted.

Related

how to select identical rows in postgresql?

The dataset that I'm looking into has an id for the incident, but a few columns (a_dttm, b_dttm, and c_dttm) have dates and times that appear more than once. I looked into it and found that even though the ids are unique, there are entire rows that look almost identical.
So without having to go through 200 rows of potential identical rows, what can I write in postgres to search for rows that are identical in a_dttm, b_dttm, and c_dttm?
This is what I've been doing to select the identical rows one by one:
SELECT *
FROM data
WHERE a_dttm::timestamp = '2007-01-13 08:29:35'
order by a_dttm desc
I got the timestamp from another query.
I know if these three columns are completely identical, then the rows are for sure duplicates.
Try
select count(*), a_dttm, b_dttm, c_dttm
from data
group by a_ddtm, b_dttm, c_dttm;
This should tell you how many duplicates you have.
This will select all the rows for which (at least one) other row exists, with the same {a_dttm,b_dttm,c_dttm}, but with a different id:
SELECT *
FROM the_table t
WHERE EXISTS (
SELECT*
FROM the_table x
WHERE x.a_dttm = t.a_dttm -- same
AND x.b_dttm = t.b_dttm --same
AND x.c_dttm = t.x_dttm --same
AND x.id <> t.id -- different
);
Similar, but now actually DELETING (some of) theduplicates:
DELETE
FROM the_table t
WHERE EXISTS (
SELECT*
FROM the_table x
WHERE x.a_dttm = t.a_dttm -- same
AND x.b_dttm = t.b_dttm --same
AND x.c_dttm = t.x_dttm --same
AND x.id > t.id -- different (actually: with a higher id)
);

How to optimize selecting one random row from a set acquired by JOIN

Query in English:
Retrieve a random row from stuff.
row is not mentioned in done.
row belongs to the highest* scored friend.
*if no rows that belong to highest scored friend are found, take the next friend, an so on.
My current query takes too long to complete, because it is randomly ordering all stuff, while it should randomly order batch after batch.
Here is an sqlfiddle with tables and data.
My query:
WITH ordered_friends AS (SELECT *
FROM friends
ORDER BY score DESC)
SELECT s.stuff_id
FROM ordered_friends
INNER JOIN (SELECT *
FROM stuff
ORDER BY random()) AS s ON s.owner = ordered_friends.friend
WHERE NOT EXISTS(
SELECT 1
FROM done
WHERE done.me = 42
AND done.friend = s.owner
AND done.stuff_id = s.stuff_id
)
-- but it should keep the order of ordered_friends (score)
-- it does not have to reorder all stuff
-- one batch for each friend is enough until a satisfying row is found.
LIMIT 1;
How about this?
SELECT s.stuff_id
FROM friends
CROSS JOIN LATERAL (SELECT stuff_id
FROM stuff
WHERE stuff.owner = friends.friend
AND NOT EXISTS(SELECT 1
FROM done
WHERE done.me = 42
AND done.friend = stuff.owner
AND done.stuff_id = stuff.stuff_id
)
ORDER BY random()
LIMIT 1
) s
ORDER BY friends.score DESC
LIMIT 1;
The following indexes would make it fast:
CREATE INDEX ON friends(score); -- for sorting
CREATE INDEX ON stuff(owner); -- for the nested loop
CREATE INDEX ON done(stuff_id, friend); -- for NOT EXISTS

UPDATE column from one table to another

I need to update a column in a table to the latest date/time combination from another table. How can I get the latest date/time combination from the one table and then update a column with that date in another table?
The two tables I am using are called dbo.DD and dbo.PurchaseOrders. The JOIN between the two tables are dbo.DueDate.XDORD = dbo.PurchaseOrders.PBPO AND dbo.DueDate.XDLINE = dbo.PurchaseOrders.PBSEQ. The columns from dbo.DueDate that I need the latest date/time from are dbo.DueDate.XDCCTD and dbo.DueDate.XDCCTT.
I need to set dbo.PurchaseOrders.PBDUE = dbo.DueDate.XDCURDT.I can't use an ORDER BY statement in the UPDATE statement, so I'm not sure how to do this. I know row_number sometimes works in these situations, but I'm unsure of how to implement.
The general pattern is:
;WITH s AS
(
SELECT
key, -- may be multiple columns
date_col,
rn = ROW_NUMBER() OVER
(
PARTITION BY key -- again, may be multiple columns
ORDER BY date_col DESC
)
FROM dbo.SourceTable
)
UPDATE d
SET d.date_col = s.date_col
FROM dbo.DestinationTable AS d
INNER JOIN s
ON d.key = s.key -- one more time, may need multiple columns here
WHERE s.rn = 1;
I didn't try to map your table names and columns because (a) I didn't get from your word problem which table was the source and which was the destination and (b) those column names look like alphabet soup and I would have screwed them up anyway.
Did seem though that the OP got this specific code working:
;WITH s AS
(
SELECT
XDORD, XDLINE,
XDCURDT,
rn = ROW_NUMBER() OVER
(
PARTITION BY XDORD, XDLINE
ORDER BY XDCCTD DESC, XDCCTT desc
)
FROM dbo.DueDate
)
UPDATE d
SET d.PBDUE = s.XDCURDT
FROM dbo.PurchaseOrders AS d
INNER JOIN s
ON d.PBPO = s.XDORD AND d.PBSEQ = s.XDLINE
WHERE s.rn = 1;

Need better summation select statement in postgres function

I've got two tables in my database. One of them, 'orders', contains a set of columns with an integer which represents what the order should contain (like 5 of A and 15 of B). The second table, 'production_work', contains those same order columns, and a date, so whenever somebody completes part of an order, I track it.
So now i need a fast way to know which orders are completed, and I'm hoping to avoid a 'completed' table on the first column as orders are editable and it's just more logic to keep correct.
This query works, but it's horribly written. What's a better way to do this? There are actually 12 of these columns that go into this query...I'm just showing 3 of them for the example.
SELECT *
FROM orders o
WHERE ud = (SELECT SUM(ud) FROM production_work WHERE order_id = o.ident)
AND dp = (SELECT SUM(dp) FROM production_work WHERE order_id = o.ident)
AND swrv = (SELECT SUM(swrv) FROM production_work WHERE order_id = o.ident)
select o.*
from
orders o
inner join
(
select order_id, sum(ud) as ud, sum(dp) as dp, sum(swrv) as swrv
from production_work
group by order_id
) pw on o.ident = pw.order_id
where
o.ud = pw.ud
and o.dp = pw.dp
and o.swrv = pw.swrv

Simple SELECT, but adding JOIN returns too many rows

The query below returns 9,817 records. Now, I want to SELECT one more field from another table. See the 2 lines that are commented out, where I've simply selected this additional field and added a JOIN statement to bind this new columns. With these lines added, the query now returns 649,200 records and I can't figure out why! I guess something is wrong with my WHERE criteria in conjunction with the JOIN statement. Please help, thanks.
SELECT DISTINCT dbo.IMPORT_DOCUMENTS.ITEMID, BEGDOC, BATCHID
--, dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.CATEGORY_ID
FROM IMPORT_DOCUMENTS
--JOIN dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS ON
dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID = dbo.IMPORT_DOCUMENTS.ITEMID
WHERE (BATCHID LIKE 'IC0%' OR BATCHID LIKE 'LP0%')
AND dbo.IMPORT_DOCUMENTS.ITEMID IN
(SELECT dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID FROM
CATEGORY_COLLECTION_CATEGORY_RESULTS
WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN(
SELECT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16))
AND Sample_Id > 0)
AND dbo.IMPORT_DOCUMENTS.ITEMID NOT IN
(SELECT ASSIGNMENT_FOLDER_DOCUMENTS.Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)
One possible reason is because one of your tables contains data at lower level, lower than your join key. For example, there may be multiple records per item id. The same item id is repeated X number of times. I would fix the query like the below. Without data knowledge, Try running the below modified query.... If output is not what you're looking for, convert it into SELECT Within a Select...
Hope this helps....
Try this SQL: SELECT DISTINCT a.ITEMID, a.BEGDOC, a.BATCHID, b.CATEGORY_ID FROM IMPORT_DOCUMENTS a JOIN (SELECT DISTINCT ITEMID FROM CATEGORY_COLLECTION_CATEGORY_RESULTS WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN (SELECT DISTINCT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16)) AND Sample_Id > 0) B ON a.ITEMID =b.ITEMID WHERE a.(a.BATCHID LIKE 'IC0%' OR a.BATCHID LIKE 'LP0%') AND a.ITEMID NOT IN (SELECT DIDTINCT Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)