how to select identical rows in postgresql? - postgresql

The dataset that I'm looking into has an id for the incident, but a few columns (a_dttm, b_dttm, and c_dttm) have dates and times that appear more than once. I looked into it and found that even though the ids are unique, there are entire rows that look almost identical.
So without having to go through 200 rows of potential identical rows, what can I write in postgres to search for rows that are identical in a_dttm, b_dttm, and c_dttm?
This is what I've been doing to select the identical rows one by one:
SELECT *
FROM data
WHERE a_dttm::timestamp = '2007-01-13 08:29:35'
order by a_dttm desc
I got the timestamp from another query.
I know if these three columns are completely identical, then the rows are for sure duplicates.

Try
select count(*), a_dttm, b_dttm, c_dttm
from data
group by a_ddtm, b_dttm, c_dttm;
This should tell you how many duplicates you have.

This will select all the rows for which (at least one) other row exists, with the same {a_dttm,b_dttm,c_dttm}, but with a different id:
SELECT *
FROM the_table t
WHERE EXISTS (
SELECT*
FROM the_table x
WHERE x.a_dttm = t.a_dttm -- same
AND x.b_dttm = t.b_dttm --same
AND x.c_dttm = t.x_dttm --same
AND x.id <> t.id -- different
);
Similar, but now actually DELETING (some of) theduplicates:
DELETE
FROM the_table t
WHERE EXISTS (
SELECT*
FROM the_table x
WHERE x.a_dttm = t.a_dttm -- same
AND x.b_dttm = t.b_dttm --same
AND x.c_dttm = t.x_dttm --same
AND x.id > t.id -- different (actually: with a higher id)
);

Related

Delete duplicate rows based on columns

I have a table called Aircraft and there are many records. The problem is that some are duplicates. I know how to select the duplicates and their counts:
SELECT flight_id, latitude, longitude, altitude, call_sign, measurement_time, COUNT(*)
FROM Aircraft
GROUP BY flight_id, latitude, longitude, altitude, call_sign, measurement_time
HAVING COUNT(*) > 1;
This returns something like:
Now, what I need to do is remove the duplicates, leaving just one each so that when I run the query again, all counts become 1.
I know that I can use the DELETE keyword, but I'm not sure how to delete it from the SELECT.
I'm sure I am missing an easy step, but I do not want to ruin my DB being a newbie.
How do I do this?
SELECT
flight_id, latitude, longitude, altitude, call_sign, measurement_time
FROM Aircraft a
WHERE EXISTS (
SELECT * FROM Aircraft x
WHERE x.flight_id = a.flight_id
AND x.latitude = a.latitude
AND x.longitude = a.longitude
AND x.altitude = a.altitude
AND x.call_sign = a.call_sign
AND x.measurement_time = a.measurement_time
AND x.id < a.id
)
;
If the query above returns thecorrect rows (to be deleted)
you can change it into a delete statement:
DELETE
FROM Aircraft a
WHERE EXISTS (
SELECT * FROM Aircraft x
WHERE x.flight_id = a.flight_id
AND x.latitude = a.latitude
AND x.longitude = a.longitude
AND x.altitude = a.altitude
AND x.call_sign = a.call_sign
AND x.measurement_time = a.measurement_time
AND x.id < a.id
)
;
I have always used the CTE method in SQL SERVER. This allows you to define columns that you want to compare, once you have established what columns make up a duplicate, then you can assign a CTE value to it and then go back and cleanup the CTE values that are greater than 1. This is an example of duplicate checking that I do.
WITH CTE AS
(select d.UID
,d.LotKey
,d.SerialNo
,d.HotWeight
,d.MarketValue
,RN = ROW_NUMBER()OVER(PARTITION BY d.HotWeight, d.serialNo, d.MarketValue order by d.SerialNo)
from LotDetail d
where d.LotKey = ('1~20161019~305')
)
DELETE FROM CTE WHERE RN <> 1
In my example I am looking at the LotDetail table where the d.hotweight and d.serial no are matching. if there is a match then the original gets CTE 1 and any duplicates get CTE 2 or greater depending on the amount of duplicates. Then you use the last DELETE statement to clear the entries that come up as duplicate. THis is really flexible so you should be able to adapt it to your issue.
Here is an example tailored to your situation.
WITH CTE AS
(select d.Flight_ID
,d.Latitude
,d.Longitude
,d.Altitude
,d.Call_sign
,d.Measurement*
,RN = ROW_NUMBER()OVER(PARTITION BY d.Flight_ID, d.Latitude, d.Longitude, d.Altitude, d.Call_Sign, d.Measurement* order by d.SerialNo)
from Aircraft d
where d.flight_id = ('**INSERT VALUE HERE')
)
DELETE FROM CTE WHERE RN <> 1
If it's a one-time operation you can create a temp table with the same schema and then copy unique rows over like so:
insert into Aircraft_temp
select distinct on (flight_id, measurement_time) Aircraft.* from Aircraft
Then swap them out by renaming, or truncate Aircraft and copy the temp contents back (truncate Aircraft; insert into Aircraft select * from Aircraft_temp;).
Safer to rename Aircraft to Aircraft_old and Aircraft_temp to Aircraft so you keep your original data until you are sure things are correct. Or at least check that the number of rows in your count query above match the count of rows in the temp table before doing the truncate.
Update2: With a separate valid primary key (assuming it is called id) you can do a DELETE based on a self join like this:
delete from Aircraft using (
select a1.id
from Aircraft a1
left join (select flight_id, measurement_time, min(id) as id from Aircraft group by 1,2) a2
on a1.id = a2.id
where a2.id is null
) as d
where Aircraft.id=d.id
This finds the minimum id (could do max too for the "latest") for each flight and identifies all the records from the full set having an id that is not the minimum (no match in the join). The unmatched ids are deleted.

PostgreSQL Removing duplicates

I am working on postgres query to remove duplicates from a table. The following table is dynamically generated and I want to write a select query which will remove the record if the first row has duplicate values.
The table looks something like this
Ist col 2nd col
4 62
6 34
5 26
5 12
I want to write a select query which remove either row 3 or 4.
There is no need for an intermediate table:
delete from df1
where ctid not in (select min(ctid)
from df1
group by first_column);
If you are deleting many rows from a large table, the approach with an intermediate table is probably faster.
If you just want to get unique values for one column, you can use:
select distinct on (first_column) *
from the_table
order by first_column;
Or simply
select first_column, min(second_column)
from the_table
group by first_column;
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1)
if you want to keep one of the rows (sorry, I initially missed it if you wanted that):
select first, min(second)
from df1
group by first
Where the table's name is df1 and the columns are named first and second.
You can actually leave off the count(first) as cnt if you want.
At the risk of stating the obvious, once you know how to select the data you want (or don't want) the delete the records any of a dozen ways is simple.
If you want to replace the table or make a new table you can just use create table as for the deletion:
create table tmp as
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1);
drop table df1;
create table df1 as select * from tmp;
or using DELETE FROM:
DELETE FROM df1 WHERE first NOT IN (SELECT first FROM tmp);
You could also use select into, etc, etc.
if you want to SELECT unique rows:
SELECT * FROM ztable u
WHERE NOT EXISTS ( -- There is no other record
SELECT * FROM ztable x
WHERE x.id = u.id -- with the same id
AND x.ctid < u.ctid -- , but with a different(lower) "internal" rowid
); -- so u.* must be unique
if you want to SELECT the other rows, which were suppressed in the previous query:
SELECT * FROM ztable nu
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = nu.id -- with the same id
AND x.ctid < nu.ctid -- , but with a different(lower) "internal" rowid
);
if you want to DELETE records, making the table unique (but keeping one record per id):
DELETE FROM ztable d
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = d.id -- with the same id
AND x.ctid < d.ctid -- , but with a different(lower) "internal" rowid
);
So basically I did this
create temp t1 as
select first, min (second) as second
from df1
group by first
select * from df1
inner join t1 on t1.first = df1.first and t1.second = df1.second
Its a satisfactory answer. Thanks for your help #Hack-R

UPDATE column from one table to another

I need to update a column in a table to the latest date/time combination from another table. How can I get the latest date/time combination from the one table and then update a column with that date in another table?
The two tables I am using are called dbo.DD and dbo.PurchaseOrders. The JOIN between the two tables are dbo.DueDate.XDORD = dbo.PurchaseOrders.PBPO AND dbo.DueDate.XDLINE = dbo.PurchaseOrders.PBSEQ. The columns from dbo.DueDate that I need the latest date/time from are dbo.DueDate.XDCCTD and dbo.DueDate.XDCCTT.
I need to set dbo.PurchaseOrders.PBDUE = dbo.DueDate.XDCURDT.I can't use an ORDER BY statement in the UPDATE statement, so I'm not sure how to do this. I know row_number sometimes works in these situations, but I'm unsure of how to implement.
The general pattern is:
;WITH s AS
(
SELECT
key, -- may be multiple columns
date_col,
rn = ROW_NUMBER() OVER
(
PARTITION BY key -- again, may be multiple columns
ORDER BY date_col DESC
)
FROM dbo.SourceTable
)
UPDATE d
SET d.date_col = s.date_col
FROM dbo.DestinationTable AS d
INNER JOIN s
ON d.key = s.key -- one more time, may need multiple columns here
WHERE s.rn = 1;
I didn't try to map your table names and columns because (a) I didn't get from your word problem which table was the source and which was the destination and (b) those column names look like alphabet soup and I would have screwed them up anyway.
Did seem though that the OP got this specific code working:
;WITH s AS
(
SELECT
XDORD, XDLINE,
XDCURDT,
rn = ROW_NUMBER() OVER
(
PARTITION BY XDORD, XDLINE
ORDER BY XDCCTD DESC, XDCCTT desc
)
FROM dbo.DueDate
)
UPDATE d
SET d.PBDUE = s.XDCURDT
FROM dbo.PurchaseOrders AS d
INNER JOIN s
ON d.PBPO = s.XDORD AND d.PBSEQ = s.XDLINE
WHERE s.rn = 1;

Firebird 2.5 Removing Rows with Duplicate Fields

I am trying to removing duplicate values which, for some reason, was imported in a specific Table.
There is no Primary Key in this table.
There is 27797 unique records.
Select distinct txdate, plunumber from itemaudit
Give me the correct records, but only displays the txdate, plunumber of course.
If it was possible to select all the fields but only select the distinct of txdate,plunumber I could export the values, delete the duplicated ones and re-import.
Or if its possible to delete the distinct values from the entire table.
If you select the distinct of all fields the value is incorrect.
To get all information on the duplicates, you simply need to query all information for the duplicate rows using a JOIN:
SELECT b.*
FROM (SELECT COUNT(*) as cnt, txdate, plunumber
FROM itemaudit
GROUP BY txdate, plunumber
HAVING COUNT(*) > 1) a
INNER JOIN itemaudit b ON a.txdate = b.txdate AND a.plunumber = b.plunumber
DELETE FROM itemaudit t1
WHERE EXISTS (
SELECT 1 FROM itemaudit t2
WHERE t1.txdate = t2.txdate and t1.plunumber = t2.plunumber
AND t1.RDB$DB_KEY < t2.RDB$DB_KEY
);

Selecting non-repeating values in Postgres

SELECT DISTINCT a.s_id, select2Result.s_id, select2Result."mNrPhone",
select2Result."dNrPhone"
FROM "Table1" AS a INNER JOIN
(
SELECT b.s_id, c."mNrPhone", c."dNrPhone" FROM "Table2" AS b, "Table3" AS c
WHERE b.a_id = 1001 AND b.s_id = c.s_id
ORDER BY b.last_name) AS select2Result
ON a.a_id = select2Result.student_id
WHERE a.k_id = 11211
It returns:
1001;1001;"";""
1002;1002;"";""
1002;1002;"2342342232123";"2342342"
1003;1003;"";""
1004;1004;"";""
1002 value is repeated twice, but it shouldn't because I used DISTINCT and no other table has an id repeated twice.
You can use DISTINCT ON like this:
SELECT DISTINCT ON (a.s_id)
a.s_id, select2Result.s_id, select2Result."mNrPhone",
select2Result."dNrPhone"
...
But like other persons have told you, the "repeated records" are different really.
The qualifier DISTINCT applies to the entire row, not to the first column in the select-list. Since columns 3 and 4 (mNrPhone and dNrPhone) are different for the two rows with s_id = 1002, the DBMS correctly lists both rows. You have to write your query differently if you only want the s_id = 1002 to appear once, and you have to decide which auxilliary data you want shown.
As an aside, it is strongly recommended that you always use the explicit JOIN notation (which was introduced in SQL-92) in all queries and sub-queries. Do not use the old implicit join notation (which is all that was available in SQL-86 or SQL-89), and especially do not use a mixture of explicit and implicit join notations (where your sub-query uses the implicit join, but the main query uses explicit join). You need to know the old notation so you can understand old queries. You should write new queries in the new notation.
First of all, the query displayed does not work at all, student_id is missing in the sub-query. You use it in the JOIN later.
More interestingly:
Pick a certain row out of a set with DISTINCT
DISTINCT and DISTINCT ON return distinct values by sorting all rows according to the set of columns to be distinct, then it picks the first row from every set. It sorts by all rows for a general DISTINCT and only the specified rows for DISTINCT ON. Here lies the opportunity to pick certain rows out of a set over other.
For instance if you prefer rows with not-empty "mNrPhone" in your example:
SELECT DISTINCT ON (a.s_id) -- sure you didn't want a.a_id?
,a.s_id AS a_s_id -- use aliases to avoid dupe name
,s.s_id AS s_s_id
,s."mNrPhone"
,s."dNrPhone"
FROM "Table1" a
JOIN (
SELECT b.s_id, c."mNrPhone", c."dNrPhone", ??.student_id -- misssing!
FROM "Table2" b
JOIN "Table3" c USING (s_id)
WHERE b.a_id = 1001
-- ORDER BY b.last_name -- pointless, DISTINCT will re-order
) s ON a.a_id = s.student_id
WHERE a.k_id = 11211
ORDER BY a.s_id -- first col must agree with DISTINCT ON, could add DESC though
,("mNrPhone" <> '') DESC -- non-empty first
ORDER BY cannot disagree with DISTINCT on the same query level. To get around this you can either use GROUP BY instead or put the whole query in a sub-query and run another SELECT with ORDER BY on it.
The ORDER BY you had in the sub-query is voided now.
In this particular case, if - as it seems - the dupes come only from the sub-query (you'd have to verify), you could instead:
SELECT a.a_id, s.s_id, s."mNrPhone", s."dNrPhone" -- picking a.a_id over s_id
FROM "Table1" a
JOIN (
SELECT DISTINCT ON (b.s_id)
,b.s_id, c."mNrPhone", c."dNrPhone", ??.student_id -- misssing!
FROM "Table2" b
JOIN "Table3" c USING (s_id)
WHERE b.a_id = 1001
ORDER BY b.s_id, (c."mNrPhone" <> '') DESC -- pick non-empty first
) s ON a.a_id = s.student_id
WHERE a.k_id = 11211
ORDER BY a.a_id -- now you can ORDER BY freely