Find equal twin record postgresql - postgresql

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I have a record with id 22 and I know it has a twin because I run this (simplified code):
SELECT min(co_id),co_name,count(*) FROM co
GROUP BY co_name
HAVING count(*) > 1
The result shows there are one twin (count 2) and I get the oldest id by min(co_id)
My question is how I search for the twin co_id? Just passing the oldest id?
Something like:
SELECT co_id FROM co
WHERE co_name EQUAL TO co_id='22'
LIMIT 2
Sample data:
id co_name
22 Volvo
23 Volvo
24 Ford
25 Ford
I know id 22 and I want to search for the twin 23 based on the content of 22.
The closest I found is this. Which is far from generic. And a nightmare for comparing 60 field:
SELECT id,
(SELECT max(b.id) from co b
WHERE a.co_name = b.co_name
LIMIT 1) as twin
FROM co a
WHERE id='22'
How do I do this in a more simple and generic way? I just want the twin record co_id.
Thank you in advance!

select max_co,co_name from (
select max(co_id) max_co,min(co_id) min_co,co_name from co
group by co_name having count(*)>1) where min_co=(your old co id as input);

You can join your table with itself:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON c1.co_name=c2.co_name
AND c1.id>c2.id
this will return all duplicated records (but not the original record with the lowest id). Or since you're using Postgresql you can use a window function:
SELECT *
FROM (
SELECT
id,
co_name,
row_number() OVER (PARTITION by co_name ORDER BY id) as row
FROM
co_name
) s
WHERE
row>1;
Please see an example here.
If you want to compare multiple columns, the JOIN solution would be more flexible. I don't know exactly how you want to compare your columns and how you exactly define "twin" rows, but you a query like this should help:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON (
c1.co_name=c2.co_name
OR c1.co_city=c2.co_city
OR c1.co_owner=c2.co_owner
OR ...
) AND c1.id>c2.id
if you just want duplicated records of id=22 then you can try with this:
SELECT c1.*
FROM
co_name c1 INNER JOIN co_name c2
ON c1.co_name=c2.co_name
AND c1.id>c2.id
WHERE
c2.id=22
or if you just want a single twin, comparing 60 columns, you can try with this query:
SELECT MIN(ID) as Twin /* or MAX(ID), depending what you're after */
FROM
co_name c1 INNER JOIN co_name c2
ON (
c1.co_name=c2.co_name
OR c1.co_city=c2.co_city
OR c1.co_owner=c2.co_owner
OR ...
) AND c1.id>c2.id
WHERE
c2.id=22

I found one solution that is working on 60 columns if I use variables in stead of hardcode in the query. Thanks everybody for all input. Some of them were about the same track.
SELECT id,
(SELECT max(b.id) from co b
WHERE concat(a.co_name,etc) = concat(b.co_name,etc)
LIMIT 1) as twin
FROM co a
WHERE id='22'
Not the best one, but fetch one twin at a time. And it is far from generic. Thanks for pointing me in the right direction. A generic solution would be nicer.

Related

T-SQL select all IDs that have value A and B

I'm trying to find all IDs in TableA that are mentioned by a set of records in TableB and that set if defined in Table C. I've come so far to the point where a set of INNER JOIN provide me with the following result:
TableA.ID | TableB.Code
-----------------------
1 | A
1 | B
2 | A
3 | B
I want to select only the ID where in this case there is an entry for both A and B, but where the values A and B are based on another Query.
I figured this should be possible with a GROUP BY TableA.ID and HAVING = ALL(Subquery on table C).
But that is returning no values.
Since you did not post your original query, I will assume it is inside a CTE. Assuming this, the query you want is something along these lines:
SELECT ID
FROM cte
WHERE Code IN ('A', 'B')
GROUP BY ID
HAVING COUNT(DISTINCT Code) = 2;
It's an extremely poor question, but you you probably need to compare distinct counts against table C
SELECT a.ID
FROM TableA a
GROUP BY a.ID
HAVING COUNT(DISTINCT a.Code) = (SELECT COUNT(*) FROM TableC)
We're guessing though.

Identifying rows with multiple IDs linked to a unique value

Using ms-sql 2008 r2; am sure this is very straightforward. I am trying to identify where a unique value {ISIN} has been linked to more than 1 Identifier. An example output would be:
isin entity_id
XS0276697439 000BYT-E
XS0276697439 000BYV-E
This is actually an error and I want to look for other instances where there may be more than one entity_id linked to a unique ISIN.
This is my current working but it's obviously not correct:
select isin, entity_id from edm_security_entity_map
where isin is not null
--and isin = ('XS0276697439')
group by isin, entity_id
having COUNT(entity_id) > 1
order by isin asc
Thanks for your help.
Elliot,
I don't have a copy of SQL in front of me right now, so apologies if my syntax isn't spot on.
I'd start by finding the duplicates:
select
x.isin
,count(*)
from edm_security_entity_map as x
group by x.isin
having count(*) > 1
Then join that back to the full table to find where those duplicates come from:
;with DuplicateList as
(
select
x.isin
--,count(*) -- not used elsewhere
from edm_security_entity_map as x
group by x.isin
having count(*) > 1
)
select
map.isin
,map.entity_id
from edm_security_entity_map as map
inner join DuplicateList as dup
on dup.isin = map.isin;
HTH,
Michael
So you're saying that if isin-1 has a row for both entity-1 and entity-2 that's an error but isin-3, say, linked to entity-3 in two separe rows is OK? The ugly-but-readable solution to that is to pre-pend another CTE on the previous solution
;with UniqueValues as
(select distinct
y.isin
,y.entity_id
from edm_security_entity_map as y
)
,DuplicateList as
(
select
x.isin
--,count(*) -- not used elsewhere
from UniqueValues as x
group by x.isin
having count(*) > 1
)
select
map.isin
,map.entity_id
from edm_security_entity_map as map -- or from UniqueValues, depening on your objective.
inner join DuplicateList as dup
on dup.isin = map.isin;
There are better solutions with additional GROUP BY clauses in the final query. If this is going into production I'd be recommending that. Or if your table has a bajillion rows. If you just need to do some analysis the above should suffice, I hope.

Simple SELECT, but adding JOIN returns too many rows

The query below returns 9,817 records. Now, I want to SELECT one more field from another table. See the 2 lines that are commented out, where I've simply selected this additional field and added a JOIN statement to bind this new columns. With these lines added, the query now returns 649,200 records and I can't figure out why! I guess something is wrong with my WHERE criteria in conjunction with the JOIN statement. Please help, thanks.
SELECT DISTINCT dbo.IMPORT_DOCUMENTS.ITEMID, BEGDOC, BATCHID
--, dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.CATEGORY_ID
FROM IMPORT_DOCUMENTS
--JOIN dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS ON
dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID = dbo.IMPORT_DOCUMENTS.ITEMID
WHERE (BATCHID LIKE 'IC0%' OR BATCHID LIKE 'LP0%')
AND dbo.IMPORT_DOCUMENTS.ITEMID IN
(SELECT dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID FROM
CATEGORY_COLLECTION_CATEGORY_RESULTS
WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN(
SELECT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16))
AND Sample_Id > 0)
AND dbo.IMPORT_DOCUMENTS.ITEMID NOT IN
(SELECT ASSIGNMENT_FOLDER_DOCUMENTS.Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)
One possible reason is because one of your tables contains data at lower level, lower than your join key. For example, there may be multiple records per item id. The same item id is repeated X number of times. I would fix the query like the below. Without data knowledge, Try running the below modified query.... If output is not what you're looking for, convert it into SELECT Within a Select...
Hope this helps....
Try this SQL: SELECT DISTINCT a.ITEMID, a.BEGDOC, a.BATCHID, b.CATEGORY_ID FROM IMPORT_DOCUMENTS a JOIN (SELECT DISTINCT ITEMID FROM CATEGORY_COLLECTION_CATEGORY_RESULTS WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN (SELECT DISTINCT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16)) AND Sample_Id > 0) B ON a.ITEMID =b.ITEMID WHERE a.(a.BATCHID LIKE 'IC0%' OR a.BATCHID LIKE 'LP0%') AND a.ITEMID NOT IN (SELECT DIDTINCT Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)

Using fields from select query in where clause in subqueries

I have a list of people and there are 4 types that can occur as well as 5 resolutions for each type. I'm trying to write a single query so that I can pull each type/resolution combination for each person but am running into problems. This is what I have so far:
SELECT person,
TypeRes1 = (SELECT COUNT(*) FROM table1 where table1.status = 45)
JOIN personTbl ON personTbl.personid = table1.personid
WHERE person LIKE 'A0%'
GROUP BY person
I have adjusted column names to make it more...generic, but basically the person table has several hundred people in it and I just want A01 through A09, so the like statement is the easiest way to do this. The problem is that my results end up being something like this:
Person TypeRes1
A06 48
A04 48
A07 48
A08 48
A05 48
Which is incorrect. I can't figure out how to get the column count correct for each person. I tried doing something like:
SELECT person as p,
TypeRes1= (SELECT COUNT(*) FROM table1
JOIN personTbl ON personTbl.personid = table1.personid
WHERE table1.status = 45 AND personTbl.person = p)
FROM table1
JOIN personTbl ON personTbl.personid = table1.personid
WHERE personTbl.person LIKE 'A0%'
GROUP BY personTbl.person
But that gives me the error: Invalid Column name 'p'. Is it possible to pass p into the subquery or is there another way to do it?
EDIT: There are 19 different statuses as well, so there will be 19 different TypeRes, for brevity I just put the one as if I can find the one, I think I can do the rest on my own.
Maybe something like this:
SELECT
person,
(
SELECT
COUNT(*)
FROM
table1
WHERE
table1.status = 45
AND personTbl.personid = table1.personid
) AS TypeRes1
FROM
personTbl
WHERE person LIKE 'A0%'

What's the best T-SQL syntax to filter for an ID that has a count of X or at least X or at most X in a joined table?

What's the best way to do something like this in T-SQL?
SELECT DISTINCT ID
FROM Members,
INNER JOIN Comments ON Members.MemberId = Comments.MemberId
WHERE COUNT(Comments.CommentId) > 100
Trying to get the members who have commented more than 100 times. This is obviously invalid code but what's the best way to write this?
This should get you what you're after. I'm not saying this is the absolutely best way of doing it, but it's unlikely you'll find anything better.
SELECT ID
FROM Members
INNER JOIN Comments
ON Members.MemberId = Comments.MemberId
GROUP BY ID
HAVING COUNT(*) > 100
I like using a subquery.
SELECT DISTINCT m.ID
FROM Members m
WHERE (SELECT COUNT(c.CommentID)
FROM Comments c
WHERE c.MemberID = m.MemberID) > 100
Try
SELECT ID
FROM Members
INNER JOIN (SELECT MemberID FROM Comments
GROUP BY MemberID HAVING COUNT(CommentId) > 100)
AS CommentCount ON Members.MemberID = CommentCount.CommentID