Modifying Duplicates - tsql

I'm trying to figure out the means to do two things:
Locate duplicate records in a table.
These are typically duplicate names in the 'Name' column but
specifically those where the ParentID is the same. It's fine if I
have identical names where the ParentID is different because these
names (or Children) belong to different parents.
Modify these duplicates.
Preferably, I would modify these duplicates by appending the 'ID' to the name.
I came up with a query to locate duplicates and them dump them into a temp table:
CREATE TABLE #Dup(
Name varchar(50),
CustNo varchar(7))
insert into #Dup (Name, CustNo)
SELECT [Name],[CustNo]
FROM [02Kids]
GROUP BY [Name], [CustNo]
HAVING Count(*)>1
This seems to work. When I view the data in the table I see the name and I see the ParentID identifying that indeed, this is a name that appears twice for that parent ID. Its worth noting that the name only appears once in the table. It doesn't show two rows with the same name and ID (perhaps this is part of my problem).
Here's the query I came up with attempting to perform the modification:
select[#Dup].[Name] + ' ' + [02Kids].[ID] as iName, [02Kids].ParentID
from #Dup
inner join [02Kids]
on #Dup.CustNo = [02Kids].ParentID
order by iName asc
Well, this sort of works, except I end up with massive amounts of duplicates. For example, one "Name" that I can confirm only has two duplicates ends up with close to 13 in total from that select query.
I may be way off here with that query (this is practice stuff I'm using to teach myself) but I'm having trouble conceiving a correct means to do this. I am still learning syntax, keywords, functions, etc so maybe there's something I should use I just don't know of yet.

Well to only get the matches you want in your "modification" query you'll need to add a match on name to your join clause. Right now you are matching your duplicate record to every kid for that parent, not just the duplicates. So if one parent has 13 kids, only one of which is a duplicate, you'll get 13 extra records.
inner join [02Kids]
on #Dup.CustNo = [02Kids].ParentID AND
#Dup.Name = [02Kids].Name

Does this answer your question?
USE tempdb
GO
CREATE TABLE Person (PersonID INT, FName VARCHAR(25), LName VARCHAR(25))
INSERT INTO Person VALUES
(1, 'Jim', 'Jones'),
(2, 'Rob', 'Smith'),
(3, 'Matt', 'Bridges'),
(4, 'Jim', 'Jones'),
(5, 'Jim', 'Jones'),
(6, 'Alex', 'Door'),
(7, 'Wilhelm', 'Kay')
GO
;WITH DupDetect AS
(
SELECT *
,Occ = ROW_NUMBER() OVER (PARTITION BY FName, LName ORDER BY PersonID)
FROM Person
)
UPDATE DupDetect
SET FName = LTRIM(STR(PersonID)) + FName
WHERE Occ > 1
SELECT *
FROM Person
Resulting in;
PersonID | FName | LName
---------------------------------
1 | Jim | Jones
2 | Rob | Smith
3 | Matt | Bridges
4 | 4Jim | Jones
5 | 5Jim | Jones
6 | Alex | Door
7 | Wilhelm | Kay
I'm unaware of any cleaner or more efficient pattern for the modification or removal of duplicates.

Related

delete duplicates in a table and update references

I have a table with id, we now added a new field where we calculated uniques from an external source, which made us realize we actually have duplicates in the database:
Main Table
id | unique_id | ...
---|------------
4 | A |
5 | A
6 | B
We can see: 5 is actually a duplicate of 4, as they both have the same unique_id.
Now this needs to be cleaned up.
I sadly can not simply delete those duplicates (5), as other tables depend on it:
Other Table (OtherTable.main_id REFERENCES MainTable.id)
id | main_id | ...
---|------------
1 | 4 | Blah
2 | 5
3 | 6
Now I have to clean up the duplicates, here
UPDATE OtherTable SET main_id = 5 WHERE main_id=4
How can I do that in an efficient update?
I tried to simply update every reference to the first one with that same unique_id, however that didn't complete in a day.
UPDATE "OtherTable" SET "main_id" = (SELECT "id" FROM "MainTable" WHERE "unique_id" = (SELECT "unique_id" FROM "MainTable" WHERE "id" == "OtherTable"."main_id") LIMIT 1)
If it helps, the MainTable contains about 750,000 entries, the OtherTable contains 12,000,000 rows.
Probably that's because those tripple select one is quite inefficient.
For the simple part of deletion the duplicates (after I would be done with changing the references to the first one of it's kind) I found this query to work swiftly enough:
DELETE FROM MainTable
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY unique_id
ORDER BY id ) AS row_num
FROM MainTable ) t
WHERE t.row_num > 1 );
However I need a way to update the references to the non-deleted ones of the duplicates.
Instead of UPDATE with a nested query, I'd suggest using UPDATE FROM for a join, and the same window function as in your DELETE statement:
UPDATE "OtherTable" AS other
SET main_id = main.min_id
FROM (SELECT
id,
first_value(id) OVER (PARTITION BY unique_id ORDER BY id) AS min_id
FROM "MainTable"
) AS main
WHERE main.id = other.main_id
AND main.id <> main.min_id

How to sum children occurrences from a joining table in Postgres?

I need to count how many consultants are using a skill through a joining table (consultant_skills), and the challenge is to sum the children occurrences to the parents recursively.
Here's the reproduction of what I'm trying to accomplish. The current results are:
skill_id | count
2 | 2
3 | 1
5 | 1
6 | 1
But I need to compute the count to the parents recursively, where the expected result would be:
skill_id | count
1 | 2
2 | 2
3 | 1
4 | 2
5 | 2
6 | 1
Does anyone know how can I do that?
Sqlfiddle Solution
You need to use WITH RECURSIVE, as the Mike suggests. His answer is useful, especially in reference to using distinct to eliminate redundant counts for consultants, but it doesn't drive to the exact results you're looking for.
See the working solution in the sqlfiddle above. I believe this is what you are looking for:
WITH RECURSIVE results(skill_id, parent_id, consultant_id)
AS (
SELECT skills.id as skill_id, parent_id, consultant_id
FROM consultant_skills
JOIN skills on skill_id = skills.id
UNION ALL
SELECT skills.id as skill_id, skills.parent_id as parent_id, consultant_id
FROM results
JOIN skills on results.parent_id = skills.id
)
SELECT skill_id, count(distinct consultant_id) from results
GROUP BY skill_id
ORDER BY skill_id
What is happening in the query below the UNION ALL is that we're recursively joining the skills table to itself, but rotating in the previous parent id as the new skill id, and using the new parent id on each iteration. The recursion stops because eventually the parent id is NULL and there is no JOIN because it's an INNER join. Hope that makes sense.

Does String Value Exists in a List of Strings | Redshift Query

I have some interesting data, I'm trying to query however I cannot get the syntax correct. I have a temporary table (temp_id), which I've filled with the id values I care about. In this example it is only two ids.
CREATE TEMPORARY TABLE temp_id (id bigint PRIMARY KEY);
INSERT INTO temp_id (id) VALUES ( 1 ), ( 2 );
I have another table in production (let's call it foo) which holds multiples those ids in a single cell. The ids column looks like this (below) with ids as a single string separated by "|"
ids
-----------
1|9|3|4|5
6|5|6|9|7
NULL
2|5|6|9|7
9|11|12|99
I want to evaluate each cell in foo.ids, and see if any of the ids in match the ones in my temp_id table.
Expected output
ids |does_match
-----------------------
1|9|3|4|5 |true
6|5|6|9|7 |false
NULL |false
2|5|6|9|7 |true
9|11|12|99 |false
So far I've come up with this, but I can't seem to return anything. Instead of trying to create a new column does_match I tried to filter within the WHERE statement. However, the issue is I cannot figure out how to evaluate all the id values in my temp table to the string blob full of the ids in foo.
SELECT
ids,
FROM foo
WHERE ids = ANY(SELECT LISTAGG(id, ' | ') FROM temp_ids)
Any suggestions would be helpful.
Cheers,
this would work, however not sure about performance
SELECT
ids
FROM foo
JOIN temp_ids
ON '|'||foo.ids||'|' LIKE '%|'||temp_ids.id::varchar||'|%'
you wrap the IDs list into a pair of additional separators, so you can always search for |id| including the first and the last number
The following SQL (I know it's a bit of a hack) returns exactly what you expect as an output, tested with your sample data, don't know how would it behave on your real data, try and let me know
with seq AS ( # create a sequence CTE to implement postgres' unnest
select 1 as i union all # assuming you have max 10 ids in ids field,
# feel free to modify this part
select 2 union all
select 3 union all
select 4 union all
select 5 union all
select 6 union all
select 7 union all
select 8 union all
select 9 union all
select 10)
select distinct ids,
case # since I can't do a max on a boolean field, used two cases
# for 1s and 0s and converted them to boolean
when max(case
when t.id in (
select split_part(ids,'|',seq.i) as tt
from seq
join foo f on seq.i <= REGEXP_COUNT(ids, '|') + 1
where tt != '' and k.ids = f.ids)
then 1
else 0
end) = 1
then true
else false
end as does_match
from temp_id t, foo
group by 1
Please let me know if this works for you!

How to query the data in a join table by two sets of joined records?

I've got three tables: users, courses, and grades, the latter of which joins users and courses with some metadata like the user's score for the course. I've created a SQLFiddle, though the site doesn't appear to be working at the moment. The schema looks like this:
CREATE TABLE users(
id INT,
name VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO users VALUES
(1, 'Beth'),
(2, 'Alice'),
(3, 'Charles'),
(4, 'Dave');
CREATE TABLE courses(
id INT,
title VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO courses VALUES
(1, 'Biology'),
(2, 'Algebra'),
(3, 'Chemistry'),
(4, 'Data Science');
CREATE TABLE grades(
id INT,
user_id INT,
course_id INT,
score INT,
PRIMARY KEY (ID)
);
INSERT INTO grades VALUES
(1, 2, 2, 89),
(2, 2, 1, 92),
(3, 1, 1, 93),
(4, 1, 3, 88);
I'd like to know how (if possible) to construct a query which specifies some users.id values (1, 2, 3) and courses.id values (1, 2, 3) and returns those users' grades.score values for those courses
| name | Algebra | Biology | Chemistry |
|---------|---------|---------|-----------|
| Alice | 89 | 92 | |
| Beth | | 93 | 88 |
| Charles | | | |
In my application logic, I'll be receiving an array of user_ids and course_ids, so the query needs to select those users and courses dynamically by primary key. (The actual data set contains millions of users and tens of thousands of courses—the examples above are just a sample to work with.)
Ideally, the query would:
use the course titles as dynamic attributes/column headers for the users' score data
sort the row and column headers alphabetically
include empty/NULL cells if the user-course pair has no grades relationship
I suspect I may need some combination of JOINs and Postgresql's crosstab, but I can't quite wrap my head around it.
Update: learning that the terminology for this is "dynamic pivot", I found this SO answer which appears to be trying to solve a related problem in Postgres with crosstab()
I think a simple pivot query should work here, since you only have 4 courses in your data set to pivot.
SELECT t1.name,
MAX(CASE WHEN t3.title = 'Biology' THEN t2.score ELSE NULL END) AS Biology,
MAX(CASE WHEN t3.title = 'Algebra' THEN t2.score ELSE NULL END) AS Algebra,
MAX(CASE WHEN t3.title = 'Chemistry' THEN t2.score ELSE NULL END) AS Chemistry,
MAX(CASE WHEN t3.title = 'Data Science' THEN t2.score ELSE NULL END) AS Data_Science
FROM users t1
LEFT JOIN grades t2
ON t1.id = t2.user_id
LEFT JOIN courses t3
ON t2.course_id = t3.id
GROUP BY t1.name
Follow the link below for a running demo. I used MySQL because, as you have noticed, SQLFiddle seems to be perpetually busted the other databases.
SQLFiddle

PostgreSQL: Can't use DISTINCT for some data types

I have a table called _sample_table_delme_data_files which contains some duplicates. I want to copy its records, without duplicates, into data_files:
INSERT INTO data_files (SELECT distinct * FROM _sample_table_delme_data_files);
ERROR: could not identify an ordering operator for type box3d
HINT: Use an explicit ordering operator or modify the query.
Problem is, PostgreSQL can not compare (or order) box3d types. How do I supply such an ordering operator so I can get only the distinct into my destination table?
Thanks in advance,
Adam
If you don't add the operator, you could try translating the box3d data to text using its output function, something like:
INSERT INTO data_files (SELECT distinct othercols,box3dout(box3dcol) FROM _sample_table_delme_data_files);
Edit The next step is: cast it back to box3d:
INSERT INTO data_files SELECT othercols, box3din(b) FROM (SELECT distinct othercols,box3dout(box3dcol) AS b FROM _sample_table_delme_data_files);
(I don't have box3d on my system so it's untested.)
The datatype box3d doesn't have an operator for the DISTINCT-operation. You have to create the operator, or ask the PostGIS-project, maybe somebody has already fixed this problem.
Finally, this was solved by a colleague.
Let's see how many dups are there:
SELECT COUNT(*) FROM _sample_table_delme_data_files ;
count
-------
12728
(1 row)
Now, we shall add another column to the source table to help us differentiate similar rows:
ALTER TABLE _sample_table_delme_data_files ADD COLUMN id2 serial;
We can now see the dups:
SELECT id, id2 FROM _sample_table_delme_data_files ORDER BY id LIMIT 10;
id | id2
--------+------
198748 | 6449
198748 | 85
198801 | 166
198801 | 6530
198829 | 87
198829 | 6451
198926 | 88
198926 | 6452
199062 | 6532
199062 | 168
(10 rows)
And remove them:
DELETE FROM _sample_table_delme_data_files
WHERE id2 IN (SELECT max(id2) FROM _sample_table_delme_data_files
GROUP BY id
HAVING COUNT(*)>1);
Let's see it worked:
SELECT id FROM _sample_table_delme_data_files GROUP BY id HAVING COUNT(*)>1;
id
----
(0 rows)
Remove the auxiliary column:
ALTER TABLE _sample_table_delme_data_files DROP COLUMN id2;
ALTER TABLE
Insert the remaining rows into the destination table:
INSERT INTO data_files (SELECT * FROM _sample_table_delme_data_files);
INSERT 0 6364