T-SQL query to remove duplicates from large tables using join

T-SQL query to remove duplicates from large tables using join - tsql

I am new in using T-SQL queries and I was trying different solutions in order to remove duplicate rows from a fairy large table (with over 270,000 rows).
The table looks something like:
TableA
-----------
RowID int not null identity(1,1) primary key,
Col1 varchar(50) not null,
Col2 int not null,
Col3 varchar(50) not null
The rows for this table are not perfect duplicates because of the existence of the RowID identity field.
The second table that I need to join with:
TableB
-----------
RowID int not null identity(1,1) primary key,
Col1 int not null,
Col2 varchar(50) not null
In TableA I have something like:
1 | gray | 4 | Angela
2 | red | 6 | Diana
3 | black| 6 | Alina
4 | black| 11 | Dana
5 | gray | 4 | Angela
6 | red | 12 | Dana
7 | red | 6 | Diana
8 | black| 11 | Dana
And in TableB:
1 | 6 | klm
2 | 11 | lmi
Second column from TableB (Col1) is foreign key inside TableA (Col2).
I need to remove ONLY the duplicates from TableA that has Col2 = 6 ignoring the other duplicates.
1 | gray | 4 | Angela
2 | red | 6 | Diana
4 | black| 6 | Alina
5 | black| 11 | Dana
6 | gray | 4 | Angela
7 | red | 12 | Dana
8 | black| 11 | Dana
I tried using
DELETE FROM TableA a inner join TableB b on a.Col2=b.Col1
WHERE a.RowId NOT IN (SELECT MIN(RowId) FROM TableA GROUP BY RowId, Col1, Col2, Col3) and b.Col2="klm"
but I still get some of the duplicates that I need to remove.
What is the best way to remove not perfect duplicate rows using join?

well min would only be one and group by PK will give you everything
and the RowID are wrong in the example
DELETE FROM TableA a
inner join TableB b
on a.Col2=b.Col1
WHERE a.RowId NOT IN (SELECT MIN(RowId)
FROM TableA GROUP BY RowId, Col1, Col2, Col3)
and b.Col2="klm"
this would be rows to delete
select *
from
( select *
, row_number over (partition by Col1, Col3 order by RowID) as rn
from TableA a
where del.Col2 = 6
) tt
where tt.rn > 1

another solution is:
WITH CTE AS(
SELECT t.[col1], t.[col2], t.[col3], t.[col4],
RN = ROW_NUMBER() OVER (PARTITION BY t.[col1], t.[col2], t.[col3], t.[col4] ORDER BY t.[col1])
FROM [TableA] t
)
delete from CTE WHERE RN > 1
regards.

Related

how to drop rows if a variale is less than x, in sql

I have the following query code
query = """
with double_entry_book as (
SELECT to_address as address, value as value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE to_address is not null
AND block_timestamp < '2022-01-01 00:00:00'
AND status = 1
AND (call_type not in ('delegatecall', 'callcode', 'staticcall') or call_type is null)
union all
-- credits
SELECT from_address as address, -value as value
FROM `bigquery-public-data.crypto_ethereum.traces`
WHERE from_address is not null
AND block_timestamp < '2022-01-01 00:00:00'
AND status = 1
AND (call_type not in ('delegatecall', 'callcode', 'staticcall') or call_type is null)
union all
)
SELECT address,
sum(value) / 1000000000000000000 as balance
from double_entry_book
group by address
order by balance desc
LIMIT 15000000
"""
In the last part, I want to drop rows where "balance" is less than, let's say, 0.02 and then group, order, etc. I imagine this should be a simple code. Any help will be appreciated!

We can delete on a CTE and use returning to get the id's of the rows being deleted, but they still exist until the transaction is comitted.
CREATE TABLE t (
id serial,
variale int);
insert into t (variale) values
(1),(2),(3),(4),(5);
✓
5 rows affected
with del as
(delete from t
where variale < 3
returning id)
select
t.id,
t.variale,
del.id ids_being_deleted
from t
left join del
on t.id = del.id;
id | variale | ids_being_deleted
-: | ------: | ----------------:
1 | 1 | 1
2 | 2 | 2
3 | 3 | null
4 | 4 | null
5 | 5 | null
select * from t;
id | variale
-: | ------:
3 | 3
4 | 4
5 | 5
db<>fiddle here

How to merge rows on a table and update junction table on postgres

Consider 2 tables (table A and table B) with a many-to-many relationship, each containing a primary key and other attributes. To map this relation there's a third joint table (table C) containing the foreign keys for each table of the relation ( fk_tableA | fk_tableB ).
Table B contains duplicate rows (except for the pk), so I want to merge these together into a single record with whatever unique primary key, just like so:
table B table B (after merging duplicates)
1 | Henry | 100.0 1 | Henry | 100.0
2 | Jessi | 97.0 2 | Jessi | 97.0
3 | Henry | 100.0 4 | Erica | 11.2
4 | Erica | 11.2
By merging these records, there may be foreign keys of table C (joint table) pointing to primary keys of table B that no longer exist. My goal is to edit them to point to the merged record:
Before merging:
tableA table B table C
id | att1 id | att1 | att2 fk_A | fk_b
----------- ------------------- ------------
1 | ab123 1 | Henry | 100.0 1 | 1
2 | adawd 2 | Jessi | 97.0 2 | 3
3 | da3wf 3 | Henry | 100.0
4 | Erica | 11.2
On table C, 2 records from table B are referenced (1 and 3) which happen to be duplicated rows. My goal is to merge those into a single record (in table B) and update the foreign key in table C:
After merging:
tableA table B table C
id | att1 id | att1 | att2 fk_A | fk_b
----------- ------------------- ------------
1 | ab123 1 | Henry | 100.0 1 | 1
2 | adawd 2 | Jessi | 97.0 2 | 1
3 | da3wf 4 | Erica | 11.2
- Note that id=3 was merged/deleted from table B and the same id
was updated on table C to point to the merged record's id.
So my question is basically how to update a junction table upon merging records of a table? I am currently using Postgres and working on millions of data.

-- \i tmp.sql
CREATE TABLE persons
( id integer primary key
, name text
, weight decimal(4,1)
);
INSERT INTO persons(id,name,weight)VALUES
(1 ,'Henry', 100.0)
,(2 ,'Jessi', 97.0)
,(3 ,'Henry', 100.0)
,(4 ,'Erica', 11.)
;
CREATE TABLE junctiontab
( fk_A integer NOT NULL
, p_id integer REFERENCES persons(id)
, PRIMARY KEY (fk_A,p_id)
);
INSERT INTO junctiontab(fk_A, p_id)VALUES (1 , 1 ),(2 , 3 );
-- find the ids of affected persons.
-- [for simplicity: put them in a temp table]
CREATE TEMP table xlat AS
SELECT * FROM(
SELECt id AS wrong_id
,min(id) OVER (PARTITION BY name ORDER BY id) AS good_id
FROM persons p
) x
WHERE good_id <> wrong_id
;
--show it
SELECT *FROM xlat;
UPDATE junctiontab j
SET p_id = x.good_id
FROM xlat x
WHERE j.p_id = x.wrong_id
-- The good junction-entry *could* already exist...
AND NOT EXISTS (
SELECT *FROM junctiontab nx
WHERE nx.fk_A= j.fk_A
AND nx.p_id= x.good_id
)
;
DELETE FROM junctiontab d
-- if the good junction-entry already existed, we can delete the wrong one now.
WHERE EXISTS (
SELECT *FROM junctiontab g
JOIN xlat x ON g.p_id= x.good_id
AND d.p_id = x.wrong_id
WHERE g.fk_A= d.fk_A
)
;
--show it
SELECT *FROM junctiontab
;
-- Delete thewrongperson-records
DELETE FROM persons p
WHERE EXISTS (
SELECT *FROM xlat x
WHERE p.id = x.wrong_id
);
--show it
SELECT * FROM persons p;
Result:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 4
CREATE TABLE
INSERT 0 2
SELECT 1
wrong_id | good_id
----------+---------
3 | 1
(1 row)
UPDATE 1
DELETE 0
fk_a | p_id
------+------
1 | 1
2 | 1
(2 rows)
DELETE 1
id | name | weight
----+-------+--------
1 | Henry | 100.0
2 | Jessi | 97.0
4 | Erica | 11.0
(3 rows)

Find unique entities with multiple UUID identifiers in redshift

Having an event table with multiple types of UUID's per user, we would like to come up with a way to stitch all those UUIDs together to get the highest possible definition of a single user.
For example:
UUID1 | UUID2
1 a
1 a
2 a
2 b
3 c
4 c
There are 2 users here, the first one with uuid1={1,2} and uuid2={a,b}, the second one with uuid1={3,4} and uuid2={c}. These chains could potentially be very long. There are no intersections (i.e. 1c doesn't exist) and all rows are timestamp ordered.
Is there a way in redshift to generate these unique "guest" identifiers without creating an immense query with many joins?
Thanks in advance!

Create test data table
-- DROP TABLE uuid_test;
CREATE TEMP TABLE uuid_test AS
SELECT 1 row_id, 1::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT 2 row_id, 1::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT 3 row_id, 2::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT 4 row_id, 2::int uuid1, 'b'::char(1) uuid2
UNION ALL SELECT 5 row_id, 3::int uuid1, 'c'::char(1) uuid2
UNION ALL SELECT 6 row_id, 4::int uuid1, 'c'::char(1) uuid2
UNION ALL SELECT 7 row_id, 4::int uuid1, 'd'::char(1) uuid2
UNION ALL SELECT 8 row_id, 5::int uuid1, 'e'::char(1) uuid2
UNION ALL SELECT 9 row_id, 6::int uuid1, 'e'::char(1) uuid2
UNION ALL SELECT 10 row_id, 6::int uuid1, 'f'::char(1) uuid2
UNION ALL SELECT 11 row_id, 7::int uuid1, 'f'::char(1) uuid2
UNION ALL SELECT 12 row_id, 8::int uuid1, 'g'::char(1) uuid2
UNION ALL SELECT 13 row_id, 8::int uuid1, 'h'::char(1) uuid2
;
The actual problem is solved by using strict ordering to find every place where the unique user changes, capturing that as a lookup table and then applying it to the original data.
-- Create lookup table with a from-to range of IDs for each unique user
WITH unique_user AS (
-- Calculate the end of the id range using LEAD() to look ahead
-- Use an inline MAX() to find the ending ID for the last entry
SELECT row_id AS from_id
, NVL(LEAD(row_id,1) OVER (ORDER BY row_id)-1, (SELECT MAX(row_id) FROM uuid_test) ) AS to_id
, unique_uuid
-- Mark unique user change when there is discontinuity in either UUID
FROM (SELECT row_id
,CASE WHEN NVL(LAG(uuid1,1) OVER (ORDER BY row_id), 0) <> uuid1
AND NVL(LAG(uuid2,1) OVER (ORDER BY row_id), '') <> uuid2
THEN MD5(uuid1||uuid2)
ELSE NULL END unique_uuid
FROM uuid_test) t
WHERE unique_uuid IS NOT NULL
ORDER BY row_id
)
-- Apply the unique user value to each row using a range join to the lookup table
SELECT a.row_id, a.uuid1, a.uuid2, b.unique_uuid
FROM uuid_test AS a
JOIN unique_user AS b
ON a.row_id BETWEEN b.from_id AND b.to_id
ORDER BY a.row_id
;
Here's the output
row_id | uuid1 | uuid2 | unique_uuid
--------+-------+-------+----------------------------------
1 | 1 | a | efaa153b0f682ae5170a3184fa0df28c
2 | 1 | a | efaa153b0f682ae5170a3184fa0df28c
3 | 2 | a | efaa153b0f682ae5170a3184fa0df28c
4 | 2 | b | efaa153b0f682ae5170a3184fa0df28c
5 | 3 | c | 5fcfcb7df376059d0075cb892b2cc37f
6 | 4 | c | 5fcfcb7df376059d0075cb892b2cc37f
7 | 4 | d | 5fcfcb7df376059d0075cb892b2cc37f
8 | 5 | e | 18a368e1052b5aa0388ef020dd9a1e20
9 | 6 | e | 18a368e1052b5aa0388ef020dd9a1e20
10 | 6 | f | 18a368e1052b5aa0388ef020dd9a1e20
11 | 7 | f | 18a368e1052b5aa0388ef020dd9a1e20
12 | 8 | g | 321fcc2447163a81d470b9353e394121
13 | 8 | h | 321fcc2447163a81d470b9353e394121

Hierarchy trees in database and web app

I want to create web app which will use tree data structures. Users will be able to create, update and delete trees. I have the following table in PostgreSQL called nodes in database:
id INTEGER PRIMARY KEY,
name VARCHAR(50) NOT NULL UNIQUE,
parent_id INTEGER NULL REFERENCE nodes(id)
Getting data
I want to get data in the following form:
id | name | children
---|------|--------------
1 | a | [2,3]
2 | b | []
3 | c | [4]
4 | d | []
I created query which returns data in form
id | name | parent_id
---|------|--------------
1 | a |
2 | b | 1
3 | c | 1
4 | d | 3
And here is code:
WITH RECURSIVE nodes_cte(id, name, parent_id, level) AS (
SELECT nodes.id, nodes.name, nodes.parent_id, 0 AS level
FROM nodes
WHERE name = 'a'
UNION ALL
SELECT nodes.id, nodes.name, nodes.parent_id, level+1
FROM nodes
JOIN nodes_cte
ON nodes_cte.id = nodes.parent_id
)
SELECT * FROM nodes_cte;
Can I change SQL code to get what I want or should I do that in app??
Inserting data
I want to know what are the ways to insert data into the table. I think that following approach will work for me:
create sequence in database
increase sequence for number of elements in tree
manually compute ids in app and insert elements in the table
Are there better ways?

CREATE TABLE nodes
( id INTEGER PRIMARY KEY
, name VARCHAR(50) NOT NULL UNIQUE
, parent_id INTEGER NULL REFERENCES nodes(id)
);
-- I created query which returns data in form
INSERT INTO nodes(id,name,parent_id)VALUES
( 1 , 'a' , NULL)
,( 2 , 'b' , 1)
,( 3 , 'c' , 1)
,( 4 , 'd' , 3)
;
SELECT p.id, p.name
, array_agg(c.id) AS children
FROM nodes p
LEFT JOIN nodes c ON c.parent_id = p.id
GROUP BY p.id, p.name
;
Result:
id | name | children
----+------+----------
1 | a | {2,3}
2 | b | {NULL}
3 | c | {4}
4 | d | {NULL}
(4 rows)
Extra: using generate_series() to insert a bunch of records. Each record having id/3 as parent, (except when zero).
INSERT INTO nodes(id,name,parent_id)
SELECT gs, 'zzz_'|| gs::text, NULLIF(gs/3 , 0)
FROM generate_series ( 5,25) gs
;
INSERTING/UPDATING DATA
Normally, your front-end should not mess with sequences, but leave that to the DBMS. You already have a UNIQUE constraint on name, because it is a natural key . So, your front-end should use that key to address rows in the nodes table, like in:
CREATE TABLE nodes2
( id SERIAL NOT NULL PRIMARY KEY
, name VARCHAR(50) NOT NULL UNIQUE
, parent_id INTEGER NULL REFERENCES nodes(id)
);
INSERT INTO nodes2(name,parent_id)
SELECT 'Omg_'|| gs::text, NULLIF(gs/3 , 0)
FROM generate_series ( 1,15) gs
;
PREPARE upd (text, text) AS
-- child, parent
UPDATE nodes2 c
SET parent_id = p.id
FROM nodes2 p
WHERE p.name = $2 -- parent
AND c.name = $1 -- child
;
EXECUTE upd( 'Omg_12', 'Omg_11');
EXECUTE upd( 'Omg_15', 'Omg_11');
Result:
CREATE TABLE
INSERT 0 15
PREPARE
UPDATE 1
UPDATE 1
id | name | children
----+--------+-----------
1 | Omg_1 | {3,4,5}
2 | Omg_2 | {6,7,8}
3 | Omg_3 | {9,10,11}
4 | Omg_4 | {13,14}
5 | Omg_5 | {NULL}
6 | Omg_6 | {NULL}
7 | Omg_7 | {NULL}
8 | Omg_8 | {NULL}
9 | Omg_9 | {NULL}
10 | Omg_10 | {NULL}
11 | Omg_11 | {15,12}
12 | Omg_12 | {NULL}
13 | Omg_13 | {NULL}
14 | Omg_14 | {NULL}
15 | Omg_15 | {NULL}
(15 rows)

Joining many tables on same data and returning all rows

UPDATE:
my orgional attempt to use FULL OUTER JOIN did not work correctly. I have updated the question to reflex the true issue. Sorry for presenting a classic XY PROBLEM.
I'm trying to retrieve a dataset from multiple tables all in one query thats is grouped by year, month of the data.
The final result should look like this:
| Year | Month | Col1 | Col2 | Col3 |
|------+-------+------+------+------|
| 2012 | 11 | 231 | - | - |
| 2012 | 12 | 534 | 12 | 13 |
| 2013 | 1 | - | 22 | 14 |
Coming from data that looks like this:
Table 1:
| Year | Month | Data |
|------+-------+------|
| 2012 | 11 | 231 |
| 2012 | 12 | 534 |
Table 2:
| Year | Month | Data |
|------+-------+------|
| 2012 | 12 | 12 |
| 2013 | 1 | 22 |
Table 3:
| Year | Month | Data |
|------+-------+------|
| 2012 | 12 | 13 |
| 2013 | 1 | 14 |
I tried using FULL OUTER JOIN but this doesn't quite work because in my SELECT clause because no matter which table I select 'Year' and 'Month' from there are null values.
SELECT
Collase(t1.year,t2.year,t3.year)
,Collese(t1.month,t2.month,t3.month)
,t1.data as col1
,t2.data as col2
,t3.data as col3
From t1
FULL OUTER JOIN t2
on t1.year = t2.year and t1.month = t2.month
FULL OUTER JOIN t3
on t1.year = t3.year and t1.month = t3.month
Result is something like this (is too confusing to repeat exactly what i would get using this demo data):
| Year | Month | Col1 | Col2 | Col3 |
|------+-------+------+------+------|
| 2012 | 11 | 231 | - | - |
| 2012 | 12 | 534 | 12 | 13 |
| 2013 | 1 | - | 22 | |
| - | 1 | - | - | 14 |

If your data allows it (not 100 columns), this is usually a clean way of doing it:
select year, month, sum(col1) as col1, sum(col2) as col2, sum(col3) as col3
from (
SELECT t1.year, t1.month, t1.data as col1, 0 as col2, 0 as col3
From t1
union all
SELECT t2.year, t2.month, 0 as col1, t2.data as col2, 0 as col3
From t2
union all
SELECT t3.year, t3.month, 0 as col1, 0 as col2, t3.data as col3
From t3
) as data
group by year, month

If you are using SQL Server 2005 or later version, you could also try this PIVOT solution:
SELECT
Year,
Month,
Col1,
Col2,
Col3
FROM (
SELECT Year, Month, 'Col1' AS Col, Data FROM t1
UNION ALL
SELECT Year, Month, 'Col2' AS Col, Data FROM t2
UNION ALL
SELECT Year, Month, 'Col3' AS Col, Data FROM t3
) f
PIVOT (
SUM(Data) FOR Col IN (Col1, Col2, Col3)
) p
;
This query can be tested and played with at SQL Fiddle.

Perhaps you are looking for the COALESCE keyword? It takes a list of columns and returns the first one that is NOT NULL, or NULL if all arguments are null. In your example, you would do something like this.
SELECT COALESCE(t1.data, t2.data)
You would still need to join tables in this case. It would just cut down on the case statements.

You could derive the complete list of years and months from all the tables, than join every table to that list (using a left join):
SELECT
f.Year,
f.Month,
t1.data AS col1,
t2.data AS col2,
t3.data AS col3
FROM (
SELECT Year, Month FROM t1
UNION
SELECT Year, Month FROM t2
UNION
SELECT Year, Month FROM t3
) f
LEFT JOIN t1 ON f.year = t1.year and f.month = t1.month
LEFT JOIN t2 ON f.year = t2.year and f.month = t2.month
LEFT JOIN t3 ON f.year = t3.year and f.month = t3.month
;
You can see a live demonstration of this query at SQL Fiddle.

if you are looking for the non-null values from either tabloe then you will have to add t1.dat IS NOT NULL as well. I hope that I understand your question.
CREATE VIEW joined_SALES
AS SELECT t1.year, t1.month, t1.data , t2.data
FROM table1 t1, table2 t2
WHERE
t1.year = t2.year
and t1.month = t2.month
and t1.dat IS NOT NULL
GROUP BY t1.year, t1.month;

This might be a better way, especially if you are going to do something with the data before returning it. Basically you are translating the table the data came from into a typeId.
declare #temp table
([year] int,
[month] int,
typeId int,
data decimal)
insert into #temp
SELECT t1.year, t1.month, 1, sum(t1.data)
From t1
group by t1.year, t1.month
insert into #temp
SELECT t2.year, t2.month, 2, sum(t2.data)
From t2
group by t1.year, t1.month
insert into #temp
SELECT t3.year, t3.month, 3, sum(t3.data)
group by t1.year, t1.month
select t.year, t.month,
sum(case when t.typeId = 1 then t.data end) as col1,
sum(case when t.typeId = 2 then t.data end) as col2,
sum(case when t.typeId = 3 then t.data end) as col3
from #temp t
group by t.year, t.month

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

T-SQL query to remove duplicates from large tables using join - tsql

another solution is: WITH CTE AS( SELECT t.[col1], t.[col2], t.[col3], t.[col4], RN = ROW_NUMBER() OVER (PARTITION BY t.[col1], t.[col2], t.[col3], t.[col4] ORDER BY t.[col1]) FROM [TableA] t ) delete from CTE WHERE RN > 1 regards.

Related

how to drop rows if a variale is less than x, in sql

How to merge rows on a table and update junction table on postgres

Find unique entities with multiple UUID identifiers in redshift

Hierarchy trees in database and web app

Joining many tables on same data and returning all rows

Categories

Resources