I have a table in Postgres with a column that has distinct alphanumeric values in the pattern 1234P001. However, due to some bug, there are duplicate values in the column, like 1234P001 appearing thrice.
I want to replace duplicate 1234P001's with 1234P002, 1234P003 and 1234P004. How can I do this in PostgresSql?
I tried using sequence but it didn't work.
This can be done with a temporary table and the use of row_number window function. Here is an illustration.
-- Prepare a test case
create table the_table (id integer, the_column text);
insert into the_table values
(1, '1234P001'),
(2, '1235P001'),
(3, '1234P001'),
(4, '1236P001'),
(5, '1235P001'),
(6, '1234P001');
create temporary table the_temp_table as
select *, row_number() over (partition by the_column order by id) ord
from the_table ;
update the_temp_table
set the_column = the_column||'.'||ord::text where ord > 1;
truncate table the_table;
insert into the_table(id, the_column)
select id, the_column from the_temp_table;
select * from the_table order by the_column;
id
the_column
1
1234P001
3
1234P001.2
6
1234P001.3
2
1235P001
5
1235P001.2
4
1236P001
Using this sample data to illustrate the concept
create table tab (id varchar(8) );
insert into tab(id) values
('1234P001'),
('1234P001'),
('1234P001'),
('1234P002'),
('1234P004'),
('1234P004'),
('1234P005');
First you need to identify the duplicated key - use count .. over
select id,
count(*) over (partition by id) > 1 is_dup
from tab;
id |is_dup|
--------+------+
1234P001|true |
1234P001|true |
1234P001|true |
1234P002|false |
1234P004|true |
1234P004|true |
1234P005|false |
Assign each duplicated row a unique sequence number (you'll see soon why)
with dup as (
select id,
count(*) over (partition by id) > 1 is_dup
from tab
)
select id,
row_number() over (order by id) dup_idx
from dup
where is_dup;
id |dup_idx|
--------+-------+
1234P001| 1|
1234P001| 2|
1234P001| 3|
1234P004| 4|
1234P004| 5|
Now generate all not existing keys based on you key schema (here prefix of length 5 and 3 digit integer)
with free_key as (
select distinct substring(id,1,5)||lpad(idx::text,3,'0') id
from tab
cross join generate_series(1,10) as t(idx) /* increase the count up to 999 if required */
except
select id from tab)
select id,
row_number() over (order by id) free_id_idx
from free_key
id |free_id_idx|
--------+-----------+
1234P003| 1|
1234P006| 2|
1234P007| 3|
1234P008| 4|
1234P009| 5|
1234P010| 6|
In the last step simple join the table with duplicated keys with the unassigned key using teh unique index to get the resolution old_id and the unique new_id
Note I use an outer join - if you get an empty new_id there is a problem you have no free key to fix in your schema.
with dup as (
select id,
count(*) over (partition by id) > 1 is_dup
from tab
),
dup2 as (
select id,
row_number() over (order by id) dup_idx
from dup
where is_dup),
free_key as (
select distinct substring(id,1,5)||lpad(idx::text,3,'0') id
from tab
cross join generate_series(1,10) as t(idx) /* increase the count up to 999 if required */
except
select id from tab),
free_key2 as (
select id,
row_number() over (order by id) free_id_idx
from free_key)
select dup2.id old_id, free_key2.id new_id
from dup2
left outer join free_key2
on dup2.dup_idx = free_key2.free_id_idx;
old_id |new_id |
--------+--------+
1234P001|1234P003|
1234P001|1234P006|
1234P001|1234P007|
1234P004|1234P008|
1234P004|1234P009|
Related
I have a table with three columns: creationTime, number, id. That has been populated every 15 seconds or so. I have been using materialized view to track duplicates like so:
SELECT number, id, count(*) AS dups_count
FROM my_table
GROUP BY number, id HAVING count(*) > 1;
The table contains thousands of records for the last 1.5 years. Refreshing this materialized view takes at this point about 2 minutes. I would like to have a better solution to this. There is no quick refresh materialized views available for PostgreSQL.
At first, I thought creating a trigger for the table to refresh materialized view could be a solution. But if I have records come in every 15 seconds and it takes materialized view over 2 minutes to recalculate, it would not be a good idea. Anyways, I wouldn't say I like the idea of recalculating the same data over and over again.
Is there a better solution to it?
A trigger the increments duplicate count might be a solution:
create table duplicates
(
number int,
id int,
dups_count int,
primary key (number, id)
);
The primary key will allow an efficient "UPSERT" that increments the dups_count in case of duplicates.
Then create a trigger that updates that table each time a row is inserted into the base table:
create function increment_dupes()
returns trigger
as
$$
begin
insert into duplicates (number, id, dups_count)
values (new.number, new.id, 1)
on conflict (number,id)
do update
set dups_count = duplicates.dups_count + 1;
return new;
end
$$
language plpgsql;
create trigger update_dups_count
after insert on my_table
for each row
execute function increment_dupes();
Each time you insert into my_table either a new row will be created in duplicates, or the current dups_count will be incremented. If you delete or update rows from my_table you will also need a trigger for that. However updating the count for UPDATEs or DELETEs is not entirely safe for concurrent operations. The INSERT ON CONFLICT is however.
A trigger does have some performance overhead, so you will have to test if the overhead is too big for your requirements.
Whenever there is a scope of growth , the best way to scale is to find a way to repeat a process on incremental data.
To explain this , we name the table that has been mentioned as 'Tab':
Tab
Number ID CreationTime
Index on creationtime column.
Key to applying the incremental method is to have a monotonically increasing value.
Here we have 'creationtime' for that.
(a) create another table Tab_duplicate with an additional column 'last_compute_timestamp'
Say:
Tab_duplicate
Number ID Duplicate_count last_compute_timestamp
(b) Create an index on column 'last_compute_timestamp'.
(c) Run the insert to find the duplicate records and insert it into Tab_duplicate along with the last_compute_timestamp.
(d) For repeat Execution:
install extension pg_cron (if its not there) and automate this execution of insert.
https://github.com/citusdata/pg_cron
https://fatdba.com/2021/07/30/pg_cron-probably-the-best-way-to-schedule-jobs-within-postgresql-database/
or
2. Use a shell script/python script to execute it on the DB through OS crontab.
The fact that last_compute_timestamp is recorded in every iteration and reused next , it will be incremental and always be fast.
DEMONSTRATION:
Step 1: Production table
create table tab
(
id int,
number int,
creationtime timestamp
);
create index tab_id on tab(creationtime);
Step 2: Duplicate capture table , with one time priming record(this can be removed after the first execution)
create table tab_duplicate
(
id int,
number int,
duplicate_count int,
last_compute_timestamp timestamp);
create index tab_duplicate_idx on tab_duplicate(last_compute_timestamp);
insert into tab_duplicate values(0,0,0,current_timestamp);
Step 3: Some duplicate entry into the production table
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(1,10,current_timestamp);
insert into tab values(1,10,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(2,20,current_timestamp);
select pg_sleep(1);
insert into tab values(3,30,current_timestamp);
insert into tab values(3,30,current_timestamp);
select pg_sleep(1);
insert into tab values(4,40,current_timestamp);
Verify records:
postgres=# select * from tab;
id | number | creationtime
----+--------+----------------------------
1 | 10 | 2022-01-23 19:00:37.238865
1 | 10 | 2022-01-23 19:00:38.248574
1 | 10 | 2022-01-23 19:00:38.252622
2 | 20 | 2022-01-23 19:00:39.259584
2 | 20 | 2022-01-23 19:00:40.26655
3 | 30 | 2022-01-23 19:00:41.274673
3 | 30 | 2022-01-23 19:00:41.279298
4 | 40 | 2022-01-23 19:00:52.697257
(8 rows)
Step 4: Duplicates captured and verified.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 4
postgres=#
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
(5 rows)
Step 5: Some more duplicates into the production table
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(5,50,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
insert into tab values(6,60,current_timestamp);
select pg_sleep(1);
Step 6: Same duplicate capture SQL executed will CAPTURE ONLY the incremental records in the production table.
INSERT INTO tab_duplicate
SELECT a.id,
a.number,
a.duplicate_count,
b.last_compute_timestamp
FROM (SELECT id,
number,
Count(*) duplicate_count
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct
GROUP BY id,
number) a,
(SELECT Max(creationtime) last_compute_timestamp
FROM tab,
(SELECT Max(last_compute_timestamp) lct
FROM tab_duplicate) max_date
WHERE creationtime > max_date.lct) b;
Execute:
postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-# a.number,
postgres-# a.duplicate_count,
postgres-# b.last_compute_timestamp
postgres-# FROM (SELECT id,
postgres(# number,
postgres(# Count(*) duplicate_count
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct
postgres(# GROUP BY id,
postgres(# number) a,
postgres-# (SELECT Max(creationtime) last_compute_timestamp
postgres(# FROM tab,
postgres(# (SELECT Max(last_compute_timestamp) lct
postgres(# FROM tab_duplicate) max_date
postgres(# WHERE creationtime > max_date.lct) b;
INSERT 0 2
Verify:
postgres=# select * from tab_duplicate;
id | number | duplicate_count | last_compute_timestamp
----+--------+-----------------+----------------------------
0 | 0 | 0 | 2022-01-23 19:00:25.779671
3 | 30 | 2 | 2022-01-23 19:00:52.697257
1 | 10 | 3 | 2022-01-23 19:00:52.697257
4 | 40 | 1 | 2022-01-23 19:00:52.697257
2 | 20 | 2 | 2022-01-23 19:00:52.697257
5 | 50 | 3 | 2022-01-23 19:02:37.884417
6 | 60 | 2 | 2022-01-23 19:02:37.884417
(7 rows)
This duplicate capture will be always fast because of two things:
It works only on incremental data of last whatever duration you schedule it.
Scanning of the table to find the maximum timestamp happens on a single column index (index only scan).
From execution plan:
-> Index Only Scan Backward using tab_duplicate_idx on tab_duplicate tab_duplicate_2 (cost=0.15..77.76 rows=1692 width=8)
CAVEAT : In case, if you have duplicates over longer period of time in table tab_duplicate , you can dedupe records in TAB_DUPLICATION at a periodic duration , say at the end of the day which will anyways be fast because TAB_DUPLICATE is anyway an aggregated small table and the table is OFFLINE to your application whereas TAB is your production table with huge accumulated data.
Also , a trigger on the production table is a viable solution but that adds overhead to transactions on the production as trigger execution has a cost for every insert.
Two approaches come to mind:
Create a secondary table with (number, id) columns. Add a trigger so that whenever a duplicate row is about to be inserted into my_table, it is also inserted into this secondary table. That way you'll have the data you need in the secondary table as soon as it comes in, and it won't take up too much space unless you have a ton of these duplicates.
Add a new column to my_table, perhaps a timestamp, to differentiate the duplicates. Add a unique constraint to my_table over the (number, id) columns where the new column is null. Then, you can change your insert to include an ON CONFLICT clause, so that if a duplicate is being inserted you set its timestamp to now. When you want to search for duplicates, you can then just query using the new column.
I got this query,
SELECT s.pos
FROM (SELECT t.guild_id, t.user_id
ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM users t) s
WHERE (s.guild_id, s.user_id) = ($2, $3)
that gets a user's "rank" in a guild, but I want to filter the results by entries that are in an array of t.user_id values (like {'1', '64', '83'}) and have this affect the resulting pos value. I found FILTER and WITHIN GROUP, but I'm not sure how to fit one of those into this query. How would I do that?
Here's the full table if that helps at all:
Table "public.users"
Column | Type | Collation | Nullable | Default
------------+-----------------------+-----------+----------+---------
guild_id | character varying(20) | | not null |
user_id | character varying(20) | | not null |
reputation | real | | not null | 0
Indexes:
"users_pkey" PRIMARY KEY, btree (guild_id, user_id)
Why not select on those first?
WITH UsersWeCareAbout AS (
SELECT * FROM users u WHERE u.user_id = ANY(subgroup_array)
), RepUsers AS (
SELECT t.guild_id, t.user_id, ROW_NUMBER() OVER(ORDER BY t.reputation DESC) AS pos
FROM UsersWeCareAbout t
) SELECT s.pos FROM RepUsers s WHERE (s.guild_id, s.user_id) = ($2, $3)
(untested if only because I didn't really have enough context to test with)
I want compare two table's all column values.The two table is identical tables means column number is same and primary key is same. can any one suggest query which compare such two tables in postgres.
The query should give the column name and what is the two different value of two tables.Like this
pkey | column_name | table1_value | table2_value
123 | bonus | 1 | 0
To get all different rows you can use:
select *
from table_1 t1
join table_2 t2 on t1.pkey = t2.pkey
where t1 is distinct from t2;
This will only compare rows that exist in both tables. If you also want to find those that are missing in on of them use a full outer join:
select coalesce(t1.pkey, t2.pkey) as pkey,
case
when t1.pkey is null then 'Missing in table_1'
when t2.pkey is null then 'Missing in table_2'
else 'At least one column is different'
end as status,
*
from table_1 t1
full ojoin table_2 t2 on t1.pkey = t2.pkey
where (t1 is distinct from t2)
or (t1.pkey is null)
or (t2.pkey is null);
If you install the hstore extension, you can view the differences as a key/value map:
select coalesce(t1.pkey, t2.pkey) as pkey,
case
when t1.pkey is null then 'Missing in table_1'
when t2.pkey is null then 'Missing in table_2'
else 'At least one column is different'
end as status,
hstore(t1) - hstore(t2) as values_in_table_1,
hstore(t2) - hstore(t1) as values_in_table_2
from table_1 t1
full ojoin table_2 t2 on t1.pkey = t2.pkey
where (t1 is distinct from t2)
or (t1.pkey is null)
or (t2.pkey is null);
Using this sample data:
create table table_1 (pkey integer primary key, col_1 text, col_2 int);
insert into table_1 (pkey, col_1, col_2)
values (1, 'a', 1), (2, 'b', 2), (3, 'c', 3), (5, 'e', 42);
create table table_2 (pkey integer primary key, col_1 text, col_2 int);
insert into table_2 (pkey, col_1, col_2)
values (1,'a', 1), (2, 'x', 2), (3, 'c', 33), (4, 'd', 52);
A possible result would be:
pkey | status | values_in_table_1 | values_in_table_2
-----+----------------------------------+-------------------+------------------
2 | At least one column is different | "col_1"=>"b" | "col_1"=>"x"
3 | At least one column is different | "col_2"=>"3" | "col_2"=>"33"
4 | Missing in table_1 | |
5 | Missing in table_2 | |
Example data:
create table test1(pkey serial primary key, str text, val int);
insert into test1 (str, val) values ('a', 1), ('b', 2), ('c', 3);
create table test2(pkey serial primary key, str text, val int);
insert into test2 (str, val) values ('a', 1), ('x', 2), ('c', 33);
This simple query gives a complete information on differences of two tables (including rows missing in one of them):
(select 1 t, * from test1
except
select 1 t, * from test2)
union all
(select 2 t, * from test2
except
select 2 t, * from test1)
order by pkey, t;
t | pkey | str | val
---+------+-----+-----
1 | 2 | b | 2
2 | 2 | x | 2
1 | 3 | c | 3
2 | 3 | c | 33
(4 rows)
In Postgres 9.5+ you can transpose the result to the expected format using jsonb functions:
select pkey, key as column, val[1] as value_1, val[2] as value_2
from (
select pkey, key, array_agg(value order by t) val
from (
select t, pkey, key, value
from (
(select 1 t, * from test1
except
select 1 t, * from test2)
union all
(select 2 t, * from test2
except
select 2 t, * from test1)
) s,
lateral jsonb_each_text(to_jsonb(s))
group by 1, 2, 3, 4
) s
group by 1, 2
) s
where key <> 't' and val[1] <> val[2]
order by pkey;
pkey | column | value_1 | value_2
------+--------+---------+---------
2 | str | b | x
3 | val | 3 | 33
(2 rows)
I tried all of the above answer.Thanks guys for your help.Bot after googling I found a simple query.
SELECT <common_column_list> from table1
EXCEPT
SELECT <common_column_list> from table2.
It shows all the row of table1 if any table1 column value is different from table2 column value.
Not very nice but fun and it works :o)
Just replace public.mytable1 and public.mytable2 by correct tables and
update the " where table_schema='public' and table_name='mytable1'"
select * from (
select pkey,column_name,t1.col_value table1_value,t2.col_value table2_value from (
select pkey,generate_subscripts(t,1) ordinal_position,unnest(t) col_value from (
select pkey,
(
replace(regexp_replace( -- null fields
'{'||substring(a::character varying,'^.(.*).$') ||'}' -- {} instead of ()
,'([\{,])([,\}])','\1null\2','g'),',,',',null,')
)::TEXT[] t
from public.mytable1 a
) a) t1
left join (
select pkey,generate_subscripts(t,1) ordinal_position,unnest(t) col_value from (
select pkey,
(
replace(regexp_replace( -- null fields
'{'||substring(a::character varying,'^.(.*).$') ||'}' -- {} instead of ()
,'([\{,])([,\}])','\1null\2','g'),',,',',null,')
)::TEXT[] t
from public.mytable2 a
) a) t2 using (pkey,ordinal_position)
join (select * from information_schema.columns where table_schema='public' and table_name='mytable1') c using (ordinal_position)
) final where COALESCE(table1_value,'')!=COALESCE(table2_value,'')
Table name: Table1
id name
1 1-aaa-14 milan road
2 23-abcde-lsd road
3 2-mnbvcx-welcoome street
I want the result like this:
Id name name1 name2
1 1 aaa 14 milan road
2 23 abcde lsd road
3 2 mnbvcx welcoome street
This function ought to give you what you need.
--Drop Function Dbo.Part
Create Function Dbo.Part
(#Value Varchar(8000)
,#Part Int
,#Sep Char(1)='-'
)Returns Varchar(8000)
As Begin
Declare #Start Int
Declare #Finish Int
Set #Start=1
Set #Finish=CharIndex(#Sep,#Value,#Start)
While (#Part>1 And #Finish>0)Begin
Set #Start=#Finish+1
Set #Finish=CharIndex(#Sep,#Value,#Start)
Set #Part=#Part-1
End
If #Part>1 Set #Start=Len(#Value)+1 -- Not found
If #Finish=0 Set #Finish=Len(#Value)+1 -- Last token on line
Return SubString(#Value,#Start,#Finish-#Start)
End
Usage:
Select ID
,Dbo.Part(Name,1,Default)As Name
,Dbo.Part(Name,2,Default)As Name1
,Dbo.Part(Name,3,Default)As Name2
From Dbo.Table1
It's rather compute-intensive, so if Table1 is very long you ought to write the results to another table, which you could refresh from time to time (perhaps once a day, at night).
Better yet, you could create a trigger, which automatically updates Table2 whenever a change is made to Table1. Assuming that column ID is primary key:
Create Table Dbo.Table2(
ID Int Constraint PK_Table2 Primary Key,
Name Varchar(8000),
Name1 Varchar(8000),
Name2 Varchar(8000))
Create Trigger Trigger_Table1 on Dbo.Table1 After Insert,Update,Delete
As Begin
If (Select Count(*)From Deleted)>0
Delete From Dbo.Table2 Where ID=(Select ID From Deleted)
If (Select Count(*)From Inserted)>0
Insert Dbo.Table2(ID, Name, Name1, Name2)
Select ID
,Dbo.Part(Name,1,Default)
,Dbo.Part(Name,2,Default)
,Dbo.Part(Name,3,Default)
From Inserted
End
Now, do your data manipulation (Insert, Update, Delete) on Table1, but do your Select statements on Table2 instead.
The below solution uses a recursive CTE for splitting the strings, and PIVOT for displaying the parts in their own columns.
WITH Table1 (id, name) AS (
SELECT 1, '1-aaa-14 milan road' UNION ALL
SELECT 2, '23-abcde-lsd road' UNION ALL
SELECT 3, '2-mnbvcx-welcoome street'
),
cutpositions AS (
SELECT
id, name,
rownum = 1,
startpos = 1,
nextdash = CHARINDEX('-', name + '-')
FROM Table1
UNION ALL
SELECT
id, name,
rownum + 1,
nextdash + 1,
CHARINDEX('-', name + '-', nextdash + 1)
FROM cutpositions c
WHERE nextdash < LEN(name)
)
SELECT
id,
[1] AS name,
[2] AS name1,
[3] AS name2
/* add more columns here */
FROM (
SELECT
id, rownum,
part = SUBSTRING(name, startpos, nextdash - startpos)
FROM cutpositions
) s
PIVOT ( MAX(part) FOR rownum IN ([1], [2], [3] /* extend the list here */) ) x
Without additional modifications this query can split names consisting of up to 100 parts (that's the default maximum recursion depth, which can be changed), but can only display no more than 3 of them. You can easily extend it to however many parts you want it to display, just follow the instructions in the comments.
select T.id,
substring(T.Name, 1, D1.Pos-1) as Name,
substring(T.Name, D1.Pos+1, D2.Pos-D1.Pos-1) as Name1,
substring(T.Name, D2.Pos+1, len(T.name)) as Name2
from Table1 as T
cross apply (select charindex('-', T.Name, 1)) as D1(Pos)
cross apply (select charindex('-', T.Name, D1.Pos+1)) as D2(Pos)
Testing performance of suggested solutions
Setup:
create table Table1
(
id int identity primary key,
Name varchar(50)
)
go
insert into Table1
select '1-aaa-14 milan road' union all
select '23-abcde-lsd road' union all
select '2-mnbvcx-welcoome street'
go 10000
Result:
if you always will have 2 dashes, you can do the following by using PARSENAME
--testing table
CREATE TABLE #test(id INT, NAME VARCHAR(1000))
INSERT #test VALUES(1, '1-aaa-14 milan road')
INSERT #test VALUES(2, '23-abcde-lsd road')
INSERT #test VALUES(3, '2-mnbvcx-welcoome street')
SELECT id,PARSENAME(name,3) AS name,
PARSENAME(name,2) AS name1,
PARSENAME(name,1)AS name2
FROM (
SELECT id,REPLACE(NAME,'-','.') NAME
FROM #test)x
if you have dots in the name column you have to first replace them and then replace them back to dots in the end
example, by using a tilde to substitute the dot
INSERT #test VALUES(3, '5-mnbvcx-welcoome street.')
SELECT id,REPLACE(PARSENAME(name,3),'~','.') AS name,
REPLACE(PARSENAME(name,2),'~','.') AS name1,
REPLACE(PARSENAME(name,1),'~','.') AS name2
FROM (
SELECT id,REPLACE(REPLACE(NAME,'.','~'),'-','.') NAME
FROM #test)x
I am trying to find duplicate rows in my DB, like this:
SELECT email, COUNT(emailid) AS NumOccurrences
FROM users
GROUP BY emailid HAVING ( COUNT(emailid) > 1 )
This returns the emailid and the number of matches found. Now what I want do is compare the ID column to another table I have and set a column there with the count.
The other table has a column named duplicates, which should contain the amount of duplicates from the select. So let's say we have 3 rows with the same emailid. The duplicates column has a "3" in all 3 rows. What I want is a "2" in the first 2 and nothing or 0 in the last of the 3 matching ID rows.
Is this possible?
Update:
I managed to have a temporary table now, which looks like this:
mailid | rowcount | AmountOfDups
643921 | 1 | 3
643921 | 2 | 3
643921 | 3 | 3
Now, how could I decide that only the first 2 should be updated (by mailid) in the other table? The other table has mailid as well.
SELECT ...
ROW_NUMBER() OVER (PARTITION BY email ORDER BY emailid DESC) AS RN
FROM ...
...is a great starting point for such a problem. Never underestimate the power of ROW_NUMBER()!
Using Sql Server 2005+ you could try something like (full example)
DECLARE #Table TABLE(
ID INT IDENTITY(1,1),
Email VARCHAR(20)
)
INSERT INTO #Table (Email) SELECT 'a'
INSERT INTO #Table (Email) SELECT 'b'
INSERT INTO #Table (Email) SELECT 'c'
INSERT INTO #Table (Email) SELECT 'a'
INSERT INTO #Table (Email) SELECT 'b'
INSERT INTO #Table (Email) SELECT 'a'
; WITH Duplicates AS (
SELECT Email,
COUNT(ID) TotalDuplicates
FROM #Table
GROUP BY Email
HAVING COUNT(ID) > 1
)
, Counts AS (
SELECT t.ID,
ROW_NUMBER() OVER(PARTITION BY t.Email ORDER BY t.ID) EmailID,
d.TotalDuplicates
FROM #Table t INNER JOIN
Duplicates d ON t.Email = d.Email
)
SELECT ID,
CASE
WHEN EmailID = TotalDuplicates
THEN 0
ELSE TotalDuplicates - 1
END Dups
FROM Counts