Best way to avoid duplicates in table? - tsql

I've been given a task that requires writing a script to mass change items in a table(ProductArea):
ProductID int
SalesareaID int
One ProductID can only exist once in each SalesareaID so there can't be any duplicates in this table. But one ProductID can be sold in multiple SalesareaID.
So an example would look something like:
ProductID SalesareaID
1 1
1 2
1 3
2 2
3 1
Now, some areas have merged. So, if I try to run a straight-forward UPDATE to fix this like:
UPDATE ProductArea SET SalesareaID = 4 where SalesareaID IN (2, 3)
it will find (1, 2) and change that to (1, 4). Then it will find (1, 3) and try to change that to (1, 4). But that already exist so it will crash with a "Cannot insert duplicate key..."-error.
Is there a best/recommended way to tell my UPDATE to only update if the resulting (ProductID, SalesareaID) doesn't already exist?

This should work
It uses a window function
declare #T table (prodID int, salesID int, primary key (prodID, salesID));
insert into #T values
(1, 1)
, (1, 2)
, (1, 3)
, (2, 2)
, (3, 1);
with cte as
( select t.*
, row_number() over (partition by t.prodID order by t.salesID) as rn
from #T t
where t.salesID in (2, 3)
)
delete cte where rn > 1;
update #T set salesID = 4 where salesID in (2, 3);
select * from #T;

If you are creating a new merged region from existing regions then I think the easiest thing to do would be to treat the merge as two separate operations.
First you insert entries for the new area based on the existing areas.
INSERT INTO ProductArea (ProductID, SalesareaID)
SELECT DISTINCT ProductID, 4 FROM ProductArea
WHERE SalesareaID IN (2, 3)
Then you remove the entries for the existing areas.
DELETE FROM ProductArea WHERE SalesareaID IN (2, 3)
The SalesareaID of 4 would need to be replaced by the id of the new Salesarea. The 2 and 3 would also need to be replaced by the ids of the areas you are merging to create the new Salesarea.

Related

PostgreSQL - Can I get the inverse of distinct rows?

I have a table of contacts. Each contact has an associating website. Each website can have multiple contacts.
I ran a query to get one contact with Select distinct on (website). This works fine.
But I want to do something the the rest of the data not selected but Select distinct on (website). Is there an inverse command where I can find all records from websites that have NOT been processed?
Use except. Here is an illustration. order by is for clarity.
create temporary table the_table (i integer, tx text);
insert into the_table values
(1, 'one'),
(1, 'one more one'),
(1, 'yet another one'),
(2, 'two'),
(2, 'one more two'),
(2, 'yet another two'),
(3, 'three'),
(3, 'three alternative');
select * from the_table
EXCEPT
select distinct on (i) * from the_table
order by i;
i
tx
1
one more one
1
yet another one
2
yet another two
2
one more two
3
three alternative

Set repeating IDs till first record repeats (bulk load csv file)

I have a file that I imported via bulk-insert and I want to assign group IDs/sequences.
I would like to assign the IDs till the first record with the first character is repeated. In this example its "A".
The challenge I have is how to achieve this example and set the IDs like this example:
ID
data
1
A000abcefd
1
E00asoaskdaok
1
C000dasdasok
2
A100abcasds
2
E100aandas
2
C100adsokdas
Here is one way to do it, but given the limited info you provided I will make the following assumptions:
**The data in your table has some order to it. This obviously will not work if that is not the case. I used an ID, you use what you have.
**The first row in the table has the character you are looking for.
CREATE TABLE #tmp(ID int, [data] varchar(20))
INSERT INTO #tmp
VALUES
(1, 'A000abcefd'),
(2, 'E00asoaskdaok'),
(3, 'C000dasdasok'),
(4, 'A100abcasds'),
(5, 'E100aandas'),
(6, 'C100adsokdas')
DECLARE #CHAR varchar(1)
SELECT #CHAR = (SELECT TOP 1 SUBSTRING([data],1,1) FROM #tmp ORDER BY ID)
SELECT SUM(CASE WHEN SUBSTRING([data],1,1) = #CHAR THEN 1 ELSE 0 END)
OVER(ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) SeqNum
,[data]
FROM #tmp

Concatenate mutliple contiguous rows to single row

i have a huge table with iot-datas from a lot of iot-devices. Every device is sending data one time per minute but only if counter-input got some singals. If not, no data will be sended. So in my database the datas looks like
Today I'm loading all this data in my application and aggregate them by iterating and checking row by row to 3 rows based on contiguous rows. Contiguous rows are all rows where next row is one minute later. It is working but it feels not smart and nice.
Does it make sense to generate this aggregation on sql server - espacialy increase performance?
How would you start?
This is a classic Islands and Gaps problem. I'm still mastering Islands and Gaps so I'd love any feedback on my solution from others in the know (please be gentle). There are at least a couple different ways to solve Islands and Gaps but this is the one that is easiest on my brain. Here's how I got it to work:
DDL to set up data:
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL
DROP TABLE #tmp;
CREATE TABLE #tmp
(IoT_Device INT,
Count INT,
TimeStamp DATETIME);
INSERT INTO #tmp
VALUES
(1, 5, '2021-10-27 14:03'),
(1, 4, '2021-10-27 14:04'),
(1, 7, '2021-10-27 14:05'),
(1, 8, '2021-10-27 14:06'),
(1, 5, '2021-10-27 14:07'),
(1, 4, '2021-10-27 14:08'),
(1, 7, '2021-10-27 14:12'),
(1, 8, '2021-10-27 14:13'),
(1, 5, '2021-10-27 14:14'),
(1, 4, '2021-10-27 14:15'),
(1, 5, '2021-10-27 14:21'),
(1, 4, '2021-10-27 14:22'),
(1, 7, '2021-10-27 14:23');
Islands and Gaps Solution:
;WITH CTE_TIMESTAMP_DATA AS (
SELECT
IoT_Device,
Count,
TimeStamp,
LAG(TimeStamp) OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS previous_timestamp,
LEAD(TimeStamp) OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS next_timestamp,
ROW_NUMBER() OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_location
FROM #tmp
)
,CTE_ISLAND_START AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_number,
IoT_Device,
TimeStamp AS island_start_timestamp,
island_location AS island_start_location
FROM CTE_TIMESTAMP_DATA
WHERE DATEDIFF(MINUTE, previous_timestamp, TimeStamp) > 1
OR previous_timestamp IS NULL
)
,CTE_ISLAND_END AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_number,
IoT_Device,
TimeStamp AS island_end_timestamp,
island_location AS island_end_location
FROM CTE_TIMESTAMP_DATA
WHERE DATEDIFF(MINUTE, TimeStamp, next_timestamp) > 1
OR next_timestamp IS NULL
)
SELECT
S.IoT_Device,
(SELECT SUM(Count)
FROM CTE_TIMESTAMP_DATA
WHERE IoT_Device = S.IoT_Device
AND TimeStamp BETWEEN S.island_start_timestamp AND E.island_end_timestamp) AS Count,
S.island_start_timestamp,
E.island_end_timestamp
FROM CTE_ISLAND_START AS S
INNER JOIN CTE_ISLAND_END AS E
ON E.IoT_Device = S.IoT_Device
AND E.island_number = S.island_number;
The CTE_TIMESTAMP_DATA query pulls the IoT_Device, Count, and TimeStamp along with the TimeStamp before and after each record using LAG and LEAD, and assigns a row number to each record ordered by TimeStamp.
The CTE_ISLAND_START query gets the start of each island.
The CTE_ISLAND_END query gets the end of each island.
The main SELECT at the bottom then uses this data to sum the Count within each island.
This will work with multiple IoT_Devices.
You can read more about Islands and Gaps here or numerous other places online.

PostgreSQL - How to add a new column with a default conditioning on another column?

I have a table like this
Table 1:
Code
1
2
3
4
1
2
I want to add a new column conditioning on the column code.
1 -> A
2 -> B
3 -> C
4 -> D
My expected result:
Table 1:
Code Name
1 A
2 B
3 C
4 D
1 A
2 B
I expecting a code like this:
alter table table_1
add column Name varchar(64) set default case when Code = 1 then "A"
Code = 2 then "B"
Code = 3 then "C"
. . .
,
One straightforward way would be to just maintain a table of codes, and then join to it:
CREATE TABLE codes (Code integer, Name varchar(5));
INSERT INTO codes (Code, Name)
VALUES
(1, 'A'),
(2, 'B'),
(3, 'C'),
(4, 'D');
SELECT t1.Code, t2.Name
FROM table_1 t1
INNER JOIN codes t2
ON t1.Code = t2.Code;
Note that I vote against doing this update, because as your underlying code values change, you might be forced to do the update again. The above approach doesn't have that problem, and you get the correct name when you select.

Check SQL Server table values against themselves

Imagine I had this table:
declare #tmpResults table ( intItemId int, strTitle nvarchar(100), intWeight float )
insert into #tmpResults values (1, 'Item One', 7)
insert into #tmpResults values (2, 'Item One v1', 6)
insert into #tmpResults values (3, 'Item Two', 6)
insert into #tmpResults values (4, 'Item Two v1', 7)
And a function, which we'll call fn_Lev that takes two strings, compares them to one another and returns the number of differences between them as an integer (i.e. the Levenshtein distance).
What's the most efficient way to query that table, check the fn_Lev value of each strTitle against all the other strTitles in the table and delete rows are similar to one another by a Levenshtein distance of 3, preferring to keeping higher intWeights?
So the after the delete, #tmpResults should contain
1 Item One 7
4 Item Two v1 7
I can think of ways to do this, but nothing that isn't horribly slow (i.e iterative). I'm sure there's a faster way?
Cheers,
Matt
SELECT strvalue= CASE
WHEN t1.intweight >= t2.intweight THEN t1.strtitle
ELSE t2.strtitle
END,
dist = Fn_lev(t1.strtitle, t2.strtitle)
FROM #tmpResults AS t1
INNER JOIN #tmpResults AS t2
ON t1.intitemid < t2.intitemid
WHERE Fn_lev(t1.strtitle, t2.strtitle) = 3
This will perform a self join that will match each row only once. It will excluding matching a row on itself or reverse of a previous match ie if A<->B is a match then B<->A isn't.
The case statement selects the highest weighted result
If I've understood you correctly, you can use a cross join
SELECT t1.intItemId AS Id1, t2.intItemId AS Id2, fn_Lev(t1.strTitle, t2.strTitle) AS Lev
FROM #tmpResults AS t1
CROSS JOIN #tmpResults AS t2
The cross join will give you the results of every combination of rows between the left and right side of the join (hence it doesn't need any ON clause, as it is matching everything to everything else). You can then use the result of the SELECT to choose which to delete.