Snowflake MERGE how to keep chronological order - merge

Snowflake has a very usefule construct
MERGE INTO ... WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ...
A very commomn situation is when you only need to update when the incoming data is newer and discard it if not.
Is there a way to do it in one statement or I would need to write a stored procedure that would read the existing row and decide in the code whether to merge the incoming row.

Assuming we want to achieve SCD 1 - overwrite then we could use MERGE like:
MERGE INTO tab trg
USING src_tab src
ON trg.business_key = src.business_key
WHEN NOT MATCHED
THEN INSERT(business_key, col1, col2)
VALUES (src.business_key, src.col1, src.col2)
WHEN MATCHED
-- if nothing has change for non-key attribute, do not update
AND HASH(trg.col1,trg.col2) != HASH(src.col1,src.col2)
THEN UPDATE SET trg.col1 = src.col1
,trg.col2 = src.col2
,trg.sys_load_date = CURRENT_TIMESTAMP();
Sample scenario:
CREATE OR REPLACE TABLE tab(sys_key INT IDENTITY(1,1),
business_key INT NOT NULL,
col1 STRING,
col2 STRING,
sys_load_date DATETIME DEFAULT CURRENT_TIMESTAMP());
CREATE OR REPLACE TABLE src_tab(business_key INT, col1 STRING, col2 STRING);
Sample run:
-- initial data
INSERT OVERWRITE INTO src_tab VALUES (10, NULL, NULL), (20, 'a','a');
--MERGE
SELECT * FROM tab ORDER BY sys_key;
/*
SYS_KEY BUSINESS_KEY COL1 COL2 SYS_LOAD_DATE
1 10 2021-05-11 11:34:31.337
2 20 a a 2021-05-11 11:34:31.337
*/
-- first row does not change, new row
INSERT OVERWRITE INTO src_tab VALUES (10, NULL, NULL), (30, 'b','b');
--MERGE
SELECT * FROM tab ORDER BY sys_key;
/*
SYS_KEY BUSINESS_KEY COL1 COL2 SYS_LOAD_DATE
1 10 2021-05-11 11:34:31.337
2 20 a a 2021-05-11 11:34:31.337
4 30 b b 2021-05-11 11:34:47.298
*/
-- change row 1
INSERT OVERWRITE INTO src_tab VALUES (10, 'x', NULL);
-- MERGE
SELECT * FROM tab ORDER BY sys_key;
/*
SYS_KEY BUSINESS_KEY COL1 COL2 SYS_LOAD_DATE
1 10 x 2021-05-11 11:35:32.122
2 20 a a 2021-05-11 11:34:31.337
4 30 b b 2021-05-11 11:34:47.298
*/

You can add additional logic to the WHEN MATCHED clause using an AND clause:
WHEN MATCHED [ AND <case_predicate> ] THEN ...

Related

How to use a declare statement to update a table

I have this Declare Statement
declare #ReferralLevelData table([Type of Contact] varchar(10));
insert into #ReferralLevelData values ('f2f'),('nf2f'),('Travel'),('f2f'),('nf2f'),('Travel'),('f2f'),('nf2f'),('Travel');
select (row_number() over (order by [Type of Contact]) % 3) +1 as [Referral ID]
,[Type of Contact]
from #ReferralLevelData
order by [Referral ID]
,[Type of Contact];
It does not insert into the table so i feel this is not working as expect, i.e it doesn't modify the table.
If it did work I was hoping to modify the statement to make it update.
At the moment the table just prints this result
1 f2f
1 nf2f
1 Travel
2 f2f
2 nf2f
2 Travel
3 f2f
3 nf2f
3 Travel
EDIT:
I want TO Update the table to enter recurring data in groups of three.
I have a table of data, it is duplicated twice in the same table to make three sets.
Its "ReferenceID" is the primary key, i want to in a way group the 3 same ReferenceID's and inject these three values "f2f" "NF2F" "Travel" into the row called "Type" in any order but ensure that each ReferenceID only has one of those values.
Do you mean the following?
declare #ReferralLevelData table(
[Referral ID] int,
[Type of Contact] varchar(10)
);
insert into #ReferralLevelData([Referral ID],[Type of Contact])
select
(row_number() over (order by [Type of Contact]) % 3) +1 as [Referral ID]
,[Type of Contact]
from
(
values ('f2f'),('nf2f'),('Travel'),('f2f'),('nf2f'),('Travel'),('f2f'),('nf2f'),('Travel')
) v([Type of Contact]);
If it suits you then you also can use the next query to generate data:
select r.[Referral ID],ct.[Type of Contact]
from
(
values ('f2f'),('nf2f'),('Travel')
) ct([Type of Contact])
cross join
(
values (1),(2),(3)
) r([Referral ID]);

Populate random data from another table

update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)
I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.
SETUP
Let's start by assuming your tables an data are the following ones.
Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):
CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;
CREATE TABLE dataset2
(
column1 TEXT
) ;
We fill both tables with sample data
INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);
INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;
Sanity check:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
Case 1: number of rows in dataset1 <= rows in dataset2
We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.
EXPLANATION
In order to make an update that shuffles all the values from column4 in a
random fashion, we need some intermediate steps.
First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that
are just:
(id_1, 1),
(id_2, 2),
(id_3, 3),
...
(id_20, 20)
Where id_1, ..., id_20 are the ids present on dataset1.
They can be of any type, they need not be consecutive, and they can be composite.
For the dataset2, we need to create another list of (column_1,rn), that looks like:
(column1_1, 17),
(column1_2, 3),
(column1_3, 11),
...
(column1_20, 15)
In this case, the second column contains all the values 1 .. 20, but shuffled.
Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.
THE REAL QUERY
This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:
WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
Note that the trick is performed by means of:
row_number() OVER (ORDER BY random()) AS rn
The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1.
These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.
CHECKS
We can check again:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
SELECT * FROM dataset1 ;
id | column4
--: | :-------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011
ALTERNATIVE
Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
And again...
SELECT * FROM dataset1;
id | column4
--: | :-------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016
You can check the whole setup and experiment at dbfiddle here
NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.
Case 2: number of rows in dataset1 > rows in dataset2
In this case, values for column4 can be repeated several times.
The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:
CREATE FUNCTION random_column1()
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;
And use it to update:
UPDATE
dataset1
SET
column4 = random_column1();
This way, some values from dataset2 might not be used at all, whereas others will be used more than once.
dbfiddle here
Better is to reference the outer table from the subquery. Then the subquery has to be evalued for every row:
update dataset1.test
set column4 = (select
case when dataset1.test.column4 = dataset1.test.column4
then column1 end
from dataset2
order by random()
limit 1
)

Summarizing Only Rows with given criteria

all!
Given the following table structure
DECLARE #TempTable TABLE
(
idProduct INT,
Layers INT,
LayersOnPallet INT,
id INT IDENTITY(1, 1) NOT NULL,
Summarized BIT NOT NULL DEFAULT(0)
)
and the following insert statement which generates test data
INSERT INTO #TempTable(idProduct, Layers, LayersOnPallet)
SELECT 1, 2, 4
UNION ALL
SELECT 1, 2, 4
UNION ALL
SELECT 1, 1, 4
UNION ALL
SELECT 2, 2, 4
I would like to summarize only those rows (by the Layers only) with the same idProduct and which will have the sum of layers equal to LayersOnPallet.
A picture is worth a thousand words:
From the picture above, you can see that only the first to rows were summarized because both have the same idProduct and the sum(layers) will be equal to LayersOnPallet.
How can I achieve this? It's there any way to do this only in selects (not with while)?
Thank you!
Perhaps this will do the trick. Note my comments:
-- your sample data
DECLARE #TempTable TABLE
(
idProduct INT,
Layers INT,
LayersOnPallet INT,
id INT IDENTITY(1, 1) NOT NULL,
Summarized BIT NOT NULL DEFAULT(0)
)
INSERT INTO #TempTable(idProduct, Layers, LayersOnPallet)
SELECT 1, 2, 4 UNION ALL
SELECT 1, 2, 4 UNION ALL
SELECT 1, 1, 4 UNION ALL
SELECT 2, 2, 4;
-- an intermediate temp table used for processing
IF OBJECT_ID('tempdb..#processing') IS NOT NULL DROP TABLE #processing;
-- let's populate the #processing table with duplicates
SELECT
idProduct,
Layers,
LayersOnPallet,
rCount = COUNT(*)
INTO #processing
FROM #tempTable
GROUP BY
idProduct,
Layers,
LayersOnPallet
HAVING COUNT(*) > 1;
-- Remove the duplicates
DELETE t
FROM #TempTable t
JOIN #processing p
ON p.idProduct = t.idProduct
AND p.Layers = t.Layers
AND p.LayersOnPallet = t.LayersOnPallet
-- Add the new, updated record
INSERT #TempTable
SELECT
idProduct,
Layers * rCount,
LayersOnPallet, 1
FROM #processing;
DROP TABLE #processing; -- cleanup
-- Final output
SELECT idProduct, Layers, LayersOnPallet, Summarized
FROM #TempTable;
Results:
idProduct Layers LayersOnPallet Summarized
----------- ----------- -------------- ----------
1 4 4 1
1 1 4 0
2 2 4 0

Postgres function to attach random integers to selected rows

I want a function or a trigger that when a row is inserted that all rows with matching criteria are given a random integer between 1 and the number of rows so to randomise the rows on a select.
E.g. if I have the data
Col1 Col2 Order
A 1
B 2
B 2
B 3
A 2
and I insert another row with Col1=B and Col2=2 then I want to end up with
Col1 Col2 Order
A 1
B 2 2
B 2 3
B 3
A 2
B 2 1
Where Order is a number with a value of 1 - with each number appearing only once?
There is no need to store this, you can generate such a number when you retrieve the data.
select col1,
col2,
row_number() over (partition by col1, col2 order by random()) as random_order
from the_table

T-SQL query, multiple values in a field

I have two tables in a database. The first table tblTracker contains many columns, but the column of particular interest is called siteAdmin and each row in that column can contain multiple loginIDs of 5 digits like 21457, 21456 or just one like 21444. The next table users contains columns like LoginID, fname, and lname.
What I would like to be able to do is take the loginIDs contained in tblTracker.siteAdmin and return fname + lname from users. I can successfully do this when there is only one loginID in the row such as 21444 but I cannot figure out how to do this when there is more than one like 21457, 21456.
Here is the SQL statement I use for when there is one loginID in that column
SELECT b.FName + '' '' + b.LName AS siteAdminName,
FROM tblTracker a
LEFT OUTER JOIN users b ON a.siteAdmin= b.Login_Id
However this doesn't work when it tries to join a siteAdmin with more than one LoginID in it
Thanks!
I prefer the number table approach to split a string in TSQL
For this method to work, you need to do this one time table setup:
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Numbers
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
Once the Numbers table is set up, create this split function:
CREATE FUNCTION [dbo].[FN_ListToTable]
(
#SplitOn char(1) --REQUIRED, the character to split the #List string on
,#List varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN
(
----------------
--SINGLE QUERY-- --this will not return empty rows
----------------
SELECT
ListValue
FROM (SELECT
LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(#SplitOn, List2, number+1)-number - 1))) AS ListValue
FROM (
SELECT #SplitOn + #List + #SplitOn AS List2
) AS dt
INNER JOIN Numbers n ON n.Number < LEN(dt.List2)
WHERE SUBSTRING(List2, number, 1) = #SplitOn
) dt2
WHERE ListValue IS NOT NULL AND ListValue!=''
);
GO
You can now easily split a CSV string into a table and join on it:
select * from dbo.FN_ListToTable(',','1,2,3,,,4,5,6777,,,')
OUTPUT:
ListValue
-----------------------
1
2
3
4
5
6777
(6 row(s) affected)
Your can now use a CROSS APPLY to split every row in your table like:
DECLARE #users table (LoginID int, fname varchar(5), lname varchar(5))
INSERT INTO #users VALUES (1, 'Sam', 'Jones')
INSERT INTO #users VALUES (2, 'Don', 'Smith')
INSERT INTO #users VALUES (3, 'Joe', 'Doe')
INSERT INTO #users VALUES (4, 'Tim', 'White')
INSERT INTO #users VALUES (5, 'Matt', 'Davis')
INSERT INTO #users VALUES (15,'Sue', 'Me')
DECLARE #tblTracker table (RowID int, siteAdmin varchar(50))
INSERT INTO #tblTracker VALUES (1,'1,2,3')
INSERT INTO #tblTracker VALUES (2,'2,3,4')
INSERT INTO #tblTracker VALUES (3,'1,5')
INSERT INTO #tblTracker VALUES (4,'1')
INSERT INTO #tblTracker VALUES (5,'5')
INSERT INTO #tblTracker VALUES (6,'')
INSERT INTO #tblTracker VALUES (7,'8,9,10')
INSERT INTO #tblTracker VALUES (8,'1,15,3,4,5')
SELECT
t.RowID, u.LoginID, u.fname+' '+u.lname AS YourAdmin
FROM #tblTracker t
CROSS APPLY dbo.FN_ListToTable(',',t.siteAdmin) st
LEFT OUTER JOIN #users u ON st.ListValue=u.LoginID --to get all rows even if missing siteAdmin
--INNER JOIN #users u ON st.ListValue=u.LoginID --to remove rows without any siteAdmin
ORDER BY t.RowID,u.fname,u.lname
OUTPUT:
RowID LoginID YourAdmin
----------- ----------- -----------
1 2 Don Smith
1 3 Joe Doe
1 1 Sam Jones
2 2 Don Smith
2 3 Joe Doe
2 4 Tim White
3 5 Matt Davis
3 1 Sam Jones
4 1 Sam Jones
5 5 Matt Davis
7 NULL NULL
7 NULL NULL
7 NULL NULL
8 3 Joe Doe
8 5 Matt Davis
8 1 Sam Jones
8 15 Sue Me
8 4 Tim White
(18 row(s) affected)