Using MERGE command in Upsolver - merge

I would like to use Upsolver MERGE command in my new transformations to populates S3/Athena and Snowflake tables. Since Snowflake is supporting Upsert command, while defining my transformation job, do I rely on Snowflake functionality and use Upsolver INSER statement or define Upsolver MERGE transformation in the same way I do it for Athena , i.e.
CREATE JOB my_job_upsert
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = TRUE
RUN_INTERVAL = 1 MINUTE
AS MERGE INTO default_glue_catalog.upsolver_samples.test_upsert_with_merge AS target
/*
Use the SELECT statement below to choose your columns and performed the desired transformations.
In this example, we aggregate the sample orders data by customer and filter it to only include repeat purchasers.
*/
USING (SELECT field1 AS email,
COUNT(DISTINCT field2) AS count
MIN(field3) AS min_number,
MAX(date) AS last_date
FROM default_glue_catalog.upsolver_samples.test_raw_data
WHERE $commit_time BETWEEN run_start_time() AND run_end_time()
GROUP BY 1
HAVING COUNT(DISTINCT field2) > 1) source
ON (target.email = source.email)--primary key
WHEN MATCHED THEN REPLACE -- Update if primary keys match
WHEN NOT MATCHED THEN INSERT MAP_COLUMNS_BY_NAME; -- Insert if primary key is unique (new record)
It would be nice to know in general if MERGE command syntax is consistent across various target platforms.
I already build Athena transformation and it works as expected

You can use the same way you did for Athena. Upsolver INSERT command will insert new keys (APPEND) and if the table had primary key defined, INSERT command will update the existing keys (Upsert) as its default behavior.
MERGE as its definition is for UPSERT and can handle Deletes as well. And the syntax is consistent across all database/data warehouse/catalog targets

Related

Strategy for selecting records for DataSet

In Most common cases, we have two tables (& more) in DB termed as master (e.g. SalesOrderHeader) & chirld (e.g. SalesOrderDetail).
We can read records from DB by one Select with INNER JOIN and additional constaints WHERE for lessen volume data for loading from DB (using "Addater.Fill(DataSet)")
#"SELECT d.SalesOrderID, d.SalesOrderDetailID, d.OrderQty,
d.ProductID, d.UnitPrice
FROM Sales.SalesOrderDetail d
INNER JOIN Sales.SalesOrderHeader h
ON d.SalesOrderID = h.SalesOrderID
WHERE DATEPART(YEAR, OrderDate) = #year;"
Did I understand right, in this case we receive one table in DataSet, w/o primary and foreign keys, and w/o possibility to set constraint between master and child tables.
This Dataset can be useful only for different queries regarding columns and record which exist in DataSet?
We can't using DbCommandBuilder for creating SQLCommands for Insert, Update, Delete based on the SelectCommand which was used for filling DataSet? And simply to Update data in these table in DB?
If we want to organize the local data moddification in tables by using the disconnect layer of ADO.NET, we must populate DataSet by two Select
"SELECT *
FROM Sales.SalesOrderHeader;"
"SELECT *
FROM Sales.SalesOrderDetail;"
After that we must create the primary keys for both table, and set constraint between master and child table. Create by DbCommandBuilder SQLCommands for Insert, Update, Delete.
In that case we will have possibility to any modification data in these tables remotely and after Update records in DB (using "Addater.Update(DataSet)").
If we will use one SelectCommand to load data in two tables in DataSet, can we use that SelectCommand for DbCommandBuilder for creating other SQLCommands for "Update" and Update all tables in DataSet by one "Addater.Update(DataSet)" or we must create separate Addapter for Update every table?
If I for economy resources will load only part of records (see below) from table (e.g. SalesOrderDetail). Do I right understand, that in this case, I can have a possible problems, when I will send new records to DB (by Update), because news records can conflict with existen in DB by primary key (some records which have other value in OrderDate field)?
"SELECT *
FROM Sales.SalesOrderDetail
WHERE DATEPART(YEAR, OrderDate) = #year;"
There is nothing preventing you from writing your own Insert, Update and Delete commands for your first select statement with the join. Of course you will have to determine a way to assure that the foreign keys exist.
Insert Into SalesOrderDetail (SalesOrderID, OrderQty, ProductID, UnitPrice) Values ( #SalesOrderID, #OrderQty, #ProductID, #UnitPrice);
Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;
Delete From SalesOrderDetail Where SalesOrderDetailID = #ID;
You would execute these with ADO.net commands instead of using the adapter. I wrote the sample code in vb.net but I am sure it is easy to change to C# if you prefer.
Private Sub UpdateQuantity(Quant As Integer, DetailID As Integer)
Using cn As New SqlConnection("Your connection string"),
cmd As New SqlCommand("Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;")
cmd.Parameters.Add("#OrderQty", SqlDbType.Int).Value = Quant
cmd.Parameters.Add("#ID", SqlDbType.Int).Value = DetailID
cn.Open()
cmd.ExecuteNonQuery()
End Using
End Sub

Make MERGE statement in BODS

I have SAP BODS as ETL tool running towards Oracle Exadata. I would like to produce a merge into statement from BODS that include a where clause, limiting the columns that will be updated when found a match.
The merge statement I have today looks like this:
MERGE INTO TargetTable s
USING
(SELECT columns
FROM "sourceTable"
) n
ON ((s.Column= n.Column) WHEN MATCHED THEN
UPDATE SET s."Column" = n.Column
-----MISSING where clause ------
WHEN NOT MATCHED THEN
INSERT /*+ APPEND */ (s.columns)
VALUES (n.Columns);
Use DS target Auto Correct load. There are several options to play with there and if you
allow merge sets to 'Yes'
You will have the above query generated. But please take care as proper keys should be set in target for this to happen.
Cheerz.
Shaz

Update from existing table in Redshift

I would like to update a value in Redshift table from results of other table, I'm trying to run to following query but received an error.
update section_translate
set word=t.section_type
from (
select distinct section_type from mr_usage where section_type like '%sディスコ')t
where word = '80sディスコ'
The error I received:
ERROR: Target table must be part of an equijoin predicate
Can't understand what is incorrect in my query.
You need to make the uncorrelated subquery to a correlated subquery,
update section_translate
set word=t.section_type
from (
select distinct section_type,'80sディスコ' as word from mr_usage where section_type like '%sディスコ')t
where section_translate.word = t.word
Otherwise, each record of the outer query is eligible for updates and the query engine rejects it. The way Postgre (and thus Redshift) evaluates uncorrelated subqueries is slightly different from SQL Server/ Oracle etc.

Redshift Copy and auto-increment does not work

I am using the COPY command from redshift to copy json data from S3.
The table definition is as follows:
CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;
The command for copy i am using is as follows:
COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ;
I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.
If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.
See below for a demo:
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=#
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
Thx in advance
The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.
Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.
In practice, this means executing the following statement after each Redshift COPY is performed:
insert into master
(id,result_code,ct_timestamp,...)
select
#{startIncrement}+row_number() over(order by ct_timestamp) as id,
result_code,...
from stage;
Then the Ids are guaranteed to be sequential/consecutives in the master table.
I can't reproduce your problem, however it is interesting how you have identity columns set correctly in conjunction with copy. Here a small summary:
Be aware that you can specify the columns (and their order) for a copy command.
COPY my_table (col1, col2, col3) FROM s3://...
So if:
EXPLICIT_IDS flag is NOT set
no columns listed like shown above
and you csv does not contain data for the IDENTITY column
then the identity values in the table will be set automatically in monotonously as we all want it.
doc:
If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified, with IDENTITY columns omitted if EXPLICIT_IDS was also not specified.

Hive: How to do a SELECT query to output a unique primary key using HiveQL?

I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
,,I want to apply the DISTINCT operation only to the call_id"
But how will then Hive know which row to eliminate?
Without knowing the amount of data / size of the stat fields you have, the following query can the job:
select distinct i1.call_id, i1.stat2, i1.stat3 from (
select call_id, MIN(concat(stat1, stat2, stat3)) as smin
from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id
AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;