ADF mapping data flow only inserting, never updating - azure-data-factory

I have an ADF data flow that will only insert. It never updates rows.
Below is a screenshot of the flow, and the Alter Row task that sets the insert/Update policies.
data flow
alter row task
There is a source table and a destination table.
There is a source table for new data.
A lookup is done against the key of the destination table.
Two columns are then generated, a hash of the source data & hash of the destination data.
In the alter row task, the policy's are as follows:
Insert: if the lookup found no matching id.
Update: if lookup found a matching id and the checksums do not match (i.e. user exists but data is different between the source and existing record).
Otherwise it should do nothing.
The Sink allows insert and updates:
Even so, on first run it inserts all records but on second run it inserts all the records again, even if they exist.
I think I am misunderstanding the process and so appreciate any expertise or advise.

Thank you Joel Cochran for your valuable inputs, repro’d the scenario, and posting it as an answer to help other community members.
If you are using the upsert method in the sink, add alter row transformation with upsert if and write the expression for the upsert condition.
If you are using insert and update as your update method in the sink then in alter row transformation use both inserts if and update if conditions to insert and update data accordingly into the sink based on alter row conditions.

Related

Azure Data Factory Copy Pipeline with Geography Data Type

I am trying to get a geography data type from a production DB to another DB on a nightly occurrence. I really wanted to leverage upsert as the write activity, but it seems that geography is not supported with this method. I was reading a similar post about bringing the data through ADF as a well known text data type and then changing it, but I keep getting confused on what to do with the data once it is brought over as a well known data type. I would appreciate any advice, thank you.
Tried to utilize ADF pipelines and data flows. Tried to convert the data type once it was in the destination, but then I was not able to run the pipeline again.
I tried to upsert the data with geography datatype from one Azure SQL database to another using copy activity and got error message.
Then, I did the upsert using dataflow activity. Below are the steps.
A source table is taken in dataflow as in below image.
CREATE TABLE SpatialTable
( id int ,
GeogCol1 geography,
GeogCol2 AS GeogCol1.STAsText() );
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (1,geography::STGeomFromText('LINESTRING(-122.360 46.656, -122.343 46.656 )', 4326));
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (2,geography::STGeomFromText('POLYGON((-122.357 47.653 , -122.348 47.649, -122.348 47.658, -122.358 47.658, -122.358 47.653))', 4326));
Then Alter Row transformation is taken and in Alter Row Conditions, Upsert if isNull(id)==false()is given. (Based on the column id, sink table upserted)
Then, in Sink dataset for target table is given. In sink settings, Update method is selected as Allow Upsert and required Key column is given. (Here column id is selected)
When pipeline is run for the first time, data is inserted into target table.
When pipeline is run for the second time by updating the existing data and inserting new records to source, data is upserted correctly.
Source Data is changed for id=1 and new row is inserted with id=3
Sink data is reflecting the changes done in source.

COPY support with postgreSQL v12 triggers

We have this pair of trigger and function that we use on our psql database for the longest time. Basically, the trigger is called each time there is a new record to the main table, and each row is inserted to the monthly partition individually. Following is the trigger function:
CREATE TRIGGER partition_mic_teams_endpoint_trg1
BEFORE INSERT ON "mic_teams_endpoint"
FOR EACH ROW EXECUTE
PROCEDURE trg_partition_mic_teams_endpoint('month');
The function we have creates monthly partitions based on a timestamp field in each row.
I have two questions:
List item Even if I try to COPY a bunch of rows from CSV to the main table, is this trigger/function going to insert each row individually? Is this efficient?
If that is the case, is it possible to have support for COPYing data to partitions instead of INSERT.
Thanks,
Note: I am sorry if I did not provide enough information for an answer
Yes, a row level trigger will be called for each row separately, and that will make COPY quite a bit slower.
One thing you could try is a statement level AFTER trigger that uses a transition table, so that you can
INSERT INTO destination SELECT ... FROM transition_table;
That should be faster, but you should test it to be certain.
See the documentation for details.

Access arbitrary metadata in after delete trigger

I'm thinking about creating archive tables in our database.
I can create an after delete trigger that would move row to archive table, but I need to fill deleted_by field which has id of the user that removed the data. This user is an entity in our application and not a internal postgres user to be clear.
If postgres would have a way to attach some metadata to the transaction I could've used it inside of the trigger to fill this field. Maybe I can use variables for that? Is there existing solution to this problem?
I suggest you to write a stored procedure that that inserts the row to the archive table and deletes it from the table. Then the API shall use only that procedure to delete a row. The user id is passed as an argument.
You can still write a trigger that inserts the row to the archive table with a NULL user id if someone attempts to use DELETE instead of the procedure. In that case, the row in the archive must have the primary key from the original table in a UNIQUE NULL column to prevent duplicates.

Need a bulk insert tip

I need to insert a table data into another table. Where it is not guaranteed that the source table have all rows correctly where some of the not null fields are having null values. So with this source table I need to enter all valid rows into the table and find all unvalid rows which failed to insert and return them.
I know we can do this by validating all rows before hand. But as this is a bulk insert from a csv and parsed by .net code so from db we wil not validate it but directly enter.
We can also do this by running a loop but performance might hit.
so my question is is any way where we can use a single statement for insert and skip rows which has a problem and insert which are valid.
BULK INSERT is all-or-nothing. SQL Server does not have the ability to shunt erroneous rows into a separate table, alas.
The best thing you can do is to validate all data thoroughly before inserting it. If the insert still fails (maybe due to a bug) you need to retry all rows one-by-one and log the errors that are occurring.
You can also bulk insert to a temp table and move the rows from there to the final table one-by-one.

Insert data from staging table into multiple, related tables?

I'm working on an application that imports data from Access to SQL Server 2008. Currently, I'm using a stored procedure to import the data individually by record. I can't go with a bulk insert or anything like that because the data is inserted into two related tables...I have a bunch of fields that go into the Account table (first name, last name, etc.) and three fields that will each have a record in an Insurance table, linked back to the Account table by the auto-incrementing AccountID that's selected with SCOPE_IDENTITY in the stored procedure.
Performance isn't very good due to the number of round trips to the database from the application. For this and some other reasons I'm planning to instead use a staging table and import the data from there. Reading up on my options for approaching this, a cursor that executes the same insert stored procedure on the data in the staging table would make sense. However it appears that cursors are evil incarnate and should be avoided.
Is there any way to insert data into one table, retrieve the auto-generated IDs, then insert data for the same records into another table using the corresponding ID, in a set-based operation? Or is a cursor my only option here?
Look at the OUTPUT clause. You should be able to add it to your INSERT statement to do what you want.
BTW, if you need to output columns into the second table that weren't inserted into the first one, then use MERGE instead of INSERT (as suggested in the comment to the original question) as its OUTPUT clause supports referencing other columns from the source table(s). Otherwise, keeping it with an INSERT is more straightforward, and it does give you access to the inserted identity column.
I'm having experiment to worked out in inserting multiple record into related table using databinding. So, try this!
Hopefully this is very helpful. Follow this link How to insert record into related tables. for more information.