Azure data factory copy activity mystery (one duped row, one missing row from source) - azure-data-factory

Okay I am at a total loss here - I have tried tweaking every setting I can think of in the copy activity I am dealing with but for some reason on the sink it is inserting a duplicate row, and missing a row from the datasource it pulls.
Data comes in via REST from our CRM system - this works fine, all records come through - then this mystery happens where the duplicated rows get inserted (this even occurred somehow when I changed it from insert to upsert and specified the GUID of the row as the key - how the heck ??)
Overall flow is : Grab auth token from rest source -> grab records via REST endpoint in a Copy data activity
Copy data settings - source settings
Sink settings copy data
Settings of the copy data step are default - I have tried editing DIUs, parallelism, max concurrent connections, different batch sizes, etc to no impact.
It appears to always be the 15,000 row that is duplicated, so 15,001 is 15,000 duped, so in my mind I am thinking it is something to do with the data being partitioned and it's not indexed properly? I just don't get how this is happening - it doesn't happen with any of our other 60+ pipelines we have run nightly.
Any help or suggestions is greatly appreciated!

Related

How to configure a Synapse Mapping Data Flow for INSERT/UPDATE/DELETE when the destination table does not yet exist

I am trying to build a generic Mapping Data Flow for some basic cleansing on tables in my Data Lake. I need it to be able to work both on an ongoing basis after data already exists in my cleansed tables as well as when new tables are added (it would detect them automatically and create and populate the destination). Both the Source and Destination tables with be Delta tables.
The approach I have taken is to have Sources configured to both my actual source and to the target and use either JOIN transformations or EXISTS transformations to identify the new, updated and removed rows.
This works fine for INSERTS and UPDATES, however my issues is dealing with DELETES when there is no data currently in the destination. Obviously there will be nothing to DELETE - that is as expected. However, because I reference the key column that will exist once data is loaded to the table I get an error on an initial run that states:
ERROR Dataflow AppManager: name=BatchJobListener.failed, opId=xxx, message=Job 'xxx failed due to reason: DF-SINK-007 at Sink 'cleansedTableWithDeletes': Sink results in 0 output columns. Please ensure at least one column is mapped.
The overall process looks as follows:
Has anyone developed a pattern that works for a generic flow (this one is parameter driven and ensures schema drift is accommodated) or a way for the Data Flow to think that there IS a column in the destination that it can refer to and get past this issue?
In Source options check Allow no files found.
You can also provide date dynamically in Filter by last modified option.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#sink-settings

Azure Data Factory ForEach is seemingly not running data flow in parallel

In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency
I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else

Smartsheet-api, Is there any way to get manually deleted row using smartsheet api or sdk call

I am deleting row from a sheet, On a sheet I have daily job which needs to recognize the deleted records, I need a way to recognize them using smartsheet api or sdk..
Thanks in advance..
I don't believe this scenario (identifying deleted rows) is explicitly supported by the API at this time. Seems like you could still use the API to achieve your goal though, with a bit more work (code) on your part.
Your code would have to get the sheet data (i.e., all sheet rows) at a regular interval and save that data somewhere -- then each time job runs, get the sheet data again and compare that data to the data you saved the previous time the job ran (to identify any rows that had been deleted).
Edit 9/26: Added Webhooks info
Note that with the approach I've described above, any rows that had been added AND deleted during the interval between job runs would not be detected. If it's important to identify each and every time a row is deleted, a better (and much more efficient) approach would be to use Webhooks. By using webhooks, your application subscribes to notifications for a specified sheet, and then would receive a callback (HTTP POST) from Smartsheet any time the sheet changes. Your application would need to inspect the information in each callback it receives to identify 'deleted row' events (eventType = deleted and objectType = row).
A simple way to do this is to add a column with a checkmark named "delete" or something similar, then with automation you can move the row to another sheet when the flag is detected, the row will be removed from the original sheet, but you will have a record of the deleted row in a different sheet that you can read or do what ever you need to do, this will also prevent deletions by mistake and you can even restore the row back if you need to. I don't think you need much code to implement this solution.

How to take data from 2 databases (with same schema) and copy it into 1 database using Data factory

I want to take data from 2 databases and copy(coalesce) it into 1 using Data factory.
The issue is: It seems that multiple inputs is not allowed for copy activities.
So i resorted to having 2 different datasets which are exact copies but with a different name... and then putting 2 different activities into the 1 pipeline which use their specific output dataset.
It just seems odd and wrong to do it this way.
Can i have some help.
This is what my diagram currently looks like:
Is there no way of just copying data from 2 seperate databases (which have the same structure but different data) to the 1 database?
The short answer is yes. But you need to work within the constraints of how ADF handles this.
A couple of things to help...
You'll always need at least 2 activities to do this when using the copy type activity. Microsoft of course charges per activity execution in ADF, so they aren't going to allow you to take shortcuts having many inputs and output per single copy activity (single charge).
The approach you show above is ok and to pass the ADF validation as you've found you simply need to have the output datasets created separately and called different things. Even if they still refer to the same underlying target table etc. This is really only a problem for the copy activity. What you could do is land the data firstly into separate staging tables in the Azure target database just for the copy (1:1). Then have a third downstream activity that executes a stored procedure that does the union of tables. In this case you could have 2 inputs to 1 output in the activity if you want to have that level of control in ADF.
Like this:
Final point, if you don't want the activities to execute in parallel you could chain the datasets to enforce a fake dependency or add a simple 'delay' clause to one of the copy operations. A delay on an activity would be simpler than provisioning a time slice offset.
Hope this helps

Oracle Global Temporary Tables and using stored procedures and functions

we recently changed one of the databases I develop on from Oracle accounts to LDAP login accounts and all went well for the front end used by the staff that access the system. However, we have a second method of entry restricted to admin staff that load the data onto the database and a lot of processing is called using the dbms_scheduler.
Most of the database tables have a created_by column which is defaulted to pick up their user name from a sys_context but when the data loads are run from dbms_scheduler this information is not available and hence the created_by columns all get populated with APP_GLOBAL.
I have managed to populate a Global Temporary Table (GTT) with the sys_context value and use this to populate the created_by from a stored procedure called by dbms_scheduler so my next logical step was to put this in a function and call it so it could be used throughout the system or even be referenced from a before insert trigger.
The problem is, when putting the code into a function the data from the GTT is not found. The table is set to preserve rows.
I have trawled many a site for an answer but have found nothing to help me can anyone here provide a solution?
The scheduler will be using a different session than the session that created the job - preserve rows will not make the GTT data visible in a different session.
I am assuming the created_by columns have a default value like nvl(sys_context(...),'APP_GLOBAL'). Consider passing the user name as a parameter to the job and set the context as the first step in the job.
A weekend off and a closer look at my code showed a fatal flaw in my syntax where the selection of data from the GTT would never happen. A quick tweak and recompile and all is well.
Jack, thanks for your help.