Data from multiple sources and deciding destination based on the Lookup SQL data - azure-data-factory

I am trying to solve the below problem where I am getting data from different sources and trying to copy that data at single destination based on the metadata stored in SQL table. below are the steps i followed-
I have 3 REST API call and the output of those calls going as input to lookup activity.
The lookup activity is queried on SQL DB which has 3 records and pulling 2 columns only, file_name and table_name.
Then for each activity is iterating on the lookup array output and from each item, I am getting the item().file_name.
Now for each item I am trying to use Switch case to decide based on the file name what should be the destination of the data.
I am not sure how I can use the file_name coming in step 3 to use as a case in of switch activity. Can anyone please guide me on that?

You need to create a variable and save the value of file_name. Then you can use that variable in of switch activity. If you do this, please make sure your Sequential setting of For Each activity is checked.

Related

Azure Data Factory Overwrite Existing Folder/Partitions in ADLS Gen2

I have my settings in my ADF Sink to Clear the folder but Partitioned via an ID
But this sink already has other partitions in that exists that I do not want to remove.
If an ID comes in, I just want to clear that specific folder/partition but it is actually clearing the full folder versus just partition. Am I missing a setting?
To overwrite only the partitions that appear in new data and keep the rest of the old partition data, you can make use of the pre commands present in the settings tab of the dataflow sink. Look at the following demonstration.
The following is my initial data which I have partitioned based on id.
Now let's say the following is the new data that you are going to write. Here, according to the requirement, you want to overwrite the partitions that are present and keep the rest as it is.
First, we need to get the distinct key column values (id in my case). Then use them in the pre commands of sink settings to remove files only from these partitions.
Take the above data (the 2nd image data) as dataflow1 source. Apply derived column transformation to add a new column with constant value say 'xxx' (to group based on this column and apply collect() aggregate function).
Group by this new column and use the aggregate as distinct(collect(id)).
Now for sink, choose as Cache, check write to activity output. When you run this dataflow in the pipeline, the debug output would be:
Send this array value to a parameter created in another dataflow where you make necessary changes and overwrite partitions. Give the following dynamic content
#activity('Data flow1').output.runStatus.output.sink1.value[0].val
Now in this second dataflow, the source is the same data used in first dataflow. For sink, instead of selecting clear the folder option, scroll down where you can find pre/post commands sections where you give the following dynamic content:
concat('rm /output/id=',toString($parts),'/*')
Now when you run this pipeline, it successfully executes and runs the overwrites only the required partitions, whereas keeps the other partitions.
The following is a sample partition data (id=2) to show that the data is overwritten (only one part file with required data will be available).
Why do not you specify the filename and write it to 1 single file.

ADF - what's the best way to execute one from a list of Data Flow activities based on a condition

I have 20 file formats and 1 Data Flow activity that maps to each one of them. Based on the file name, I know which data flow activity to execute. Is the only way to handle this through a "Switch" activity? Is there another way? e.g. can I parameterize the data flow to execute by a variable name?:
Unfortunately , there is no option to run one out of list of dataflows based on input condition.
To perform data migration and transformation for multiple tables, you can use same dataflow and parameterize the dataflow by providing the table names either during the runtime or use a control table to hold all the tablenames and inside foreach , call the dataflow activity. In the sink settings, use merge schema option.

How to configure a Synapse Mapping Data Flow for INSERT/UPDATE/DELETE when the destination table does not yet exist

I am trying to build a generic Mapping Data Flow for some basic cleansing on tables in my Data Lake. I need it to be able to work both on an ongoing basis after data already exists in my cleansed tables as well as when new tables are added (it would detect them automatically and create and populate the destination). Both the Source and Destination tables with be Delta tables.
The approach I have taken is to have Sources configured to both my actual source and to the target and use either JOIN transformations or EXISTS transformations to identify the new, updated and removed rows.
This works fine for INSERTS and UPDATES, however my issues is dealing with DELETES when there is no data currently in the destination. Obviously there will be nothing to DELETE - that is as expected. However, because I reference the key column that will exist once data is loaded to the table I get an error on an initial run that states:
ERROR Dataflow AppManager: name=BatchJobListener.failed, opId=xxx, message=Job 'xxx failed due to reason: DF-SINK-007 at Sink 'cleansedTableWithDeletes': Sink results in 0 output columns. Please ensure at least one column is mapped.
The overall process looks as follows:
Has anyone developed a pattern that works for a generic flow (this one is parameter driven and ensures schema drift is accommodated) or a way for the Data Flow to think that there IS a column in the destination that it can refer to and get past this issue?
In Source options check Allow no files found.
You can also provide date dynamically in Filter by last modified option.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#sink-settings

AZURE DATA FACTORY - Can I set a variable from within a CopyData task or by using the output?

I have simple pipeline that has a Copy activity to populate a table. That task is based on a query and will only ever return 1 row.
The problem I am having is that I want to reuse the value from one of the columns (batch number) to set a variable so that at the end of the pipeline I can use a Stored Procedure to log that the batch was processed. I would rather avoid running the query a second time in a lookup task so can I make use of the data already being returned?
I have tried duplicating the column in the Copy activity and then mapping that to something like #BatchNo but that fails and have even tried to add a Set Variable task but can't figure out how to take a single column #{activity('Populate Aleprstw').output} does not error but not sure what that will actually do in this case.
Thanks and sorry if its a silly question.
Cheers
Mark
I always do it like this:
Generate a batch number (usually with a proc)
Use a lookup to grab it into a variable
Use the batch number in all activities (might be multiple copes, procs etc.)
Write the batch completion
From your description it seems you have the batch embedded in the data copy from the start which is not typical.
If you must do it this way, is there really an issue with running a lookup again?
Copy activity doesn't return data like that, so you won't be able to capture the results that way. With this design, running the query again in a Lookup is the best option.
Is the query in the Source running on the same Server as the Sink? If so, you could collapse the entire operation into a Stored Procedure that returns the data point you are trying to capture.

How to take data from 2 databases (with same schema) and copy it into 1 database using Data factory

I want to take data from 2 databases and copy(coalesce) it into 1 using Data factory.
The issue is: It seems that multiple inputs is not allowed for copy activities.
So i resorted to having 2 different datasets which are exact copies but with a different name... and then putting 2 different activities into the 1 pipeline which use their specific output dataset.
It just seems odd and wrong to do it this way.
Can i have some help.
This is what my diagram currently looks like:
Is there no way of just copying data from 2 seperate databases (which have the same structure but different data) to the 1 database?
The short answer is yes. But you need to work within the constraints of how ADF handles this.
A couple of things to help...
You'll always need at least 2 activities to do this when using the copy type activity. Microsoft of course charges per activity execution in ADF, so they aren't going to allow you to take shortcuts having many inputs and output per single copy activity (single charge).
The approach you show above is ok and to pass the ADF validation as you've found you simply need to have the output datasets created separately and called different things. Even if they still refer to the same underlying target table etc. This is really only a problem for the copy activity. What you could do is land the data firstly into separate staging tables in the Azure target database just for the copy (1:1). Then have a third downstream activity that executes a stored procedure that does the union of tables. In this case you could have 2 inputs to 1 output in the activity if you want to have that level of control in ADF.
Like this:
Final point, if you don't want the activities to execute in parallel you could chain the datasets to enforce a fake dependency or add a simple 'delay' clause to one of the copy operations. A delay on an activity would be simpler than provisioning a time slice offset.
Hope this helps