How to configure a Synapse Mapping Data Flow for INSERT/UPDATE/DELETE when the destination table does not yet exist - azure-data-factory

I am trying to build a generic Mapping Data Flow for some basic cleansing on tables in my Data Lake. I need it to be able to work both on an ongoing basis after data already exists in my cleansed tables as well as when new tables are added (it would detect them automatically and create and populate the destination). Both the Source and Destination tables with be Delta tables.
The approach I have taken is to have Sources configured to both my actual source and to the target and use either JOIN transformations or EXISTS transformations to identify the new, updated and removed rows.
This works fine for INSERTS and UPDATES, however my issues is dealing with DELETES when there is no data currently in the destination. Obviously there will be nothing to DELETE - that is as expected. However, because I reference the key column that will exist once data is loaded to the table I get an error on an initial run that states:
ERROR Dataflow AppManager: name=BatchJobListener.failed, opId=xxx, message=Job 'xxx failed due to reason: DF-SINK-007 at Sink 'cleansedTableWithDeletes': Sink results in 0 output columns. Please ensure at least one column is mapped.
The overall process looks as follows:
Has anyone developed a pattern that works for a generic flow (this one is parameter driven and ensures schema drift is accommodated) or a way for the Data Flow to think that there IS a column in the destination that it can refer to and get past this issue?

In Source options check Allow no files found.
You can also provide date dynamically in Filter by last modified option.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#sink-settings

Related

Azure Data Factory Overwrite Existing Folder/Partitions in ADLS Gen2

I have my settings in my ADF Sink to Clear the folder but Partitioned via an ID
But this sink already has other partitions in that exists that I do not want to remove.
If an ID comes in, I just want to clear that specific folder/partition but it is actually clearing the full folder versus just partition. Am I missing a setting?
To overwrite only the partitions that appear in new data and keep the rest of the old partition data, you can make use of the pre commands present in the settings tab of the dataflow sink. Look at the following demonstration.
The following is my initial data which I have partitioned based on id.
Now let's say the following is the new data that you are going to write. Here, according to the requirement, you want to overwrite the partitions that are present and keep the rest as it is.
First, we need to get the distinct key column values (id in my case). Then use them in the pre commands of sink settings to remove files only from these partitions.
Take the above data (the 2nd image data) as dataflow1 source. Apply derived column transformation to add a new column with constant value say 'xxx' (to group based on this column and apply collect() aggregate function).
Group by this new column and use the aggregate as distinct(collect(id)).
Now for sink, choose as Cache, check write to activity output. When you run this dataflow in the pipeline, the debug output would be:
Send this array value to a parameter created in another dataflow where you make necessary changes and overwrite partitions. Give the following dynamic content
#activity('Data flow1').output.runStatus.output.sink1.value[0].val
Now in this second dataflow, the source is the same data used in first dataflow. For sink, instead of selecting clear the folder option, scroll down where you can find pre/post commands sections where you give the following dynamic content:
concat('rm /output/id=',toString($parts),'/*')
Now when you run this pipeline, it successfully executes and runs the overwrites only the required partitions, whereas keeps the other partitions.
The following is a sample partition data (id=2) to show that the data is overwritten (only one part file with required data will be available).
Why do not you specify the filename and write it to 1 single file.

Cannot repopulate ElectrodeGroup datajoint table

I'm a researcher in Loren Frank's lab at UCSF using datajoint and files in the nwb format. I made some changes to our code for defining entries in our ElectrodeGroup table, and was hoping to test those by deleting an entry in the table and regenerating it with the new code. I was able to delete the entry, but cannot repopulate it. In particular, when I run ElectrodeGroup.populate() or ElectrodeGroup.populate({"nwb_file_name": my_file_name}), no changes are made to the table. I confirmed that the electrode group I deleted and am trying to regenerate is defined in the original nwb file. I am seeking input on why the populate command seems to not be working here. Thanks in advance for any help!
This user also contacted our team through another channel. Sharing the solution below for future users, in reference to this schema. In short, the populate process is reserved for unique upstream primary keys.
Since the ElectrodeGroup's only upstream table dependency is Session, the make method will only be called if there are no electrode groups for that session. This is because from the perspective of DataJoint, the only 'guaranteed' knowledge about what should exist for this table is defined solely by the presence/absence of related upstream records. Since the 'new' primary 'electrode_group_name' attribute is defined by the ElectrodeGroup table itself, DataJoint doesn't know how many copies will be created by make, and so simply invokes make 1 time per Session, expecting the single make invocation to fully define all possible electrode_group_name values the table will use. If there is one value for that session, no work needs to be done, so no make() invocation occurs.
There are a couple possible solutions:
Model the electrode group explicitly, with a table defines the existence of an electrode group (e.g., ElectrodeGroupConfiguration). This ElectrodeGroup would then inherit primary keys from both Session and ElectrodeGroupConfiguration. The ElectrodeGroup make function would be adjusted to load that unique keys across upstream tables.
Adjust the make function to handle the partial insert/update case, and call the make function directly with the desired primary key when these kinds of 'abnormal' updates need to occur.
Method #1 is 'cleanest' w/r/t to the DataJoint data model (explicitly modeled data dependencies using make/populate), whereas #2 is slightly 'escaping' the DataJoint data model in a controlled way to achieve a desired schema/data result.

ADF Mapping Data in Sink Table Error - Sink Table remains empty

I have a Data flow in Azure Data Factory which I want to use to combine data from three sources and then sink in a destination table (with some transformation in-between). For the sink table I created a table in SQL, matching the column headers and data types from my Data Flow in Azure.
However when I publish the data flow, the sink table remains empty. The only error I get is under Mapping "At least one incoming column is mapped to a column in the sink dataset schema with a conflicting type, which can cause NULL values or runtime errors." This seems to be inhibiting me from enabling Auto Mapping - so I mapped the columns manually.
So where I'm at the moment:
DataFlowLayout
I tried manually mapping the columns - the datatypes in my input and my sink tables match up with each other but my sink table is still empty
SinkTableDataMapping
Under Data Preview for both my source and my sink tables I am able to view a sample of my data, so they are not empty
SinkTableDataPreview
Anyone experience something similar?
The details provided about is not sufficient to provide any inputs. Can you add more details like what are the file format of those three sources/from where you are pulling/ additional screenshots that will really help to guide better. Thanks
You can try below steps:
Disable auto mapping of columns in Sink Transformation and manually map columns.
And check Allow insert option selected under sink transformation settings.
Also make sure all column data types of input and output of Sink transformation should match to avoid nulls.
Thanks for the feedback. In the end the issue was that the pipeline which sinks the data from the flow, in the destination table was not properly set up - that is why the dataflow was not showing any errors but the sink table still remained empty. So the dataflow was kind of hanging in the air with no instruction to actually perform the sink

Data from multiple sources and deciding destination based on the Lookup SQL data

I am trying to solve the below problem where I am getting data from different sources and trying to copy that data at single destination based on the metadata stored in SQL table. below are the steps i followed-
I have 3 REST API call and the output of those calls going as input to lookup activity.
The lookup activity is queried on SQL DB which has 3 records and pulling 2 columns only, file_name and table_name.
Then for each activity is iterating on the lookup array output and from each item, I am getting the item().file_name.
Now for each item I am trying to use Switch case to decide based on the file name what should be the destination of the data.
I am not sure how I can use the file_name coming in step 3 to use as a case in of switch activity. Can anyone please guide me on that?
You need to create a variable and save the value of file_name. Then you can use that variable in of switch activity. If you do this, please make sure your Sequential setting of For Each activity is checked.

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.