Is it possible to merge Azure Data Factory data flows - azure-data-factory

I have two separate Data flows in Azure Data Factory, and I want to combine them into a single Data flow.
There is a technique for copying elements from one Data flow to another, as described in this video: https://www.youtube.com/watch?v=3_1I4XdoBKQ
This does not work for Source or Sink stages, though. The Script elements do not contain the Dataset that the Source or Sink is connected to, and if you try to copy them, the designer window closes and the Data flow is corrupted. The details are in the JSON, but I have tried copying and pasting into the JSON and that doesn't work either - the source appears on the canvas, but is not usable.
Does anyone know if there is a technique for doing this, other than just manually recreating the objects on the canvas?

Thanks Leon for confirming that this isn't supported, here is my workaround process.
Open the Data Flow that will receive the merged code.
Open the Data Flow that contains the code to merge in.
Go through the to-be-merged flow and change the names of any transformations that clash with the names of transformations in the target flow.
Manually create, in the target flow, any Sources that did not already exist.
Copy the entire script out of the to-be-merged flow into a text editor.
Remove the Sources and Sinks.
Copy the remaining transformations into the clipboard, and paste them in to the target flow's script editor.
Manually create the Sinks, remembering to set all properties such as "Allow Update".
Be prepared that, if you make a mistake and paste in something that is not correct, then the flow editor window will close and the flow will be unusable. The only way to recover it is to refresh and discard all changes since you last published, so don't do this if you have other unpublished changes that you don't want to lose!
I have already established a practice in our team that no mappings are done in Sinks. All mappings are done in Derived Column transformations, and any column name ambiguity is resolved in a Select transformations, so the Sink is always just auto-map. That makes operations like this simpler.
It should be possible to keep the Source definitions in Step 6, remove the Source elements from the target script, and paste the new Sources in to replace them, but that's a little more complex and error-prone.

Related

How can I pass output from a filter activity directly to a copy activity in ADF?

I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?
Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.

Azure Data Factory data sink seems to be hidden or collapsed so it can't be validated

I'm quite new to Data Factory, so perhaps this is something very obvious that I'm missing.
I removed one of my datasets from within the list of Factory Resources (the pane on the left) and created a new one so that I could name it correctly. Previously, the dataset I deleted was associated with a data flow sink, which was named crmConstituent - this can be seen at the bottom of my screenshot. I now realise I should have removed the sink before I deleted the dataset, but I got ahead of myself.
I then added a new/replacement sink to my data flow, named SinkConstituents, associated with my recreated dataset. However, it looks like the original sink has persisted in the data flow somewhere and I can't seem to remove it using the interface. To the right of NameColumns in the top row of the flow, I can see my fresh new sink (SinkConstituents) but in between the two, I can also see another blue line that is possibly part of my old sink that I wanted to remove, but it appears to be only partially visible, as though it is collapsed. If I right-click on the "collapsed sink", it comes up with the option to delete it, but this doesn't appear to actually do anything if I select delete; the blue line remains and so does the error. I can't click on the collapsed sink to view the configuration of it in the bottom pane.
I'm currently unable to validate my factory because the original sink named crmConstituent doesn't have a dataset any more (because I am a monster who deletes things without thinking ahead), but I don't seem to be able to remove the sink from my data flow or recreate the original dataset to soothe the sink into thinking nothing is missing after all.

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

Azure Data factory, How to incrementally copy blob data to sql

I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.
You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.
I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.

How to add a file to ClearCase database, but not in source control?

On my project I have some files that are generated automatically, so you'd normally don't put those in Source Control.
But since this process takes a long time and they change quite periodically, I'd rather keep them in Clear Case database to not impose this process to every one that desires to compile the source that isn't directly related to these files.
So, is there a way that I could add files on ClearCase UCM without creating a version tree?
More directly, I'd like to know if there a way to only one version per branch. As if when delivering this file to the main branch, it would delete the old version an replace it by the new one.
I know that this is a bit unorthodox, but I ask this because I'm not interested by the generated files history and I'd like to save space in the server.
So, is there a way that I could add files on ClearCase UCM without creating a version tree?
No.
Unless those files are radically different from one generation to the next, (or are huge binary), ClearCase would only record the delta, which wouldn't consume too much space.
One trick would be to rename the stream in which the import of the newly generated source is done, and create a new stream, in order to not have a huge version tree over time.