Azure Data Factory - run script on parquet files and output as parquet files - azure-data-factory

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?

Pass the data flow dataset parameter values from the pipeline data flow activity settings.

Related

need help in ADF trigger with Blob

which needs to be triggered when a file received in a Blob.
But the complex part is that there are 2 files, A.JSON and B.JSON which will be generated in 2 different locations.
So When A.JSON generated in location 1, the Pipeline A should trigger and also when B.JSON generated in Loation 2, the Pipeline A should trigger. I have done the blob trigger using 1 file on 1 location but not sure how to do when 2 different files come in 2 different locations .
There are three ways you could do this.
Using ADF directly with conditions to evaluate if the file triggered is from a specific path as per your need.
Setup Logic Apps for each different paths you would want to monitor for blobs created.
Add two different triggers configured for different paths (best option)
First method: (This has an overhead of running every time a file is triggered in container.)
Edit the trigger to look through whole storage or all containers. Select the file type: JSON in your case.
Parameterize source dataset for dynamic container and file name
Create parameters in pipeline, one each for refering the folder path you want to monitor and one for holding the triggered filename.
where receive_trigger_files will be assigned the triggered file name dynamically.
I am showing an example here where a lookup activity would evaluate the path and execute the respective activities forward if triggered file path and our monitoring paths match.
another for the path2
For example a Get MetaData activity or any in your scenario
Lets manually debug and check for a file exercise01.json that is sored in path2
You can also use IF condition activity similarly, but would require multiple steps or monitoring using activity statuses won't be clear.
Second method: Setup a blob triggered logic app
Run ADF pipeline using Create a pipeline run action, and set or pass appropriate parameters as explained previously.
Third method: Add 2 triggers each for a path you wish to monitor blob creation.

Azure Data Flow generic curation framework

I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK
To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.

Can a Mapping Data Flow use a parameterized Parquet dataset?

thanks for coming in.
I am trying to develop a Mapping Data Flow in an Azure Synapse workspace (so I believe that this can also apply to ADFv2) that takes a Delta input and transforms it straight into a Parquet -formatted output, with the relevant detail of using a Parquet dataset pointing to ADLSGen2 with parameterized file system and folder, in opposition to a hard-coded file-system and folder, because this would take creating too many datasets as there are too many folders of interest in the Data Lake.
The Mapping Data Flow:
As I try to use it as a Source in my Mapping Data Flows, the debug configuration (as well as the parent pipeline configuration) will duly ask for my input on those parameters, which I am happy to enter.
Then, as soon I try to debug or run the pipeline I get this error in less than 1 second:
{
"Message": "ErrorCode=InvalidTemplate, ErrorMessage=The expression 'body('DataFlowDebugExpressionResolver')?.50_DeltaToParquet_xxxxxxxxx?.ParquetCurrent.directory' is not valid: the string character '_' at position '43' is not expected."
}
RunId: xxx-xxxxxx-xxxxxx
This error message is not very specific to know where I should look.
I tried replacing the parameterized Parquet dataset with a hard-coded one, and it works perfectly both in debug and pipeline -run modes. However, this does not gets me what I need which is the ability to reuse my Parquet dataset instead of having to create a specific dataset for each Data Lake folder.
There are also no spaces in the Data Lake file system. Please refer to these parameters that look a lot like my production environment:
File System: prodfs001
Directory: synapse/workspace01/parquet/dim_mydim
Thanks in advance to all of you, folks!
The directory name synapse/workspace01/parquet/dim_mydim has an _ in dim_mydim, can you try replacing the underscore, or maybe you can use dimmydim to test whether it works.

Azure Data factory, How to incrementally copy blob data to sql

I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.
You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.
I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.

How to pass the Result of my CustomDotNet Activity to my stored procedure

I've created a ADF for my project,which consist of a Custom Activity and stored procedureactivity.
The thing I'm blocked here is.
My custom activity does the obtain the most recent modified file - let's assume xx.txt is the file - from in my Azure Blob Container.
My stored procedure has the single parameter FileName. I want to pass the file name to my stored procedure which can be obtained from the above custom activity.
(We can say simply as my stored procedure activity input depends on the Custom Activity output)
How can I do this in my ADF?
You will need to do this with two separate activities (maybe in two separate pipelines, if you want greater control) and using some middle ground storage that both processes can access.
For example:
Input dataset, Blob.
Custom activity.
Output dataset, SQL DB.
Input dataset, SQLDB (same as 3).
Stored Proc activity.
Output dataset, SQLDB.
It is the staging area datasets used for point 3 and 4 that is most important here.
This is how ADF will want to handle it. Although I understand why you would want to pass the output for the custom activity straight to the stored procedure, without using an additional dataset.
Hope this helps.