ADF - built-in copy task to create folders based on variables - azure-data-factory

I am using Azure Data Factory's built-in copy task, set on a daily schedule, to copy data into a container in Azure Data Lake Storage Gen2 using the Built-in Copy tool. For my destination, I'm trying to use date variables to create a folder structure for the data. In my resultant pipeline the formula looks like this:
Dir1/Dir2/#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}/#{formatDateTime(pipeline().parameters.windowStart,'MM')}/#{formatDateTime(pipeline().parameters.windowStart,'dd')}
Unfortunately this is throwing an error:
Operation on target ForEach_h33 failed: Activity failed because an inner activity failed; Inner activity name: Copy_h33, Error: The function 'formatDateTime' expects its first parameter to be of type string. The provided value is of type 'Null'.
Everything I've created was just generated by the tool, the folder path I used when following the tool was as suggested:
Dir1/Dir2/{year}/{month}/{day} (I was then able to set the format of each variable - e.g., yyyy, MM, dd, which suggests the tool understood what I was doing.
The only other thing I can think of, is that the folder structure in the container only contains Dir1/Dir2/ - I am expecting the subdirectories to be created as the copy task runs.
I'll also add, everything runs fine if I just use the directory Dir1/Dir2/ - so the issue is with my variables.

There is nothing wrong in your built-in copy task. It will give perfect result when it is running on the scheduled trigger time.
But if you want to run with manual trigger, it will give the error like above.
When you are running with manual trigger you must give the windowStart parameter value by yourself. In the above It is saying that the value is null.
Give the value as MM/DD/YYYY. The Schedule trigger automatically takes this value when it runs daily. But In manual triggering we have to specify this value.
Then click on Ok. Now, you will not get error like that, and you can create folders with the year/month/day format.

Related

Persistable key value pair storage in Synapse or ADF

I am using Synapse and have a lot of scenarios where I need to read a value at the beginning of a pipeline then save a value at the end of a pipeline as a key value pair (kvp). e.g. when the pipeline begins I read a value from a kvp store to get the max date from the last time the pipeline ran, I use that value to get all values from a table that are greater than or equal to that datetime. when the pipeline finishes doing what it has to do, I save the max modified date from this run. wash, rise, dry. I have a few ideas, like parquet file, redis (this seems a bit much). Just trying to see if anyone has come up with a more elegant/simple approach.
You can use Global Parameters which can be used in different pipelines and the values can be modified in the run time.
Go to Manage in Azure Data Factory and click on Global Parameters in the left panel options. Then click on + New.
Create a new Global Parameter.
Later you use this global parameter in any pipeline and can change its value in runtime. Refer below image for the same.

Azure Data Factory - run script on parquet files and output as parquet files

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.

need help in ADF trigger with Blob

which needs to be triggered when a file received in a Blob.
But the complex part is that there are 2 files, A.JSON and B.JSON which will be generated in 2 different locations.
So When A.JSON generated in location 1, the Pipeline A should trigger and also when B.JSON generated in Loation 2, the Pipeline A should trigger. I have done the blob trigger using 1 file on 1 location but not sure how to do when 2 different files come in 2 different locations .
There are three ways you could do this.
Using ADF directly with conditions to evaluate if the file triggered is from a specific path as per your need.
Setup Logic Apps for each different paths you would want to monitor for blobs created.
Add two different triggers configured for different paths (best option)
First method: (This has an overhead of running every time a file is triggered in container.)
Edit the trigger to look through whole storage or all containers. Select the file type: JSON in your case.
Parameterize source dataset for dynamic container and file name
Create parameters in pipeline, one each for refering the folder path you want to monitor and one for holding the triggered filename.
where receive_trigger_files will be assigned the triggered file name dynamically.
I am showing an example here where a lookup activity would evaluate the path and execute the respective activities forward if triggered file path and our monitoring paths match.
another for the path2
For example a Get MetaData activity or any in your scenario
Lets manually debug and check for a file exercise01.json that is sored in path2
You can also use IF condition activity similarly, but would require multiple steps or monitoring using activity statuses won't be clear.
Second method: Setup a blob triggered logic app
Run ADF pipeline using Create a pipeline run action, and set or pass appropriate parameters as explained previously.
Third method: Add 2 triggers each for a path you wish to monitor blob creation.

Using Azure Data Factory output in Logic App

I have a logic app that runs on occurrence initially that runs an ADF
pipeline which outputs a folder of files.
Then, I use a List Blobs action to pull one specific file
from the newly made folder and place its path on a queue.
And once a message is placed on that queue, it triggers the run of
another ADF pipeline.
The issue is I have not seen a way to get the output of the first ADF pipeline to put on the queue. I have tried to cheat within the List Blobs action that is sequential to the 1st ADF pipeline by explicitly searching the name of the output folder because it will be the same every time.
However, even after the 1st ADF is ran and produces the folder, within the first instance of this Logic App being ran the List Blobs can't find the folder and says the file path is not found.
Only after I run the Logic App a second time the folder is finally found which is not at all optimal. How can I fix this ? I prefer to keep everything in one logic app. Are there other Azure tools that can help in addition?
I am not having the details of the implementation but i am wondering if the message is written by the first pipeline is only used as a signal the second pipeline ? if thats the case why you cannot you call the second pipeline on completion of the first one ? may be these pipelines are on different ADF's ?
I suggest you to read and see if you can use the Event triggers

Talend: How to copy the file with modified as of today

I have a job in Talend which will connect to a ftp folder and look for the files eg:ABCD. This file is created everyday and its placed in the ftp path and i need to move this files to some other folder. I'm new to talend and Java. Could you please help me how to move this file when and only the file last modified date as of the job run date.
You can use tFTPFileProperties to obtain the properties of the remote file, then in a javarow access those properties. You can then compare to current date either in the tJavaRow and stick the results in a global variable or put the date in a global variable. You then use an IF trigger to join to the tFTPGet component.
The IF trigger will either check the results of your compare, or do the compare. It will only execute the FTP Get if true.
This shows overall job structure, including the fields made available from the file properties:
This shows how to obtain the datetime of the remote file. This is where you will need to stick it in a global variable (code for that is not shown) so you can use it in your IF trigger code.
This shows the datetime of the remote file when the job is run.
This points you in the right direction but you will need to still do some work. You will need to do the compare in your IF trigger and know how to compare dates.