Issue with Copy Activity Metadata - azure-data-factory

The Copy Data Activity Which used to show the number of rows written isnt showing up any more.
Is there any option in the copy activity to make sure it reflects the number of rows written.

Be it debug of the pipeline, or a triggered pipeline run, you can check the output of the copy data activity to conclude whether the data read is equal to data written or not.
Let's say it is a pipeline run. Navigate to monitor section and click on the pipeline.
Now, the activity run dialog opens up. There you can monitor from the activity debug output whether the data read is equal to data written or not:
NOTE: The above is for blob to blob copy. For your source and sink, there will be similar activity output data that might contain the required information (like rows read and rows written). The following is an example for Azure SQL database to blob:

Related

Azure Data Factory Overwrite Existing Folder/Partitions in ADLS Gen2

I have my settings in my ADF Sink to Clear the folder but Partitioned via an ID
But this sink already has other partitions in that exists that I do not want to remove.
If an ID comes in, I just want to clear that specific folder/partition but it is actually clearing the full folder versus just partition. Am I missing a setting?
To overwrite only the partitions that appear in new data and keep the rest of the old partition data, you can make use of the pre commands present in the settings tab of the dataflow sink. Look at the following demonstration.
The following is my initial data which I have partitioned based on id.
Now let's say the following is the new data that you are going to write. Here, according to the requirement, you want to overwrite the partitions that are present and keep the rest as it is.
First, we need to get the distinct key column values (id in my case). Then use them in the pre commands of sink settings to remove files only from these partitions.
Take the above data (the 2nd image data) as dataflow1 source. Apply derived column transformation to add a new column with constant value say 'xxx' (to group based on this column and apply collect() aggregate function).
Group by this new column and use the aggregate as distinct(collect(id)).
Now for sink, choose as Cache, check write to activity output. When you run this dataflow in the pipeline, the debug output would be:
Send this array value to a parameter created in another dataflow where you make necessary changes and overwrite partitions. Give the following dynamic content
#activity('Data flow1').output.runStatus.output.sink1.value[0].val
Now in this second dataflow, the source is the same data used in first dataflow. For sink, instead of selecting clear the folder option, scroll down where you can find pre/post commands sections where you give the following dynamic content:
concat('rm /output/id=',toString($parts),'/*')
Now when you run this pipeline, it successfully executes and runs the overwrites only the required partitions, whereas keeps the other partitions.
The following is a sample partition data (id=2) to show that the data is overwritten (only one part file with required data will be available).
Why do not you specify the filename and write it to 1 single file.

How can I pass output from a filter activity directly to a copy activity in ADF?

I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?
Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.

Validation checks on dataset ADF vs databricks

I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar
You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/

Event based trigger for a sequential run of the same data factory pipeline

I would like to use an event based trigger to run a data factory pipeline.
The trigger will check a folder in a data lake for any new file and start a pipeline once a new CSV file is copied.
The pipeline will then copy the data to an intermediate table to check its consistency (multiple checks using different data flow activities) and if everything's correct, copies it into a stage table.
It is thus very important that the intermediate table will contain the data from only one single CSV file before it is checked.
I have read though that the event based trigger will start in parallel as many pipelines as the (simultaneously) downloaded CSV files.
Is this right? in this case how can I force each Pipeline to wait until the previous one is done?
Thank you for your help.
There is a flag on the pipeline properties (accssible in the top-right of the editor pane) called concurrency. Set this to 1 and only one copy will run and any other invocations will be queued until that one finishes.

How to force an empty output file with Azure Stream Analytics

I have configured a Stream Analytics Jobs so that input data goes to an Azure Data Lake repository every hour.
Sometimes there is no event to track, so no output. But my Data Factory goes in error because the file doesn't exist.
I wonder if exist a way to force empty file out from Stream Analytics?
Many thanks!
You can look at our common query patterns here. In particular I think you can use the one named "fill missing values" to generate some events regularly, even when there is no input.
Let me know if it works for you.
Thanks!
JS
Are you using ADF v2?
I didn't find anything inbuilt in ADF to come up with it.
But I can see few workarounds - starting from simplest one:
In your ASA query, you can use WITH statement and union your input with a fake empty message. - Then there will be always output
As a second output in ASA job you can store in some DB info whenever a file was produced. Then in ADF you can check whenever there are files and run copy conditionally.
In ADF run web activity e.g. LogicApp/FunctionApp to get info whenever files in container exist.
Find the way to do it...
I had an activity using the data lake analytics, what I do is to run an U-SQL than read data with no transformation and write it to the output with headers.
In that way the activity always write an output file!
Very easy!