Difference between DataFlow and Pipelines - azure-data-factory

I do not understand the difference between dataflow and pipeline in Azure Data Factory.
I have read and see DataFlow can Transform Data without writing any line of code.
But I have made a pipeline and this is exactly the same thing.
Thanks

A Pipeline is an orchestrator and does not transform data. It manages a series of one or more activities, such as Copy Data or Execute Stored Procedure. Data Flow is one of these activity types and is very different from a Pipeline.
Data Flow performs row and column level transformations, such as parsing values, calculations, adding/renaming/deleting columns, even adding or removing rows. At runtime a Data Flow is executed in a Spark environment, not the Data Factory execution runtime.
A Pipeline can run without a Data Flow, but a Data Flow cannot run without a Pipeline.

Firstly, dataflow activity need to be executed in the pipeline. So I suspect that you are talking about the copy activity and dataflow activity as both of them are used for transferring data from source to sink.
I have read and see DataFlow can Transform Data without writing any
line of code.
Your could see the overview of Data Flow. Data flow allows data engineers to develop graphical data transformation logic without writing code. All data transfer steps are based on visual interfaces.
I have made a pipeline and this is exactly the same thing.
Copy activity could be used for data transmission. However, it has many limitations with column mapping. So,if you just need simple and pure data transmission, Copy Activity could be used. In order to further meet the personalized needs, you could find many built-in features in the Data Flow Activity. For example, Derived column, Aggregate,Sort etc.

Related

How to implement scd2 in snowflake tables using Azure Data Factory

I want to implement the scd2 in the snowflake tables. My source and target tables are present in snowflake only. The entire process has to be done using Azure Data Factory.
I went through the documentation given by azure for implementing the scd2 using data flows but when I tried to create a dataset for snowflake connection its showing as disabled.
Is there any way or any documentation where I can see the steps to create SCD2 in adf with snowflake tables.
Thanks
vipendra
SCD2 in ADF can be built and managed graphically via data flows. The Snowflake connector for ADF today does not work directly with data flows, yet. So for now, you will need to use the Copy Activity in an ADF pipeline and stage the dimension data in Blob or ADLS, then build your SCD2 logic in data flows using the staged data.
Your pipeline will look something like this:
[Copy Activity Snowflake-to-Blob] -> [Data Flow SCD2 logic Blob-to-Blob] -> [Copy Activity Blob-to-Snowkflake]
We are working on direct connectivity to Snowflake from data flows and hope to land that soon.
If your source and target tables are both in Snowflake, you could use Snowflake Streams to do this. There's a blog post covering this in more detail at https://community.snowflake.com/s/article/Building-a-Type-2-Slowly-Changing-Dimension-in-Snowflake-Using-Streams-and-Tasks-Part-1
However, in short, if you have a source table source, you can put a stream on it like so:
create or replace stream source_changes on table source;
This will capture all the changes that are made to the source table. You can then build a view on that stream that establishes how you want to feed those changes into the SCD table. (The blog post uses case statements to put start and end dates in on each row in the view).
From there, you can use a Snowflake Task to automate the process of loading from the stream into the SCD only when the Stream actually has changes.

Concurrent file processing in data flow activity Azure Data Factory

When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.
However, when reading about data flows in the following article from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:
A mapping data flow will execute better
when the Source transformation iterates over multiple files instead of
looping via the For Each activity. We recommend using wildcards or
file lists in your source transformation. The Data Flow process will
execute faster by allowing the looping to occur inside the Spark
cluster. For more information, see Wildcarding in Source
Transformation.
For example, if you have a list of data files from July 2019 that you
wish to process in a folder in Blob Storage, below is a wildcard you
can use in your Source transformation.
DateFiles/_201907.txt
By using wildcarding, your pipeline will only contain one Data Flow
activity. This will perform better than a Lookup against the Blob
Store that then iterates across all matched files using a ForEach with
an Execute Data Flow activity inside.
Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.
Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?
Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?
Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.
Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.
To process each file individually 1x1, use the control flow capabilities in the pipeline.
Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.

Do I need a storage (of some sort) when pulling data in Azure Data factory

*Data newbie here *
Currently, to run analytics report on data pulled from Dynamics 365, I use Power BI.
Issue with this is, Power BI is quite slow processing large data. I carry out a number of transform steps (e.g. Merge, Join, deleting or renaming columns, etc). So, when I try to run a query in Power BI with said steps, it takes a long time to complete.
So, as a solution, I decided to make use of Azure Data Factory(ADF). The plan is to use ADF to pull the data from CRM (i.e. Dynamics 365), perform transformations and publish the data. Then I'll use Power BI for visual analytics.
My question is:
What azure service will I need in addition to Data Factory? Will I need to store the data I pulled from CRM somewhere - like Azure Data Lake or Blob storage? Or can I do the transformation on the fly, right after the data is ingested?
Initially, I thought I could use the 'copy' activity to ingest data from CRM and start playing with the data. But using the copy activity, I needed to provide a sink (destination for the data. Which has to be a storage of some sort).
I also thought, I could make use of the 'lookup' activity. I tried to use it, but getting errors (no exception message is produced).
I have scoured the internet for a similar process (i.e. Dynamics 365 -> Data Factory -> Power BI), but I've not been able to find any.
Most of the processes I've seen however, utilises some sort of data storage right after data ingest.
All response welcome. Even if you believe I am going about this the wrong way.
Thanks.
Few things here:
The copy activity just moves data from a source, to a sink. It doesnt modify it on the fly.
The lookup activity is just to look for some atributes to use later on the same pipeline.
ADF cannot publish a dataset to power bi (although it may be able to push to a streaming dataset).
You approach is correct, but you need that last step of transforming the data. You have a lot of options here, but since you are already familiar with Power Bi you can use the Wrangling Dataflows, which allows you to take a file from the datalake, apply some power query and save a new file in the lake. You can also use Mapping Dataflows, databricks, or any other data transformation tool.
Lastly, you can pull files from a data lake with Power Bi to make your report with the data on this new file.
Of course, as always in Azure there are a lot of ways to solve problems or architect services, this is the one I consider simpler for you.
Hope this helped!

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.
Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.

Parallelisms in Azure Data factory v2 copy activity

We are implementing solution to achieve similar functionality as of ssis packages to copy data from one database to another (on-premise to azure SQL). In SSIS we have option to setup parallel processing in different ways. We can also transfer data in chunks.
Similarly, which is the best way to achieve parallelisms in Azure Data Factory version 2? Please consider scenario of transferring data for only 1 table.
Have a look at the Copy Activity Performance and Tuning Guide for ways to optimize transferring data into the Cloud with ADF: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance