Where to use the Azure Data Factory Mapping Data Flow make sense? - azure-data-factory

My assumptions where MDF might be right fit are as follows:
MDF can be used as a Data Wrangling Tool by end-users
MDF is better suited for SQL Server-based Datawarehouse architectures to load the data into staging or data lake in clean format (prepare the data before loading it to SQL Server DWH and then use a proper ETL tool to do transformations)
If MDF has to be used for light ELT / ETL tasks directly on Data Lake or DWH, it needs customization for complex transformations...
My question would be:
A) Did anyone use Mapping Data Flow in production for option 2 and 3 above?
B) If assumption 3 is valid, would you suggest going for Spark-based transformation or an ETL tool rather than patching the MDF with customizations as new versions might not be compatible with, etc..

I disagree with most of your assumptions. Data Flow is a part of a larger ETL environment, either Data Factory (ADF) or Azure Synapse Pipelines and you really can't separate it from it's host. Data Flow is a UI code generator that executes at runtime as a Spark job. If your end user is a data engineer, then yes Data Flow is a good tool for them.
ADF is a great service for orchestrating data operations. ADF supports all the things you mentioned (SSIS, Notebooks, Stored Procedures, and many more). It also supports Data Flow, which is absolutely a "proper" tool for transformations and has a very rich feature set. In fact, if you are NOT doing transformations, Data Flow is likely overkill for your solution.

Related

Is azure data factory ELT?

I have a question about Azure Data Factory (ADF). I have read (and heard) contradictory info about ADF being ETL or ELT. So, is ADF ETL? Or, is it ETL? To my knowledge, ELT uses the transformation (compute?) engine of the target (whereas ETL uses a dedicated transformation engine). To my knowledge, ADF uses Databricks under the hood, which is really just an on-demand Spark cluster. That Spark cluster is separate from the target. So, that would mean that ADF is ETL. But, I'm not confident about this.
Good question.
It all depends on what you use and how you use it.
If it is strictly a copy activity, then it is ELT.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
The transform can be a stored procedure (does not matter RDBMS) and the source/destination are tables. If the landing zone is a data lake, then you want to call a Databricks or Synapse notebook. Again, the source is a file. The target is probably a delta table. Most people love SQL and delta tables give you those ACID properties.
Now, if you using a mapping or wrangling data flow, then it is ETL, if the pattern is pure. Of course you can mix and match. Both these data flows use a spark engine. It cost money to have big spark clusters running.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
https://learn.microsoft.com/en-us/azure/data-factory/wrangling-tutorial
Here is an article from MSDN.
https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/etl
It has old (SSIS) and new (SYNAPSE) technologies.

What is the difference between polybase and bulk insert, copy methods in azure data factory and when to use them?

Can some one please elaborate on when to use Polybase versus bulk insert in azure datafactory, what are the differences between these two copy methods?
The two options labeled “Polybase” and the “COPY command” are only applicable to Azure Synapse Analytics (formerly Azure SQL Data Warehouse). They are both fast methods of loading which involve staging data in Azure storage (if it’s not already in Azure Storage) and using a fast, highly parallel method of loading to each compute node from storage. Especially on large tables these options are preferred due to their scalability but they do come with some restrictions documented at the link above.
In contrast, on Azure Synapse Analytics a bulk insert is a slower load method which loads data through the control node and is not as highly parallel or performant. It is an order of magnitude slower on large files. But it can be more forgiving in terms of data types and file formatting.
On other Azure SQL databases, bulk insert is the preferred and fast method.

Do I need a storage (of some sort) when pulling data in Azure Data factory

*Data newbie here *
Currently, to run analytics report on data pulled from Dynamics 365, I use Power BI.
Issue with this is, Power BI is quite slow processing large data. I carry out a number of transform steps (e.g. Merge, Join, deleting or renaming columns, etc). So, when I try to run a query in Power BI with said steps, it takes a long time to complete.
So, as a solution, I decided to make use of Azure Data Factory(ADF). The plan is to use ADF to pull the data from CRM (i.e. Dynamics 365), perform transformations and publish the data. Then I'll use Power BI for visual analytics.
My question is:
What azure service will I need in addition to Data Factory? Will I need to store the data I pulled from CRM somewhere - like Azure Data Lake or Blob storage? Or can I do the transformation on the fly, right after the data is ingested?
Initially, I thought I could use the 'copy' activity to ingest data from CRM and start playing with the data. But using the copy activity, I needed to provide a sink (destination for the data. Which has to be a storage of some sort).
I also thought, I could make use of the 'lookup' activity. I tried to use it, but getting errors (no exception message is produced).
I have scoured the internet for a similar process (i.e. Dynamics 365 -> Data Factory -> Power BI), but I've not been able to find any.
Most of the processes I've seen however, utilises some sort of data storage right after data ingest.
All response welcome. Even if you believe I am going about this the wrong way.
Thanks.
Few things here:
The copy activity just moves data from a source, to a sink. It doesnt modify it on the fly.
The lookup activity is just to look for some atributes to use later on the same pipeline.
ADF cannot publish a dataset to power bi (although it may be able to push to a streaming dataset).
You approach is correct, but you need that last step of transforming the data. You have a lot of options here, but since you are already familiar with Power Bi you can use the Wrangling Dataflows, which allows you to take a file from the datalake, apply some power query and save a new file in the lake. You can also use Mapping Dataflows, databricks, or any other data transformation tool.
Lastly, you can pull files from a data lake with Power Bi to make your report with the data on this new file.
Of course, as always in Azure there are a lot of ways to solve problems or architect services, this is the one I consider simpler for you.
Hope this helped!

Parallelisms in Azure Data factory v2 copy activity

We are implementing solution to achieve similar functionality as of ssis packages to copy data from one database to another (on-premise to azure SQL). In SSIS we have option to setup parallel processing in different ways. We can also transfer data in chunks.
Similarly, which is the best way to achieve parallelisms in Azure Data Factory version 2? Please consider scenario of transferring data for only 1 table.
Have a look at the Copy Activity Performance and Tuning Guide for ways to optimize transferring data into the Cloud with ADF: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance

ETL tool or ad-hoc solutions?

I'm designing a data warehouse system, the origin data sources are two: files (hexadecimal format, record structure known) and PostgreSQL database.
The ETL phase has to read the content of the two sources (files and DB) and combining/integrating/cleaning them. After this, loading data into the DW.
For this purpose, is better a tool (for example Talend) or ad-hoc solution (writing ad-hoc routines by using a programming language)?
I would suggest you use the Bulk Loader to get your flat file into DB. This allows you to customize the loading rules and then process/cleanse the resulting data set using regular SQL (no other custom code to write)