Azure Data Factory: Different Compute environment - azure-data-factory

There are couple of compute environments that can do transformations for me. I have a REST source from where I am getting responses every day and I have to perform some transformations.
https://learn.microsoft.com/en-us/azure/data-factory/compute-linked-services
I am confused as to what could be the best way to do it? Or in other words whats the different between all the compute environments as in when should I use Azure Batch, stored procedures, HDInsight, etc?

It really depends on where you have the data. If you are storing the data in a data lake, you won't use a stored procedure. If you are storing the data in an Azure Sql, you won't use Data Lake Analytics.
Basically its like this:
Data lake -> data lake analytics with u-sql
Azure SQL (warehouse or just sql) -> stored procedure
HDInsight hadoop -> Pig, hive, etc
None of the above -> custom activity with Azure Batch
Hope this helped!

Related

Is azure data factory ELT?

I have a question about Azure Data Factory (ADF). I have read (and heard) contradictory info about ADF being ETL or ELT. So, is ADF ETL? Or, is it ETL? To my knowledge, ELT uses the transformation (compute?) engine of the target (whereas ETL uses a dedicated transformation engine). To my knowledge, ADF uses Databricks under the hood, which is really just an on-demand Spark cluster. That Spark cluster is separate from the target. So, that would mean that ADF is ETL. But, I'm not confident about this.
Good question.
It all depends on what you use and how you use it.
If it is strictly a copy activity, then it is ELT.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
The transform can be a stored procedure (does not matter RDBMS) and the source/destination are tables. If the landing zone is a data lake, then you want to call a Databricks or Synapse notebook. Again, the source is a file. The target is probably a delta table. Most people love SQL and delta tables give you those ACID properties.
Now, if you using a mapping or wrangling data flow, then it is ETL, if the pattern is pure. Of course you can mix and match. Both these data flows use a spark engine. It cost money to have big spark clusters running.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
https://learn.microsoft.com/en-us/azure/data-factory/wrangling-tutorial
Here is an article from MSDN.
https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/etl
It has old (SSIS) and new (SYNAPSE) technologies.

Where to use the Azure Data Factory Mapping Data Flow make sense?

My assumptions where MDF might be right fit are as follows:
MDF can be used as a Data Wrangling Tool by end-users
MDF is better suited for SQL Server-based Datawarehouse architectures to load the data into staging or data lake in clean format (prepare the data before loading it to SQL Server DWH and then use a proper ETL tool to do transformations)
If MDF has to be used for light ELT / ETL tasks directly on Data Lake or DWH, it needs customization for complex transformations...
My question would be:
A) Did anyone use Mapping Data Flow in production for option 2 and 3 above?
B) If assumption 3 is valid, would you suggest going for Spark-based transformation or an ETL tool rather than patching the MDF with customizations as new versions might not be compatible with, etc..
I disagree with most of your assumptions. Data Flow is a part of a larger ETL environment, either Data Factory (ADF) or Azure Synapse Pipelines and you really can't separate it from it's host. Data Flow is a UI code generator that executes at runtime as a Spark job. If your end user is a data engineer, then yes Data Flow is a good tool for them.
ADF is a great service for orchestrating data operations. ADF supports all the things you mentioned (SSIS, Notebooks, Stored Procedures, and many more). It also supports Data Flow, which is absolutely a "proper" tool for transformations and has a very rich feature set. In fact, if you are NOT doing transformations, Data Flow is likely overkill for your solution.

Azure Table Storage Sink in ADF Data Flow

Here is how my ADF Pipeline looks like. In Data Flow, I read some data from a source, perform filter & join and store data to a sink. My plan was to use Azure Table Storage as the sink. However, according to https://github.com/MicrosoftDocs/azure-docs/issues/34981, ADF Data Flow does not support Azure Table Storage as a sink. Is there an alternative to use Azure Table Storage as the sink in Data Flow?
No, it is impossible. Azure Table Storage can not be the sink of data flow.
Only these six dataset is allowed:
Not only these limits. When as the sink of the dataflow, Azure Blob Storage and Azure Data Lake Storage Gen1&Gen2 only support four format: JSON, Avro, Text, Parquet.'
At least for now, your idea is not a viable solution.
For more information, have a look of this offcial doc:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#supported-sink-connectors-in-mapping-data-flow
Even today it isn't possible. One option could be (we are solving a similar case like this currently) to use a Blob Storage as a temporary destination.
The data flow will store the result in the Blob Storage. The source data is processed by all these different transformations in the data flow and prepared well for table storage, e.g. PartitionKey, RowKey, and all other columns are there.
A subsequent Copy Activity will move the data from Blob Storage into Table Storage easily.
The marked part of the pipeline is doing exactly this:
Full Orders runs the data flow
to Table Storage copy activity moves data into the Table Storage

Can I force flush a Databricks Delta table, so the disk copy has latest/consistent data?

I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.
But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.
So, two questions...
Could this happen? Are changes held in the log for a while before being written out?
Can I force a transaction log flush, so I know the disk copy is updated?
Azure Data Factory has built in delta lake support (this was not the case at the time the question was raised).
Delta is available as an inline dataset in a Azure Data Factory data flow activity. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus (see also the docs here).
ADF supports Delta Lake format as of July 2020:
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793
The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake
Delta is currently available in ADF as a public preview in data flows as an inline dataset.

Parallelisms in Azure Data factory v2 copy activity

We are implementing solution to achieve similar functionality as of ssis packages to copy data from one database to another (on-premise to azure SQL). In SSIS we have option to setup parallel processing in different ways. We can also transfer data in chunks.
Similarly, which is the best way to achieve parallelisms in Azure Data Factory version 2? Please consider scenario of transferring data for only 1 table.
Have a look at the Copy Activity Performance and Tuning Guide for ways to optimize transferring data into the Cloud with ADF: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance