Is azure data factory ELT? - azure-data-factory

I have a question about Azure Data Factory (ADF). I have read (and heard) contradictory info about ADF being ETL or ELT. So, is ADF ETL? Or, is it ETL? To my knowledge, ELT uses the transformation (compute?) engine of the target (whereas ETL uses a dedicated transformation engine). To my knowledge, ADF uses Databricks under the hood, which is really just an on-demand Spark cluster. That Spark cluster is separate from the target. So, that would mean that ADF is ETL. But, I'm not confident about this.

Good question.
It all depends on what you use and how you use it.
If it is strictly a copy activity, then it is ELT.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
The transform can be a stored procedure (does not matter RDBMS) and the source/destination are tables. If the landing zone is a data lake, then you want to call a Databricks or Synapse notebook. Again, the source is a file. The target is probably a delta table. Most people love SQL and delta tables give you those ACID properties.
Now, if you using a mapping or wrangling data flow, then it is ETL, if the pattern is pure. Of course you can mix and match. Both these data flows use a spark engine. It cost money to have big spark clusters running.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
https://learn.microsoft.com/en-us/azure/data-factory/wrangling-tutorial
Here is an article from MSDN.
https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/etl
It has old (SSIS) and new (SYNAPSE) technologies.

Related

GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

Where to use the Azure Data Factory Mapping Data Flow make sense?

My assumptions where MDF might be right fit are as follows:
MDF can be used as a Data Wrangling Tool by end-users
MDF is better suited for SQL Server-based Datawarehouse architectures to load the data into staging or data lake in clean format (prepare the data before loading it to SQL Server DWH and then use a proper ETL tool to do transformations)
If MDF has to be used for light ELT / ETL tasks directly on Data Lake or DWH, it needs customization for complex transformations...
My question would be:
A) Did anyone use Mapping Data Flow in production for option 2 and 3 above?
B) If assumption 3 is valid, would you suggest going for Spark-based transformation or an ETL tool rather than patching the MDF with customizations as new versions might not be compatible with, etc..
I disagree with most of your assumptions. Data Flow is a part of a larger ETL environment, either Data Factory (ADF) or Azure Synapse Pipelines and you really can't separate it from it's host. Data Flow is a UI code generator that executes at runtime as a Spark job. If your end user is a data engineer, then yes Data Flow is a good tool for them.
ADF is a great service for orchestrating data operations. ADF supports all the things you mentioned (SSIS, Notebooks, Stored Procedures, and many more). It also supports Data Flow, which is absolutely a "proper" tool for transformations and has a very rich feature set. In fact, if you are NOT doing transformations, Data Flow is likely overkill for your solution.

What is the difference between polybase and bulk insert, copy methods in azure data factory and when to use them?

Can some one please elaborate on when to use Polybase versus bulk insert in azure datafactory, what are the differences between these two copy methods?
The two options labeled “Polybase” and the “COPY command” are only applicable to Azure Synapse Analytics (formerly Azure SQL Data Warehouse). They are both fast methods of loading which involve staging data in Azure storage (if it’s not already in Azure Storage) and using a fast, highly parallel method of loading to each compute node from storage. Especially on large tables these options are preferred due to their scalability but they do come with some restrictions documented at the link above.
In contrast, on Azure Synapse Analytics a bulk insert is a slower load method which loads data through the control node and is not as highly parallel or performant. It is an order of magnitude slower on large files. But it can be more forgiving in terms of data types and file formatting.
On other Azure SQL databases, bulk insert is the preferred and fast method.

PySaprk- Perform Merge in Synapse using Databricks Spark

We are having a tricky situation while performing ACID operation using Databricks Spark .
We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND and OVERWRITE (only these two use full in our case) . So based these two mode we thought of below options:
We will write whole dataframe into a stage table . And we will use this stage table to perform MERGE operation( ~ UPSERT )with final Table .Stage table will be truncated / dropped after that .
We Will bring target table data into Spark also. Inside Spark We will perform MERGE using Delta lake and will generate a final Dataframe .This dataframe will be written back to Target table in OVERWRITE mode.
Considering the cons. sides..
in Option 1 , We have to use two table just to write the final data. And In,case both Stage and target tables are big , then performing MERGE operation inside Synapse is another herculean task and May take time .
in option 2 ,We have to bring the Target table into Spark in-memory. Even though network IO is not much of our concern as both Databricks and Synpse will be in same Azure AZ, It may leads to memory issue in Spark side.
Is there any other feasible options ?? Or any recommendation ??
Answer would depend on many factors not listed in your question. It's a very open ended question.
(Given the way your question is phrased I'm assuming you're using Dedicated SQL Pools and not an On-demand Synapse)
Here are some thoughts:
You'll be using spark cluster's compute in option 1 and Synapse' compute in option 2. Compare cost.
Pick the lower cost.
Read and write to/from Spark to/from Synapse using their driver uses Datalake as stage. I.e. while reading a table from Synapse into a datafrmae in Spark, driver will first make Synapse export data to Datalake (as parquet IIRC) and then read the files in Datalake to create the Dataframe. This scales nicely if you're talking about 10s or million or billions of rows. But the overhead could become a performance overhead if row counts are low (10-100s of thousands).
Test and pick the faster one.
Remember that Synapse is not like a traditional MySQL or SQL-Server. It's an MPP DB.
"performing MERGE operation inside Synapse is another herculean task and May take time" is a wrong statement. It scales just like a Spark cluster.
It may leads to memory issue in Spark side, yes and no. One one hand all data isn't going to be loaded into a single worker node. OTOH yes, you do need enough memory for each node to do it's own part.
Although Synapse can be scaled up and down dynamically, I've seen it take up to 40 minutes to complete a scale up. Databricks on the other hand is fully on-demand and you can probably get away with turning on cluster, do upsert, shutdown cluster. With Synapse you'll probably have other clients using it, so may not be able to shut it down.
So with Synapse either you'll have to live with 40-80 minutes down time for each upsert (scale up, upsert, scale down), OR
pay for high DWU flat-rate all the time, though your usage is high only when you upsert but otherwise it's pretty low.
Lastly, remember that MERGE is in preview at the time of writing this. Means no Sev-A support cases/immediate support if something breaks in your prod because you're using MERGE.
You can always use DELETE + INSERT instead. Assumes the delta you receive has all columns from target table and not just updated ones.
Did you try creating checksum to do merge upsert only for rows that have actual data change?

Parallelisms in Azure Data factory v2 copy activity

We are implementing solution to achieve similar functionality as of ssis packages to copy data from one database to another (on-premise to azure SQL). In SSIS we have option to setup parallel processing in different ways. We can also transfer data in chunks.
Similarly, which is the best way to achieve parallelisms in Azure Data Factory version 2? Please consider scenario of transferring data for only 1 table.
Have a look at the Copy Activity Performance and Tuning Guide for ways to optimize transferring data into the Cloud with ADF: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance