What is the difference between polybase and bulk insert, copy methods in azure data factory and when to use them? - azure-data-factory

Can some one please elaborate on when to use Polybase versus bulk insert in azure datafactory, what are the differences between these two copy methods?

The two options labeled “Polybase” and the “COPY command” are only applicable to Azure Synapse Analytics (formerly Azure SQL Data Warehouse). They are both fast methods of loading which involve staging data in Azure storage (if it’s not already in Azure Storage) and using a fast, highly parallel method of loading to each compute node from storage. Especially on large tables these options are preferred due to their scalability but they do come with some restrictions documented at the link above.
In contrast, on Azure Synapse Analytics a bulk insert is a slower load method which loads data through the control node and is not as highly parallel or performant. It is an order of magnitude slower on large files. But it can be more forgiving in terms of data types and file formatting.
On other Azure SQL databases, bulk insert is the preferred and fast method.

Related

Is azure data factory ELT?

I have a question about Azure Data Factory (ADF). I have read (and heard) contradictory info about ADF being ETL or ELT. So, is ADF ETL? Or, is it ETL? To my knowledge, ELT uses the transformation (compute?) engine of the target (whereas ETL uses a dedicated transformation engine). To my knowledge, ADF uses Databricks under the hood, which is really just an on-demand Spark cluster. That Spark cluster is separate from the target. So, that would mean that ADF is ETL. But, I'm not confident about this.
Good question.
It all depends on what you use and how you use it.
If it is strictly a copy activity, then it is ELT.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
The transform can be a stored procedure (does not matter RDBMS) and the source/destination are tables. If the landing zone is a data lake, then you want to call a Databricks or Synapse notebook. Again, the source is a file. The target is probably a delta table. Most people love SQL and delta tables give you those ACID properties.
Now, if you using a mapping or wrangling data flow, then it is ETL, if the pattern is pure. Of course you can mix and match. Both these data flows use a spark engine. It cost money to have big spark clusters running.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
https://learn.microsoft.com/en-us/azure/data-factory/wrangling-tutorial
Here is an article from MSDN.
https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/etl
It has old (SSIS) and new (SYNAPSE) technologies.

GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

Where to use the Azure Data Factory Mapping Data Flow make sense?

My assumptions where MDF might be right fit are as follows:
MDF can be used as a Data Wrangling Tool by end-users
MDF is better suited for SQL Server-based Datawarehouse architectures to load the data into staging or data lake in clean format (prepare the data before loading it to SQL Server DWH and then use a proper ETL tool to do transformations)
If MDF has to be used for light ELT / ETL tasks directly on Data Lake or DWH, it needs customization for complex transformations...
My question would be:
A) Did anyone use Mapping Data Flow in production for option 2 and 3 above?
B) If assumption 3 is valid, would you suggest going for Spark-based transformation or an ETL tool rather than patching the MDF with customizations as new versions might not be compatible with, etc..
I disagree with most of your assumptions. Data Flow is a part of a larger ETL environment, either Data Factory (ADF) or Azure Synapse Pipelines and you really can't separate it from it's host. Data Flow is a UI code generator that executes at runtime as a Spark job. If your end user is a data engineer, then yes Data Flow is a good tool for them.
ADF is a great service for orchestrating data operations. ADF supports all the things you mentioned (SSIS, Notebooks, Stored Procedures, and many more). It also supports Data Flow, which is absolutely a "proper" tool for transformations and has a very rich feature set. In fact, if you are NOT doing transformations, Data Flow is likely overkill for your solution.

Why does Snowflake recommend creating an external stage rather than loading it directly from a bucket?

In the snowflake documents about bulk loading from AWS S3,
they are saying like :
You can load directly from the bucket, but Snowflake recommends creating an external stage that references the bucket and using the external stage instead.
So my first question is:
Why does Snowflake recommend creating an external stage rather than loading it directly from a bucket?
Is there a reason for this? Or If you have any documentation explaining why, please let me know. :)
And my second question is:
In the architecture diagram of Bulk Loading from a Local File System, there are arrows(➡) from data files to stage, but in the case of Bulk Loading from Amazon S3, there are no arrows from Data Files to external stage. What is the difference between with and without arrows?
Bulk Loading from Amazon S3:
https://docs.snowflake.com/en/user-guide/data-load-s3.html
Bulk Loading from a Local File System:
https://docs.snowflake.com/en/user-guide/data-load-local-file-system.html
The stage hold all the permissions for the bucket, so and security role can create deal with the AWS tokens, and then grant access to the stage for reads/writes, to other roles, this separates the two tasks of loading data, and securing data.
It also allows the stage to have tokens changed/updated, and code/users using it are not impacted, or even changing to methods where (name escapes me but the) dynamic key exchange happens, so key rotation is all automatic between S3/AWS. Which how we do it, in fact we have many stages, for different sources of data, and the security aspects on business policies do not need to be known handle by the data engineer's who build the ETL code.

Parallelisms in Azure Data factory v2 copy activity

We are implementing solution to achieve similar functionality as of ssis packages to copy data from one database to another (on-premise to azure SQL). In SSIS we have option to setup parallel processing in different ways. We can also transfer data in chunks.
Similarly, which is the best way to achieve parallelisms in Azure Data Factory version 2? Please consider scenario of transferring data for only 1 table.
Have a look at the Copy Activity Performance and Tuning Guide for ways to optimize transferring data into the Cloud with ADF: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance