Can anyone please tell me how efficient Azure Databricks Jobs are? - azure-data-factory

like If I schedule a databricks notebook to run via Jobs and Azure Data Factory, which one would be more efficient and why?

There are few cases when Databricks Workflows (former Jobs) are more efficient to use than ADF:
ADF still uses Jobs API 2.0 for submission of ephemeral jobs that doesn't support setting of default access control lists
If you have a job consisting of several tasks, Databricks has an option of cluster reuse that allow to use the same cluster(s) to run multiple subtasks, and don't wait to creation of new clusters as in case when subtasks are scheduled from ADF
You can more efficiently share a context between subtasks when using Databricks Workflows

Related

Sagemaker Pre-processing/Training Jobs vs ECS

We are considering using Sagemaker jobs/ECS as a resource for a few of our ML jobs. Our jobs are based on a custom docker file (no spark, just basic ML python libraries) and thus all that is required is resource for the container.
Wanted to know is there any specific advantage of using Sagemaker vs ECS here ? Also, As in our use-case we only require a resource for running docker image, would processing Job / training job serve the same purpose? Thanks!
Yeah you could make use of a either a Training Job or Processing Job (assuming the ML jobs are for transient training and/or processing).
The benefit of using SageMaker over ECS is that SageMaker manages the infrastructure. The Jobs are also transient and as such will be killed after training/processing while your artifacts will be automatically saved to S3.
With SageMaker Training or Processing Jobs all you need to do is bring your container (sitting in ECR) and kick off the Job with a single API (CreateTrainingJob, CreateProcessingJob)

How Data Flow computing differs from Databricks

Knowing that in ADF Dataflows transformations will run in a Databricks cluster in the backgroung, how different (in terms of cost and performance) would be to run the same transformations on a Databricks notebook in the same pipeline?
I guess it will depend on how we set the Databricks cluster but my question is also to understand how this cluster will run in the background. Would it be a dedicated cluster or shared one in the platform?
Each activity in ADF is executed by an Integration Runtime (VM). If you are synchronously monitoring a Databricks job, you will be charged for the Integration Runtime that will be monitoring your job.
Notebook execution in Databricks will be charged as a job cluster. Please create pool and use that pool in ADF. In databricks you will see history of ADF created clusters in pool overview.
During creation of the pool please be careful with settings as you can be charged for idle time. Min idle could be 0 and auto termination time set to low value. If you have dataflow which executes notebooks step by step reuse the same pool can be quicker and cheaper as databricks will not deploy new machine and use existing machine from pool (if it wasn't auto-terminated already).
On the screenshot ADF jobs in pool and min idle settings:

Is it possible in airflow to run a single task on multiple worker nodes i.e running a task in distributed way

I am using spring batch to create a workflow of batch job. The single batch job takes 2 hrs to complete(data to be processed ~ 1 million) so decided to run in distributed way where one task will be distributed across multiple worker nodes, that way it can execute in short time. The other jobs (all are working in distributed manner) in workflow need to run in sequential manner one after other. The jobs are multi node distributed jobs(master/slave architecture) that need to run one after another.
Now, I was considering to deploy the workflow on airflow. So, while exploring that I could not find any way to run a single task that distributes across multiple machine. Is it possible in airflow?
Yes, you can create a task using Spark framework. Spark allows you to process the data on multiple nodes in a distributed fashion.
You can then use SparkSubmitOperator to align the task in your DAG.

What are the best tools to schedule Snowflake tasks or python scripts in Ec2 to load data into snowflake?

Please share your experiences wrt orchestrating jobs run through various tools and programmatic interfaces to load data to Snowflake-
python scripts in Ec2 instances. currently scheduled using crontab.
tasks in snowflake
Alteryx workflows
Are there any tools with sophisticated UI to create job workflows with dependencies?
The workflow can have -
python script followed by a task
Alteryx workflow followed by a python script and then a task
If any job fails then it should send emails to the team.
Thanks
We have used both CONTROL-M and Apache Airflow to schedule and orchestrate data load to snowflake

Azure Data Factory v2 and data processing in custom activity

I am migrating (extract-load) a large dataset to a LOB service, and would like to use Azure Data Factory v2 (ADF v2). This would be the cloud version of the same kind of orchestration typically implemented in SSIS. My source database and dataset, as well as the target platform are on Azure. That lead me to ADFv2 with Batch Service (ABS) and creating a custom activity.
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity
However, I am unable to read from the documentation or samples provided by Microsoft how ADF v2 can create the job and tasks needed by the batch service.
As an example, lets say I have dataset with 10 million records, and batch service with 10 cores in a pool. How do I submit 1/10, or even row-for-row, to my command line app running on each of the cores in the pool? How do I distribute the work? Following the default guide at the docs for ADF v2, I just get a datasets.json file, and it is the same for all my pool nodes, no "slice" or subset information.
If ADF v2 was not involved I would create a job in ABS and for each row or for each X rows, create a task. The nodes would then execute task for task. How do I achieve something similar with ADF v2?