Knowing that in ADF Dataflows transformations will run in a Databricks cluster in the backgroung, how different (in terms of cost and performance) would be to run the same transformations on a Databricks notebook in the same pipeline?
I guess it will depend on how we set the Databricks cluster but my question is also to understand how this cluster will run in the background. Would it be a dedicated cluster or shared one in the platform?
Each activity in ADF is executed by an Integration Runtime (VM). If you are synchronously monitoring a Databricks job, you will be charged for the Integration Runtime that will be monitoring your job.
Notebook execution in Databricks will be charged as a job cluster. Please create pool and use that pool in ADF. In databricks you will see history of ADF created clusters in pool overview.
During creation of the pool please be careful with settings as you can be charged for idle time. Min idle could be 0 and auto termination time set to low value. If you have dataflow which executes notebooks step by step reuse the same pool can be quicker and cheaper as databricks will not deploy new machine and use existing machine from pool (if it wasn't auto-terminated already).
On the screenshot ADF jobs in pool and min idle settings:
Related
We are considering using Sagemaker jobs/ECS as a resource for a few of our ML jobs. Our jobs are based on a custom docker file (no spark, just basic ML python libraries) and thus all that is required is resource for the container.
Wanted to know is there any specific advantage of using Sagemaker vs ECS here ? Also, As in our use-case we only require a resource for running docker image, would processing Job / training job serve the same purpose? Thanks!
Yeah you could make use of a either a Training Job or Processing Job (assuming the ML jobs are for transient training and/or processing).
The benefit of using SageMaker over ECS is that SageMaker manages the infrastructure. The Jobs are also transient and as such will be killed after training/processing while your artifacts will be automatically saved to S3.
With SageMaker Training or Processing Jobs all you need to do is bring your container (sitting in ECR) and kick off the Job with a single API (CreateTrainingJob, CreateProcessingJob)
like If I schedule a databricks notebook to run via Jobs and Azure Data Factory, which one would be more efficient and why?
There are few cases when Databricks Workflows (former Jobs) are more efficient to use than ADF:
ADF still uses Jobs API 2.0 for submission of ephemeral jobs that doesn't support setting of default access control lists
If you have a job consisting of several tasks, Databricks has an option of cluster reuse that allow to use the same cluster(s) to run multiple subtasks, and don't wait to creation of new clusters as in case when subtasks are scheduled from ADF
You can more efficiently share a context between subtasks when using Databricks Workflows
I have Azure Data Factory pipeline, which are running Lookup(SQL Selects) and Copy Data(Inserts) in ForEach for 5000-1000 times. I want to execute pipeline nightly, but currently it takes more than 8 hours to finish. Each iteration takes 15min.
I can see from Azure SQL that CPU, RAM, IO load Metrics are ok.
I'm using Self-Hosted Integration runtime.
What I can do to speed up Azure Data Factory processing?
How I can find bottleneck of solution and how to fix?
You can enhance the scale of processing by the following approaches:
You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
Scale up works only if the processor and memory of the node are being less than fully utilized.
You can scale out the self-hosted IR, by adding more nodes (machines).
Here are Performance tuning steps that can help you to tune the performance of your service.
You can follow this official documentation to identify and resolve the bottleneck.
I've a realtime spark job which runs in EMR cluster and I've another batch job which runs in another EMR cluster and this job is triggered at specific time.
How to run both these jobs in one EMR cluster ?
Any suggestions.
If the steps in both the EMR are not dependent on each other, then you can use the feature called Concurrency in the EMR to solve your use case. This feature simply means that you can run more than 1 step in parallel at a time.
This feature is there from the EMR version 5.28.0. If you are using the older version than this then you can not use this feature.
While launching the EMR from the AWS console, this feature is termed as 'Concurrency' in the UI. you can choose any number between 1 to 256.
If you are launching the EMR from the AWS CLI, then this feature is termed as 'StepConcurrencyLevel'.
You can read more about this at multiple steps now in EMR and AWS CLI details
I have a simple Flink job that reads from an ActiveMQ source & sink to a database & print. I deploy the job in the Kubernetes with 2 TaskManagers, each having Task Slots of 10 (taskmanager.numberOfTaskSlots: 10). I configured parallelism more than the total TaskSlots available (ie., 10 in this case).
When I see the Flink Dashboard I see this job runs only in one of the TaskManager, but the other TaskManager has no Jobs. I verified this by checking every operator where it is scheduled, also in Task Manager UI page one of the manager has all slots free. I attach below images for reference.
Did I configure anything wrong? Where is the gap in my understanding? And can someone explain it?
Job
The first task manager has enough slots (10) to fully satisfy the requirements of your job.
The scheduler's default behavior is to fully utilize one task manager's slots before using slots from another task manager. If instead you would prefer that Flink spread out the workload across all available task managers, set cluster.evenly-spread-out-slots: true in flink-conf.yaml. (This option was added in Flink 1.10 to recreate a scheduling behavior similar to what was the default before Flink 1.5.)