I have Azure Data Factory pipeline, which are running Lookup(SQL Selects) and Copy Data(Inserts) in ForEach for 5000-1000 times. I want to execute pipeline nightly, but currently it takes more than 8 hours to finish. Each iteration takes 15min.
I can see from Azure SQL that CPU, RAM, IO load Metrics are ok.
I'm using Self-Hosted Integration runtime.
What I can do to speed up Azure Data Factory processing?
How I can find bottleneck of solution and how to fix?
You can enhance the scale of processing by the following approaches:
You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
Scale up works only if the processor and memory of the node are being less than fully utilized.
You can scale out the self-hosted IR, by adding more nodes (machines).
Here are Performance tuning steps that can help you to tune the performance of your service.
You can follow this official documentation to identify and resolve the bottleneck.
Related
Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api.
To clarify:
I have to use LightGBM in general, there is no option here.
The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.
Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.
Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.
Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?
One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.
Has anyone done anything similar in the past?
Thanks
Thanks for your questions. Here are answers:
SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container
Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:
each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)
In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.
Knowing that in ADF Dataflows transformations will run in a Databricks cluster in the backgroung, how different (in terms of cost and performance) would be to run the same transformations on a Databricks notebook in the same pipeline?
I guess it will depend on how we set the Databricks cluster but my question is also to understand how this cluster will run in the background. Would it be a dedicated cluster or shared one in the platform?
Each activity in ADF is executed by an Integration Runtime (VM). If you are synchronously monitoring a Databricks job, you will be charged for the Integration Runtime that will be monitoring your job.
Notebook execution in Databricks will be charged as a job cluster. Please create pool and use that pool in ADF. In databricks you will see history of ADF created clusters in pool overview.
During creation of the pool please be careful with settings as you can be charged for idle time. Min idle could be 0 and auto termination time set to low value. If you have dataflow which executes notebooks step by step reuse the same pool can be quicker and cheaper as databricks will not deploy new machine and use existing machine from pool (if it wasn't auto-terminated already).
On the screenshot ADF jobs in pool and min idle settings:
I've created a new ADF pipeline which is working well but gives me some concern over performance.
As an example - here's a task from the pipeline that copies a small blob from a container to another container in the same storage account:
Notice that it's queued for 58 seconds.
The pipeline uses "Managed Virtual Network" integration runtime because it makes use of Azure SQL Private Endpoints.
Any ideas why the copy data tasks are held at "Queued" for so long?
As you mentioned that your pipeline using "Managed Virtual Network" integration runtime, therefore, as per the Activity execution time using managed virtual network:
By design, Azure integration runtime in managed virtual network takes
longer queue time than global Azure integration runtime as we are not
reserving one compute node per data factory, so there is a warm up for
each activity to start, and it occurs primarily on virtual network
join rather than Azure integration runtime. For non-copy activities
including pipeline activity and external activity, there is a 60
minutes Time To Live (TTL) when you trigger them at the first time.
Within TTL, the queue time is shorter because the node is already
warmed up.
There is also 60 minutes time to Live(TTL) feature is available in "Managed Virtual Network" IR which shorten the queue time because the node is already warmed up, but unfortunately Copy activity doesn't have TTL support yet.
While searching for a service to migrate our on-premise MongoDB to Azure CosmosDB with Mongo API, We came across the service called, Azure Data Bricks. We have total of 186GB of data. which we need to migrate to CosmosDB with less downtime as possible. How can we improve the data transfer rate for that. If someone can give some insights to this spark based PaaS provided by Azure, It will be very much helpful.
Thank you
Have you referred the article given on our docs page?
In general you can assume the migration workload can consume entire provisioned throughput, the throughout provisioned would give an estimation of the migration speed. You could consider increasing the RUs at the time of migration and reduce it later.
The migration performance can be adjusted through these configurations:
Number of workers and cores in the Spark cluster
maxBatchSize
Disable indexes during data transfer
I want to load about 100 small tables (min 5 records, max 10000 records) from SQL Server into Google BigQuery on a daily basis. We have created 100 Datafusion pipelines, one pipeline per source table. When we start one pipeline it takes about 7 minutes to execute. Offcourse its starts DataProc, connects to SQL server and sinks the data into Google BigQuery. When we have to run this sequentially it will take 700 minutes? When we try to run in pipelines in parallel we are limited by the network range which would be 256/3. 1 pipeline starts 3 VM's one master 2 slaves. We tried but the performance is going down when we start more than 10 pipelines in parallel.
Questions. Is this the right approach?
When multiple pipelines are running at the same time, there are multiple Dataproc clusters running behind the scenes with more VMs and require more disk. There are some plugins to help you out with multiple source tables. Correct plugin to use should be CDAP/Google plugin called Multiple Table Plugins as it allows for multiple source tables.
In the Data Fusion studio, you can find it in Hub -> Plugins.
To see full lists of available plugins, please visit official documentation.
Multiple Data Fusion pipelines can use the same pre-provisioned Dataproc cluster. You need to create the Remote Hadoop Provisioner compute profile for the Data Fusion instance.
This feature is only available in Enterprise edition.
How setup compute profile for the Data Fusion instance.