Exponential Backoff policy for spark on azure storage - pyspark

I have spark jobs on k8s which are reading and writing parquet files from azure storage (blobs). I recently understood that there are environment limits on Azure for the number of transactions/sec and my pipelines is exceeding those limits.
This is resulting in throttling and my some tasks in my jobs are taking 8-10x the usual time (it isn't data skew). One of the recommendation was to apply an exponential backoff policy but i have not found any such setting on spark configurations.
Anyone facing similar situation or any help on this would truly be appreciated ?

Related

How to increase performance of Azure Data Factory Pipeline?

I have Azure Data Factory pipeline, which are running Lookup(SQL Selects) and Copy Data(Inserts) in ForEach for 5000-1000 times. I want to execute pipeline nightly, but currently it takes more than 8 hours to finish. Each iteration takes 15min.
I can see from Azure SQL that CPU, RAM, IO load Metrics are ok.
I'm using Self-Hosted Integration runtime.
What I can do to speed up Azure Data Factory processing?
How I can find bottleneck of solution and how to fix?
You can enhance the scale of processing by the following approaches:
You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
Scale up works only if the processor and memory of the node are being less than fully utilized.
You can scale out the self-hosted IR, by adding more nodes (machines).
Here are Performance tuning steps that can help you to tune the performance of your service.
You can follow this official documentation to identify and resolve the bottleneck.

How Data Flow computing differs from Databricks

Knowing that in ADF Dataflows transformations will run in a Databricks cluster in the backgroung, how different (in terms of cost and performance) would be to run the same transformations on a Databricks notebook in the same pipeline?
I guess it will depend on how we set the Databricks cluster but my question is also to understand how this cluster will run in the background. Would it be a dedicated cluster or shared one in the platform?
Each activity in ADF is executed by an Integration Runtime (VM). If you are synchronously monitoring a Databricks job, you will be charged for the Integration Runtime that will be monitoring your job.
Notebook execution in Databricks will be charged as a job cluster. Please create pool and use that pool in ADF. In databricks you will see history of ADF created clusters in pool overview.
During creation of the pool please be careful with settings as you can be charged for idle time. Min idle could be 0 and auto termination time set to low value. If you have dataflow which executes notebooks step by step reuse the same pool can be quicker and cheaper as databricks will not deploy new machine and use existing machine from pool (if it wasn't auto-terminated already).
On the screenshot ADF jobs in pool and min idle settings:

How to set spark executor memory in the Azure Data Factory Linked service

My Spark Scala code is failing due to Spark out of memory issue. I am running the code from ADF pipeline. In Databricks cluster, the executor memory is set to 4g. I want to change this value at ADF level instead of changing it at cluster level. While creating a linked service we have additional cluster settings where we can define the cluster spark configuration. Please find the below. Could someone please let me know how to set the spark executor memory in linked service in ADF.
Thank you.
Add Name = spark.executor.memory and Value = 6g
Monitor core configuration settings to ensure your Spark jobs run in a predictable and performant way. These settings help determine the best Spark cluster configuration for your particular workloads.
Also refer - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-settings

Batch Processing on Kubernetes

Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ? How to prevent batch processing process same data if we use kubernetes auto scaling feature ? Thank you.
Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ?
For Spring Batch, we (the Spring Batch team) do have some experience on the matter which we share in the following talks:
Cloud Native Batch Processing on Kubernetes, by Michael Minella
Spring Batch on Kubernetes, by me.
Running batch jobs on kubernetes can be tricky:
pods may be re-scheduled by k8s on different nodes in the middle of processing
cron jobs might be triggered twice
etc
This requires additional non-trivial work on the developer's side to make sure the batch application is fault-tolerant (resilient to node failure, pod re-scheduling, etc) and safe against duplicate job execution in a clustered environment.
Spring Batch takes care of this additional work for you and can be a good choice to run batch workloads on k8s for several reasons:
Cost efficiency: Spring Batch jobs maintain their state in an external database, which makes it possible to restart them from the last save point in case of job/node failure or pod re-scheduling
Robustness: Safe against duplicate job executions thanks to a centralized job repository
Fault-tolerance: Retry/Skip failed items in case of transient errors like a call to a web service that might be temporarily down or being re-scheduled in a cloud environment
I wrote a blog post in which I explain all these aspects in details with code examples. You can find it here: Spring Batch on Kubernetes: Efficient batch processing at scale
How to prevent batch processing process same data if we use kubernetes auto scaling feature ?
Making each job process a different data set is the way to go (a job per file for example). But there are different patterns that you might be interested in, see Job Patterns from k8s docs.

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html