Implementing exponential backoff in spark between consecutive attempts - scala

On our AWS EMR spark cluster we have 100-150 jobs running at a given time. Each cluster-task-node is allocated 7 of 8 cores. But at some hour in a day the load average touches 10 --causing task failure/ task lost/ executors lost/ heart beat missed. If we decrease the allocated cores/node the cluster is under utilised (load avg 4)
To avoid executors lost/ tasks lost --we tried increasing maxAttempts, but all the attempts are happening in that hour window only and the job is failing.
So, as intermediary we are thinking of
having try catch in the main method. So that executor lost, max attempts exceeded exceptions can be caught and... re-trigger the code
The EMR-core-nodes have 6 of 8cores allocated to spark --May be for this reason the driver never crashed!
So the questions
Will re-triggering main code in the same JVM several times any side affects? It can happen that there are some dataframes already created in the previous run and now they are getting recreated. Or recached or the SQL is getting re-executed
if this is not the way then any suggestions on how to get exponential backoff?
we are using spark 3.3 on AWS EMR 6.X. The spark job is triggered via sh file.

Related

Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Thank you for reading this SO question, it may seem long, but I'll try to get as most information as possible in it to help to get the answer.
Summary
We are currently experiencing a scheduling issue with our Flink cluster.
The symptoms are that some/most/all (it depends, the symptoms are not always the same) of our tasks are shown as SCHEDULED but fail after a timeout. The jobs are then shown as RUNNING.
The failing exception is the following one:
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
After analysis, we assume (we cannot prove it, as there are not that much logs for that part of the code) that the failure is due to a deadlock/race condition that is happening when several jobs are being submitted at the same time to the Flink cluster, even though we have enough slots available in the cluster.
We actually have the error with 52 available task slots, and have 12 jobs that are not scheduled.
Additional information
Flink version: 1.13.1 commit a7f3192
Flink cluster in session mode
2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits sets on memory to 4Gb)
50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No limits set).
Our Flink cluster is shut down every night, and restarted every morning. The error seems to occur when a lot of jobs needs to be scheduled. The jobs are configured to restore their state, and we do not see any issues for jobs that are being scheduled and run correctly, it seems to really be related to a scheduling issue.
Questions
May it be that the issue described in FLINK-23409 is actually the same, but occurs only when there is a race condition when scheduling several jobs?
Is there any way to increase logging in the scheduler to debug this issue?
Is it a known issue? If yes, is there any workaround/solution to resolve it?
P.S: a while ago, I asked more or less the same question on the ML, but dropped it, I'm sorry if this is considered as cross-asking, it's not intended t. We are just opening a new thread as we have more information and the issue re-occur.

24 hours performance test execution stopped abruptly running in jmeter pod in AKS

I am running load test of 24 hours using Jmeter in Azure Kubernetes service. I am using Throughput shaping timer in my jmx file. No listener is added as part of jmx file.
My test stopped abruptly after 6 or 7 hrs.
jmeter-server.log file under Jmeter slave pod is giving warning --> WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool.
Below is snapshot from jmeter-server.log file.
Using Jmeter version - 5.2.1 and Kubernetes version - 1.19.6
I checked, Jmeter pods for master and slaves are continously running(no restart happened) in AKS.
I provided 2GB memory to Jmeter slave pod still load test is stopped abruptly.
I am using log analytics workspace for logging. Checked ContainerLog table not getting error.
Snapshot of JMX file.
Using following elements -> Thread Group, Throughput Controller, Http request Sampler and Throughput Shaping Timer
Please suggest for same.
It looks like your Schedule Feedback Function configuration is wrong in its last parameter
The warning means that the Throughput Shaping Timer attempts to increase the number of threads to reach/maintain the desired concurrency but it doesn't have enough threads in order to do this.
So either increase this Spare threads ration to be closer to 1 if you're using a float value for percentage or increment the absolute value in order to match the number of threads.
Quote from documentation:
Example function call: ${__tstFeedback(tst-name,1,100,10)} , where "tst-name" is name of Throughput Shaping Timer to integrate with, 1 and 100 are starting threads and max allowed threads, 10 is how many spare threads to keep in thread pool. If spare threads parameter is a float value <1, then it is interpreted as a ratio relative to the current estimate of threads needed. If above 1, spare threads is interpreted as an absolute count.
More information: Using JMeter’s Throughput Shaping Timer Plugin
However it doesn't explain the premature termination of the test so ensure that there are no errors in jmeter/k8s logs, one of the possible reasons is that JMeter process is being terminated by OOMKiller

Zombie giant unkillable task blocks Druid at restart

I'm running Apache Druid 0.17 deploying with nohup ./bin/start-nano-quickstart > mylog.log. As the deep storage I am using s3 and I have parquet extension enabled and all work fine. I could ingest with several small spark partitioned parquet datasources from s3 correctly. All the remaining configurations are untouched.
As I tried loading a giant datasource to test the performance and resource usage the task died after a couple of hours because of OutOfMemory.(It was expected)
2020-02-07T17:32:20,519 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - New segment[arc_2016-09-29T12:00:00.000Z_2016-09-29T13:00:00.000Z_2020-02-07T17:22:45.965Z] for sequenceName[index_parallel_arc_chgindko_2020-02-07T14:59:32.337Z].
Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded
Now every time I restart Druid, it starts that giant task and it is impossible to kill it. Even when the task apparently dies or turns in waiting status the CPU usage is about 140% and I cannot submit new tasks to Druid. I tried to access the Derby database manually to find the task and remove it but I was not successful and this solution is really nasty. I know that I can change the database in the configuration so the next time I will have a fresh Druid but it is not a good solution as I will miss all other datasources. How can I get ready of this long running zombie task?

GC in a spark/scala job?

I've inherited two spark scala jobs, which have previously been launched sequentially, but independently:
dse spark-submit <job1 params>
dse spark-submit <job2 params>
Both jobs more-or-less max out the system resources, but run successfully with the current --executor-memory/--driver-memory/--memory-fraction settings.
There is considerable overlap between the two jobs, and we would like to merge them into a single task. It seemed reasonable to define a new App which launches each, individually:
def main(args: Array[String]) {
processJob1(args)
processJob2(args)
}
Watching in the spark console, job 1 runs and completes, then job 2 begins but runs into out of memory errors.
In Java, I believe that an OOM error usually means that GC ran, and that there wasn't enough resulting free space. Being new to spark and ignorant, this raises a few questions.
How does spark 'scope' an Rdd? For example, shouldn't any objects
defined in job 1 should be marked for GC before job 2 begins? Would
the OOM failure in this case then indicate a memory leak in job 1?
Will scala block the workers while a GC is going on? Would it even know, or would the jobs just run more slowly?
GC is run by the JVM. Does this mean that something like System.gc() would only trigger a GC on the master node? Is it possible to trigger a GC at the workers?

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html