Spark streaming error: Issue communicating with driver in heartbeater - scala

I'm experimenting an issue with heartbeating when I running my Spark Streaming app.
I know the meaning of heartbeating, I have tried to increase its value in "spark.executor.heartbeatInterval", but the issue it still remaing.
My config is:
4 executors
4 cores per executor
6GB RAM per executor
Spark streaming time window: 30s
Each batch takes between 2s and 28s to complete
In the logs I can see how, suddenly, executors start to log "Issue communicating with driver in heartbeater" and when the it happen X times, the executor shutdown (as the spark doc says).
In the logs I can't see any exception (such as OOM or something about GC). Simply, some time (some hours after starting), heartbeater fails.
I have read about to repartition data to try to solve the issue, but I can't because it is a Kafka Direct appication and each partition is partial ordered so I don't do repartition anytime.
This is the trace I can see:
2018/12/16 13:44:26:317 WARN org.apache.spark.executor.Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more

Related

Program takes a lot of time to end because of this warning Executor: Issue communicating with driver in heartbeater

I have the standalone spark cluster with one master and four workers which must read the large oracle table and process it. Each node has 30G Ram. In average program time is about 1.7 Hours, but sometimes it takes a lot of time because of this warning:
WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:996)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at org.apache.spark.executor.Executor$$Lambda$356/321695195.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
I search and added these options to the spark_submit command.But, still receive the same warning.
--conf spark.sql.broadcastTimeout=3600 --conf spark.rpc.message.maxSize=1024 --conf spark.rpc.askTimeout=600s
Also, I set 28G for driver memory and 24G for executor memory. Moreover, I read this post:
https://stackoverflow.com/a/54038675/6640504
As this comment said, it is not right to increase spark.network.timeout and spark.executor.heartbeatInterval to 10000000 with the same value. I am sure the program takes more time to run with this setting.
Would you please guide me how to solve this problem and improve speed of the program?
Any help is really appreciated.

When Kafka Topic partition reassignment, Flink job fails continuously

env
kafka 1.0.1
flink 1.7.1
trouble
I use topic with 200 partitions. and flink uses this topic.
Recently, I do manual partition reassignment.
When i reassigned partitions, Flink continuosly fails with this error.
error1.
[2021-07-28 18:21:15,926] WARN Attempting to send response via channel for which there is no open connection, connection id ..(kafka.network.Processor)
error2.
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for -126: 30042 ms has passed since batch creation plus linger time
error3.
java.lang.Exception: Error while triggering checkpoint 656 for Source: Custom Source -> Sink: ... (32/200)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1174)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not perform checkpoint 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:570)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1163)
... 5 more
Caused by: java.lang.Exception: Could not complete snapshot 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:422)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1113)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1055)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:729)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:641)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
... 7 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for ...-86: 30049 ms has passed since batch creation plus linger time
And When i restarted failed job, this error occurs continuously.
ClassLoader info: URL ClassLoader:
file: '/blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47' (invalid JAR: /blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47 (Too many open files))
Class not resolvable through given classloader.
So I restarted all mesos and flink cluster with zookeeper clearance.
Is there any cause to look for?
There were network issues with certain brokers in the cluster.
If a request for a specific partition is processed slowly due to a network issue, it is expected that the message will be displayed.
Subsequently, the job corresponding to the partition does not work properly, and it seems that the checkpoint issue of flink occurs.
This problem was solved by replacing the equipment of the broker.

Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"

We have a spark streaming application reading data from Kafka.
Data size: 15 Million
Below errors were seen:
java.lang.AssertionError: assertion failed: Failed to get records for spark-executor- after ...polling for 512 at scala.Predef$.assert(Predef.scala:170)
There were more errors seen pertaining to CachedKafkaConsumer
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other Kafka stream timeout settings are default.
"request.timeout.ms"
"heartbeat.interval.ms"
"session.timeout.ms"
"max.poll.interval.ms"
Also, the Kafka is being recently updated to 0.10 from 0.8. In 0.8, this behavior was not seen.
No resource issues are seen.
Any pointers?

kafka couchbase sink connector getting disconnected after dumping some records

I've installed confluent_3.3.0 and started zookeper, schema-registry and kafka broker .
And downloaded couchbase connector from below link
https://github.com/couchbase/kafka-connect-couchbase
Running sink connector using below command
./bin/connect-standalone etc/kafka/connect-standalone.properties /home/nayangiri/couch-connect-test/kafka-connect-couchbase/config/quickstart-couchbase-sink.properties
After running connector, I'm starting publishing JSON using kafka-python library.
The problem is, connector is getting disconnected without dumping all published messages with below error
[2017-11-07 20:12:39,815] WARN This transcoder (JsonBinaryTranscoder) does not support mutation tokens - this method is a stub and needs to be implemented on custom transcoders. (com.couchbase.client.java.transcoder.AbstractTranscoder:150)
[2017-11-07 20:12:44,821] WARN This transcoder (JsonBinaryTranscoder) does not support mutation tokens - this method is a stub and needs to be implemented on custom transcoders. (com.couchbase.client.java.transcoder.AbstractTranscoder:150)
[2017-11-07 20:12:44,821] WARN This transcoder (JsonBinaryTranscoder) does not support mutation tokens - this method is a stub and needs to be implemented on custom transcoders. (com.couchbase.client.java.transcoder.AbstractTranscoder:150)
[2017-11-07 20:12:44,823] ERROR Task test-couchbase-sink-1 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:455)
com.couchbase.client.java.error.CannotRetryException: maximum number of attempts reached after 5 retries
at com.couchbase.client.java.util.retry.RetryWithDelayHandler.call(RetryWithDelayHandler.java:101)
at com.couchbase.client.java.util.retry.RetryWithDelayHandler.call(RetryWithDelayHandler.java:42)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:69)
at rx.internal.operators.OperatorZip$Zip.tick(OperatorZip.java:252)
at rx.internal.operators.OperatorZip$Zip$InnerSubscriber.onNext(OperatorZip.java:323)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:77)
at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:302)
at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:284)
at rx.internal.operators.NotificationLite.accept(NotificationLite.java:135)
at rx.subjects.SubjectSubscriptionManager$SubjectObserver.emitNext(SubjectSubscriptionManager.java:253)
at rx.subjects.BehaviorSubject.onNext(BehaviorSubject.java:160)
at rx.observers.SerializedObserver.onNext(SerializedObserver.java:91)
at rx.subjects.SerializedSubject.onNext(SerializedSubject.java:67)
at rx.internal.operators.OnSubscribeRedo$2$1.onError(OnSubscribeRedo.java:237)
at rx.internal.operators.OperatorMerge$MergeSubscriber.reportError(OperatorMerge.java:266)
at rx.internal.operators.OperatorMerge$MergeSubscriber.checkTerminate(OperatorMerge.java:818)
at rx.internal.operators.OperatorMerge$MergeSubscriber.emitLoop(OperatorMerge.java:579)
at rx.internal.operators.OperatorMerge$MergeSubscriber.emit(OperatorMerge.java:568)
at rx.internal.operators.OperatorMerge$InnerSubscriber.onError(OperatorMerge.java:852)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.onError(OnSubscribeMap.java:88)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:73)
at rx.observers.Subscribers$5.onNext(Subscribers.java:235)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onNext(OnSubscribeDoOnEach.java:101)
at rx.internal.producers.SingleProducer.request(SingleProducer.java:65)
at rx.Subscriber.setProducer(Subscriber.java:211)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.setProducer(OnSubscribeMap.java:102)
at rx.Subscriber.setProducer(Subscriber.java:205)
at rx.Subscriber.setProducer(Subscriber.java:205)
at rx.subjects.AsyncSubject.onCompleted(AsyncSubject.java:103)
at com.couchbase.client.core.endpoint.AbstractGenericHandler.completeResponse(AbstractGenericHandler.java:390)
at com.couchbase.client.core.endpoint.AbstractGenericHandler.access$000(AbstractGenericHandler.java:72)
at com.couchbase.client.core.endpoint.AbstractGenericHandler$1.call(AbstractGenericHandler.java:408)
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException
at com.couchbase.connect.kafka.util.JsonBinaryTranscoder.newDocument(JsonBinaryTranscoder.java:40)
at com.couchbase.connect.kafka.util.JsonBinaryTranscoder.newDocument(JsonBinaryTranscoder.java:30)
at com.couchbase.client.java.transcoder.AbstractTranscoder.newDocument(AbstractTranscoder.java:133)
at com.couchbase.client.java.CouchbaseAsyncBucket$16.call(CouchbaseAsyncBucket.java:568)
at com.couchbase.client.java.CouchbaseAsyncBucket$16.call(CouchbaseAsyncBucket.java:560)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:69)
... 19 more
Caused by: rx.exceptions.OnErrorThrowable$OnNextValue: OnError while emitting onNext value: com.couchbase.client.core.message.kv.UpsertResponse.class
at rx.exceptions.OnErrorThrowable.addValueAsLastCause(OnErrorThrowable.java:118)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:73)
... 19 more
[2017-11-07 20:12:44,830] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:456)
[2017-11-07 20:12:44,830] ERROR Task test-couchbase-sink-1 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:148)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:457)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:251)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:180)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2017-11-07 20:12:44,831] **ERROR Task is being killed and will not recover until manually restarted** (org.apache.kafka.connect.runtime.WorkerTask:149)
[2017-11-07 20:12:44,836] INFO Closed bucket test (com.couchbase.client.core.config.ConfigurationProvider:115)
[2017-11-07 20:12:44,836] INFO Disconnected from Node 10.103.2.76/localhost (com.couchbase.client.core.node.Node:115)
[2017-11-07 20:12:44,839] INFO [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect. (com.couchbase.client.core.endpoint.Endpoint:115)
Thank you for Reading
Thanks for raising this issue. This is a regression in version 3.2.0 of the connector. It is being tracked as KAFKAC-83.
The fix is included in version 3.2.1, scheduled for release on November 21, 2017 released on November 8, 2017.
In the meantime you may wish to temporarily downgrade to version 3.1.3, or build the connector from the latest source code.
PSA: The Couchbase forums have a dedicated section for discussion related to the Kafka connector.

How to investigate failing dataproc worker processes?

I'm running a PySpark job and Im having trouble determining the cause of failure on worker processes.
While my job is running I started noticing stack traces in the job output such as:
16/04/10 03:24:21 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1460240417530_0021_01_000003 on host: cluster-2-w-0.c.my-project.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
[Stage 0:=================================> (19 + 13) / 32]16/04/10 03:26:21 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(2,Container marked as failed: container_1460240417530_0021_01_000003 on host: cluster-2-w-0.c.my-project.internal. Exit status: -100. Diagnostics: Container released on a *lost* node)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.removeExecutor(CoarseGrainedSchedulerBackend.scala:359)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receive$1.applyOrElse(YarnSchedulerBackend.scala:176)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 11 more
16/04/10 03:26:40 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error sending message [message = RequestExecutors(23,0,Map())] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
... 7 more
[Stage 0:=================================> (19 + 13) / 32]
Ill also notice the overall CPU usage of the cluster slowly drop as worker nodes fail. These nodes seem to permanently fail and do not re-join the cluster:
I'm using preemtible machines but when I check the status of these machines they are still running and have not been preempted. So I'm guessing its something wrong on the worker:
It could be because of the heavy workload in the workers. Try to increase spark.network.timeout (default 120) to a bigger number.
If that is not resolving the error, most likely causes are garbage collection. Try to run a memory profile with the following options.
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -XX:+CMSClassUnloadingEnabled