PySpark on Dataproc stops with SocketTimeoutException - pyspark

We are currently trying to run a Spark job on a Dataproc cluster using PySpark 2.2.0 except the Spark job stops after a seemingly random amount of time passes with the following error message:
17/07/25 00:52:48 ERROR org.apache.spark.api.python.PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:702)
The error could sometimes take only a couple minutes to happen or it could take 3 hours. From personal experience, the Spark job runs for about 30 minutes to 1 hour before hitting the error.
Once the Spark job hits the error, it just stops. No matter how long I wait, it outputs nothing. On YARN ResourceManager, the application status is still labeled as "RUNNING" and I must Ctrl+C to terminate the program. At that point, the application is labelled as "FINISHED".
I run the Spark job using /path/to/spark/bin/spark-submit --jars /path/to/jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.0.jar spark_job.py command on the master node's console. The JAR file is necessary because the Spark job streams messages from Kafka (running on the same cluster as the Spark job) and pushes some messages back to the same Kafka to a different topic.
I've already looked at some other answers on this site (primarily this and this) and they have been somewhat helpful but we haven't been able to track down where in the log might it state what caused the executors to die. So far, I've monitored the nodes during the task through the YARN ResourceManager as well as gone through the logs located in /var/logs/hadoop-yarn directory in every node. The only "clue" I could find in the log was org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM which is the only line that is written to the dead executor's logs.
As a last ditch effort, we attempted to increase the cluster's memory size in the hopes that the issue will just go away but it hasn't. Originally, the cluster was running on a 1 master 2 workers cluster with 4vCPU, 15GB memory. We created a new Dataproc cluster, this time with 1 master and 3 workers, with the workers each having 8vCPU 52GB memory (master has same specs as previous).
What we would like to know is:
1. Where/how can I see the exception that is causing the executors to be terminated?
2. Is this an issue with how Spark is configured?
3. Dataproc image version is "preview". Could that possibly be the cause of the error?
and ultimately,
4. How do we resolve this issue? What other steps can we take?
This Spark job needs to continuously stream from Kafka for an indefinite amount of time so we would like this error to be fixed rather than prolonging the time it takes for the error to occur.
Here are some screenshots from the YARN ResourceManager to demonstrate what we are seeing:
Cluster Metrics
Executor Summary
The screenshots are from before the Spark job stopped from the error.
And this is the Spark configuration file located in /path/to/spark/conf/spark-defaults.conf (did not change anything from the default setting by Dataproc):
spark.master yarn
spark.submit.deployMode client
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.eventLog.enabled true
spark.eventLog.dir hdfs://highmem-m/user/spark/eventlog
# Dynamic allocation on YARN
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.executor.instances 10000
spark.dynamicAllocation.maxExecutors 10000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0
spark.yarn.historyServer.address highmem-m:18080
spark.history.fs.logDirectory hdfs://highmem-m/user/spark/eventlog
spark.executor.cores 2
spark.executor.memory 4655m
spark.yarn.executor.memoryOverhead 465
# Overkill
spark.yarn.am.memory 4655m
spark.yarn.am.memoryOverhead 465
spark.driver.memory 3768m
spark.driver.maxResultSize 1884m
spark.rpc.message.maxSize 512
# Add ALPN for Bigtable
spark.driver.extraJavaOptions
spark.executor.extraJavaOptions
# Disable Parquet metadata caching as its URI re-encoding logic does
# not work for GCS URIs (b/28306549). The net effect of this is that
# Parquet metadata will be read both driver side and executor side.
spark.sql.parquet.cacheMetadata=false
# User-supplied properties.
#Mon Jul 24 23:12:12 UTC 2017
spark.executor.cores=4
spark.executor.memory=18619m
spark.driver.memory=3840m
spark.driver.maxResultSize=1920m
spark.yarn.am.memory=640m
spark.executorEnv.PYTHONHASHSEED=0
I'm not quite sure where the User-supplied properties came from.
Edit:
Some additional information about the clusters:
I use the zookeeper, kafka, and jupyter initialization action scripts found at https://github.com/GoogleCloudPlatform/dataproc-initialization-actions in the order of zookeeper -> kafka -> jupyter (unfortunately I don't have enough reputation to post more than 2 links at the moment)
Edit 2:
From #Dennis's insightful questions, we ran the Spark job while paying particular attention to the executors that have higher On Heap Storage Memory used. What I noticed is that it is always the executors from worker #0 that have significantly higher storage memory usage compared to the other executors. The stdout file for the executors of worker #0 are always empty. These three lines are repeated many times over in stderr:
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:01 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:04 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:07 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:09 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:10 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:13 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:14 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:15 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Verifying properties
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Property group.id is overridden to
17/07/27 16:32:18 INFO kafka.utils.VerifiableProperties: Property zookeeper.connect is overridden to
It seems to be repeating every 1~3 seconds.
As for the stdout and stderr for the other executors from other worker nodes, they are empty.
Edit 3:
As mentioned from #Dennis's comments, we kept the Kafka topic the Spark job was consuming from with replication factor of 1. I also found that I've forgotten to add worker #2 to zookeeper.connect in the Kafka config file and also forgot to give the consumer streaming messages from Kafka in Spark a group ID. I've fixed those places (remade topic with replication factor of 3) and observed that now the workload mainly focuses on worker #1. Following the suggestions from #Dennis, I've run sudo jps after SSH-ing to worker #1 and get the following output:
[Removed this section to save character space; it was only the error messages from a failed call to jmap so it didn't hold any useful information]
Edit 4:
I'm now seeing this in worker #1 executors' stdout files:
2017-07-27 22:16:24
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
===Truncated===
Heap
PSYoungGen total 814592K, used 470009K [0x000000063c180000, 0x000000069e600000, 0x00000007c0000000)
eden space 799744K, 56% used [0x000000063c180000,0x0000000657e53598,0x000000066ce80000)
from space 14848K, 97% used [0x000000069d780000,0x000000069e5ab1b8,0x000000069e600000)
to space 51200K, 0% used [0x0000000698200000,0x0000000698200000,0x000000069b400000)
ParOldGen total 574464K, used 180616K [0x0000000334400000, 0x0000000357500000, 0x000000063c180000)
object space 574464K, 31% used [0x0000000334400000,0x000000033f462240,0x0000000357500000)
Metaspace used 49078K, capacity 49874K, committed 50048K, reserved 1093632K
class space used 6054K, capacity 6263K, committed 6272K, reserved 1048576K
and
2017-07-27 22:06:44
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
===Truncated===
Heap
PSYoungGen total 608768K, used 547401K [0x000000063c180000, 0x000000066a280000, 0x00000007c0000000)
eden space 601088K, 89% used [0x000000063c180000,0x000000065d09c498,0x0000000660c80000)
from space 7680K, 99% used [0x0000000669b00000,0x000000066a2762c8,0x000000066a280000)
to space 36864K, 0% used [0x0000000665a80000,0x0000000665a80000,0x0000000667e80000)
ParOldGen total 535552K, used 199304K [0x0000000334400000, 0x0000000354f00000, 0x000000063c180000)
object space 535552K, 37% used [0x0000000334400000,0x00000003406a2340,0x0000000354f00000)
Metaspace used 48810K, capacity 49554K, committed 49792K, reserved 1093632K
class space used 6054K, capacity 6263K, committed 6272K, reserved 1048576K
When the error happened, an executor from worker #2 received SIGNAL TERM and was labeled as dead. At this time, it was the only dead executor.
Strangely, the Spark job picked back up again after 10 minutes or so. Looking at the Spark UI interface, only executors from worker #1 are active and the rest are dead. First time this has happened.
Edit 5:
Again, following #Dennis's suggestions (thank you, #Dennis!), this time ran sudo -u yarn jmap -histo <pid>. This is the top 10 of the most memory hogging classes from CoarseGrainedExecutorBackend after about 10 minutes:
num #instances #bytes class name
----------------------------------------------
1: 244824 358007944 [B
2: 194242 221184584 [I
3: 2062554 163729952 [C
4: 746240 35435976 [Ljava.lang.Object;
5: 738 24194592 [Lorg.apache.spark.unsafe.memory.MemoryBlock;
6: 975513 23412312 java.lang.String
7: 129645 13483080 java.io.ObjectStreamClass
8: 451343 10832232 java.lang.StringBuilder
9: 38880 10572504 [Z
10: 120807 8698104 java.lang.reflect.Field
Also, I've encountered a new type of error which caused an executor to die. It produced some failed tasks highlighted in the Spark UI and found this in the executor's stderr:
17/07/28 00:44:03 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 6821.0 (TID 2585)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/07/28 00:44:03 ERROR org.apache.spark.executor.Executor: Exception in task 0.1 in stage 6821.0 (TID 2586)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/07/28 00:44:03 ERROR org.apache.spark.util.Utils: Uncaught exception in thread stdout writer for /opt/conda/bin/python
java.lang.AssertionError: assertion failed: Block rdd_5480_0 is not locked for reading
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
17/07/28 00:44:03 ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /opt/conda/bin/python,5,main]
java.lang.AssertionError: assertion failed: Block rdd_5480_0 is not locked for reading
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Edit 6:
This time, I took the jmap after 40 minutes of running:
num #instances #bytes class name
----------------------------------------------
1: 23667 391136256 [B
2: 25937 15932728 [I
3: 159174 12750016 [C
4: 334 10949856 [Lorg.apache.spark.unsafe.memory.MemoryBlock;
5: 78437 5473992 [Ljava.lang.Object;
6: 125322 3007728 java.lang.String
7: 40931 2947032 java.lang.reflect.Field
8: 63431 2029792 com.esotericsoftware.kryo.Registration
9: 20897 1337408 com.esotericsoftware.kryo.serializers.UnsafeCacheFields$UnsafeObjectField
10: 20323 975504 java.util.HashMap
These are the results of ps ux:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 601 0.8 0.9 3008024 528812 ? Sl 16:12 1:17 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dproc_nodema
yarn 6086 6.3 0.0 96764 24340 ? R 18:37 0:02 /opt/conda/bin/python -m pyspark.daemon
yarn 8036 8.2 0.0 96296 24136 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8173 9.4 0.0 97108 24444 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8240 9.0 0.0 96984 24576 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8329 7.6 0.0 96948 24720 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8420 8.5 0.0 96240 23788 ? R 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8487 6.0 0.0 96864 24308 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8554 0.0 0.0 96292 23724 ? S 18:37 0:00 /opt/conda/bin/python -m pyspark.daemon
yarn 8564 0.0 0.0 19100 2448 pts/0 R+ 18:37 0:00 ps ux
yarn 31705 0.0 0.0 13260 2756 ? S 17:56 0:00 bash /hadoop/yarn/nm-local-dir/usercache/<user_name>/app
yarn 31707 0.0 0.0 13272 2876 ? Ss 17:56 0:00 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java
yarn 31713 0.4 0.7 2419520 399072 ? Sl 17:56 0:11 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx6
yarn 31771 0.0 0.0 13260 2740 ? S 17:56 0:00 bash /hadoop/yarn/nm-local-dir/usercache/<user_name>/app
yarn 31774 0.0 0.0 13284 2800 ? Ss 17:56 0:00 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java
yarn 31780 11.1 1.4 21759016 752132 ? Sl 17:56 4:31 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx1
yarn 31883 0.1 0.0 96292 27308 ? S 17:56 0:02 /opt/conda/bin/python -m pyspark.daemon
The pid of the CoarseGrainedExecutorBackEnd is 31780 in this case.
Edit 7:
Increasing heartbeatInterval in the Spark settings did not change anything, which makes sense in hindsight.
I created a short bash script that reads from Kafka with the console consumer for 5 seconds and writes the messages into a text file. The text file is uploaded to Hadoop where Spark streams from. We tested whether the Timeout was related to Kafka through this method.
Streaming from Hadoop and outputting to Kafka from Spark caused SocketTimeout
Streaming from Kafka directly and not outputting to Kafka from Spark caused SocketTimeout
Streaming from Hadoop and not outputting to Kafka from Spark caused SocketTimeout
So we moved on with the assumption that Kafka had nothing to do with the Timeout.
We installed Stackdriver Monitoring to see memory usage as the Timeout occurred. Nothing really interesting from the metrics; memory usage looked relatively stable throughout (hovering around 10~15% at most for the busiest nodes).
We guessed perhaps something to do with the communication between the worker nodes is what could be causing the issue. Right now, our amount of data traffic is very low so even one worker can handle all the workload with relative ease.
Running the Spark job on a single node cluster while streaming from Kafka brokers from a different cluster seemed to have stopped the SocketTimeout... except the AssertionError documented above now frequently occurs.
Per #Dennis's suggestion, I created a new cluster (also single node) without the jupyter initialization script this time which means Spark runs on Python v2.7.9 now (without Anaconda). The first run, Spark encountered SocketTimeoutException in just 15 seconds. The second time ran for just over 2 hours, failing with the same AssertionError. I'm starting to wonder if this is a problem with Spark's internals. The third run ran for about 40 minutes and then ran into SocketTimeoutException.

A client of mine was seeing various production Pyspark jobs (Spark version 2.2.1) fail in Google Cloud Dataproc intermittently with a very similar stack trace to yours:
ERROR org.apache.spark.api.python.PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:711)
I found that disabling ipv6 on the Dataproc cluster VMs seemed to fix the issue. One way to do that is adding these lines to a Dataproc init script so they are run at cluster creation time:
printf "\nnet.ipv6.conf.default.disable_ipv6 = 1\nnet.ipv6.conf.all.disable_ipv6=1\n" >> /etc/sysctl.conf
sysctl -p

Related

Why does the message "The Critical Analyzer detected slow paths on the broker" mean in Artemis broker?

Setup: I have an artemis broker HA cluster with 3 brokers. The replication policy is replication. Each broker is running in its own VM.
Problem: When I leave my brokers running for long time, usually after 5-6 hours, I get the below error.
2022-11-21 21:32:37,902 WARN
[org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager
is expired on path 0 2022-11-21 21:32:37,902 INFO
[org.apache.activemq.artemis.core.server] AMQ224107: The Critical
Analyzer detected slow paths on the broker. It is recommended that
you enable trace logs on org.apache.activemq.artemis.utils.critical
while you troubleshoot this issue. You should disable the trace logs
when you have finished troubleshooting. 2022-11-21 21:32:37,902 ERROR
[org.apache.activemq.artemis.core.server] AMQ224079: The process for
the virtual machine will be killed, as component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager#46d59067
is not responsive 2022-11-21 21:32:37,969 WARN
[org.apache.activemq.artemis.core.server] AMQ222199: Thread dump:
******************************************************************************* Complete Thread dump "Thread-517
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7#437da279)"
Id=602 TIMED_WAITING on
java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
What does this really mean? I understand that the critical analyzer sees an error and it halts the broker but what is causing this error?
You may take a look at the documentation. Basically you are experiencing some issue tat the broker detects and it shuts down before it becomes too irresponsive. Setting the policy to LOG you might get more clues on the issue.

When Kafka Topic partition reassignment, Flink job fails continuously

env
kafka 1.0.1
flink 1.7.1
trouble
I use topic with 200 partitions. and flink uses this topic.
Recently, I do manual partition reassignment.
When i reassigned partitions, Flink continuosly fails with this error.
error1.
[2021-07-28 18:21:15,926] WARN Attempting to send response via channel for which there is no open connection, connection id ..(kafka.network.Processor)
error2.
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for -126: 30042 ms has passed since batch creation plus linger time
error3.
java.lang.Exception: Error while triggering checkpoint 656 for Source: Custom Source -> Sink: ... (32/200)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1174)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not perform checkpoint 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:570)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1163)
... 5 more
Caused by: java.lang.Exception: Could not complete snapshot 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:422)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1113)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1055)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:729)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:641)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
... 7 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for ...-86: 30049 ms has passed since batch creation plus linger time
And When i restarted failed job, this error occurs continuously.
ClassLoader info: URL ClassLoader:
file: '/blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47' (invalid JAR: /blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47 (Too many open files))
Class not resolvable through given classloader.
So I restarted all mesos and flink cluster with zookeeper clearance.
Is there any cause to look for?
There were network issues with certain brokers in the cluster.
If a request for a specific partition is processed slowly due to a network issue, it is expected that the message will be displayed.
Subsequently, the job corresponding to the partition does not work properly, and it seems that the checkpoint issue of flink occurs.
This problem was solved by replacing the equipment of the broker.

Kakfa broker v2.1.0 gets into stuck state after LeaderEpochCache truncation

We are running a 3 broker kafka cluster (v2.11-2.1.0), r4.xlarge machines on aws.We have cpu usage upto 85%, memory usage nearing 99%(including i/o buffers). We have 3 node zk cluster. we have around ~80-90k messages/sec load
broker java env:
java -Xmx7982m -Xms7770m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Xloggc:/home/kafka/kafka/bin/../logs/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M```
Usually producers and consumers dont face any issues, but intermittently below log errors are written in "server.log"
2019-05-27 20:21:53,211] WARN [LeaderEpochCache pulse-flattening-errors-23] New epoch entry EpochEntry(epoch=2, startOffset=0) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=1, startOffset=0)). Cache now contains 1 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,175] WARN [LeaderEpochCache pulse-21-feb-error-message-16] New epoch entry EpochEntry(epoch=4, startOffset=1833) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1833)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,269] WARN [LeaderEpochCache pulse-21-feb-error-message-28] New epoch entry EpochEntry(epoch=4, startOffset=1525) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1525)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,339] WARN [LeaderEpochCache pulse-21-feb-error-message-4] New epoch entry EpochEntry(epoch=4, startOffset=1427) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1427)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,529] WARN [LeaderEpochCache pulse-21-feb-error-message-10] New epoch entry EpochEntry(epoch=4, startOffset=2430) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=2430)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,577] WARN [LeaderEpochCache pulse-21-feb-error-message-22] New epoch entry EpochEntry(epoch=4, startOffset=1802) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1802)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache
Obvservations:
Above warnings occur on leader of a particular partition
broker while still listening on port 9092, stops responding to any producer requests.
producers timeout and not able to send messages to cluster
Other brokers fail replication hence they also get stuck
New leader is not elected
All producers fail
I have Checked config for the topics and partitions , nothing seems to be out of ordinary. Mostly maintaining defaults of the apache kafka distribution v2.11-2.1.0
unclean leader election is also set to false for the cluster as well as topics.
This has been happening intermittently and repeatedly.
I have checked issues.apache.org for related issues, could not find any relevant matching issue.
This is making our cluster unstable and causes full downtime to producers. Dont have any clue in which direction to start looking

Losing executors when saving parquet file

I have loaded a dataset which is just around ~ 20 GB in size - the cluster has ~ 1TB available so memory shouldn't be an issue imho.
It is no problem for me to save the original data which consists only of strings:
df_data.write.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
However, as I transform the data:
df_transformed = df_data.drop('bri').join(
df_data[['docId', 'bri']].rdd\
.map(lambda x: (x.docId, json.loads(x.bri))
if x.bri is not None else (x.docId, dict()))\
.toDF()\
.withColumnRenamed('_1', 'docId')\
.withColumnRenamed('_2', 'bri'),
['dokumentId']
)
and then save it:
df_transformed.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
The log output will tell me that the memory limit was exceeded:
18/03/08 10:23:09 WARN TaskSetManager: Lost task 17.0 in stage 18.3 (TID 2866, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 29.0 in stage 18.3 (TID 2878, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 65.0 in stage 18.3 (TID 2914, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm not quite sure what the problem is. Even setting the executor's memory to 60GB RAM each does not solve the problem.
So, obviously the problem comes with the transformation. Any idea what exactly causes this problem?

Why are the executors getting killed by the driver?

The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.