kafka.common.OffsetOutOfRangeException - apache-kafka

Am getting the below error in spark streaming. This works fine for few days but getting the below error after 4-5 days. Can anyone please help me on this.
kafka.common.OffsetOutOfRangeException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:102)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I have checked the kafka properties and the default value for log.retention.hours is 7 days. Also executed the below command.
kafka-topics.sh --zookeeper localhost:2181 --describe --topics-with-overrides
Got the below output for the topic that is reading from.
Topic:consumer_events PartitionCount:25 ReplicationFactor:3 Configs:retention.ms=43200000,min.insync.replicas=2,cleanup.policy=delete,compression.type=snappy,retention.bytes=644245094400
Can anyone please help me on this.
Thanks

Related

kafka ignite sink connector

I am trying to replicate data from Kafka topic to apache ignite cluster. I have added ignite-core.jar along with all other required files in plugins and it's loaded in connector as per the logs. Still I'm facing this error. Any suggestions would be helpful. Thank you.
INFO Instantiated connector string-ignite-connector with version 6.1.1-ccs of type class org.apache.ignite.stream.kafka.connect.IgniteSinkConnector (org.apache.kafka.connect.runtime.Worker:284)
[2021-05-15 08:52:43,717] INFO Finished creating connector string-ignite-connector (org.apache.kafka.connect.runtime.Worker:310)
[2021-05-15 08:52:43,719] ERROR WorkerConnector{id=string-ignite-connector} Error while starting connector (org.apache.kafka.connect.runtime.WorkerConnector:193)
java.lang.NoClassDefFoundError: org/apache/ignite/internal/util/typedef/internal/A
at org.apache.ignite.stream.kafka.connect.IgniteSinkConnector.start(IgniteSinkConnector.java:55)
at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:185)
at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:210)
at org.apache.kafka.connect.runtime.WorkerConnector.doTransitionTo(WorkerConnector.java:349)
at org.apache.kafka.connect.runtime.WorkerConnector.doTransitionTo(WorkerConnector.java:332)
at org.apache.kafka.connect.runtime.WorkerConnector.doRun(WorkerConnector.java:140)
at org.apache.kafka.connect.runtime.WorkerConnector.run(WorkerConnector.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.ignite.internal.util.typedef.internal.A
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 12 more
[2021-05-15 08:52:43,723] ERROR Failed to create job for myconfig/ignite-connector.properties (org.apache.kafka.connect.cli.ConnectStandalone:110)
[2021-05-15 08:52:43,723] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:121)
java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Failed to transition connector string-ignite-connector to state STARTED
at org.apache.kafka.connect.util.ConvertingFutureCallback.result(ConvertingFutureCallback.java:115)
at org.apache.kafka.connect.util.ConvertingFutureCallback.get(ConvertingFutureCallback.java:99)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:118)
Caused by: org.apache.kafka.connect.errors.ConnectException: Failed to transition connector string-ignite-connector to state STARTED
at org.apache.kafka.connect.runtime.WorkerConnector.doTransitionTo(WorkerConnector.java:335)
at org.apache.kafka.connect.runtime.WorkerConnector.doRun(WorkerConnector.java:140)
at org.apache.kafka.connect.runtime.WorkerConnector.run(WorkerConnector.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/ignite/internal/util/typedef/internal/A
at org.apache.ignite.stream.kafka.connect.IgniteSinkConnector.start(IgniteSinkConnector.java:55)
at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:185)
at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:210)
at org.apache.kafka.connect.runtime.WorkerConnector.doTransitionTo(WorkerConnector.java:349)
at org.apache.kafka.connect.runtime.WorkerConnector.doTransitionTo(WorkerConnector.java:332)
... 7 more
Caused by: java.lang.ClassNotFoundException: org.apache.ignite.internal.util.typedef.internal.A
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 12 more
It looks like you have ignite-kafka module in your classpath, but not ignite-core and its dependencies, which is also needed. My understanding ends there.

Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"

We have a spark streaming application reading data from Kafka.
Data size: 15 Million
Below errors were seen:
java.lang.AssertionError: assertion failed: Failed to get records for spark-executor- after ...polling for 512 at scala.Predef$.assert(Predef.scala:170)
There were more errors seen pertaining to CachedKafkaConsumer
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other Kafka stream timeout settings are default.
"request.timeout.ms"
"heartbeat.interval.ms"
"session.timeout.ms"
"max.poll.interval.ms"
Also, the Kafka is being recently updated to 0.10 from 0.8. In 0.8, this behavior was not seen.
No resource issues are seen.
Any pointers?

Kafka NotLeaderForPartitionException

I am trying to create a simple Kafka Client that i am running from a Springboot application. My requirement is to use Kafka 0.8.2.2 version. The application works fine in the beginning then I am getting this exception I am unable to find the root cause.
In the Kafka log I see following exception:
Offset request with correlation id 0 from client testgroup on partition [TESTQA,2] failed due to Leader not local for partition [TESTQA,2] on broker 256
2017-07-21 03:40:54,186 INFO kafka.network.Processor: Closing socket connection to /.
kafka.common.NotLeaderForPartitionException: null
at sun.reflect.GeneratedConstructorAccessor41.newInstance(Unknown Source) ~[na:na]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_131]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_131]
at java.lang.Class.newInstance(Class.java:442) ~[na:1.8.0_131]
at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:169) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:177) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:172) ~[kafka_2.10-0.8.2.2.jar:na]
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) ~[scala-library-2.10.4.jar:na]
at scala.collection.immutable.Map$Map2.foreach(Map.scala:130) ~[scala-library-2.10.4.jar:na]
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) ~[scala-library-2.10.4.jar:na]
at kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:172) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:87) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:77) ~[kafka_2.10-0.8.2.2.jar:na]
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) ~[scala-library-2.10.4.jar:na]
at scala.collection.immutable.Map$Map2.foreach(Map.scala:130) ~[scala-library-2.10.4.jar:na]
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) ~[scala-library-2.10.4.jar:na]
at kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:77) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95) ~[kafka_2.10-0.8.2.2.jar:na]
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) [kafka_2.10-0.8.2.2.jar:na]
NotLeaderForPartitionException is logged at the ERROR level with a full stack trace. But really this is just an informational message on the server when a client with stale metadata sends requests to the wrong leader for a partition.
visit : https://issues.apache.org/jira/browse/KAFKA-816

Why would Spark Streaming application stall when consuming from Kafka on YARN?

I'm writing a Spark Streaming app in Scala. The goal of the app is to consume the latest records from Kafka and print them to stdout.
The app works perfectly when I run it locally using --master local[n]. However, when I run the app in YARN (and produce to the topic that I am consuming from), the app gets stuck at:
16/11/18 20:53:05 INFO JobScheduler: Added jobs for time 1479502385000 ms
After repeating the line above several times, Spark gives the following error:
16/11/18 20:54:47 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 9, r3d3.hadoop.REDACTED.REDACTED): java.net.ConnectException: Connection timed out
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
at kafka.consumer.SimpleConsumer.getOrMakeConnection(SimpleConsumer.scala:142)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at org.apache.spark.util.NextIterator.to(NextIterator.scala:21)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Error from the streaming UI:
org.apache.spark.streaming.dstream.DStream.print(DStream.scala:757)
com.REDACTED.bdp.Main$.main(Main.scala:88)
com.REDACTED.bdp.Main.main(Main.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Errors from YARN application logs (stdout):
java.lang.NullPointerException
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(KafkaRDD.scala:158)
at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:66)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:101)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:101)
at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:60)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-11-21 15:57:49,925] ERROR Exception in task 0.1 in stage 33.0 (TID 34) (org.apache.spark.executor.Executor)
org.apache.spark.util.TaskCompletionListenerException
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Another error from YARN application logs:
[2016-11-21 15:52:32,264] WARN Exception encountered while connecting to the server : (org.apache.hadoop.ipc.Client)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:375)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:558)
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:373)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:723)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:722)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:373)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1493)
at org.apache.hadoop.ipc.Client.call(Client.java:1397)
at org.apache.hadoop.ipc.Client.call(Client.java:1358)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1315)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1311)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$sparkJar(Client.scala:1195)
at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:1333)
at org.apache.spark.deploy.yarn.ExecutorRunnable.prepareEnvironment(ExecutorRunnable.scala:290)
at org.apache.spark.deploy.yarn.ExecutorRunnable.env$lzycompute(ExecutorRunnable.scala:61)
at org.apache.spark.deploy.yarn.ExecutorRunnable.env(ExecutorRunnable.scala:61)
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:80)
at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:68)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The weird part is that about 5% of the time, the app reads from Kafka successfully, for whatever reason.
The cluster and YARN seem to be working properly.
The cluster is secured using Kerberos.
What might be the source of this error?
tl;dr The answer does not offer an answer and merely suggests a possible next step.
My understanding of when the Lost task event could be reported for a streaming job is when the job was executed and it could not finish which in your case is the connection issue between a Spark executor and a Kafka broker.
16/11/18 20:54:47 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 9, r3d3.hadoop.REDACTED.REDACTED): java.net.ConnectException: Connection timed out
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
at kafka.consumer.SimpleConsumer.getOrMakeConnection(SimpleConsumer.scala:142)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
The pattern of the error message is as follows:
Lost task [id] in stage [taskSetId] (TID [tid], [host], executor [executorId]): [reason]
that translates to your case as having the Spark executor running on host r3d3.hadoop.REDACTED.REDACTED.
The reason for the failure is what follows which says:
java.net.ConnectException: Connection timed out
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
at kafka.consumer.SimpleConsumer.getOrMakeConnection(SimpleConsumer.scala:142)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
And I would ask myself when could a Kafka broker be unavailable for a client (which in your case is a Spark Streaming application which may or may not contribute to understand the root cause of the issue).
I think it might be unrelated to Apache Spark and would look for more answers in Kafka circles.

kafka.common.KafkaStorageException: Failed to change the log file suffix from to .deleted for log segment 0

Does anybody know what this error means ?
kafka.common.KafkaStorageException: Failed to change the log file suffix from to .deleted for log segment 0
at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:259)
at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:729)
at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:720)
at kafka.log.Log$$anonfun$deleteOldSegments$1.apply(Log.scala:488)
at kafka.log.Log$$anonfun$deleteOldSegments$1.apply(Log.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at kafka.log.Log.deleteOldSegments(Log.scala:488)
at kafka.log.LogManager.kafka$log$LogManager$$cleanupExpiredSegments(LogManager.scala:411)
at kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:442)
at kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:440)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at kafka.log.LogManager.cleanupLogs(LogManager.scala:440)
at kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(LogManager.scala:182)
at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:99)
at kafka.utils.Utils$$anon$1.run(Utils.scala:54)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Kafka seems to work but I'd like to understand what's the message about.
Seems that this happens when kafka tries to rename the file when its still open. This is the report of the issue.
The workaround is to delete your kafka logs:
C:\tmp\kafka-logs
C:\tmp\zookeeper
At the time of writing this answer the existing bug is still outstanding.