Spark job failing in YARN mode - scala

I have a Spark program written in Scala that read a CSV file from HDFS, compute a new column and save it as a parquet file. I am running the program in a YARN cluster. But every time I try to launch it the executors fails at some point with this error.
Could you help me to find what might cause this error ?
Log from on executor
16/10/27 15:58:10 WARN storage.BlockManager: Putting block rdd_12_225 failed due to an exception
16/10/27 15:58:10 WARN storage.BlockManager: Block rdd_12_225 could not be removed as it was not found on disk or in memory
16/10/27 15:58:10 ERROR executor.Executor: Exception in task 225.0 in stage 4.0 (TID 465)
java.io.IOException: Stream is corrupted
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211)
at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.readSize(UnsafeRowSerializer.scala:113)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.<init>(UnsafeRowSerializer.scala:120)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3.asKeyValueIterator(UnsafeRowSerializer.scala:110)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:66)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryRelation.scala:118)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryRelation.scala:110)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 15385 of input buffer
at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
... 41 more
EDIT :
There is the code used
var df = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls", "true").csv(hdfsFileURLIn).repartition(nPartitions)
df.printSchema()
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK)
df.repartition(nPartitions, $"ipix").write.mode("overwrite").option("spark.hadoop.dfs.replication", 1).parquet(hdfsFileURLOut)
the user function a2p is just taking two Double and return an other double
I need to say that this worked well with relatively small CSV (~1Go) but this error happen every times with bigger ones (~15Go)
EDIT 2:
Following the suggestions I disabled the repartition and I used StorageLevel.DISK_ONLY
With this I don't get the Putting block rdd_***** failed due to an exception but there is still an exception related to LZ4 (Stream is corrupted):
16/10/28 07:53:00 ERROR util.Utils: Aborting task
java.io.IOException: Stream is corrupted
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211)
at org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark_project.guava.io.ByteStreams.read(ByteStreams.java:899)
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:254)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 12966 of input buffer
at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
... 25 more
EDIT 3 : I managed to launch it without any errors by removing also the second repartition (the one that repartition using the column ipix) I will look further in the documentation of this method
EDIT 4 : This is strange, occasionally some executors fail with a segmentation fault :
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f48d8a47f2c, pid=3501, tid=0x00007f48cc60c700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_102-b14) (build 1.8.0_102-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.102-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 4713 C2 org.apache.spark.unsafe.types.UTF8String.hashCode()I (18 bytes) # 0x00007f48d8a47f2c [0x00007f48d8a47e60+0xcc]
#
# Core dump written. Default location: /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1477580152295_0008/container_1477580152295_0008_01_000006/core or core.3501
#
# An error report file with more information is saved as:
# /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1477580152295_0008/container_1477580152295_0008_01_000006/hs_err_pid3501.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
I checked the memory and all my executors always have plenty of free memory (at least 6Go)
EDIT 4 : So I tested with multiple files and the execution always succeed but sometime some executors fails (with the error above) and are started again by YARN

Which version of lz4-java are you using? This may be related to the problem that was fixed in version 1.1.2 -- see this bug report
Also, I am curious about your function a2p. It should ideally take two Column objects as input, and not just Doubles (unless you registered it as a UDF).

Ran into the same issue.
Symptoms look exactly like this problem: SPARK-18105.
As of 1/29/17 it is not fixed yet.

I replaced lz4-java jar to it's latest version (lz4-java-1.5.0.jar) in jars directory of inside SPARK_HOME path. This worked for me.

Related

Kafka Structured streaming application throwing IllegalStateException when there is a gap in the offset

I have a Structured Streaming application running with Kafka on spark 2.3,
The "spark-sql-kafka-0-10_2.11" version is 2.3.0
The application starts to read messages and process them successfully, then after reaching a specific offset (as shown in the exception message), it throws the following exception:
java.lang.IllegalStateException: Tried to fetch 666 but the returned record offset was 665
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:297)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147)
at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362)
at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:151)
at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:142)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:381)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It always fail on the same offset, looks like this is due to a gap in the offset, because I saw in Kafka UI that after offset 665 there is 667 (it skipped 666 for some reason), and the Kafka client in my Structured Streaming application tries to fetch 666 and fails.
After digging inside Spark's code, I see that they did not expect this exception to happen (according to the comment):
https://github.com/apache/spark/blob/branch-2.3/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L297
So I am wondering, am I doing something wrong?
Or is this a bug on the specific version I am using?
There was a longstanding issue in Spark which was fixed in Spark 2.4, which created a bit of an impedance mismatch between Kafka and Spark. Part of the fix was backported to Spark 2.3.1, but is only enabled if the configuration option spark.streaming.kafka.allowNonConsecutiveOffsets is set to true; as you observe, it's quite possible that you're hitting something that wasn't backported, in which case upgrading to Spark 2.4 might be worth considering.

Spark: LeaseExpiredException while writing large dataframe to parquet files

I have a large dataframe which I am writing to parquet files in HDFS. Getting the below exception from logs :
2018-10-15 18:31:32 ERROR Executor:91 - Exception in task 41.0 in stage 0.0 (TID 1321)
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:369)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /home/prod_out/20181007/_temporary/0/_temporary/attempt_20181015183108_0000_m_000041_0/part-00041-1185b10b-bcb1-4b7e-b732-dd6f71322b7d-c000.snappy.parquet (inode 33628528083): File does not exist. Holder DFSClient_NONMAPREDUCE_179567941_77 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3481)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3284)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3122)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy19.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1455)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1251)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
2018-10-15 18:32:06 INFO CoarseGrainedExecutorBackend:54 - Got assigned task 2189
Googled about it but couldn't find any concrete solution. Set the speculation false:
conf.set("spark.speculation","false")
But still didn't help.
It's finishing few tasks, generating few part files and then abruptly stops with this error.
Details:
Spark version : 2.3.1 (This was not happening in 1.6x).
There is only one session running, which rules out the possibility of the same location being accessed by a different session.
Any pointers?
Thanks!
Actually the issue is because before spark writes the data into specified hdfs location, it uploads the data into temporary location.This two stage mechanism is the used to ensure consistency of the final data set when working with file systems. In case of successful write the data is moved from temporary location. And in case of unsuccessful write the data is removed from the temporary location. In your case there might be a different executor thread making changes to the temporary location. And once the original executor thread looks to the temporary location, it is not available and hdfs lease exception is thrown.
In order to avoid this exception,
Make sure you are not using any parallel collections.
Avoid multi-threading if applicable
spark.conf.set("spark.speculation","false")
It may be useful to you this solution: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
In my case I couldn't write orc files. I removed coalesce option and then it worked!

Kafka can not delete old log segments on Windows

I have run into an issue with Kafka on Windows where it attempts to delete log segments, but it cannot due to another process having access to the files. This is caused by Kafka holding access to the file itself and trying to delete a file it has open. The bug is below for reference.
I have found two JIRA bugs that have been filed on this issue https://issues.apache.org/jira/browse/KAFKA-1194 and https://issues.apache.org/jira/browse/KAFKA-2170. The first being logged under version 0.8.1 and the second for version 0.10.1.
I have personally tried versions 0.10.1 and 0.10.2. Neither of them have the bug fixed in them.
My question is, does anyone know of patch that can fix this issue or know if the Kafka people have a fix for this that will be rolling out soon.
Thanks.
kafka.common.KafkaStorageException: Failed to change the log file suffix from to .deleted for log segment 6711351
at kafka.log.LogSegment.kafkaStorageException$1(LogSegment.scala:340)
at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:342)
at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:981)
at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:971)
at kafka.log.Log$$anonfun$deleteOldSegments$1.apply(Log.scala:673)
at kafka.log.Log$$anonfun$deleteOldSegments$1.apply(Log.scala:673)
at scala.collection.immutable.List.foreach(List.scala:381)
at kafka.log.Log.deleteOldSegments(Log.scala:673)
at kafka.log.Log.deleteRetentionSizeBreachedSegments(Log.scala:717)
at kafka.log.Log.deleteOldSegments(Log.scala:697)
at kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:474)
at kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:472)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at kafka.log.LogManager.cleanupLogs(LogManager.scala:472)
at kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(LogManager.scala:200)
at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.FileSystemException: c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log -> c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log.deleted: The process cannot access the file because it is being used by another process.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:711)
at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:210)
... 28 more
Suppressed: java.nio.file.FileSystemException: c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log -> c:\kafka-logs\kafka-logs\metric-values-0\00000000000006711351.log.deleted: The process cannot access the file because it is being used by another process.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:708)
... 29 more
Im having a similar issue running kafka on local , kafka server seems stopping on execution upon failure to delete the log file. For the solution to avoid this to happen i have to increase the log retention for the log to avoid auto deletion.
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=500
setting the log to xxx hours will avoid this upon running on local, but for production I think for linux based system , this should not happen.
In case you need to delete the log file , delete it manually where your logs located ,then restart the kafka .

MalformedInputException error from SparkStream App reading from Kafka topic

I am very new to Kafka and SparkStreaming. This Sparkstreaming consumer reads from Kafka (key is string and value is string) and when the data doesn't contain any escaped UTF8 string, it works fine. But when it does, it fails with the following error message:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:172)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Here is an example of those lines with escaped UTF8 string (note that it contains two slashes):
Los m\\xC3\\xA1s vendidos en ...
Initially I suspected Kafka streaming encoding configuration but changing it didn't help. In the end we found that the error happens when collect is executed against the data returned from Python UDF.
If more info is needed, please comment. Thank you in advance!
It turned out that our JVM encoding was set to ISO8859-1 in the data collection side (from Python UDF). So we ended up solving this by specifying utf-8 in calling pipe:
rdd.pipe(command, encoding = Codec.UTF8.name)

java.util.concurrent.RejectedExecutionException in Spark although driver/client has precisely same version as Server

A task that works in spark local mode is not working for standalone cluster running on the same machine.
The only difference is:
local[*]
vs
spark://<host>.local:7077
for the master
I am able to run spark pi against the master at the above address and also use the spark gui: so the master address is generally working for spark.
Here is the (normal) spark init code:
val sconf = new SparkConf().setMaster(master).setAppName("EpisCatalog")
val sc = new SparkContext(sconf)
Here is the stacktrace from running the program:
15/12/03 03:39:04.746 main WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/03 03:39:07.706 main WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/03 03:39:27.739 appclient-registration-retry-thread ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#b649f0b rejected from java.util.concurrent.ThreadPoolExecutor#5ef7a52b[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:103)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:102)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:128)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:139)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1130)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:131)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am running spark 1.6.0-SNAPSHOT. It has been "installed" to local maven repo and I have verified that the client is using the latest local maven repo version.
I had the same problem. It could be solved by using the full host url (can be found on the Master Web UI, port 18080) instead of just the hostname or localhost.
So I had to use mymachine.mycompany.org instead of mymachine
I got the same problem and in my case there was version mismatch. I had Spark Driver written on 1.5.1 version and Spark Cluster setup on 1.6.0.
Maybe you deploy cluster on stable version which was on that time 1.5.1.