I am very new to Kafka and SparkStreaming. This Sparkstreaming consumer reads from Kafka (key is string and value is string) and when the data doesn't contain any escaped UTF8 string, it works fine. But when it does, it fails with the following error message:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:172)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Here is an example of those lines with escaped UTF8 string (note that it contains two slashes):
Los m\\xC3\\xA1s vendidos en ...
Initially I suspected Kafka streaming encoding configuration but changing it didn't help. In the end we found that the error happens when collect is executed against the data returned from Python UDF.
If more info is needed, please comment. Thank you in advance!
It turned out that our JVM encoding was set to ISO8859-1 in the data collection side (from Python UDF). So we ended up solving this by specifying utf-8 in calling pipe:
rdd.pipe(command, encoding = Codec.UTF8.name)
Related
I have a Structured Streaming application running with Kafka on spark 2.3,
The "spark-sql-kafka-0-10_2.11" version is 2.3.0
The application starts to read messages and process them successfully, then after reaching a specific offset (as shown in the exception message), it throws the following exception:
java.lang.IllegalStateException: Tried to fetch 666 but the returned record offset was 665
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:297)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147)
at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362)
at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:151)
at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:142)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:381)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It always fail on the same offset, looks like this is due to a gap in the offset, because I saw in Kafka UI that after offset 665 there is 667 (it skipped 666 for some reason), and the Kafka client in my Structured Streaming application tries to fetch 666 and fails.
After digging inside Spark's code, I see that they did not expect this exception to happen (according to the comment):
https://github.com/apache/spark/blob/branch-2.3/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L297
So I am wondering, am I doing something wrong?
Or is this a bug on the specific version I am using?
There was a longstanding issue in Spark which was fixed in Spark 2.4, which created a bit of an impedance mismatch between Kafka and Spark. Part of the fix was backported to Spark 2.3.1, but is only enabled if the configuration option spark.streaming.kafka.allowNonConsecutiveOffsets is set to true; as you observe, it's quite possible that you're hitting something that wasn't backported, in which case upgrading to Spark 2.4 might be worth considering.
I followed the this tutorial to deploy Zk on Jboss 6 but on server start up I am getting the error below:
ERROR
[org.apache.catalina.core.ContainerBase.[jboss.web].[default-host].[/stream-zk]]
(ServerService Thread Pool -- 92) JBWEB000289: Servlet zkLoader threw
load() exception: java.lang.ClassCastException:
org.zkoss.zk.ui.http.DHtmlLayoutServlet cannot be cast to
javax.servlet.Servlet
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1154)
[jbossweb-7.5.7.Final-redhat-1.jar:7.5.7.Final-redhat-1]
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1100)
[jbossweb-7.5.7.Final-redhat-1.jar:7.5.7.Final-redhat-1]
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:3593)
[jbossweb-7.5.7.Final-redhat-1.jar:7.5.7.Final-redhat-1]
at org.apache.catalina.core.StandardContext.start(StandardContext.java:3802)
[jbossweb-7.5.7.Final-redhat-1.jar:7.5.7.Final-redhat-1]
at org.jboss.as.web.deployment.WebDeploymentService.doStart(WebDeploymentService.java:163)
[jboss-as-web-7.5.0.Final-redhat-21.jar:7.5.0.Final-redhat-21]
at org.jboss.as.web.deployment.WebDeploymentService.access$000(WebDeploymentService.java:61)
[jboss-as-web-7.5.0.Final-redhat-21.jar:7.5.0.Final-redhat-21]
at org.jboss.as.web.deployment.WebDeploymentService$1.run(WebDeploymentService.java:96)
[jboss-as-web-7.5.0.Final-redhat-21.jar:7.5.0.Final-redhat-21]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[rt.jar:1.8.0_261]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_261]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[rt.jar:1.8.0_261]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[rt.jar:1.8.0_261]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_261]
at org.jboss.threads.JBossThread.run(JBossThread.java:122)
Any help is appreciated.
Such error "...cannot be cast to javax.servlet.Servlet..." is mostly caused by a class that doesn't extend HttpServlet, but ZK really extends HttpServlet.
So the root cause is probably from Jboss.
According to this thread you might load 2 different versions of the servlet API jar.
According to the documentation:
This is a flat namespace and there should not be multiple instances of a class in different deployment JARs. If there are, only the first
loaded will be used and the results may not be as expected
Please check this first.
I am facing a problem with NiFi writing to HDFS. I am getting an error:
ERROR [Timer-Driven Process Thread-10] o.apache.nifi.processors.hadoop.PutHDFS PutHDFS[id=4af43efa-a8ff-18ac-0000-00002377fba5] Failed to properly initialize Processor. If still scheduled to run, NiFi will attempt to initialize and run the Processor again after the 'Administrative Yield Duration' has elapsed. Failure is due to java.lang.reflect.InvocationTargetException: java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.nifi.util.ReflectionUtils.invokeMethodsWithAnnotations(ReflectionUtils.java:137)
at org.apache.nifi.util.ReflectionUtils.invokeMethodsWithAnnotations(ReflectionUtils.java:125)
at org.apache.nifi.util.ReflectionUtils.invokeMethodsWithAnnotations(ReflectionUtils.java:70)
at org.apache.nifi.util.ReflectionUtils.invokeMethodsWithAnnotation(ReflectionUtils.java:47)
at org.apache.nifi.controller.StandardProcessorNode.lambda$initiateStart$1(StandardProcessorNode.java:1364)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
at org.apache.hadoop.fs.Path.<init>(Path.java:134)
at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.getConfigurationFromResources(AbstractHadoopProcessor.java:225)
at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.resetHDFSResources(AbstractHadoopProcessor.java:254)
at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.abstractOnScheduled(AbstractHadoopProcessor.java:205)
... 15 common frames omitted
My HDFS configuration is:
Note: same configuration were applied on PutFile and it worked perfectly (Kafka.topic was not empty)
It seems when PutHDFS processor is trying to save the file looking for kafka.topic attribute associated with the flowfile but attribute not having any value to it.
Make sure you are having kafka.topic attribute having some value associate to it, we can use UpdataAttribute processor before PutHDFS to add the attribute.
I have following versions installed :-
spark 2.1.0,
scala 2.11.6,
mongoDB 3.2.17
I tried to start the pyspark shell with following command
./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
after this i started spark session as follows
from pyspark.sql import SparkSession
my_spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://127.0.0.1/mycollection.dummy").config("spark.mongodb.output.uri", "mongodb://127.0.0.1/mycollection.dummy").getOrCreate()
i performed writing to a collection in mongodb db and it is getting executed successfully
but, when i try to read the collection using the command
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://127.0.0.1/mycollection.dummy").load()
it is showing error as follows
17/10/13 10:43:33 ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2;
at com.mongodb.spark.sql.MongoInferSchema$.com$mongodb$spark$sql$MongoInferSchema$$compatibleType(MongoInferSchema.scala:135)
at com.mongodb.spark.sql.MongoInferSchema$$anonfun$3.apply(MongoInferSchema.scala:78)
at com.mongodb.spark.sql.MongoInferSchema$$anonfun$3.apply(MongoInferSchema.scala:78)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1135)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1135)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1136)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$25.apply(RDD.scala:1136)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
You seem to have an inconsistency in the dataframe reading line. First of all, you initialize my_spark, but then use spark. You also use "uri", instead of "spark.mongodb.input.uri". Try this below:
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").option("spark.mongodb.input.uri","mongodb://127.0.0.1/mycollection.dummy").load()
Otherwise provide more code to check as a whole.
I have a Spark program written in Scala that read a CSV file from HDFS, compute a new column and save it as a parquet file. I am running the program in a YARN cluster. But every time I try to launch it the executors fails at some point with this error.
Could you help me to find what might cause this error ?
Log from on executor
16/10/27 15:58:10 WARN storage.BlockManager: Putting block rdd_12_225 failed due to an exception
16/10/27 15:58:10 WARN storage.BlockManager: Block rdd_12_225 could not be removed as it was not found on disk or in memory
16/10/27 15:58:10 ERROR executor.Executor: Exception in task 225.0 in stage 4.0 (TID 465)
java.io.IOException: Stream is corrupted
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211)
at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.readSize(UnsafeRowSerializer.scala:113)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.<init>(UnsafeRowSerializer.scala:120)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3.asKeyValueIterator(UnsafeRowSerializer.scala:110)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:66)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryRelation.scala:118)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryRelation.scala:110)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 15385 of input buffer
at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
... 41 more
EDIT :
There is the code used
var df = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls", "true").csv(hdfsFileURLIn).repartition(nPartitions)
df.printSchema()
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK)
df.repartition(nPartitions, $"ipix").write.mode("overwrite").option("spark.hadoop.dfs.replication", 1).parquet(hdfsFileURLOut)
the user function a2p is just taking two Double and return an other double
I need to say that this worked well with relatively small CSV (~1Go) but this error happen every times with bigger ones (~15Go)
EDIT 2:
Following the suggestions I disabled the repartition and I used StorageLevel.DISK_ONLY
With this I don't get the Putting block rdd_***** failed due to an exception but there is still an exception related to LZ4 (Stream is corrupted):
16/10/28 07:53:00 ERROR util.Utils: Aborting task
java.io.IOException: Stream is corrupted
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211)
at org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark_project.guava.io.ByteStreams.read(ByteStreams.java:899)
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:254)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 12966 of input buffer
at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205)
... 25 more
EDIT 3 : I managed to launch it without any errors by removing also the second repartition (the one that repartition using the column ipix) I will look further in the documentation of this method
EDIT 4 : This is strange, occasionally some executors fail with a segmentation fault :
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f48d8a47f2c, pid=3501, tid=0x00007f48cc60c700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_102-b14) (build 1.8.0_102-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.102-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 4713 C2 org.apache.spark.unsafe.types.UTF8String.hashCode()I (18 bytes) # 0x00007f48d8a47f2c [0x00007f48d8a47e60+0xcc]
#
# Core dump written. Default location: /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1477580152295_0008/container_1477580152295_0008_01_000006/core or core.3501
#
# An error report file with more information is saved as:
# /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1477580152295_0008/container_1477580152295_0008_01_000006/hs_err_pid3501.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
I checked the memory and all my executors always have plenty of free memory (at least 6Go)
EDIT 4 : So I tested with multiple files and the execution always succeed but sometime some executors fails (with the error above) and are started again by YARN
Which version of lz4-java are you using? This may be related to the problem that was fixed in version 1.1.2 -- see this bug report
Also, I am curious about your function a2p. It should ideally take two Column objects as input, and not just Doubles (unless you registered it as a UDF).
Ran into the same issue.
Symptoms look exactly like this problem: SPARK-18105.
As of 1/29/17 it is not fixed yet.
I replaced lz4-java jar to it's latest version (lz4-java-1.5.0.jar) in jars directory of inside SPARK_HOME path. This worked for me.