Spark Structured Streaming recovering from a query exception - scala

Is it possible to recover automatically from an exception thrown during query execution?
Context: I'm developing a Spark application that reads data from a Kafka topic, processes the data, and outputs to S3. However, after running for a couple of days in production, the spark application faces some network hiccups from S3 that causes an exception to be thrown and stops the application. It's also worth mentioning that this application runs on Kubernetes using GCP's Spark k8s Operator.
From what I've seen so far, these exceptions are minor and a simple restart of the application solves the issue. Can we handle those exceptions and restart the structured streaming query automatically?
Here's an example of a thrown exception:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: ...
Current Committed Offsets: ...
Current Available Offsets: ...
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan: ...
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at io.blahblahView$$anonfun$11$$anonfun$apply$2.apply(View.scala:90)
at io.blahblahView $$anonfun$11$$anonfun$apply$2.apply(View.scala:82)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at io.blahblahView$$anonfun$11.apply(View.scala:82)
at io.blahblahView$$anonfun$11.apply(View.scala:79)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
... 1 more
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://.../view/v1/_temporary/0
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:993)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:734)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:291)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:361)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 47 more
What's the simplest way of taking care of such issues automatically?

After spending too many hours trying to find an elegant fix to this issue, and not finding anything, here's what I came up with.
Some might say it's a hack, but it's simple, it works and solves a complex problem. I tested it in production and it solves the issue of recovering automatically from failure due to an occasional minor exception.
I call it a The Query Watchdog. Here's the simplest version where the watchdog will retry running the query indefinitely:
val writer = df.writeStream...
while (true) {
val query = writer.start()
try {
query.awaitTermination()
}
catch {
case e: StreamingQueryException => println("Streaming Query Exception caught!: " + e);
}
}
Some people might want to replace the while(true) with some kind of counter to limit the number of retries. Someone could also supplement this code and send notifications through slack or email whenever a retry happened. Others could simply collect the number of retries in Prometheus.
Hope it helps,
Cheers

Since you are using the Spark Operator, why not using its restart functionality? If the controller notices that the application has stopped then it will re-submit it automatically.
This will work assuming that the application fails, i.e. the driver pod stops. There are some cases where a driver exception is thrown but the driver pod keeps running without doing anything. In that case the Spark Operator will think that the application is still running ok.

No, there is not in a reliable way to do this. BTW, No is also an answer.
Logic for checking exceptions are generally via try / catch running on the driver.
As unexpected situations at Executor level are already standardly handled by the Spark Framework itself for Structured Streaming, and if the error is non-recoverable, then the App / Job simply crashes after signalling of error(s) back to the driver unless you code try / catch within the various foreachXXX constructs.
That said, it is not clear for the foreachXXX constructs that the micro batch will be recoverable in such an approach afaics, some part of the microbatch is highly likely lost. Hard to test though.
Given that Spark has things standardly catered for that you cannot hook into, why would it be possible to insert a loop or try/catch in the source of the program? Likewise broadcast variables area an issue - although some have techniques around this so they say. But it is not in the spirit of the framework.
So, good question as I wonder(ed) about this (in the past).

Depending on you Spark runtime and environment, an alternative recommended for example in Databricks documentation is to simply let the streaming queries fail so that the retries can be handled at Spark job level.
One of the benefits of this is that it decouples the retry policy and related email notifications from your application.

Related

The first element has not yet passed through in 5000 milliseconds

I am new to lagom scala. Created a simple crud to learn the key concept of it. First I compiled my code using sbt clean compile and it successfully compiled the code.
After that I tried to run the project using sbt clean runAll command. Sometimes service started without any problem. But sometimes I found a message saying that
java.util.concurrent.TimeoutException: The first element has not yet passed through in 5000 milliseconds.
at akka.stream.impl.Timers$Initial$$anon$1.onTimer(Timers.scala:62)
at akka.stream.stage.TimerGraphStageLogic.onInternalTimer(GraphStage.scala:1601)
at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1(GraphStage.scala:1590)
at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1$adapted(GraphStage.scala:1590)
at akka.stream.impl.fusing.GraphInterpreter.runAsyncInput(GraphInterpreter.scala:466)
at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:497)
at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:599)
at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:768)
at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:783)
at akka.actor.Actor.aroundReceive(Actor.scala:534)
at akka.actor.Actor.aroundReceive$(Actor.scala:532)
at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:690)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:573)
at akka.actor.ActorCell.invoke(ActorCell.scala:543)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:269)
at akka.dispatch.Mailbox.run(Mailbox.scala:230)
at akka.dispatch.Mailbox.exec(Mailbox.scala:242)
at java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(Unknown Source)
at java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Can anyone please explain me the solution? Thanks in advance
I think that is one of the offset-timeouts, either for the read side or for the Kafka client.
If you are running everything on one machine and it is struggling, getting things running, setting up connections etc. could potentially take more time, so try tweaking those timeouts to higher values and see if it helps.

Dataflow Tasks are not working with Spring Batch

I'm having Spring Batch job that is also dataflow task. When I run this job everything seems OK, In Tasks > Executions I can see that tasks finished successfully. On the other hand when I go to Jobs tabs I'm getting this error (in command line):
java.lang.NullPointerException: null
at org.springframework.cloud.dataflow.server.service.impl.DefaultTaskJobService.getTaskJobExecution(DefaultTaskJobService.java:240) ~[spring-cloud-dataflow-server-core-1.2.2.RELEASE.jar!/:1.2.2.RELEASE]
at org.springframework.cloud.dataflow.server.service.impl.DefaultTaskJobService.getTaskJobExecutionsForList(DefaultTaskJobService.java:233) ~[spring-cloud-dataflow-server-core-1.2.2.RELEASE.jar!/:1.2.2.RELEASE]
at org.springframework.cloud.dataflow.server.service.impl.DefaultTaskJobService.listJobExecutions(DefaultTaskJobService.java:103) ~[spring-cloud-dataflow-server-core-1.2.2.RELEASE.jar!/:1.2.2.RELEASE]
at org.springframework.cloud.dataflow.server.controller.JobExecutionController.list(JobExecutionController.java:91) ~[spring-cloud-dataflow-server-core-1.2.2.RELEASE.jar!/:1.2.2.RELEASE]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_112]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_112]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_112]
Spring Dataflow version = 1.2.2
I just ran into this issue a couple of days ago. The reason you get a null pointer exception is because when you ran your task, it's execution got inserted into the batch and task tables, except for the TASK_TASK_BATCH table. When you try to access the Jobs tab in DataFlow it reads from this table and cannot find the record so it throws the null pointer.
For me the root cause was because I removed the spring-boot-starter-parent from my spring boot application (since I needed to provide my own). In the process of doing so, I removed the spring-cloud-starter-task dependency, in exchange for the spring-cloud-task-core, thinking this was all I needed seeing as it compiled and the task ran just fine. What I eventually found out was that the spring-cloud-starter-task not only contains the spring-cloud-task-core, but also the spring-cloud-task-batch and the spring-cloud-task-stream. In my case, I was missing spring-cloud-task-batch which is what inserts the records into the TASK_TASK_BATCH table. So I simply removed the spring-cloud-task-core, and re-added spring-cloud-starter-task which solved the issue.
It's worth mentioning that one will also have to go and fix the data, by either manually adding those missing records to the TASK_TASK_BATCH table, or resetting your batch/task tables/sequences. Hope this helps!

Drools 6.5 ConcurrentModificationException with LinkedHashMap Fact

In our Java application using Drools 6.5 final release, we use Disruptor to run the same rules by different user threads, each thread has its own dedicated Session object while all the sessions are created from a common KieBase. Dev/QA did not see the following error but in Production, we see the error: the object being inserted is LinkedHashMap instance and this object will definitely being processed by one user thread (based on hashCode of immutable object coming with the message), so this is strange that this LinkedHashCode object would be modified by thread other than the user threads. Any thoughts on what could be the cause?
07:04:15.719 ERROR [RuleHandler6] erf.SupportsProfilingHandlerBase - Exception -
java.util.ConcurrentModificationException
at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:711)
at java.util.LinkedHashMap$LinkedEntryIterator.next(LinkedHashMap.java:744)
at java.util.LinkedHashMap$LinkedEntryIterator.next(LinkedHashMap.java:742)
at java.util.AbstractMap.hashCode(AbstractMap.java:507)
at org.drools.core.common.EqualityAssertMapComparator.hashCodeOf(EqualityAssertMapComparator.java:46)
at org.drools.core.util.ObjectHashMap.get(ObjectHashMap.java:90)
at org.drools.core.common.ClassAwareObjectStore.getHandleForObject(ClassAwareObjectStore.java:150)
at org.drools.core.common.NamedEntryPoint.getFactHandle(NamedEntryPoint.java:680)
at consolidator.services.DroolsKieContainer$SessionWrapper.internalFire(DroolsKieContainer.java:198)
at consolidator.services.DroolsKieContainer$SessionWrapper.fire(DroolsKieContainer.java:175)
at consolidator.services.DroolsKieService.fire(DroolsKieService.java:153)
at consolidator.disruptor.RuleHandler.handleFIX(RuleHandler.java:88)
at consolidator.disruptor.RuleHandler.onEventCore(RuleHandler.java:68)
at consolidator.disruptor.RuleHandler.onEventCore(RuleHandler.java:15)
at consolidator.perf.SupportsProfilingHandlerBase.onEvent(SupportsProfilingHandlerBase.java:43)
at consolidator.perf.SupportsProfilingHandlerBase.onEvent(SupportsProfilingHandlerBase.java:9)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed

I often find spark fails with large jobs with a rather unhelpful meaningless exception. The worker logs look normal, no errors, but they get state "KILLED". This is extremely common for large shuffles, so operations like .distinct.
The question is, how do I diagnose what's going wrong, and ideally, how do I fix it?
Given that a lot of these operations are monoidal I've been working around the problem by splitting the data into, say 10, chunks, running the app on each chunk, then running the app on all of the resulting outputs. In other words - meta-map-reduce.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
As of September 1st 2014, this is an "open improvement" in Spark. Please see https://issues.apache.org/jira/browse/SPARK-3052. As syrza pointed out in the given link, the shutdown hooks are likely done in incorrect order when an executor failed which results in this message. I understand you will have to little more investigation to figure out the main cause of problem (i.e. why your executor failed). If it is a large shuffle, it might be an out-of-memory error which cause executor failure which then caused the Hadoop Filesystem to be closed in their shutdown hook. So, the RecordReaders in running tasks of that executor throw "java.io.IOException: Filesystem closed" exception. I guess it will be fixed in subsequent release and then you will get more helpful error message :)
Something calls DFSClient.close() or DFSClient.abort(), closing the client. The next file operation then results in the above exception.
I would try to figure out what calls close()/abort(). You could use a breakpoint in your debugger, or modify the Hadoop source code to throw an exception in these methods, so you would get a stack trace.
The exception about “file system closed” can be solved if the spark job is running on a cluster. You can set properties like spark.executor.cores , spark.driver.cores and spark.akka.threads to the maximum values w.r.t your resource availability. I had the same problem when my dataset was pretty large with JSON data about 20 million records. I fixed it with the above properties and it ran like a charm. In my case, I set those properties to 25,25 and 20 respectively. Hope it helps!!
Reference Link:
http://spark.apache.org/docs/latest/configuration.html

Spring Batch Partitioned Step stopped after hours from when a non-skippable exception occured

I want to verify a behaviour of Spring Batch...
When running a partitioned step of a Job I got this exception:
org.springframework.batch.core.JobExecutionException: Partition handler returned an unsuccessful step
at org.springframework.batch.core.partition.support.PartitionStep.doExecute(PartitionStep.java:111)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:195)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:137)
at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:64)
at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:60)
at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:152)
at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:131)
at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:135)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:301)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:134)
at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:127)
only this - no previous exceptions that might have triggered this, and then got a FAILED result for my job.
When searching the logs from previous hours-days I noticed these exceptions(3 of them in different partitioned steps):
06/05/2014 21:50:51.996 [Step3TaskExecutor-12] [] ERROR AbstractStep - Line (222) Encountered an error executing the step
org.springframework.retry.RetryException: Non-skippable exception in recoverer while processing; nested exception is java.io.FileNotFoundException: Source 'blabla....pdf' does not exist
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$2.recover(FaultTolerantChunkProcessor.java:281)
at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:435)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:304)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:188)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.transform(FaultTolerantChunkProcessor.java:290)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:192)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:75)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:395)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:133)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:267)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:77)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:368)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:144)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:253)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:195)
at org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler$1.call(TaskExecutorPartitionHandler.java:139)
at org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler$1.call(TaskExecutorPartitionHandler.java:136)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.FileNotFoundException: Source 'blabla....pdf' does not exist
It seemed weird to me that after those exceptions, job continued to run, so I'm thinking that only the slave-steps that this exception occured have failed and master step waited for the rest of slave steps to finish in order to return the first error mentioned.
Can someone verify that this is the problem? it's been driving me crazy for days
That is correct behavior for Spring Batch's partitioning. The PartitionHandler in the master step evaluates the results of all steps at once when they have all returned (or timed out). With regards to what happened in the slaves, those logged errors would be a leading cause to me. However, the definitive answer should be in the job repository (assuming you're using a database backed implementation). When a step fails (even a partitioned slave), the exception is stored there.
I got this error when the utilization of my CPU is very high.
When i added this bean in configuration, it worked for me:
#Bean
public TaskExecutor asyncTaskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(numberOfCores);
return taskExecutor;
}
I use it here:
#Bean
public Step masterStep() throws Exception {
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep().getName(), partitioner())
.step(slaveStep())
.gridSize(gridSize)
.taskExecutor(asyncTaskExecutor())
.build();
}