Spring Batch Partitioned Step stopped after hours from when a non-skippable exception occured - spring-batch

I want to verify a behaviour of Spring Batch...
When running a partitioned step of a Job I got this exception:
org.springframework.batch.core.JobExecutionException: Partition handler returned an unsuccessful step
at org.springframework.batch.core.partition.support.PartitionStep.doExecute(PartitionStep.java:111)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:195)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:137)
at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:64)
at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:60)
at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:152)
at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:131)
at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:135)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:301)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:134)
at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:127)
only this - no previous exceptions that might have triggered this, and then got a FAILED result for my job.
When searching the logs from previous hours-days I noticed these exceptions(3 of them in different partitioned steps):
06/05/2014 21:50:51.996 [Step3TaskExecutor-12] [] ERROR AbstractStep - Line (222) Encountered an error executing the step
org.springframework.retry.RetryException: Non-skippable exception in recoverer while processing; nested exception is java.io.FileNotFoundException: Source 'blabla....pdf' does not exist
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$2.recover(FaultTolerantChunkProcessor.java:281)
at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:435)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:304)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:188)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.transform(FaultTolerantChunkProcessor.java:290)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:192)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:75)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:395)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:133)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:267)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:77)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:368)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:144)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:253)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:195)
at org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler$1.call(TaskExecutorPartitionHandler.java:139)
at org.springframework.batch.core.partition.support.TaskExecutorPartitionHandler$1.call(TaskExecutorPartitionHandler.java:136)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.FileNotFoundException: Source 'blabla....pdf' does not exist
It seemed weird to me that after those exceptions, job continued to run, so I'm thinking that only the slave-steps that this exception occured have failed and master step waited for the rest of slave steps to finish in order to return the first error mentioned.
Can someone verify that this is the problem? it's been driving me crazy for days

That is correct behavior for Spring Batch's partitioning. The PartitionHandler in the master step evaluates the results of all steps at once when they have all returned (or timed out). With regards to what happened in the slaves, those logged errors would be a leading cause to me. However, the definitive answer should be in the job repository (assuming you're using a database backed implementation). When a step fails (even a partitioned slave), the exception is stored there.

I got this error when the utilization of my CPU is very high.
When i added this bean in configuration, it worked for me:
#Bean
public TaskExecutor asyncTaskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(numberOfCores);
return taskExecutor;
}
I use it here:
#Bean
public Step masterStep() throws Exception {
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep().getName(), partitioner())
.step(slaveStep())
.gridSize(gridSize)
.taskExecutor(asyncTaskExecutor())
.build();
}

Related

IllegalArgumentException thrown from org.drools.modelcompiler.builder.generator.expressiontyper.ExpressionTyper

I spent a whole bunch of time today working out a problem that occurred on migrating from Optaplanner 7.48 to Optaplanner 8.3.0 The problem seems to me to be actually in the 7.50 version of Drools that Optaplanner 8.3 depends on.
rule "applyHomeTeamConstraints"
enabled true
when $f1: ProblemFixture(slot != null, $ht: homeTeam)
$tu: ProblemSoftConstraint(this memberOf $ht.constraints,
matches($f1.slot.dateTime) )
then
scoreHolder.penalize(kcontext, -$tu.getWeight());
end
The rule worked perfectly happily in Optaplanner 7.48/Drools 7.48.
When I use this rule in my Optaplanner 8.3.0/Drools 7.50 project, I get an exceptionally unhelpful message...
java.lang.IllegalStateException: There is an error in a scoreDrl or scoreDrlFile.
at uk.me.selwyn_family.league.scheduler.solver.SolverTest.testBuildSeasonSolver(SolverTest.java:47)
Caused by: java.lang.UnsupportedOperationException
at uk.me.selwyn_family.league.scheduler.solver.SolverTest.testBuildSeasonSolver(SolverTest.java:47)
After many hours of tracing I found that the exception was actually thrown at line 243 in org.drools.modelcompiler.builder.generator.expressiontyper.ExpressionTyper
Expression parentLeft = findLeftLeafOfNameExpr(halfPointFreeExpr.getParentNode().orElseThrow(UnsupportedOperationException::new));
and occurs because the parentNode is null.
I can solve the problem by simply add this. in front of the matches() as follows...
rule "applyHomeTeamConstraints"
enabled true
when $f1: ProblemFixture(slot != null, $ht: homeTeam)
$tu: ProblemSoftConstraint(this memberOf $ht.constraints,
this.matches($f1.slot.dateTime) )
then
scoreHolder.penalize(kcontext, -$tu.getWeight());
end
I am trying to work out whether it was only working by accident in the 7.48 version and actually should never have worked or whether a bug has crept into the Drools 7.50 that should be reported.
Either way a more helpful error message would be good!
Any suggestions gratefully received.
Update: Now I have told Surefire not to trimStackTrace here is the full trace...
java.lang.IllegalStateException: There is an error in a scoreDrl or scoreDrlFile.
at org.optaplanner.core.impl.score.director.ScoreDirectorFactoryFactory.buildDroolsScoreDirectorFactory(ScoreDirectorFactoryFactory.java:280)
at org.optaplanner.core.impl.score.director.ScoreDirectorFactoryFactory.decideMultipleScoreDirectorFactories(ScoreDirectorFactoryFactory.java:103)
at org.optaplanner.core.impl.score.director.ScoreDirectorFactoryFactory.buildScoreDirectorFactory(ScoreDirectorFactoryFactory.java:68)
at org.optaplanner.core.impl.solver.DefaultSolverFactory.buildScoreDirectorFactory(DefaultSolverFactory.java:116)
at org.optaplanner.core.impl.solver.DefaultSolverFactory.getScoreDirectorFactory(DefaultSolverFactory.java:73)
at org.optaplanner.core.api.score.ScoreManager.create(ScoreManager.java:59)
at uk.me.selwyn_family.league.scheduler.solver.SeasonSolver.<init>(SeasonSolver.java:31)
at uk.me.selwyn_family.league.scheduler.solver.SolverTest.testBuildSeasonSolver(SolverTest.java:47)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:696)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
at org.testng.TestRunner.privateRun(TestRunner.java:767)
at org.testng.TestRunner.run(TestRunner.java:617)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:348)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305)
at org.testng.SuiteRunner.run(SuiteRunner.java:254)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
at org.testng.TestNG.run(TestNG.java:1057)
at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:136)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:99)
at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:145)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
Caused by: java.lang.UnsupportedOperationException
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.drools.modelcompiler.builder.generator.expressiontyper.ExpressionTyper.toTypedExpressionRec(ExpressionTyper.java:243)
at org.drools.modelcompiler.builder.generator.expressiontyper.ExpressionTyper.toTypedExpression(ExpressionTyper.java:153)
at org.drools.modelcompiler.builder.generator.drlxparse.ConstraintParser.getDrlxParseResult(ConstraintParser.java:174)
at org.drools.modelcompiler.builder.generator.drlxparse.ConstraintParser.drlxParse(ConstraintParser.java:104)
at org.drools.modelcompiler.builder.generator.visitor.pattern.PatternDSL.findAllConstraint(PatternDSL.java:139)
at org.drools.modelcompiler.builder.generator.visitor.pattern.PatternDSL.buildPattern(PatternDSL.java:241)
at org.drools.modelcompiler.builder.generator.visitor.ModelGeneratorVisitor.visit(ModelGeneratorVisitor.java:145)
at org.drools.compiler.lang.descr.PatternDescr.accept(PatternDescr.java:303)
at org.drools.modelcompiler.builder.generator.visitor.AndVisitor.visit(AndVisitor.java:50)
at org.drools.modelcompiler.builder.generator.visitor.ModelGeneratorVisitor.visit(ModelGeneratorVisitor.java:86)
at org.drools.modelcompiler.builder.generator.ModelGenerator.processRule(ModelGenerator.java:187)
at org.drools.modelcompiler.builder.generator.ModelGenerator.generateModel(ModelGenerator.java:160)
at org.drools.modelcompiler.builder.ModelBuilderImpl.compileKnowledgePackages(ModelBuilderImpl.java:284)
at org.drools.modelcompiler.builder.ModelBuilderImpl.buildRules(ModelBuilderImpl.java:220)
at org.drools.modelcompiler.builder.ModelBuilderImpl.postBuild(ModelBuilderImpl.java:130)
at org.drools.compiler.builder.impl.CompositeKnowledgeBuilderImpl.build(CompositeKnowledgeBuilderImpl.java:115)
at org.drools.compiler.builder.impl.CompositeKnowledgeBuilderImpl.build(CompositeKnowledgeBuilderImpl.java:99)
at org.drools.compiler.kie.builder.impl.AbstractKieProject.buildKnowledgePackages(AbstractKieProject.java:268)
at org.drools.compiler.kie.builder.impl.AbstractKieProject.buildKnowledgePackages(AbstractKieProject.java:216)
at org.drools.compiler.kie.builder.impl.AbstractKieProject.verify(AbstractKieProject.java:80)
at org.drools.compiler.kie.builder.impl.KieBuilderImpl.buildKieProject(KieBuilderImpl.java:277)
at org.drools.compiler.kie.builder.impl.KieBuilderImpl.buildAll(KieBuilderImpl.java:245)
at org.drools.compiler.kie.builder.impl.KieBuilderImpl.buildAll(KieBuilderImpl.java:202)
at org.kie.internal.utils.KieHelper.getKieContainer(KieHelper.java:100)
at org.kie.internal.utils.KieHelper.build(KieHelper.java:82)
at org.optaplanner.core.impl.score.director.ScoreDirectorFactoryFactory.buildDroolsScoreDirectorFactory(ScoreDirectorFactoryFactory.java:272)
... 36 more
(No wonder it took me a long time to trace!!!)
This sounds like a regression in Drools. Yes, that error message isn't helpful - that too is an issue (note that Drools has improved a lot to get better error messages and the new way to write constraints (ConstraintStreams) has far more useful stacktraces).
One thing that doesn't make sense to me is your stacktrace:
java.lang.IllegalStateException: There is an error in a scoreDrl or scoreDrlFile.
at uk.me.selwyn_family.league.scheduler.solver.SolverTest.testBuildSeasonSolver(SolverTest.java:47)
Caused by: java.lang.UnsupportedOperationException
at uk.me.selwyn_family.league.scheduler.solver.SolverTest.testBuildSeasonSolver(SolverTest.java:47)
This stacktrace doesn't mention lines of org.optaplanner or org.drools. Can you please share the code of SolverTest.java:47 too so we can verify it doesn't covert/eat the exception? The message does seem to come from OptaPlanner, but a true OptaPlanner exception would have a chained (= caused by) exception coming from a line in org.drools with hopefully more information for us to diagnose why it doesn't give a proper error message.

Spark Structured Streaming recovering from a query exception

Is it possible to recover automatically from an exception thrown during query execution?
Context: I'm developing a Spark application that reads data from a Kafka topic, processes the data, and outputs to S3. However, after running for a couple of days in production, the spark application faces some network hiccups from S3 that causes an exception to be thrown and stops the application. It's also worth mentioning that this application runs on Kubernetes using GCP's Spark k8s Operator.
From what I've seen so far, these exceptions are minor and a simple restart of the application solves the issue. Can we handle those exceptions and restart the structured streaming query automatically?
Here's an example of a thrown exception:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: ...
Current Committed Offsets: ...
Current Available Offsets: ...
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan: ...
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at io.blahblahView$$anonfun$11$$anonfun$apply$2.apply(View.scala:90)
at io.blahblahView $$anonfun$11$$anonfun$apply$2.apply(View.scala:82)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at io.blahblahView$$anonfun$11.apply(View.scala:82)
at io.blahblahView$$anonfun$11.apply(View.scala:79)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
... 1 more
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://.../view/v1/_temporary/0
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:993)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:734)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:291)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:361)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 47 more
What's the simplest way of taking care of such issues automatically?
After spending too many hours trying to find an elegant fix to this issue, and not finding anything, here's what I came up with.
Some might say it's a hack, but it's simple, it works and solves a complex problem. I tested it in production and it solves the issue of recovering automatically from failure due to an occasional minor exception.
I call it a The Query Watchdog. Here's the simplest version where the watchdog will retry running the query indefinitely:
val writer = df.writeStream...
while (true) {
val query = writer.start()
try {
query.awaitTermination()
}
catch {
case e: StreamingQueryException => println("Streaming Query Exception caught!: " + e);
}
}
Some people might want to replace the while(true) with some kind of counter to limit the number of retries. Someone could also supplement this code and send notifications through slack or email whenever a retry happened. Others could simply collect the number of retries in Prometheus.
Hope it helps,
Cheers
Since you are using the Spark Operator, why not using its restart functionality? If the controller notices that the application has stopped then it will re-submit it automatically.
This will work assuming that the application fails, i.e. the driver pod stops. There are some cases where a driver exception is thrown but the driver pod keeps running without doing anything. In that case the Spark Operator will think that the application is still running ok.
No, there is not in a reliable way to do this. BTW, No is also an answer.
Logic for checking exceptions are generally via try / catch running on the driver.
As unexpected situations at Executor level are already standardly handled by the Spark Framework itself for Structured Streaming, and if the error is non-recoverable, then the App / Job simply crashes after signalling of error(s) back to the driver unless you code try / catch within the various foreachXXX constructs.
That said, it is not clear for the foreachXXX constructs that the micro batch will be recoverable in such an approach afaics, some part of the microbatch is highly likely lost. Hard to test though.
Given that Spark has things standardly catered for that you cannot hook into, why would it be possible to insert a loop or try/catch in the source of the program? Likewise broadcast variables area an issue - although some have techniques around this so they say. But it is not in the spirit of the framework.
So, good question as I wonder(ed) about this (in the past).
Depending on you Spark runtime and environment, an alternative recommended for example in Databricks documentation is to simply let the streaming queries fail so that the retries can be handled at Spark job level.
One of the benefits of this is that it decouples the retry policy and related email notifications from your application.

mongodb spring connection lost overnight

I'm using azure cosmosdb with mongoAPI (spring data, mongoRepository)
Each morning, first request to fetch data from mongo causes exception:
Following requests succeed without doing any actions.
Any idea what might be causing this?
Is there a way to have spring automatically recover connections without failing request?
Thanks
The exception:
org.springframework.data.mongodb.UncategorizedMongoDbException: Exception sending message; nested exception is com.mongodb.MongoSocketWriteException: Exception sending message
at org.springframework.data.mongodb.core.MongoExceptionTranslator.translateExceptionIfPossible(MongoExceptionTranslator.java:107)
at org.springframework.data.mongodb.core.MongoTemplate.potentiallyConvertRuntimeException(MongoTemplate.java:2135)
at org.springframework.data.mongodb.core.MongoTemplate.executeFindMultiInternal(MongoTemplate.java:1978)
at org.springframework.data.mongodb.core.MongoTemplate.doFind(MongoTemplate.java:1784)
at org.springframework.data.mongodb.core.MongoTemplate.doFind(MongoTemplate.java:1767)
at org.springframework.data.mongodb.core.MongoTemplate.find(MongoTemplate.java:641)
at org.springframework.data.mongodb.repository.query.MongoQueryExecution$CollectionExecution.execute(MongoQueryExecution.java:79)
at org.springframework.data.mongodb.repository.query.MongoQueryExecution$ResultProcessingExecution.execute(MongoQueryExecution.java:411)
at org.springframework.data.mongodb.repository.query.AbstractMongoQuery.execute(AbstractMongoQuery.java:94)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$QueryExecutorMethodInterceptor.doInvoke(RepositoryFactorySupport.java:483)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$QueryExecutorMethodInterceptor.invoke(RepositoryFactorySupport.java:461)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.data.projection.DefaultMethodInvokingMethodInterceptor.invoke(DefaultMethodInvokingMethodInterceptor.java:56)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.data.repository.core.support.SurroundingTransactionDetectorMethodInterceptor.invoke(SurroundingTransactionDetectorMethodInterceptor.java:57)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:213)
...
Caused by: com.mongodb.MongoSocketWriteException: Exception sending message
at com.mongodb.connection.InternalStreamConnection.translateWriteException(InternalStreamConnection.java:465)
at com.mongodb.connection.InternalStreamConnection.sendMessage(InternalStreamConnection.java:208)
at com.mongodb.connection.UsageTrackingInternalConnection.sendMessage(UsageTrackingInternalConnection.java:90)
at com.mongodb.connection.DefaultConnectionPool$PooledConnection.sendMessage(DefaultConnectionPool.java:429)
at com.mongodb.connection.CommandProtocol.sendMessage(CommandProtocol.java:189)
at com.mongodb.connection.CommandProtocol.execute(CommandProtocol.java:111)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:168)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:289)
at com.mongodb.connection.DefaultServerConnection.command(DefaultServerConnection.java:176)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:216)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:207)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:113)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:516)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:510)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:431)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:404)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:510)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:81)
at com.mongodb.Mongo.execute(Mongo.java:836)
at com.mongodb.Mongo$2.execute(Mongo.java:823)
at com.mongodb.DBCursor.initializeCursor(DBCursor.java:870)
at com.mongodb.DBCursor.hasNext(DBCursor.java:142)
at org.springframework.data.mongodb.core.MongoTemplate.executeFindMultiInternal(MongoTemplate.java:1964)
... 129 common frames omitted
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886)
at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857)
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
at com.mongodb.connection.SocketStream.write(SocketStream.java:75)
at com.mongodb.connection.InternalStreamConnection.sendMessage(InternalStreamConnection.java:204)
... 150 common frames omitted
It closes the connection because of passing the maximum connection idle time.
You will need to set this property according to your requirement.
maxConnectionIdleTime
This can be set either on your Mongo configuration or application profile.
Good luck.
You can set keep-alive like that:
#Bean
public MongoDbFactory mongoDbFactory() throws Exception {
MongoClientOptions.Builder optionsBuilder = MongoClientOptions.builder();
optionsBuilder.socketKeepAlive(true);
return new SimpleMongoDbFactory(new MongoClientURI(mongoUri, optionsBuilder));
}

Strange exception on inital request to Repository when using Spring Data JPA with EJB/CDI

I've created a small project which combines Spring Data, a JPA Repository, EJB/CDI and either Wildfly Swarm or plain Wildfly.
The REST resource (an EJB) calls a CDI bean which has a Spring Data JPA Repository injected.
The initial request to the repository ends in an exception, but subsequent calls work just fine.
2016-12-15 23:03:09,690 WARN [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (default task-1) SQL Error: 0, SQLState: null
2016-12-15 23:03:09,690 ERROR [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (default task-1) javax.resource.ResourceException: IJ000460: Error checking for a transaction
2016-12-15 23:03:10,268 ERROR [io.undertow.request] (default task-1) UT005023: Exception handling request to /Penny: org.jboss.resteasy.spi.UnhandledException: javax.transaction.RollbackException: ARJUNA016053: Could not commit transaction.
...
Caused by: javax.transaction.RollbackException: ARJUNA016053: Could not commit transaction.
...
Caused by: java.lang.Throwable: setRollbackOnly called from:
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.setRollbackOnly(TransactionImple.java:339)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.setRollbackOnly(BaseTransaction.java:159)
at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.setRollbackOnly(BaseTransactionManagerDelegate.java:143)
at org.hibernate.jpa.spi.AbstractEntityManagerImpl.markForRollbackOnly(AbstractEntityManagerImpl.java:1509)
at org.hibernate.jpa.spi.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1611)
at org.hibernate.jpa.spi.AbstractEntityManagerImpl.buildQueryFromName(AbstractEntityManagerImpl.java:753)
at org.hibernate.jpa.spi.AbstractEntityManagerImpl.createNamedQuery(AbstractEntityManagerImpl.java:730)
at org.springframework.data.jpa.repository.query.NamedQuery.hasNamedQuery(NamedQuery.java:99)
at org.springframework.data.jpa.repository.query.NamedQuery.lookupFrom(NamedQuery.java:121)
at org.springframework.data.jpa.repository.query.JpaQueryLookupStrategy$DeclaredQueryLookupStrategy.resolveQuery(JpaQueryLookupStrategy.java:162)
at org.springframework.data.jpa.repository.query.JpaQueryLookupStrategy$CreateIfNotFoundQueryLookupStrategy.resolveQuery(JpaQueryLookupStrategy.java:212)
at org.springframework.data.jpa.repository.query.JpaQueryLookupStrategy$AbstractQueryLookupStrategy.resolveQuery(JpaQueryLookupStrategy.java:77)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$QueryExecutorMethodInterceptor.<init>(RepositoryFactorySupport.java:435)
at org.springframework.data.repository.core.support.RepositoryFactorySupport.getRepository(RepositoryFactorySupport.java:220)
at org.springframework.data.jpa.repository.cdi.JpaRepositoryBean.create(JpaRepositoryBean.java:73)
at org.springframework.data.repository.cdi.CdiRepositoryBean.create(CdiRepositoryBean.java:372)
at org.springframework.data.repository.cdi.CdiRepositoryBean.create(CdiRepositoryBean.java:170)
at org.jboss.weld.context.AbstractContext.get(AbstractContext.java:96)
at org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:101)
at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
at org.jboss.weld.bean.proxy.ContextBeanInstance.getInstance(ContextBeanInstance.java:99)
at org.jboss.weld.bean.proxy.ProxyMethodHandler.invoke(ProxyMethodHandler.java:99)
at ch.maxant.demo.swarmproblems.EmployeeService.findByName(EmployeeService.java)
at ch.maxant.demo.swarmproblems.EmployeeService$Proxy$_$$_WeldClientProxy.findByName(Unknown Source)
at ch.maxant.demo.swarmproblems.EmployeeResource.get(EmployeeResource.java)
Debugging, I found out that org.springframework.data.jpa.respository.query.NamedQuery#lookupFrom ends up calling
through to org.hibernate.jpa.spi.AbstractentityManagerImpl#buildQueryfromName which seems to throw an
IllegalArgumentException: No query defined for that name [Employee.findByName]. That kind of makes sense, in that
Spring appears to be loading the named query lazily.
Is this a bug in Spring or is the application not setup properly?
A different project which uses JBoss Wildfly (8) has the same problem: https://github.com/maxant/jee7webappwithspringdata.

Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed

I often find spark fails with large jobs with a rather unhelpful meaningless exception. The worker logs look normal, no errors, but they get state "KILLED". This is extremely common for large shuffles, so operations like .distinct.
The question is, how do I diagnose what's going wrong, and ideally, how do I fix it?
Given that a lot of these operations are monoidal I've been working around the problem by splitting the data into, say 10, chunks, running the app on each chunk, then running the app on all of the resulting outputs. In other words - meta-map-reduce.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
As of September 1st 2014, this is an "open improvement" in Spark. Please see https://issues.apache.org/jira/browse/SPARK-3052. As syrza pointed out in the given link, the shutdown hooks are likely done in incorrect order when an executor failed which results in this message. I understand you will have to little more investigation to figure out the main cause of problem (i.e. why your executor failed). If it is a large shuffle, it might be an out-of-memory error which cause executor failure which then caused the Hadoop Filesystem to be closed in their shutdown hook. So, the RecordReaders in running tasks of that executor throw "java.io.IOException: Filesystem closed" exception. I guess it will be fixed in subsequent release and then you will get more helpful error message :)
Something calls DFSClient.close() or DFSClient.abort(), closing the client. The next file operation then results in the above exception.
I would try to figure out what calls close()/abort(). You could use a breakpoint in your debugger, or modify the Hadoop source code to throw an exception in these methods, so you would get a stack trace.
The exception about “file system closed” can be solved if the spark job is running on a cluster. You can set properties like spark.executor.cores , spark.driver.cores and spark.akka.threads to the maximum values w.r.t your resource availability. I had the same problem when my dataset was pretty large with JSON data about 20 million records. I fixed it with the above properties and it ran like a charm. In my case, I set those properties to 25,25 and 20 respectively. Hope it helps!!
Reference Link:
http://spark.apache.org/docs/latest/configuration.html