I am saving DStream to Cassandra. There is a column in Cassandra with map<text, text> datatype. Cassandra does not support null value in Map, but null value can occur in the stream.
I have added try catch if case something goes wrong, but the program stopped despite that and I don't see error message in the log:
try {
cassandraStream.saveToCassandra("table", "keyspace")
} catch {
case e: Exception => log.error("Error in saving data in Cassandra" + e.getMessage, e)
}
Exception
Caused by: java.lang.NullPointerException: Map values cannot be null
at com.datastax.driver.core.TypeCodec$AbstractMapCodec.serialize(TypeCodec.java:2026)
at com.datastax.driver.core.TypeCodec$AbstractMapCodec.serialize(TypeCodec.java:1909)
at com.datastax.driver.core.AbstractData.set(AbstractData.java:530)
at com.datastax.driver.core.AbstractData.set(AbstractData.java:536)
at com.datastax.driver.core.BoundStatement.set(BoundStatement.java:870)
at com.datastax.spark.connector.writer.BoundStatementBuilder.com$datastax$spark$connector$writer$BoundStatementBuilder$$bindColumnUnset(BoundStatementBuilder.scala:73)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$6.apply(BoundStatementBuilder.scala:84)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$6.apply(BoundStatementBuilder.scala:84)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$bind$1.apply$mcVI$sp(BoundStatementBuilder.scala:106)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:101)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:233)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:210)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:210)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:54)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:54)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
... 3 more
I'd like to know why the program got stops, despite the try/catch block. Why is the exception not caught?
To understand the source of the failure you have to acknowledge that DStreamFunctions.saveToCassandra, same as DStream output operations in general, is not an action in strict sense. In practice it just invokes foreachRDD:
dstream.foreachRDD(rdd => rdd.sparkContext.runJob(rdd, writer.write _))
which in turn:
Apply a function to each RDD in this DStream. This is an output operator, so 'this' DStream will be registered as an output stream and therefore materialized.
The difference is subtle, but important - the operation is registered but the actual execution happens in different context, at later point in time.
It means there are no runtime failures to caught at the point you invoke saveToCassandra.
As already pointed out, try or Try would contain the driver exception, if applied directly on an action. So you'd for example re-implement saveToCassandra as
dstream.foreachRDD(rdd => try {
rdd.sparkContext.runJob(rdd, writer.write _)
} catch {
case e: Exception => log.error("Error in saving data in Cassandra" + e. getMessage, e)
})
the stream should be able to proceed, although the current batch will be completely or partially lost.
It is important to note that this is not the same as catching the original exception, which will be thrown, uncaught and visible in the log. To catch problem at its source you'd have to apply try / catch block directly in writer, and this is obviously not an option when you execute code, over which you don't have control.
Take away message is (already stated in this thread) - make sure to sanitize your data to avoid known sources of failure.
The problem is that you don't catch the exception you think you do. The code you have will catch a driver exception, and in fact code structured like this will do it.
It doesn't however mean that
the program should never stop.
While driver failure, which would be a consequence of fatal executor failure, is contained and driver can exit gracefully, stream as such is already gone. Therefore your code exits, because there is no more stream to run.
If the code in question was under your control, exception handling should be delegated to the task, but in case of 3rd party code, there is no such option.
Instead you should validate your data, and remove problematic records, before these are passed to saveToCassandra.
Related
I am building unit tests for a class which runs in an Apache beam pipeline.
One of the methods I'm testing should throw an exception after attempting some operation on a PCollection. Is there a way to expect this exception with PAssert?
for example
//Expect SomeClass to throw an exception when instantiated with this argument
val invalidParDo = ParDo.of(new SomeClass("This is an invalid parameter"))
val actualOutput = inputPCollection.apply("",invalidParDo)
//Here I'd like a way to assert that I expected a particular exception
PAssert.that(someWayToExpectExceptionHere)
testPipeline.run()
Or any similar alternative
Note that PAssert is a Beam transform that validates that a given PCollection is equal to one or more given values. Exceptions raised in invalidParDo that are not handled will result in a pipeline failure (after retries) hence it's not possible to validate such failures in a subsequent step. But if you update invalidParDo to generate an invalid value instead of raising an exception, you should be able to validate the generation of such invalid values using a PAssert.
I have been researching on the proper way of handling exceptions in Apache Spark jobs. I have read through different questions in Stackoverflow but I still haven't got to a conclusion. From my point of view there are three ways of handling exceptions:
Try catch/block surrounding the lambda function that is going to perform the computation. This is tricky because the block will have to be placed surrounding the code that triggers the lazy computation. If an error happens then I assume there won't be any RDD to work with (Taken from this blog entry)
val lines: RDD[String] = sc.textFile("large_file.txt")
val tokens =
lines.flatMap(_ split " ")
.map(s => s(10))
try {
// This try-catch block catch all the exceptions thrown by the
// preceding transformations.
tokens.saveAsTextFile("/some/output/file.txt")
} catch {
case e : StringIndexOutOfBoundsException =>
// Doing something in response of the exception
}
Try catch/block inside the lambda function: This implies deciding on the correct output of a caught exception inside the lambda function.
rdd.map({
Try(fn) match{
case Success: _
case Failure:<<Record with error flag>>
}).filter(record.errorflag==null)
Let the exception propagate. The task will fail and the Spark framework will relaunch the task again. This works when the error is caused by reasons outside the code scope. e.g. (memory leak, connection to another service lost momentarily.)
What's the correct way of handling exceptions?. I guess it depends on what you want to achieve with the RDD operation. If an error in one of the RDD records means that the output is not valid then option 1 is the way to go. If we expect some of the records to fail, we go for option 2. Option 3 does not even need to make a choice as it is the normal behaviour of the platform.
In the past we did not bother with try/catch approach except for input parameter checking.
For the rest we just relied on checking the return code as in:
spark-submit --master yarn ... bla bla
ret_val=$?
...
Why? As you need to correct something in general and we needed to then start over again. It's hard to dynamically correct certain things. Your scheduling tool can pick this up as well, Rundeck, Airflow...et al.
More advanced restart options are possible but simply get convoluted, but could be done. As you allude to in option 2. Never seen that done though.
It is easy to handle a particular set of exceptions using the Catch operator. How could we ignore certain exceptions but handle the rest in the Catch block?
Say for example I would like to let ArgumentNullException, ArgumentOutOfRangeException and TimeoutException exceptions bubble and report an error but for anything else I would like to retry.
The catch version below would catch all the exceptions, and there is no way to selectively ignore some types, as the signature requires it to return a IObservable<T>
source.Catch<T, Exception>(e=> Observable.Return(default(T)))
If I have to retry on certain Exception then I could write something like (I think)
source.Catch<T, WebException>(e=> source.Retry())
A very meaningful operator in these situations is RetryWhen.
RetryWhen(selector) gives you an observable of Exception, and expects an observable which, triggers a retry every time it produces a value.
RetryWhen(e => e) is equivalent to Retry() (all exceptions trigger a retry).
RetryWhen(e => e.Take(3)) is equivalent to Retry(3) (first three exceptions trigger a retry).
Now you can simply transform the inner sequence into one that emits on your desired types of exceptions. For example, to only retry on TimeoutException:
source.RetryWhen(e => e.Where(exn => exn is TimeoutException))
I would like to know the guarantees of the following pattern:
try {
//business logic here
} catch {
case t: Throwable =>
//try to signal the error
//shutdown the app
}
I'm interested in catching all unexpected exceptions (that can be thrown by any framework, library, custom code etc...), trying to record the error and shutting down the virtual machine.
In Scala what are the guarantees of the catch Throwable exception? Are there any differences with the java Exception hierarchy to take in consideration?
Throwable is defined in the JVM spec:
An exception in the Java Virtual Machine is represented by an instance of the class Throwable or one of its subclasses.
which means that both Scala and Java share the same definition of Throwable. In fact, scala.Throwable is just an alias for java.lang.Throwable. So in Scala, a catch clause that handles Throwable will catch all exceptions (and errors) thrown by the enclosed code, just like in Java.
There are any difference with the java Exception hierarchy to take in consideration?
Since Scala uses the same Throwable as Java, Exception and Errors represent the same things. The only "difference" (that I know of) is that in Scala, exceptions are sometimes used under the hood for flow control, so that if you want to catch non-fatal exceptions (thus excluding Errors), you should rather use catch NonFatal(e) instead of catch e : Exception. But this doesn't apply to catching Throwable directly
All harmless Throwables can be caught by:
try {
// dangerous
} catch {
case NonFatal(e) => log.error(e, "Something not that bad.")
}
This way, you never catch an exception that a reasonable application should not try to catch.
If a Scala future fails, and there is no continuation that "observes" that failure (or the only continuations use map/flatMap and don't run in case of failure), then errors go undetected. I would like such errors to be at least logged, so I can find bugs.
I use the term "observed error" because in .Net Tasks there is the chance to catch "unobserved task exceptions", when the Task object is collected by the GC. Similarly, with synchronous methods, uncaught exceptions that terminate the thread can be logged.
In Scala futures, to 'observe' a failure would mean that some continuation or other code reads the Exception stored in the future value before that future is disposed. I'm aware that finalization is not deterministic or reliable, and presumably that's why it's not used to catch unhandled errors, although .Net does succeed in doing this.
Is there a way to achieve this in Scala? If not, how should I organize my code to prevent unhandled error bugs?
Today I have andThen checkResult appended to various futures. But it's hard to know when to use this and when not to: if a library method returns a Future, it shouldn't checkResult and log errors itself, because the library user may handle the failure, so the responsibility falls onto the user. As I edit code I sometimes need to add checks and sometimes to remove them, and such manual management is surely wrong.
I have concluded there is no way to generically notice unhandled errors in Scala futures.
You can just use Future.recover in the function that returns the Future.
So for instance, you could just "log" the error and rethrow the original exception, in the simplest case:
def libraryFunction(): Future[Int] = {
val f = ...
f.recover {
case NonFatal(t) =>
println("Error : " + t)
throw t
}
}
Note the use of NonFatal to match all the exception types it is sensible to catch.
That recover block could equally return an alternative result if you wish.