Exception handling in Apache Spark - scala

I have been researching on the proper way of handling exceptions in Apache Spark jobs. I have read through different questions in Stackoverflow but I still haven't got to a conclusion. From my point of view there are three ways of handling exceptions:
Try catch/block surrounding the lambda function that is going to perform the computation. This is tricky because the block will have to be placed surrounding the code that triggers the lazy computation. If an error happens then I assume there won't be any RDD to work with (Taken from this blog entry)
val lines: RDD[String] = sc.textFile("large_file.txt")
val tokens =
lines.flatMap(_ split " ")
.map(s => s(10))
try {
// This try-catch block catch all the exceptions thrown by the
// preceding transformations.
tokens.saveAsTextFile("/some/output/file.txt")
} catch {
case e : StringIndexOutOfBoundsException =>
// Doing something in response of the exception
}
Try catch/block inside the lambda function: This implies deciding on the correct output of a caught exception inside the lambda function.
rdd.map({
Try(fn) match{
case Success: _
case Failure:<<Record with error flag>>
}).filter(record.errorflag==null)
Let the exception propagate. The task will fail and the Spark framework will relaunch the task again. This works when the error is caused by reasons outside the code scope. e.g. (memory leak, connection to another service lost momentarily.)
What's the correct way of handling exceptions?. I guess it depends on what you want to achieve with the RDD operation. If an error in one of the RDD records means that the output is not valid then option 1 is the way to go. If we expect some of the records to fail, we go for option 2. Option 3 does not even need to make a choice as it is the normal behaviour of the platform.

In the past we did not bother with try/catch approach except for input parameter checking.
For the rest we just relied on checking the return code as in:
spark-submit --master yarn ... bla bla
ret_val=$?
...
Why? As you need to correct something in general and we needed to then start over again. It's hard to dynamically correct certain things. Your scheduling tool can pick this up as well, Rundeck, Airflow...et al.
More advanced restart options are possible but simply get convoluted, but could be done. As you allude to in option 2. Never seen that done though.

Related

Getting a spurious fibre trace with zip and for comprehension

I have just started to look at ZIO to try to improve my Scala. I have started by trying to update some of my old code. I have wrapped some legacy code which returns a configuration wrapped in an option and I'm converting that to a ZIO which I'm then using in for comprehension.
The code works as expected but if I return None I get:
Fiber failed.
A checked error was not handled.
None
Fiber:Id(1612612180323,1) was supposed to continue to:
a future continuation at tryzio.MyApp$.run(MyApp.scala:94)
a future continuation at zio.ZIO.exitCode(ZIO.scala:543)
Fiber:Id(1612612180323,1) execution trace:
at zio.ZIO$.fromOption(ZIO.scala:3246)
at tryzio.MyApp$.run(MyApp.scala:93)
The code works as expected for both a Some and a None, but I get the spurious Fibre messages for a None.
The code that generates this is really very simple:
def run(args: List[String]) = ({
for {
cmdln <- ZIO.fromOption(getConfig(args.toArray))
_ <- putStrLn("Model Data Builder")
_ <- putStrLn(s"\nVerbose: ${cmdln.verbose}")
} yield()
}).exitCode
But I have clearly missed something obvious! I'm new to ZIO so please use small words when explains my lack of understanding. Do I need to join all Fibres? I have tried to catch the None and then exit but actually my code never strops running if I do that - Very strange.
.exitCode caught an error (empty configuration) which hasn't been handled in the code, printed debug information in stdErr and exited the program with status 1. So it works as expected.
I agree that the error message is a bit misleading and should start with something something business-related rather than fiber failed.
You might fire a ticket on github.

Scala - differents between eventually timeout and Thread.sleep()

I have some async (ZIO) code, which I need to test. If I create a testing part using Thread.sleep() it works fine and I always get response:
for {
saved <- database.save(smth)
result <- eventually {
Thread.sleep(20000)
database.search(...)
}
} yield result
But if I made same logic using timeout and interval from eventually then it never works correctly ( I got timeouts):
for {
saved <- database.save(smth)
result <- eventually(timeout(Span(20, Seconds)), interval(Span(20, Seconds))) {
database.search(...)
}
} yield result
I do not understand why timeout and interval works different then Thread.sleep. It should be doing exactly same thing. Can someone explain it to me and tell how I should change this code to do not need to use Thread.sleep()?
Assuming database.search(...) returns ZIO[] object.
eventually{database.search(...)} most probably succeeds immediately after the first try.
It successfully created a task to query the database.
Then database is queried without any retry logic.
Regarding how to make it work:
val search: ZIO[Any, Throwable, String] = ???
val retried: ZIO[Any with Clock, Throwable, Option[String]] = search.retry(Schedule.spaced(Duration.fromMillis(1000))).timeout(Duration.fromMillis(20000))
Something like that should work. But I believe that more elegant solutions exist.
The other answer from #simpadjo addresses the "what" quite succinctly. I'll add some additional context as to why you might see this behavior.
for {
saved <- database.save(smth)
result <- eventually {
Thread.sleep(20000)
database.search(...)
}
} yield result
There are three different technologies being mixed here which is causing some confusion.
First is ZIO which is an asynchronous programming library that uses it's own custom runtime and execution model to perform tasks. The second is eventually which comes from ScalaTest and is useful for checking asynchronous computations by effectively polling the state of a value. And thirdly, there is Thread.sleep which is a Java api that literally suspends the current thread and prevents task progression until the timer expires.
eventually uses a simple retry mechanism that differs based on whether you are using a normal value or a Future from the scala standard library. Basically it runs the code in the block and if it throws then it sleeps the current thread and then retries it based on some interval configuration, eventually timing out. Notably in this case the behavior is entirely synchronous, meaning that as long as the value in the {} doesn't throw an exception it won't keep retrying.
Thread.sleep is a heavy weight operation and in this case it is effectively blocking the function being passed to eventually from progressing for 20 seconds. Meaning that by the time the database.search is called the operation has likely completed.
The second variant is different, it executes the code in the eventually block immediately, if it throws an exception then it will attempt it again based on the interval/timeout logic that your provide. In this scenario the save may not have completed (or propagated if it is eventually consistent). Because you are returning a ZIO which is designed not to throw, and eventually doesn't understand ZIO it will simply return the search attempt with no retry logic.
The accepted answer:
val retried: ZIO[Any with Clock, Throwable, Option[String]] = search.retry(Schedule.spaced(Duration.fromMillis(1000))).timeout(Duration.fromMillis(20000))
works because the retry and timeout are using the built-in ZIO operators which do understand how to actually retry and timeout a ZIO. Meaning that if search fails the retry will handle it until it succeeds.

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.

Catching unhandled errors in Scala futures

If a Scala future fails, and there is no continuation that "observes" that failure (or the only continuations use map/flatMap and don't run in case of failure), then errors go undetected. I would like such errors to be at least logged, so I can find bugs.
I use the term "observed error" because in .Net Tasks there is the chance to catch "unobserved task exceptions", when the Task object is collected by the GC. Similarly, with synchronous methods, uncaught exceptions that terminate the thread can be logged.
In Scala futures, to 'observe' a failure would mean that some continuation or other code reads the Exception stored in the future value before that future is disposed. I'm aware that finalization is not deterministic or reliable, and presumably that's why it's not used to catch unhandled errors, although .Net does succeed in doing this.
Is there a way to achieve this in Scala? If not, how should I organize my code to prevent unhandled error bugs?
Today I have andThen checkResult appended to various futures. But it's hard to know when to use this and when not to: if a library method returns a Future, it shouldn't checkResult and log errors itself, because the library user may handle the failure, so the responsibility falls onto the user. As I edit code I sometimes need to add checks and sometimes to remove them, and such manual management is surely wrong.
I have concluded there is no way to generically notice unhandled errors in Scala futures.
You can just use Future.recover in the function that returns the Future.
So for instance, you could just "log" the error and rethrow the original exception, in the simplest case:
def libraryFunction(): Future[Int] = {
val f = ...
f.recover {
case NonFatal(t) =>
println("Error : " + t)
throw t
}
}
Note the use of NonFatal to match all the exception types it is sensible to catch.
That recover block could equally return an alternative result if you wish.

Scala exception as a function argument design pattern

I am writing a web app where exceptions are used to handle error cases. Often, I find myself writing helpers like this:
def someHelper(...) : Boolean {...}
and then using it like this:
if (!someHelper(...)){
throw new SomeException()
}
These exceptions represent things like invalid parameters, and when handled they send a useful error message to the user, eg
try {
...
} catch {
case e: SomeException => "Bad user!"
}
Is this a reasonable approach? And how could I pass the exception into the helper function and have it thrown there? I have had trouble constructing a type for such a function.
I use Either most of the time, not exceptions. I generally use exceptions, as you have done or some similar way, when the control flow has to go way, way back to some distant point, and otherwise there's nothing sensible to do. However, when the exceptions can be handled fairly locally, I will instead
def myMethod(...): Either[String,ValidatedInputForm] = {
...
if (!someHelper(...)) Left("Agree button not checked")
else Right(whateverForm)
}
and then when I call this method, I can
myMethod(blah).fold({ err =>
doSomething(err)
saneReturnValue
}, { form =>
foo(form)
form.usefulField
})
or match on Left(err) vs Right(form), or various other things.
If I don't want to handle the error right there, but instead want to process the return value, I
myMethod(blah).right.map{ form =>
foo(form)
bar(form)
}
and I'll get an Either with the error message unchanged as a Left, if it was an error message, or with the result of { foo(form); bar(form) } as a Right if it was okay. You can also chain your error processing using flatMap, e.g. if you wanted to perform an additional check on so-far-correct values and reject some of them, you could
myMethod(blah).right.flatMap{ form =>
if (!checkSomething(form)) Left("Something didn't check out.")
else Right(form)
}
It's this sort of processing that makes using Either more convenient (and usually better-performing, if exceptions are common) than exceptions, which is why I use them.
(In fact, in very many cases I don't care why something went wrong, only that it went wrong, in which case I just use an Option.)
There's nothing special about passing an exception instance to some method:
def someMethod(e: SomeException) {
throw e
}
someMethod(new SomeException)
But I have to say that I get a very distinct feeling that your whole idea just smells. If you want to validate a user input just write validators, e.g. UserValidator which will have some method like isValid to test a user input and return a boolean, you can implement some messaging there too. Exceptions are really intended for different purposes.
The two most common ways to approach what you're trying to do is to either just have the helper create and throw an exception itself, or exactly what you're doing: have the calling code check the results, and throw a meaningful exception, if needed.
I've never seen a library where you pass in the exception you expect the helper to throw. As I said on another answer, there's a surprisingly substantial cost to simply instantiating an exception, and if you followed this pattern throughout your code you could see an overall performance problem. This could be mitigated through the use of by-name parameters, but if you just forget to put => in a few key functions, you've got a performance problem that's difficult to track down.
At the end of the day, if you want the helper to throw an exception, it makes sense that the helper itself already knows what sort of exception it wants to throw. If I had to choose between A and B:
def helperA(...) { if (stuff) throw new InvalidStuff() }
def helperB(..., onError: => Exception) { if (stuff) throw onError }
I would choose A every time.
Now, if I had to choose between A and what you have now, that's a toss up. It really depends on context, what you're trying to accomplish with the helpers, how else they may be used, etc.
On a final note, naming is very important in these sorts of situations. If your go the return-code-helper route, your helpers should have question names, such as isValid. If you have exception-throwing-helpers, they should have action names, such as validate. Maybe even give it emphasis, like validate_!.
For an alternative approach you could check out scalaz Validators, which give a lot of flexibility for this kind of case (e.g. should I crash on error, accumulate the errors and report at the end or ignore them completely?). A few
examples might help you decide if this is the right approach for you.
If you find it hard to find a way in to the library, this answer gives some pointers to some introductory material; or check out .