Is there a way to assert an expected exception with Apache beam PAssert? - scala

I am building unit tests for a class which runs in an Apache beam pipeline.
One of the methods I'm testing should throw an exception after attempting some operation on a PCollection. Is there a way to expect this exception with PAssert?
for example
//Expect SomeClass to throw an exception when instantiated with this argument
val invalidParDo = ParDo.of(new SomeClass("This is an invalid parameter"))
val actualOutput = inputPCollection.apply("",invalidParDo)
//Here I'd like a way to assert that I expected a particular exception
PAssert.that(someWayToExpectExceptionHere)
testPipeline.run()
Or any similar alternative

Note that PAssert is a Beam transform that validates that a given PCollection is equal to one or more given values. Exceptions raised in invalidParDo that are not handled will result in a pipeline failure (after retries) hence it's not possible to validate such failures in a subsequent step. But if you update invalidParDo to generate an invalid value instead of raising an exception, you should be able to validate the generation of such invalid values using a PAssert.

Related

Exception handling in Apache Spark

I have been researching on the proper way of handling exceptions in Apache Spark jobs. I have read through different questions in Stackoverflow but I still haven't got to a conclusion. From my point of view there are three ways of handling exceptions:
Try catch/block surrounding the lambda function that is going to perform the computation. This is tricky because the block will have to be placed surrounding the code that triggers the lazy computation. If an error happens then I assume there won't be any RDD to work with (Taken from this blog entry)
val lines: RDD[String] = sc.textFile("large_file.txt")
val tokens =
lines.flatMap(_ split " ")
.map(s => s(10))
try {
// This try-catch block catch all the exceptions thrown by the
// preceding transformations.
tokens.saveAsTextFile("/some/output/file.txt")
} catch {
case e : StringIndexOutOfBoundsException =>
// Doing something in response of the exception
}
Try catch/block inside the lambda function: This implies deciding on the correct output of a caught exception inside the lambda function.
rdd.map({
Try(fn) match{
case Success: _
case Failure:<<Record with error flag>>
}).filter(record.errorflag==null)
Let the exception propagate. The task will fail and the Spark framework will relaunch the task again. This works when the error is caused by reasons outside the code scope. e.g. (memory leak, connection to another service lost momentarily.)
What's the correct way of handling exceptions?. I guess it depends on what you want to achieve with the RDD operation. If an error in one of the RDD records means that the output is not valid then option 1 is the way to go. If we expect some of the records to fail, we go for option 2. Option 3 does not even need to make a choice as it is the normal behaviour of the platform.
In the past we did not bother with try/catch approach except for input parameter checking.
For the rest we just relied on checking the return code as in:
spark-submit --master yarn ... bla bla
ret_val=$?
...
Why? As you need to correct something in general and we needed to then start over again. It's hard to dynamically correct certain things. Your scheduling tool can pick this up as well, Rundeck, Airflow...et al.
More advanced restart options are possible but simply get convoluted, but could be done. As you allude to in option 2. Never seen that done though.

Mocking a Source in Akka Streams

I have a wrapper class AwsS3Bucket which when invoked, returns a source Source[ByteString, NotUsed]. In my unit test case, I have mocked this client and do the necessary assertions.
val source = Source.fromIterator(() => List(ByteString("some string")).toIterator)
when(awsS3Bucket.getSource(any[String])).thenReturn(source)
However, now I want to test the error scenario wherein I want the getSource method to throw an exception.
I tried the following code,
val error = new RuntimeException("error in source")
when(awsS3Bucket.getSource(any[String])).thenReturn(error)
but it gives me a compilation issue saying that
Cannot resolve overloaded method thenReturn
Can anyone please let me know the correct method of returning an exception in a Source in akka streams.
You have to use thenThrow(new RuntimeException("error in source")) to stub an Exception.
That said, you may find issues sometimes with checked exceptions as Scala treats all exceptions as runtime, so they aren't declared in the signature, and standard Mockito will validate you're stubbing an exception that can be thrown by the stubbed method.
In mockito-scala that check has been removed to acknowledge the fact that all exceptions behave as runtime in Scala

Exception is not caught by the try catch block

I am saving DStream to Cassandra. There is a column in Cassandra with map<text, text> datatype. Cassandra does not support null value in Map, but null value can occur in the stream.
I have added try catch if case something goes wrong, but the program stopped despite that and I don't see error message in the log:
try {
cassandraStream.saveToCassandra("table", "keyspace")
} catch {
case e: Exception => log.error("Error in saving data in Cassandra" + e.getMessage, e)
}
Exception
Caused by: java.lang.NullPointerException: Map values cannot be null
at com.datastax.driver.core.TypeCodec$AbstractMapCodec.serialize(TypeCodec.java:2026)
at com.datastax.driver.core.TypeCodec$AbstractMapCodec.serialize(TypeCodec.java:1909)
at com.datastax.driver.core.AbstractData.set(AbstractData.java:530)
at com.datastax.driver.core.AbstractData.set(AbstractData.java:536)
at com.datastax.driver.core.BoundStatement.set(BoundStatement.java:870)
at com.datastax.spark.connector.writer.BoundStatementBuilder.com$datastax$spark$connector$writer$BoundStatementBuilder$$bindColumnUnset(BoundStatementBuilder.scala:73)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$6.apply(BoundStatementBuilder.scala:84)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$6.apply(BoundStatementBuilder.scala:84)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$bind$1.apply$mcVI$sp(BoundStatementBuilder.scala:106)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:101)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:233)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:210)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:210)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:54)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:54)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
... 3 more
I'd like to know why the program got stops, despite the try/catch block. Why is the exception not caught?
To understand the source of the failure you have to acknowledge that DStreamFunctions.saveToCassandra, same as DStream output operations in general, is not an action in strict sense. In practice it just invokes foreachRDD:
dstream.foreachRDD(rdd => rdd.sparkContext.runJob(rdd, writer.write _))
which in turn:
Apply a function to each RDD in this DStream. This is an output operator, so 'this' DStream will be registered as an output stream and therefore materialized.
The difference is subtle, but important - the operation is registered but the actual execution happens in different context, at later point in time.
It means there are no runtime failures to caught at the point you invoke saveToCassandra.
As already pointed out, try or Try would contain the driver exception, if applied directly on an action. So you'd for example re-implement saveToCassandra as
dstream.foreachRDD(rdd => try {
rdd.sparkContext.runJob(rdd, writer.write _)
} catch {
case e: Exception => log.error("Error in saving data in Cassandra" + e. getMessage, e)
})
the stream should be able to proceed, although the current batch will be completely or partially lost.
It is important to note that this is not the same as catching the original exception, which will be thrown, uncaught and visible in the log. To catch problem at its source you'd have to apply try / catch block directly in writer, and this is obviously not an option when you execute code, over which you don't have control.
Take away message is (already stated in this thread) - make sure to sanitize your data to avoid known sources of failure.
The problem is that you don't catch the exception you think you do. The code you have will catch a driver exception, and in fact code structured like this will do it.
It doesn't however mean that
the program should never stop.
While driver failure, which would be a consequence of fatal executor failure, is contained and driver can exit gracefully, stream as such is already gone. Therefore your code exits, because there is no more stream to run.
If the code in question was under your control, exception handling should be delegated to the task, but in case of 3rd party code, there is no such option.
Instead you should validate your data, and remove problematic records, before these are passed to saveToCassandra.

How to test for two or more exceptions in ScalaTest?

I am using ScalaTest for unit testing. I currently have the following:
f(x) should produce[Exception]
I would like to specify two or more subclasses of Exception, e.g.
f(x) should (produce[ExceptionA] or produce[ExceptionB])
Is this possible? If not, what is the recommended way to proceed?
I would look at restructuring either your code or your tests if you've got a block of code that is non-deterministic in the exception that it will throw. That said, you could use an evaluating block to capture the thrown exception and then check if it's one of the required types. e.g.
val caught = evaluating {
// code that should throw an exception
} should produce [Exception]
then
assert(caught.isInstanceOf[ExceptionA] || caught.isInstanceOf[ExceptionB])

Catching unhandled errors in Scala futures

If a Scala future fails, and there is no continuation that "observes" that failure (or the only continuations use map/flatMap and don't run in case of failure), then errors go undetected. I would like such errors to be at least logged, so I can find bugs.
I use the term "observed error" because in .Net Tasks there is the chance to catch "unobserved task exceptions", when the Task object is collected by the GC. Similarly, with synchronous methods, uncaught exceptions that terminate the thread can be logged.
In Scala futures, to 'observe' a failure would mean that some continuation or other code reads the Exception stored in the future value before that future is disposed. I'm aware that finalization is not deterministic or reliable, and presumably that's why it's not used to catch unhandled errors, although .Net does succeed in doing this.
Is there a way to achieve this in Scala? If not, how should I organize my code to prevent unhandled error bugs?
Today I have andThen checkResult appended to various futures. But it's hard to know when to use this and when not to: if a library method returns a Future, it shouldn't checkResult and log errors itself, because the library user may handle the failure, so the responsibility falls onto the user. As I edit code I sometimes need to add checks and sometimes to remove them, and such manual management is surely wrong.
I have concluded there is no way to generically notice unhandled errors in Scala futures.
You can just use Future.recover in the function that returns the Future.
So for instance, you could just "log" the error and rethrow the original exception, in the simplest case:
def libraryFunction(): Future[Int] = {
val f = ...
f.recover {
case NonFatal(t) =>
println("Error : " + t)
throw t
}
}
Note the use of NonFatal to match all the exception types it is sensible to catch.
That recover block could equally return an alternative result if you wish.