Spark Task not Serializable with simple accumulator? - scala

I am running this simple code:
val accum = sc.accumulator(0, "Progress");
listFilesPar.foreach {
filepath =>
accum += 1
}
listFilesPar is an RDD[String]
which throws the following error:
org.apache.spark.SparkException: Task not serializable
Right now I don't understand what's happening
and I don't put parenthesis but brackets because I need to write a lengthy function. I am just doing unit testing

The typical cause of this is that the closure unexpectedly captures something. Something that you did not include in your paste, because you would never expect it would be serialized.
You can try to reduce your code until you find it. Or just turn on serialization debug logging with -Dsun.io.serialization.extendedDebugInfo=true. You will probably see in the output that Spark tries to serialize something silly.

Related

Getting a spurious fibre trace with zip and for comprehension

I have just started to look at ZIO to try to improve my Scala. I have started by trying to update some of my old code. I have wrapped some legacy code which returns a configuration wrapped in an option and I'm converting that to a ZIO which I'm then using in for comprehension.
The code works as expected but if I return None I get:
Fiber failed.
A checked error was not handled.
None
Fiber:Id(1612612180323,1) was supposed to continue to:
a future continuation at tryzio.MyApp$.run(MyApp.scala:94)
a future continuation at zio.ZIO.exitCode(ZIO.scala:543)
Fiber:Id(1612612180323,1) execution trace:
at zio.ZIO$.fromOption(ZIO.scala:3246)
at tryzio.MyApp$.run(MyApp.scala:93)
The code works as expected for both a Some and a None, but I get the spurious Fibre messages for a None.
The code that generates this is really very simple:
def run(args: List[String]) = ({
for {
cmdln <- ZIO.fromOption(getConfig(args.toArray))
_ <- putStrLn("Model Data Builder")
_ <- putStrLn(s"\nVerbose: ${cmdln.verbose}")
} yield()
}).exitCode
But I have clearly missed something obvious! I'm new to ZIO so please use small words when explains my lack of understanding. Do I need to join all Fibres? I have tried to catch the None and then exit but actually my code never strops running if I do that - Very strange.
.exitCode caught an error (empty configuration) which hasn't been handled in the code, printed debug information in stdErr and exited the program with status 1. So it works as expected.
I agree that the error message is a bit misleading and should start with something something business-related rather than fiber failed.
You might fire a ticket on github.

Exception handling in Apache Spark

I have been researching on the proper way of handling exceptions in Apache Spark jobs. I have read through different questions in Stackoverflow but I still haven't got to a conclusion. From my point of view there are three ways of handling exceptions:
Try catch/block surrounding the lambda function that is going to perform the computation. This is tricky because the block will have to be placed surrounding the code that triggers the lazy computation. If an error happens then I assume there won't be any RDD to work with (Taken from this blog entry)
val lines: RDD[String] = sc.textFile("large_file.txt")
val tokens =
lines.flatMap(_ split " ")
.map(s => s(10))
try {
// This try-catch block catch all the exceptions thrown by the
// preceding transformations.
tokens.saveAsTextFile("/some/output/file.txt")
} catch {
case e : StringIndexOutOfBoundsException =>
// Doing something in response of the exception
}
Try catch/block inside the lambda function: This implies deciding on the correct output of a caught exception inside the lambda function.
rdd.map({
Try(fn) match{
case Success: _
case Failure:<<Record with error flag>>
}).filter(record.errorflag==null)
Let the exception propagate. The task will fail and the Spark framework will relaunch the task again. This works when the error is caused by reasons outside the code scope. e.g. (memory leak, connection to another service lost momentarily.)
What's the correct way of handling exceptions?. I guess it depends on what you want to achieve with the RDD operation. If an error in one of the RDD records means that the output is not valid then option 1 is the way to go. If we expect some of the records to fail, we go for option 2. Option 3 does not even need to make a choice as it is the normal behaviour of the platform.
In the past we did not bother with try/catch approach except for input parameter checking.
For the rest we just relied on checking the return code as in:
spark-submit --master yarn ... bla bla
ret_val=$?
...
Why? As you need to correct something in general and we needed to then start over again. It's hard to dynamically correct certain things. Your scheduling tool can pick this up as well, Rundeck, Airflow...et al.
More advanced restart options are possible but simply get convoluted, but could be done. As you allude to in option 2. Never seen that done though.

gatling after method not reached

Right now I'm having difficulty with a custom gatling feeder despite the fact that it's circular. I'm getting this error:
java.lang.IllegalStateException: Feeder is now empty, stopping engine
I'm reading this is the default behavior. However, I want to make sure each user users a different refurl from the feeder: refUrlFeederBuffer.
Also, why isn't it running my after method? I need my cleanup procedures to run regardless of the success or failure of the simulation. If I don't cleanup I can't restart the test!
var refUrlFeeder: Array [Map[String, String]] = Array()
before {
//create stuff and put the refUrls from it in a map
refUrlFeeder = refUrlFeeder :+ Map("refUrl" -> simpleUrl)
}
after {
// delete whatever I created in the before method
// THIS METHOD DOES NOT EXCUTE if the feeder is emptied
// I need it to execute regardless of errors during the scenario
}
object ImportRecords {
val someXml = "<some xml>"
val feeder = RecordSeqFeederBuilder(refUrlFeeder).circular
val update =
feed(feeder)
exec(http("Update stuff")
.put("${refUrl}")
.body(StringBody(someXml))
.asXML
.check(status.is(200))
)
}
val adminUpdaters = scenario("Admins who do updates").exec(ImportRecords.update) setUp(adminUpdaters.inject(atOnceUsers(1)).protocols(httpConf))
When the feeder runs out of items Gatling stops whole engine. It is exceptional situation which is stated also in exception itself:
[error] java.lang.IllegalStateException: Feeder is now empty, stopping engine
Hook after is called only when simulation is completed. You can receive the errors in sense of logic in your simulation, but not developer bugs. Hook is not called when there is a developer bug, which in this case it is.
Simply running out of feeder is a bug, because it says that your setUp part of simulation is not in correlation with your provided data in this case your feeder.
Btw. what does your setUp part of the simulation looks like?
EDIT: Just looking at your code structure, I'm guessing (while not seeing the whole simulation), that initialisation of your ImportRecords happens before hook before is called and thus your val feeder contains empty array. Making an empty array circular will lead to just another empty array, hence you will get an exception when Gatling tries to take an element from feeder. Try to add:
println(refUrlFeeder)
into initialisation of your object ImportRecords to find out if this is the case.
Good luck

write Multiple lines in scala

This is my code:
bufferedWriter.write(Play.current.configuration.getString("play.compile.c.command").get.replace("{objectname}", registerdJobDetails.executableNm.mkString.split('.')(0)).replace("{executablename}", getExecutableFolderName(jobId) + registerdJobDetails.executableNm))
bufferedWriter.newLine()
bufferedWriter.write(Play.current.configuration.getString("play.runable.c.command").get.replace("{objectname}", registerdJobDetails.executableNm.mkString.split('.')(0)))
only the first line is getting written but the other lines is not getting written.
I am getting the error as
java.util.NoSuchElementException: None.get
Most likely is that your problem is that Play.current.configuration.getString("play.runable.c.command") Has type Option[String] and are calling the get method on Option which pretty much should never be called. The world would be a better place if this method didn't even exist. I digress.
If this call to getString returns None, then the call to get throws an exception that there is no value to get.
It seems that Play.current.configuration.getString("play.runable.c.command")is returning a None which will throw an exception when calling get.
this should never be done! Use a combination of map and getOrElse to avoid unexpected behaviour. For instance:
val runableConfig = Play.current.configuration.getString("play.runable.c.command")
val s = runableConfig.map {
case runable =>
runable.replace("{objectname}", registerdJobDetails.executableNm.mkString.split('.')(0)
}.getOrElse("NO CONFIG FOUND")
bufferedWriter.write(s)
or you can even make the bufferedWriter.write(s) inside the map.
If you are not aware, the map method executes when the Option is defined (it's a Some(x)) and ignores when the Option is a None
Note: if it's returning a None try to debug that. Verify that you have no typos and that you are reading the right configuration file

Why am I getting an (inconsistent) compiler error when using a for comprehension with a result of Try[Unit]?

A common pattern I have in my code base is to define methods with Try[Unit] to indicate the method is doing something (like writing a transaction long to a SaaS REST endpoint) where there is no interesting return result and I want to capture any and emit any errors encountered to stdout. I have read on another StackOverflow thread this is entirely acceptable (also see 2nd comment on the question).
I am specifically and explicitly avoiding actually throwing and catching an exception which is an extraordinarily expensive activity on most JVM implementations. This means any answer which uses the Try() idiom won't help resolve my issue.
I have the following code (vastly simplified from my real life scenario):
def someMethod(transactionToLog: String): Try[Unit] = {
val result =
for {
_ <- callToSomeMethodReturningATryInstance
} yield Unit
result match {
case Success(_)
//Unit
case Failure(e) ->
println(s"oopsie - ${e.getMessage}")
}
result
}
Sometimes this code compiles just fine. Sometimes it doesn't. When it doesn't, it gives me the following compilation error:
Error:(row, col) type mismatch;
found : scala.util.Try[Unit.type]
required: scala.util.Try[Unit]
result
^
Sometimes the IntelliJ syntax highlighter shows the code as fine. Other times it shows an error (roughly the same as the one above). And sometimes I get the compiler error but not a highligher error. And other times it compiles fine and I get only a highlighter error. And thus far, I am having a difficult time finding a perfect "example" to capture the compiler error.
I attempted to "cure" the issue by adding .asInstanceOf[Unit] to the result of the for comprehension. That got it past the compiler. However, now I am getting a runtime exception of java.lang.ClassCastException: scala.Unit$ cannot be cast to scala.runtime.BoxedUnit. Naturally, this is infinitely worse than the original compilation error.
So, two questions:
Assuming Try[Unit] isn't valid, what is the Scala idiomatic (or even just preferred) way to specify a Try which returns no useful Success result?
Assuming Try[Unit] is valid, how do I get past the compilation error (described above)?
SIDENOTE:
I must have hit this problem a year ago and didn't remember the details. I created the solution below which I have all over my code bases. However, recently I started using Future[Unit] in a number of places. And when I first tried Try[Unit], it appeared to work. However, the number of times it is now causing both compilation and IDE highlighting issues has grown quite a bit. So, I want to make a final decision about how to proceed which is consistent with whatever exists (or is even emerging) and is Scala idiomatic.
package org.scalaolio.util
/** This package object serves to ease Scala interactions with mutating
* methods.
*/
package object Try_ {
/** Placeholder type for when a mutator method does not intend to
* return anything with a Success, but needs to be able to return a
* Failure with an unthrown Exception.
*/
sealed case class CompletedNoException private[Try_] ()
/** Placeholder instance for when a mutator method needs to indicate
* success via Success(completedNoExceptionSingleton)
*/
val completedNoExceptionSingleton = new CompletedNoException()
}
Replace yield Unit with yield ().
Unit is a type - () is a value (and the only possible value) of type Unit.
That is why for { _ <- expr } yield Unit results in a Try[Unit.type], just as yield String would result in a Try[String.type].