When exactly a Spark task can be serialized? - scala

I read some related questions about this topic, but still cannot understand the following. I have this simple Spark application which reads some JSON records from a file:
object Main {
// implicit val formats = DefaultFormats // OK: here it works
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("Spark Test App")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/alex/data/person.json")
implicit val formats = DefaultFormats // Exception: Task not serializable
val persons = input.flatMap { line ⇒
// implicit val formats = DefaultFormats // OK: here it also works
try {
val json = parse(line)
Some(json.extract[Person])
} catch {
case e: Exception ⇒ None
}
}
}
}
I suppose the implicit formats is not serializable since it includes some ThreadLocal for the date format. But, why it works when placed as a member of the object Main or inside the closure of flatMap, and not as a common val inside the main function?
Thanks in advance.

If the formats is inside the flatMap, it's only created as part of executing the mapping function. So the mapper can be serialized and sent to the cluster, since it doesn't contain a formats yet. The flipside is that this will create formats anew every time the mapper runs (i.e. once for every row) - you might prefer to use mapPartitions rather than flatMap so that you can have the value created once for each partition.
If formats is outside the flatMap then it's created once on the master machine, and you're attempting to serialize it and send it to the cluster.
I don't understand why formats as a field of Main would work. Maybe objects are magically pseudo-serializable because they're singletons (i.e. their fields aren't actually serialized, rather the fact that this is a reference to the single static Main instance is serialized)? That's just a guess though.

The best way to answer your question I think is in three short answers:
1) Why it works when placed as a member of the object Main?, the question here is that code works because it's inside an Object, not necessary the Main Object. And now: Why? because Spark serializes your whole object and send it to each of the executors, moreover an Object in Scala is generated like a JAVA Static class and the initial values of static fields in a Java class are stored in the jar and workers can use it directly. This is not the same if you use a class instead an Object.
2) The second question is: why it works if it's inside a flatmap?.
When you run transformations on a RDD (filter, flatMap ... etc), your transformation code is: serialized on the driver node, send to worker, once there it will be deserialized and executed. As you can see exactly the same as in 1) the code will be serialized "automatycally".
And finally the 3) question: Why this is not working as a common val inside the main function? this is because the val is not serialized "automatically", but you can test it like this: val yourVal = new yourVal with Serializable

Related

Variable inside dataframe foreach gives null pointer exception in Scala

I'm having some issues when trying to execute a class function inside a "dataframe.foreach" function. My custom class is persisting the data into a DynamoDB table.
What happens is that if I have the following code, it won't work and will raise a "Null Pointer Exception" that points to the line of code where the "writer.writeRow(r)" is executed:
object writeToDynamoDB extends App {
val df: DataFrame = ...
val writer: DynamoDBWriter = new DDBWriter(...)
df
.foreach(
r => writer.writeRow(r)
)
}
If I use the same code, but having the code inside a code block or an if clause, it will work:
object writeToDynamoDB extends App {
val df: DataFrame = ...
if(true) {
val writer: DynamoDBWriter = new DDBWriter(...)
df
.foreach(
r => writer.writeRow(r)
)
}
}
I guess it has something to do with the variable scope. Even in IntelliJ the color of the variable is purple + Italic in the first case and "regular" grey in the second case. I read about it, and we have the method, field and local scope in Scala, but I'm can't relate that with what I'm trying to do.
Some questions after this introduction:
Can anyone explain why does Scala and/or Spark have this behaviour?
The solution here is to put some code inside a function, code block
or a "fake" if clause as far as I know. Is there any possible issue regarding Spark properties (shuffles, etc)?
Is there any other way to do this type of operations?
Hope I was clear.
Thanks in advance.
Regards
As said above, your issue is caused by delayed initialization when using the App trait. Spark docs strongly discourage that:
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
The reason can be found in the Javadocs of the App trait itself:
It should be noted that this trait is implemented using the DelayedInit functionality, which means that fields of the object will not have been initialized before the main method has been executed.
This basically means that writer is still uninitialized (so null) by the time the closure passed to foreach is created.
If you put respective code into a block, writer becomes a local variable and is initialized at the time when the block is evaluated. That way your closure will contain the correct value of writer. In this case it doesn't matter anymore when the code is evaluated, because everything get's evaluated together.
The correct and recommended solution is to use a standard main method for your Spark applications:
object writeToDynamoDB {
def main(args: Array[String]): Unit = {
val df: DataFrame = ...
val writer: DynamoDBWriter = new DDBWriter(...)
df.foreach(r => writer.writeRow(r))
}
}

How to update the "bytes written" count in custom Spark data source?

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?
I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

How to iterate over result of Future List in Scala?

I am new to Scala and was trying my hands on with akka. I am trying to access data from MongoDB in Scala and want to convert it into JSON and XML format.
This code attached below is using path /getJson and calling getJson() function to get data in a form of future.
get {
concat(
path("getJson"){
val f = Patterns.ask(actor1,getJson(),10.seconds)
val res = Await.result(f,10.seconds)
val result = res.toString
complete(res.toString)
}
}
The getJson() method is as follows:
def getJson()= {
val future = collection.find().toFuture()
future
}
I have a Greeting Case class in file Greeting.scala:
case class Greeting(msg:String,name:String)
And MyJsonProtocol.scala file for Marshelling of scala object to JSON format as follows:
trait MyJsonProtocol extends SprayJsonSupport with DefaultJsonProtocol {
implicit val templateFormat = jsonFormat2(Greeting)
}
I am getting output of complete(res.toString) in Postman as :
Future(Success(List(
Iterable(
(_id,BsonObjectId{value=5fc73944986ced2b9c2527c4}),
(msg,BsonString{value='Hiiiiii'}),
(name,BsonString{value='Ruchirrrr'})
),
Iterable(
(_id,BsonObjectId{value=5fc73c35050ec6430ec4b211}),
(msg,BsonString{value='Holaaa Amigo'}),
(name,BsonString{value='Pablo'})),
Iterable(
(_id,BsonObjectId{value=5fc8c224e529b228916da59d}),
(msg,BsonString{value='Demo'}),
(name,BsonString{value='RuchirD'}))
)))
Can someone please tell me how to iterate over this output and to display it in JSON format?
When working with Scala, its very important to know your way around types. First step toweards this is at least knowing the types of your variables and values.
If you look at this method,
def getJson() = {
val future = collection.find().toFuture()
future
}
Is lacks the type type information at all levels, which is a really bad practice.
I am assuming that you are using mongo-scala-driver. And your collection is actually a MongoCollection[Document].
Which means that the output of collection.find() should be a FindOberservable[Document], hence collection.find().toFuture() should be a Future[Seq[Document]]. So, your getJson method should be written as,
def getJson(): Future[Seq[Document]] =
collection.find().toFuture()
Now, this means that you are passing a Future[Seq[Document]] to your actor1, which is again a bad practice. You should never send any kind of Future values among actors. It looks like your actor1 does nothing but sends the same message back. Why does this actor1 even required when it does nothing ?
Which means your f is a Future[Future[Seq[Document]]]. Then you are using Await.result to get the result of this future f. Which is again an anti-pattern, since Await blocks your thread.
Now, your res is a Future[Seq[Document]]. And you are converting it to a String and sending that string back with complete.
Your JsonProtocol is not working because you are not even passing it any Greeting's.
You have to do the following,
Read raw Bson objects from mongo.
convert raw Bson objects to your Gretting objects.
comlete your result with these Gretting objects. The JsonProtocol should take case of converting these Greeting objects to Json.
The easist way to do all this is by using the mongo driver's CodecRegistreis.
case class Greeting(msg:String, name:String)
Now, your MongoDAL object will look like following (it might be missing some imports, fill any missing imports as you did in your own code).
import org.mongodb.scala.bson.codecs.Macros
import org.mongodb.scala.bson.codecs.DEFAULT_CODEC_REGISTRY
import org.bson.codecs.configuration.CodecRegistries
import org.mongodb.scala.{MongoClient, MongoCollection, MongoDatabase}
object MongoDAL {
val greetingCodecProvider = Macros.createCodecProvider[Greeting]()
val codecRegistry = CodecRegistries.fromRegistries(
CodecRegistries.fromProviders(greetingCodecProvider),
DEFAULT_CODEC_REGISTRY
)
val mongoClient: MongoClient = ... // however you are connecting to mongo and creating a mongo client
val mongoDatabase: MongoDatabase =
mongoClient
.getDatabase("database_name")
.withCodecRegistry(codecRegistry)
val greetingCollection: MongoCollection[Greeting] =
mongoDatabase.getCollection[Greeting]("greeting_collection_name")
def fetchAllGreetings(): Future[Seq[Greeting]] =
greetingCollection.find().toFuture()
}
Now, your route can be defined as
get {
concat(
path("getJson") {
val greetingSeqFuture: Future[Seq[Greeting]] = MongoDAL.fetchAllGreetings()
// I don't see any need for that actor thing,
// but if you really need to do that, then you can
// do that by using flatMap to chain future computations.
val actorResponseFuture: Future[Seq[Greeting]] =
greetingSeqFuture
.flatMap(greetingSeq => Patterns.ask(actor1, greetingSeq, 10.seconds))
// complete can handle futures just fine
// it will wait for futre completion
// then convert the seq of Greetings to Json using your JsonProtocol
complete(actorResponseFuture)
}
}
First of all, don't call toString in complete(res.toString).
As it said in AkkaHTTP json support guide if you set everything right, your case class will be converted to json automatically.
But as I see in the output, your res is not an object of a Greeting type. Looks like it is somehow related to the Greeting and has the same structure. Seems to be a raw output of the MongoDB request. If it is a correct assumption, you should convert the raw output from MongoDB to your Greeting case class.
I guess it could be done in getJson() after collection.find().

How to construct and persist a reference object per worker in a Spark 2.3.0 UDF?

In a Spark 2.3.0 Structured Streaming job I need to append a column to a DataFrame which is derived from the value of the same row of an existing column.
I want to define this transformation in a UDF and use withColumn to build the new DataFrame.
Doing this transform requires consulting a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance.
What is the best way to construct and persist this object once per worker node so it can be referenced repeatedly for every record in every batch? Note that the object is not serializable.
My current attempts have revolved around subclassing UserDefinedFunction to add the expensive object as a lazy member and providing an alternate constructor to this subclass that does the init normally performed by the udf function, but I've been so far unable to get it to do the kind of type coercion that udf does -- some deep type inference is wanting objects of type org.apache.spark.sql.Column when my transformation lambda works on a string for input and output.
Something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataType
class ExpensiveReference{
def ExpensiveReference() = ... // Very slow
def transformString(in:String) = ... // Fast
}
class PersistentValUDF(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) extends UserDefinedFunction(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]){
lazy val ExpensiveReference = new ExpensiveReference()
def PersistentValUDF(){
this(((in:String) => ExpensiveReference.transformString(in) ):(String => String), StringType, Some(List(StringType)))
}
}
The further I dig into this rabbit hole the more I suspect there's a better way to accomplish this that I'm overlooking. Hence this post.
Edit:
I tested initializing a reference lazily in an object declared in the UDF; this triggers reinitialization. Example code and object
class IntBox {
var valu = 0;
def increment {
valu = valu + 1
}
def get:Int ={
return valu
}
}
val altUDF = udf((input:String) => {
object ExpensiveRef{
lazy val box = new IntBox
def transform(in:String):String={
box.increment
return in + box.get.toString
}
}
ExpensiveRef.transform(input)
})
The above UDF always appends 1; so the lazy object is being reinitialized per-record.
I found this post whose Option 1 I was able to turn into a workable solution. The end result ended up being similar to Jacek Laskowski's answer, but with a few tweaks:
Pull the object definition outside of the UDF's scope. Even being lazy, it will still reinitialize if it's defined in the scope of the UDF.
Move the transform function off of the object and into the UDF's lambda (required to avoid serialization errors)
Capture the object's lazy member in the closure of the UDF lambda
Something like this:
object ExpensiveReference {
lazy val ref = ...
}
val persistentUDF = udf((input:String)=>{
/*transform code that references ExpensiveReference.ref*/
})
DISCLAIMER Let me have a go on this, but please consider it a work in progress (downvotes are big no-no :))
What I'd do would be to use a Scala object with a lazy val for the expensive reference.
object ExpensiveReference {
lazy val ref = ???
def transform(in:String) = {
// use ref here
}
}
With the object, whatever you do on a Spark executor (be it part of a UDF or any other computation) is going to instantiate ExpensiveReference.ref at the very first access. You could access it directly or a part of transform.
Again, it does not really matter whether you do this in a UDF or a UDAF or any other transformation. The point is that once a computation happens on a Spark executor "a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance." would happen only once.
It could be in a UDF (just to make it clearer).

Handling case classes in twitter chill (Scala interface to Kryo)?

Twitter-chill looks like a good solution to the problem of how to serialize efficiently in Scala without excessive boilerplate.
However, I don't see any evidence of how they handle case classes. Does this just work automatically or does something need to be done (e.g. creating a zero-arg constructor)?
I have some experience with the WireFormat serialization mechanism built into Scoobi, which is a Scala Hadoop wrapper similar to Scalding. They have serializers for case classes up to 22 arguments that use the apply and unapply methods and do type matching on the arguments to these functions to retrieve the types. (This might not be necessary in Kryo/chill.)
They generally just work (as long as the component members are also serializable by Kryo):
case class Foo(id: Int, name: String)
// setup
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
// write
val data = Foo(1,"bob")
val buffer = new Array[Byte](4096]
val output = new Output(buffer)
kryo.writeObject(output, data)
// read
val input = new Input(buffer)
val data2 = kryo.readObject(input,classOf[Foo]).asInstanceOf[Foo]