Variable inside dataframe foreach gives null pointer exception in Scala - scala

I'm having some issues when trying to execute a class function inside a "dataframe.foreach" function. My custom class is persisting the data into a DynamoDB table.
What happens is that if I have the following code, it won't work and will raise a "Null Pointer Exception" that points to the line of code where the "writer.writeRow(r)" is executed:
object writeToDynamoDB extends App {
val df: DataFrame = ...
val writer: DynamoDBWriter = new DDBWriter(...)
df
.foreach(
r => writer.writeRow(r)
)
}
If I use the same code, but having the code inside a code block or an if clause, it will work:
object writeToDynamoDB extends App {
val df: DataFrame = ...
if(true) {
val writer: DynamoDBWriter = new DDBWriter(...)
df
.foreach(
r => writer.writeRow(r)
)
}
}
I guess it has something to do with the variable scope. Even in IntelliJ the color of the variable is purple + Italic in the first case and "regular" grey in the second case. I read about it, and we have the method, field and local scope in Scala, but I'm can't relate that with what I'm trying to do.
Some questions after this introduction:
Can anyone explain why does Scala and/or Spark have this behaviour?
The solution here is to put some code inside a function, code block
or a "fake" if clause as far as I know. Is there any possible issue regarding Spark properties (shuffles, etc)?
Is there any other way to do this type of operations?
Hope I was clear.
Thanks in advance.
Regards

As said above, your issue is caused by delayed initialization when using the App trait. Spark docs strongly discourage that:
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
The reason can be found in the Javadocs of the App trait itself:
It should be noted that this trait is implemented using the DelayedInit functionality, which means that fields of the object will not have been initialized before the main method has been executed.
This basically means that writer is still uninitialized (so null) by the time the closure passed to foreach is created.
If you put respective code into a block, writer becomes a local variable and is initialized at the time when the block is evaluated. That way your closure will contain the correct value of writer. In this case it doesn't matter anymore when the code is evaluated, because everything get's evaluated together.
The correct and recommended solution is to use a standard main method for your Spark applications:
object writeToDynamoDB {
def main(args: Array[String]): Unit = {
val df: DataFrame = ...
val writer: DynamoDBWriter = new DDBWriter(...)
df.foreach(r => writer.writeRow(r))
}
}

Related

How to construct and persist a reference object per worker in a Spark 2.3.0 UDF?

In a Spark 2.3.0 Structured Streaming job I need to append a column to a DataFrame which is derived from the value of the same row of an existing column.
I want to define this transformation in a UDF and use withColumn to build the new DataFrame.
Doing this transform requires consulting a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance.
What is the best way to construct and persist this object once per worker node so it can be referenced repeatedly for every record in every batch? Note that the object is not serializable.
My current attempts have revolved around subclassing UserDefinedFunction to add the expensive object as a lazy member and providing an alternate constructor to this subclass that does the init normally performed by the udf function, but I've been so far unable to get it to do the kind of type coercion that udf does -- some deep type inference is wanting objects of type org.apache.spark.sql.Column when my transformation lambda works on a string for input and output.
Something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataType
class ExpensiveReference{
def ExpensiveReference() = ... // Very slow
def transformString(in:String) = ... // Fast
}
class PersistentValUDF(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) extends UserDefinedFunction(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]){
lazy val ExpensiveReference = new ExpensiveReference()
def PersistentValUDF(){
this(((in:String) => ExpensiveReference.transformString(in) ):(String => String), StringType, Some(List(StringType)))
}
}
The further I dig into this rabbit hole the more I suspect there's a better way to accomplish this that I'm overlooking. Hence this post.
Edit:
I tested initializing a reference lazily in an object declared in the UDF; this triggers reinitialization. Example code and object
class IntBox {
var valu = 0;
def increment {
valu = valu + 1
}
def get:Int ={
return valu
}
}
val altUDF = udf((input:String) => {
object ExpensiveRef{
lazy val box = new IntBox
def transform(in:String):String={
box.increment
return in + box.get.toString
}
}
ExpensiveRef.transform(input)
})
The above UDF always appends 1; so the lazy object is being reinitialized per-record.
I found this post whose Option 1 I was able to turn into a workable solution. The end result ended up being similar to Jacek Laskowski's answer, but with a few tweaks:
Pull the object definition outside of the UDF's scope. Even being lazy, it will still reinitialize if it's defined in the scope of the UDF.
Move the transform function off of the object and into the UDF's lambda (required to avoid serialization errors)
Capture the object's lazy member in the closure of the UDF lambda
Something like this:
object ExpensiveReference {
lazy val ref = ...
}
val persistentUDF = udf((input:String)=>{
/*transform code that references ExpensiveReference.ref*/
})
DISCLAIMER Let me have a go on this, but please consider it a work in progress (downvotes are big no-no :))
What I'd do would be to use a Scala object with a lazy val for the expensive reference.
object ExpensiveReference {
lazy val ref = ???
def transform(in:String) = {
// use ref here
}
}
With the object, whatever you do on a Spark executor (be it part of a UDF or any other computation) is going to instantiate ExpensiveReference.ref at the very first access. You could access it directly or a part of transform.
Again, it does not really matter whether you do this in a UDF or a UDAF or any other transformation. The point is that once a computation happens on a Spark executor "a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance." would happen only once.
It could be in a UDF (just to make it clearer).

Scala NullPointerException during initialization

Consider the following case (this is a simplified version of what I have. The numberOfScenarios is the most important variable here: I usually use a hardcoded number instead of it, but I'm trying to see if it's possible to calculate the value):
object ScenarioHelpers {
val identifierList = (1 to Scenarios.numberOfScenarios).toArray
val concurrentIdentifierQueue = new ConcurrentLinkedQueue[Int](identifierList.toSeq)
}
abstract class AbstractScenario {
val identifier = ScenarioHelpers.concurrentIdentifierQueue.poll()
}
object Test1 extends AbstractScenario {
val scenario1 = scenario("test scenario 1").exec(/..steps../)
}
object Test2 extends AbstractScenario {
val scenario2 = scenario("test scenario 2").exec(/..steps../)
}
object Scenarios {
val scenarios = List(Test1.scenario1, Test2.scenario2)
val numberOfScenarios = scenarios.length
}
object TestPreparation {
val feeder = ScenarioHelpers.identifierList.map(n => Map("counter" -> n))
val prepScenario = scenario("test preparation")
.feed(feeder)
.exec(/..steps../)
}
Not sure if it matters, but the simulation starts with executing the TestPreparation.prepScenario.
I see that this code contains a circular dependency which makes this case impossible in and of itself. But I get a NullPointerException on the line in AbstractScenario where identifier is being initialized.
I don't fully understand all this, but I think it has something to do with the vals being simply declared at first and the initialization does not happen until later. So when identifier is being initialized, the concurrentIdentifierQueue is not yet initialized and is therefore null.
I'm just trying to understand the reasons behind the NullPointerException and also if there's any way to get this working with minimal changes? Thank you.
NPEs during trait initialization is a very common problem.
The most robust way to resolve it is avoiding implementation inheritance at all.
if it is not possible for some reasons you can mark problematic fields lazy val or def instead of val.
You answered that yourself:
I see that this code contains a circular dependency which makes this case impossible in and of itself. But I get a NullPointerException on the line in AbstractScenario where identifier is being initialized.
val feeder = ScenarioHelpers.identifierList... calls ScenarioHelpers initialization
val identifierList = (1 to Scenarios.numberOfScenarios).toArray calls Scenarios initialization
val scenarios = List(Test1.scenario1, Test2.scenario2) calls Test1 inicialization including AbstractScenario
Here val identifier = ScenarioHelpers.concurrentIdentifierQueue.poll() ScenarioHelpers is still initializing and identifierList is null.
You have to get numberOfScenarios in noncyclic way. Personally I would remove identifierList and assign identifier other way - incrementing counter or so.

Scala Class parameters

I have recently come across a Scala class in my project which has parameters like currying. For Example:
class A (a:Int,b:Int)(c:Int){
/**
Some definition
**/
}
What is the advantage of these kind of parameterization?
When I create an object as
val obj = new A(10,20)
I am getting runtime error.
Can anyone explain?
What is the advantage of these kind of parameterization?
I can think of two possible advantages:
Delayed construction.
You can construct part of the instance in one area of the code base...
val almostA = new A(10, 20)(_)
...and complete it in a different area of the code.
val realA = almostA(30)
Default parameter value.
Your example doesn't do this, but a curried parameter can reference an earlier parameter.
class A (a:Int,b:Int)(c:Int = a){ . . .
When I create an object as val obj = new A(10,20) I am getting runtime error.
You should be getting a compile-time error. You either have to complete the constructor parameters, new A(10,20)(30), or you have to leave a placeholder, new A(10,20)(_). You can leave the 2nd parameter group off only if it has a default value (see #2 above).

When exactly a Spark task can be serialized?

I read some related questions about this topic, but still cannot understand the following. I have this simple Spark application which reads some JSON records from a file:
object Main {
// implicit val formats = DefaultFormats // OK: here it works
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("Spark Test App")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/alex/data/person.json")
implicit val formats = DefaultFormats // Exception: Task not serializable
val persons = input.flatMap { line ⇒
// implicit val formats = DefaultFormats // OK: here it also works
try {
val json = parse(line)
Some(json.extract[Person])
} catch {
case e: Exception ⇒ None
}
}
}
}
I suppose the implicit formats is not serializable since it includes some ThreadLocal for the date format. But, why it works when placed as a member of the object Main or inside the closure of flatMap, and not as a common val inside the main function?
Thanks in advance.
If the formats is inside the flatMap, it's only created as part of executing the mapping function. So the mapper can be serialized and sent to the cluster, since it doesn't contain a formats yet. The flipside is that this will create formats anew every time the mapper runs (i.e. once for every row) - you might prefer to use mapPartitions rather than flatMap so that you can have the value created once for each partition.
If formats is outside the flatMap then it's created once on the master machine, and you're attempting to serialize it and send it to the cluster.
I don't understand why formats as a field of Main would work. Maybe objects are magically pseudo-serializable because they're singletons (i.e. their fields aren't actually serialized, rather the fact that this is a reference to the single static Main instance is serialized)? That's just a guess though.
The best way to answer your question I think is in three short answers:
1) Why it works when placed as a member of the object Main?, the question here is that code works because it's inside an Object, not necessary the Main Object. And now: Why? because Spark serializes your whole object and send it to each of the executors, moreover an Object in Scala is generated like a JAVA Static class and the initial values of static fields in a Java class are stored in the jar and workers can use it directly. This is not the same if you use a class instead an Object.
2) The second question is: why it works if it's inside a flatmap?.
When you run transformations on a RDD (filter, flatMap ... etc), your transformation code is: serialized on the driver node, send to worker, once there it will be deserialized and executed. As you can see exactly the same as in 1) the code will be serialized "automatycally".
And finally the 3) question: Why this is not working as a common val inside the main function? this is because the val is not serialized "automatically", but you can test it like this: val yourVal = new yourVal with Serializable

Squeryl session management with 'using'

I'm learning Squeryl and trying to understand the 'using' syntax but can't find documentation on it.
In the following example two databases are created, A contains the word Hello, and B contains Goodbye. The intention is to query the contents of A, then append the word World and write the result to B.
Expected console output is Inserted Message(2,HelloWorld)
object Test {
def main(args: Array[String]) {
Class.forName("org.h2.Driver");
import Library._
val sessionA = Session.create(DriverManager.getConnection(
"jdbc:h2:file:data/dbA","sa","password"),new H2Adapter)
val sessionB = Session.create(DriverManager.getConnection(
"jdbc:h2:file:data/dbB","sa","password"),new H2Adapter)
using(sessionA){
drop; create
myTable.insert(Message(0,"Hello"))
}
using(sessionB){
drop; create
myTable.insert(Message(0,"Goodbye"))
}
using(sessionA){
val results = from(myTable)(s => select(s))//.toList
using(sessionB){
results.foreach(m => {
val newMsg = m.copy(msg = (m.msg+"World"))
myTable.insert(newMsg)
println("Inserted "+newMsg)
})
}
}
}
case class Message(val id: Long, val msg: String) extends KeyedEntity[Long]
object Library extends Schema { val myTable = table[Message] }
}
As it stands, the code prints Inserted Message(2,GoodbyeWorld), unless the toList is added on the end of the val results line.
Is there some way to bind the results query to use sessionA even when evaluated inside the using(sessionB)? This seems preferable to using toList to force the query to evaluate and store the contents in memory.
Update
Thanks to Dave Whittaker's answer, the following snippet fixes it without resorting to 'toList' and corrects my understanding of both 'using' and the running of queries.
val results = from(myTable)(s => select(s))
using(sessionA){
results.foreach(m => {
val newMsg = m.copy(msg = (m.msg+"World"))
using(sessionB){myTable.insert(newMsg)}
println("Inserted "+newMsg)
})
}
First off, I apologize for the lack of documentation. The using() construct is a new feature that is only available in SNAPSHOT builds. I actually talked to Max about some of the documentation issues for early adopters yesterday and we are working to fix them.
There isn't a way that I can think of to bind a specific Session to a Query. Looking at your example, it looks like an easy work around would be to invert your transactions. When you create a query, Squeryl doesn't actually access the DB, it just creates an AST representing the SQL to be performed, so you don't need to issue your using(sessionA) at that point. Then, when you are ready to iterate over the results you can wrap the query invocation in a using(sessionA) nested within your using(sessionB). Does that make sense?