Scala NullPointerException during initialization - scala

Consider the following case (this is a simplified version of what I have. The numberOfScenarios is the most important variable here: I usually use a hardcoded number instead of it, but I'm trying to see if it's possible to calculate the value):
object ScenarioHelpers {
val identifierList = (1 to Scenarios.numberOfScenarios).toArray
val concurrentIdentifierQueue = new ConcurrentLinkedQueue[Int](identifierList.toSeq)
}
abstract class AbstractScenario {
val identifier = ScenarioHelpers.concurrentIdentifierQueue.poll()
}
object Test1 extends AbstractScenario {
val scenario1 = scenario("test scenario 1").exec(/..steps../)
}
object Test2 extends AbstractScenario {
val scenario2 = scenario("test scenario 2").exec(/..steps../)
}
object Scenarios {
val scenarios = List(Test1.scenario1, Test2.scenario2)
val numberOfScenarios = scenarios.length
}
object TestPreparation {
val feeder = ScenarioHelpers.identifierList.map(n => Map("counter" -> n))
val prepScenario = scenario("test preparation")
.feed(feeder)
.exec(/..steps../)
}
Not sure if it matters, but the simulation starts with executing the TestPreparation.prepScenario.
I see that this code contains a circular dependency which makes this case impossible in and of itself. But I get a NullPointerException on the line in AbstractScenario where identifier is being initialized.
I don't fully understand all this, but I think it has something to do with the vals being simply declared at first and the initialization does not happen until later. So when identifier is being initialized, the concurrentIdentifierQueue is not yet initialized and is therefore null.
I'm just trying to understand the reasons behind the NullPointerException and also if there's any way to get this working with minimal changes? Thank you.

NPEs during trait initialization is a very common problem.
The most robust way to resolve it is avoiding implementation inheritance at all.
if it is not possible for some reasons you can mark problematic fields lazy val or def instead of val.

You answered that yourself:
I see that this code contains a circular dependency which makes this case impossible in and of itself. But I get a NullPointerException on the line in AbstractScenario where identifier is being initialized.
val feeder = ScenarioHelpers.identifierList... calls ScenarioHelpers initialization
val identifierList = (1 to Scenarios.numberOfScenarios).toArray calls Scenarios initialization
val scenarios = List(Test1.scenario1, Test2.scenario2) calls Test1 inicialization including AbstractScenario
Here val identifier = ScenarioHelpers.concurrentIdentifierQueue.poll() ScenarioHelpers is still initializing and identifierList is null.
You have to get numberOfScenarios in noncyclic way. Personally I would remove identifierList and assign identifier other way - incrementing counter or so.

Related

Variable inside dataframe foreach gives null pointer exception in Scala

I'm having some issues when trying to execute a class function inside a "dataframe.foreach" function. My custom class is persisting the data into a DynamoDB table.
What happens is that if I have the following code, it won't work and will raise a "Null Pointer Exception" that points to the line of code where the "writer.writeRow(r)" is executed:
object writeToDynamoDB extends App {
val df: DataFrame = ...
val writer: DynamoDBWriter = new DDBWriter(...)
df
.foreach(
r => writer.writeRow(r)
)
}
If I use the same code, but having the code inside a code block or an if clause, it will work:
object writeToDynamoDB extends App {
val df: DataFrame = ...
if(true) {
val writer: DynamoDBWriter = new DDBWriter(...)
df
.foreach(
r => writer.writeRow(r)
)
}
}
I guess it has something to do with the variable scope. Even in IntelliJ the color of the variable is purple + Italic in the first case and "regular" grey in the second case. I read about it, and we have the method, field and local scope in Scala, but I'm can't relate that with what I'm trying to do.
Some questions after this introduction:
Can anyone explain why does Scala and/or Spark have this behaviour?
The solution here is to put some code inside a function, code block
or a "fake" if clause as far as I know. Is there any possible issue regarding Spark properties (shuffles, etc)?
Is there any other way to do this type of operations?
Hope I was clear.
Thanks in advance.
Regards
As said above, your issue is caused by delayed initialization when using the App trait. Spark docs strongly discourage that:
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
The reason can be found in the Javadocs of the App trait itself:
It should be noted that this trait is implemented using the DelayedInit functionality, which means that fields of the object will not have been initialized before the main method has been executed.
This basically means that writer is still uninitialized (so null) by the time the closure passed to foreach is created.
If you put respective code into a block, writer becomes a local variable and is initialized at the time when the block is evaluated. That way your closure will contain the correct value of writer. In this case it doesn't matter anymore when the code is evaluated, because everything get's evaluated together.
The correct and recommended solution is to use a standard main method for your Spark applications:
object writeToDynamoDB {
def main(args: Array[String]): Unit = {
val df: DataFrame = ...
val writer: DynamoDBWriter = new DDBWriter(...)
df.foreach(r => writer.writeRow(r))
}
}

Infinite loop when replacing concrete value by parameter name

I have the two following objects (in scala and using spark):
1. The main object
object Omain {
def main(args: Array[String]) {
odbscan
}
}
2. The object odbscan
object odbscan {
val conf = new SparkConf().setAppName("Clustering").setMaster("local")
conf.set("spark.driver.maxResultSize", "3g")
val sc = new SparkContext(conf)
val param_user_minimal_rating_count = 2
/***Connexion***/
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val sql = "SELECT id, data FROM user_profile"
val options = connectMysql.getOptionsMap(sql)
val uSQL = sqlcontext.load("jdbc", options)
val users = uSQL.rdd.map { x =>
val v = x.toString().substring(1, x.toString().size - 1).split(",")
var ap: Map[Int, Double] = Map()
if (v.size > 1)
ap = v(1).split(";").map { y => (y.split(":")(0).toInt, y.split(":")(1).toDouble) }.toMap
(v(0).toInt, ap)
}.filter(_._2.size >= param_user_minimal_rating_count)
println(users.collect().mkString("\n"))
}
When I execute this code I obtain an infinite loop, until I change:
filter(_._2.size >= param_user_minimal_rating_count)
to
filter(_._2.size >= 1)
or any other numerical value, in this case the code work, and I have my result displayed
What I think is happening here is that Spark serializes functions to send them over the wire. And that because your function (the one you're passing to map) calls the accessor param_user_minimal_rating_count of object odbscan, the entire object odbscan will need to get serialized and sent along with it. Deserializing and then using that deserialized object will cause the code in its body to get executed again which causes an infinite loop of serializing-->sending-->deserializing-->executing-->serializing-->...
Probably the easiest thing to do here is changing that val to final val param_user_minimal_rating_count = 2 so the compiler will inline the value. But note that this will only be a solution for literal constants. For more information see constant value definitions and constant expressions.
An other and better solution would be to refactor your code so that no instance variables are used in lambda expressions. Referencing vals that are defined in an object or class will get the whole object serialized. So try to only refer to vals that are local (to a method). And most importantly don't execute your business logic from within a constructor/the body of an object or class.
Your problem is somewhere else.
The only difference between the 2 snippets is the definition of val Eps = 5 outside of the map which does not change at all the control flow of your code.
Please post more context so we can help.

How to get truly atomic update for TrieMap.getOrElseUpdate

As I understand, TrieMap.getOrElseUpdate is still not truly atomic, and this fixes only returned result (it could return different instances for different callers before this fix), so the updater function still might be called several times, as documentation (for 2.11.7) says:
Note: This method will invoke op at most once. However, op may be invoked without the result being added to the map if a concurrent process is also trying to add a value corresponding to the same key k.
*I've checked that manually for 2.11.7, still "at least once"
How to guarantee one-time call (if I use TrieMap for factories)?
I think this solution should work for my requirements:
trait LazyComp { val get: Int }
val map = new TrieMap[String, LazyComp]()
val count = new AtomicInteger() //just for test, you don't need it
def getSingleton(key: String) = {
val v = new LazyComp {
lazy val get = {
//compute something
count.incrementAndGet() //just for test, you don't need it
}
}
map.putIfAbsent(key, v).getOrElse(v).get
}
I believe, lazy val actually uses synchronized inside. And also the code inside get should be safe from exceptions
However, performance could be improved in future: SIP-20
Test:
scala> (0 to 10000000).par.map(_ => getSingleton("zzz")).last
res8: Int = 1
P.S. Java has computeIfAbscent method on ConcurrentHashMap which I could use as well.

Why is using uninitialized variables allowed in Scala? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Scala: forward references - why does this code compile?
object Omg {
class A
class B(val a: A)
private val b = new B(a)
private val a = new A
def main(args: Array[String]) {
println(b.a)
}
}
the following code prints "null". In java. similar construction doesn't compile because of invalid forward reference. The question is - why does it compile well in Scala? Is that by design, described in SLS or simply bug in 2.9.1?
It's not a bug, but a classic error when learning Scala. When the object Omg is initialized, all values are first set to the default value (null in this case) and then the constructor (i.e. the object body) runs.
To make it work, just add the lazy keyword in front of the declaration you are forward referencing (value a in this case):
object Omg {
class A
class B(val a: A)
private val b = new B(a)
private lazy val a = new A
def main(args: Array[String]) {
println(b.a)
}
}
Value a will then be initialized on demand.
This construction is fast (the values are only initialized once for all application runtime) and thread-safe.
The way that I understand it, this has to do with how the Scala classes are created. In Java, the class defined above would be initializing the variables inline, and since a had not been defined yet, it could not be compiled. However, in Scala it is more the equivalent of this in Java (which should also produce null in the same scenario):
class Omg {
private B b = null;
private A a = null;
Omg(){
b = new B(a);
a = new A();
}
}
Alternately, you could make your declaration of b lazy, which would defer setting the value until it is called (at which time a will have been set).
If this is a concern, compile with -Xcheckinit during development and iterate until the exceptions go away.
Spec 5.1 for template body statements executed in order; beginning of 4.0 for forward references in blocks.
Forward References - why does this code compile?
As #paradigmatic states, it's not really a bug. It's the initialization order, which follows the declaration order. It this case, a is null when b is declared/init-ed.
Changing the line private val b = new B(a) to private lazy val b = new B(a) will fix issue, since using lazy will delay the init. of b to it's first usage.
It's very likely that this behavior is described in the SLS.

Scala and forward references [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Scala: forward references - why does this code compile?
object Omg {
class A
class B(val a: A)
private val b = new B(a)
private val a = new A
def main(args: Array[String]) {
println(b.a)
}
}
the following code prints "null". In java. similar construction doesn't compile because of invalid forward reference. The question is - why does it compile well in Scala? Is that by design, described in SLS or simply bug in 2.9.1?
It's not a bug, but a classic error when learning Scala. When the object Omg is initialized, all values are first set to the default value (null in this case) and then the constructor (i.e. the object body) runs.
To make it work, just add the lazy keyword in front of the declaration you are forward referencing (value a in this case):
object Omg {
class A
class B(val a: A)
private val b = new B(a)
private lazy val a = new A
def main(args: Array[String]) {
println(b.a)
}
}
Value a will then be initialized on demand.
This construction is fast (the values are only initialized once for all application runtime) and thread-safe.
The way that I understand it, this has to do with how the Scala classes are created. In Java, the class defined above would be initializing the variables inline, and since a had not been defined yet, it could not be compiled. However, in Scala it is more the equivalent of this in Java (which should also produce null in the same scenario):
class Omg {
private B b = null;
private A a = null;
Omg(){
b = new B(a);
a = new A();
}
}
Alternately, you could make your declaration of b lazy, which would defer setting the value until it is called (at which time a will have been set).
If this is a concern, compile with -Xcheckinit during development and iterate until the exceptions go away.
Spec 5.1 for template body statements executed in order; beginning of 4.0 for forward references in blocks.
Forward References - why does this code compile?
As #paradigmatic states, it's not really a bug. It's the initialization order, which follows the declaration order. It this case, a is null when b is declared/init-ed.
Changing the line private val b = new B(a) to private lazy val b = new B(a) will fix issue, since using lazy will delay the init. of b to it's first usage.
It's very likely that this behavior is described in the SLS.