Spark Broadcasted Variable Seems to be unitialized - scala

I have a singleton object Main, and defined private var
private var params: Broadcast[Parameters] = _
Then I am initializing it through the main method
def main(args: Array[String]) = {
...
params = spark.sparkContext.broadcast(new Parameters())
I have a method inside this object, then pass rdd argument and trying to get a value from broadcast variable in map transformation.
def testMethod(rdd: RDD[String]): Unit = {
rdd.map(elem -> {
val currentDate = params.value.CURRENT_DATE
...
Got an error on line where I try to get value grom params: NullPointerException. How to fix this code to make it work? I can't understand what is wrong with initialization.

Related

How to declare static global values and define them later in Scala?

Primary goal
I want to use some static vals in a class so that I don't have to pass them as function parameters.
My approach
Since I want them to be static, I am declaring them in the companion object. But I cannot assign them values when I declare them, for some reasons. So I am following the below approach.
case class DemoParams(name: String)
class Demo {
def foo = {
println("Demo params name is: ", Demo.demoParams.name) // Works fine
anotherFoo(Demo.demoParams.name) // Throws NPE !
}
def anotherFoo(someName: String) = {
// some code
}
}
object Demo {
var demoParams: DemoParams = _ // Declare here
def apply() = new Demo()
def run = {
demoParams = DemoParams(name = "Salmon") // Define here
val demoObj = Demo()
demoObj.foo
}
def main() = {
run
}
}
Demo.main()
I am able to print Demo.demoParams but surprisingly, this throws a NullPointerException when I pass Demo.demoParams to another function, while running the Spark app on a cluster.
Questions
Firstly, is this the right way of declaring static values and defining them later? I would prefer to not use vars and use immutable vals. Is there a better alternative?
Second, could you think of any reason I would be getting a NPE while passing Demo.demoParams.name to another function?
Your code works fine and doesn't throw anything (after fixing a few compile errors).
But ... Don't do this, it's ... yucky :/
How about passing params to the class as ... well ... params instead?
case class Demo(params: DemoParams) {
def foo() = {
println("Demo params name is: " + params.name)
}
}
object Demo {
def run() = {
val demoParams = DemoParams(name = "Salmon")
val demoObj = Demo(demoParams)
demoObj.foo()
}
}
Not sure this is the best alternative, but consider using a trait, which still keeps you in the FP zone by avoiding the use of var:
case class DemoParams(name: String)
trait Demo {
val demoParams: DemoParams
}
Then just define it where you need it, and it's ready for use:
object MainApp extends App {
val demoObj = new Demo {
override val demoParams: DemoParams = DemoParams(name = "Salmon")
}
println("Demo params name is: ", demoObj.demoParams.name) // (Demo params name is: ,Salmon)
anotherFoo(demoObj.demoParams.name) // Salmon
def anotherFoo(name: String): Unit = println(name)
}
About the second question, without the actual code one can only guess (this sample code does not throw NPE). Probably somewhere you are using it without defining it previously, because var demoParams: DemoParams = _ just initializes demoParams to the default value of the reference type DemoParams, which is null in this case, and you get NPE when you try to access the name value of a null object. This is why using var is discouraged.

Is it possible to declare a val before assignment/initialization in Scala?

In general, can you declare a val in scala before assigning it a value? If not, why not? An example where this might be useful (in my case at least) is that I want to declare a val which will be available in a larger scope than when I assign it. If I cannot do this, how can I achieve the desired behavior?
And I want this to be a val, not a var because after it is assigned, it should NEVER change, so a var isn't ideal.
For example:
object SomeObject {
val theValIWantToDeclare // I don't have enough info to assign it here
def main(): Unit = {
theValIWantToDeclare = "some value"
}
def someOtherFunc(): Unit {
val blah = someOperationWith(theValIWantToDeclare)
}
}
object SomeObject {
private var tviwtdPromise: Option[Int] = None
lazy val theValIWantToDeclare: Int = tviwtdPromise.get
private def declareTheVal(v: Int): Unit = {
tviwtdPromise = Some(v)
theValIWantToDeclare
}
def main(args: Array[String]): Unit = {
declareTheVal(42)
someOtherFunction()
}
def someOtherFunction(): Unit = {
println(theValIWantToDeclare)
}
}
It will crash with a NoSuchElementException if you try to use theValIWantToDeclare before fulfilling the "promise" with declareTheVal.
It sounds to me that you need a lazy val.
A lazy val is populated on demand and the result is cached for all subsequent calls.
https://blog.codecentric.de/en/2016/02/lazy-vals-scala-look-hood/
Why not define a SomeObjectPartial that is partially constructed, and class SomeObject(theVal) that takes the value as a parameter?
Then your program has two states, one with the partial object, and another with the completed object.

Scala - Global variable get the value in a function

I am a beginner of Scala.
I immediately get a problem.
Let's say:
I have a Vector variable and 2 functions. The first one function is calling 2nd function. And there is a variable in 2nd function is what I need to get and then append it to Vector. (Without returning the variable in 2nd function.)
The structure looks like this:
def main(args: Array[String]): Unit = {
var vectorA = Vector().empty
}
def funcA(): sometype = {
...
...
...
funcB()
}
def funcB(): sometype = {
var error = 87
}
How can I add error variable in global Vector?
I tried to write vectorA :+ error but in vain.
You could do the following:
def main(args: Array[String]): Unit = {
val vectorA = funcA(Vector.empty)
}
def funcA(vec: Vector): sometype = {
...
...
...
funcB()
}
def funcB(vec: Vector): sometype = {
// Here you could append, which returns a new copy of the Vector
val error = 87
vec :+ error
}
Keep in mind that, it is advisable to use immutable variables. Though not always this might be true, but for most of the applications that just involve doing some CRUD type logic, it is better to use immutable variables.

Access Spark broadcast variable in different classes

I am broadcasting a value in Spark Streaming application . But I am not sure how to access that variable in a different class than the class where it was broadcasted.
My code looks as follows:
object AppMain{
def main(args: Array[String]){
//...
val broadcastA = sc.broadcast(a)
//..
lines.foreachRDD(rdd => {
val obj = AppObject1
rdd.filter(p => obj.apply(p))
rdd.count
}
}
object AppObject1: Boolean{
def apply(str: String){
AnotherObject.process(str)
}
}
object AnotherObject{
// I want to use broadcast variable in this object
val B = broadcastA.Value // compilation error here
def process(): Boolean{
//need to use B inside this method
}
}
Can anyone suggest how to access broadcast variable in this case?
There is nothing particularly Spark specific here ignoring possible serialization issues. If you want to use some object it has to be available in the current scope and you can achieve this the same way as usual:
you can define your helpers in a scope where broadcast is already defined:
{
...
val x = sc.broadcast(1)
object Foo {
def foo = x.value
}
...
}
you can use it as a constructor argument:
case class Foo(x: org.apache.spark.broadcast.Broadcast[Int]) {
def foo = x.value
}
...
Foo(sc.broadcast(1)).foo
method argument
case class Foo() {
def foo(x: org.apache.spark.broadcast.Broadcast[Int]) = x.value
}
...
Foo().foo(sc.broadcast(1))
or even mixed-in your helpers like this:
trait Foo {
val x: org.apache.spark.broadcast.Broadcast[Int]
def foo = x.value
}
object Main extends Foo {
val sc = new SparkContext("local", "test", new SparkConf())
val x = sc.broadcast(1)
def main(args: Array[String]) {
sc.parallelize(Seq(None)).map(_ => foo).first
sc.stop
}
}
Just a short take on performance considerations that were introduced earlier.
Options proposed by zero233 are indeed very elegant way of doing this kind of things in Scala. At the same time it is important to understand implications of using certain patters in distributed system.
It is not the best idea to use mixin approach / any logic that uses enclosing class state. Whenever you use a state of enclosing class within lambdas Spark will have to serialize outer object. This is not always true but you'd better off writing safer code than one day accidentally blow up the whole cluster.
Being aware of this, I would personally go for explicit argument passing to the methods as this would not result in outer class serialization (method argument approach).
you can use classes and pass the broadcast variable to classes
your psudo code should look like :
object AppMain{
def main(args: Array[String]){
//...
val broadcastA = sc.broadcast(a)
//..
lines.foreach(rdd => {
val obj = new AppObject1(broadcastA)
rdd.filter(p => obj.apply(p))
rdd.count
})
}
}
class AppObject1(bc : Broadcast[String]){
val anotherObject = new AnotherObject(bc)
def apply(str: String): Boolean ={
anotherObject.process(str)
}
}
class AnotherObject(bc : Broadcast[String]){
// I want to use broadcast variable in this object
def process(str : String): Boolean = {
val a = bc.value
true
//need to use B inside this method
}
}

Scala serialization/deserialization of singleton object

I am quite new to the scala programming language, and I currently need to do the following. I have a signleton object like the following:
object MyObject extends Serializable {
val map: HashMap[String, Int] = null
val x: int = -1;
val foo: String = ""
}
Now i want to avoid to have to serialize each field of this object separately, thus I was considering writing the whole object to a file, and then, in the next execution of the program, read the file and initialize the singleton object from there. Is there any way to do this?
Basically what I want is when the serialization file doesn't exist, those variables to be initialized to new structures, while when it exists, the fields to be initialized from the ones on the file. But I want to avoid having to serialize/deserialize every field manually...
UPDATE:
I had to use a custom deserializer as presented here: https://issues.scala-lang.org/browse/SI-2403, since i had issues with a custom class I use inside the HashMap as values.
UPDATE2:
Here is the code I use to serialize:
val store = new ObjectOutputStream(new FileOutputStream(new File("foo")))
store.writeObject(MyData)
store.close
And the code to deserialize (in a different file):
#transient private lazy val loadedData: MyTrait = {
if(new File("foo").exists()) {
val in = new ObjectInputStream(new FileInputStream("foo")) {
override def resolveClass(desc: java.io.ObjectStreamClass): Class[_] = {
try { Class.forName(desc.getName, false, getClass.getClassLoader) }
catch { case ex: ClassNotFoundException => super.resolveClass(desc) }
}
}
val obj = in.readObject().asInstanceOf[MyTrait]
in.close
obj
}
else null
}
Thanks,
No needs to serialize an object with only immutable fields (because the compiler will do it for you...) I will assume that the object provides default values. Here is a way to do this:
Start by writing an trait with all the required fields:
trait MyTrait {
def map: HashMap[String, Int]
def x: Int
def foo: String
}
Then write an object with the defaults:
object MyDefaults extends MyTrait {
val map = Map()
val x = -1
val foo =
}
Finally write an implementation unserializing data if it exists:
object MyData extends MyTrait {
private lazy val loadedData: Option[MyTrait] = {
if( /* filename exists */ ) Some( /*unserialize filename as MyTrait*/)
else None
}
lazy val map = loadedData.getOrElse( MyDefault ).map
lazy val x = loadedData.getOrElse( MyDefault ).x
lazy val foo = loadedData.getOrElse( MyDefault ).foo
}