Does method parameter trigger serialization in Spark? - scala

I read the Spark programming guide about passing functions and wonder what happens when the function reference the outer method parameter/local variable.
For example, I have this object
object Main {
def main(args: Array[String]): Unit = {
val ds: Dataset[String] = ???
ds.map(_ + args(0))
}
}
Does Spark have to serialize Main? What if args is a local variable inside main?

No, in both these cases Spark won't serialize the Main object. Method arguments and local variables (which are pretty much the same thing from the semantics perspective) do not "belong" to the enclosing object or class, they are associated with a particular method invocation and can therefore be captured by the closure directly.
As a general rule, if you need to have a reference to some object in order to access some value, then this reference will be captured and, therefore, serialized:
class Application(n: Int) {
val x = "internal state " + n
def doSomething(ds: Dataset[String], param: String): Unit = {
ds.map(_ + x + param)
}
}
Note that here in order to access x, which is an instance member, you necessarily have to have the enclosing instance to be available, because it depends on the actual parameters that the instance was constructed with. Another way to see it is to remember that when you use x in the example above, it is actually a shortcut for this.x:
ds.map(_ + this.x + param)
Compared to this, the param value has no such dependency - it is passed as is to the method, and there is no need to access any other enclosing object to use it. Therefore, param will be captured and serialized directly.
That's why there is that advice to put instance members to local variables in order not to capture the entire object: when you put a value to a local variable, it no longer requires accessing the enclosing instance:
val localX = this.x
ds.map(_ + localX + param)
Of course, if you have references to the enclosing instance inside the object that you want to capture, like here:
class Inner(app: Application)
class Application {
val x = new Inner(this)
def doSomething(ds: Dataset[String]): Unit = {
val localX = x
ds.map(_ + localX.toString)
}
}
then storing it to a local variable would not help, because Spark would still need to serialize the app field of the Inner class, which points back to the Application instance. That's why you have to be careful if you have complex object graphs that you use in Spark methods which are going to be sent to executors.

Related

Calling function library scala

I'm looking to call the ATR function from this scala wrapper for ta-lib. But I can't figure out how to use wrapper correctly.
package io.github.patceev.talib
import com.tictactec.ta.lib.{Core, MInteger, RetCode}
import scala.concurrent.Future
object Volatility {
def ATR(
highs: Vector[Double],
lows: Vector[Double],
closes: Vector[Double],
period: Int = 14
)(implicit core: Core): Future[Vector[Double]] = {
val arrSize = highs.length - period + 1
if (arrSize < 0) {
Future.successful(Vector.empty[Double])
} else {
val begin = new MInteger()
val length = new MInteger()
val result = Array.ofDim[Double](arrSize)
core.atr(
0, highs.length - 1, highs.toArray, lows.toArray, closes.toArray,
period, begin, length, result
) match {
case RetCode.Success =>
Future.successful(result.toVector)
case error =>
Future.failed(new Exception(error.toString))
}
}
}
}
Would someone be able to explain how to use function and print out the result to the console.
Many thanks in advance.
Regarding syntax, Scala is one of many languages where you call functions and methods passing arguments in parentheses (mostly, but let's keep it simple for now):
def myFunction(a: Int): Int = a + 1
myFunction(1) // myFunction is called and returns 2
On top of this, Scala allows to specify multiple parameters lists, as in the following example:
def myCurriedFunction(a: Int)(b: Int): Int = a + b
myCurriedFunction(2)(3) // myCurriedFunction returns 5
You can also partially apply myCurriedFunction, but again, let's keep it simple for the time being. The main idea is that you can have multiple lists of arguments passed to a function.
Built on top of this, Scala allows to define a list of implicit parameters, which the compiler will automatically retrieve for you based on some scoping rules. Implicit parameters are used, for example, by Futures:
// this defines how and where callbacks are run
// the compiler will automatically "inject" them for you where needed
implicit val ec: ExecutionContext = concurrent.ExecutionContext.global
Future(4).map(_ + 1) // this will eventually result in a Future(5)
Note that both Future and map have a second parameter list that allows to specify an implicit execution context. By having one in scope, the compiler will "inject" it for you at the call site, without having to write it explicitly. You could have still done it and the result would have been
Future(4)(ec).map(_ + 1)(ec)
That said, I don't know the specifics of the library you are using, but the idea is that you have to instantiate a value of type Core and either bind it to an implicit val or pass it explicitly.
The resulting code will be something like the following
val highs: Vector[Double] = ???
val lows: Vector[Double] = ???
val closes: Vector[Double] = ???
implicit val core: Core = ??? // instantiate core
val resultsFuture = Volatility.ATR(highs, lows, closes) // core is passed implicitly
for (results <- resultsFuture; result <- results) {
println(result)
}
Note that depending on your situation you may have to also use an implicit ExecutionContext to run this code (because you are extracting the Vector[Double] from a Future). Choosing the right execution context is another kind of issue but to play around you may want to use the global execution context.
Extra
Regarding some of the points I've left open, here are some pointers that hopefully will turn out to be useful:
Operators
Multiple Parameter Lists (Currying)
Implicit Parameters
Scala Futures

Why is it allowed to put methods inside blocks, and statements inside objects in Scala?

I'm learning Scala and I don't really understand the following example :
object Test extends App {
def method1 = println("method 1")
val x = {
def method2 = "method 2" // method inside a block
"this is " + method2
}
method1 // statement inside an object
println(x) // same
}
I mean, it feels inconsistent to me because here I see two different concepts :
Objects/Classes/Traits, which contains members.
Blocks, which contains statements, the last statement being the value of the block.
But here we have a method part of a block, and statements part of an object. So, does it mean that blocks are objects too ? And how are handled the statements part of an object, are they members too ?
Thanks.
Does it mean that blocks are objects too?
No, blocks are not objects. Blocks are used for scoping the binding of variables. Scala enables not only defining expressions inside blocks but also to define methods. If we take your example and compile it, we can see what the compiler does:
object Test extends Object {
def method1(): Unit = scala.Predef.println("method 1");
private[this] val x: String = _;
<stable> <accessor> def x(): String = Test.this.x;
final <static> private[this] def method2$1(): String = "method 2";
def <init>(): tests.Test.type = {
Test.super.<init>();
Test.this.x = {
"this is ".+(Test.this.method2$1())
};
Test.this.method1();
scala.Predef.println(Test.this.x());
()
}
}
What the compiler did is extract method2 to an "unnamed" method on method2$1 and scoped it to private[this] which is scoped to the current instance of the type.
And how are handled the statements part of an object, are they members
too?
The compiler took method1 and println and calls them inside the constructor when the type is initialized. So you can see val x and the rest of the method calls are invoked at construction time.
method2 is actually not a method. It is a local function. Scala allows you to create named functions inside local scopes for organizing your code into functions without polluting the namespace.
It is most often used to define local tail-recursive helper functions. Often, when making a function tail-recursive, you need to add an additional parameter to carry the "state" on the call stack, but this additional parameter is a private internal implementation detail and shouldn't be exposed to clients. In languages without local functions, you would make this a private helper alongside the primary method, but then it would still be within the namespace of the class and callable by all other methods of the class, when it is really only useful for that particular method. So, in Scala, you can instead define it locally inside the method:
// non tail-recursive
def length[A](ls: List[A]) = ls match {
case Nil => 0
case x :: xs => length(xs) + 1
}
//transformation to tail-recursive, Java-style:
def length[A](ls: List[A]) = lengthRec(ls, 0)
private def lengthRec[A](ls: List[A], len: Int) = ls match {
case Nil => len
case x :: xs => lengthRec(xs, len + 1)
}
//tail-recursive, Scala-style:
def length[A](ls: List[A]) = {
//note: lengthRec is nested and thus can access `ls`, there is no need to pass it
def lengthRec(len: Int) = ls match {
case Nil => len
case x :: xs => lengthRec(xs, len + 1)
}
lengthRec(ls, 0)
}
Now you might say, well I see the value in defining local functions inside methods, but what's the value in being able to define local functions in blocks? Scala tries to as simple as possible and have as few corner cases as possible. If you can define local functions inside methods, and local functions inside local functions … then why not simplify that rule and just say that local functions behave just like local fields, you can simply define them in any block scope. Then you don't need different scope rules for local fields and local functions, and you have simplified the language.
The other thing you mentioned, being able to execute code in the bode of a template, that's actually the primary constructor (so to speak … it's technically more like an initializer). Remember: the primary constructor's signature is defined with parentheses after the class name … but where would you put the code for the constructor then? Well, you put it in the body of the class!

Infinite loop when replacing concrete value by parameter name

I have the two following objects (in scala and using spark):
1. The main object
object Omain {
def main(args: Array[String]) {
odbscan
}
}
2. The object odbscan
object odbscan {
val conf = new SparkConf().setAppName("Clustering").setMaster("local")
conf.set("spark.driver.maxResultSize", "3g")
val sc = new SparkContext(conf)
val param_user_minimal_rating_count = 2
/***Connexion***/
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val sql = "SELECT id, data FROM user_profile"
val options = connectMysql.getOptionsMap(sql)
val uSQL = sqlcontext.load("jdbc", options)
val users = uSQL.rdd.map { x =>
val v = x.toString().substring(1, x.toString().size - 1).split(",")
var ap: Map[Int, Double] = Map()
if (v.size > 1)
ap = v(1).split(";").map { y => (y.split(":")(0).toInt, y.split(":")(1).toDouble) }.toMap
(v(0).toInt, ap)
}.filter(_._2.size >= param_user_minimal_rating_count)
println(users.collect().mkString("\n"))
}
When I execute this code I obtain an infinite loop, until I change:
filter(_._2.size >= param_user_minimal_rating_count)
to
filter(_._2.size >= 1)
or any other numerical value, in this case the code work, and I have my result displayed
What I think is happening here is that Spark serializes functions to send them over the wire. And that because your function (the one you're passing to map) calls the accessor param_user_minimal_rating_count of object odbscan, the entire object odbscan will need to get serialized and sent along with it. Deserializing and then using that deserialized object will cause the code in its body to get executed again which causes an infinite loop of serializing-->sending-->deserializing-->executing-->serializing-->...
Probably the easiest thing to do here is changing that val to final val param_user_minimal_rating_count = 2 so the compiler will inline the value. But note that this will only be a solution for literal constants. For more information see constant value definitions and constant expressions.
An other and better solution would be to refactor your code so that no instance variables are used in lambda expressions. Referencing vals that are defined in an object or class will get the whole object serialized. So try to only refer to vals that are local (to a method). And most importantly don't execute your business logic from within a constructor/the body of an object or class.
Your problem is somewhere else.
The only difference between the 2 snippets is the definition of val Eps = 5 outside of the map which does not change at all the control flow of your code.
Please post more context so we can help.

What is the apply function in Scala?

I never understood it from the contrived unmarshalling and verbing nouns ( an AddTwo class has an apply that adds two!) examples.
I understand that it's syntactic sugar, so (I deduced from context) it must have been designed to make some code more intuitive.
What meaning does a class with an apply function give? What is it used for, and what purposes does it make code better (unmarshalling, verbing nouns etc)?
how does it help when used in a companion object?
Mathematicians have their own little funny ways, so instead of saying "then we call function f passing it x as a parameter" as we programmers would say, they talk about "applying function f to its argument x".
In mathematics and computer science, Apply is a function that applies
functions to arguments.
Wikipedia
apply serves the purpose of closing the gap between Object-Oriented and Functional paradigms in Scala. Every function in Scala can be represented as an object. Every function also has an OO type: for instance, a function that takes an Int parameter and returns an Int will have OO type of Function1[Int,Int].
// define a function in scala
(x:Int) => x + 1
// assign an object representing the function to a variable
val f = (x:Int) => x + 1
Since everything is an object in Scala f can now be treated as a reference to Function1[Int,Int] object. For example, we can call toString method inherited from Any, that would have been impossible for a pure function, because functions don't have methods:
f.toString
Or we could define another Function1[Int,Int] object by calling compose method on f and chaining two different functions together:
val f2 = f.compose((x:Int) => x - 1)
Now if we want to actually execute the function, or as mathematician say "apply a function to its arguments" we would call the apply method on the Function1[Int,Int] object:
f2.apply(2)
Writing f.apply(args) every time you want to execute a function represented as an object is the Object-Oriented way, but would add a lot of clutter to the code without adding much additional information and it would be nice to be able to use more standard notation, such as f(args). That's where Scala compiler steps in and whenever we have a reference f to a function object and write f (args) to apply arguments to the represented function the compiler silently expands f (args) to the object method call f.apply (args).
Every function in Scala can be treated as an object and it works the other way too - every object can be treated as a function, provided it has the apply method. Such objects can be used in the function notation:
// we will be able to use this object as a function, as well as an object
object Foo {
var y = 5
def apply (x: Int) = x + y
}
Foo (1) // using Foo object in function notation
There are many usage cases when we would want to treat an object as a function. The most common scenario is a factory pattern. Instead of adding clutter to the code using a factory method we can apply object to a set of arguments to create a new instance of an associated class:
List(1,2,3) // same as List.apply(1,2,3) but less clutter, functional notation
// the way the factory method invocation would have looked
// in other languages with OO notation - needless clutter
List.instanceOf(1,2,3)
So apply method is just a handy way of closing the gap between functions and objects in Scala.
It comes from the idea that you often want to apply something to an object. The more accurate example is the one of factories. When you have a factory, you want to apply parameter to it to create an object.
Scala guys thought that, as it occurs in many situation, it could be nice to have a shortcut to call apply. Thus when you give parameters directly to an object, it's desugared as if you pass these parameters to the apply function of that object:
class MyAdder(x: Int) {
def apply(y: Int) = x + y
}
val adder = new MyAdder(2)
val result = adder(4) // equivalent to x.apply(4)
It's often use in companion object, to provide a nice factory method for a class or a trait, here is an example:
trait A {
val x: Int
def myComplexStrategy: Int
}
object A {
def apply(x: Int): A = new MyA(x)
private class MyA(val x: Int) extends A {
val myComplexStrategy = 42
}
}
From the scala standard library, you might look at how scala.collection.Seq is implemented: Seq is a trait, thus new Seq(1, 2) won't compile but thanks to companion object and apply, you can call Seq(1, 2) and the implementation is chosen by the companion object.
Here is a small example for those who want to peruse quickly
object ApplyExample01 extends App {
class Greeter1(var message: String) {
println("A greeter-1 is being instantiated with message " + message)
}
class Greeter2 {
def apply(message: String) = {
println("A greeter-2 is being instantiated with message " + message)
}
}
val g1: Greeter1 = new Greeter1("hello")
val g2: Greeter2 = new Greeter2()
g2("world")
}
output
A greeter-1 is being instantiated with message hello
A greeter-2 is being instantiated with message world
TLDR for people comming from c++
It's just overloaded operator of ( ) parentheses
So in scala:
class X {
def apply(param1: Int, param2: Int, param3: Int) : Int = {
// Do something
}
}
Is same as this in c++:
class X {
int operator()(int param1, int param2, int param3) {
// do something
}
};
1 - Treat functions as objects.
2 - The apply method is similar to __call __ in Python, which allows you to use an instance of a given class as a function.
The apply method is what turns an object into a function. The desire is to be able to use function syntax, such as:
f(args)
But Scala has both functional and object oriented syntax. One or the other needs to be the base of the language. Scala (for a variety of reasons) chooses object oriented as the base form of the language. That means that any function syntax has to be translated into object oriented syntax.
That is where apply comes in. Any object that has the apply method can be used with the syntax:
f(args)
The scala infrastructure then translates that into
f.apply(args)
f.apply(args) has correct object oriented syntax. Doing this translation would not be possible if the object had no apply method!
In short, having the apply method in an object is what allows Scala to turn the syntax: object(args) into the syntax: object.apply(args). And object.apply(args) is in the form that can then execute.
FYI, this implies that all functions in scala are objects. And it also implies that having the apply method is what makes an object a function!
See the accepted answer for more insight into just how a function is an object, and the tricks that can be played as a result.
To put it crudely,
You can just see it as custom ()operator. If a class X has an apply() method, whenever you call X() you will be calling the apply() method.

Scala singleton factories and class constants

OK, in the question about 'Class Variables as constants', I get the fact that the constants are not available until after the 'official' constructor has been run (i.e. until you have an instance). BUT, what if I need the companion singleton to make calls on the class:
object thing {
val someConst = 42
def apply(x: Int) = new thing(x)
}
class thing(x: Int) {
import thing.someConst
val field = x * someConst
override def toString = "val: " + field
}
If I create companion object first, the 'new thing(x)' (in the companion) causes an error. However, if I define the class first, the 'x * someConst' (in the class definition) causes an error.
I also tried placing the class definition inside the singleton.
object thing {
var someConst = 42
def apply(x: Int) = new thing(x)
class thing(x: Int) {
val field = x * someConst
override def toString = "val: " + field
}
}
However, doing this gives me a 'thing.thing' type object
val t = thing(2)
results in
t: thing.thing = val: 84
The only useful solution I've come up with is to create an abstract class, a companion and an inner class (which extends the abstract class):
abstract class thing
object thing {
val someConst = 42
def apply(x: Int) = new privThing(x)
class privThing(x: Int) extends thing {
val field = x * someConst
override def toString = "val: " + field
}
}
val t1 = thing(2)
val tArr: Array[thing] = Array(t1)
OK, 't1' still has type of 'thing.privThing', but it can now be treated as a 'thing'.
However, it's still not an elegant solution, can anyone tell me a better way to do this?
PS. I should mention, I'm using Scala 2.8.1 on Windows 7
First, the error you're seeing (you didn't tell me what it is) isn't a runtime error. The thing constructor isn't called when the thing singleton is initialized -- it's called later when you call thing.apply, so there's no circular reference at runtime.
Second, you do have a circular reference at compile time, but that doesn't cause a problem when you're compiling a scala file that you've saved on disk -- the compiler can even resolve circular references between different files. (I tested. I put your original code in a file and compiled it, and it worked fine.)
Your real problem comes from trying to run this code in the Scala REPL. Here's what the REPL does and why this is a problem in the REPL. You're entering object thing and as soon as you finish, the REPL tries to compile it, because it's reached the end of a coherent chunk of code. (Semicolon inference was able to infer a semicolon at the end of the object, and that meant the compiler could get to work on that chunk of code.) But since you haven't defined class thing it can't compile it. You have the same problem when you reverse the definitions of class thing and object thing.
The solution is to nest both class thing and object thing inside some outer object. This will defer compilation until that outer object is complete, at which point the compiler will see the definitions of class thing and object thing at the same time. You can run import thingwrapper._ right after that to make class thing and object thing available in global scope for the REPL. When you're ready to integrate your code into a file somewhere, just ditch the outer class thingwrapper.
object thingwrapper{
//you only need a wrapper object in the REPL
object thing {
val someConst = 42
def apply(x: Int) = new thing(x)
}
class thing(x: Int) {
import thing.someConst
val field = x * someConst
override def toString = "val: " + field
}
}
Scala 2.12 or more could benefit for sip 23 which just (August 2016) pass to the next iteration (considered a “good idea”, but is a work-in-process)
Literal-based singleton types
Singleton types bridge the gap between the value level and the type level and hence allow the exploration in Scala of techniques which would typically only be available in languages with support for full-spectrum dependent types.
Scala’s type system can model constants (e.g. 42, "foo", classOf[String]).
These are inferred in cases like object O { final val x = 42 }. They are used to denote and propagate compile time constants (See 6.24 Constant Expressions and discussion of “constant value definition” in 4.1 Value Declarations and Definitions).
However, there is no surface syntax to express such types. This makes people who need them, create macros that would provide workarounds to do just that (e.g. shapeless).
This can be changed in a relatively simple way, as the whole machinery to enable this is already present in the scala compiler.
type _42 = 42.type
type Unt = ().type
type _1 = 1 // .type is optional for literals
final val x = 1
type one = x.type // … but mandatory for identifiers