Scala Map as parameters for spark ML models - scala

I have developed a tool using pyspark. In that tool, the user provides a dict of model parameters, which is then passed to an spark.ml model such as Logistic Regression in the form of LogisticRegression(**params).
Since I am transferring to Scala now, I was wondering how this can be done in Spark using Scala? Coming from Python, my intuition is to pass a Scala Map such as:
val params = Map("regParam" -> 100)
val model = new LogisticRegression().set(params)
Obviously, it's not as trivial as that. It seem as in scala, we need to set every single parameter separately, like:
val model = new LogisticRegression()
.setRegParam(0.3)
I really want to avoid being forced to iterate over all user input parameters and set the appropriate parameters with tons of if clauses.
Any ideas how to solve this as elegantly as in Python?

According to the LogisticRegression API you need to set each param individually via setter:
Users can set and get the parameter values through setters and
getters, respectively.
An idea is to build your own mapping function to dynamically call the corresponding param setter using reflection.

Scala is a statically typed language, hence by-design doesn't have anything like Python's **params. As already being considered, you can store them in a Map of type[K, Any], but type erasure would erase types of the Map values due to JVM's runtime constraint.
Shapeless provides some neat mixed-type features that can circumvent the problem. An alternative is to use Scala's TypeTag to preserve type information, as in the following example:
import scala.reflect.runtime.universe._
case class Params[K]( m: Map[(K, TypeTag[_]), Any] ) extends AnyVal {
def add[V](k: K, v: V)(implicit vt: TypeTag[V]) = this.copy(
m = this.m + ((k, vt) -> v)
)
def grab[V](k: K)(implicit vt: TypeTag[V]) = m((k, vt)).asInstanceOf[V]
}
val params = Params[String](Map.empty).
add[Int]("a", 100).
add[String]("b", "xyz").
add[Double]("c", 5.0).
add[List[Int]]("d", List(1, 2, 3))
// params: Params[String] = Params( Map(
// (a,TypeTag[Int]) -> 100, (b,TypeTag[String]) -> xyz, (c,TypeTag[Double]) -> 5.0,
// (d,TypeTag[scala.List[Int]]) -> List(1, 2, 3)
// ) )
params.grab[Int]("a")
// res1: Int = 100
params.grab[String]("b")
// res2: String = xyz
params.grab[Double]("c")
// res3: Double = 5.0
params.grab[List[Int]]("d")
// res4: List[Int] = List(1, 2, 3)

Related

scala: Is collection object mutable?

this question has no functional value. Just trying to get a better understanding of scala.
All collections inherent Iterable. There is no isMutable method in Iterable.
Soliciting input on whether the below isMutable is the most efficient to assess mutability. It seems archaic but couldn't find an alternative other than testing for all mutable collection classes which is not ideal since new mutable classes could be added in the future. (I would define the method using implicit but didn't for simplicity).
import scala.collection.{immutable, mutable}
object IsMutable extends App {
val mutableMap: mutable.Map[String, Int] = mutable.Map("Apples" -> 4,"Pineapples" -> 1,"Oranges" -> 10,"Grapes" -> 7)
val immutableMap: immutable.Map[String, Int] = Map("Apples" -> 4,"Pineapples" -> 1,"Oranges" -> 10,"Grapes" -> 7)
def isMutable[A](obj: Iterable[A]): Boolean = obj.getClass.toString.startsWith("class scala.collection.mutable")
println(isMutable(mutableMap))
println(isMutable(immutableMap))
}
I think relying on the class name is not a good approach, although I believe, that the approach I'm proposing is not probably the most elegant/best way to find out if a collection is mutable or not. But you can use this:
import scala.reflect.ClassTag
def isMutable[T](iterable: scala.collection.Iterable[T]): Boolean =
iterable.isInstanceOf[scala.collection.mutable.Iterable[T]]
I think it would work fine for most of the types (the weird interface below is because I'm using ammonite as REPL, which is pretty cool :D).
# val immutableMap = scala.collection.immutable.Map[String, String]()
immutableMap: Map[String, String] = Map()
# isMutable(immutableMap)
res14: Boolean = false
# val mutableMap = scala.collection.mutable.Map[String, String]()
mutableMap: mutable.Map[String, String] = HashMap()
# isMutable(mutableMap)
res16: Boolean = true
AminMal's idea if fine. Mutable collections do extend mutable.Iterable, so checking if your collection is an instance of it is self-explanatory.
As an alternative way: mutable collections inherit 2 specific traits that allow them to be mutated internally: Growable and Shrinkable. Growable means a collection can be augmented using the += operator, while Shrinkable means it can be reduced using the -= operator.
On a side note, there is a trick to use these operators on immutable collections too: your reference must be declared using var to support reassignment. With mutable collections, though, you don't need reassignment, because these operations are supported by the 2 traits mentioned, which is why mutable collections can be declared using val.
Checking if your collection is an instance of either one of these 2 traits means it is mutable:
val myMap: mutable.Map[String, Int] = mutable.Map(
"Apples" -> 4,
"Pineapples" -> 1,
"Oranges" -> 10,
"Grapes" -> 7
)
val mySet: mutable.Set[Int] = mutable.Set(1, 2, 3)
val myMap2: Map[String, Int] = Map(
"Apples" -> 4,
"Pineapples" -> 1,
"Oranges" -> 10,
"Grapes" -> 7
)
val mySet2: Set[Int] = Set(1, 2, 3)
println(myMap.isInstanceOf[mutable.Growable[_]]) // true
println(myMap2.isInstanceOf[mutable.Shrinkable[_]]) // false
println(mySet.isInstanceOf[mutable.Shrinkable[_]]) // true
println(mySet2.isInstanceOf[mutable.Shrinkable[_]]) // false

Scala: a template for function to accept only a certain arity and a certain output?

I have a class, where all of its functions have the same arity and same type of output. (Why? Each function is a separate processor that is applied to a Spark DataFrame and yields another DataFrame).
So, the class looks like this:
class Processors {
def p1(df: DataFrame): DataFrame {...}
def p2(df: DataFrame): DataFrame {...}
def p3(df: DataFrame): DataFrame {...}
...
}
I then apply all the methods of a given DataFrame by mapping over Processors.getClass.getMethod, which allows me to add more processors without changing anything else in the code.
What I'd like to do is define a template to the methods under Processors which will restrict all of them to accept only one DataFrame and return a DataFrame. Is there a way to do this?
Implementing a restriction on what kind of functions can be added to a "list" is possible by using an appropriate container class instead of a generic class to hold the methods that are restricted. The container of restricted methods can then be part of some new class or object or part of the main program.
What you lose below by using containers (e.g. a Map with string keys and restricted values) to hold specific kinds of functions is compile-time checking of the names of the methods. e.g. calling triple vs trilpe
The restriction of a function to take a type T and return that same type T can be defined as a type F[T] using Function1 from the scala standard library. Function1[A,B] allows any single-parameter function with input type A and output type B, but we want these input/output types to be the same, so:
type F[T] = Function1[T,T]
For a container, I will demonstrate scala.collection.mutable.ListMap[String,F[T]] assuming the following requirements:
string names reference the functions (doThis, doThat, instead of 1, 2, 3...)
functions can be added to the list later (mutable)
though you could choose some other mutable or immutable collection class (e.g. Vector[F[T]] if you only want to number the methods) and still benefit from the restriction of what kind of functions future developers can include into the container.
An abstract type can be defined as:
type TaskMap[T] = ListMap[String, F[T]]
For your specific application you would then instantiate this as:
val Processors:TaskMap[Dataframe] = ListMap(
"p1" -> ((df: DataFrame) => {...code for p1 goes here...}),
"p2" -> ((df: DataFrame) => {...code for p2 goes here...}),
"p3" -> ((df: DataFrame) => {...code for p3 goes here...})
)
and then to call one of these functions you use
Processors("p2")(someDF)
For simplicity of demonstration, let's forget about Dataframes for a moment and consider whether this scheme works with integers.
Consider the short program below. The collection "myTasks" can only contain functions from Int to Int. All of the lines below have been tested in the scala interpreter, v2.11.6, so you can follow along line by line.
import scala.collection.mutable.ListMap
type F[T] = Function1[T,T]
type TaskMap[T] = ListMap[String, F[T]]
val myTasks: TaskMap[Int] = ListMap(
"negate" -> ((x:Int)=>(-x)),
"triple" -> ((x:Int)=>(3*x))
)
we can add a new function to the container that adds 7 and name it "add7"
myTasks += ( "add7" -> ((x:Int)=>(x+7)) )
and the scala interpreter responds with:
res0: myTasks.type = Map(add7 -> <function1>, negate -> <function1>, triple -> <function1>)
but we can't add a function named "half" because it would return a Float, and a Float is not an Int and should trigger a type error
myTasks += ( "half" -> ((x:Int)=>(0.5*x)) )
Here we get this error message:
scala> myTasks += ( "half" -> ((x:Int)=>(0.5*x)) )
<console>:12: error: type mismatch;
found : Double
required: Int
myTasks += ( "half" -> ((x:Int)=>(0.5*x)) )
^
In a compiled application, this would be found at compile time.
How to call the functions stored this way is a bit more verbose for single calls, but can be very convenient.
Suppose we want to call "triple" on 10.
We can't write
triple(10)
<console>:9: error: not found: value triple
Instead it is
myTasks("triple")(10)
res4: Int = 30
Where this notation becomes more useful is if you have a list of tasks to perform but only want to allow tasks listed in myTasks.
Suppose we want to run all the tasks on the input data "10"
myTasks mapValues { _ apply 10 }
res9: scala.collection.Map[String,Int] =
Map(add7 -> 17, negate -> -10, triple -> 30)
Suppose we want to triple, then add7, then negate
If each result is desired separately, as above, that becomes:
List("triple","add7","negate") map myTasks.apply map { _ apply 10 }
res11: List[Int] = List(30, 17, -10)
But "triple, then add 7, then negate" could also be describing a series of steps to do 10, i.e. we want -((3*10)+7)" and scala can do that too
val myProgram = List("triple","add7","negate")
myProgram map myTasks.apply reduceLeft { _ andThen _ } apply 10
res12: Int = -37
opening the door to writing an interpreter for your own customizable set of tasks because we can also write
val magic = myProgram map myTasks.apply reduceLeft { _ andThen _ }
and magic is then a function from int to int that can take aribtrary ints or otherwise do work as a function should.
scala> magic(1)
res14: Int = -10
scala> magic(2)
res15: Int = -13
scala> magic(3)
res16: Int = -16
scala> List(10,20,30) map magic
res17: List[Int] = List(-37, -67, -97)
Is this what you mean?
class Processors {
type Template = DataFrame => DataFrame
val p1: Template = ...
val p2: Template = ...
val p3: Template = ...
def applyAll(df: DataFrame): DataFrame =
p1(p2(p3(df)))
}

Scala - Multiple ways of initializing containers

I am new to Scala and was wondering what is the difference between initializing a Map data structure using the following three ways:
private val currentFiles: HashMap[String, Long] = new HashMap[String, Long]()
private val currentJars = new HashMap[String, Long]
private val currentVars = Map[String, Long]
There are two different parts to your question.
first, the difference between using an explicit type or not (cases 1 and 2) goes for any class, not necessarily containers.
val x = 1
Here the type is not explicit, and the compiler will try to figure it out using type inference. The type of x will be Int.
val x: Int = 1
Same as above, but now explicitly. If whatever you have at the right of = can't be cast to an Int, you will get a compiler error.
val x: Any = 1
Here we will still store a 1, but the type of the variable will be a parent class, using polymorphism.
The second part of your question is about initialization. The base initialization is as in java:
val x = new List[Int]()
This calls the class constructor and returns a new instance of the exact class.
Now, there is a special method called .apply that you can define and call with just parenthesis, like this:
val x = Seq[Int]()
This is a shortcut for this:
val x = Seq.apply[Int]()
Notice this is a function on the Seq object. The return type is whatever the function wants it to be, it is just another function. That said, it is mostly used to return a new instance of the given type, but there are no guarantees, you need to look at the function documentation to be sure of the contract.
That said, in the case of val x = Map[String, Long]() the implementation returns an actual instance of immutable.HashMap[String, Long], which is kind of the default Map implementation.
Map and HashMap are almost equivalent, but not exactly the same thing.
Map is trait, and HashMap is a class. Although under the hood they may be the same thing (scala.collection.immutable.HashMap) (more on that later).
When using
private val currentVars = Map[String, Long]()
You get a Map instance. In scala, () is a sugar, under the hood you are actually calling the apply() method of the object Map. This would be equivalent to:
private val currentVars = Map.apply[String, Long]()
Using
private val currentJars = new HashMap[String, Long]()
You get a HashMap instance.
In the third statement:
private val currentJars: HashMap[String, Long] = new HashMap[String, Long]()
You are just not relying anymore on type inference. This is exactly the same as the second statement:
private val currentJars: HashMap[String, Long] = new HashMap[String, Long]()
private val currentJars = new HashMap[String, Long]() // same thing
When / Which I use / Why
About type inference, I would recommend you to go with type inference. IMHO in this case it removes verbosity from the code where it is not really needed. But if you really miss like-java code, then include the type :) .
Now, about the two constructors...
Map vs HashMap
Short answer
You should probably always go with Map(): it is shorter, already imported and returns a trait (like a java interface). This last reason is nice because when passing this Map around you won't rely on implementation details since Map is just an interface of what you want or need.
On the other side, HashMap is an implementation.
Long answer
Map is not always a HashMap.
As seen in Programming in Scala, Map.apply[K, V]() can return a different class depending on how many key-value pairs you pass to it (ref):
Number of elements Implementation
0 scala.collection.immutable.EmptyMap
1 scala.collection.immutable.Map1
2 scala.collection.immutable.Map2
3 scala.collection.immutable.Map3
4 scala.collection.immutable.Map4
5 or more scala.collection.immutable.HashMap
When you have less then 5 elements you get an special class for each of these small collections and when you have an empty Map, you get a singleton object.
This is done mostly to get better performance.
You can try it out in repl:
import scala.collection.immutable.HashMap
val m2 = Map(1 -> 1, 2 -> 2)
m2.isInstanceOf[HashMap[Int, Int]]
// false
val m5 = Map(1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4, 5 -> 5, 6 -> 6)
m5.isInstanceOf[HashMap[Int, Int]]
// true
If you are really curious you can even take a look at the source code.
So, even for performance you should also probably stick with Map().

Call a function with arguments from a list

Is there a way to call a function with arguments from a list? The equivalent in Python is sum(*args).
// Scala
def sum(x: Int, y: Int) = x + y
val args = List(1, 4)
sum.???(args) // equivalent to sum(1, 4)
sum(args: _*) wouldn't work here.
Don't offer change the declaration of the function anyhow. I'm acquainted with a function with repeated parameters def sum(args: Int*).
Well, you can write
sum(args(0), args(1))
But I assume you want this to work for any list length? Then you would go for fold or reduce:
args.reduce(sum) // args must be non empty!
(0 /: args)(sum) // aka args.foldLeft(0)(sum)
These methods assume a pair-wise reduction of the list. For example, foldLeft[B](init: B)(fun: (B, A) => B): B reduces a list of elements of type A to a single element of type B. In this example, A = B = Int. It starts with the initial value init. Since you want to sum, the sum of an empty list would be zero. It then calls the function with the current accumulator (the running sum) and each successive element of the list.
So it's like
var result = 0
result = sum(result, 1)
result = sum(result, 4)
...
The reduce method assumes that the list is non-empty and requires that the element type doesn't change (the function must map from two Ints to an Int).
I wouldn't recommend it for most uses since it's a bit complicated and hard to read, bypasses compile-time checks, etc., but if you know what you're doing and need to do this, you can use reflection. This should work with any arbitary parameter types. For example, here's how you might call a constructor with arguments from a list:
import scala.reflect.runtime.universe
class MyClass(
val field1: String,
val field2: Int,
val field3: Double)
// Get our runtime mirror
val runtimeMirror = universe.runtimeMirror(getClass.getClassLoader)
// Get the MyClass class symbol
val classSymbol = universe.typeOf[MyClass].typeSymbol.asClass
// Get a class mirror for the MyClass class
val myClassMirror = runtimeMirror.reflectClass(classSymbol)
// Get a MyClass constructor representation
val myClassCtor = universe.typeOf[MyClass].decl(universe.termNames.CONSTRUCTOR).asMethod
// Get an invokable version of the constructor
val myClassInvokableCtor = myClassMirror.reflectConstructor(myClassCtor)
val myArgs: List[Any] = List("one", 2, 3.0)
val myInstance = myClassInvokableCtor(myArgs: _*).asInstanceOf[MyClass]

Why we need implicit parameters in scala?

I am new to scala, and today when I came across this akka source code I was puzzled:
def traverse[A, B](in: JIterable[A], fn: JFunc[A, Future[B]],
executor: ExecutionContext): Future[JIterable[B]] = {
implicit val d = executor
scala.collection.JavaConversions.iterableAsScalaIterable(in).foldLeft(
Future(new JLinkedList[B]())) { (fr, a) ⇒
val fb = fn(a)
for (r ← fr; b ← fb) yield { r add b; r }
}
}
Why the code is written using implicit parameters intentionally? Why can't it be written as:
scala.collection.JavaConversions.iterableAsScalaIterable(in).foldLeft(
Future(new JLinkedList[B](),executor))
without decalaring a new implicit variable d? Is there any advantage of doing this? For now I only find implicits increase the ambiguity of the code.
I can give you 3 reasons.
1) It hides boilerplate code.
Lets sort some lists:
import math.Ordering
List(1, 2, 3).sorted(Ordering.Int) // Fine. I can tell compiler how to sort ints
List("a", "b", "c").sorted(Ordering.String) // .. and strings.
List(1 -> "a", 2 -> "b", 3 -> "c").sorted(Ordering.Tuple2(Ordering.Int, Ordering.String)) // Not so fine...
With implicit parameters:
List(1, 2, 3).sorted // Compiller knows how to sort ints
List(1 -> "a", 2 -> "b", 3 -> "c").sorted // ... and some other types
2) It alows you to create API with generic methods:
scala> (70 to 75).map{ _.toChar }
res0: scala.collection.immutable.IndexedSeq[Char] = Vector(F, G, H, I, J, K)
scala> (70 to 75).map{ _.toChar }(collection.breakOut): String // You can change default behaviour.
res1: String = FGHIJK
3) It allows you to focus on what really matters:
Future(new JLinkedList[B]())(executor) // meters: what to do - `new JLinkedList[B]()`. don't: how to do - `executor`
It's not so bad, but what if you need 2 futures:
val f1 = Future(1)(executor)
val f2 = Future(2)(executor) // You have to specify the same executor every time.
Implicit creates "context" for all actions:
implicit val d = executor // All `Future` in this scope will be created with this executor.
val f1 = Future(1)
val f2 = Future(2)
3.5) Implicit parameters allows type-level programming . See shapeless.
About "ambiguity of the code":
You don't have to use implicits, alternatively you can specify all parameters explicitly. It looks ugly sometimes (see sorted example), but you can do it.
If you can't find which implicit variables are used as parameters you can ask compiler:
>echo object Test { List( (1, "a") ).sorted } > test.scala
>scalac -Xprint:typer test.scala
You'll find math.this.Ordering.Tuple2[Int, java.lang.String](math.this.Ordering.Int, math.this.Ordering.String) in output.
In the code from Akka you linked, it is true that executor could be just passed explicitly. But if there was more than one Future used throughout this method, declaring implicit parameter would definitely make sense to avoid passing it around many times.
So I would say that in the code you linked, implicit parameter was used just to follow some code style. It would be ugly to make an exception from it.
Your question intrigued me, so I searched a bit on the net. Here's what I found on this blog: http://daily-scala.blogspot.in/2010/04/implicit-parameters.html
What is an implicit parameter?
An implicit parameter is a parameter to method or constructor that is marked as implicit. This means that if a parameter value is not supplied then the compiler will search for an "implicit" value defined within scope (according to resolution rules.)
Why use an implicit parameter?
Implicit parameters are very nice for simplifying APIs. For example the collections use implicit parameters to supply CanBuildFrom objects for many of the collection methods. This is because normally the user does not need to be concerned with those parameters. Another example is supplying an encoding to an IO library so the encoding is defined once (perhaps in a package object) and all methods can use the same encoding without having to define it for every method call.