precomputing some code in a scala closure. - scala

In scala, I have the following code:
def isDifferentGroup(artifact2: DefaultArtifact) = getArtifact(artifact1Id).getGroupId != artifact2.getGroupId
val artifacts = getArtifacts().filter(isSameGroup)
the function isDifferentGroup is accessing the external String variable artifactId (closure).
I'd like to avoid computing getArtifact(artifactId) for each item in the list.
I could do as follows:
val artifact1: DefaultArtifact = getArtifact(artifact1Id)
def isDifferentGroup(artifact2: DefaultArtifact) = artifact1.getGroupId != artifact2.getGroupId
val artifacts = getArtifacts().filter(isSameGroup)
however, we are creating a variable artifact1 outside the fonction isDifferentGroup, and that is ugly, because this variable is used only inside the fonction isDifferentGroup.
how to solve it?
one possibility would be to make a partial function as follows:
def isDifferentGroup(artifact1: DefaultArtifact)(artifact2: DefaultArtifact) = artifact1.getGroupId != artifact2.getGroupId
val artifacts = getArtifacts().filter(isDifferentGroup(getArtifact(artifact1Id)))
however, I have to move the code getArtifact(artifactId) outside the isDifferentGroup function, and I don't want this.
how to solve it?

Everything that is processed in a function body is evaluated every time the function is called, so you can't tell that a part of it will be statically evaluated and shared (without resorting to using some kind of external cache). So you have to separate the function body from values you want to evaluate in advance and use inside the function.
However, you can enclose such values in a block so that forms a compact block. The best thing I can think of is to declare a val with a function type, such as
val isDifferentGroup: DefaultArtifact => Boolean = {
val gid = getArtifact(artifact1Id).getGroupId
(artifact2: DefaultArtifact) => {
(gid != artifact2.getGroupId)
}
}
This way, you can explicitly state what part is evaluated statically only once in the main val block (here gid) and what part is evaluated in response to an artifact2 argument. And you can call isDifferentGroup just as if it were a method:
println(isDifferentGroup(someArtifact))
This is basically just a different way of creating an encapsulating class like
val isDifferentGroup: DefaultArtifact => Boolean =
new Function1[DefaultArtifact,Boolean] {
val gid = getArtifact(artifact1Id).getGroupId
override def apply(artifact2: DefaultArtifact): Boolean =
(gid != artifact2.getGroupId);
}
You can even declare it as a lazy val, in which case gid is evalauted at most once, the first time the function is called.

Why not just create a class to encapsulate both a (private) value and functionality?
This code will probably not compile, just to illustrate it:
class Artifact(artifact1Id: Id) {
private val artifact1: DefaultArtifact = getArtifact(artifact1Id)
def isDifferentGroup(artifact2: DefaultArtifact) =
artifact1.getGroupId != artifact2.getGroupId
}

Related

How to concat parameter of a function within a variable name

I may have not asked the question correctly but let say I want to use the name of whatever parameter i pass in to my function, to have my variable within that function to be named ;
def myFunc(dfName: DataFrame): Unit = {
val "{dfName}_concatenated" = dfName
}
so if i passed in myFunc(testDf) the variable inside should be named testDf_concatenated
Values and variables have to be named at compile time and this process you're looking for needs to be done at runtime, so you can not do it this way, try using a simple name and the value would be a Map("{whatever}_concatenated" -> ...).
I am really not sure what is your question, will that work for you?
def myFunc(dfName: String, anotherParam: Int): Unit = {
val dfNameConcatenated = s"${dfName}_concatenated"
}

Why is it allowed to put methods inside blocks, and statements inside objects in Scala?

I'm learning Scala and I don't really understand the following example :
object Test extends App {
def method1 = println("method 1")
val x = {
def method2 = "method 2" // method inside a block
"this is " + method2
}
method1 // statement inside an object
println(x) // same
}
I mean, it feels inconsistent to me because here I see two different concepts :
Objects/Classes/Traits, which contains members.
Blocks, which contains statements, the last statement being the value of the block.
But here we have a method part of a block, and statements part of an object. So, does it mean that blocks are objects too ? And how are handled the statements part of an object, are they members too ?
Thanks.
Does it mean that blocks are objects too?
No, blocks are not objects. Blocks are used for scoping the binding of variables. Scala enables not only defining expressions inside blocks but also to define methods. If we take your example and compile it, we can see what the compiler does:
object Test extends Object {
def method1(): Unit = scala.Predef.println("method 1");
private[this] val x: String = _;
<stable> <accessor> def x(): String = Test.this.x;
final <static> private[this] def method2$1(): String = "method 2";
def <init>(): tests.Test.type = {
Test.super.<init>();
Test.this.x = {
"this is ".+(Test.this.method2$1())
};
Test.this.method1();
scala.Predef.println(Test.this.x());
()
}
}
What the compiler did is extract method2 to an "unnamed" method on method2$1 and scoped it to private[this] which is scoped to the current instance of the type.
And how are handled the statements part of an object, are they members
too?
The compiler took method1 and println and calls them inside the constructor when the type is initialized. So you can see val x and the rest of the method calls are invoked at construction time.
method2 is actually not a method. It is a local function. Scala allows you to create named functions inside local scopes for organizing your code into functions without polluting the namespace.
It is most often used to define local tail-recursive helper functions. Often, when making a function tail-recursive, you need to add an additional parameter to carry the "state" on the call stack, but this additional parameter is a private internal implementation detail and shouldn't be exposed to clients. In languages without local functions, you would make this a private helper alongside the primary method, but then it would still be within the namespace of the class and callable by all other methods of the class, when it is really only useful for that particular method. So, in Scala, you can instead define it locally inside the method:
// non tail-recursive
def length[A](ls: List[A]) = ls match {
case Nil => 0
case x :: xs => length(xs) + 1
}
//transformation to tail-recursive, Java-style:
def length[A](ls: List[A]) = lengthRec(ls, 0)
private def lengthRec[A](ls: List[A], len: Int) = ls match {
case Nil => len
case x :: xs => lengthRec(xs, len + 1)
}
//tail-recursive, Scala-style:
def length[A](ls: List[A]) = {
//note: lengthRec is nested and thus can access `ls`, there is no need to pass it
def lengthRec(len: Int) = ls match {
case Nil => len
case x :: xs => lengthRec(xs, len + 1)
}
lengthRec(ls, 0)
}
Now you might say, well I see the value in defining local functions inside methods, but what's the value in being able to define local functions in blocks? Scala tries to as simple as possible and have as few corner cases as possible. If you can define local functions inside methods, and local functions inside local functions … then why not simplify that rule and just say that local functions behave just like local fields, you can simply define them in any block scope. Then you don't need different scope rules for local fields and local functions, and you have simplified the language.
The other thing you mentioned, being able to execute code in the bode of a template, that's actually the primary constructor (so to speak … it's technically more like an initializer). Remember: the primary constructor's signature is defined with parentheses after the class name … but where would you put the code for the constructor then? Well, you put it in the body of the class!

Infinite loop when replacing concrete value by parameter name

I have the two following objects (in scala and using spark):
1. The main object
object Omain {
def main(args: Array[String]) {
odbscan
}
}
2. The object odbscan
object odbscan {
val conf = new SparkConf().setAppName("Clustering").setMaster("local")
conf.set("spark.driver.maxResultSize", "3g")
val sc = new SparkContext(conf)
val param_user_minimal_rating_count = 2
/***Connexion***/
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val sql = "SELECT id, data FROM user_profile"
val options = connectMysql.getOptionsMap(sql)
val uSQL = sqlcontext.load("jdbc", options)
val users = uSQL.rdd.map { x =>
val v = x.toString().substring(1, x.toString().size - 1).split(",")
var ap: Map[Int, Double] = Map()
if (v.size > 1)
ap = v(1).split(";").map { y => (y.split(":")(0).toInt, y.split(":")(1).toDouble) }.toMap
(v(0).toInt, ap)
}.filter(_._2.size >= param_user_minimal_rating_count)
println(users.collect().mkString("\n"))
}
When I execute this code I obtain an infinite loop, until I change:
filter(_._2.size >= param_user_minimal_rating_count)
to
filter(_._2.size >= 1)
or any other numerical value, in this case the code work, and I have my result displayed
What I think is happening here is that Spark serializes functions to send them over the wire. And that because your function (the one you're passing to map) calls the accessor param_user_minimal_rating_count of object odbscan, the entire object odbscan will need to get serialized and sent along with it. Deserializing and then using that deserialized object will cause the code in its body to get executed again which causes an infinite loop of serializing-->sending-->deserializing-->executing-->serializing-->...
Probably the easiest thing to do here is changing that val to final val param_user_minimal_rating_count = 2 so the compiler will inline the value. But note that this will only be a solution for literal constants. For more information see constant value definitions and constant expressions.
An other and better solution would be to refactor your code so that no instance variables are used in lambda expressions. Referencing vals that are defined in an object or class will get the whole object serialized. So try to only refer to vals that are local (to a method). And most importantly don't execute your business logic from within a constructor/the body of an object or class.
Your problem is somewhere else.
The only difference between the 2 snippets is the definition of val Eps = 5 outside of the map which does not change at all the control flow of your code.
Please post more context so we can help.

Why Scala needs duplicate constructor? (java.lang.NoSuchMethodException)

I was receiving this error in my Hadoop job.
java.lang.NoSuchMethodException: <PackageName>.<ClassName>.<init>(<parameters>)
In most Scala code, you would have it in compile time. But since this job is called in runtime I was not catching it in compile time.
I would think default parameter would cause constructors with both signatures to be created, one taking a single argument.
class BasicDynamicBlocker(args: Args, evaluation: Boolean = false) extends Job(args) with HiveAccess {
//I NEEDED THIS TOO:
def this(args: Args) = {
this(args, false)
}
...
}
I learned the hard way that I needed to declare the overloaded constructor using this. (I wanted to write this out in case it helps someone else.)
I also have a small questions. It still seems redundant to me. Is there a reason Scala language's design restrictions require this?
It is not like when you have default parameter you will get overloads generated for each possible case, like for example:
def method(num: Int = 4, str: String = "") = ???
you expect compiler to generate
def method(num: Int) = method(num, "")
def method(str: String) = method(4, str)
def method() = method(4, "")
but that is not the case.
You will instead have generated methods (in companion object), for each default param
def method$default$1: Int = 4
def method$default$2: String = "a"
and whenever you say in your code
method(str = "a")
it will be just changed to
method(method$default$1, "a")
So in your case, constructor with signature this(args: Args) just did not exist, there was only the 2 param version.
You can read more here: http://docs.scala-lang.org/sips/completed/named-and-default-arguments.html

scala macro that refers to 'this' object

I am trying to use a macro to eliminate the need for scala to construct a downward-passed function object. This code gets used in inner-loops of our system, and we don't want the inner loop to just allocate objects endlessly. This is creating performance problems for us.
Our original code was this:
dis.withBitLengthLimit(newLimit){... body ...}
And the body was a function that was passed in as a function object.
The problem I have is that the original non-macro version refers to 'this'. My workaround below is to make each place the macro is called pass the 'this' object as another argument. i.e., ugly like:
dis.withBitLengthLimit(dis, newLimit){... body ...}
It's not awful, but sure seems like passing dis should be unnecessary.
Is there a cleaner way?
Here's the macro below.
object IOMacros {
/**
* Used to temporarily vary the bit length limit.
*
* Implementing as a macro eliminates the creation of a downward function object every time this
* is called.
*
* ISSUE: this macro really wants to use a self reference to `this`. But when a macro is expanded
* the object that `this` represents changes. Until a better way to do this comes about, we have to pass
* the `this` object to the `self` argument, which makes calls look like:
* dis.withBitLengthLimit(dis, newLimit){... body ...}
* That looks redundant, and it is, but it's more important to get the allocation of this downward function
* object out of inner loops.
*/
def withBitLengthLimitMacro(c: Context)(self: c.Tree, lengthLimitInBits: c.Tree)(body: c.Tree) = {
import c.universe._
q"""{
import edu.illinois.ncsa.daffodil.util.MaybeULong
val ___dStream = $self
val ___newLengthLimit = $lengthLimitInBits
val ___savedLengthLimit = ___dStream.bitLimit0b
if (!___dStream.setBitLimit0b(MaybeULong(___dStream.bitPos0b + ___newLengthLimit))) false
else {
try {
$body
} finally {
___dStream.resetBitLimit0b(___savedLengthLimit)
}
true
}
}"""
}
The prefix method on Context provides access to the expression that the macro method is called on, which should allow you to accomplish what you're trying to do. Here's a quick example of how you can use it:
import scala.language.experimental.macros
import scala.reflect.macros.blackbox.Context
class Foo(val i: Int) {
def bar: String = macro FooMacros.barImpl
}
object FooMacros {
def barImpl(c: Context): c.Tree = {
import c.universe._
val self = c.prefix
q"_root_.scala.List.fill($self.i + $self.i)(${ self.tree.toString }).mkString"
}
}
And then:
scala> val foo = new Foo(3)
foo: Foo = Foo#6fd7c13e
scala> foo.bar
res0: String = foofoofoofoofoofoo
Note that there are some issues you need to be aware of. prefix gives you the expression, which may not be a variable name:
scala> new Foo(2).bar
res1: String = new Foo(2)new Foo(2)new Foo(2)new Foo(2)
This means that if the expression has side effects, you have to take care not to include it in the result tree more than once (assuming you don't want them to happen multiple times):
scala> new Qux(1).bar
hey
hey
res2: String = new Qux(1)new Qux(1)
Here the constructor is called twice since we include the prefix expression in the macro's result twice. You can avoid this by defining a temporary variable in the macro:
object FooMacros {
def barImpl(c: Context): c.Tree = {
import c.universe._
val tmp = TermName(c.freshName)
val self = c.prefix
q"""
{
val $tmp = $self
_root_.scala.List.fill($tmp.i + $tmp.i)(${ self.tree.toString }).mkString
}
"""
}
}
And then:
scala> class Qux(i: Int) extends Foo(i) { println("hey") }
defined class Qux
scala> new Qux(1).bar
hey
res3: String = new Qux(1)new Qux(1)
Note that this approach (using freshName) is a lot better than just prefixing local variables in the macro with a bunch of underscores, which can cause problems if you include an expression that happens to contain a variable with the same name.
(Update about that last paragraph: actually I don't remember for sure if you can get yourself into problems with local variable names shadowing names that might be used in included trees. I avoid it myself, but I can't manufacture an example of it causing problems at the moment, so it might be fine.)