Infinite loop when replacing concrete value by parameter name - scala

I have the two following objects (in scala and using spark):
1. The main object
object Omain {
def main(args: Array[String]) {
odbscan
}
}
2. The object odbscan
object odbscan {
val conf = new SparkConf().setAppName("Clustering").setMaster("local")
conf.set("spark.driver.maxResultSize", "3g")
val sc = new SparkContext(conf)
val param_user_minimal_rating_count = 2
/***Connexion***/
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val sql = "SELECT id, data FROM user_profile"
val options = connectMysql.getOptionsMap(sql)
val uSQL = sqlcontext.load("jdbc", options)
val users = uSQL.rdd.map { x =>
val v = x.toString().substring(1, x.toString().size - 1).split(",")
var ap: Map[Int, Double] = Map()
if (v.size > 1)
ap = v(1).split(";").map { y => (y.split(":")(0).toInt, y.split(":")(1).toDouble) }.toMap
(v(0).toInt, ap)
}.filter(_._2.size >= param_user_minimal_rating_count)
println(users.collect().mkString("\n"))
}
When I execute this code I obtain an infinite loop, until I change:
filter(_._2.size >= param_user_minimal_rating_count)
to
filter(_._2.size >= 1)
or any other numerical value, in this case the code work, and I have my result displayed

What I think is happening here is that Spark serializes functions to send them over the wire. And that because your function (the one you're passing to map) calls the accessor param_user_minimal_rating_count of object odbscan, the entire object odbscan will need to get serialized and sent along with it. Deserializing and then using that deserialized object will cause the code in its body to get executed again which causes an infinite loop of serializing-->sending-->deserializing-->executing-->serializing-->...
Probably the easiest thing to do here is changing that val to final val param_user_minimal_rating_count = 2 so the compiler will inline the value. But note that this will only be a solution for literal constants. For more information see constant value definitions and constant expressions.
An other and better solution would be to refactor your code so that no instance variables are used in lambda expressions. Referencing vals that are defined in an object or class will get the whole object serialized. So try to only refer to vals that are local (to a method). And most importantly don't execute your business logic from within a constructor/the body of an object or class.

Your problem is somewhere else.
The only difference between the 2 snippets is the definition of val Eps = 5 outside of the map which does not change at all the control flow of your code.
Please post more context so we can help.

Related

Calling function library scala

I'm looking to call the ATR function from this scala wrapper for ta-lib. But I can't figure out how to use wrapper correctly.
package io.github.patceev.talib
import com.tictactec.ta.lib.{Core, MInteger, RetCode}
import scala.concurrent.Future
object Volatility {
def ATR(
highs: Vector[Double],
lows: Vector[Double],
closes: Vector[Double],
period: Int = 14
)(implicit core: Core): Future[Vector[Double]] = {
val arrSize = highs.length - period + 1
if (arrSize < 0) {
Future.successful(Vector.empty[Double])
} else {
val begin = new MInteger()
val length = new MInteger()
val result = Array.ofDim[Double](arrSize)
core.atr(
0, highs.length - 1, highs.toArray, lows.toArray, closes.toArray,
period, begin, length, result
) match {
case RetCode.Success =>
Future.successful(result.toVector)
case error =>
Future.failed(new Exception(error.toString))
}
}
}
}
Would someone be able to explain how to use function and print out the result to the console.
Many thanks in advance.
Regarding syntax, Scala is one of many languages where you call functions and methods passing arguments in parentheses (mostly, but let's keep it simple for now):
def myFunction(a: Int): Int = a + 1
myFunction(1) // myFunction is called and returns 2
On top of this, Scala allows to specify multiple parameters lists, as in the following example:
def myCurriedFunction(a: Int)(b: Int): Int = a + b
myCurriedFunction(2)(3) // myCurriedFunction returns 5
You can also partially apply myCurriedFunction, but again, let's keep it simple for the time being. The main idea is that you can have multiple lists of arguments passed to a function.
Built on top of this, Scala allows to define a list of implicit parameters, which the compiler will automatically retrieve for you based on some scoping rules. Implicit parameters are used, for example, by Futures:
// this defines how and where callbacks are run
// the compiler will automatically "inject" them for you where needed
implicit val ec: ExecutionContext = concurrent.ExecutionContext.global
Future(4).map(_ + 1) // this will eventually result in a Future(5)
Note that both Future and map have a second parameter list that allows to specify an implicit execution context. By having one in scope, the compiler will "inject" it for you at the call site, without having to write it explicitly. You could have still done it and the result would have been
Future(4)(ec).map(_ + 1)(ec)
That said, I don't know the specifics of the library you are using, but the idea is that you have to instantiate a value of type Core and either bind it to an implicit val or pass it explicitly.
The resulting code will be something like the following
val highs: Vector[Double] = ???
val lows: Vector[Double] = ???
val closes: Vector[Double] = ???
implicit val core: Core = ??? // instantiate core
val resultsFuture = Volatility.ATR(highs, lows, closes) // core is passed implicitly
for (results <- resultsFuture; result <- results) {
println(result)
}
Note that depending on your situation you may have to also use an implicit ExecutionContext to run this code (because you are extracting the Vector[Double] from a Future). Choosing the right execution context is another kind of issue but to play around you may want to use the global execution context.
Extra
Regarding some of the points I've left open, here are some pointers that hopefully will turn out to be useful:
Operators
Multiple Parameter Lists (Currying)
Implicit Parameters
Scala Futures

Scala capture method of field without whole instance

I had a piece of code that looks like this:
val foo = df.map(parser.parse) // def parse(str: String): ParsedData = { ... }
However, I found out that passes a lambda into Scala that captures this, I guess Scala treated code above into:
val foo = df.map(s => /* this. */parser.parse(s))
where my intention to pass lightweight capture. In Java, I'd do parser::parse, but that isn't available in Scala.
That does the trick:
val tmp = parser // split val read from capture; must be in method body
val foo = df.map(tmp.parse)
but makes code a little bit more unpleasant. There is an answer by #tiago-henrique-engel that could be used like:
val tmp = parser.parse _
val foo = df.map(tmp)
but it still requires a temporary val.
So, is there a way to pass lightweight capture to map without storing it first into some val to keep the code clean?

How to construct and persist a reference object per worker in a Spark 2.3.0 UDF?

In a Spark 2.3.0 Structured Streaming job I need to append a column to a DataFrame which is derived from the value of the same row of an existing column.
I want to define this transformation in a UDF and use withColumn to build the new DataFrame.
Doing this transform requires consulting a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance.
What is the best way to construct and persist this object once per worker node so it can be referenced repeatedly for every record in every batch? Note that the object is not serializable.
My current attempts have revolved around subclassing UserDefinedFunction to add the expensive object as a lazy member and providing an alternate constructor to this subclass that does the init normally performed by the udf function, but I've been so far unable to get it to do the kind of type coercion that udf does -- some deep type inference is wanting objects of type org.apache.spark.sql.Column when my transformation lambda works on a string for input and output.
Something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataType
class ExpensiveReference{
def ExpensiveReference() = ... // Very slow
def transformString(in:String) = ... // Fast
}
class PersistentValUDF(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) extends UserDefinedFunction(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]){
lazy val ExpensiveReference = new ExpensiveReference()
def PersistentValUDF(){
this(((in:String) => ExpensiveReference.transformString(in) ):(String => String), StringType, Some(List(StringType)))
}
}
The further I dig into this rabbit hole the more I suspect there's a better way to accomplish this that I'm overlooking. Hence this post.
Edit:
I tested initializing a reference lazily in an object declared in the UDF; this triggers reinitialization. Example code and object
class IntBox {
var valu = 0;
def increment {
valu = valu + 1
}
def get:Int ={
return valu
}
}
val altUDF = udf((input:String) => {
object ExpensiveRef{
lazy val box = new IntBox
def transform(in:String):String={
box.increment
return in + box.get.toString
}
}
ExpensiveRef.transform(input)
})
The above UDF always appends 1; so the lazy object is being reinitialized per-record.
I found this post whose Option 1 I was able to turn into a workable solution. The end result ended up being similar to Jacek Laskowski's answer, but with a few tweaks:
Pull the object definition outside of the UDF's scope. Even being lazy, it will still reinitialize if it's defined in the scope of the UDF.
Move the transform function off of the object and into the UDF's lambda (required to avoid serialization errors)
Capture the object's lazy member in the closure of the UDF lambda
Something like this:
object ExpensiveReference {
lazy val ref = ...
}
val persistentUDF = udf((input:String)=>{
/*transform code that references ExpensiveReference.ref*/
})
DISCLAIMER Let me have a go on this, but please consider it a work in progress (downvotes are big no-no :))
What I'd do would be to use a Scala object with a lazy val for the expensive reference.
object ExpensiveReference {
lazy val ref = ???
def transform(in:String) = {
// use ref here
}
}
With the object, whatever you do on a Spark executor (be it part of a UDF or any other computation) is going to instantiate ExpensiveReference.ref at the very first access. You could access it directly or a part of transform.
Again, it does not really matter whether you do this in a UDF or a UDAF or any other transformation. The point is that once a computation happens on a Spark executor "a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance." would happen only once.
It could be in a UDF (just to make it clearer).

Scala NullPointerException during initialization

Consider the following case (this is a simplified version of what I have. The numberOfScenarios is the most important variable here: I usually use a hardcoded number instead of it, but I'm trying to see if it's possible to calculate the value):
object ScenarioHelpers {
val identifierList = (1 to Scenarios.numberOfScenarios).toArray
val concurrentIdentifierQueue = new ConcurrentLinkedQueue[Int](identifierList.toSeq)
}
abstract class AbstractScenario {
val identifier = ScenarioHelpers.concurrentIdentifierQueue.poll()
}
object Test1 extends AbstractScenario {
val scenario1 = scenario("test scenario 1").exec(/..steps../)
}
object Test2 extends AbstractScenario {
val scenario2 = scenario("test scenario 2").exec(/..steps../)
}
object Scenarios {
val scenarios = List(Test1.scenario1, Test2.scenario2)
val numberOfScenarios = scenarios.length
}
object TestPreparation {
val feeder = ScenarioHelpers.identifierList.map(n => Map("counter" -> n))
val prepScenario = scenario("test preparation")
.feed(feeder)
.exec(/..steps../)
}
Not sure if it matters, but the simulation starts with executing the TestPreparation.prepScenario.
I see that this code contains a circular dependency which makes this case impossible in and of itself. But I get a NullPointerException on the line in AbstractScenario where identifier is being initialized.
I don't fully understand all this, but I think it has something to do with the vals being simply declared at first and the initialization does not happen until later. So when identifier is being initialized, the concurrentIdentifierQueue is not yet initialized and is therefore null.
I'm just trying to understand the reasons behind the NullPointerException and also if there's any way to get this working with minimal changes? Thank you.
NPEs during trait initialization is a very common problem.
The most robust way to resolve it is avoiding implementation inheritance at all.
if it is not possible for some reasons you can mark problematic fields lazy val or def instead of val.
You answered that yourself:
I see that this code contains a circular dependency which makes this case impossible in and of itself. But I get a NullPointerException on the line in AbstractScenario where identifier is being initialized.
val feeder = ScenarioHelpers.identifierList... calls ScenarioHelpers initialization
val identifierList = (1 to Scenarios.numberOfScenarios).toArray calls Scenarios initialization
val scenarios = List(Test1.scenario1, Test2.scenario2) calls Test1 inicialization including AbstractScenario
Here val identifier = ScenarioHelpers.concurrentIdentifierQueue.poll() ScenarioHelpers is still initializing and identifierList is null.
You have to get numberOfScenarios in noncyclic way. Personally I would remove identifierList and assign identifier other way - incrementing counter or so.

precomputing some code in a scala closure.

In scala, I have the following code:
def isDifferentGroup(artifact2: DefaultArtifact) = getArtifact(artifact1Id).getGroupId != artifact2.getGroupId
val artifacts = getArtifacts().filter(isSameGroup)
the function isDifferentGroup is accessing the external String variable artifactId (closure).
I'd like to avoid computing getArtifact(artifactId) for each item in the list.
I could do as follows:
val artifact1: DefaultArtifact = getArtifact(artifact1Id)
def isDifferentGroup(artifact2: DefaultArtifact) = artifact1.getGroupId != artifact2.getGroupId
val artifacts = getArtifacts().filter(isSameGroup)
however, we are creating a variable artifact1 outside the fonction isDifferentGroup, and that is ugly, because this variable is used only inside the fonction isDifferentGroup.
how to solve it?
one possibility would be to make a partial function as follows:
def isDifferentGroup(artifact1: DefaultArtifact)(artifact2: DefaultArtifact) = artifact1.getGroupId != artifact2.getGroupId
val artifacts = getArtifacts().filter(isDifferentGroup(getArtifact(artifact1Id)))
however, I have to move the code getArtifact(artifactId) outside the isDifferentGroup function, and I don't want this.
how to solve it?
Everything that is processed in a function body is evaluated every time the function is called, so you can't tell that a part of it will be statically evaluated and shared (without resorting to using some kind of external cache). So you have to separate the function body from values you want to evaluate in advance and use inside the function.
However, you can enclose such values in a block so that forms a compact block. The best thing I can think of is to declare a val with a function type, such as
val isDifferentGroup: DefaultArtifact => Boolean = {
val gid = getArtifact(artifact1Id).getGroupId
(artifact2: DefaultArtifact) => {
(gid != artifact2.getGroupId)
}
}
This way, you can explicitly state what part is evaluated statically only once in the main val block (here gid) and what part is evaluated in response to an artifact2 argument. And you can call isDifferentGroup just as if it were a method:
println(isDifferentGroup(someArtifact))
This is basically just a different way of creating an encapsulating class like
val isDifferentGroup: DefaultArtifact => Boolean =
new Function1[DefaultArtifact,Boolean] {
val gid = getArtifact(artifact1Id).getGroupId
override def apply(artifact2: DefaultArtifact): Boolean =
(gid != artifact2.getGroupId);
}
You can even declare it as a lazy val, in which case gid is evalauted at most once, the first time the function is called.
Why not just create a class to encapsulate both a (private) value and functionality?
This code will probably not compile, just to illustrate it:
class Artifact(artifact1Id: Id) {
private val artifact1: DefaultArtifact = getArtifact(artifact1Id)
def isDifferentGroup(artifact2: DefaultArtifact) =
artifact1.getGroupId != artifact2.getGroupId
}