Using getOrElseUpdate of TrieMap in Scala - scala

I am using the getOrElseUpdate method of scala.collection.concurrent.TrieMap (from 2.11.6)
// simplified for clarity
val trie = new TrieMap[Int, Future[String]]
def foo(): String = ... // a very long process
val fut: Future[String] = trie.getOrElseUpdate(id, Future(foo()))
As I understand, if I invoke the getOrElseUpdate in multiple threads without any synchronization the foo is invoked just once.
Is it correct ?

The current implementation is that it will be invoked zero or one times. It may be invoked without the result being inserted, however. (This is standard behavior for CAS-based maps as opposed to ones that use synchronized.)

Related

Conversion of breakOut - use iterator or view?

Scala 2.13 migration guide contains a note regarding how to port collection.breakOut:
collection.breakOut no longer exists, use .view and .to(Collection) instead.
and few paragraphs below in a overview table there is:
Description
Old Code
New Code
Automatic Migration Rule
collection.breakOutno longer exists
val xs: List[Int]= ys.map(f)(collection.breakOut)
val xs =ys.iterator.map(f).to(List)
Collection213Upgrade
The scala-collection-migration rewrite rule uses .iterator. What is the difference between the two? Is there a reason to prefer one to the other?
When used like that there is no real difference.
A View can be reused while an Iterator must be discarded after it's been used once.
val list = List(1,2,3,4,5)
val view = list.view
val viewPlus1 = view.map(_ + 1).toList
view.foreach(println) // works as expected
val it = list.iterator
val itPlus1 = it.map(_ + 1).toList
it.foreach(println) // undefined behavior
In its simplest form a View[A] is a wrapper around a function () => Iterator[A], so all its methods can create a fresh Iterator[A] and delegate to the appropriate method on that iterator.

lazy val function vs def method

When calling to a function from external class, in case of many calls, what will give me a better performance, lazy val function or def method?
So far, what I understood is:
def method-
Defined and tied to a class, needed to be declare inside "object" in order to be called as java static style.
Call-by-name, evaluated only when accessed, and every accessed.
lazy val lambda expression -
Tied to object Function1/2...22
Call-by-value, evaluated the first time get accessed and evaluated only one time.
Is actually def apply method tied to a class.
So, it may seem that using lazy val will reduce the need to evaluate the function every time, should it be preferred ?
I faced that when i'm producing UDF for Spark code, and i'm trying to understand which approach is better.
object sql {
def emptyStringToNull(str: String): Option[String] = {
Option(str).getOrElse("").trim match {
case "" => None
case "[]" => None
case "null" => None
case _ => Some(str.trim)
}
}
def udfEmptyStringToNull: UserDefinedFunction = udf(emptyStringToNull _)
def repairColumn_method(dataFrame: DataFrame, colName: String): DataFrame = {
dataFrame.withColumn(colName, udfEmptyStringToNull(col(colName)))
}
lazy val repairColumn_fun: (DataFrame, String) => DataFrame = { (df,colName) =>
df.withColumn(colName, udfEmptyStringToNull(col(colName)))
}
}
There's no need for you to use a lazy val in this specific case. When you assign a function to a lazy val, its results are not memoized, as you seem to think they are. Since the function itself is a plain function literal and not the result of an expensive computation (regardless of what goes on inside it), making it lazy is not useful. All it does is add overhead when accessing and calling it. A simple val would be better, but making it a proper method would be best.
If you want memoization, see Is there a generic way to memoize in Scala? instead.
Ignoring your specific example, if the def in question didn't take any arguments and both it and the lazy val were simple values that were expensive to compute, I would go with the lazy val if you're going to call it many times to avoid computing it over and over again.
If they were values that were very cheap to compute and you're not going to call it many times, or if they're expensive to compute but you're only going to call them once, I would go with a def instead. There wouldn't be much difference if you used a lazy val instead, but it would avoid making a couple of fields.
If they're somewhat cheap to compute but they're being called many times, it may be better to use a lazy val simply because they'll be cached. However, you might want to look at your overall design before looking at such micro-optimizations.

Calling function library scala

I'm looking to call the ATR function from this scala wrapper for ta-lib. But I can't figure out how to use wrapper correctly.
package io.github.patceev.talib
import com.tictactec.ta.lib.{Core, MInteger, RetCode}
import scala.concurrent.Future
object Volatility {
def ATR(
highs: Vector[Double],
lows: Vector[Double],
closes: Vector[Double],
period: Int = 14
)(implicit core: Core): Future[Vector[Double]] = {
val arrSize = highs.length - period + 1
if (arrSize < 0) {
Future.successful(Vector.empty[Double])
} else {
val begin = new MInteger()
val length = new MInteger()
val result = Array.ofDim[Double](arrSize)
core.atr(
0, highs.length - 1, highs.toArray, lows.toArray, closes.toArray,
period, begin, length, result
) match {
case RetCode.Success =>
Future.successful(result.toVector)
case error =>
Future.failed(new Exception(error.toString))
}
}
}
}
Would someone be able to explain how to use function and print out the result to the console.
Many thanks in advance.
Regarding syntax, Scala is one of many languages where you call functions and methods passing arguments in parentheses (mostly, but let's keep it simple for now):
def myFunction(a: Int): Int = a + 1
myFunction(1) // myFunction is called and returns 2
On top of this, Scala allows to specify multiple parameters lists, as in the following example:
def myCurriedFunction(a: Int)(b: Int): Int = a + b
myCurriedFunction(2)(3) // myCurriedFunction returns 5
You can also partially apply myCurriedFunction, but again, let's keep it simple for the time being. The main idea is that you can have multiple lists of arguments passed to a function.
Built on top of this, Scala allows to define a list of implicit parameters, which the compiler will automatically retrieve for you based on some scoping rules. Implicit parameters are used, for example, by Futures:
// this defines how and where callbacks are run
// the compiler will automatically "inject" them for you where needed
implicit val ec: ExecutionContext = concurrent.ExecutionContext.global
Future(4).map(_ + 1) // this will eventually result in a Future(5)
Note that both Future and map have a second parameter list that allows to specify an implicit execution context. By having one in scope, the compiler will "inject" it for you at the call site, without having to write it explicitly. You could have still done it and the result would have been
Future(4)(ec).map(_ + 1)(ec)
That said, I don't know the specifics of the library you are using, but the idea is that you have to instantiate a value of type Core and either bind it to an implicit val or pass it explicitly.
The resulting code will be something like the following
val highs: Vector[Double] = ???
val lows: Vector[Double] = ???
val closes: Vector[Double] = ???
implicit val core: Core = ??? // instantiate core
val resultsFuture = Volatility.ATR(highs, lows, closes) // core is passed implicitly
for (results <- resultsFuture; result <- results) {
println(result)
}
Note that depending on your situation you may have to also use an implicit ExecutionContext to run this code (because you are extracting the Vector[Double] from a Future). Choosing the right execution context is another kind of issue but to play around you may want to use the global execution context.
Extra
Regarding some of the points I've left open, here are some pointers that hopefully will turn out to be useful:
Operators
Multiple Parameter Lists (Currying)
Implicit Parameters
Scala Futures

How to get truly atomic update for TrieMap.getOrElseUpdate

As I understand, TrieMap.getOrElseUpdate is still not truly atomic, and this fixes only returned result (it could return different instances for different callers before this fix), so the updater function still might be called several times, as documentation (for 2.11.7) says:
Note: This method will invoke op at most once. However, op may be invoked without the result being added to the map if a concurrent process is also trying to add a value corresponding to the same key k.
*I've checked that manually for 2.11.7, still "at least once"
How to guarantee one-time call (if I use TrieMap for factories)?
I think this solution should work for my requirements:
trait LazyComp { val get: Int }
val map = new TrieMap[String, LazyComp]()
val count = new AtomicInteger() //just for test, you don't need it
def getSingleton(key: String) = {
val v = new LazyComp {
lazy val get = {
//compute something
count.incrementAndGet() //just for test, you don't need it
}
}
map.putIfAbsent(key, v).getOrElse(v).get
}
I believe, lazy val actually uses synchronized inside. And also the code inside get should be safe from exceptions
However, performance could be improved in future: SIP-20
Test:
scala> (0 to 10000000).par.map(_ => getSingleton("zzz")).last
res8: Int = 1
P.S. Java has computeIfAbscent method on ConcurrentHashMap which I could use as well.

Scala Fork-Join-All With Multiple Generic Types and 1 Generic Unit of Work

I'm attempting to write a method which accepts multiple generic types and takes as an argument a unit of work to execute.
The idea is that the unit of work is a common function that itself is generic. For the sake of example, let's say it's something like the following:
def loadModelRdd[T: TypeTag](sc: SparkContext): RDD[T] = {
...
}
loadModelRdd() will construct an RDD of the given type after some internal processing like loading the Model information, etc.
A prototype method I've been hacking on looks something like the following (non-working):
def forkAll[A : Manifest, B : Manifest](work: => RDD[_]): (RDD[A], RDD[B]) = {
def aFuture = Future { work } // How can I notify that this work call returns type A?
def bFuture = Future { work } // How can I notify that this work call returns type B?
val res = for {
a <- aFuture
b <- bFuture
} yield (a.asInstanceOf[A], b.asInstanceOf[B])
Await.result(res, 10.seconds)
}
This is a shortened version of the code I'm working on as I'm actually looking at accepting as many as 10 different types.
As you can see, the overall goal of the forkAll method is to wrap the unit of work in a Future, fork-join the execution of the unit of work for each type, then return the results as a Tuple'd result. An example consumer statement would be:
val (a, b) = forkAll[ClassA, ClassB](loadModelRdd)
i.e I want to fork-join at this point and wait for the results, but I want the executions to be executed in parallel and then collected back to the Driver (Spark Driver to be specific).
The problem is I'm not sure how to coerce the type returned by the unit of work within forkAll when constructing the Future {} blocks. Without the forkAll, the implementation looked like the following:
val resA = loadModelRdd[ClassA](sc)
val resB = loadModelRdd[ClassB](sc)
...
I am looking at doing this for two reasons:
To abstract the details of fork-join for any unit of work which matches this model.
A version of this code, which explicitly states what the unit of work is, is working in Production and was responsible for cutting execution of a long-running block by close to half. I have a couple of execution steps where this pattern could be applied
Is this something that is possible in Scala's type system? Or should I look at this problem from a different perspective? I've tried a couple of implementations (including one described here) but I haven't quite found one that fits my current view of the problem
Please let me know if there is any additional information needed.
Thanks!
Short answer: Scala does not allow functions with type parameters, so what you want is not exactly possible.
You are attempting to pass a method with a type parameter. Although methods are allowed to have type parameters, functions are not. When you try to pass a method, it acts like an anonymous function, so you must specify a type.
However, since methods do allow type parameters, you can take advantage of this by creating an abstract class that will do your fork/join
abstract class ForkJoin {
protected def work[T]: RDD[T]
def apply[A, B]: (RDD[A], RDD[B]) = {
// Write implementation of fork/join here
(work[A], work[B])
}
}
then overriding the type generic work method so that it does what you want, such as calling some other pre-defined method.
val forkJoin = new ForkJoin {
override protected def work[T]: RDD[T] =
loadModelRdd[T](sc)
}
val (intRdd, stringRdd) = forkJoin[Int, String]
Check out this for a prototype implementation that compiles and runs without issues.