I have the following code that set the Atomic variable (both java.util.concurrent.atomic and monix.execution.atomic behaves the same:
class Foo {
val s = AtomicAny(null: String)
def foo() = {
println("called")
/* Side Effects */
"foo"
}
def get(): String = {
s.compareAndSet(null, foo())
s.get
}
}
val f = new Foo
f.get //Foo.s set from null to foo, print called
f.get //Foo.s not updated, but still print called
The second time it compareAndSet, it did not update the value, but still foo is called. This is causing problem because foo is having side effects (in my real code, it creates an Akka actor and give me error because it tries to create duplicate actors).
How can I make sure the second parameter is not evaluated unless it is actually used? (Preferably not using synchronized)
I need to pass implicit parameter to foo so lazy val would not work. E.g.
lazy val s = get() //Error cannot provide implicit parameter
def foo()(implicit context: Context) = {
println("called")
/* Side Effects */
"foo"
}
def get()(implicit context: Context): String = {
s.compareAndSet(null, foo())
s.get
}
Updated answer
The quick answer is to put this code inside an actor and then you don't have to worry about synchronisation.
If you are using Akka Actors you should never need to do your own thread synchronisation using low-level primitives. The whole point of the actor model is to limit the interaction between threads to just passing asynchronous messages. This provides all the thread synchronisation that you need and guarantees that an actor processes a single message at a time in a single-threaded manner.
You should definitely not have a function that is accessed simultaneously by multiple threads that creates a singleton actor. Just create the actor when you have the information you need and pass the ActorRef to any other actors that need it using dependency injection or a message. Or create the actor at the start and initialise it when the first message arrives (using context.become to manage the actor state).
Original answer
The simplest solution is just to use a lazy val to hold your instance of foo:
class Foo {
lazy val foo = {
println("called")
/* Side Effects */
"foo"
}
}
This will create foo the first time it is used and after that will just return the same value.
If this is not possible for some reason, use an AtomicInteger initialised to 0 and then call incrementAndGet. If this returns 1 then it is the first pass through this code and you can call foo.
Explanation:
Atomic operations such as compareAndSet require support from the CPU instruction set, and modern processors have single atomic instructions for such operations. In some cases (e.g. cache line is held exclusively by this processor) the operation can be very fast. Other cases (e.g. cache line also in cache of another processor) the operation can be significantly slower and can impact other threads.
The result is that the CPU must be holding the new value before the atomic instruction is executed. So the value must be computed before it is known whether it is needed or not.
Related
Say I do the following:
def foo: Future[Int] = ...
var cache: Option[Int] = None
def getValue: Future[Int] = synchronized {
cache match {
case Some(value) => Future(value)
case None =>
foo.map { value =>
cache = Some(value)
value
}
}
}
Is there a risk of deadlock with the above code? Or can I assume that the synchronzied block applies even within the future map block?
For a deadlock to exist, at least two different lock operations are to be called (in a possibly out of order sequence).
From what you show here (but we do not see what the foo implementation is), this is not the case. Only one lock exist and it is reentrant (if you try to enter twice on the same syncrhronized block from the same thread, you won't lock yourself out).
Therefore, no deadlock is possible from the code you've shown.
Still, I question this design. Maybe it is a simplification of your actual code, but from what I understand, you have
A function that can generate a int
You want to call this function only once and cache its result
I'd simplify your implementation greatly if that's the case :
def expensiveComputation: Int = ???
val foo = Future { expensiveComputation() }
def getValue: Future[Int] = foo
You'd have a single call to expensiveComputation (per instance of your enclosing object), and a synchronized cache on its return value, because Future is in and of itself a concurrency-safe construct.
Note that Future itself functions as a cache (see GPI's answer). However, GPI's answer isn't quite equivalent to your code: your code will only cache a successful value and will retry, while if the initial call to expensiveComputation in GPI's answer fails, getValue will always fail.
This however, gives us retry until successful:
def foo: Future[Int] = ???
private def retryFoo(): Future[Int] = foo.recoverWith{ case _ => retryFoo() }
lazy val getValue: Future[Int] = retryFoo()
In general, anything related to Futures which is asynchronous will not respect the synchronized block, unless you happen to Await on the asynchronous part within the synchronized block (which kind of defeats the point). In your case, it's absolutely possible for the following sequence (among many others) to occur:
Initial state: cache = None
Thread A calls getValue, obtains lock
Thread A pattern matches to None, calls foo to get a Future[Int] (fA0), schedules a callback to run in some thread B on fA0's successful completion (fA1)
Thread A releases lock
Thread A returns fA1
Thread C calls getValue, obtains lock
Thread C patter matches to None, calls foo to get a Future[Int] (fC0), schedules a callback to run in some thread D on fC0's successful completion (fC1)
fA0 completes successfully with value 42
Thread B runs callback on fA0, sets cache = Some(42), completes successfully with value 42
Thread C releases lock
Thread C returns fC1
fC1 completes successfull with value 7
Thread D runs callback on fC0, sets cache = Some(7), completes successfully with value 7
The code above can't deadlock, but there's no guarantee that foo will successfully complete exactly once (it could successfully complete arbitrarily many times), nor is there any guarantee as to which particular value of foo will be returned by a given call to getValue.
EDIT to add: You could also replace
cache = Some(value)
value
with
cache.synchronized { cache = cache.orElse(Some(value)) }
cache.get
Which would prevent cache from being assigned to multiple times (i.e. it would always contain the value returned by the first map callback to execute on a future returned by foo). It probably still wouldn't deadlock (I find that if I have to reason about a deadlock, my time is probably better spent reasoning about a better abstraction), but is this elaborate/verbose machinery better than just using a retry-on-failure Future as a cache?
No, but synchronized isn't actually doing much here. getValue returns almost immediately with a Future (which may or may not be completed yet), so the lock on getValue is extremely short-lived. It does not wait for foo.map to evaluate before releasing the lock, because that is executed only after foo is completed, which will almost certainly happen after getValue returns.
I have a weka model stored in S3 which is of size around 400MB.
Now, I have some set of record on which I want to run the model and perform prediction.
For performing prediction, What I have tried is,
Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD.
----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.
Download and load the model on driver as a static object and send it to executor in each map operation.
-----> Working (Not efficient, as in each map operation, i am passing 400MB object)
Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)
Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?
You have two options:
1. Create a singleton object with a lazy val representing the data:
object WekaModel {
lazy val data = {
// initialize data here. This will only happen once per JVM process
}
}
Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.
elementsRDD.map { element =>
// use WekaModel.data here
}
Advantages
is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.
Disadvantages
Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.
2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.
This allows you to initialize whatever you need for the entire partition.
elementsRDD.mapPartition { elements =>
val model = new WekaModel()
elements.map { element =>
// use model and element. there is a single instance of model per partition.
}
}
Advantages:
Provides more flexibility in the initialization and deinitialization of objects.
Disadvantages
Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.
Here's what worked for me even better than the lazy initializer. I created an object level pointer initialized to null, and let each executor initialize it. In the initialization block you can have run-once code. Note that each processing batch will reset local variables but not the Object-level ones.
object Thing1 {
var bigObject : BigObject = null
def main(args: Array[String]) : Unit = {
val sc = <spark/scala magic here>
sc.textFile(infile).map(line => {
if (bigObject == null) {
// this takes a minute but runs just once
bigObject = new BigObject(parameters)
}
bigObject.transform(line)
})
}
}
This approach creates exactly one big object per executor, rather than the one big object per partition of other approaches.
If you put the var bigObject : BigObject = null within the main function namespace, it behaves differently. In that case, it runs the bigObject constructor at the beginning of each partition (ie. batch). If you have a memory leak, then this will eventually kill the executor. Garbage collection would also need to do more work.
Here is what we usually do
define a singleton client that do those kind of stuff to ensure only one client is present in each executors
have a getorcreate method to create or fetch the client information, usulaly let's you have a common serving platform you want to serve for multiple different models, then we can use like concurrentmap to ensure threadsafe and computeifabsent
the getorcreate method will be called inside RDD level like transform or foreachpartition, so make sure init happen in executor level
You can achieve this by broadcasting a case object with a lazy val as follows:
case object localSlowTwo {lazy val value: Int = {Thread.sleep(1000); 2}}
val broadcastSlowTwo = sc.broadcast(localSlowTwo)
(1 to 1000).toDS.repartition(100).map(_ * broadcastSlowTwo.value.value).collect
The event timeline for this on three executors with three threads each looks as follows:
Running the last line again from the same spark-shell session does not initialize any more:
This works for me and it's threadsafe if you use singleton and synchronized like shown below
object singletonObj {
var data: dataObj =null
def getDataObj(): dataObj = this.synchronized {
if (this.data==null){
this.data = new dataObj()
}
this.data
}
}
object app {
def main(args: Array[String]): Unit = {
lazy val mydata: dataObj = singletonObj.getDataObj()
df.map(x=>{ functionA(mydata) })
}
}
What is the best practice when defining an actor?
Actor state: is it better to define a "var" with a collection like in the code below or is it better to define a "val" with mutable collection ? should we define it as private ?
should we define methods of Actor as private ?
class FooActor(out:ActorRef)extends Actor {
private var words:List[String] = Nil
override def receive: Receive = ???
def foo()=???
}
On the first point, generally, I would go with neither. Instead, set the receive method to a method taking the collection as a parameter, and update the actor's state when the collection changes using context.become(...). Eg:
class FooActor(out:ActorRef)extends Actor {
override def receive: Receive = active(Nil)
def active(words:List[String]): Receive = Receive {
case word_to_add: String => context.become(active(word_to_add :: words))
case ...
}
private def foo()=???
}
On the second point, any helper methods are probably only for the actor's own use, so make them private.
To the first point it really depends on how large the collection of items is going to be that you're mutating. Are you going to be adding 100k items to a Map over the course of 100k messages? If this is the case perhaps you should be using a mutable collection so as to avoid the overhead of copying the entire collection to add each item. Make a smart decision based on the use case.
Here's a reference to the performance of mutable vs. immutable collections: http://www.scala-lang.org/docu/files/collections-api/collections.html
To the second point the visibility of the methods doesn't matter in terms of the interface with the Actor. The only way that you should be interacting with an Actor is through asking and telling messages so the visibility of any member methods is of little consequence outside of inferring purpose to the reader.
related with another question I posted (scala futures - keeping track of request context when threadId is irrelevant)
when debugging a future, the call stack isn't very informative (as the call context is usually in another thread and another time).
this is especially problematic when there can be different paths leading to the same future code (for instance usage of DAO called from many places in the code etc).
do you know of an elegant solution for this?
I was thinking of passing a token/request ID (for flows started by a web server request) - but this would require passing it around - and also won't include any of the state which you can see in the stack trace.
perhaps passing a stack around? :)
Suppose you make a class
case class Context(requestId: Int, /* other things you need to pass around */)
There are two basic ways to send it around implicitly:
1) Add an implicit Context parameter to any function that requires it:
def processInAnotherThread(/* explicit arguments */)(
implicit evaluationContext: scala.concurrent.EvaluationContext,
context: Context): Future[Result] = ???
def processRequest = {
/* ... */
implicit val context: Context = Context(getRequestId, /* ... */)
processInAnotherThread(/* explicit parameters */)
}
The drawback is that every function that needs to access Context must have this parameter and it litters the function signatures quite a bit.
2) Put it into a DynamicVariable:
// Context companion object
object Context {
val context: DynamicVariable[Context] =
new DynamicVariable[Context](Context(0, /* ... */))
}
def processInAnotherThread(/* explicit arguments */)(
implicit evaluationContext: scala.concurrent.EvaluationContext
): Future[Result] = {
// get requestId from context
Context.context.value.requestId
/* ... */
}
def processRequest = {
/* ... */
Context.context.withValue(Context(getRequestId, /* ... */)) {
processInAnotherThread(/* explicit parameters */)
}
}
The drawbacks are that
it's not immediately clear deep inside the processing that there is some context available and what contents it has and also referential transparency is broken. I believe it's better to strictly limit the number of available DynamicVariables, preferably don't have more than 1 or at most 2 and document their use.
context must either have default values or nulls for all its contents, or it must itself be a null by default (new DynamicVariable[Context](null)). Forgetting to initialize Context or its contents before processing may lead to nasty errors.
DynamicVariable is still much better than some global variable and doesn't influence the signatures of the functions that don't use it directly in any way.
In both cases you may update the contents of an existing Context with a copy method of a case class. For example:
def deepInProcessing(/* ... */): Future[Result] =
Context.context.withValue(
Context.context.value.copy(someParameter = newParameterValue)
) {
processFurther(/* ... */)
}
My app ends up doing a lot of background processing via Actors, specifically loading Mapper instances and then doing some work upon them. It's very repetitive and I'd like to cache some of these lookups across my Actor code.
I'd typically use a ThreadLocal for this. However, since the thread initialization is handled by the Actor thread pool, it seems like the only place to initialize and subsequently clear the ThreadLocal would be in the actor's PartialFunction which receives incoming messages.
What I'm doing now is to create another method in my Actor, like this:
override def aroundUpdates[T](fn: => T) : T = {
clientCache.init {
fn
}
}
Where the init method handles clearing the ThreadLocal in a finally block. I don't like this approach because aroundUpdates only exists for the purpose of setting up the cache and it smells like a code smell.
Is there a better way to do this?
You don't need to use thread-locals: during a single reaction, you are running in a single thread. Hence you could just use a normal var. What's more, because your reactions are sequential and the actor subsystem manages synchronization for you, you could (If you want) access the state from different reactions:
def act = loop {
var state : String = null
def foo = state = "Hello"
def bar = { println(state + " World"); state = null }
def baz = println(state + " Oxbow")
react {
case MsgA => foo; bar
case MsgB => baz
}
}
Hence thread locals make no sense whatsoever to use in your own reactions!