Is it correct to use `Future` to run some loop task which is never finished? - scala

In our project, we need to do a task which listens to a queue and process the coming messages in a loop, which is never finished. The code is looking like:
def processQueue = {
while(true) {
val message = queue.next();
processMessage(message) match {
case Success(_) => ...
case _ => ...
}
}
}
So we want to run it in a separate thread.
I can imagine two ways to do it, one is to use Thread as what we do in Java:
new Thread(new Runnable() { processQueue() }).start();
Another way is use Future (as we did now):
Future { processQueue }
I just wonder if is it correct to use Future in this case, since as I know(which might be wrong), Future is mean to be running some task which will finish or return a result in some time of the future. But our task is never finished.
I also wonder what's the best solution for this in scala.

A Future is supposed to a value that will eventually exist, so I don't think it makes much sense to create one that will never be fulfilled. They're also immutable, so passing information to them is a no-no. And using some externally referenced queue within the Future sounds like dark road to go down.
What you're describing is basically an Akka Actor, which has it's own FIFO queue, with a receive method to process messages. It would look something like this:
import akka.actor._
class Processor extends Actor {
def receive = {
case msg: String => processMessage(msg) match {
case Success(x) => ...
case _ => ...
}
case otherMsg # Message(_, _) => {
// process this other type of message..
}
}
}
Your application could create a single instance of this Processor actor with an ActorSystem (or some other elaborate group of these actors):
val akkaSystem = ActorSystem("myActorSystem")
val processor: ActorRef = akkaSystem.actorOf(Props[Processor], "Processor")
And send it messages:
processor ! "Do some work!"
In short, it's a better idea to use a concurrency framework like Akka than to create your own for processing queues on separate threads. The Future API is most definitely not the way to go.
I suggest perusing the Akka Documentation for more information.

If you are just running one thread (aside from the main thread), it won't matter. If you do this repeatedly, and you really want lots of separate threads, you should use Thread since that is what it is for. Futures are built with the assumption that they'll terminate, so you might run out of pool threads.

Related

Alternative to using Future.sequence inside Akka Actors

We have a fairly complex system developed using Akka HTTP and Actors model. Until now, we extensively used ask pattern and mixed Futures and Actors.
For example, an actor gets message, it needs to execute 3 operations in parallel, combine a result out of that data and returns it to sender. What we used is
declare a new variable in actor receive message callback to store a sender (since we use Future.map it can be another sender).
executed all those 3 futures in parallel using Future.sequence (sometimes its call of function that returns a future and sometimes it is ask to another actor to get something from it)
combine the result of all 3 futures using map or flatMap function of Future.sequence result
pipe a final result to a sender using pipeTo
Here is a code simplified:
case RetrieveData(userId, `type`, id, lang, paging, timeRange, platform) => {
val sen = sender
val result: Future[Seq[Map[String, Any]]] = if (paging.getOrElse(Paging(0, 0)) == Paging(0, 0)) Future.successful(Seq.empty)
else {
val start = System.currentTimeMillis()
val profileF = profileActor ? Get(userId)
Future.sequence(Seq(profileF, getSymbols(`type`, id), getData(paging, timeRange, platform)).map { result =>
logger.info(s"Got ${result.size} news in ${System.currentTimeMillis() - start} ms")
result
}.recover { case ex: Throwable =>
logger.error(s"Failure on getting data: ${ex.getMessage}", ex)
Seq.empty
}
}
result.pipeTo(sen)
}
Function getAndProcessData contains Future.sequence with executing 3 futures in parallel.
Now, as I'm reading more and more on Akka, I see that using ask is creating another actor listener. Questions are:
As we extensively use ask, can it lead to a to many threads used in a system and perhaps a thread starvation sometimes?
Using Future.map much also means different thread often. I read about one thread actor illusion which can be easily broken with mixing Futures.
Also, can this affect performances in a bad way?
Do we need to store sender in temp variable send, since we're using pipeTo? Could we do only pipeTo(sender). Also, does declaring sen in almost each receive callback waste to much resources? I would expect its reference will be removed once operation in complete.
Is there a chance to design such a system in a better way, meadning that we don't use map or ask so much? I looked at examples when you just pass a replyTo reference to some actor and the use tell instead of ask. Also, sending message to self and than replying to original sender can replace working with Future.map in some scenarios. But how it can be designed having in mind we want to perform 3 async operations in parallel and returns a formatted data to a sender? We need to have all those 3 operations completed to be able to format data.
I tried not to include to many examples, I hope you understand our concerns and problems. Many questions, but I would really love to understand how it works, simple and clear
Thanks in advance
If you want to do 3 things in parallel you are going to need to create 3 Future values which will potentially use 3 threads, and that can't be avoided.
I'm not sure what the issue with map is, but there is only one call in this code and that is not necessary.
Here is one way to clean up the code to avoid creating unnecessary Future values (untested!):
case RetrieveData(userId, `type`, id, lang, paging, timeRange, platform) =>
if (paging.forall(_ == Paging(0, 0))) {
sender ! Seq.empty
} else {
val sen = sender
val start = System.currentTimeMillis()
val resF = Seq(
profileActor ? Get(userId),
getSymbols(`type`, id),
getData(paging, timeRange, platform),
)
Future.sequence(resF).onComplete {
case Success(result) =>
val dur = System.currentTimeMillis() - start
logger.info(s"Got ${result.size} news in $dur ms")
sen ! result
case Failure(ex)
logger.error(s"Failure on getting data: ${ex.getMessage}", ex)
sen ! Seq.empty
}
}
You can avoid ask by creating your own worker thread that collects the different results and then sends the result to the sender, but that is probably more complicated than is needed here.
An actor only consumes a thread in the dispatcher when it is processing a message. Since the number of messages the actor spawned to manage the ask will process is one, it's very unlikely that the ask pattern by itself will cause thread starvation. If you're already very close to thread starvation, an ask might be the straw that breaks the camel's back.
Mixing Futures and actors can break the single-thread illusion, if and only if the code executing in the Future accesses actor state (meaning, basically, vars or mutable objects defined outside of a receive handler).
Request-response and at-least-once (between them, they cover at least most of the motivations for the ask pattern) will in general limit throughput compared to at-most-once tells. Implementing request-response or at-least-once without the ask pattern might in some situations (e.g. using a replyTo ActorRef for the ultimate recipient) be less overhead than piping asks, but probably not significantly. Asks as the main entry-point to the actor system (e.g. in the streams handling HTTP requests or processing messages from some message bus) are generally OK, but asks from one actor to another are a good opportunity to streamline.
Note that, especially if your actor imports context.dispatcher as its implicit ExecutionContext, transformations on Futures are basically identical to single-use actors.
Situations where you want multiple things to happen (especially when you need to manage partial failure (Future.sequence.recover is a possible sign of this situation, especially if the recover gets nontrivial)) are potential candidates for a saga actor to organize one particular request/response.
I would suggest instead of using Future.sequence, Souce from Akka can be used which will run all the futures in parallel, in which you can provide the parallelism also.
Here is the sample code:
Source.fromIterator( () => Seq(profileF, getSymbols(`type`, id), getData(paging, timeRange, platform)).iterator )
.mapAsync( parallelism = 1 ) { case (seqIdValue, row) =>
row.map( seqIdValue -> _ )
}.runWith( Sink.seq ).map(_.map(idWithDTO => idWithDTO))
This will return Future[Seq[Map[String, Any]]]

Scala how to use akka actors to handle a timing out operation efficiently

I am currently evaluating javascript scripts using Rhino in a restful service. I wish for there to be an evaluation time out.
I have created a mock example actor (using scala 2.10 akka actors).
case class Evaluate(expression: String)
class RhinoActor extends Actor {
override def preStart() = { println("Start context'"); super.preStart()}
def receive = {
case Evaluate(expression) ⇒ {
Thread.sleep(100)
sender ! "complete"
}
}
override def postStop() = { println("Stop context'"); super.postStop()}
}
Now I run use this actor as follows:
def run {
val t = System.currentTimeMillis()
val system = ActorSystem("MySystem")
val actor = system.actorOf(Props[RhinoActor])
implicit val timeout = Timeout(50 milliseconds)
val future = (actor ? Evaluate("10 + 50")).mapTo[String]
val result = Try(Await.result(future, Duration.Inf))
println(System.currentTimeMillis() - t)
println(result)
actor ! PoisonPill
system.shutdown()
}
Is it wise to use the ActorSystem in a closure like this which may have simultaneous requests on it?
Should I make the ActorSystem global, and will that be ok in this context?
Is there a more appropriate alternative approach?
EDIT: I think I need to use futures directly, but I will need the preStart and postStop. Currently investigating.
EDIT: Seems you don't get those hooks with futures.
I'll try and answer some of your questions for you.
First, an ActorSystem is a very heavy weight construct. You should not create one per request that needs an actor. You should create one globally and then use that single instance to spawn your actors (and you won't need system.shutdown() anymore in run). I believe this covers your first two questions.
Your approach of using an actor to execute javascript here seems sound to me. But instead of spinning up an actor per request, you might want to pool a bunch of the RhinoActors behind a Router, with each instance having it's own rhino engine that will be setup during preStart. Doing this will eliminate per request rhino initialization costs, speeding up your js evaluations. Just make sure you size your pool appropriately. Also, you won't need to be sending PoisonPill messages per request if you adopt this approach.
You also might want to look into the non-blocking callbacks onComplete, onSuccess and onFailure as opposed to using the blocking Await. These callbacks also respect timeouts and are preferable to blocking for higher throughput. As long as whatever is way way upstream waiting for this response can handle the asynchronicity (i.e. an async capable web request), then I suggest going this route.
The last thing to keep in mind is that even though code will return to the caller after the timeout if the actor has yet to respond, the actor still goes on processing that message (performing the evaluation). It does not stop and move onto the next message just because a caller timed out. Just wanted to make that clear in case it wasn't.
EDIT
In response to your comment about stopping a long execution there are some things related to Akka to consider first. You can call stop the actor, send a Kill or a PosionPill, but none of these will stop if from processing the message that it's currently processing. They just prevent it from receiving new messages. In your case, with Rhino, if infinite script execution is a possibility, then I suggest handling this within Rhino itself. I would dig into the answers on this post (Stopping the Rhino Engine in middle of execution) and setup your Rhino engine in the actor in such a way that it will stop itself if it has been executing for too long. That failure will kick out to the supervisor (if pooled) and cause that pooled instance to be restarted which will init a new Rhino in preStart. This might be the best approach for dealing with the possibility of long running scripts.

Trouble using scala actors

I have read that, when using react, all actors can execute in a single thread. I often process a collection in parallel and need to output the result. I do not believe System.out.println is threadsafe so I need some protection. One way (a traditional way) I could do this:
val lock = new Object
def printer(msg: Any) {
lock.synchronized {
println(msg)
}
}
(1 until 1000).par.foreach { i =>
printer(i)
}
println("done.")
How does this first solution compare to using actors in terms of efficiency? Is it true that I'm not creating a new thread?
val printer = actor {
loop {
react {
case msg => println(msg)
}
}
}
(1 until 10000).par.foreach { i =>
printer ! i
}
println("done.")
It doesn't seem to be a good alternative however, because the actor code never completes. If I put a println at the bottom it is never hit, even though it looks like it goes through every iteration for i. What am I doing wrong?
As you have it now with your Actor code, you only have one actor doing all the printing. As you can see from running the code, the values are all printed out sequentially by the Actor whereas in the parallel collection code, they're out of order. I'm not too familiar with parallel collections, so I don't know the performance gains between the two.
However, if your code is doing a lot of work in parallel, you probably would want to go with multiple actors. You could do something like this:
def printer = actor {
loop {
react {
case msg => println(msg)
}
}
}
val num_workers = 10
val worker_bees = Vector.fill(num_workers)(printer)
(1 until 1000).foreach { i =>
worker_bees(i % num_workers) ! i
}
The def is important. This way you're actually creating multiple actors and not just flooding one.
One actor instance will never process more than one message at the time. Whatever thread pool is allocated for the actors, each actor instance will only occupy one thread at the time, so you are guaranteed that all the printing will be processed serially.
As for not finishing, the execution of an actor never returns from a react or a loop, so:
val printer = actor {
loop {
react {
case msg => println(msg)
}
// This line is never reached because of react
}
// This line is never reached because of loop
}
If you replace loop and react with a while loop and receive, you'll see that everything inside the while loop executes as expected.
To fix your actor implementation you need to tell the actor to exit before the program will exit as well.
val printer = actor {
loop {
react {
case "stop" => exit()
case msg => println(msg)
}
}
}
(1 until 1000).par.foreach { printer ! _ }
printer ! "stop"
In both your examples there are thread pools involved backing both the parallels library and the actor library but they are created as needed.
However, println is thread safe as it does indeed have a lock in it's internals.
(1 until 1000).par.foreach { println(_) } // is threadsafe
As for performance, there are many factors. The first is that moving from a lock that multiple threads are contending for to a lock being used by only one thread ( one actor ) will increase performance. Second, if you are going to use actors and want performance, use
Akka. Akka actors are blazingly fast when compared to scala actors. Also, I hope that the stdout that println is writing to is going to a file and not the screen since involving the display drivers is going to kill your performance.
Using the parallels library is wonderful for performance since so you can take advantage of multiple cores for your computation. If each computation is very small then try the actor route for centralized reporting. However if each computation is significant and takes a decent amount of cpu time then stick just using println by itself. You really are not in a contended lock situation.
I'm not sure I can understand your problem correctly. For me your actor code works fine and terminates.
Nevertheless, you can savely use println for parallel collections, so all you really need is something like this:
(1 until 1000).par.foreach { println(_) }
Works like a charm here. I assume you already know that the output order will vary, but I just want to stress it again, because the question comes up ever so often. So don't expect the numbers to scroll down your screen in a successive fashion.

Easiest way to do idle processing in a Scala Actor?

I have a scala actor that does some work whenever a client requests it. When, and only when no client is active, I would like the Actor to do some background processing.
What is the easiest way to do this? I can think of two approaches:
Spawn a new thread that times out and wakes up the actor periodically. A straight forward approach, but I would like to avoid creating another thread (to avoid the extra code, complexity and overhead).
The Actor class has a reactWithin method, which could be used to time out from the actor itself. But the documentation says the method doesn't return. So, I am not sure how to use it.
Edit; a clarification:
Assume that the background task can be broken down into smaller units that can be independently processed.
Ok, I see I need to put my 2 cents. From the author's answer I guess the "priority receive" technique is exactly what is needed here. It is possible to find discussion in "Erlang: priority receive question here at SO". The idea is to accept high priority messages first and to accept other messages only in absence of high-priority ones.
As Scala actors are very similar to Erlang, a trivial code to implement this would look like this:
def act = loop {
reactWithin(0) {
case msg: HighPriorityMessage => // process msg
case TIMEOUT =>
react {
case msg: HighPriorityMessage => // process msg
case msg: LowPriorityMessage => // process msg
}
}
}
This works as follows. An actor has a mailbox (queue) with messages. The receive (or receiveWithin) argument is a partial function and Actor library looks for a message in a mailbox which can be applied to this partial function. In our case it would be an object of HighPriorityMessage only. So, if Actor library finds such a message, it applies our partial function and we are processing a message of high priority. Otherwise, reactWithin with timeout 0 calls our partial function with argument TIMEOUT and we immediately try to process any possible message from the queue (as it waits for a message we cannot exclude a possiblity to get HighPriorityMessage).
It sounds like the problem you describe is not well suited to the actor sub-system. An Actor is designed to sequentially process its message queue:
What should happen if the actor is performing the background work and a new task arrives?
An actor can only find out about this is it is continuously checking its mailbox as it performs the background task. How would you implement this (i.e. how would you code the background tasks as a unit of work so that the actor could keep interrupting and checking the mailbox)?
What should happen if the actor has many background tasks in its mailbox in front of the main task?
Do these background tasks get thrown away, or sent to another actor? If the latter, how can you prevent CPU time being given to that actor to perform the tasks?
All in all, it sounds much more like you need to explore some grid-style software that can run in the background (like Data Synapse)!
Just after asking this question I tried out some completely whacky code and it seems to work fine. I am not sure though if there is a gotcha in it.
import scala.actors._
object Idling
object Processor extends Actor {
start
import Actor._
def act() = {
loop {
// here lie dragons >>>>>
if (mailboxSize == 0) this ! Idling
// <<<<<<
react {
case msg:NormalMsg => {
// do the normal work
reply(answer)
}
case Idling=> {
// do the idle work in chunks
}
case msg => println("Rcvd unknown message:" + msg)
}
}
}
}
Explanation
Any code inside the argument of loop but before the call to react seems to get called when the Actor is about to wait for a message. I am sending a Idling message to self here. In the handler for this message I ensure that the mailbox-size is 0, before doing the processing.

Does this Scala actor block when creating new actor in a handler?

I have the following piece of code:
actor {
loop {
react {
case SomeEvent =>
//I want to submit a piece of work to a queue and then send a response
//when that is finished. However, I don't want *this* actor to block
val params = "Some args"
val f: Future[Any] = myQueue.submitWork( params );
actor {
//await here
val response = f.get
publisher ! response
}
}
}
}
As I understood it, the outer actor would not block on f.get because that is actually being performed by a separate actor (the one created inside the SomeEvent handler).
Is this correct?
Yes, that is correct. Your outer actor will simply create an actor and suspend (wait for its next message). However, be very careful of this kind of thing. The inner actor is started automatically on the scheduler, to be handled by a thread. That thread will block on that Future (that looks like a java.util.concurrent.Future to me). If you do this enough times, you can run into starvation problems where all available threads are blocking on Futures. An actor is basically a work queue, so you should use those semantics instead.
Here's a version of your code, using the Scalaz actors library. This library is much simpler and easier to understand than the standard Scala actors (the source is literally a page and a half). It also leads to much terser code:
actor {(e: SomeEvent) => promise { ... } to publisher }
This version is completely non-blocking.