Alternative to using Future.sequence inside Akka Actors - scala

We have a fairly complex system developed using Akka HTTP and Actors model. Until now, we extensively used ask pattern and mixed Futures and Actors.
For example, an actor gets message, it needs to execute 3 operations in parallel, combine a result out of that data and returns it to sender. What we used is
declare a new variable in actor receive message callback to store a sender (since we use Future.map it can be another sender).
executed all those 3 futures in parallel using Future.sequence (sometimes its call of function that returns a future and sometimes it is ask to another actor to get something from it)
combine the result of all 3 futures using map or flatMap function of Future.sequence result
pipe a final result to a sender using pipeTo
Here is a code simplified:
case RetrieveData(userId, `type`, id, lang, paging, timeRange, platform) => {
val sen = sender
val result: Future[Seq[Map[String, Any]]] = if (paging.getOrElse(Paging(0, 0)) == Paging(0, 0)) Future.successful(Seq.empty)
else {
val start = System.currentTimeMillis()
val profileF = profileActor ? Get(userId)
Future.sequence(Seq(profileF, getSymbols(`type`, id), getData(paging, timeRange, platform)).map { result =>
logger.info(s"Got ${result.size} news in ${System.currentTimeMillis() - start} ms")
result
}.recover { case ex: Throwable =>
logger.error(s"Failure on getting data: ${ex.getMessage}", ex)
Seq.empty
}
}
result.pipeTo(sen)
}
Function getAndProcessData contains Future.sequence with executing 3 futures in parallel.
Now, as I'm reading more and more on Akka, I see that using ask is creating another actor listener. Questions are:
As we extensively use ask, can it lead to a to many threads used in a system and perhaps a thread starvation sometimes?
Using Future.map much also means different thread often. I read about one thread actor illusion which can be easily broken with mixing Futures.
Also, can this affect performances in a bad way?
Do we need to store sender in temp variable send, since we're using pipeTo? Could we do only pipeTo(sender). Also, does declaring sen in almost each receive callback waste to much resources? I would expect its reference will be removed once operation in complete.
Is there a chance to design such a system in a better way, meadning that we don't use map or ask so much? I looked at examples when you just pass a replyTo reference to some actor and the use tell instead of ask. Also, sending message to self and than replying to original sender can replace working with Future.map in some scenarios. But how it can be designed having in mind we want to perform 3 async operations in parallel and returns a formatted data to a sender? We need to have all those 3 operations completed to be able to format data.
I tried not to include to many examples, I hope you understand our concerns and problems. Many questions, but I would really love to understand how it works, simple and clear
Thanks in advance

If you want to do 3 things in parallel you are going to need to create 3 Future values which will potentially use 3 threads, and that can't be avoided.
I'm not sure what the issue with map is, but there is only one call in this code and that is not necessary.
Here is one way to clean up the code to avoid creating unnecessary Future values (untested!):
case RetrieveData(userId, `type`, id, lang, paging, timeRange, platform) =>
if (paging.forall(_ == Paging(0, 0))) {
sender ! Seq.empty
} else {
val sen = sender
val start = System.currentTimeMillis()
val resF = Seq(
profileActor ? Get(userId),
getSymbols(`type`, id),
getData(paging, timeRange, platform),
)
Future.sequence(resF).onComplete {
case Success(result) =>
val dur = System.currentTimeMillis() - start
logger.info(s"Got ${result.size} news in $dur ms")
sen ! result
case Failure(ex)
logger.error(s"Failure on getting data: ${ex.getMessage}", ex)
sen ! Seq.empty
}
}
You can avoid ask by creating your own worker thread that collects the different results and then sends the result to the sender, but that is probably more complicated than is needed here.

An actor only consumes a thread in the dispatcher when it is processing a message. Since the number of messages the actor spawned to manage the ask will process is one, it's very unlikely that the ask pattern by itself will cause thread starvation. If you're already very close to thread starvation, an ask might be the straw that breaks the camel's back.
Mixing Futures and actors can break the single-thread illusion, if and only if the code executing in the Future accesses actor state (meaning, basically, vars or mutable objects defined outside of a receive handler).
Request-response and at-least-once (between them, they cover at least most of the motivations for the ask pattern) will in general limit throughput compared to at-most-once tells. Implementing request-response or at-least-once without the ask pattern might in some situations (e.g. using a replyTo ActorRef for the ultimate recipient) be less overhead than piping asks, but probably not significantly. Asks as the main entry-point to the actor system (e.g. in the streams handling HTTP requests or processing messages from some message bus) are generally OK, but asks from one actor to another are a good opportunity to streamline.
Note that, especially if your actor imports context.dispatcher as its implicit ExecutionContext, transformations on Futures are basically identical to single-use actors.
Situations where you want multiple things to happen (especially when you need to manage partial failure (Future.sequence.recover is a possible sign of this situation, especially if the recover gets nontrivial)) are potential candidates for a saga actor to organize one particular request/response.

I would suggest instead of using Future.sequence, Souce from Akka can be used which will run all the futures in parallel, in which you can provide the parallelism also.
Here is the sample code:
Source.fromIterator( () => Seq(profileF, getSymbols(`type`, id), getData(paging, timeRange, platform)).iterator )
.mapAsync( parallelism = 1 ) { case (seqIdValue, row) =>
row.map( seqIdValue -> _ )
}.runWith( Sink.seq ).map(_.map(idWithDTO => idWithDTO))
This will return Future[Seq[Map[String, Any]]]

Related

Are Akka actors overkill for doing data crunching/uploading?

I'm quite new to Scala as well as Akka actors. I'm really only reading about their use and implementation now. My background is largely js and python with a bit of C#.
A new service I have to write is going to receive REST requests, then do the following:
Open a socket connection to a message broker
Query an external REST service once
Make many big, long REST requests to another internal service, do math on the responses, and send the result out. Messages are sent through the socket connection as progress updates.
Scalability is the primary concern here, as we may normally receive ~10 small requests per minute, but at unknown times receive several jaw-droppingly enormous and long running requests at once.
Using Scala Futures, the very basic implementation would be something like this:
val smallResponse = smallHttpRequest(args)
smallResponse.onComplete match {
case Success(result) => {
result.data.grouped(10000).toList.forEach(subList => {
val bigResponse = getBigSlowHttpRequest(subList)
bigResponse.onSuccess {
case crunchableStuff => crunchAndDeliver(crunchableStuff)
}
})
}
case Failure(error) => handleError(error)
}
My understanding is that on a machine with many cores, letting the JVM handle all the threading underneath the above futures would allow for them all to run in parallel.
This could definitely be written using Akka actors, but I don't know what, if any, benefits I would realize in doing so. Would it be overkill to turn the above into an actor based process with a bunch of workers taking chunks of crunching?
For such an operation, I wouldn't go near Akka Actors -- it's way too much for what looks to be a very basic chain of async requests. The Actor system gives you the ability to safely handle and/or accumulate state in an actor, whilst your task can easily be modeled as a type safe stateless flow of data.
So Futures (or preferably one of the many lazy variants such as the Twitter Future, cats.IO, fs2 Task, Monix, etc) would easily handle that.
No IDE to hand, so there's bound to be a huge mistake in here somewhere!
val smallResponse = smallHttpRequest(args)
val result: Future[List[CrunchedData]] = smallResponse.map(result => {
result.data
.grouped(10000)
.toList
// List[X] => List[Future[X]]
.map(subList => getBigSlowHttpRequest(subList))
// List[Future[X]] => Future[List[X]] so flatmap
.flatMap(listOfFutures => Future.sequence(listOfFutures))
})
Afterwards you could pass the future back via the controller if using something like Finch, Http4s, Play, Akka Http, etc. Or manually take a look like in your example code.

Akka Http - Host Level Client Side API Source.queue pattern

We started to implement the Source.queue[HttpRequest] pattern mentioned in the docs: http://doc.akka.io/docs/akka-http/current/scala/http/client-side/host-level.html#examples
This is the (reduced) example from the documentation
val poolClientFlow = Http()
.cachedHostConnectionPool[Promise[HttpResponse]]("akka.io")
val queue =
Source.queue[(HttpRequest, Promise[HttpResponse])](
QueueSize, OverflowStrategy.dropNew
)
.via(poolClientFlow)
.toMat(Sink.foreach({
case ((Success(resp), p)) => p.success(resp)
case ((Failure(e), p)) => p.failure(e)
}))(Keep.left)
.run()
def queueRequest(request: HttpRequest): Future[HttpResponse] = {
val responsePromise = Promise[HttpResponse]()
queue.offer(request -> responsePromise).flatMap {
case QueueOfferResult.Enqueued => responsePromise.future
case QueueOfferResult.Dropped => Future.failed(new RuntimeException("Queue overflowed. Try again later."))
case QueueOfferResult.Failure(ex) => Future.failed(ex)
case QueueOfferResult.QueueClosed => Future.failed(new RuntimeException("Queue was closed (pool shut down) while running the request. Try again later."))
}
}
val responseFuture: Future[HttpResponse] = queueRequest(HttpRequest(uri = "/"))
The docs state that using Source.single(request) is an anti-pattern and should be avoid. However it doesn't clarify why and what implications come by using Source.queue.
At this place we previously showed an example that used the Source.single(request).via(pool).runWith(Sink.head).
In fact, this is an anti-pattern that doesn’t perform well. Please either supply requests using a queue or in a streamed fashion as shown below.
Advantages of Source.queue
The flow is only materialized once ( probably a performance gain? ). However if I understood the akka-http implementation correctly, a new flow is materialized for each connection, so this doesn't seem to be that much of a problem
Explicit backpressure handling with OverflowStrategy and matching over the QueueOfferResult
Issues with Source.queue
These are the questions that came up, when we started implementing this pattern in our application.
Source.queue is not thread-safe
The queue implementation is not thread safe. When we use the queue in different routes / actors we have this scenario that:
A enqueued request can override the latest enqueued request, thus leading to an unresolved Future.
UPDATE
This issue as been addressed in akka/akka/issues/23081. The queue is in fact thread safe.
Filtering?
What happens when request are being filtered? E.g. when someone changes the implementation
Source.queue[(HttpRequest, Promise[HttpResponse])](
QueueSize, OverflowStrategy.dropNew)
.via(poolClientFlow)
// only successful responses
.filter(_._1.isSuccess)
// failed won't arrive here
.to(Sink.foreach({
case ((Success(resp), p)) => p.success(resp)
case ((Failure(e), p)) => p.failure(e)
}))
Will the Future not resolve? With a single request flow this is straightforward:
Source.single(request).via(poolClientFlow).runWith(Sink.headOption)
QueueSize vs max-open-request?
The difference between the QueueSize and max-open-requests is not clear. In the end, both are buffers. Our implementation ended up using QueueSize == max-open-requests
What's the downside for Source.single()?
Until now I have found two reasons for using Source.queue over Source.single
Performance - materializing the flow only once. However according to this answer it shouldn't be an issue
Explicitly configuring backpressure and handle failure cases. In my opinion the ConnectionPool has a sufficient handling for too much load. One can map over the resulting future and handle the exceptions.
thanks in advance,
Muki
I'll answer each of your questions directly and then give a general indirect answer to the overall problem.
probably a performance gain?
You are correct that there is a Flow materialized for each IncomingConnection but there is still a performance gain to be had if a Connection has multiple requests coming from it.
What happens when request are being filtered?
In general streams do not have a 1:1 mapping between Source elements and Sink Elements. There can be 1:0, as in your example, or there can be 1:many if a single request somehow spawned multiple responses.
QueueSize vs max-open-request?
This ratio would depend on the speed with which elements are being offered to the queue and the speed with which http requests are being processed into responses. There is no pre-defined ideal solution.
GENERAL REDESIGN
In most cases a Source.queue is used because some upstream function is creating input elements dynamically and then offering them to the queue, e.g.
val queue = ??? //as in the example in your question
queue.offer(httpRequest1)
queue.offer(httpRequest2)
queue.offer(httpRequest3)
This is poor design because whatever entity or function that is being used to create each input element could itself be part of the stream Source, e.g.
val allRequests = Iterable(httpRequest1, httpRequest2, httpRequest3)
//no queue necessary
val allResponses : Future[Seq[HttpResponse]] =
Source(allRequests)
.via(poolClientFlow)
.to(Sink.seq[HttpResponse])
.run()
Now there is no need to worry about the queue, max queue size, etc. Everything is bundled into a nice compact stream.
Even if the source of requests is dynamic, you can still usually use a Source. Say we are getting the request paths from the console stdin, this can still be a complete stream:
import scala.io.{Source => ioSource}
val consoleLines : () => Iterator[String] =
() => ioSource.stdin.getLines()
Source
.fromIterator(consoleLines)
.map(consoleLine => HttpRequest(GET, uri = Uri(consoleLine)))
.via(poolClientFlow)
.to(Sink.foreach[HttpResponse](println))
.run()
Now, even if each line is typed into the console at random intervals the stream can still behave reactively without a Queue.
The only instance I've every seen a queue, or Source.ActorRef, as being absolutely necessary is when you have to create a callback function that gets passed into a third party API. This callback function will have to offer the incoming elements to the queue.

Are futures blocked in the receive method of Actors

Just a question :
I have an actor that queries the db (assume that queries take some time) .
all results from db are returning a Future .
this is basically they way we do it :
case class BasePay(id:String,baseSalary)
class CalcActor(db:DB) extends Actor{
override def receive: Receive = {
case BasePay(id:String,baseSalary) =>
for{
person <- db.longQueryToFindPerson(id)
calc <- db.anotherLongQueryCallCommission(person,baseSalary)
}yield Foo(person,calc)
}
what happens if I get a lot of BasePay messages before the futures completes ?
is it queued ? are there other failures I should notice here ?
What happens if I get a lot of BasePay messages before the futures completes?
A lot of futures will be executed, regardless of when the first one completes.
Is it queued?
No. The only way to have it queue would be to block on the Future result. Since the Future is dispatched asynchronously, the actor is able to continue processing messages.
Are there other failures I should notice here?
This is a broad question. Since that looks like example code, it is difficult to speculate what could go wrong. You could quickly exhaust any sort of connection pool by dispatching many queries at the same time. That can be limited by creating an ExecutionContext with a limited size to throttle how many of the Futures are executed at the same time, but that would not limit the actor from accepting the messages rapidly.
the for comprehension uses a context to execute your code
for{
person <- db.longQueryToFindPerson(id)
calc <- db.anotherLongQueryCallCommission(person,baseSalary)
}yield Foo(person,calc)
this is actually desugar into
db.longQueryToFindPerson(id).flatMap(person =>
db.anotherLongQueryCallCommission(person,baseSalary)
.map(calc => Foo(person,calc))(aContext)//if no context will use implicit in this case the dispatcher assigned to the actor
but future flatmap requires a context to run, given that none is provided it will use an implicit context
in this case will be using the dispatcher assigned to your actor, therefore, your actor will be competing for threads allocation with the futures being executed. So your actor will increase its mailbox until dispatcher is able to process the futures.
you can specify another dispatcher to run the futures, there different ways.
implicit val context = ExecutionContext.fromExecutor(//etc)
for{
person <- db.longQueryToFindPerson(id)
calc <- db.anotherLongQueryCallCommission(person,baseSalary)
}yield Foo(person,calc)
if this is the default mailbox, i.e you didn't specify a mailbox in some way then its non-blocking and unbounded so its OK as long as you don't run out of memory.
check the documentation for even more info.

Scala how to use akka actors to handle a timing out operation efficiently

I am currently evaluating javascript scripts using Rhino in a restful service. I wish for there to be an evaluation time out.
I have created a mock example actor (using scala 2.10 akka actors).
case class Evaluate(expression: String)
class RhinoActor extends Actor {
override def preStart() = { println("Start context'"); super.preStart()}
def receive = {
case Evaluate(expression) ⇒ {
Thread.sleep(100)
sender ! "complete"
}
}
override def postStop() = { println("Stop context'"); super.postStop()}
}
Now I run use this actor as follows:
def run {
val t = System.currentTimeMillis()
val system = ActorSystem("MySystem")
val actor = system.actorOf(Props[RhinoActor])
implicit val timeout = Timeout(50 milliseconds)
val future = (actor ? Evaluate("10 + 50")).mapTo[String]
val result = Try(Await.result(future, Duration.Inf))
println(System.currentTimeMillis() - t)
println(result)
actor ! PoisonPill
system.shutdown()
}
Is it wise to use the ActorSystem in a closure like this which may have simultaneous requests on it?
Should I make the ActorSystem global, and will that be ok in this context?
Is there a more appropriate alternative approach?
EDIT: I think I need to use futures directly, but I will need the preStart and postStop. Currently investigating.
EDIT: Seems you don't get those hooks with futures.
I'll try and answer some of your questions for you.
First, an ActorSystem is a very heavy weight construct. You should not create one per request that needs an actor. You should create one globally and then use that single instance to spawn your actors (and you won't need system.shutdown() anymore in run). I believe this covers your first two questions.
Your approach of using an actor to execute javascript here seems sound to me. But instead of spinning up an actor per request, you might want to pool a bunch of the RhinoActors behind a Router, with each instance having it's own rhino engine that will be setup during preStart. Doing this will eliminate per request rhino initialization costs, speeding up your js evaluations. Just make sure you size your pool appropriately. Also, you won't need to be sending PoisonPill messages per request if you adopt this approach.
You also might want to look into the non-blocking callbacks onComplete, onSuccess and onFailure as opposed to using the blocking Await. These callbacks also respect timeouts and are preferable to blocking for higher throughput. As long as whatever is way way upstream waiting for this response can handle the asynchronicity (i.e. an async capable web request), then I suggest going this route.
The last thing to keep in mind is that even though code will return to the caller after the timeout if the actor has yet to respond, the actor still goes on processing that message (performing the evaluation). It does not stop and move onto the next message just because a caller timed out. Just wanted to make that clear in case it wasn't.
EDIT
In response to your comment about stopping a long execution there are some things related to Akka to consider first. You can call stop the actor, send a Kill or a PosionPill, but none of these will stop if from processing the message that it's currently processing. They just prevent it from receiving new messages. In your case, with Rhino, if infinite script execution is a possibility, then I suggest handling this within Rhino itself. I would dig into the answers on this post (Stopping the Rhino Engine in middle of execution) and setup your Rhino engine in the actor in such a way that it will stop itself if it has been executing for too long. That failure will kick out to the supervisor (if pooled) and cause that pooled instance to be restarted which will init a new Rhino in preStart. This might be the best approach for dealing with the possibility of long running scripts.

When could Futures be more appropriate than Actors (or vice versa) in Scala?

Suppose I need to run a few concurrent tasks.
I can wrap each task in a Future and wait for their completion. Alternatively I can create an Actor for each task. Each Actor would execute its task (e.g. upon receiving a "start" message) and send the result back.
I wonder when I should use the former (with Futures) and the latter (with Actors) approach and why the Future approach is considered better for the case described above.
Because it is syntactically simpler.
val tasks: Seq[() => T] = ???
val futures = tasks map {
t => future { t() }
}
val results: Future[Seq[T]] = Future.sequence(futures)
The results future you can then wait on using Await.result or you can map it further/use it in for-comprehension or install callbacks on it.
Compare that to instantiating all the actors, sending messages to them, coding their receive blocks, receiving responses from them and shutting them down -- that would generally require more boilerplate.
As a general rule, use the simplest concurrency model that fits your application, rather than the most powerful. Ordering from simplest to most complex would be sequential programming->parallel collections->futures->stateless actors->stateful actors->threads with software transactional memory->threads with explicit locking->threads with lock-free algorithms. Pick the first one in this list that solves your problem. The farther down that list you go, the greater the complexities and risks, so you're better off trading simplicity for conceptual power.
I tend to think that actors are useful when you have interacting threads. In your case, it appears to be that all the jobs are independent; I would use futures.