How much processing should be done in an Akka actor ? - scala

I've to do some processing (run some jobs) on a regular interval (10min, 30min, 60min ...). I've different actors for each. I'm sending messages to self inside an actor to trigger the processing. The processing done inside an actor could take anywhere from a 10ms to 30s or more in some cases. My question from an Actor design perspective is how "heavy" can the processing inside an Actor receive be ? Or it doesn't really matter ?

It's fine to have actors that take arbitrarily long to process a message. But you wouldn't want to do that in an actor that is responsible for the timely processing of further messages.
A common pattern is to have a manager actor that receives messages and farms out work to worker actors. It's no problem if the worker actors take a long time, because the manager actor can simply create another worker actor if it needs one.
Think about what needs to be responsive and what will take a long time, and make sure a given actor isn't doing both.
But also, if you have something to do that is going to take 30 seconds, you might want to break it up into multiple actors somehow so that multiple cores can work on it in parallel. Another common pattern is for an actor given a task to look at how big it is and if it is over some threshold, spawn multiple actors and give each part of the job. Each of those actors then does the same thing, deciding whether to do the job itself or split up the work among some child actors. In this way trees of actors are formed on the fly based on the size of the job, and when the job is done the actors disappear.

Related

Use akka actors to traverse directory tree

I'm new to the actor model and was trying to write a simple example. I want to traverse a directory tree using Scala and Akka. The program should find all files and perform an arbitrary (but fast) operation on each file.
I wanted to check how can I model recursion using actors?
How do I gracefully stop the actor system when the traversal will be finished?
How can I control the number of actors to protect against out of memory?
Is there a way to keep the mailboxes of the actors from growing too big?
What will be different if the file operation will take long time to execute?
Any help would be appreciated!
Actors are workers. They take work in and give results back, or they supervise other workers. In general, you want your actors to have a single responsibility.
In theory, you could have an actor that processes a directory's contents, working on each file, or spawning an actor for each directory encountered. This would be bad, as long file-processing time would stall the system.
There are several methods for stopping the actor system gracefully. The Akka documentation mentions several of them.
You could have an actor supervisor that queues up requests for actors, spawns actors if below an actor threshold count, and decrementing the count when actors finish up. This is the job of a supervisor actor. The supervisor actor could sit to one side while it monitors, or it could also dispatch work. Akka has actor models the implement both of these approaches.
Yes, there are several ways to control the size of a mailbox. Read the documentation.
The file operation can block other processing if you do it the wrong way, such as a naive, recursive traversal.
The first thing to note is there are two types of work: traversing the file hierarchy and processing an individual file. As your first implementation try, create two actors, actor A and actor B. Actor A will traverse the file system, and send messages to actor B with the path to files to process. When actor A is done, it sends an "all done" indicator to actor B and terminates. When actor B processes the "all done" indicator, it terminates. This is a basic implementation that you can use to learn how to use the actors.
Everything else is a variation on this. Next variation might be creating two actor B's with a shared mailbox. Shutdown is a little more involved but still straightforward. The next variation is to create a dispatcher actor which farms out work to one or more actor B's. The next variation uses multiple actor A's to traverse the file system, with a supervisor to control how many actors get created.
If you follow this development plan, you will have learned a lot about how to use Akka, and can answer all of your questions.

Should Akka Actors do real processing tasks?

I'm writing an application that reads relatively large text files, validates and transforms the data (every line in a text file is an own item, there are around 100M items/file) and creates some kind of output. There already exists a multihreaded Java application (using BlockingQueue between Reading/Processing/Persisting Tasks), but I want to implement a Scala application that does the same thing.
Akka seems to be a very popular choice for building concurrent applications. Unfortunately, due to the asynchronous nature of actors, I still don't understand what a single actor can or can't do, e.g. if I can use actors as traditional workers that do some sort of calculation.
Several documentations say that Actors should never block and I understand why. But the given examples for blocking code always only mention such things as blocking file/network IO.. things that make the actor waiting for a short period of time which is of course a bad thing.
But what if the actor is "blocking" because it actually does something useful instead of waiting? In my case, the processing and transformation of a single line/item of text takes 80ms which is quite a long time (pure processing, no IO involved). Can this work be done by an actor directly or should I use a Future instead (but then, If I have to use Futures anyway, why use Akka in the first place..)?.
The Akka docs and examples show that work can be done directly by actors. But it seems that the authors only do very simplistic work (such as calling filter on a String or incrementing a counter and that's it). I don't know if they do this to keep the docs simple and concise or because you really should not do more that within an actor.
How would you design an Akka-based application for my use case (reading text file, processing every line which takes quite some time, eventually persisting the result)? Or is this some kind of problem that does not suit to Akka?
It all depends on the type of an actor.
I use this rule of thumb: if you don't need to talk to this actor and this actor does not have any other responsibilities, then it's ok to block in it doing actual work. You can treat it as a Future and this is what I would call a "worker".
If you block in an actor that is not a leaf node (worker), i.e. work distributor then the whole system will slow down.
There are a few patterns that involve work pulling/pushing or actor per request model. Either of those could be a fit for your application. You can have a manager that creates an actor for each piece of work and when the work is finished actor sends result back to manager and dies. You can also keep an actor alive and ask for more work from that actor. You can also combine actors and Futures.
Sometimes you want to be able to talk to a worker if your processing is more complex and involves multiple stages. In that case a worker can delegate work yet to another actor or to a future.
To sum-up don't block in manager/work distribution actors. It's ok to block in workers if that does not slow your system down.
disclaimer: by blocking I mean doing actual work, not just busy waiting which is never ok.
Doing computations that take 100ms is fine in an actor. However, you need to make sure to properly deal with backpressure. One way would be to use the work-pulling pattern, where your CPU bound actors request new work whenever they are ready instead of receiving new work items in a message.
That said, your problem description sounds like a processing pipeline that might benefit from using a higher level abstraction such as akka streams. Basically, produce a stream of file names to be processed and then use transformations such as map to get the desired result. I have something like this in production that sounds pretty similar to your problem description, and it works very well provided the data used by the individual processing chunks is not too large.
Of course, a stream will also be materialized to a number of actors. But the high level interface will be more type-safe and easier to reason about.

Num of actor instance

I'm new to akka-actor and confused with some problems:
when I create an actorSystem, and use actorOf(Props(classOf[AX], ...)) to create actor in main method, how many instances are there for my actor AX?
If the answer to Q1 was just one, does this mean whatever data-structure I created in the AX actor class's definition will only appear in one thread and I should not concern about concurrency problems?
What if one of my actor's action (one case in receive method) is a time consuming task and would take quite long time to finish? Will my single Actor instance not responding until it finish that task?
If the answer to Q3 is right, what I am supposed to do to prevent my actor from not responding? Should I start another thread and send another message back to it until finish the task? Is there a best practice there I should follow?
yes, the actor system will only create 1 actor instance for each time you call the 'actorOf' method. However, when using a Router it is possible to create 1 router which spreads the load to any number of actors. So in that case it is possible to construct multiple instances, but 'normally' using actorOf just creates 1 instance.
Yes, within an actor you do not have to worry about concurrency because Akka guarantees that any actor only processes 1 message at the time. You must take care not to somehow mutate the state of the actor from code outside the actor. So whenever exposing the actor state, always do this using an immutable class. Case classes are excellent for this. But also be ware of modifying the actor state when completing a Future from inside the actor. Since the Future runs on it's own thread you could have a concurrency issue when the Future completes and the actor is processing a next message at the same time.
The actor executes on 1 thread at the time, but this might be a different thread each time the actor executes.
Akka is a highly concurrent and distributed framework, everything is asynchronous and non-blocking and you must do the same within your application. Scala and Akka provide several solutions to do this. Whenever you have a time consuming task within an actor you might either delegate the time consuming task to another actor just for this purpose, use Futures or use Scala's 'async/await/blocking'. When using 'blocking' you give a hint to the compiler/runtime a blocking action is done and the runtime might start additional thread to prevent thread starvation. The Scala Concurrent programming book is an excellent guide to learn this stuff. Also look at the concurrent package ScalaDocs and Neophyte's Guide to Scala.
If the actor really has to wait for the time consuming task to complete, then yes, your actor can only respond when that's finished. But this is a very 'request-response' way of thinking. Try to get away from this. The actor could also respond immediately indicating the task has started and send an additional message once the task has been completed.
With time consuming tasks always be sure to use a different threadpool so the ActorSystem will not be blocked because all of it's available threads are used up by time consuming tasks. For Future's you can provide a separate ExecutionContext (do not use the ActorSystem's Dispatch context for this!), but via Akka's configuration you can also configure certain actors to run on a different thread pool.
See 3.
Success!
one instance (if you declare a router in your props then (maybe) more than one)
Yes. This is one of the advantages of actors.
Yes. An Actor will process messages sequentially.
You can use scala.concurrent.Future (do not use actor state in the future) or delegate the work to a child actor (the main actor can manage the state and can respond to messages). Future or child-actor depends on use case.

Stateless Akka actors

I am just starting with Akka and I am trying to split some messy functions in more manageable pieces, each of which would be carried out by a different actor.
My task seems exactly what is suitable for the actor model. I have a tree of objects, which are saved in a database. Each node has some attributes; lets' concentrate on just one and call it wealth. The wealth of the children depends on the wealth of the parent. When one computes the wealth on a node, this should trigger a similar computation on the children. I want to collect all the updated instances of nodes and save them at the same time in the DB.
This seems easy: an actor just performs the computation and then starts another actor for each child of the current node. When all children send back a message with the results of the computation, the actor collects them, adds the current node and sends a message to the actor above him.
The problem is that I do not know a way to be sure that one has received a message from all the children. A trivial way would be to use a counter and increase it whenever a message from a child arrives.
But what happens if two independent parts of the system require such a computation to be performed and the same actor is reused? The actor will spawn twice as many children and the counting will not be reliable anymore. What I would need is to make sure that the same actor is not reused from the external system, but new actors are generated each time the computation is triggered.
Is this even possible? If not, is there some automatic mechanism in Akka to make sure that every child has performed its task?
Am I just using the actor model in a situation that is not suitable here? Essentially I am doing nothing more than could be done with functions - the actors themselves are stateless, but they allow me to simplify and parallelize the computation.
The way that you describe things, I think what you really want is a recursive function that returns a Future[Tree[NodeWealth]], the function would spawn a Future every time it is called and it would call itself for each child in the hierarchy, at the end of the function it would compose the Futures from those calls into a single Future[Result]. When the recursive function terminates and returns the Future[Tree[NodeWealth]] you can then compose it with another Future which updates your DB. Take a look at the Akka documentation here on Futures. And pay particular attention to the section "Composing Futures".
The composition of futures should allow you to avoid state and implement this easily.
Another option would be to use actors and futures, and use the ask pattern on child actors, compose the resulting futures into a single future which you return to the sender (parent actor). This is essentially the same thing, just intertwined with actors. I would probably only go this route if you are already representing your nodes as actors for some other reason.

An Actor "queue"?

In Java, to write a library that makes requests to a server, I usually implement some sort of dispatcher (not unlike the one found here in the Twitter4J library: http://github.com/yusuke/twitter4j/blob/master/twitter4j-core/src/main/java/twitter4j/internal/async/DispatcherImpl.java) to limit the number of connections, to perform asynchronous tasks, etc.
The idea is that N number of threads are created. A "Task" is queued and all threads are notified, and one of the threads, when it's ready, will pop an item from the queue, do the work, and then return to a waiting state. If all the threads are busy working on a Task, then the Task is just queued, and the next available thread will take it.
This keeps the max number of connections to N, and allows at most N Tasks to be operating at the same time.
I'm wondering what kind of system I can create with Actors that will accomplish the same thing? Is there a way to have N number of Actors, and when a new message is ready, pass it off to an Actor to handle it - and if all Actors are busy, just queue the message?
Akka Framework is designed to solve this kind of problems, and is exactly what you're looking for.
Look thru this docu - there're lots of highly configurable dispathers (event-based, thread-based, load-balanced, work-stealing, etc.) that manage actors mailboxes, and allow them to work in conjunction. You may also find interesting this blog post.
E.g. this code instantiates new Work Stealing Dispatcher based on the fixed thread pool, that fulfils load balancing among the actors it supervises:
val workStealingDispatcher = Dispatchers.newExecutorBasedEventDrivenWorkStealingDispatcher("pooled-dispatcher")
workStealingDispatcher
.withNewThreadPoolWithLinkedBlockingQueueWithUnboundedCapacity
.setCorePoolSize(16)
.buildThreadPool
Actor that uses the dispatcher:
class MyActor extends Actor {
messageDispatcher = workStealingDispatcher
def receive = {
case _ =>
}
}
Now, if you start 2+ instances of the actor, dispatcher will balance the load between the mailboxes (queues) of the actors (actor that has too much messages in the mailbox will "donate" some to the actors that has nothing to do).
Well, you have to see about the actors scheduler, as actors are not usually 1-to-1 with threads. The idea behind actors is that you may have many of them, but the actual number of threads will be limited to something reasonable. They are not supposed to be long running either, but rather quickly answering to messages they receive. In short, the architecture of that code seems to be wholly at odds with how one would design an actor system.
Still, each working actor may send a message to a Queue actor asking for the next task, and then loop back to react. This Queue actor would receive either queueing messages, or dequeuing messages. It could be designed like this:
val q: Queue[AnyRef] = new Queue[AnyRef]
loop {
react {
case Enqueue(d) => q enqueue d
case Dequeue(a) if q.nonEmpty => a ! (q dequeue)
}
}