Efficient processing of custom data by actors - scala

I am an Akka newbie trying things out for a particular problem. I am trying to write code for an actor system which would efficiently process custom data coming from multiple clients in the form of events. By custom data, I mean, the content and structure of the data would vary between events from the same client (e.g., we might have instrumented to drop 5 events containing 5 different piece of information for the same client), and between events from different clients (e.g., we might be capturing completely different set of information from one client vs. another). I am wondering what would be a good way to use actor-based processing for this type of scenarios.
This are the alternatives what I have thought so far:
(A) I will write an actor which would load client-specific processor class through reflection, based on the client whose event is being processed. The client-specific processor class would contain logic corresponding to all the type of events that would be received for that client. I will initiate 'n' instances of this actor.
context.actorOf(Props[CustomEventProcessor].withRouter(RoundRobinPool(nrOfInstances = 100)), name = "CustomProcessor")
(B) I will write actors for each client, each containing logic corresponding to all the type of events that would be received for that client. I will initiate 'n' instances of each of these actors.
context.actorOf(Props[CleintXEventProcessor].withRouter(RoundRobinPool(nrOfInstances = 50)), name = "ClientXCustomProcessor")
context.actorOf(Props[CleintYEventProcessor].withRouter(RoundRobinPool(nrOfInstances = 50)), name = "ClientYCustomProcessor")
At this point, I have a few questions:
Would [A] be slower compared to [B] becuase [A] is using reflection? I am assuming that once an actor instance has finished processing a particular event, it dies, so the next actor instance processing an event from the same client would have to start with loading the processor class again. Is this assumption correct?
Given a specific event flow pattern, would a system based on [B] have a heavier runtime memory footprint compared to [A] becuase now each actor for each client can have multiple instances of them in memory?
Any other way to approach this problem?
Thanks for any pointers.

Well,
It could be a bit slower, but I think not really noticeable. And no, you don't have to kill actors between events.
No, because single actor takes like 400 bytes in memory, so you can create a single actor for each event, not only one actor per client.
Yes, via Reactive Streams which I think is a bit clearer solution than actors, but Akka Streams are still experimental, and it may be a bit harder to learn than actors. But you'll have backpressure for free if its needed.

Related

Should Akka Actors do real processing tasks?

I'm writing an application that reads relatively large text files, validates and transforms the data (every line in a text file is an own item, there are around 100M items/file) and creates some kind of output. There already exists a multihreaded Java application (using BlockingQueue between Reading/Processing/Persisting Tasks), but I want to implement a Scala application that does the same thing.
Akka seems to be a very popular choice for building concurrent applications. Unfortunately, due to the asynchronous nature of actors, I still don't understand what a single actor can or can't do, e.g. if I can use actors as traditional workers that do some sort of calculation.
Several documentations say that Actors should never block and I understand why. But the given examples for blocking code always only mention such things as blocking file/network IO.. things that make the actor waiting for a short period of time which is of course a bad thing.
But what if the actor is "blocking" because it actually does something useful instead of waiting? In my case, the processing and transformation of a single line/item of text takes 80ms which is quite a long time (pure processing, no IO involved). Can this work be done by an actor directly or should I use a Future instead (but then, If I have to use Futures anyway, why use Akka in the first place..)?.
The Akka docs and examples show that work can be done directly by actors. But it seems that the authors only do very simplistic work (such as calling filter on a String or incrementing a counter and that's it). I don't know if they do this to keep the docs simple and concise or because you really should not do more that within an actor.
How would you design an Akka-based application for my use case (reading text file, processing every line which takes quite some time, eventually persisting the result)? Or is this some kind of problem that does not suit to Akka?
It all depends on the type of an actor.
I use this rule of thumb: if you don't need to talk to this actor and this actor does not have any other responsibilities, then it's ok to block in it doing actual work. You can treat it as a Future and this is what I would call a "worker".
If you block in an actor that is not a leaf node (worker), i.e. work distributor then the whole system will slow down.
There are a few patterns that involve work pulling/pushing or actor per request model. Either of those could be a fit for your application. You can have a manager that creates an actor for each piece of work and when the work is finished actor sends result back to manager and dies. You can also keep an actor alive and ask for more work from that actor. You can also combine actors and Futures.
Sometimes you want to be able to talk to a worker if your processing is more complex and involves multiple stages. In that case a worker can delegate work yet to another actor or to a future.
To sum-up don't block in manager/work distribution actors. It's ok to block in workers if that does not slow your system down.
disclaimer: by blocking I mean doing actual work, not just busy waiting which is never ok.
Doing computations that take 100ms is fine in an actor. However, you need to make sure to properly deal with backpressure. One way would be to use the work-pulling pattern, where your CPU bound actors request new work whenever they are ready instead of receiving new work items in a message.
That said, your problem description sounds like a processing pipeline that might benefit from using a higher level abstraction such as akka streams. Basically, produce a stream of file names to be processed and then use transformations such as map to get the desired result. I have something like this in production that sounds pretty similar to your problem description, and it works very well provided the data used by the individual processing chunks is not too large.
Of course, a stream will also be materialized to a number of actors. But the high level interface will be more type-safe and easier to reason about.

What is Event Driven Concurrency?

I am starting to learn Scala and functional programming. I was reading the book !Programming scala: Tackle Multi-Core Complexity on the Java Virtual Machine". Upon the first chapter I've seen the word Event-Driven concurrency and Actor model. Before I continue reading this book I want to have an idea about Event-Driven concurrency or Actor Model.
What is Event-Driven concurrency, and how is it related to Actor Model?
An Event Driven programming model involves registering code to be run when a given event fires. An example is, instead of calling a method that returns some data from a database:
val user = db.getUser(1)
println(user.name)
You could instead register a callback to be run when the data is ready:
db.getUser(1, u => println(u.name))
In the first example, no concurrency was happening; The current thread would block until db.getUser(1) returned data from the database. In the second example db.getUser would return immediately and carry on executing the next code in the program. In parallel to this, the callback u => println(u.name) will be executed at some point in the future.
Some people prefer the second approach as it doesn't mean memory hungry Threads are needlessly sat around waiting for slow I/O to return.
The Actor Model is an example of how Event-Driven concepts can be used to help the programmer easily write concurrent programs.
From a super high level, Actors are objects that define a series of Event Driven message handlers that get fired when the Actor receives messages. In Akka, each instance of an Actor is single Threaded, however when many of these Actors are put together they create a system with concurrency.
For example, Actor A could send messages to Actor B and C in parallel. Actor B and C could fire messages back to Actor A. Actor A would have message handlers to receive these messages and behave as desired.
To learn more about the Actor model I would recommend reading the Akka documentation. It is really well written: http://doc.akka.io/docs/akka/2.1.4/
There is also lot's of good documentation around the web about Event Driven Concurrency that us much more detailed than what I've written here. http://berb.github.io/diploma-thesis/original/055_events.html
Theon's answer provides a good modern overview. I'd like to add some historical perspective.
Tony Hoare and Robert Milner both developed mathematical algebra for analysing concurrent systems (Communicating Sequential Processes, CSP, and Communicating Concurrent Systems, CCS). Both of these look like heavy mathematics to most of us but the practical application is relatively straightforward. CSP led directly to the Occam programming language amongst others, with Go being the newest example. CCS led to Pi calculus and the mobility of communicating channel ends, a feature that is part of Go and was added to Occam in the last decade or so.
CSP models concurrency purely by considering automomous entities ('processes', v.lightweight things like green threads) interacting simply by event exchange. The medium for passing events is along channels. Processes may have to deal with several inputs or outputs and they do this by selecting the event that is ready first. The events usually carry data from the sender to the receiver.
A principle feature of the CSP model is that a pair of processes engage in communication only when both are ready - in practical terms this leads to what is usually called 'synchronous' communication. However, the actual implementations (Go, Occam, Akka) allow channels to be buffered (the normal state in Akka) so that the lock-step exchange of events is often actually decoupled instead.
So in summary, an event-driven CSP-based system is really a data-flow network of processes connected by channels.
Besides the CSP interpretation of event-driven, there have been others. An important example is the 'event-wheel' approach, once popular for modelling concurrent systems whilst actually having a single processing thread. Such systems handle events by putting them into a processing queue and dealing with them due course, usually via a callback. Java Swing's event processing engine is a good example. There were others, e.g. for time-based simulation engines. One might think of the Javascript / NodeJS model as fitting into this category as well.
So in summary, an event-wheel was a way to express concurrency but without parallelism.
The irony of this is that the two approaches I've described above are both described as event driven but what they mean by event driven is different in each case. In one case, hardware-like entities are wired together; in the other, almost all actions are executed by callbacks. The CSP approach claims to be scalable because it's fully composable; it's naturally adept at parallel execution also. If there are any reasons to favour one over the other, these are probably it.
To understand the answer to this you have to look at event concurrency from the OS layer up. First you start with threads which are the smallest section of code that can be run by the OS and eventually deal with I/O, timing and other kinds of events.
The OS groups threads into a process in which they share the same memory, protection and security permissions. Above that layer you have user programs which typically make I/O requests that are handled by user libraries.
The I/O libraries handle these requests in one of two ways. Unix-like systems use a "reactor" model in which the library registers I/O handlers for all the different types of I/O and events in the system. These handlers are activated when I/O is ready on a specific device. Windows-like systems use an I/O completion model in which I/O requests are made and a callback is triggered when the request is complete.
Both of these models require a significant amount of overhead to manage overall program state if you were to use them directly. However some programming tasks (web apps / services) lend themselves to a seemingly more direct implementation if you use an event model directly, but you still need to manage all of that program state. In order to track program logic across dispatches of several related events you have to manually track state and pass it around to the callbacks. This tracking structure is usually called a state context or baton. As you might imagine passing batons around all over the place to numerous seemingly unrelated handlers makes for some extremely hard to read and spaghetti-like code. It's also a pain to write and debug -- especially when you're trying to handle the synchronization of various concurrent paths of execution. You start getting into Futures and then the code becomes really difficult to read.
One well-known event processing library is call libuv. It's a portable event loop that integrates Unix's reactor model with Windows' completion model into a single model usually called a "proactor". Its the event handler that drives NodeJS.
Which brings us to communicating sequential processes.
https://en.wikipedia.org/wiki/Communicating_sequential_processes
Rather than writing asynchronous I/O dispatch and synchronization code using one or more concurrency models (and their often competing conventions), we flip the problem on its head. We use a "coroutine" which looks like normal sequential code.
A simple example is a coroutine that receives a single byte over an event channel from another coroutine that sends a single byte. This effectively synchronizes I/O producer and consumer because the writer/sender has to wait for a reader/receiver and vice-versa. While either process is waiting they explicitly yield execution to other processes. When a coroutine yields, its scoped program state is saved on a stack frame thus saving you from the confusion of managing multi-layered baton state in an event loop.
Using applications built on these event channels we can construct arbitrary, reusable, concurrent logic and the algorithms no longer look like spaghetti code. In pure CSP systems if you write to a channel and there is no reader, you will be blocked. The channel endpoints are known via handles internally to the program.
Actor systems are different in a couple of ways. First, the endpoints are the actor threads and they are named and known external to the mainline program. The second difference is that sends and receives on these channels are buffered. In other words if you send a message to an actor and there isn't one listening or its busy you aren't blocked until one reads from their input channel. Other differences exist like one actor can publish to two different actors concurrently.
As you might guess Actor systems can easily be built from CSP systems. There are other details like waiting for specific event patterns and selecting from them, but that's the basics.
I hope that clarifies things a bit.
Other constructs can be built from these ideas. Various programming systems (Go, Erlang, etc) include CSP implementations within them. Operating systems like Inferno and Node9 use CSPs and Channels as the basis of their distributed computing model.
Go: https://en.wikipedia.org/wiki/Go_(programming_language)
Erlang: https://en.wikipedia.org/wiki/Erlang_(programming_language)
Inferno: https://en.wikipedia.org/wiki/Inferno_(operating_system)
Node9: https://github.com/jvburnes/node9

Handling application level concurrency - what's my options?

We have a thin web layer (Scalatra) that translates incoming HTTP requests into events (case classes) that are sent to a thread-bound event processing actor. Some of the events contains the id of an aggregate root that we need to mutate for various reasons. The total amount of application data is too big to fit in memory, so we need to retrieve the aggregate, by its id, from a data source before operating on it. Of course we don't want the event processing actor to block, so the idea is to spawn a new (event-based?) actor that loads the data, mutates it and stores it back into the data source. Ideally I would like to handle concurrency in the application instead of relying on ACID capabilities of the data source. Basically I need serialized/transactional access to each aggregate.
Can this be achieved using actors?
What would be the best approach?
Keeping a ConcurrentHashMap inside the event processing actor containing actors keyed on aggregate root id?
Or do we have to involve STM:s (ScalaSTM/Akka) or something similar?
You can represent your "aggregate root" as an actor. When you want to mutate the aggregate root you can send a message to do so from your request handling actor. You can also have an intermediary broker actor that forwards messages to the correct actor and manages a cache of aggregate root actors ( by id ) by instantiating an actor representing the data on demand and stoping them as needed. STM will be needed if you need to coordinate a mutation across actors that represent data.

How to handle concurrent access to a Scala collection?

I have an Actor that - in its very essence - maintains a list of objects. It has three basic operations, an add, update and a remove (where sometimes the remove is called from the add method, but that aside), and works with a single collection. Obviously, that backing list is accessed concurrently, with add and remove calls interleaving each other constantly.
My first version used a ListBuffer, but I read somewhere it's not meant for concurrent access. I haven't gotten concurrent access exceptions, but I did note that finding & removing objects from it does not always work, possibly due to concurrency.
I was halfway rewriting it to use a var List, but removing items from Scala's default immutable List is a bit of a pain - and I doubt it's suitable for concurrent access.
So, basic question: What collection type should I use in a concurrent access situation, and how is it used?
(Perhaps secondary: Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread?)
(Tertiary: In Scala, what collection type is best for inserts and random access (delete / update)?)
Edit: To the kind responders: Excuse my late reply, I'm making a nasty habit out of dumping a question on SO or mailing lists, then moving on to the next problem, forgetting the original one for the moment.
Take a look at the scala.collection.mutable.Synchronized* traits/classes.
The idea is that you mixin the Synchronized traits into regular mutable collections to get synchronized versions of them.
For example:
import scala.collection.mutable._
val syncSet = new HashSet[Int] with SynchronizedSet[Int]
val syncArray = new ArrayBuffer[Int] with SynchronizedBuffer[Int]
You don't need to synchronize the state of the actors. The aim of the actors is to avoid tricky, error prone and hard to debug concurrent programming.
Actor model will ensure that the actor will consume messages one by one and that you will never have two thread consuming message for the same Actor.
Scala's immutable collections are suitable for concurrent usage.
As for actors, a couple of things are guaranteed as explained here the Akka documentation.
the actor send rule: where the send of the message to an actor happens before the receive of the same actor.
the actor subsequent processing rule: where processing of one message happens before processing of the next message by the same actor.
You are not guaranteed that the same thread processes the next message, but you are guaranteed that the current message will finish processing before the next one starts, and also that at any given time, only one thread is executing the receive method.
So that takes care of a given Actor's persistent state. With regard to shared data, the best approach as I understand it is to use immutable data structures and lean on the Actor model as much as possible. That is, "do not communicate by sharing memory; share memory by communicating."
What collection type should I use in a concurrent access situation, and how is it used?
See #hbatista's answer.
Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread
The second (though the thread on which messages are processed may change, so don't store anything in thread-local data). That's how the actor can maintain invariants on its state.

Akka - How many instances of an actor should you create?

I'm new to the Akka framework and I'm building an HTTP server application on top of Netty + Akka.
My idea so far is to create an actor for each type of request. E.g. I would have an actor for a POST to /my-resource and another actor for a GET to /my-resource.
Where I'm confused is how I should go about actor creation? Should I:
Create a new actor for every request (by this I mean for every request should I do a TypedActor.newInstance() of the appropriate actor)? How expensive is it to create a new actor?
Create one instance of each actor on server start up and use that actor instance for every request? I've read that an actor can only process one message at a time, so couldn't this be a bottle neck?
Do something else?
Thanks for any feedback.
Well, you create an Actor for each instance of mutable state that you want to manage.
In your case, that might be just one actor if my-resource is a single object and you want to treat each request serially - that easily ensures that you only return consistent states between modifications.
If (more likely) you manage multiple resources, one actor per resource instance is usually ideal unless you run into many thousands of resources. While you can also run per-request actors, you'll end up with a strange design if you don't think about the state those requests are accessing - e.g. if you just create one Actor per POST request, you'll find yourself worrying how to keep them from concurrently modifying the same resource, which is a clear indication that you've defined your actors wrongly.
I usually have fairly trivial request/reply actors whose main purpose it is to abstract the communication with external systems. Their communication with the "instance" actors is then normally limited to one request/response pair to perform the actual action.
If you are using Akka, you can create an actor per request. Akka is extremely slim on resources and you can create literarily millions of actors on an pretty ordinary JVM heap. Also, they will only consume cpu/stack/threads when they actually do something.
A year ago I made a comparison between the resource consumption of the thread-based and event-based standard actors. And Akka is even better than the event-base.
One of the big points of Akka in my opinion is that it allows you to design your system as "one actor per usage" where earlier actor systems often forced you to do "use only actors for shared services" due to resource overhead.
I would recommend that you go for option 1.
Options 1) or 2) have both their drawbacks. So then, let's use options 3) Routing (Akka 2.0+)
Router is an element which act as a load balancer, routing the requests to other Actors which will perform the task needed.
Akka provides different Router implementations with different logic to route a message (for example SmallestMailboxPool or RoundRobinPool).
Every Router may have several children and its task is to supervise their Mailbox to further decide where to route the received message.
//This will create 5 instances of the actor ExampleActor
//managed and supervised by a RoundRobinRouter
ActorRef roundRobinRouter = getContext().actorOf(
Props.create(ExampleActor.class).withRouter(new RoundRobinRouter(5)),"router");
This procedure is well explained in this blog.
It's quite a reasonable option, but whether it's suitable depends on specifics of your request handling.
Yes, of course it could.
For many cases the best thing to do would be to just have one actor responding to every request (or perhaps one actor per type of request), but the only thing this actor does is to forward the task to another actor (or spawn a Future) which will actually do the job.
For scaling up the serial requests handling, add a master actor (Supervisor) which in turn will delegate to the worker actors (Children) (round-robin fashion).