Using a library that is not thread-safe inside of a controller - scala

I'm not sure how threading works inside of play, from what I understand netty uses a single thread but not sure how this translates to how controller actions are called.
class SomeController extends Controller {
val processor = new PegDownProcessor() //
def index = Action { request =>
val result = processor.doSomething()
Ok("hello")
}
}
The pegdown library says instantiating the PegDownProcessor could take 100's of milliseconds, and suggests to use a single reference in an application.
Note that the first time you create a PegDownProcessor it can take up
to a few hundred milliseconds to prepare the underlying parboiled
parser instance. However, once the first processor has been built all
further instantiations will be fast. Also, you can reuse an existing
PegDownProcessor instance as often as you want, as long as you prevent
concurrent accesses, since neither the PegDownProcessor nor the
underlying parser is thread-safe.
https://github.com/sirthias/pegdown
It also says that it isn't thread-safe.
Is the above usage designed correctly where I use a single instance as a val inside of a controller, and actually use it inside of a controller action?
Please explain if it is correct i.e. thread-safe or why it isn't?

Play actions can be called from multiple threads.
A quick solution that popped into my head:
You could create a pool of processors. The pool would be thread-safe and would contain a given number of processors (you could assign the number of processors dynamically or based on the CPU/RAM you have). When a request comes in, the pool puts it in a (FIFO) queue (of course you should use a thread-safe queue implementation). Each processor operates on its own thread, when one finishes a job, it checks the queue for a new job. The enqueue method of the pool returns a Future which is resolved when the task is processed. Play supports async results for the controller methods, so this would play nicely with Play as well.
A similar solution is to use Akka and its actor pool feature that basically implements the above approach in a more generic way. Since actors are single-threaded, each actor would have a single reference to a processor and would simply do the same as you would do on a single thread. Akka allows for advanced options, such as defining the scheduling method, and also fits nicely in the Play stack. Akka has almost no overhead itself, and you can create thousands of actors without any performance issues.

Related

Should Akka Actors do real processing tasks?

I'm writing an application that reads relatively large text files, validates and transforms the data (every line in a text file is an own item, there are around 100M items/file) and creates some kind of output. There already exists a multihreaded Java application (using BlockingQueue between Reading/Processing/Persisting Tasks), but I want to implement a Scala application that does the same thing.
Akka seems to be a very popular choice for building concurrent applications. Unfortunately, due to the asynchronous nature of actors, I still don't understand what a single actor can or can't do, e.g. if I can use actors as traditional workers that do some sort of calculation.
Several documentations say that Actors should never block and I understand why. But the given examples for blocking code always only mention such things as blocking file/network IO.. things that make the actor waiting for a short period of time which is of course a bad thing.
But what if the actor is "blocking" because it actually does something useful instead of waiting? In my case, the processing and transformation of a single line/item of text takes 80ms which is quite a long time (pure processing, no IO involved). Can this work be done by an actor directly or should I use a Future instead (but then, If I have to use Futures anyway, why use Akka in the first place..)?.
The Akka docs and examples show that work can be done directly by actors. But it seems that the authors only do very simplistic work (such as calling filter on a String or incrementing a counter and that's it). I don't know if they do this to keep the docs simple and concise or because you really should not do more that within an actor.
How would you design an Akka-based application for my use case (reading text file, processing every line which takes quite some time, eventually persisting the result)? Or is this some kind of problem that does not suit to Akka?
It all depends on the type of an actor.
I use this rule of thumb: if you don't need to talk to this actor and this actor does not have any other responsibilities, then it's ok to block in it doing actual work. You can treat it as a Future and this is what I would call a "worker".
If you block in an actor that is not a leaf node (worker), i.e. work distributor then the whole system will slow down.
There are a few patterns that involve work pulling/pushing or actor per request model. Either of those could be a fit for your application. You can have a manager that creates an actor for each piece of work and when the work is finished actor sends result back to manager and dies. You can also keep an actor alive and ask for more work from that actor. You can also combine actors and Futures.
Sometimes you want to be able to talk to a worker if your processing is more complex and involves multiple stages. In that case a worker can delegate work yet to another actor or to a future.
To sum-up don't block in manager/work distribution actors. It's ok to block in workers if that does not slow your system down.
disclaimer: by blocking I mean doing actual work, not just busy waiting which is never ok.
Doing computations that take 100ms is fine in an actor. However, you need to make sure to properly deal with backpressure. One way would be to use the work-pulling pattern, where your CPU bound actors request new work whenever they are ready instead of receiving new work items in a message.
That said, your problem description sounds like a processing pipeline that might benefit from using a higher level abstraction such as akka streams. Basically, produce a stream of file names to be processed and then use transformations such as map to get the desired result. I have something like this in production that sounds pretty similar to your problem description, and it works very well provided the data used by the individual processing chunks is not too large.
Of course, a stream will also be materialized to a number of actors. But the high level interface will be more type-safe and easier to reason about.

Play's execution contexts vs scala global

How does the execution context from
import scala.concurrent.ExecutionContext.Implicits.global
differ from Play's execution contexts:
import play.core.Execution.Implicits.{internalContext, defaultContext}
They are very different.
In Play 2.3.x and prior, play.core.Execution.Implicits.internalContext is a ForkJoinPool with fixed constraints on size, used internally by Play. You should never use it for your application code. From the docs:
Play Internal Thread Pool - This is used internally by Play. No application code should ever be executed by a thread in this thread pool, and no blocking should ever be done in this thread pool. Its size can be configured by setting internal-threadpool-size in application.conf, and it defaults to the number of available processors.
Instead, you would use play.api.libs.concurrent.Execution.Implicits.defaultContext, which uses an ActorSystem.
In 2.4.x, they both use the same ActorSystem. This means that Akka will distribute work among its own pool of threads, but in a way that is invisible to you (other than configuration). Several Akka actors can share the same thread.
scala.concurrent.ExecutionContext.Implicits.global is an ExecutionContext defined in the Scala standard library. It is a special ForkJoinPool that using the blocking method to handle potentially blocking code in order to spawn new threads in the pool. You really shouldn't use this in a Play application, as Play will have no control over it. It also has the potential to spawn a lot of threads and use a ton of memory, if you're not careful.
I've written more about scala.concurrent.ExecutionContext.Implicits.global in this answer.
They are the same and point out to the default dispatcher of the underlying actor system in your
Play or Akka or combined application.
##Default Play's Context
play.api.libs.concurrent.Execution.Implicits.defaultContext
##Play's Internal Context
play.core.Execution.Implicits.internalContext
##Guice's EC Injected
class ClassA #Inject()(config: Configuration)
(implicit ec: ExecutionContext) {
...
}
But this is different:
scala.concurrent.ExecutionContext.Implicits.global
Also DB drivers, e.g. if you use slick, may come up with their own Execution Context. Anyway,
Best Practices:
Don’t use scala.concurrent.ExecutionContext.Implicits.global when you are using play or akka framework, in this way you may use more threads than optimum during high load so the performance may decrease.
Don’t be afraid! use the default dispatcher as much as you want everywhere unless you do some blocking task for example listening on network connection, or reading from db explicitly that makes you “current threed” waiting for the result.
Start with default executor and if you found Play / Akka not responding well during high load, switch to a new thread pool for time consuming computation tasks.
Computational tasks that are taking long time is not usually considered blocking. For example traversing an auto completion tree in the memory. But you may considered them blocking when you want to have your control structures remaining functioning once you have a time taking computational task.
The bad thing that may happen when you consider computational tasks as non-blocking is that the play and Akka message dispatcher will be paused when all threads are computing in heavy load. The pros of a separate dispatcher is that the queue processor doesn’t starve. The Cons with separate dispatcher is that you may allocate more threads that optimum and your overall performance will be decreased.
The difference is for high load servers, don’t worry for small projects, use the default
Use scala.concurrent.ExecutionContext.Implicits.global when you have no other executor running in your app. Don’t worry this is safe then.
Once you create Futures, use the default pool, this is the safest way unless you are sure that the future is blocking. Then use a separate pool or use blocking{} structure if possible.
Create a separate thread pool once
You Await for a future
You call Thread.sleep
You are reading a stream/socket/http call
Manually querying db with a blocking driver, (usually slick is safe)
Schedule a task to be run in 10 second
Schedule a task to be run every second
For map/recover operations of a future, use the default executor, usually this is safe
Exception handling is safe with default dispatcher
Always use Akka dispatchers with you in Play or Akka, has a nice way to define a new dispatcher in application.conf
PRELUDE: This question is from 6 years ago, since then many things have changed. I know this is not an answer to the original question, but I was fooled more than 1 day with the same confusion that the original question states; so I decided to share my research results with the community.
The latest update reagrding ExecutionContext, which perfectly applies to Play 2.8.15 is as follows. The Play 2.6 migration guide states:
The play.api.libs.concurrent.Execution class has been deprecated, as it was using global mutable state under the hood to pull the “current” application’s ExecutionContext.
If you want to specify the implicit behavior that you had previously, then you should pass in the execution context implicitly in the constructor.
So you cannot use play.api.libs.concurrent.Execution.Implicits.defaultContext anymore. The no-configuration, out-of-the-box practice is to provide an implicit value of type scala.concurrent.ExecutionContext for the controller, something like:
import scala.concurrent.ExecutionContext
#Singleton
class AsyncController #Inject()(cc: ControllerComponents, actorSystem: ActorSystem)(implicit exec: ExecutionContext) extends AbstractController(cc)
This means that none of the above answers hold, also the question itself is not relevant anymore, since play.core.Execution.Implicits.defaultContext is not available anymore.

How to tune Play Framework application with proper threadpools?

I am working with Play Framework (Scala) version 2.3. From the docs:
You can’t magically turn synchronous IO into asynchronous by wrapping it in a Future. If you can’t change the application’s architecture to avoid blocking operations, at some point that operation will have to be executed, and that thread is going to block. So in addition to enclosing the operation in a Future, it’s necessary to configure it to run in a separate execution context that has been configured with enough threads to deal with the expected concurrency.
This has me a bit confused on how to tune my webapp. Specifically, since my app has a good amount of blocking calls: a mix of JDBC calls, and calls to 3rd party services using blocking SDKs, what is the strategy for configuring the execution context and determining the number of threads to provide? Do I need a separate execution context? Why can't I simply configure the default pool to have a sufficient amount of threads (and if I do this, why would I still need to wrap the calls in a Future?)?
I know this ultimately will depend on the specifics of my app, but I'm looking for some guidance on the strategy and approach. The play docs preach the use of non-blocking operations everywhere but in reality the typical web-app hitting a sql database has many blocking calls, and I got the impression from reading the docs that this type of app will perform far from optimally with the default configurations.
[...] what is the strategy for configuring the execution context and
determining the number of threads to provide
Well, that's the tricky part which depends on your individual requirements.
First of all, you probably should choose a basic profile from the docs (pure asynchronous, highly synchronous or many specific thread pools)
The second step is to fine-tune your setup by profiling and benchmarking your application
Do I need a separate execution context?
Not necessarily. But it makes sense to use separate execution contexts if you want to trigger all your blocking IO-calls at once and not in a sequential way (so database call B does not have to wait until database call A is finished).
Why can't I simply configure the default pool to have a sufficient
amount of threads (and if I do this, why would I still need to wrap
the calls in a Future?)?
You can, check the docs:
play {
akka {
akka.loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = WARNING
actor {
default-dispatcher = {
fork-join-executor {
parallelism-min = 300
parallelism-max = 300
}
}
}
}
}
With this approach, you basically are turning Play into a one-thread-per-request-model. This is not the idea behind Play, but if you're doing a lot of blocking IO calls, it's the simplest approach. In this case, you don't need to wrap your database calls in a Future.
To put it in a nutshell, you basically have three ways to go:
Only use (IO-)technologies whose API calls are non-blocking and asynchronous. This allows you to use a small threadpool / default execution context which suits the nature of Play
Turn Play into a one-thread-per-request Framework by drastically increasing the default execution context. No futures needed, just call your blocking database as always
Create specific execution contexts for your blocking IO-calls and gain fine-grained control of what you are doing
Firstly, before diving in and refactoring your app, you should determine whether this is actually a problem for you. Run some benchmarks (gatling is superb) and do a few profiles with something like JProfiler. If you can live with the current performance then happy days.
The ideal is to use a reactive driver which would return you a future that then gets passed all the way back to your controller. Unfortunately async is still an Open ticket for slick. Interacting with REST APIs can be made reactive using the PlayWS library, but if you have to go via a library that your 3rd party provides then you're stuck.
So, assuming that none of these are feasible and that you do need to improve performance, the question is what benefit would Play's suggestion have? I think what they're getting at here is that it's useful to partition your threads into those that block and those that can make use of asynchronous techniques.
If, for instance, only some proportion of your requests are long and blocking then with a single thread pool you risk all threads being used for the blocking operations. Your controller would then not be able to handle any new requests, irrespective of whether that request needs to call a blocking service. If you can allocate enough threads that this never happens then no problem.
If, on the other hand, you are hitting your limit for threads then by using two pools you can keep your fast, non-blocking requests snappy. You would have one pool servicing requests in your controller and calling into services which return futures. Some of these futures would actually be performing work using a separate pool of threads, but only for the blocking operations. If there is any portion of your app which could be made reactive, then your controller could take advantage of this while isolating the controller from the blocking operations.

How to handle concurrent access to a Scala collection?

I have an Actor that - in its very essence - maintains a list of objects. It has three basic operations, an add, update and a remove (where sometimes the remove is called from the add method, but that aside), and works with a single collection. Obviously, that backing list is accessed concurrently, with add and remove calls interleaving each other constantly.
My first version used a ListBuffer, but I read somewhere it's not meant for concurrent access. I haven't gotten concurrent access exceptions, but I did note that finding & removing objects from it does not always work, possibly due to concurrency.
I was halfway rewriting it to use a var List, but removing items from Scala's default immutable List is a bit of a pain - and I doubt it's suitable for concurrent access.
So, basic question: What collection type should I use in a concurrent access situation, and how is it used?
(Perhaps secondary: Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread?)
(Tertiary: In Scala, what collection type is best for inserts and random access (delete / update)?)
Edit: To the kind responders: Excuse my late reply, I'm making a nasty habit out of dumping a question on SO or mailing lists, then moving on to the next problem, forgetting the original one for the moment.
Take a look at the scala.collection.mutable.Synchronized* traits/classes.
The idea is that you mixin the Synchronized traits into regular mutable collections to get synchronized versions of them.
For example:
import scala.collection.mutable._
val syncSet = new HashSet[Int] with SynchronizedSet[Int]
val syncArray = new ArrayBuffer[Int] with SynchronizedBuffer[Int]
You don't need to synchronize the state of the actors. The aim of the actors is to avoid tricky, error prone and hard to debug concurrent programming.
Actor model will ensure that the actor will consume messages one by one and that you will never have two thread consuming message for the same Actor.
Scala's immutable collections are suitable for concurrent usage.
As for actors, a couple of things are guaranteed as explained here the Akka documentation.
the actor send rule: where the send of the message to an actor happens before the receive of the same actor.
the actor subsequent processing rule: where processing of one message happens before processing of the next message by the same actor.
You are not guaranteed that the same thread processes the next message, but you are guaranteed that the current message will finish processing before the next one starts, and also that at any given time, only one thread is executing the receive method.
So that takes care of a given Actor's persistent state. With regard to shared data, the best approach as I understand it is to use immutable data structures and lean on the Actor model as much as possible. That is, "do not communicate by sharing memory; share memory by communicating."
What collection type should I use in a concurrent access situation, and how is it used?
See #hbatista's answer.
Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread
The second (though the thread on which messages are processed may change, so don't store anything in thread-local data). That's how the actor can maintain invariants on its state.

Actors (scala/akka): is it implied that the receive method will be accessed in a threadsafe manner?

I assume that the messages will be received and processed in a threadsafe manner. However, I have been reading (some) akka/scala docs but I didn't encounter the keyword 'threadsafe' yet.
It is probably because the actor model assumes that each actor instance processes its own mailbox sequentially. That means it should never happen, that two or more concurrent threads execute single actor instance's code. Technically you could create a method in an actor's class (because it is still an object) and call it from multiple threads concurrently, but this would be a major departure from the actor's usage rules and you would do it "at your own risk", because then you would lose all thread-safety guarantees of that model.
This is also one of the reasons, why Akka introduced a concept of ActorRef - a handle, that lets you communicate with the actor through message passing, but not by calling its methods directly.
I think we have it pretty well documented: http://doc.akka.io/docs/akka/2.3.9/general/jmm.html
Actors are 'Treadsafe'. The Actor System (AKKA), provides each actor with its own 'light-weight thread'. Meaning that this is not a tread, but the AKKA system will give the impression that an Actor is always running in it's own thread to the developer. This means that any operations performed as a result of acting on a message are, for all purposes, thread safe.
However, you should not undermine AKKA by using mutable messages or public state. If you develop you actors to be stand alone units of functionality, then they will be threadsafe.
See also:
http://doc.akka.io/docs/akka/2.3.12/general/actors.html#State
and
http://doc.akka.io/docs/akka/2.3.12/general/jmm.html for a more indepth study of the AKKA memory model and how it manages 'tread' issues.