Executing CPU-bound tasks with Scala actors? - scala

Suppose I have to ехеcute several CPU-bound tasks. If I have 4 CPUs, for example, I would probably create a fixed-size thread pool of 4-5 worker threads waiting on a queue and put the tasks in the queue. In Java I can use java.util.concurrent (maybe ThreadPoolExecutor) to implement this mechanism.
How would you implement it with Scala actors?

All actors are basically threads which are executed by a scheduler under the hood. The scheduler creates a thread pool to execute actors roughly bound to your number of cores. This means that you can just create an actor per task you need to execute and leave the rest to Scala:
for(i <- 1 to 20) {
actor {
print(i);
Thread.sleep(1000);
}
}
The disadvantage here is depending on the number of tasks, the cost of creating a thread for each task may be quite expensive since threads are not so cheap in Java.
A simple way to create a bounded pool of worker actors and then distribute the tasks to them via messaging would be something like:
import scala.actors.Actor._
val numWorkers = 4
val pool = (1 to numWorkers).map { i =>
actor {
loop {
react {
case x: String => println(x)
}
}
}
}
for(i <- 1 to 20) {
val r = (new util.Random).nextInt(numWorkers)
pool(r) ! "task "+i
}
The reason we want to create multiple actors is because a single actor processes only one message (i.e. task) at a time so to get parallelism for your tasks you need to create multiple.
A side note: the default scheduler becomes particularly important when it comes to I/O bound tasks, as you will definitely want to change the size of the thread pool in that case. Two good blog posts which go into details about this are: Explore the Scheduling of Scala Actors and Scala actors thread pool pitfall.
With that said, Akka is an Actor framework that provides tools for more advanced workflows with Actors, and it is what I would use in any real application. Here is a load balancing (rather than random) task executor:
import akka.actor.Actor
import Actor._
import akka.routing.{LoadBalancer, CyclicIterator}
class TaskHandler extends Actor {
def receive = {
case t: Task =>
// some computationally expensive thing
t.execute
case _ => println("default case is required in Akka...")
}
}
class TaskRouter(numWorkers: Int) extends Actor with LoadBalancer {
val workerPool = Vector.fill(numWorkers)(actorOf[TaskHandler].start())
val seq = new CyclicIterator(workerPool)
}
val router = actorOf(new TaskRouter(4)).start()
for(i <- 1 to 20) {
router ! Task(..)
}
You can have different types of Load Balancing (CyclicIterator is round-robin distribution), so you can check the docs here for more info.

Well, you usually don't. Part of the attraction of using actors is that they handle such details for you.
If, however, you insist on managing that, you'll need to override the protected scheduler method on your Actor class to return an appropriate IScheduler. See also the scala.actors.scheduler package, and the comments on the Actor trait concerning schedulers.

Related

Scala : Futures with map, flatMap for IO/CPU bound tasks

I am aware that in Java 8 it is not a good idea to start long-running tasks in stream framework's filter, map, etc. methods, as there is no way to tune the underlying fork-join pool and it can cause latency problems and starvings.
Now, my question would be, is there any problem like that with Scala? I tried to google it, but I guess I just can't put this question in a google-able sentence.
Let's say I have a list of objects and I want to save them into a database using a forEach, would that cause any issues? I guess this would not be an issue in Scala, as functional transformations are fundamental building blocks of the language, but anyway...
If you don't see any kind of I/O operations then using futures can be an overhead.
def add(x: Int, y: Int) = Future { x + y }
Executing purely CPU-bound operations in a Future constructor will make your logic slower to execute, not faster. Mapping and flatmapping over them, can increase add fuel to this problem.
In case you want to initialize a Future with a constant/simple calculation, you can use Future.successful().
But all blocking I/O, including SQL queries makes sence be wrapped inside a Future with blocking
E.g :
Future {
DB.withConnection { implicit connection =>
val query = SQL("select * from bar")
query()
}
}
Should be done as,
import scala.concurrent.blocking
Future {
blocking {
DB.withConnection { implicit connection =>
val query = SQL("select * from bar")
query()
}
}
This blocking notifies the thread pool that this task is blocking. This allows the pool to temporarily spawn new workers as needed. This is done to prevent starvation in blocking applications.
The thread pool(by default the scala.concurrent.ExecutionContext.global pool) knows when the code in a blocking is completed.(Since it's a fork join thread pool)
Therefore it will remove the spare worker threads as they completes, and the pool will shrink back down to its expected size with time(Number of cores by default).
But this scenario can also backfire if there is not enough memory to expand the thread pool.
So for your scenario, you can use
images.foreach(i => {
import scala.concurrent.blocking
Future {
blocking {
DB.withConnection { implicit connection =>
val query = SQL("insert into .........")
query()
}
}
})
If you're doing a lot of blocking I/O, then it's a good practice to create a separate thread-pool/execution context and execute all blocking calls in that pool.
References :
scala-best-practices
demystifying-the-blocking-construct-in-scala-futures
Hope this helps.

Idiomatically scheduling background work that dies with the main thread in Scala

I have a scala program that runs for a while and then terminates. I'd like to provide a library to this program that, behind the scenes, schedules an asynchronous task to run every N seconds. I'd also like the program to terminate when the main entrypoint's work is finished without needing to explicitly tell the background work to shut down (since it's inside a library).
As best I can tell the idiomatic way to do polling or scheduled work in Scala is with Akka's ActorSystem.scheduler.schedule, but using an ActorSystem makes the program hang after main waiting for the actors. I then tried and failed to add another actor that joins on the main thread, seemingly because "Anything that blocks a thread is not advised within Akka"
I could introduce a custom dispatcher; I could kludge something together with a polling isAlive check, or adding a similar check inside each worker; or I could give up on Akka and just use raw Threads.
This seems like a not-too-unusual thing to want to do, so I'd like to use idiomatic Scala if there's a clear best way.
I don't think there is an idiomatic Scala way.
The JVM program terminates when all non-daemon thread are finished. So you can schedule your task to run on a daemon thread.
So just use Java functionality:
import java.util.concurrent._
object Main {
def main(args: Array[String]): Unit = {
// Make a ThreadFactory that creates daemon threads.
val threadFactory = new ThreadFactory() {
def newThread(r: Runnable) = {
val t = Executors.defaultThreadFactory().newThread(r)
t.setDaemon(true)
t
}
}
// Create a scheduled pool using this thread factory
val pool = Executors.newSingleThreadScheduledExecutor(threadFactory)
// Schedule some function to run every second after an initial delay of 0 seconds
// This assumes Scala 2.12. In 2.11 you'd have to create a `new Runnable` manually
// Note that scheduling will stop, if there is an exception thrown from the function
pool.scheduleAtFixedRate(() => println("run"), 0, 1, TimeUnit.SECONDS)
Thread.sleep(5000)
}
}
You can also use guava to create a daemon thread factory with new ThreadFactoryBuilder().setDaemon(true).build().
If you use Akka scheduler you will be relying on highly tuned and optimized implementation that is well tested. Bringing up an actor system is a bit heavy weight though, I agree. Additionally you have to bring in a dependency on akka. If you are ok with that you can explicitly call system.shutdown from main when you are done, or wrap it in a function that will do it for you.
Alternatively, you could try something along these lines:
import scala.concurrent._
import ExecutionContext.Implicits.global
object Main extends App {
def repeatEvery[T](timeoutMillis: Int)(f: => T): Future[T] = {
val p = Promise[T]()
val never = p.future
f
def timeout = Future {
Thread.sleep(timeoutMillis)
throw new TimeoutException
}
val failure = Future.firstCompletedOf(List(never, timeout))
failure.recoverWith { case _ => repeatEvery(timeoutMillis)(f) }
}
repeatEvery(1000) {
println("scheduled job called")
}
println("main started doing its work")
Thread.sleep(10000)
println("main finished")
}
Prints:
scheduled job called
main started doing its work
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
scheduled job called
main finished
I don't like that it uses Thread.sleep, but that is done to avoid using any other 3rd party schedulers and Scala Future does not provide timeout options. So you'll be wasting one thread on that scheduling task, but that's what Akka scheduler seems to do anyway. The difference is that perhaps you want a single scheduler for the whole JVM not to waste too many threads. The code I provided albeit simpler will waste a thread per job.

How to implement actor model without Akka?

How to implement simple actors without Akka? I don't need high-performance for many (non-fixed count) actor instances, green-threads, IoC (lifecycle, Props-based factories, ActorRef's), supervising, backpressure etc. Need only sequentiality (queue) + handler + state + message passing.
As a side-effect I actually need small actor-based pipeline (with recursive links) + some parallell actors to optimize the DSP algorithm calculation. It will be inside library without transitive dependencies, so I don't want (and can't as it's a jar-plugin) to push user to create and pass akkaSystem, the library should have as simple and lightweight interface as possible. I don't need IoC as it's just a library (set of functions), not a framework - so it has more algorithmic complexity than structural. However, I see actors as a good instrument for describing protocols and I actually can decompose the algorithm to small amount of asynchronously interacting entities, so it fits to my needs.
Why not Akka
Akka is heavy, which means that:
it's an external dependency;
has complex interface and implementation;
non-transparent for library's user, for example - all instances are managed by akka's IoC, so there is no guarantee that one logical actor is always maintained by same instance, restart will create a new one;
requires additional support for migration which is comparable with scala's migration support itself.
It also might be harder to debug akka's green threads using jstack/jconsole/jvisualvm, as one actor may act on any thread.
Sure, Akka's jar (1.9Mb) and memory consumption (2.5 million actors per GB) aren't heavy at all, so you can run it even on Android. But it's also known that you should use specialized tools to watch and analyze actors (like Typesafe Activator/Console), which user may not be familiar with (and I wouldn't push them to learn it). It's all fine for enterprise project as it almost always has IoC, some set of specialized tools and continuous migration, but this isn't good approach for a simple library.
P.S. About dependencies. I don't have them and I don't want to add any (I'm even avoiding the scalaz, which actually fits here a little bit), as it will lead to heavy maintenance - I'll have to keep my simple library up-to-date with Akka.
Here is most minimal and efficient actor in the JVM world with API based on Minimalist Scala actor from Viktor Klang:
https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/src/test/scala/com/github/gist/viktorklang/Actor.scala
It is handy and safe in usage but isn't type safe in message receiving and cannot send messages between processes or hosts.
Main features:
simplest FSM-like API with just 3 states (Stay, Become and Die): https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/src/test/scala/com/github/gist/viktorklang/Actor.scala#L28-L30
minimalistic error handling - just proper forwading to the default exception handler of executor threads: https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/src/test/scala/com/github/gist/viktorklang/Actor.scala#L52-L53
fast async initialization that takes ~200 ns to complete, so no need for additional futures/actors for time consuming actor initialization: https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/out0.txt#L447
smallest memory footprint, that is ~40 bytes in a passive state (BTW the new String() spends the same amout of bytes in the JVM heap): https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/out0.txt#L449
very efficient in message processing with throughput ~90M msg/sec for 4 core CPU: https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/out0.txt#L466
very efficient in message sending/receiving with latency ~100 ns: https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/out0.txt#L472
per actor tuning of fairness by the batch parameter: https://github.com/plokhotnyuk/actors/blob/41eea0277530f86e4f9557b451c7e34345557ce3/src/test/scala/com/github/gist/viktorklang/Actor.scala#L32
Example of stateful counter:
def process(self: Address, msg: Any, state: Int): Effect = if (state > 0) {
println(msg + " " + state)
self ! msg
Become { msg =>
process(self, msg, state - 1)
}
} else Die
val actor = Actor(self => msg => process(self, msg, 5))
Results:
scala> actor ! "a"
a 5
scala> a 4
a 3
a 2
a 1
This will use FixedThreadPool (and so its internal task queue):
import scala.concurrent._
trait Actor[T] {
implicit val context = ExecutionContext.fromExecutor(java.util.concurrent.Executors.newFixedThreadPool(1))
def receive: T => Unit
def !(m: T) = Future { receive(m) }
}
FixedThreadPool with size 1 guarantees sequentiality here. Of course it's NOT the best way to manage your threads if you need 100500 dynamically created actors, but it's fine if you need some fixed amount of actors per application to implement your protocol.
Usage:
class Ping(pong: => Actor[Int]) extends Actor[Int] {
def receive = {
case m: Int =>
println(m)
if (m > 0) pong ! (m - 1)
}
}
object System {
lazy val ping: Actor[Int] = new Ping(pong) //be careful with lazy vals mutual links between different systems (objects); that's why people prefer ActorRef
lazy val pong: Actor[Int] = new Ping(ping)
}
System.ping ! 5
Results:
import scala.concurrent._
defined trait Actor
defined class Ping
defined object System
res17: scala.concurrent.Future[Unit] = scala.concurrent.impl.Promise$DefaultPromise#6be61f2c
5
4
3
2
1
0
scala> System.ping ! 5; System.ping ! 7
5
7
4
6
3
5
2
res19: scala.concurrent.Future[Unit] = scala.concurrent.impl.Promise$DefaultPromise#54b053b1
4
1
3
0
2
1
0
This implementation is using two Java threads, so it's "twice" faster than counting without parallelization.

Is it correct to use `Future` to run some loop task which is never finished?

In our project, we need to do a task which listens to a queue and process the coming messages in a loop, which is never finished. The code is looking like:
def processQueue = {
while(true) {
val message = queue.next();
processMessage(message) match {
case Success(_) => ...
case _ => ...
}
}
}
So we want to run it in a separate thread.
I can imagine two ways to do it, one is to use Thread as what we do in Java:
new Thread(new Runnable() { processQueue() }).start();
Another way is use Future (as we did now):
Future { processQueue }
I just wonder if is it correct to use Future in this case, since as I know(which might be wrong), Future is mean to be running some task which will finish or return a result in some time of the future. But our task is never finished.
I also wonder what's the best solution for this in scala.
A Future is supposed to a value that will eventually exist, so I don't think it makes much sense to create one that will never be fulfilled. They're also immutable, so passing information to them is a no-no. And using some externally referenced queue within the Future sounds like dark road to go down.
What you're describing is basically an Akka Actor, which has it's own FIFO queue, with a receive method to process messages. It would look something like this:
import akka.actor._
class Processor extends Actor {
def receive = {
case msg: String => processMessage(msg) match {
case Success(x) => ...
case _ => ...
}
case otherMsg # Message(_, _) => {
// process this other type of message..
}
}
}
Your application could create a single instance of this Processor actor with an ActorSystem (or some other elaborate group of these actors):
val akkaSystem = ActorSystem("myActorSystem")
val processor: ActorRef = akkaSystem.actorOf(Props[Processor], "Processor")
And send it messages:
processor ! "Do some work!"
In short, it's a better idea to use a concurrency framework like Akka than to create your own for processing queues on separate threads. The Future API is most definitely not the way to go.
I suggest perusing the Akka Documentation for more information.
If you are just running one thread (aside from the main thread), it won't matter. If you do this repeatedly, and you really want lots of separate threads, you should use Thread since that is what it is for. Futures are built with the assumption that they'll terminate, so you might run out of pool threads.

Akka for REST polling

I'm trying to interface a large Scala + Akka + PlayMini application with an external REST API. The idea is to periodically poll (basically every 1 to 10 minutes) a root URL and then crawl through sub-level URLs to extract data which is then sent to a message queue.
I have come up with two ways to do this:
1st way
Create a hierarchy of actors to match the resource path structure of the API. In the Google Latitude case, that would mean, e.g.
Actor 'latitude/v1/currentLocation' polls https://www.googleapis.com/latitude/v1/currentLocation
Actor 'latitude/v1/location' polls https://www.googleapis.com/latitude/v1/location
Actor 'latitude/v1/location/1' polls https://www.googleapis.com/latitude/v1/location/1
Actor 'latitude/v1/location/2' polls https://www.googleapis.com/latitude/v1/location/2
Actor 'latitude/v1/location/3' polls https://www.googleapis.com/latitude/v1/location/3
etc.
In this case, each actor is responsible for polling its associated resource periodically, as well as creating / deleting child actors for next-level path resources (i.e. actor 'latitude/v1/location' creates actors 1, 2, 3, etc. for all locations it learns about through polling of https://www.googleapis.com/latitude/v1/location).
2nd way
Create a pool of identical polling actors which receive polling requests (containing the resource path) load-balanced by a router, poll the URL once, do some processing, and schedule polling requests (both for next-level resources and for the polled URL). In Google Latitude, that would mean for instance:
1 router, n poller actors. Initial polling request for https://www.googleapis.com/latitude/v1/location leads to several new (immediate) polling requests for https://www.googleapis.com/latitude/v1/location/1, https://www.googleapis.com/latitude/v1/location/2, etc. and one (delayed) polling request for the same resource, i.e. https://www.googleapis.com/latitude/v1/location.
I have implemented both solutions and can't immediately observe any relevant difference of performance, at least not for the API and polling frequencies I am interested in. I find the first approach to be somewhat easier to reason about and perhaps easier to use with system.scheduler.schedule(...) than the second approach (where I need to scheduleOnce(...)). Also, assuming resources are nested through several levels and somewhat short-lived (e.g. several resources may be added/removed between each polling), akka's lifecycle management makes it easy to kill off a whole branch in the 1st case. The second approach should (theoretically) be faster and the code is somewhat easier to write.
My questions are:
What approach seems to be the best (in terms of performance, extensibility, code complexity, etc.)?
Do you see anything wrong with the design of either approach (esp. the 1st one)?
Has anyone tried to implement anything similar? How was it done?
Thanks!
Why not create a master poller, which then kicks of async resource requests on the schedule?
I'm no expert using Akka, but I gave this a shot:
The poller object that iterates through the list of resources to fetch:
import akka.util.duration._
import akka.actor._
import play.api.Play.current
import play.api.libs.concurrent.Akka
object Poller {
val poller = Akka.system.actorOf(Props(new Actor {
def receive = {
case x: String => Akka.system.actorOf(Props[ActingSpider], name=x.filter(_.isLetterOrDigit)) ! x
}
}))
def start(l: List[String]): List[Cancellable] =
l.map(Akka.system.scheduler.schedule(3 seconds, 3 seconds, poller, _))
def stop(c: Cancellable) {c.cancel()}
}
The actor that reads the resource asynchronously and triggers more async reads. You could put the message dispatch on a schedule rather than call immediately if it was kinder:
import akka.actor.{Props, Actor}
import java.io.File
class ActingSpider extends Actor {
import context._
def receive = {
case name: String => {
println("reading " + name)
new File(name) match {
case f if f.exists() => spider(f)
case _ => println("File not found")
}
context.stop(self)
}
}
def spider(file: File) {
io.Source.fromFile(file).getLines().foreach(l => {
val k = actorOf(Props[ActingSpider], name=l.filter(_.isLetterOrDigit))
k ! l
})
}
}