Running scala futures somewhat in parallel - scala

I have some scala futures. I can easily run them in parallel with Future.sequence. I can also run them one-after-another with something like this:
def serFut[A, B](l: Iterable[A])(fn: A ⇒ Future[B]) : Future[List[B]] =
l.foldLeft(Future(List.empty[B])) {
(previousFuture, next) ⇒
for {
previousResults ← previousFuture
next ← fn(next)
} yield previousResults :+ next
}
(Described here). Now suppose that I want to run them slightly in parallel - ie with the constraint that at most m of them are running at once. The above code does this for the special case of m=1. Is there a nice scala-idiomatic way of doing it for general m? Then for extra utility, what's the most elegant way to implement a kill-switch into the routine? And could I change m on the fly?
My own solutions to this keep leading me back to procedural code, which feels rather wimpy next to scala elegance.

You can achieve it by using an ExecutionContext which uses a pool of max m threads:
How to configure a fine tuned thread pool for futures?
Put the implicit val ec = new ExecutionContext { ... somewhere within the scope of the serFut function so that it is used when creating futures.

The easiest way is to define your ExecutionContext like
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(m))

Related

Reentrant locks within monads in Scala

A colleague of mine stated the following, about using a Java ReentrantReadWriteLock in some Scala code:
Acquiring the lock here is risky. It's "reentrant", but that internally depends on the thread context. F may run different stages of the same computation in different threads. You can easily cause a deadlock.
F here refers to some effectful monad.
Basically what I'm trying to do is to acquire the same reentrant lock twice, within the same monad.
Could somebody clarify why this could be a problem?
The code is split into two files. The outermost one:
val lock: Resource[F, Unit] = for {
// some other resource
_ <- store.writeLock
} yield ()
lock.use { _ =>
for {
// stuff
_ <- EitherT(store.doSomething())
// other stuff
} yield ()
}
Then, in the store:
import java.util.concurrent.locks.{Lock, ReentrantReadWriteLock}
import cats.effect.{Resource, Sync}
private def lockAsResource[F[_]](lock: Lock)(implicit F: Sync[F]): Resource[F, Unit] =
Resource.make {
F.delay(lock.lock())
} { _ =>
F.delay(lock.unlock())
}
private val lock = new ReentrantReadWriteLock
val writeLock: Resource[F, Unit] = lockAsResource(lock.writeLock())
def doSomething(): F[Either[Throwable, Unit]] = writeLock.use { _ =>
// etc etc
}
The writeLock in the two pieces of code is the same, and it's a cats.effect.Resource[F, Unit] wrapping a ReentrantReadWriteLock's writeLock. There are some reasons why I was writing the code this way, so I wouldn't want to dig into that. I would just like to understand why (according to my colleague, at least), this could potentially break stuff.
Also, I'd like to know if there is some alternative in Scala that would allow something like this without the risk for deadlocks.
IIUC your question:
You expect that for each interaction with the Resource lock.lock and lock.unlock actions happen in the same thread.
1) There is no guarantee at all since you are using arbitrary effect F here.
It's possible to write an implementation of F that executes every action in a new thread.
2) Even if we assume that F is IO then the body of doSomething someone could do IO.shift. So the next actions including unlock would happen in another thread. Probably it's not possible with the current signature of doSomething but you get the idea.
Also, I'd like to know if there is some alternative in Scala that would allow something like this without the risk for deadlocks.
You can take a look at scalaz zio STM.

Scala style with Futures

Looking for the best way to write a chain of functions that need to run async one after the other. Given these two options:
Option 1
def operation1(): Unit = {...}
def operation2(): Unit = {...}
def foo(): Future[Unit] =
Future {
operation1()
operation2()
} onComplete {case _ => println("done!")}
Option 2
def operation1(): Future[Unit] = {...}
def operation2(): Future[Unit] = {...}
def foo(): Future[Unit] = {
operation1()
.flatMap {case _ => operation2() }
.onComplete {case _ => println("done!")}
}
Are there any advantages/disadvantages of one over the other?
I believe that the Option 1 will run the two functions on the same background thread. Is that also the case for the Option 2?
Are there any good practices for this?
Another question, given this function:
def foo: Future[A]
if I want to cast the result to unit, is this the best way to do it:
foo map { _ => () }
Thanks!
The potential advantage of Option 1 over Option 2 is that, it guarantees operation2 will run right after operation1 - if it didn't fail with an exception - whereas, in Option 2, you might have exhausted your thread pool available threads when the flatMap is to be done.
Yes, Option1 will run the operations in the same thread for sure. Option 2 will try to run them in two threads as long as there are enough of them available.
flatMap[S](f: (T) ⇒ Future[S])(implicit executor: ExecutionContext): Future[S]
You did have to declare an implicit execution context, or import it: That determines which pool you are using. If you imported the default global executor then your pool is a fork join based one with - by default - as many threads as cores you machine has.
The first option is like having a thread running both operations, one after the another whereas the second option runs the first operation in a thread and then tries to get another thread from your ExecutionContext to run the second operation.
The best practice is to use what you need:
Do you want to make sure that operation2 run in a context where no more threads are available in the execution context? If the answer is yes, then use Option1. Otherwise, you can use Option2
In relation to your last question: What you're doing in your proposed snippet is not casting, your are mapping a function which provides an Unit value for any value of type A. The effect is that you get a future of type Unit which is useful to check its completion state. That is the best way to get what you seem to want.
Be aware, however, of the fact that, as well as with flatMap, the execution of that "transformation function" will be run in a different thread provided by the implicit executor in your context. That is why map has an implicit parameter executor too:
def map[S](f: (T) ⇒ S)(implicit executor: ExecutionContext): Future[S]

Using getOrElseUpdate of TrieMap in Scala

I am using the getOrElseUpdate method of scala.collection.concurrent.TrieMap (from 2.11.6)
// simplified for clarity
val trie = new TrieMap[Int, Future[String]]
def foo(): String = ... // a very long process
val fut: Future[String] = trie.getOrElseUpdate(id, Future(foo()))
As I understand, if I invoke the getOrElseUpdate in multiple threads without any synchronization the foo is invoked just once.
Is it correct ?
The current implementation is that it will be invoked zero or one times. It may be invoked without the result being inserted, however. (This is standard behavior for CAS-based maps as opposed to ones that use synchronized.)

Unit test async scala code

Experimenting with concurrent execution I was wondering how to actually test it.
The execution flow is of a side-effect nature and futures are created to wrap independent executions/processing.
Been searching for some good examples on how to properly unit test the following scenarios (foo and bar are the methods I wish to test):
scenario #1
def foo : Unit = {
Future { doSomething }
Future { doSomethingElse }
}
private def doSomething : Unit = serviceCall1
private def doSomethingElse : Unit = serviceCall2
Scenario motivation
foo immediately returns but invokes 2 futures which perform separate tasks (e.g. save analytics and store record to DB). These service calls can be mocked, but what I'm trying to test is that both these services are called once I wrap them in Futures
scenario #2
def bar : Unit = {
val futureX = doAsyncX
val futureY = doAsyncY
for {
x <- futureX
y <- futureY
} yield {
noOp(x, y)
}
}
Scenario motivation
Start with long running computations that can be executed concurrently (e.g. get the number of total visitors and get the frequently used User-Agent header to our web site). Combine the result in some other operation (which in this case Unit method that simply throws the values)
Note I'm familiar with actors and testing actors, but given the above code I wonder what should be the most suitable approach (refactoring included)
EDIT What I'm doing at the moment
implicit value context = ExecutionContext.fromExecutor(testExecutor)
def testExecutor = {
new Executor {
def execute(runnable : Runnable) = runnable.run
}
}
This ExecutionContext implementation will not run the Future as a separate thread and the entire execution will be done in sequence. This kinda feels like a hack but based on Electric Monk answer, it seems like the other solution is more of the same.
One solution would be to use a DeterministicExecutor. Not a scalaesque solution, but should so the trick.
If you are using ScalaTest, take a look at: http://doc.scalatest.org/2.0/index.html#org.scalatest.concurrent.Futures
Specs2 also has support for testing Futures:
http://etorreborre.github.io/specs2/guide/org.specs2.guide.Matchers.html
ScalaTest 3.x supports asynchronous non-blocking testing.

Concurrent map/foreach in scala

I have an iteration vals: Iterable[T] and a long-running function without any relevant side effects: f: (T => Unit). Right now this is applied to vals in the obvious way:
vals.foreach(f)
I would like the calls to f to be done concurrently (within reasonable limits). Is there an obvious function somewhere in the Scala base library? Something like:
Concurrent.foreach(8 /* Number of threads. */)(vals, f)
While f is reasonably long running, it is short enough that I don't want the overhead of invoking a thread for each call, so I am looking for something based on a thread pool.
Many of the answers from 2009 still use the old scala.actors.Futures._, which are no longer in the newer Scala. While Akka is the preferred way, a much more readable way is to just use parallel (.par) collections:
vals.foreach { v => f(v) }
becomes
vals.par.foreach { v => f(v) }
Alternatively, using parMap can appear more succinct though with the caveat that you need to remember to import the usual Scalaz*. As usual, there's more than one way to do the same thing in Scala!
Scalaz has parMap. You would use it as follows:
import scalaz.Scalaz._
import scalaz.concurrent.Strategy.Naive
This will equip every functor (including Iterable) with a parMap method, so you can just do:
vals.parMap(f)
You also get parFlatMap, parZipWith, etc.
I like the Futures answer. However, while it will execute concurrently, it will also return asynchronously, which is probably not what you want. The correct approach would be as follows:
import scala.actors.Futures._
vals map { x => future { f(x) } } foreach { _() }
I had some issues using scala.actors.Futures in Scala 2.8 (it was buggy when I checked). Using java libs directly worked for me, though:
final object Parallel {
val cpus=java.lang.Runtime.getRuntime().availableProcessors
import java.util.{Timer,TimerTask}
def afterDelay(ms: Long)(op: =>Unit) = new Timer().schedule(new TimerTask {override def run = op},ms)
def repeat(n: Int,f: Int=>Unit) = {
import java.util.concurrent._
val e=Executors.newCachedThreadPool //newFixedThreadPool(cpus+1)
(0 until n).foreach(i=>e.execute(new Runnable {def run = f(i)}))
e.shutdown
e.awaitTermination(Math.MAX_LONG, TimeUnit.SECONDS)
}
}
I'd use scala.actors.Futures:
vals.foreach(t => scala.actors.Futures.future(f(t)))
The latest release of Functional Java has some higher-order concurrency features that you can use.
import fjs.F._
import fj.control.parallel.Strategy._
import fj.control.parallel.ParModule._
import java.util.concurrent.Executors._
val pool = newCachedThreadPool
val par = parModule(executorStrategy[Unit](pool))
And then...
par.parMap(vals, f)
Remember to shutdown the pool.
You can use the Parallel Collections from the Scala standard library.
They're just like ordinary collections, but their operations run in parallel. You just need to put a par call before you invoke some collections operation.
import scala.collection._
val array = new Array[String](10000)
for (i <- (0 until 10000).par) array(i) = i.toString