FS2 Stream with StateT[IO, _, _], periodically dumping state - scala

I have a program which consumes an infinite stream of data. Along the way I'd like to record some metrics, which form a monoid since they're just simple sums and averages. Periodically, I want to write out these metrics somewhere, clear them, and return to accumulating them. I have essentially:
object Foo {
type MetricsIO[A] = StateT[IO, MetricData, A]
def recordMetric(m: MetricData): MetricsIO[Unit] = {
StateT.modify(_.combine(m))
}
def sendMetrics: MetricsIO[Unit] = {
StateT.modifyF { s =>
val write: IO[Unit] = writeMetrics(s)
write.attempt.map {
case Left(_) => s
case Right(_) => Monoid[MetricData].empty
}
}
}
}
So most of the execution uses IO directly and lifts using StateT.liftF. And in certain situations, I include some calls to recordMetric. At the end of it I've got a stream:
val mainStream: Stream[MetricsIO, Bar] = ...
And I want to periodically, say every minute or so, dump the metrics, so I tried:
val scheduler: Scheduler = ...
val sendStream =
scheduler
.awakeEvery[MetricsIO](FiniteDuration(1, TimeUnit.Minutes))
.evalMap(_ => Foo.sendMetrics)
val result = mainStream.concurrently(sendStream).compile.drain
And then I do the usual top level program stuff of calling run with the start state and then calling unsafeRunSync.
The issue is, I only ever see empty metrics! I suspect it's something to with my monoid implicitly providing empty metrics to sendStream but I can't quite figure out why that should be or how to fix it. Maybe there's a way I can "interleave" these sendMetrics calls into the main stream instead?
Edit: here's a minimal complete runnable example:
import fs2._
import cats.implicits._
import cats.data._
import cats.effect._
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import scala.concurrent.duration._
val sec = Executors.newScheduledThreadPool(4)
implicit val ec = ExecutionContext.fromExecutorService(sec)
type F[A] = StateT[IO, List[String], A]
val slowInts = Stream.unfoldEval[F, Int, Int](1) { n =>
StateT(state => IO {
Thread.sleep(500)
val message = s"hello $n"
val newState = message :: state
val result = Some((n, n + 1))
(newState, result)
})
}
val ticks = Scheduler.fromScheduledExecutorService(sec).fixedDelay[F](FiniteDuration(1, SECONDS))
val slowIntsPeriodicallyClearedState = slowInts.either(ticks).evalMap[Int] {
case Left(n) => StateT.liftF(IO(n))
case Right(_) => StateT(state => IO {
println(state)
(List.empty, -1)
})
}
Now if I do:
slowInts.take(10).compile.drain.run(List.empty).unsafeRunSync
Then I get the expected result - the state properly accumulates into the output. But if I do:
slowIntsPeriodicallyClearedState.take(10).compile.drain.run(List.empty).unsafeRunSync
Then I see an empty list consistently printed out. I would have expected partial lists (approx. 2 elements) printed out.

StateT is not safe to use with effect types, because it's not safe in the face of concurrent access. Instead, consider using a Ref (from either fs2 or cats-effect, depending what version).
Something like this:
def slowInts(ref: Ref[IO, Int]) = Stream.unfoldEval[F, Int, Int](1) { n =>
val message = s"hello $n"
ref.modify(message :: _) *> IO {
Thread.sleep(500)
val result = Some((n, n + 1))
result
}
}
val ticks = Scheduler.fromScheduledExecutorService(sec).fixedDelay[IO](FiniteDuration(1, SECONDS))
def slowIntsPeriodicallyClearedState(ref: Ref[IO, Int] =
slowInts.either(ticks).evalMap[Int] {
case Left(n) => IO.pure(n)
case Right(_) =>
ref.modify(_ => Nil).flatMap { case Change(previous, now) =>
IO(println(now)).as(-1)
}
}

Related

Chain Scala Futures when processing a Seq of objects?

import scala.concurrent.duration.Duration
import scala.concurrent.duration.Duration._
import scala.concurrent.{Await, Future}
import scala.concurrent.Future._
import scala.concurrent.ExecutionContext.Implicits.global
object TestClass {
final case class Record(id: String)
final case class RecordDetail(id: String)
final case class UploadResult(result: String)
val ids: Seq[String] = Seq("a", "b", "c", "d")
def fetch(id: String): Future[Option[Record]] = Future {
Thread sleep 100
if (id != "b" && id != "d") {
Some(Record(id))
} else None
}
def fetchRecordDetail(record: Record): Future[RecordDetail] = Future {
Thread sleep 100
RecordDetail(record.id + "_detail")
}
def upload(recordDetail: RecordDetail): Future[UploadResult] = Future {
Thread sleep 100
UploadResult(recordDetail.id + "_uploaded")
}
def notifyUploaded(results: Seq[UploadResult]): Unit = println("notified " + results)
def main(args: Array[String]): Unit = {
//for each id from ids, call fetch method and if record exists call fetchRecordDetail
//and after that upload RecordDetail, collect all UploadResults into seq
//and call notifyUploaded with that seq and await result, you should see "notified ...." in console
// In the following line of code how do I pass result of fetch to fetchRecordDetail function
val result = Future.traverse(ids)(x => Future(fetch(x)))
// val result: Future[Unit] = ???
Await.ready(result, Duration.Inf)
}
}
My problem is that I don't know what code to put in the main to make it work as written in the comments. To sum up, I have an ids:Seq[String] and I want each id to go through asynchronous methods fetch, fetchRecordDetail, upload, and finally the whole Seq to come to notifyUploaded.
I think that the simplest way to do it is :
def main(args: Array[String]): Unit = {
//for each id from ids, call fetch method and if record exists call fetchRecordDetail
//and after that upload RecordDetail, collect all UploadResults into seq
//and call notifyUploaded with that seq and await result, you should see "notified ...." in console
def runWithOption[A, B](f: A => Future[B], oa: Option[A]): Future[Option[B]] = oa match {
case Some(a) => f(a).map(b => Some(b))
case None => Future.successful(None)
}
val ids: Seq[String] = Seq("a", "b", "c", "d")
val resultSeq: Seq[Future[Option[UploadResult]]] = ids.map(id => {
for (or: Option[Record] <- fetch(id);
ord: Option[RecordDetail] <- runWithOption(fetchRecordDetail, or);
our: Option[UploadResult] <- runWithOption(upload, ord)
) yield our
})
val filteredResult: Future[Seq[UploadResult]] = Future.sequence(resultSeq).map(s => s.collect({ case Some(ur) => ur }))
val result: Future[Seq[UploadResult]] = filteredResult.andThen({ case Success(s) => notifyUploaded(s) })
Await.ready(result, Duration.Inf)
}
The idea is that you first get a Seq[Future[_]] that you map through all the methods (here it is done using for-comprehension). Here is an important trick is to actually pass Seq[Future[Option[_]]]. Passing Option[_] through the whole chain via runWithOption helper method simplifies code a lot without a need to block until the very last stage.
Then you convert Seq[Future[_]] into a Future[Seq[_]] and filter out results for those ids that failed at the fetch stage. And finally you apply notifyUploaded.
P.S. Note that there is no error handling in this code whatsoever and it is not clear how you expect it to behave in case of errors at different stages.

Iterate data source asynchronously in batch and stop while remote return no data in Scala

Let's say we have a fake data source which will return data it holds in batch
class DataSource(size: Int) {
private var s = 0
implicit val g = scala.concurrent.ExecutionContext.global
def getData(): Future[List[Int]] = {
s = s + 1
Future {
Thread.sleep(Random.nextInt(s * 100))
if (s <= size) {
List.fill(100)(s)
} else {
List()
}
}
}
object Test extends App {
val source = new DataSource(100)
implicit val g = scala.concurrent.ExecutionContext.global
def process(v: List[Int]): Unit = {
println(v)
}
def next(f: (List[Int]) => Unit): Unit = {
val fut = source.getData()
fut.onComplete {
case Success(v) => {
f(v)
v match {
case h :: t => next(f)
}
}
}
}
next(process)
Thread.sleep(1000000000)
}
I have mine, the problem here is some portion is more not pure. Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list? My situation is a little from this post, the next() there is synchronous call while my is also async.
Or is it ever possible to do what I want? Next batch will only be fetched when the previous one is resolved in the end whether to fetch the next batch depends on the size returned?
What's the best way to walk through this type of data sources? Are there any existing Scala frameworks that provide the feature I am looking for? Is play's Iteratee, Enumerator, Enumeratee the right tool? If so, can anyone provide an example on how to use those facilities to implement what I am looking for?
Edit----
With help from chunjef, I had just tried out. And it actually did work out for me. However, there was some small change I made based on his answer.
Source.fromIterator(()=>Iterator.continually(source.getData())).mapAsync(1) (f=>f.filter(_.size > 0))
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
However, can someone give comparison between Akka Stream and Play Iteratee? Does it worth me also try out Iteratee?
Code snip 1:
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
Code snip 2: Assuming the getData depends on some other output of another flow, and I would like to concat it with the below flow. However, it yield too many files open error. Not sure what would cause this error, the mapAsync has been limited to 1 as its throughput if I understood correctly.
Flow[Int].mapConcat[Future[List[Int]]](c => {
Iterator.continually(ds.getData(c)).to[collection.immutable.Iterable]
}).mapAsync(1)(identity).takeWhile(_.nonEmpty).runForeach(println)
The following is one way to achieve the same behavior with Akka Streams, using your DataSource class:
import scala.concurrent.Future
import scala.util.Random
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._
object StreamsExample extends App {
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
val ds = new DataSource(100)
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
}
class DataSource(size: Int) {
...
}
A simplified line-by-line overview:
line 1: Creates a stream source that continually calls ds.getData if there is downstream demand.
line 2: mapAsync is a way to deal with stream elements that are Futures. In this case, the stream elements are of type Future[List[Int]]. The argument 1 is the level of parallelism: we specify 1 here because DataSource internally uses a mutable variable, and a parallelism level greater than one could produce unexpected results. identity is shorthand for x => x, which basically means that for each Future, we pass its result downstream without transforming it.
line 3: Essentially, ds.getData is called as long as the result of the Future is a non-empty List[Int]. If an empty List is encountered, processing is terminated.
line 4: runForeach here takes a function List[Int] => Unit and invokes that function for each stream element.
Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list?
I think you are looking for a Promise.
You would set up a Promise before you start the first iteration.
This gives you promise.future, a Future that you can then use to follow the completion of everything.
In your onComplete, you add a case _ => promise.success().
Something like
def loopUntilDone(f: (List[Int]) => Unit): Future[Unit] = {
val promise = Promise[Unit]
def next(): Unit = source.getData().onComplete {
case Success(v) =>
f(v)
v match {
case h :: t => next()
case _ => promise.success()
}
case Failure(e) => promise.failure(e)
}
// get going
next(f)
// return the Future for everything
promise.future
}
// future for everything, this is a `Future[Unit]`
// its `onComplete` will be triggered when there is no more data
val everything = loopUntilDone(process)
You are probably looking for a reactive streams library. My personal favorite (and one I'm most familiar with) is Monix. This is how it will work with DataSource unchanged
import scala.concurrent.duration.Duration
import scala.concurrent.Await
import monix.reactive.Observable
import monix.execution.Scheduler.Implicits.global
object Test extends App {
val source = new DataSource(100)
val completed = // <- this is Future[Unit], completes when foreach is done
Observable.repeat(Observable.fromFuture(source.getData()))
.flatten // <- Here it's Observable[List[Int]], it has collection-like methods
.takeWhile(_.nonEmpty)
.foreach(println)
Await.result(completed, Duration.Inf)
}
I just figured out that by using flatMapConcat can achieve what I wanted to achieve. There is no point to start another question as I have had the answer already. Put my sample code here just in case someone is looking for similar answer.
This type of API is very common for some integration between traditional Enterprise applications. The DataSource is to mock the API while the object App is to demonstrate how the client code can utilize Akka Stream to consume the APIs.
In my small project the API was provided in SOAP, and I used scalaxb to transform the SOAP to Scala async style. And with the client calls demonstrated in the object App, we can consume the API with AKKA Stream. Thanks for all for the help.
class DataSource(size: Int) {
private var transactionId: Long = 0
private val transactionCursorMap: mutable.HashMap[TransactionId, Set[ReadCursorId]] = mutable.HashMap.empty
private val cursorIteratorMap: mutable.HashMap[ReadCursorId, Iterator[List[Int]]] = mutable.HashMap.empty
implicit val g = scala.concurrent.ExecutionContext.global
case class TransactionId(id: Long)
case class ReadCursorId(id: Long)
def startTransaction(): Future[TransactionId] = {
Future {
synchronized {
transactionId += transactionId
}
val t = TransactionId(transactionId)
transactionCursorMap.update(t, Set(ReadCursorId(0)))
t
}
}
def createCursorId(t: TransactionId): ReadCursorId = {
synchronized {
val c = transactionCursorMap.getOrElseUpdate(t, Set(ReadCursorId(0)))
val currentId = c.foldLeft(0l) { (acc, a) => acc.max(a.id) }
val cId = ReadCursorId(currentId + 1)
transactionCursorMap.update(t, c + cId)
cursorIteratorMap.put(cId, createIterator)
cId
}
}
def createIterator(): Iterator[List[Int]] = {
(for {i <- 1 to 100} yield List.fill(100)(i)).toIterator
}
def startRead(t: TransactionId): Future[ReadCursorId] = {
Future {
createCursorId(t)
}
}
def getData(cursorId: ReadCursorId): Future[List[Int]] = {
synchronized {
Future {
Thread.sleep(Random.nextInt(100))
cursorIteratorMap.get(cursorId) match {
case Some(i) => i.next()
case _ => List()
}
}
}
}
}
object Test extends App {
val source = new DataSource(10)
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
implicit val g = scala.concurrent.ExecutionContext.global
//
// def process(v: List[Int]): Unit = {
// println(v)
// }
//
// def next(f: (List[Int]) => Unit): Unit = {
// val fut = source.getData()
// fut.onComplete {
// case Success(v) => {
// f(v)
// v match {
//
// case h :: t => next(f)
//
// }
// }
//
// }
//
// }
//
// next(process)
//
// Thread.sleep(1000000000)
val s = Source.fromFuture(source.startTransaction())
.map { e =>
source.startRead(e)
}
.mapAsync(1)(identity)
.flatMapConcat(
e => {
Source.fromIterator(() => Iterator.continually(source.getData(e)))
})
.mapAsync(5)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
/*
val done = Source.fromIterator(() => Iterator.continually(source.getData())).mapAsync(1)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runFold(List[List[Int]]()) { (acc, r) =>
// println("=======" + acc + r)
r :: acc
}
done.onSuccess {
case e => {
e.foreach(println)
}
}
done.onComplete(_ => system.terminate())
*/
}

Sequencing Scala Futures with bounded parallelism (without messing around with ExecutorContexts)

Background: I have a function:
def doWork(symbol: String): Future[Unit]
which initiates some side-effects to fetch data and store it, and completes a Future when its done. However, the back-end infrastructure has usage limits, such that no more than 5 of these requests can be made in parallel. I have a list of N symbols that I need to get through:
var symbols = Array("MSFT",...)
but I want to sequence them such that no more than 5 are executing simultaneously. Given:
val allowableParallelism = 5
my current solution is (assuming I'm working with async/await):
val symbolChunks = symbols.toList.grouped(allowableParallelism).toList
def toThunk(x: List[String]) = () => Future.sequence(x.map(doWork))
val symbolThunks = symbolChunks.map(toThunk)
val done = Promise[Unit]()
def procThunks(x: List[() => Future[List[Unit]]]): Unit = x match {
case Nil => done.success()
case x::xs => x().onComplete(_ => procThunks(xs))
}
procThunks(symbolThunks)
await { done.future }
but, for obvious reasons, I'm not terribly happy with it. I feel like this should be possible with folds, but every time I try, I end up eagerly creating the Futures. I also tried out a version with RxScala Observables, using concatMap, but that also seemed like overkill.
Is there a better way to accomplish this?
I have example how to do it with scalaz-stream. It's quite a lot of code because it's required to convert scala Future to scalaz Task (abstraction for deferred computation). However it's required to add it to project once. Another option is to use Task for defining 'doWork'. I personally prefer task for building async programs.
import scala.concurrent.{Future => SFuture}
import scala.util.Random
import scala.concurrent.ExecutionContext.Implicits.global
import scalaz.stream._
import scalaz.concurrent._
val P = scalaz.stream.Process
val rnd = new Random()
def doWork(symbol: String): SFuture[Unit] = SFuture {
Thread.sleep(rnd.nextInt(1000))
println(s"Symbol: $symbol. Thread: ${Thread.currentThread().getName}")
}
val symbols = Seq("AAPL", "MSFT", "GOOGL", "CVX").
flatMap(s => Seq.fill(5)(s).zipWithIndex.map(t => s"${t._1}${t._2}"))
implicit class Transformer[+T](fut: => SFuture[T]) {
def toTask(implicit ec: scala.concurrent.ExecutionContext): Task[T] = {
import scala.util.{Failure, Success}
import scalaz.syntax.either._
Task.async {
register =>
fut.onComplete {
case Success(v) => register(v.right)
case Failure(ex) => register(ex.left)
}
}
}
}
implicit class ConcurrentProcess[O](val process: Process[Task, O]) {
def concurrently[O2](concurrencyLevel: Int)(f: Channel[Task, O, O2]): Process[Task, O2] = {
val actions =
process.
zipWith(f)((data, f) => f(data))
val nestedActions =
actions.map(P.eval)
merge.mergeN(concurrencyLevel)(nestedActions)
}
}
val workChannel = io.channel((s: String) => doWork(s).toTask)
val process = Process.emitAll(symbols).concurrently(5)(workChannel)
process.run.run
When you'll have all this transformation in scope, basically all you need is:
val workChannel = io.channel((s: String) => doWork(s).toTask)
val process = Process.emitAll(symbols).concurrently(5)(workChannel)
Quite short and self-decribing
Although you've already got an excellent answer, I thought I might still offer an opinion or two about these matters.
I remember seeing somewhere (on someone's blog) "use actors for state and use futures for concurrency".
So my first thought would be to utilize actors somehow. To be precise, I would have a master actor with a router launching multiple worker actors, with number of workers restrained according to allowableParallelism. So, assuming I have
def doWorkInternal (symbol: String): Unit
which does the work from yours doWork taken 'outside of future', I would have something along these lines (very rudimentary, not taking many details into consideration, and practically copying code from akka documentation):
import akka.actor._
case class WorkItem (symbol: String)
case class WorkItemCompleted (symbol: String)
case class WorkLoad (symbols: Array[String])
case class WorkLoadCompleted ()
class Worker extends Actor {
def receive = {
case WorkItem (symbol) =>
doWorkInternal (symbol)
sender () ! WorkItemCompleted (symbol)
}
}
class Master extends Actor {
var pending = Set[String] ()
var originator: Option[ActorRef] = None
var router = {
val routees = Vector.fill (allowableParallelism) {
val r = context.actorOf(Props[Worker])
context watch r
ActorRefRoutee(r)
}
Router (RoundRobinRoutingLogic(), routees)
}
def receive = {
case WorkLoad (symbols) =>
originator = Some (sender ())
context become processing
for (symbol <- symbols) {
router.route (WorkItem (symbol), self)
pending += symbol
}
}
def processing: Receive = {
case Terminated (a) =>
router = router.removeRoutee(a)
val r = context.actorOf(Props[Worker])
context watch r
router = router.addRoutee(r)
case WorkItemCompleted (symbol) =>
pending -= symbol
if (pending.size == 0) {
context become receive
originator.get ! WorkLoadCompleted
}
}
}
You could query the master actor with ask and receive a WorkLoadCompleted in a future.
But thinking more about 'state' (of number of simultaneous requests in processing) to be hidden somewhere, together with implementing necessary code for not exceeding it, here's something of the 'future gateway intermediary' sort, if you don't mind imperative style and mutable (used internally only though) structures:
object Guardian
{
private val incoming = new collection.mutable.HashMap[String, Promise[Unit]]()
private val outgoing = new collection.mutable.HashMap[String, Future[Unit]]()
private val pending = new collection.mutable.Queue[String]
def doWorkGuarded (symbol: String): Future[Unit] = {
synchronized {
val p = Promise[Unit] ()
incoming(symbol) = p
if (incoming.size <= allowableParallelism)
launchWork (symbol)
else
pending.enqueue (symbol)
p.future
}
}
private def completionHandler (t: Try[Unit]): Unit = {
synchronized {
for (symbol <- outgoing.keySet) {
val f = outgoing (symbol)
if (f.isCompleted) {
incoming (symbol).completeWith (f)
incoming.remove (symbol)
outgoing.remove (symbol)
}
}
for (i <- outgoing.size to allowableParallelism) {
if (pending.nonEmpty) {
val symbol = pending.dequeue()
launchWork (symbol)
}
}
}
}
private def launchWork (symbol: String): Unit = {
val f = doWork(symbol)
outgoing(symbol) = f
f.onComplete(completionHandler)
}
}
doWork now is exactly like yours, returning Future[Unit], with the idea that instead of using something like
val futures = symbols.map (doWork (_)).toSeq
val future = Future.sequence(futures)
which would launch futures not regarding allowableParallelism at all, I would instead use
val futures = symbols.map (Guardian.doWorkGuarded (_)).toSeq
val future = Future.sequence(futures)
Think about some hypothetical database access driver with non-blocking interface, i.e. returning futures on requests, which is limited in concurrency by being built over some connection pool for example - you wouldn't want it to return futures not taking parallelism level into account, and require you to juggle with them to keep parallelism under control.
This example is more illustrative than practical since I wouldn't normally expect that 'outgoing' interface would be utilizing futures like this (which is quote ok for 'incoming' interface).
First, obviously some purely functional wrapper around Scala's Future is needed, cause it's side-effective and runs as soon as it can. Let's call it Deferred:
import scala.concurrent.Future
import scala.util.control.Exception.nonFatalCatch
class Deferred[+T](f: () => Future[T]) {
def run(): Future[T] = f()
}
object Deferred {
def apply[T](future: => Future[T]): Deferred[T] =
new Deferred(() => nonFatalCatch.either(future).fold(Future.failed, identity))
}
And here is the routine:
import java.util.concurrent.CopyOnWriteArrayList
import java.util.concurrent.atomic.AtomicInteger
import scala.collection.immutable.Seq
import scala.concurrent.{ExecutionContext, Future, Promise}
import scala.util.control.Exception.nonFatalCatch
import scala.util.{Failure, Success}
trait ConcurrencyUtils {
def runWithBoundedParallelism[T](parallelism: Int = Runtime.getRuntime.availableProcessors())
(operations: Seq[Deferred[T]])
(implicit ec: ExecutionContext): Deferred[Seq[T]] =
if (parallelism > 0) Deferred {
val indexedOps = operations.toIndexedSeq // index for faster access
val promise = Promise[Seq[T]]()
val acc = new CopyOnWriteArrayList[(Int, T)] // concurrent acc
val nextIndex = new AtomicInteger(parallelism) // keep track of the next index atomically
def run(operation: Deferred[T], index: Int): Unit = {
operation.run().onComplete {
case Success(value) =>
acc.add((index, value)) // accumulate result value
if (acc.size == indexedOps.size) { // we've done
import scala.collection.JavaConversions._
// in concurrent setting next line may be called multiple times, that's why trySuccess instead of success
promise.trySuccess(acc.view.sortBy(_._1).map(_._2).toList)
} else {
val next = nextIndex.getAndIncrement() // get and inc atomically
if (next < indexedOps.size) { // run next operation if exists
run(indexedOps(next), next)
}
}
case Failure(t) =>
promise.tryFailure(t) // same here (may be called multiple times, let's prevent stdout pollution)
}
}
if (operations.nonEmpty) {
indexedOps.view.take(parallelism).zipWithIndex.foreach((run _).tupled) // run as much as allowed
promise.future
} else {
Future.successful(Seq.empty)
}
} else {
throw new IllegalArgumentException("Parallelism must be positive")
}
}
In a nutshell, we run as much operations initially as allowed and then on each operation completion we run next operation available, if any. So the only difficulty here is to maintain next operation index and results accumulator in concurrent setting. I'm not an absolute concurrency expert, so make me know if there are some potential problems in the code above. Notice that returned value is also a deferred computation that should be run.
Usage and test:
import org.scalatest.{Matchers, FlatSpec}
import org.scalatest.concurrent.ScalaFutures
import org.scalatest.time.{Seconds, Span}
import scala.collection.immutable.Seq
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.duration._
class ConcurrencyUtilsSpec extends FlatSpec with Matchers with ScalaFutures with ConcurrencyUtils {
"runWithBoundedParallelism" should "return results in correct order" in {
val comp1 = mkDeferredComputation(1)
val comp2 = mkDeferredComputation(2)
val comp3 = mkDeferredComputation(3)
val comp4 = mkDeferredComputation(4)
val comp5 = mkDeferredComputation(5)
val compountComp = runWithBoundedParallelism(2)(Seq(comp1, comp2, comp3, comp4, comp5))
whenReady(compountComp.run()) { result =>
result should be (Seq(1, 2, 3, 4, 5))
}
}
// increase default ScalaTest patience
implicit val defaultPatience = PatienceConfig(timeout = Span(10, Seconds))
private def mkDeferredComputation[T](result: T, sleepDuration: FiniteDuration = 100.millis): Deferred[T] =
Deferred {
Future {
Thread.sleep(sleepDuration.toMillis)
result
}
}
}
Use Monix Task. An example from Monix document for parallelism=10
val items = 0 until 1000
// The list of all tasks needed for execution
val tasks = items.map(i => Task(i * 2))
// Building batches of 10 tasks to execute in parallel:
val batches = tasks.sliding(10,10).map(b => Task.gather(b))
// Sequencing batches, then flattening the final result
val aggregate = Task.sequence(batches).map(_.flatten.toList)
// Evaluation:
aggregate.foreach(println)
//=> List(0, 2, 4, 6, 8, 10, 12, 14, 16,...
Akka streams, allow you to do the following:
import akka.NotUsed
import akka.stream.Materializer
import akka.stream.scaladsl.Source
import scala.concurrent.Future
def sequence[A: Manifest, B](items: Seq[A], func: A => Future[B], parallelism: Int)(
implicit mat: Materializer
): Future[Seq[B]] = {
val futures: Source[B, NotUsed] =
Source[A](items.toList).mapAsync(parallelism)(x => func(x))
futures.runFold(Seq.empty[B])(_ :+ _)
}
sequence(symbols, doWork, allowableParallelism)

waiting for "recursive" futures in scala

a simple code sample that describes my problem:
import scala.util._
import scala.concurrent._
import scala.concurrent.duration._
import ExecutionContext.Implicits.global
class LoserException(msg: String, dice: Int) extends Exception(msg) { def diceRoll: Int = dice }
def aPlayThatMayFail: Future[Int] = {
Thread.sleep(1000) //throwing a dice takes some time...
//throw a dice:
(1 + Random.nextInt(6)) match {
case 6 => Future.successful(6) //I win!
case i: Int => Future.failed(new LoserException("I did not get 6...", i))
}
}
def win(prefix: String): String = {
val futureGameLog = aPlayThatMayFail
futureGameLog.onComplete(t => t match {
case Success(diceRoll) => "%s, and finally, I won! I rolled %d !!!".format(prefix, diceRoll)
case Failure(e) => e match {
case ex: LoserException => win("%s, and then i got %d".format(prefix, ex.diceRoll))
case _: Throwable => "%s, and then somebody cheated!!!".format(prefix)
}
})
"I want to do something like futureGameLog.waitForRecursiveResult, using Await.result or something like that..."
}
win("I started playing the dice")
this simple example illustrates what i want to do. basically, if to put it in words, i want to wait for a result for some computation, when i compose different actions on previous success or failed attampts.
so how would you implement the win method?
my "real world" problem, if it makes any difference, is using dispatch for asynchronous http calls, where i want to keep making http calls whenever the previous one ends, but actions differ on wether the previous http call succeeded or not.
You can recover your failed future with a recursive call:
def foo(x: Int) = x match {
case 10 => Future.successful(x)
case _ => Future.failed[Int](new Exception)
}
def bar(x: Int): Future[Int] = {
foo(x) recoverWith { case _ => bar(x+1) }
}
scala> bar(0)
res0: scala.concurrent.Future[Int] = scala.concurrent.impl.Promise$DefaultPromise#64d6601
scala> res0.value
res1: Option[scala.util.Try[Int]] = Some(Success(10))
recoverWith takes a PartialFunction[Throwable,scala.concurrent.Future[A]] and returns a Future[A]. You should be careful though, because it will use quite some memory when it does lots of recursive calls here.
As drexin answered the part about exception handling and recovering, let me try and answer the part about a recursive function involving futures. I believe using a Promise will help you achieve your goal. The restructured code would look like this:
def win(prefix: String): String = {
val prom = Promise[String]()
def doWin(p:String) {
val futureGameLog = aPlayThatMayFail
futureGameLog.onComplete(t => t match {
case Success(diceRoll) => prom.success("%s, and finally, I won! I rolled %d !!!".format(prefix, diceRoll))
case Failure(e) => e match {
case ex: LoserException => doWin("%s, and then i got %d".format(prefix, ex.diceRoll))
case other => prom.failure(new Exception("%s, and then somebody cheated!!!".format(prefix)))
}
})
}
doWin(prefix)
Await.result(prom.future, someTimeout)
}
Now this won't be true recursion in the sense that it will be building up one long stack due to the fact that the futures are async, but it is similar to recursion in spirit. Using the promise here gives you something to block against while the recursion does it's thing, blocking the caller from what's happening behind the scene.
Now, if I was doing this, I would probable redefine things like so:
def win(prefix: String): Future[String] = {
val prom = Promise[String]()
def doWin(p:String) {
val futureGameLog = aPlayThatMayFail
futureGameLog.onComplete(t => t match {
case Success(diceRoll) => prom.success("%s, and finally, I won! I rolled %d !!!".format(prefix, diceRoll))
case Failure(e) => e match {
case ex: LoserException => doWin("%s, and then i got %d".format(prefix, ex.diceRoll))
case other => prom.failure(new Exception("%s, and then somebody cheated!!!".format(prefix)))
}
})
}
doWin(prefix)
prom.future
}
This way you can defer the decision on whether to block or use async callbacks to the caller of this function. This is more flexible, but it also exposes the caller to the fact that you are doing async computations and I'm not sure that is going to be acceptable for your scenario. I'll leave that decision up to you.
This works for me:
def retryWithFuture[T](f: => Future[T],retries:Int, delay:FiniteDuration) (implicit ec: ExecutionContext, s: Scheduler): Future[T] ={
f.recoverWith { case _ if retries > 0 => after[T](delay,s)(retryWithFuture[T]( f , retries - 1 , delay)) }
}

How do I rewrite a for loop with a shared dependency using actors

We have some code which needs to run faster. Its already profiled so we would like to make use of multiple threads. Usually I would setup an in memory queue, and have a number of threads taking jobs of the queue and calculating the results. For the shared data I would use a ConcurrentHashMap or similar.
I don't really want to go down that route again. From what I have read using actors will result in cleaner code and if I use akka migrating to more than 1 jvm should be easier. Is that true?
However, I don't know how to think in actors so I am not sure where to start.
To give a better idea of the problem here is some sample code:
case class Trade(price:Double, volume:Int, stock:String) {
def value(priceCalculator:PriceCalculator) =
(priceCalculator.priceFor(stock)-> price)*volume
}
class PriceCalculator {
def priceFor(stock:String) = {
Thread.sleep(20)//a slow operation which can be cached
50.0
}
}
object ValueTrades {
def valueAll(trades:List[Trade],
priceCalculator:PriceCalculator):List[(Trade,Double)] = {
trades.map { trade => (trade,trade.value(priceCalculator)) }
}
def main(args:Array[String]) {
val trades = List(
Trade(30.5, 10, "Foo"),
Trade(30.5, 20, "Foo")
//usually much longer
)
val priceCalculator = new PriceCalculator
val values = valueAll(trades, priceCalculator)
}
}
I'd appreciate it if someone with experience using actors could suggest how this would map on to actors.
This is a complement to my comment on shared results for expensive calculations. Here it is:
import scala.actors._
import Actor._
import Futures._
case class PriceFor(stock: String) // Ask for result
// The following could be an "object" as well, if it's supposed to be singleton
class PriceCalculator extends Actor {
val map = new scala.collection.mutable.HashMap[String, Future[Double]]()
def act = loop {
react {
case PriceFor(stock) => reply(map getOrElseUpdate (stock, future {
Thread.sleep(2000) // a slow operation
50.0
}))
}
}
}
Here's an usage example:
scala> val pc = new PriceCalculator; pc.start
pc: PriceCalculator = PriceCalculator#141fe06
scala> class Test(stock: String) extends Actor {
| def act = {
| println(System.currentTimeMillis().toString+": Asking for stock "+stock)
| val f = (pc !? PriceFor(stock)).asInstanceOf[Future[Double]]
| println(System.currentTimeMillis().toString+": Got the future back")
| val res = f.apply() // this blocks until the result is ready
| println(System.currentTimeMillis().toString+": Value: "+res)
| }
| }
defined class Test
scala> List("abc", "def", "abc").map(new Test(_)).map(_.start)
1269310737461: Asking for stock abc
res37: List[scala.actors.Actor] = List(Test#6d888e, Test#1203c7f, Test#163d118)
1269310737461: Asking for stock abc
1269310737461: Asking for stock def
1269310737464: Got the future back
scala> 1269310737462: Got the future back
1269310737465: Got the future back
1269310739462: Value: 50.0
1269310739462: Value: 50.0
1269310739465: Value: 50.0
scala> new Test("abc").start // Should return instantly
1269310755364: Asking for stock abc
res38: scala.actors.Actor = Test#15b5b68
1269310755365: Got the future back
scala> 1269310755367: Value: 50.0
For simple parallelization, where I throw a bunch of work out to process and then wait for it all to come back, I tend to like to use a Futures pattern.
class ActorExample {
import actors._
import Actor._
class Worker(val id: Int) extends Actor {
def busywork(i0: Int, i1: Int) = {
var sum,i = i0
while (i < i1) {
i += 1
sum += 42*i
}
sum
}
def act() { loop { react {
case (i0:Int,i1:Int) => sender ! busywork(i0,i1)
case None => exit()
}}}
}
val workforce = (1 to 4).map(i => new Worker(i)).toList
def parallelFourSums = {
workforce.foreach(_.start())
val futures = workforce.map(w => w !! ((w.id,1000000000)) );
val computed = futures.map(f => f() match {
case i:Int => i
case _ => throw new IllegalArgumentException("I wanted an int!")
})
workforce.foreach(_ ! None)
computed
}
def serialFourSums = {
val solo = workforce.head
workforce.map(w => solo.busywork(w.id,1000000000))
}
def timed(f: => List[Int]) = {
val t0 = System.nanoTime
val result = f
val t1 = System.nanoTime
(result, t1-t0)
}
def go {
val serial = timed( serialFourSums )
val parallel = timed( parallelFourSums )
println("Serial result: " + serial._1)
println("Parallel result:" + parallel._1)
printf("Serial took %.3f seconds\n",serial._2*1e-9)
printf("Parallel took %.3f seconds\n",parallel._2*1e-9)
}
}
Basically, the idea is to create a collection of workers--one per workload--and then throw all the data at them with !! which immediately gives back a future. When you try to read the future, the sender blocks until the worker's actually done with the data.
You could rewrite the above so that PriceCalculator extended Actor instead, and valueAll coordinated the return of the data.
Note that you have to be careful passing non-immutable data around.
Anyway, on the machine I'm typing this from, if you run the above you get:
scala> (new ActorExample).go
Serial result: List(-1629056553, -1629056636, -1629056761, -1629056928)
Parallel result:List(-1629056553, -1629056636, -1629056761, -1629056928)
Serial took 1.532 seconds
Parallel took 0.443 seconds
(Obviously I have at least four cores; the parallel timing varies rather a bit depending on which worker gets what processor and what else is going on on the machine.)