Cannot execute parallel computation with functional API - scala

I am trying to implement the countWords function from the red book on the parallelism chapter. When I pass a thread pool to the function and I modify the function to print the thread counting the words, I can only see the main thread printed. This indicates me that I am not able to make this function execute in parallel.
What I currently have:
type Par[A] = ExecutorService => Future[A]
def asyncF[A, B](f: A => B): A => Par[B] = a => lazyUnit(f(a))
def lazyUnit[A](a: => A): Par[A] = fork(unit(a))
def unit[A](a: A): Par[A] = (_: ExecutorService) => UnitFuture(a)
def fork[A](a: => Par[A]): Par[A] =
es => es.submit(new Callable[A] {
def call = a(es).get
})
def countWords(l: List[String]): Par[Int] = map(sequence(l.map(asyncF {
println(Thread.currentThread())
s => s.split(" ").length
})))(_.sum)
When I run:
val listPar = List("ab cd", "hg ks", "lh ks", "lh hs")
val es = Executors.newFixedThreadPool(4)
val counts = countWords(listPar)(es)
println(counts.get(100, SECONDS))
I get:
Thread[main,5,main]
8
I would expect to see a thread printed per each element of the list (as there are four elements and a thread pool of size 4) however I can only see the main thread printed.
Any suggestions?
Thanks

I want to start with one piece of advice when asking questions - you should always provide a MCVE. Your code doesn't compile; for example, I have no idea where UnitFuture comes from, I have no idea what's the implementation of sequence that you're using etc.
Here is a snippet that works with standard Scala. First, the explanation:
Method countWords takes a list of strings to count, and also two services - one for handling Java Futures on different threads, and one for handling Scala Futures on different threads. Scala one is derived from Java one via ExecutionContext.fromExecutor method.
Why both Java and Scala? Well, I wanted to preserve Java because that's how you initially wrote your code, but I don't know how to sequence a Java Future. So what I did was:
for each substring:
fork a Java Future task
turn it into a Scala Future
sequence the obtained list of Scala Futures
In case you're not familiar with implicits, you will (if you intend to work with Scala). Here I used the execution context implicitly because it removes a lot of boilerplate - this way I don't have to explicitly pass it when converting to Scala future, when mapping / sequencing etc.
And now the code itself:
import java.util.concurrent.{Callable, ExecutorService, Executors}
import java.util.concurrent.{Future => JFuture}
import scala.concurrent.{ExecutionContext, Future}
def scalaFromJavaFuture[A](
javaFuture: JFuture[A]
)(implicit ec: ExecutionContext): Future[A] =
Future { javaFuture.get }(ec)
def fork(s: String)(es: ExecutorService): java.util.concurrent.Future[Int] =
es.submit(new Callable[Int] {
def call = {
println(s"Thread: ${Thread.currentThread()}, processing string: $s")
s.split(" ").size
}
})
def countWords(l: List[String])(es: ExecutorService)(implicit ec: ExecutionContext): Future[Int] = {
val listOfFutures = l.map(elem => scalaFromJavaFuture(fork(elem)(es)))
Future.sequence(listOfFutures).map(_.sum)
}
val listPar = List("ab cd", "hg ks", "lh ks", "lh hs")
val es = Executors.newFixedThreadPool(4)
implicit val ec = ExecutionContext.fromExecutor(es)
val counts = countWords(listPar)(es)
counts.onComplete(println)
Example output:
Thread: Thread[pool-1-thread-1,5,main], processing string: ab cd
Thread: Thread[pool-1-thread-3,5,main], processing string: hg ks
Thread: Thread[pool-1-thread-2,5,main], processing string: lh ks
Thread: Thread[pool-1-thread-4,5,main], processing string: lh hs
Success(8)
Note that it's up to execution context to determine the threads. Run it a couple of times and you will see for yourself - you might end up with e.g. only two threads being used:
Thread: Thread[pool-1-thread-1,5,main], processing string: ab cd
Thread: Thread[pool-1-thread-3,5,main], processing string: hg ks
Thread: Thread[pool-1-thread-1,5,main], processing string: lh ks
Thread: Thread[pool-1-thread-1,5,main], processing string: lh hs
Success(8)

Related

How to Promise.allSettled with Scala futures?

I have two scala futures. I want to perform an action once both are completed, regardless of whether they were completed successfully. (Additionally, I want the ability to inspect those results at that time.)
In Javascript, this is Promise.allSettled.
Does Scala offer a simple way to do this?
One last wrinkle, if it matters: I want to do this in a JRuby application.
You can use the transform method to create a Future that will always succeed and return the result or the error as a Try object.
def toTry[A](future: Future[A])(implicit ec: ExecutionContext): Future[Try[A]] =
future.transform(x => Success(x))
To combine two Futures into one, you can use zip:
def settle2[A, B](fa: Future[A], fb: Future[B])(implicit ec: ExecutionContext)
: Future[(Try[A], Try[B])] =
toTry(fa).zip(toTry(fb))
If you want to combine an arbitrary number of Futures this way, you can use Future.traverse:
def allSettled[A](futures: List[Future[A]])(implicit ec: ExecutionContext)
: Future[List[Try[A]]] =
Future.traverse(futures)(toTry(_))
Normally in this case we use Future.sequence to transform a collection of a Future into one single Future so you can map on it, but Scala short circuit the failed Future and doesn't wait for anything after that (Scala considers one failure to be a failure for all), which doesn't fit your case.
In this case you need to map failed ones to successful, then do the sequence, e.g.
val settledFuture = Future.sequence(List(future1, future2, ...).map(_.recoverWith { case _ => Future.unit }))
settledFuture.map(//Here it is all settled)
EDIT
Since the results need to be kept, instead of mapping to Future.unit, we map the actual result into another layer of Try:
val settledFuture = Future.sequence(
List(Future(1), Future(throw new Exception))
.map(_.map(Success(_)).recover(Failure(_)))
)
settledFuture.map(println(_))
//Output: List(Success(1), Failure(java.lang.Exception))
EDIT2
It can be further simplified with transform:
Future.sequence(listOfFutures.map(_.transform(Success(_))))
Perhaps you could use a concurrent counter to keep track of the number of completed Futures and then complete the Promise once all Futures have completed
def allSettled[T](futures: List[Future[T]]): Future[List[Future[T]]] = {
val p = Promise[List[Future[T]]]()
val length = futures.length
val completedCount = new AtomicInteger(0)
futures foreach {
_.onComplete { _ =>
if (completedCount.incrementAndGet == length) p.trySuccess(futures)
}
}
p.future
}
val futures = List(
Future(-11),
Future(throw new Exception("boom")),
Future(42)
)
allSettled(futures).andThen(println(_))
// Success(List(Future(Success(-11)), Future(Failure(java.lang.Exception: boom)), Future(Success(42))))
scastie

Dealing with impure side-effects in FP, IO Monad

Trying to understand how best deal with side-effects in FP.
I implemented this rudimentary IO implementation:
trait IO[A] {
def run: A
}
object IO {
def unit[A](a: => A): IO[A] = new IO[A] { def run = a }
def loadFile(fileResourcePath: String) = IO.unit[List[String]]{
Source.fromResource(fileResourcePath).getLines.toList }
def printMessage(message: String) = IO.unit[Unit]{ println(message) }
def readLine(message:String) = IO.unit[String]{ StdIn.readLine() }
}
I have the following use case:
- load lines from log file
- parse each line to BusinessType object
- process each BusinessType object
- print process result
Case 1:
So Scala code may look like this
val load: String => List[String]
val parse: List[String] => List[BusinessType]
val process: List[BusinessType] => String
val output: String => Unit
Case 2:
I decide to use IO above:
val load: String => IO[List[String]]
val parse: IO[List[String]] => List[BusinessType]
val process: List[BusinessType] => IO[Unit]
val output: IO[Unit] => Unit
In case 1 the load is impure because it's reading from file so is the output is also impure because, it's writing the result to console.
To be more functional I use case 2.
Questions:
- Aren't case 1 and 2 really the same thing?
- In case 2 aren't we just delaying the inevitable?
as the parse function will need to call the io.run
method and cause a side-effect?
- when they say "leave side-effects until the end of the world"
how does this apply to the example above? where is the
end of the world here?
Your IO monad seems to lack all the monad stuff, namely the part where you can flatMap over it to build bigger IO out of smaller IO. That way, everything stays "pure" until the call run at the very end.
In case 2 aren't we just delaying the inevitable?
as the parse function will need call the io.run
method and cause a side effect?
No. The parse function should not call io.run. It should return another IO that you can then combine with its input IO.
when they say "leave side-effects until the end of the world"
how does this apply to the example above? where is the
end of the world here?
End of the world would be the last thing your program does. You only run once. The rest of your program "purely" builds one giant IO for that.
Something like
def load(): IO[Seq[String]]
def parse(data: Seq[String]): IO[Parsed] // returns IO, because has side-effects
def pureComputation(data: Parsed): Result // no side-effects, no need to use I/O
def output(data: Result): IO[Unit]
// combining effects is "pure", so the whole thing
// can be a `val` (or a `def` if it takes some input params)
val program: IO[Unit] = for {
data <- load() // use <- to "map" over IO
parsed <- parse()
result = pureComputation(parsed) // = instead of <-, no I/O here
_ <- output(result)
} yield ()
// only `run` at the end produces any effects
def main() {
program.run()
}

Multiple futures that may fail - returning both successes and failures?

I have a situation where I need to run a bunch of operations in parallel.
All operations have the same return value (say a Seq[String]).
Its possible that some of the operations may fail, and others successfully return results.
I want to return both the successful results, and any exceptions that happened, so I can log them for debugging.
Is there a built-in way, or easy way through any library (cats/scalaz) to do this, before I go and write my own class for doing this?
I was thinking of doing each operation in its own future, then checking each future, and returning a tuple of Seq[String] -> Seq[Throwable] where left value is the successful results (flattened / combined) and right is a list of any exceptions that occurred.
Is there a better way?
Using Await.ready, which you mention in a comment, generally loses most benefits from using futures. Instead you can do this just using the normal Future combinators. And let's do the more generic version, which works for any return type; flattening the Seq[String]s can be added easily.
def successesAndFailures[T](futures: Seq[Future[T]]): Future[(Seq[T], Seq[Throwable])] = {
// first, promote all futures to Either without failures
val eitherFutures: Seq[Future[Either[Throwable, T]]] =
futures.map(_.transform(x => Success(x.toEither)))
// then sequence to flip Future and Seq
val futureEithers: Future[Seq[Either[Throwable, T]]] =
Future.sequence(eitherFutures)
// finally, Seq of Eithers can be separated into Seqs of Lefts and Rights
futureEithers.map { seqOfEithers =>
val (lefts, rights) = seqOfEithers.partition(_.isLeft)
val failures = lefts.map(_.left.get)
val successes = rights.map(_.right.get)
(successes, failures)
}
}
Scalaz and Cats have separate to simplify the last step.
The types can be inferred by the compiler, they are shown just to help you see the logic.
Calling value on your Future returns an Option[Try[T]]. If the Future has not completed then the Option is None. If it has completed then it's easy to unwrap and process.
if (myFutr.isCompleted)
myFutr.value.map(_.fold( err: Throwable => //log the error
, ss: Seq[String] => //process results
))
else
// do something else, come back later
Sounds like a good use-case for the Try idiom (it's basically similar to the Either monad).
Example of usage from the doc:
import scala.util.{Success, Failure}
val f: Future[List[String]] = Future {
session.getRecentPosts
}
f onComplete {
case Success(posts) => for (post <- posts) println(post)
case Failure(t) => println("An error has occurred: " + t.getMessage)
}
It actually does a little bit more than what you asked because it is fully asynchronous. Does it fit your use-case?
I'd do it this way:
import scala.concurrent.{Future, ExecutionContext}
import scala.util.Success
def eitherify[A](f: Future[A])(implicit ec: ExecutionContext): Future[Either[Throwable, A]] = f.transform(tryResult => Success(tryResult.toEither))
def eitherifyF[A, B](f: A => Future[B])(implicit ec: ExecutionContext): A => Future[Either[Throwable, B]] = { a => eitherify(f(a)) }
// here we need some "cats" magic for `traverse` and `separate`
// instead of `traverse` you can use standard `Future.sequence`
// there is no analogue for `separate` in the standard library
import cats.implicits._
def myProgram[A, B](values: List[A], asyncF: A => Future[B])(implicit ec: ExecutionContext): Future[(List[Throwable], List[B])] = {
val appliedTransformations: Future[List[Either[Throwable, B]]] = values.traverse(eitherifyF(asyncF))
appliedTransformations.map(_.separate)
}

Does the map of Scala Future add any overhead?

Suppose I've got a function like this:
import scala.concurrent._
def plus2Future(fut: Future[Int])
(implicit ec: ExecutionContext): Future[Int] = fut.map(_ + 2)
Now I am using plus2Future to create a new Future:
import import scala.concurrent.ExecutionContext.Implicits.global
val fut1 = Future { 0 }
val fut2 = plus2Future(fut1)
Does function plus2 always run now on the same thread as fut1 ? I guess it does not.
Does using map in plus2Future add an overhead of thread context switching, creating a new Runnable etc. ?
Does using map in plus2Future add an overhead of thread context
switching, creating a new Runnable etc. ?
map, on the default implementation of Future (via DefaultPromise) is:
def map[S](f: T => S)(implicit executor: ExecutionContext): Future[S] = { // transform(f, identity)
val p = Promise[S]()
onComplete { v => p complete (v map f) }
p.future
}
Where onComplete creates a new CallableRunnable and eventually invokes dispatchOrAddCallback:
/** Tries to add the callback, if already completed, it dispatches the callback to be executed.
* Used by `onComplete()` to add callbacks to a promise and by `link()` to transfer callbacks
* to the root promise when linking two promises togehter.
*/
#tailrec
private def dispatchOrAddCallback(runnable: CallbackRunnable[T]): Unit = {
getState match {
case r: Try[_] => runnable.executeWithValue(r.asInstanceOf[Try[T]])
case _: DefaultPromise[_] => compressedRoot().dispatchOrAddCallback(runnable)
case listeners: List[_] => if (updateState(listeners, runnable :: listeners)) () else dispatchOrAddCallback(runnable)
}
}
Which dispatches the call to the underlying execution context. This means that it is up to the implementing ExecutionContext to decide how and where the future is ran, so one cannot give a deterministic answer of "It runs here or there". We can see that there is both an allocation of a Promise and a callback object.
In general, I wouldn't worry about this, unless I've benchmarked the code and found this to be a bottleneck.
No, it cannot (deterministically) use the same thread (unless the EC only has one thread) since an unbounded amount of time can elapse between the two lines of code.

Akka streams: dealing with futures within graph stage

Within an akka stream stage FlowShape[A, B] , part of the processing I need to do on the A's is to save/query a datastore with a query built with A data. But that datastore driver query gives me a future, and I am not sure how best to deal with it (my main question here).
case class Obj(a: String, b: Int, c: String)
case class Foo(myobject: Obj, name: String)
case class Bar(st: String)
//
class SaveAndGetId extends GraphStage[FlowShape[Foo, Bar]] {
val dao = new DbDao // some dao with an async driver
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with Outhandler {
override def onPush() = {
val foo = grab(in)
val add = foo.record.value()
val result: Future[String] = dao.saveAndGetRecord(add.myobject)//saves and returns id as string
//the naive approach
val record = Await(result, Duration.inf)
push(out, Bar(record))// ***tests pass every time
//mapping the future approach
result.map { x=>
push(out, Bar(x))
} //***tests fail every time
The next stage depends on the id of the db record returned from query, but I want to avoid Await. I am not sure why mapping approach fails:
"it should work" in {
val source = Source.single(Foo(Obj("hello", 1, "world")))
val probe = source
.via(new SaveAndGetId))
.runWith(TestSink.probe)
probe
.request(1)
.expectBarwithId("one")//say we know this will be
.expectComplete()
}
private implicit class RichTestProbe(probe: Probe[Bar]) {
def expectBarwithId(expected: String): Probe[Bar] =
probe.expectNextChainingPF{
case r # Bar(str) if str == expected => r
}
}
When run with mapping future, I get failure:
should work ***FAILED***
java.lang.AssertionError: assertion failed: expected: message matching partial function but got unexpected message OnComplete
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:406)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:814)
at akka.stream.testkit.TestSubscriber$ManualProbe.expectEventPF(StreamTestKit.scala:570)
The async side channels example in the docs has the future in the constructor of the stage, as opposed to building the future within the stage, so doesn't seem to apply to my case.
I agree with Ramon. Constructing a new FlowShapeis not necessary in this case and it is too complicated. It is very much convenient to use mapAsync method here:
Here is a code snippet to utilize mapAsync:
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object MapAsyncExample {
val numOfParallelism: Int = 10
def main(args: Array[String]): Unit = {
Source.repeat(5)
.mapAsync(parallelism)(x => asyncSquare(x))
.runWith(Sink.foreach(println)) previous stage
}
//This method returns a Future
//You can replace this part with your database operations
def asyncSquare(value: Int): Future[Int] = Future {
value * value
}
}
In the snippet above, Source.repeat(5) is a dummy source to emit 5 indefinitely. There is a sample function asyncSquare which takes an integer and calculates its square in a Future. .mapAsync(parallelism)(x => asyncSquare(x)) line uses that function and emits the output of Future to the next stage. In this snipet, the next stage is a sink which prints every item.
parallelism is the maximum number of asyncSquare calls that can run concurrently.
I think your GraphStage is unnecessarily overcomplicated. The below Flow performs the same actions without the need to write a custom stage:
val dao = new DbDao
val parallelism = 10 //number of parallel db queries
val SaveAndGetId : Flow[Foo, Bar, _] =
Flow[Foo]
.map(foo => foo.record.value().myobject)
.mapAsync(parallelism)(rec => dao.saveAndGetRecord(rec))
.map(Bar.apply)
I generally try to treat GraphStage as a last resort, there is almost always an idiomatic way of getting the same Flow by using the methods provided by the akka-stream library.