Scala: Parallel execution with ListBuffer appends doesn't produce expected outcome - scala

I know I'm doing something wrong with mutable.ListBuffer but I can't figure out how to fix it (and a proper explanation of the issue).
I simplified the code below to reproduce the behavior.
I'm basically trying to run functions in parallel to add elements to a list as my first list get processed. I end up "losing" elements.
import java.util.Properties
import scala.collection.mutable.ListBuffer
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
import scala.concurrent.{ExecutionContext}
import ExecutionContext.Implicits.global
object MyTestObject {
var listBufferOfInts = new ListBuffer[Int]() // files that are processed
def runFunction(): Int = {
listBufferOfInts = new ListBuffer[Int]()
val inputListOfInts = 1 to 1000
val fut = Future.traverse(inputListOfInts) { i =>
Future {
appendElem(i)
}
}
Await.ready(fut, Duration.Inf)
listBufferOfInts.length
}
def appendElem(elem: Int): Unit = {
listBufferOfInts ++= List(elem)
}
}
MyTestObject.runFunction()
MyTestObject.runFunction()
MyTestObject.runFunction()
which returns:
res0: Int = 937
res1: Int = 992
res2: Int = 997
Obviously I would expect 1000 to be returned all the time. How can I fix my code to keep the "architecture" but make my ListBuffer "synchronized" ?

I don't know what exact problem is as you said you simplified it, but still you have an obvious race condition, multiple threads modify a single mutable collection and that is very bad. As other answers pointed out you need some locking so that only one thread could modify collection at the same time. If your calculations are heavy, appending result in synchronized way to a buffer shouldn't notably affect the performance but when in doubt always measure.
But synchronization is not needed, you can do something else instead, without vars and mutable state. Let each Future return your partial result and then merge them into a list, in fact Future.traverse does just that.
import scala.concurrent.duration._
import scala.concurrent.{Await, Future}
import scala.concurrent.ExecutionContext.Implicits.global
def runFunction: Int = {
val inputListOfInts = 1 to 1000
val fut: Future[List[Int]] = Future.traverse(inputListOfInts.toList) { i =>
Future {
// some heavy calculations on i
i * 4
}
}
val listOfInts = Await.result(fut, Duration.Inf)
listOfInts.size
}
Future.traverse already gives you an immutable list with all your results combined, no need to append them to a mutable buffer.
Needless to say, you will always get 1000 back.
# List.fill(10000)(runFunction).exists(_ != 1000)
res18: Boolean = false

I'm not sure the above shows what you are trying to do correctly. Maybe the issue is that you are actually sharing a var ListBuffer which you reinitialise within runFunction.
When I take this out I collect all the events I'm expecting correctly:
import java.util.Properties
import scala.collection.mutable.ListBuffer
import scala.concurrent.duration.Duration
import scala.concurrent.{ Await, Future }
import scala.concurrent.{ ExecutionContext }
import ExecutionContext.Implicits.global
object BrokenTestObject extends App {
var listBufferOfInts = ( new ListBuffer[Int]() )
def runFunction(): Int = {
val inputListOfInts = 1 to 1000
val fut = Future.traverse(inputListOfInts) { i =>
Future {
appendElem(i)
}
}
Await.ready(fut, Duration.Inf)
listBufferOfInts.length
}
def appendElem(elem: Int): Unit = {
listBufferOfInts.append( elem )
}
BrokenTestObject.runFunction()
BrokenTestObject.runFunction()
BrokenTestObject.runFunction()
println(s"collected ${listBufferOfInts.length} elements")
}
If you really have a synchronisation issue you can use something like the following:
import java.util.Properties
import scala.collection.mutable.ListBuffer
import scala.concurrent.duration.Duration
import scala.concurrent.{ Await, Future }
import scala.concurrent.{ ExecutionContext }
import ExecutionContext.Implicits.global
class WrappedListBuffer(val lb: ListBuffer[Int]) {
def append(i: Int) {
this.synchronized {
lb.append(i)
}
}
}
object MyTestObject extends App {
var listBufferOfInts = new WrappedListBuffer( new ListBuffer[Int]() )
def runFunction(): Int = {
val inputListOfInts = 1 to 1000
val fut = Future.traverse(inputListOfInts) { i =>
Future {
appendElem(i)
}
}
Await.ready(fut, Duration.Inf)
listBufferOfInts.lb.length
}
def appendElem(elem: Int): Unit = {
listBufferOfInts.append( elem )
}
MyTestObject.runFunction()
MyTestObject.runFunction()
MyTestObject.runFunction()
println(s"collected ${listBufferOfInts.lb.size} elements")
}

Changing
listBufferOfInts ++= List(elem)
to
synchronized {
listBufferOfInts ++= List(elem)
}
Make it work. Probably can become a performance issue? I'm still interested in an explanation and maybe a better way of doing things!

Related

scala (spark) zio convert future to zio

My objective is to run a number of spark ml regression models (1000s of times) on one dataset and I want to do this using zio instead of future, because it is running too slow. Below is the working example of using Future.
A distinct list of keys is used to filter the partitioned dataset on key and run the model on. I've set up a thread pool with 8 executors to manage it, but it quickly degrades in performance.
import scala.concurrent.{Await, ExecutionContext, ExecutionContextExecutorService, Future}
import java.util.concurrent.{Executors, TimeUnit}
import scala.concurrent.duration._
import org.apache.spark.sql.SaveMode
val pool = Executors.newFixedThreadPool(8)
implicit val xc: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(pool)
case class Result(key: String, coeffs: String)
try {
import spark.implicits._
val tasks = {
for (x <- keys)
yield Future {
Seq(
Result(
x.group,
runModel(input.filter(col("group")===x)).mkString(",")
)
).toDS()
.write.mode(SaveMode.Overwrite).option("header", false).csv(
s"hdfs://namenode:8020/results/$x.csv"
)
}
}.toSeq
Await.result(Future.sequence(tasks), Duration.Inf)
}
finally {
pool.shutdown()
pool.awaitTermination(Long.MaxValue, TimeUnit.NANOSECONDS)
}
I've tried to implement this in zio, but I don't know how to implement queues and set a limit of executors like in futures.
Below is my failed attempt so far...
import zio._
import zio.console._
import zio.stm._
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession}
import org.apache.spark.sql.functions.col
//example data/signatures
case class ModelResult(key: String, coeffs: String)
case class Data(key: String, sales: Double)
val keys: Array[String] = Array("100_1", "100_2")
def runModel[T](ds: Dataset[T]): Vector[Double]
object MyApp1 extends App {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val input: Dataset[Data] = Seq(Data("100_1", 1d), Data("100_2", 2d)).toDS
def run(args: List[String]): ZIO[ZEnv, Nothing, Int] = {
for {
queue <- Queue.bounded[Int](8)
_ <- ZIO.foreach(1 to 8) (i => queue.offer(i)).fork
_ <- ZIO.foreach(keys) { k => queue.take.flatMap(_ => readWrite(k, input, queue)) }
} yield 0
}
def writecsv(k: String, v: String) = {
Seq(ModelResult(k, v))
.toDS
.write
.mode(SaveMode.Overwrite).option("header", value = false)
.csv(s"hdfs://namenode:8020/results/$k.csv")
}
def readWrite[T](key: String, ds: Dataset[T], queue: Queue[Int]): ZIO[ZEnv, Nothing, Int] = {
(for {
result <- runModel(ds.filter(col("key")===key)).mkString(",")
_ <- writecsv(key, result)
_ <- queue.offer(1)
_ <- putStrLn(s"successfully wrote output for $key")
} yield 0)
}
}
//to run
MyApp1.run(List[String]())
What is the best way to deal with compute this in zio?
To parallelize some workload across, say, 8 threads all you need is
ZIO.foreachParN(8)(1 to 100)(id => zio.blocking.blocking(Task{yourClusterJob(id)}))
But don't expect lots of a boost by switching from Futures to ZIO here:
1) Actual workload dominates coordination overhead so difference between ZIO and Future should be marginal.
2) Maybe you won't get any boost at all because 8 tasks will be fighting for the same resource pool in the Spark cluster.

How to produce a Traversable with cats-effect's IO given an async that will call multiple times

What I'm really trying to do is monitor multiple files and when any of them is modified I'd like to update some state and produce a side effect using this state. I imagine what I want is a scan over a Traversable that produces a Traversable[IO[_]]. But I don't see the path there.
as a minimal attempt to produce this I wrote
package example
import better.files.{File, FileMonitor}
import cats.implicits._
import com.monovore.decline._
import cats.effect.IO
import java.nio.file.{Files, Path}
import scala.concurrent.ExecutionContext.Implicits.global
object Hello extends CommandApp(
name = "cats-effects-playground",
header = "welcome",
main = {
val filesOpts = Opts.options[Path]("input", help = "input files")
filesOpts.map { files =>
IO.async[File] { cb =>
val watchers = files.map { path =>
new FileMonitor(path, recursive = false) {
override def onModify(file: File, count: Int) = cb(Right(file))
}
}
watchers.toList.foreach(_.start)
}
.flatMap(f => IO { println(f) })
.unsafeRunSync
}
}
)
but this has two major flaws. One it creates a thread for each file I'm watching, which is a little heavy. But more importantly the program finishes as soon as a single file is modified, even though onModify would be called more times if the program stayed running.
I'm not married to using better-files, it just seemed like the path of least resistance. But I do require using Cats IO.
This solution doesn't solve the issue of creating a bunch of threads, and it doesn't strictly produce a Traversable, but it solves the underlying use case. I'm very open to this being critiqued and a better solution provided.
package example
import better.files.{File, FileMonitor}
import cats.implicits._
import com.monovore.decline._
import cats.effect.IO
import java.nio.file.{Files, Path}
import java.util.concurrent.LinkedBlockingQueue
import scala.concurrent.ExecutionContext.Implicits.global
object Hello extends CommandApp(
name = "cats-effects-playground",
header = "welcome",
main = {
val filesOpts = Opts.options[Path]("input", help = "input files")
filesOpts.map { files =>
val bq: LinkedBlockingQueue[IO[File]] = new LinkedBlockingQueue()
val watchers = files.map { path =>
new FileMonitor(path, recursive = false) {
override def onModify(file: File, count: Int) = bq.put(IO(file))
}
}
def ioLoop(): IO[Unit] = bq.take()
.flatMap(f => IO(println(f)))
.flatMap(_ => ioLoop())
watchers.toList.foreach(_.start)
ioLoop.unsafeRunSync
}
}
)

how to combine the results Future[ Option[ T ] ] into Seq[ T ]

I have a method
def readTree(id: String): Future[Option[CategoryTreeResponse]]
and a list of String channels:List[String].
How to iterate and combine all the results into a non Future Sequence ? such as :
def readAllTrees(): Seq[CategoryTreeResponse] = ???
Possibly without blocking.
Coming form the imperative world, I'd do like this :
import scala.concurrent.duration._
def readTrees(): Seq[CategoryTreeResponse] = {
val list = ListBuffer[CategoryTreeResponse]()
for (id <- channels) {
val tree = Await.result(readTree(id), 5.seconds)
if (tree.isDefined) {
list += tree.get
}
}
list
}
You could do something like this
def readAllTrees(channels: List[String]): Future[Seq[CategoryTreeResponse]] = {
Future.sequence(channels.map(readTree(_))).map(_.flatten)
}
I have changed the signature of readAllTrees to receive the list and return a Future of the Sequence.
If you want to access to the resulting sequence you will need to wait until is finished doing
Await.result(readAllTrees(channels), Duration.Inf)
But this is not a very nice way to manage futures because it will lock the thread that calls Await.ready
Future.sequence and Await.result should help. I agree with Mikel though, it is better to stay async as long as possible using map/flatMap/foreach etc methods of the Future class
scala> :paste
// Entering paste mode (ctrl-D to finish)
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
case class CategoryTreeResponse()
val futureResults: List[Future[Option[CategoryTreeResponse]]] = List(
Future.successful(Option(CategoryTreeResponse())),
Future.successful(Option(CategoryTreeResponse())),
Future.successful(None)
)
val futureResult: Future[List[Option[CategoryTreeResponse]]] = Future.sequence(futureResults)
val allResults: List[Option[CategoryTreeResponse]] = Await.result(futureResult, Duration.Inf)
val nonEmptyResults: Seq[CategoryTreeResponse] = allResults.flatMap(_.toSeq)
// Exiting paste mode, now interpreting.
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
defined class CategoryTreeResponse
futureResults: List[scala.concurrent.Future[Option[CategoryTreeResponse]]] = List(Future(Success(Some(CategoryTreeResponse()))), Future(Success(Some(CategoryTreeResponse()))), Future(Success(None)))
futureResult: scala.concurrent.Future[List[Option[CategoryTreeResponse]]] = Future(Success(List(Some(CategoryTreeResponse()), Some(CategoryTreeResponse()), None)))
allResults: List[Option[CategoryTreeResponse]] = List(Some(CategoryTreeResponse()), Some(CategoryTreeResponse()), None)
nonEmptyResults: Seq[CategoryTreeResponse] = List(CategoryTreeResponse(), CategoryTreeResponse())
scala>

Executing More than 3 Futures does not work

I m using dispatch library in my sbt project. When I initialize three future and run them it is working perfectly But I increase one more future then it goes to a loop.
My code:
//Initializing Futures
def sequenceOfFutures() ={
var pageNumber: Int = 1
var list ={Seq(Future{})}
for (pageNumber <- 1 to 4) {
list ++= {
Seq(
Future {
str= getRequestFunction(pageNumber);
GlobalObjects.sleep(Random.nextInt(1500));
}
)
}
}
Future.sequence(list)
}
Await.result(sequenceOfFutures, Duration.Inf)
And then getRequestionFunction(pageNumber) code:
def getRequestionFunction(pageNumber)={
val h=Http("scala.org", as_str)
while(h.isComplete){
Thread,sleep(1500);
}
}
I tried based on one suggestion from How to configure a fine tuned thread pool for futures?
I added this to my code:
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = new ExecutionContext {
val threadPool = Executors.newFixedThreadPool(1000);
def execute(runnable: Runnable) {
threadPool.submit(runnable)
}
def reportFailure(t: Throwable) {}
}// Still didn't work
So when I use more than four Futures then it keeps await forever. Is there some solution to fix it?
But it didn't work Could someone please suggest how to solve this issue?

Creating a time-based chunking Enumeratee

I want to create a Play 2 Enumeratee that takes in values and outputs them, chunked together, every x seconds/milliseconds. That way, in a multi-user websocket environment with lots of user input, one could limit the number of received frames per second.
I know that it's possible to group a set number of items together like this:
val chunker = Enumeratee.grouped(
Traversable.take[Array[Double]](5000) &>> Iteratee.consume()
)
Is there a built-in way to do this based on time rather than based on the number of items?
I was thinking about doing this somehow with a scheduled Akka job, but on first sight this seems inefficient, and I'm not sure if concurency issues would arise.
How about like this? I hope this is helpful for you.
package controllers
import play.api._
import play.api.Play.current
import play.api.mvc._
import play.api.libs.iteratee._
import play.api.libs.concurrent.Akka
import play.api.libs.concurrent.Promise
object Application extends Controller {
def index = Action {
val queue = new scala.collection.mutable.Queue[String]
Akka.future {
while( true ){
Logger.info("hogehogehoge")
queue += System.currentTimeMillis.toString
Thread.sleep(100)
}
}
val timeStream = Enumerator.fromCallback { () =>
Promise.timeout(Some(queue), 200)
}
Ok.stream(timeStream.through(Enumeratee.map[scala.collection.mutable.Queue[String]]({ queue =>
var str = ""
while(queue.nonEmpty){
str += queue.dequeue + ", "
}
str
})))
}
}
And this document is also helpful for you.
http://www.playframework.com/documentation/2.0/Enumerators
UPDATE
This is for play2.1 version.
package controllers
import play.api._
import play.api.Play.current
import play.api.mvc._
import play.api.libs.iteratee._
import play.api.libs.concurrent.Akka
import play.api.libs.concurrent.Promise
import scala.concurrent._
import ExecutionContext.Implicits.global
object Application extends Controller {
def index = Action {
val queue = new scala.collection.mutable.Queue[String]
Akka.future {
while( true ){
Logger.info("hogehogehoge")
queue += System.currentTimeMillis.toString
Thread.sleep(100)
}
}
val timeStream = Enumerator.repeatM{
Promise.timeout(queue, 200)
}
Ok.stream(timeStream.through(Enumeratee.map[scala.collection.mutable.Queue[String]]({ queue =>
var str = ""
while(queue.nonEmpty){
str += queue.dequeue + ", "
}
str
})))
}
}
Here I've quickly defined an iteratee that will take values from an input for a fixed time length t measured in milliseconds and an enumeratee that will allow you to group and further process an input stream divided into segments constructed within such length t. It relies on JodaTime to keep track of how much time has passed since the iteratee began.
def throttledTakeIteratee[E](timeInMillis: Long): Iteratee[E, List[E]] = {
var startTime = new Instant()
def step(state: List[E])(input: Input[E]): Iteratee[E, List[E]] = {
val timePassed = new Interval(startTime, new Instant()).toDurationMillis
input match {
case Input.EOF => { startTime = new Instant; Done(state, Input.EOF) }
case Input.Empty => Cont[E, List[E]](i => step(state)(i))
case Input.El(e) =>
if (timePassed >= timeInMillis) { startTime = new Instant; Done(e::state, Input.Empty) }
else Cont[E, List[E]](i => step(e::state)(i))
}
}
Cont(step(List[E]()))
}
def throttledTake[T](timeInMillis: Long) = Enumeratee.grouped(throttledTakeIteratee[T](timeInMillis))