Scala - Batched Stream from Futures - scala

I have instances of a case class Thing, and I have a bunch of queries to run that return a collection of Things like so:
def queries: Seq[Future[Seq[Thing]]]
I need to collect all Things from all futures (like above) and group them into equally sized collections of 10,000 so they can be serialized to files of 10,000 Things.
def serializeThings(Seq[Thing]): Future[Unit]
I want it to be implemented in such a way that I don't wait for all queries to run before serializing. As soon as there are 10,000 Things returned after the futures of the first queries complete, I want to start serializing.
If I do something like:
Future.sequence(queries)
It will collect the results of all the queries, but my understanding is that operations like map won't be invoked until all queries complete and all the Things must fit into memory at once.
What's the best way to implement a batched stream pipeline using Scala collections and concurrent libraries?

I think that I managed to make something. The solution is based on my previous answer. It collects results from Future[List[Thing]] results until it reaches a treshold of BatchSize. Then it calls serializeThings future, when it finishes, the loop continues with the rest.
object BatchFutures extends App {
case class Thing(id: Int)
def getFuture(id: Int): Future[List[Thing]] = {
Future.successful {
List.fill(3)(Thing(id))
}
}
def serializeThings(things: Seq[Thing]): Future[Unit] = Future.successful {
//Thread.sleep(2000)
println("processing: " + things)
}
val ids = (1 to 4).toList
val BatchSize = 5
val future = ids.foldLeft(Future.successful[List[Thing]](Nil)) {
case (acc, id) =>
acc flatMap { processed =>
getFuture(id) flatMap { res =>
val all = processed ++ res
val (batch, rest) = all.splitAt(5)
if (batch.length == BatchSize) { // if futures filled the batch with needed amount
serializeThings(batch) map { _ =>
rest // process the rest
}
} else {
Future.successful(all) //if we need more Things for a batch
}
}
}
}.flatMap { rest =>
serializeThings(rest)
}
Await.result(future, Duration.Inf)
}
The result prints:
processing: List(Thing(1), Thing(1), Thing(1), Thing(2), Thing(2))
processing: List(Thing(2), Thing(3), Thing(3), Thing(3), Thing(4))
processing: List(Thing(4), Thing(4))
When the number of Things isn't divisible by BatchSize we have to call serializeThings once more(last flatMap). I hope it helps! :)

Before you do Future.sequence do what you want to do with individual future and then use Future.sequence.
//this can be used for serializing
def doSomething(): Unit = ???
//do something with the failed future
def doSomethingElse(): Unit = ???
def doSomething(list: List[_]) = ???
val list: List[Future[_]] = List.fill(10000)(Future(doSomething()))
val newList =
list.par.map { f =>
f.map { result =>
doSomething()
}.recover { case throwable =>
doSomethingElse()
}
}
Future.sequence(newList).map ( list => doSomething(list)) //wait till all are complete
instead of newList generation you could use Future.traverse
Future.traverse(list)(f => f.map( x => doSomething()).recover {case th => doSomethingElse() }).map ( completeListOfValues => doSomething(completeListOfValues))

Related

Scala Futures for-comprehension with a list of values

I need to execute a Future method on some elements I have in a list simultaneously. My current implementation works sequentially, which is not optimal for saving time. I did this by mapping my list and calling the method on each element and processing the data this way.
My manager shared a link with me showing how to execute Futures simultaneously using for-comprehension but I cannot see/understand how I can implement this with my List.
The link he shared with me is https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop/
Here is my current code:
private def method1(id: String): Tuple2[Boolean, List[MyObject]] = {
val workers = List.concat(idleWorkers, activeWorkers.keys.toList)
var ready = true;
val workerStatus = workers.map{ worker =>
val option = Await.result(method2(worker), 1 seconds)
var status = if (option.isDefined) {
if (option.get._2 == id) {
option.get._1.toString
} else {
"INVALID"
}
} else "FAILED"
val status = s"$worker: $status"
if (option.get._1) {
ready = false
}
MyObject(worker.toString, status)
}.toList.filterNot(s => s. status.contains("INVALID"))
(ready, workerStatus)
}
private def method2(worker: ActorRef): Future[Option[(Boolean, String)]] = Future{
implicit val timeout: Timeout = 1 seconds;
Try(Await.result(worker ? GetStatus, 1 seconds)) match {
case Success(extractedVal) => extractedVal match {
case res: (Boolean, String) => Some(res)
case _ => None
}
case Failure(_) => { None }
case _ => { None }
}
}
If someone could suggest how to implement for-comprehension in this scenario, I would be grateful. Thanks
For method2 there is no need for the Future/Await mix. Just map the Future:
def method2(worker: ActorRef): Future[Option[(Boolean, String)]] =
(worker ? GetStatus).map{
case res: (Boolean, String) => Some(res)
case _ => None
}
For method1 you likewise need to map the result of method2 and do the processing inside the map. This will make workerStatus a List[Future[MyObject]] and means that everything runs in parallel.
Then use Future.sequence(workerStatus) to turn the List[Future[MyObject]] into a Future[List[MyObject]]. You can then use map again to do the filtering/ checking on that List[MyObject]. This will happen when all the individual Futures have completed.
Ideally you would then return a Future from method1 to keep everything asynchronous. You could, if absolutely necessary, use Await.result at this point which would wait for all the asynchronous operations to complete (or fail).

Iterate data source asynchronously in batch and stop while remote return no data in Scala

Let's say we have a fake data source which will return data it holds in batch
class DataSource(size: Int) {
private var s = 0
implicit val g = scala.concurrent.ExecutionContext.global
def getData(): Future[List[Int]] = {
s = s + 1
Future {
Thread.sleep(Random.nextInt(s * 100))
if (s <= size) {
List.fill(100)(s)
} else {
List()
}
}
}
object Test extends App {
val source = new DataSource(100)
implicit val g = scala.concurrent.ExecutionContext.global
def process(v: List[Int]): Unit = {
println(v)
}
def next(f: (List[Int]) => Unit): Unit = {
val fut = source.getData()
fut.onComplete {
case Success(v) => {
f(v)
v match {
case h :: t => next(f)
}
}
}
}
next(process)
Thread.sleep(1000000000)
}
I have mine, the problem here is some portion is more not pure. Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list? My situation is a little from this post, the next() there is synchronous call while my is also async.
Or is it ever possible to do what I want? Next batch will only be fetched when the previous one is resolved in the end whether to fetch the next batch depends on the size returned?
What's the best way to walk through this type of data sources? Are there any existing Scala frameworks that provide the feature I am looking for? Is play's Iteratee, Enumerator, Enumeratee the right tool? If so, can anyone provide an example on how to use those facilities to implement what I am looking for?
Edit----
With help from chunjef, I had just tried out. And it actually did work out for me. However, there was some small change I made based on his answer.
Source.fromIterator(()=>Iterator.continually(source.getData())).mapAsync(1) (f=>f.filter(_.size > 0))
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
However, can someone give comparison between Akka Stream and Play Iteratee? Does it worth me also try out Iteratee?
Code snip 1:
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
Code snip 2: Assuming the getData depends on some other output of another flow, and I would like to concat it with the below flow. However, it yield too many files open error. Not sure what would cause this error, the mapAsync has been limited to 1 as its throughput if I understood correctly.
Flow[Int].mapConcat[Future[List[Int]]](c => {
Iterator.continually(ds.getData(c)).to[collection.immutable.Iterable]
}).mapAsync(1)(identity).takeWhile(_.nonEmpty).runForeach(println)
The following is one way to achieve the same behavior with Akka Streams, using your DataSource class:
import scala.concurrent.Future
import scala.util.Random
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._
object StreamsExample extends App {
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
val ds = new DataSource(100)
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
}
class DataSource(size: Int) {
...
}
A simplified line-by-line overview:
line 1: Creates a stream source that continually calls ds.getData if there is downstream demand.
line 2: mapAsync is a way to deal with stream elements that are Futures. In this case, the stream elements are of type Future[List[Int]]. The argument 1 is the level of parallelism: we specify 1 here because DataSource internally uses a mutable variable, and a parallelism level greater than one could produce unexpected results. identity is shorthand for x => x, which basically means that for each Future, we pass its result downstream without transforming it.
line 3: Essentially, ds.getData is called as long as the result of the Future is a non-empty List[Int]. If an empty List is encountered, processing is terminated.
line 4: runForeach here takes a function List[Int] => Unit and invokes that function for each stream element.
Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list?
I think you are looking for a Promise.
You would set up a Promise before you start the first iteration.
This gives you promise.future, a Future that you can then use to follow the completion of everything.
In your onComplete, you add a case _ => promise.success().
Something like
def loopUntilDone(f: (List[Int]) => Unit): Future[Unit] = {
val promise = Promise[Unit]
def next(): Unit = source.getData().onComplete {
case Success(v) =>
f(v)
v match {
case h :: t => next()
case _ => promise.success()
}
case Failure(e) => promise.failure(e)
}
// get going
next(f)
// return the Future for everything
promise.future
}
// future for everything, this is a `Future[Unit]`
// its `onComplete` will be triggered when there is no more data
val everything = loopUntilDone(process)
You are probably looking for a reactive streams library. My personal favorite (and one I'm most familiar with) is Monix. This is how it will work with DataSource unchanged
import scala.concurrent.duration.Duration
import scala.concurrent.Await
import monix.reactive.Observable
import monix.execution.Scheduler.Implicits.global
object Test extends App {
val source = new DataSource(100)
val completed = // <- this is Future[Unit], completes when foreach is done
Observable.repeat(Observable.fromFuture(source.getData()))
.flatten // <- Here it's Observable[List[Int]], it has collection-like methods
.takeWhile(_.nonEmpty)
.foreach(println)
Await.result(completed, Duration.Inf)
}
I just figured out that by using flatMapConcat can achieve what I wanted to achieve. There is no point to start another question as I have had the answer already. Put my sample code here just in case someone is looking for similar answer.
This type of API is very common for some integration between traditional Enterprise applications. The DataSource is to mock the API while the object App is to demonstrate how the client code can utilize Akka Stream to consume the APIs.
In my small project the API was provided in SOAP, and I used scalaxb to transform the SOAP to Scala async style. And with the client calls demonstrated in the object App, we can consume the API with AKKA Stream. Thanks for all for the help.
class DataSource(size: Int) {
private var transactionId: Long = 0
private val transactionCursorMap: mutable.HashMap[TransactionId, Set[ReadCursorId]] = mutable.HashMap.empty
private val cursorIteratorMap: mutable.HashMap[ReadCursorId, Iterator[List[Int]]] = mutable.HashMap.empty
implicit val g = scala.concurrent.ExecutionContext.global
case class TransactionId(id: Long)
case class ReadCursorId(id: Long)
def startTransaction(): Future[TransactionId] = {
Future {
synchronized {
transactionId += transactionId
}
val t = TransactionId(transactionId)
transactionCursorMap.update(t, Set(ReadCursorId(0)))
t
}
}
def createCursorId(t: TransactionId): ReadCursorId = {
synchronized {
val c = transactionCursorMap.getOrElseUpdate(t, Set(ReadCursorId(0)))
val currentId = c.foldLeft(0l) { (acc, a) => acc.max(a.id) }
val cId = ReadCursorId(currentId + 1)
transactionCursorMap.update(t, c + cId)
cursorIteratorMap.put(cId, createIterator)
cId
}
}
def createIterator(): Iterator[List[Int]] = {
(for {i <- 1 to 100} yield List.fill(100)(i)).toIterator
}
def startRead(t: TransactionId): Future[ReadCursorId] = {
Future {
createCursorId(t)
}
}
def getData(cursorId: ReadCursorId): Future[List[Int]] = {
synchronized {
Future {
Thread.sleep(Random.nextInt(100))
cursorIteratorMap.get(cursorId) match {
case Some(i) => i.next()
case _ => List()
}
}
}
}
}
object Test extends App {
val source = new DataSource(10)
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
implicit val g = scala.concurrent.ExecutionContext.global
//
// def process(v: List[Int]): Unit = {
// println(v)
// }
//
// def next(f: (List[Int]) => Unit): Unit = {
// val fut = source.getData()
// fut.onComplete {
// case Success(v) => {
// f(v)
// v match {
//
// case h :: t => next(f)
//
// }
// }
//
// }
//
// }
//
// next(process)
//
// Thread.sleep(1000000000)
val s = Source.fromFuture(source.startTransaction())
.map { e =>
source.startRead(e)
}
.mapAsync(1)(identity)
.flatMapConcat(
e => {
Source.fromIterator(() => Iterator.continually(source.getData(e)))
})
.mapAsync(5)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
/*
val done = Source.fromIterator(() => Iterator.continually(source.getData())).mapAsync(1)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runFold(List[List[Int]]()) { (acc, r) =>
// println("=======" + acc + r)
r :: acc
}
done.onSuccess {
case e => {
e.foreach(println)
}
}
done.onComplete(_ => system.terminate())
*/
}

Submitting operations in created future

I have a Future lazy val that obtains some object and a function which submits operations in the Future.
class C {
def printLn(s: String) = println(s)
}
lazy val futureC: Future[C] = Future{Thread.sleep(3000); new C()}
def func(s: String): Unit = {
futureC.foreach{c => c.printLn(s)}
}
The problem is when Future is completed it executes operations in reverse order than they have been submited. So for example if I execute sequentialy
func("A")
func("B")
func("C")
I get after Future completion
scala> C
B
A
This order is important for me. Is there a way to preserve this order?
Of course I can use an actor who asks for future and stashing strings while future is not ready, but it seems redundant for me.
lazy val futureC: Future[C]
lazy vals in scala will be compiled in to the code which uses a synchronized block for thread safety.
Here when the func(A) is called, it will obtain the lock for the lazy val and that thread will go to sleep.
Therefore func(B) & func(C) will blocked by the lock.
When those blocked threads are run, the order cannot be guaranteed.
If you do it like below, you'll have the order as you expect. This is because the for comprehension creates a flatMap, & map based chain that gets executed sequentially.
lazy val futureC: Future[C] = Future {
Thread.sleep(1000)
new C()
}
def func(s: String) : Future[Unit] = {
futureC.map { c => c.printLn(s) }
}
val x = for {
_ <- func("A")
_ <- func("B")
_ <- func("C")
} yield ()
The order preserves even without the lazy keyword. You can remove the lazy keyword unless it is really necessary.
Hope this helps.
You can use Future.traverse to ensure the order of execution.
Something like this.. Im not sure how your func has a reference to the correct futureC, so I moved it inside.
def func(s: String): Future[Unit] = {
lazy val futureC = Future{Thread.sleep(3000); new C()}
futureC.map{c => c.printLn(s)}
}
def traverse[A,B](xs: Seq[A])(fn: A => Future[B]): Future[Seq[B]] =
xs.foldLeft(Future(Seq[B]())) { (acc, item) =>
acc.flatMap { accValue =>
fn(item).map { itemValue =>
accValue :+ itemValue
}
}
}
traverse(Seq("A","B","C"))(func)

How to spawn an unknown amount of Futures and combine the result even if one or more failed?

I want to turn the following sequential code into concurrent code with Futures and need advice on how to structure it.
sequential:
import java.net.URL
val providers = List(
new URL("http://www.cnn.com"),
new URL("http://www.bbc.co.uk"),
new URL("http://www.othersite.com")
)
def download(urls: URL*) = urls.flatMap(url => io.Source.fromURL(url).getLines).distinct
val res = download(providers:_*)
I want to download all sources that are coming in via the varargs of the download method and combine the results into one Seq/List/Set, whatever, together. When one Future failed, let's say because the server is unreachable, it should take all others and move on and return the result nonetheless. firstCompletedOf won't work because I need the results of all, except one failed due to error. I thought about using Future.sequence like below but I can't get it to work. Here is what I tried...
def download(urls: URL*) = Future.sequence {
urls.map { url =>
Future {
io.Source.fromURL(url).getLines
}
}
}
This produces a Seq[Future[Iterator[String]]] which is not compatible with M_[Future[A_]].
A Future[Iterator[String]] is what I want. (I thought I return an Iterator because I need to reuse it later on with reset method on Iterator.)
You can use parallel collections:
import java.net.URL
val providers = List(
new URL("http://www.cnn.com"),
new URL("http://www.bbc.co.uk"),
new URL("http://www.othersite.com")
)
def download(urls: URL*) = urls.par.flatMap(url => {
Try {
io.Source.fromURL(url).getLines
} match {
case Success(e) => e
case Failure(_) => Seq()
}
}).toSeq
val res: Seq[String] = download(providers:_*)
Or if you want the non blocking version with a Future:
def download(urls: URL*) = Future {
blocking {
urls.par.flatMap(url => {
Try {
io.Source.fromURL(url).getLines
} match {
case Success(e) => e
case Failure(_) => Seq()
}
})
}
}
val res: Future[Seq[String]] = download(providers:_*)

Scala Synchronising Asynchronous calls with Future

I have a method that does a couple of database look up and performs some logic.
The MyType object that I return from the method is as follows:
case class MyResultType(typeId: Long, type1: Seq[Type1], type2: Seq[Type2])
The method definition is like this:
def myMethod(typeId: Long, timeInterval: Interval) = async {
// 1. check if I can find an entity in the database for typeId
val myTypeOption = await(db.run(findMyTypeById(typeId))) // I'm getting the headOption on this result
if (myTypeOption.isDefined) {
val anotherDbLookUp = await(doSomeDBStuff) // Line A
// the interval gets split and assume that I get a List of thse intervals
val intervalList = splitInterval(interval)
// for each of the interval in the intervalList, I do database look up
val results: Seq[(Future[Seq[Type1], Future[Seq[Type2])] = for {
interval <- intervalList
} yield {
(getType1Entries(interval), getType2Entries(interval))
}
// best way to work with the results so that I can return MyResultType
}
else {
None
}
}
Now the getType1Entries(interval) and getType2Entries(interval) each returns a Future of Seq(Type1) and Seq(Type2) entries!
My problem now is to get the Seq(Type1) and Seq(Type2) out of the Future and stuff that into the MyResultType case class?
You could refer to this question you asked
Scala transforming a Seq with Future
so you get the
val results2: Future[Seq([Iterable[Type1], [Iterable[Type2])] = ???
and then call await on it and you have no Futures at all, you can do what you want.
I hope I understood the question correctly.
Oh and by the way you should map myTypeOption instead of checking if it's defined and returning None if it's not
if (myTypeOption.isDefined) {
Some(x)
} else {
None
}
can be simply replaced with
myTypeOption.map { _ => // ignoring what actually was inside option
x // return whatever you want, without wrapping it in Some
}
If I understood your question correctly, then this should do the trick.
def myMethod(typeId: Long, timeInterval: Interval): Option[Seq[MyResultType]] = async {
// 1. check if I can find an entity in the database for typeId
val myTypeOption = await(db.run(findMyTypeById(typeId))) // I'm getting the headOption on this result
if (myTypeOption.isDefined) {
// the interval gets split and assume that I get a List of thse intervals
val intervalList = splitInterval(interval)
// for each of the interval in the intervalList, I do database look up
val results: Seq[(Future[Seq[Type1]], Future[Seq[Type2]])] = for {
interval <- intervalList
} yield {
(getType1Entries(interval), getType2Entries(interval))
}
// best way to work with the results so that I can return MyResultType
Some(
await(
Future.sequence(
results.map{
case (l, r) =>
l.zip(r).map{
case (vl, vr) => MyResultType(typeId, vl, vr)
}
})))
}
else {
None
}
}
There are two parts to your problem, 1) how to deal with two dependent futures, and 2) how to extract the resulting values.
When dealing with dependent futures, I normally compose them together:
val future1 = Future { 10 }
val future2 = Future { 20 }
// results in a new future with type (Int, Int)
val combined = for {
a <- future1
b <- future2
} yield (a, b)
// then you can use foreach/map, Await, or onComplete to do
// something when your results are ready..
combined.foreach { ((a, b)) =>
// do something with the result here
}
To extract the results I generally use Await if I need to make a synchronous response, use _.onComplete() if I need to deal with potential failure, and use _.foreach()/_.map() for most other circumstances.