Converting a Future[Future[(String,String)]] to Future[(String,String)] in Scala using for comprehension - scala

I've a Future[Future[(String,String)]] and I want to convert it into a Future[(String,String)] using a for comprehension.

Not necessarily with a for comprehension, a simple approach involves the use of flatMap over identity.
Consider for instance
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
def f: Future[Future[(String,String)]] = Future { Future {("a","aa")} }
Then
f.flatMap(identity)
res: scala.concurrent.Future[(String, String)] = Promise$DefaultPromise#1849937

import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
val futureOfFutures: Future[Future[(String,String)]] = Future { Future {("a","aa")} }
for(futureOfFuture <- futureOfFutures; futureResult <- futureOfFuture) yield futureResult
Just in case you wanted to use for comprehension. This is how you do it.

Related

scala (spark) zio convert future to zio

My objective is to run a number of spark ml regression models (1000s of times) on one dataset and I want to do this using zio instead of future, because it is running too slow. Below is the working example of using Future.
A distinct list of keys is used to filter the partitioned dataset on key and run the model on. I've set up a thread pool with 8 executors to manage it, but it quickly degrades in performance.
import scala.concurrent.{Await, ExecutionContext, ExecutionContextExecutorService, Future}
import java.util.concurrent.{Executors, TimeUnit}
import scala.concurrent.duration._
import org.apache.spark.sql.SaveMode
val pool = Executors.newFixedThreadPool(8)
implicit val xc: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(pool)
case class Result(key: String, coeffs: String)
try {
import spark.implicits._
val tasks = {
for (x <- keys)
yield Future {
Seq(
Result(
x.group,
runModel(input.filter(col("group")===x)).mkString(",")
)
).toDS()
.write.mode(SaveMode.Overwrite).option("header", false).csv(
s"hdfs://namenode:8020/results/$x.csv"
)
}
}.toSeq
Await.result(Future.sequence(tasks), Duration.Inf)
}
finally {
pool.shutdown()
pool.awaitTermination(Long.MaxValue, TimeUnit.NANOSECONDS)
}
I've tried to implement this in zio, but I don't know how to implement queues and set a limit of executors like in futures.
Below is my failed attempt so far...
import zio._
import zio.console._
import zio.stm._
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession}
import org.apache.spark.sql.functions.col
//example data/signatures
case class ModelResult(key: String, coeffs: String)
case class Data(key: String, sales: Double)
val keys: Array[String] = Array("100_1", "100_2")
def runModel[T](ds: Dataset[T]): Vector[Double]
object MyApp1 extends App {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val input: Dataset[Data] = Seq(Data("100_1", 1d), Data("100_2", 2d)).toDS
def run(args: List[String]): ZIO[ZEnv, Nothing, Int] = {
for {
queue <- Queue.bounded[Int](8)
_ <- ZIO.foreach(1 to 8) (i => queue.offer(i)).fork
_ <- ZIO.foreach(keys) { k => queue.take.flatMap(_ => readWrite(k, input, queue)) }
} yield 0
}
def writecsv(k: String, v: String) = {
Seq(ModelResult(k, v))
.toDS
.write
.mode(SaveMode.Overwrite).option("header", value = false)
.csv(s"hdfs://namenode:8020/results/$k.csv")
}
def readWrite[T](key: String, ds: Dataset[T], queue: Queue[Int]): ZIO[ZEnv, Nothing, Int] = {
(for {
result <- runModel(ds.filter(col("key")===key)).mkString(",")
_ <- writecsv(key, result)
_ <- queue.offer(1)
_ <- putStrLn(s"successfully wrote output for $key")
} yield 0)
}
}
//to run
MyApp1.run(List[String]())
What is the best way to deal with compute this in zio?
To parallelize some workload across, say, 8 threads all you need is
ZIO.foreachParN(8)(1 to 100)(id => zio.blocking.blocking(Task{yourClusterJob(id)}))
But don't expect lots of a boost by switching from Futures to ZIO here:
1) Actual workload dominates coordination overhead so difference between ZIO and Future should be marginal.
2) Maybe you won't get any boost at all because 8 tasks will be fighting for the same resource pool in the Spark cluster.

Scala - How to use a Timer without blocking on Futures with Await.result

I have an Rest API provided by akka-http. In some cases I need to get data from an external database (Apache HBase), and I would like the query to fail if the database takes too long to deliver the data.
One naïve way is to wrap the call inside a Future and then block it with an Await.result with the needed duration.
import scala.concurrent.duration._
import scala.concurrent.{Await, Future}
object AsyncTest1 extends App {
val future = Future {
getMyDataFromDB()
}
val myData = Await.result(future, 100.millis)
}
The seems to be inefficient as this implementation needs two threads. Is There an efficient way to do this ?
I have another use case where I want to send multiple queries in parallel and then aggregates the results, with the same delay limitation.
val future1 = Future {
getMyDataFromDB1()
}
val future2 = Future {
getMyDataFromDB2()
}
val foldedFuture = Future.fold(
Seq(future1, future2))(MyAggregatedData)(myAggregateFunction)
)
val myData = Await.result(foldedFuture, 100.millis)
Same question here, what is the most efficient way to implement this ?
Thanks for your help
One solution would be to use Akka's after function which will let you pass a duration, after which the future throws an exception or whatever you want.
Take a look here. It demonstrates how to implement this.
EDIT:
I guess I'll post the code here in case the link gets broken in future:
import scala.concurrent._
import scala.concurrent.duration._
import ExecutionContext.Implicits.global
import scala.util.{Failure, Success}
import akka.actor.ActorSystem
import akka.pattern.after
val system = ActorSystem("theSystem")
lazy val f = future { Thread.sleep(2000); true }
lazy val t = after(duration = 1 second, using = system.scheduler)(Future.failed(new TimeoutException("Future timed out!")))
val fWithTimeout = Future firstCompletedOf Seq(f, t)
fWithTimeout.onComplete {
case Success(x) => println(x)
case Failure(error) => println(error)
}

how to combine the results Future[ Option[ T ] ] into Seq[ T ]

I have a method
def readTree(id: String): Future[Option[CategoryTreeResponse]]
and a list of String channels:List[String].
How to iterate and combine all the results into a non Future Sequence ? such as :
def readAllTrees(): Seq[CategoryTreeResponse] = ???
Possibly without blocking.
Coming form the imperative world, I'd do like this :
import scala.concurrent.duration._
def readTrees(): Seq[CategoryTreeResponse] = {
val list = ListBuffer[CategoryTreeResponse]()
for (id <- channels) {
val tree = Await.result(readTree(id), 5.seconds)
if (tree.isDefined) {
list += tree.get
}
}
list
}
You could do something like this
def readAllTrees(channels: List[String]): Future[Seq[CategoryTreeResponse]] = {
Future.sequence(channels.map(readTree(_))).map(_.flatten)
}
I have changed the signature of readAllTrees to receive the list and return a Future of the Sequence.
If you want to access to the resulting sequence you will need to wait until is finished doing
Await.result(readAllTrees(channels), Duration.Inf)
But this is not a very nice way to manage futures because it will lock the thread that calls Await.ready
Future.sequence and Await.result should help. I agree with Mikel though, it is better to stay async as long as possible using map/flatMap/foreach etc methods of the Future class
scala> :paste
// Entering paste mode (ctrl-D to finish)
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
case class CategoryTreeResponse()
val futureResults: List[Future[Option[CategoryTreeResponse]]] = List(
Future.successful(Option(CategoryTreeResponse())),
Future.successful(Option(CategoryTreeResponse())),
Future.successful(None)
)
val futureResult: Future[List[Option[CategoryTreeResponse]]] = Future.sequence(futureResults)
val allResults: List[Option[CategoryTreeResponse]] = Await.result(futureResult, Duration.Inf)
val nonEmptyResults: Seq[CategoryTreeResponse] = allResults.flatMap(_.toSeq)
// Exiting paste mode, now interpreting.
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
defined class CategoryTreeResponse
futureResults: List[scala.concurrent.Future[Option[CategoryTreeResponse]]] = List(Future(Success(Some(CategoryTreeResponse()))), Future(Success(Some(CategoryTreeResponse()))), Future(Success(None)))
futureResult: scala.concurrent.Future[List[Option[CategoryTreeResponse]]] = Future(Success(List(Some(CategoryTreeResponse()), Some(CategoryTreeResponse()), None)))
allResults: List[Option[CategoryTreeResponse]] = List(Some(CategoryTreeResponse()), Some(CategoryTreeResponse()), None)
nonEmptyResults: Seq[CategoryTreeResponse] = List(CategoryTreeResponse(), CategoryTreeResponse())
scala>

Scala: Parallel execution with ListBuffer appends doesn't produce expected outcome

I know I'm doing something wrong with mutable.ListBuffer but I can't figure out how to fix it (and a proper explanation of the issue).
I simplified the code below to reproduce the behavior.
I'm basically trying to run functions in parallel to add elements to a list as my first list get processed. I end up "losing" elements.
import java.util.Properties
import scala.collection.mutable.ListBuffer
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
import scala.concurrent.{ExecutionContext}
import ExecutionContext.Implicits.global
object MyTestObject {
var listBufferOfInts = new ListBuffer[Int]() // files that are processed
def runFunction(): Int = {
listBufferOfInts = new ListBuffer[Int]()
val inputListOfInts = 1 to 1000
val fut = Future.traverse(inputListOfInts) { i =>
Future {
appendElem(i)
}
}
Await.ready(fut, Duration.Inf)
listBufferOfInts.length
}
def appendElem(elem: Int): Unit = {
listBufferOfInts ++= List(elem)
}
}
MyTestObject.runFunction()
MyTestObject.runFunction()
MyTestObject.runFunction()
which returns:
res0: Int = 937
res1: Int = 992
res2: Int = 997
Obviously I would expect 1000 to be returned all the time. How can I fix my code to keep the "architecture" but make my ListBuffer "synchronized" ?
I don't know what exact problem is as you said you simplified it, but still you have an obvious race condition, multiple threads modify a single mutable collection and that is very bad. As other answers pointed out you need some locking so that only one thread could modify collection at the same time. If your calculations are heavy, appending result in synchronized way to a buffer shouldn't notably affect the performance but when in doubt always measure.
But synchronization is not needed, you can do something else instead, without vars and mutable state. Let each Future return your partial result and then merge them into a list, in fact Future.traverse does just that.
import scala.concurrent.duration._
import scala.concurrent.{Await, Future}
import scala.concurrent.ExecutionContext.Implicits.global
def runFunction: Int = {
val inputListOfInts = 1 to 1000
val fut: Future[List[Int]] = Future.traverse(inputListOfInts.toList) { i =>
Future {
// some heavy calculations on i
i * 4
}
}
val listOfInts = Await.result(fut, Duration.Inf)
listOfInts.size
}
Future.traverse already gives you an immutable list with all your results combined, no need to append them to a mutable buffer.
Needless to say, you will always get 1000 back.
# List.fill(10000)(runFunction).exists(_ != 1000)
res18: Boolean = false
I'm not sure the above shows what you are trying to do correctly. Maybe the issue is that you are actually sharing a var ListBuffer which you reinitialise within runFunction.
When I take this out I collect all the events I'm expecting correctly:
import java.util.Properties
import scala.collection.mutable.ListBuffer
import scala.concurrent.duration.Duration
import scala.concurrent.{ Await, Future }
import scala.concurrent.{ ExecutionContext }
import ExecutionContext.Implicits.global
object BrokenTestObject extends App {
var listBufferOfInts = ( new ListBuffer[Int]() )
def runFunction(): Int = {
val inputListOfInts = 1 to 1000
val fut = Future.traverse(inputListOfInts) { i =>
Future {
appendElem(i)
}
}
Await.ready(fut, Duration.Inf)
listBufferOfInts.length
}
def appendElem(elem: Int): Unit = {
listBufferOfInts.append( elem )
}
BrokenTestObject.runFunction()
BrokenTestObject.runFunction()
BrokenTestObject.runFunction()
println(s"collected ${listBufferOfInts.length} elements")
}
If you really have a synchronisation issue you can use something like the following:
import java.util.Properties
import scala.collection.mutable.ListBuffer
import scala.concurrent.duration.Duration
import scala.concurrent.{ Await, Future }
import scala.concurrent.{ ExecutionContext }
import ExecutionContext.Implicits.global
class WrappedListBuffer(val lb: ListBuffer[Int]) {
def append(i: Int) {
this.synchronized {
lb.append(i)
}
}
}
object MyTestObject extends App {
var listBufferOfInts = new WrappedListBuffer( new ListBuffer[Int]() )
def runFunction(): Int = {
val inputListOfInts = 1 to 1000
val fut = Future.traverse(inputListOfInts) { i =>
Future {
appendElem(i)
}
}
Await.ready(fut, Duration.Inf)
listBufferOfInts.lb.length
}
def appendElem(elem: Int): Unit = {
listBufferOfInts.append( elem )
}
MyTestObject.runFunction()
MyTestObject.runFunction()
MyTestObject.runFunction()
println(s"collected ${listBufferOfInts.lb.size} elements")
}
Changing
listBufferOfInts ++= List(elem)
to
synchronized {
listBufferOfInts ++= List(elem)
}
Make it work. Probably can become a performance issue? I'm still interested in an explanation and maybe a better way of doing things!

Wait for all nested Futures to complete

val works: Seq[Future[Seq[Future[String]]]] = ...
How would I wait for all these top and nested Futures to complete?
My first idea is:
val result1: Seq[Seq[Future[String]]] = Await.result(
Future.sequence(works), Duration.Inf
)
val result2: Seq[String] = Await.result(
Future.sequence(result1.flatten), Duration.Inf
)
But I guess it is not as effiecient as it could be.
You shouldn't nest multiple Futures, normally you would use map, flatMap, for comprehensions, ... to chain asynchronous operations using Futures together.
If you want to continue with your Seq[Future[Seq[Future[String]]]], you can use the Future.sequence and Seq.flatten functions you were using to create a Future[Seq[String]] :
import scala.concurrent.{Future, Await}
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
val works = Seq(Future.successful(Seq(Future.successful("abc"))))
val onlyOneFuture: Future[Seq[String]] =
Future.sequence(works).map(_.flatten).flatMap(Future.sequence(_))
Await.result(onlyOneFuture, 5.seconds)
// Seq[String] = List(abc)