Using scalaz-stream as a real time Writer for asynchronous computations - scala

I have a web-app that does a bunch of slow concurrent work to calculate its result. Instead of leaving the end user hanging I'd like to stream back progress updates via a websocket.
My codebase is built up of composition of Scalaz eithers (/) like:
type ProcessResult = Error \/ Int
def downloadFile(url: String): Future[Error \/ String] = ???
def doSlowProcessing(data1: String, data2: String): Future[ProcessResult] = ???
/* Very simple however doesn't give any progress update */
def execute(): Future[ProcessResult] = {
val download1 = downloadFile(...)
val download2 = downloadFile(...)
val et = for {
d1 <- download1
d2 <- download2
processed <- doSlowProcessing(d1, d2)
} yield processed
et.run
}
This works very well but of course the entire computation needs to be finished before I get anything out of the Future. Even if I stacked on a Writer monad to do logging I would only get the log once finished, not making my end users any happier.
I toyed around with using a scalaz-stream Queue to send the logs as a side effect while the code is running, however the end result is pretty ugly:
def execute(): Process[Task, String \/ ProcessResult] = {
val (q, src) = async.queue[String \/ ProcessResult]
val download1 = downloadFile(...)
val download2 = downloadFile(...)
val et = for {
d1 <- q.enqueue("Downloading 1".left); download1
d2 <- q.enqueue("Downloading 2".left); download2
processed <- q.enqueue("Doing processing".left); doSlowProcessing(d1, d2)
} yield processed
et.run.onSuccess {
x =>
q.enqueue(x.right)
q.close
}
src
}
It feels like there should be an idiomatic way to achieve this? Turning my SIP-14 Scala futures into Tasks is possible if necessary.

I don't think you need to use queue, one of the approaches can be to use non-Deterministic merging using the wye, i.e.
type Result = ???
val download1: Process[Task,File] = ???
val download2: Process[Task,File] = ???
val result: Process[Task,(File,File)] = (download1 yip download2).once
val processed: Process[Task, Result] = result.flatMap(doSlowProcessing)
// Run asynchronously,
processed.runLast.runAsync {
case Some(r) => .... // result computed
case None => .... //no result, hence download1,2 were empty.
}
//or run synchronously awaiting the result
processed.runLast.run match {
case Some(r) => .... // result computed
case None => .... //no result
}
//to capture the error information while download use
val withError: Process[Task,Throwable\/File] = download1.attempt
//or to log and recover to other file download
val withError: Process[Task,File] download1 onFailure { err => Log(err); download3 }
Does that make a sense?
Also please note that async.queue is deprecated since 0.5.0 in favor to async.unboundedQueue

Related

Synchronously go through Flink DataStream[String], but asynchronously do multiple updates to each message

val updatedDataStream = dataStream.map(new MyMapFunction)
I use map() instead of Flink's native AsyncDataStream because I want to go through the messages synchronously. AsyncDataStream.unorderedWait or orderedWait will go through the message asynchronously.
Below code, updates each message in the dataStream with 2 updates, but the 2 updates are done asynchronously so the total time of both updates together is equal to the time taken by the slowest update.
class MyMapFunction extends RichMapFunction[String, String]{
private var client: AsyncClient = _
override def open(parameters: Configuration): Unit = {
client = new AsyncClient
}
override def map(value: String): String = {
if (value.nonEmpty) {
// below line de-serializes json message to an parsable object
val a = objectMapper.readValue(value, classOf[Test])
// below function calls (firstUpdate and secondUpdate) return back Future[String]
val firstFieldValue = client.firstUpdate()
val secondFieldValue = client.secondUpdate()
def updateRecord(r1: String, r2: String): String = {
a.firstField = r1
a.secondField = r2
// below line serializes object back to a json String
objectMapper.writeValueAsString(a)
}
val enrichment = for {
r1 <- firstFieldValue
r2 <- secondFieldValue
} yield (updateRecord(r1, r2))
val f = enrichment.onComplete {
case Success(result) => result
case Failure(exception) => exception
}
} else ""
}
}
Problem:
This won't work as onComplete returns Unit. But I want it to return result (String) so I can send it back to updatedDataStream.
Since map has a synchronous signature, you'll have to block. Await.result blocks until the future completes.
// instead of val f = enrichment.onComplete ...
Await.result(enrichment, Duration.Inf)
Note that blocking like this may limit throughput, though if r1 and r2 are able to execute in parallel, this period of blocking will likely be shorter than the time the thread invoking map would be blocked if done synchronously.

ZIO watch file system events

help me how to organize a directory scan on ZIO. This is my version, but it doesn't track all file creation events (miss some events).
object Main extends App {
val program = for {
stream <- ZIO.succeed(waitEvents)
_ <- stream.run(ZSink.foreach(k => putStrLn(k.map(e => (e.kind(), e.context())).mkString("\n"))))
} yield ()
val managedWatchService = ZManaged.make {
for {
watchService <- FileSystem.default.newWatchService
path = Path("c:/temp")
_ <- path.register(watchService,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_DELETE
)
} yield watchService
}(_.close.orDie)
val lookKey = ZManaged.make {
managedWatchService.use(watchService => watchService.take)
}(_.reset)
val waitEvents = ZStream.fromEffect {
lookKey.use(key => key.pollEvents)
}.repeat(Schedule.forever)
override def run(args: List[String]): ZIO[zio.ZEnv, Nothing, ExitCode] =
program
.provideLayer(Console.live ++ Blocking.live ++ Clock.live)
.exitCode
}
Thank you for your advice.
You are forcing your WatchService to shutdown and recreate every time you poll for events. Since that probably involves some system handles it is likely fairly slow so you would probably missing file events that occur in between. More likely you want to produce the WatchService once and then poll it repeatedly. I would suggest something like this instead:
object Main extends App {
val managedWatchService = ZManaged.make {
for {
watchService <- FileSystem.default.newWatchService
path = Path("c:/temp")
_ <- path.register(watchService,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_DELETE
)
} yield watchService
}(_.close.orDie)
// Convert ZManaged[R, E, ZStream[R, E, A]] into ZStream[R, E, A]
val waitEvents = ZStream.unwrapManaged(
managedWatchService.mapM(_.take).map { key =>
// Use simple effect composition instead of a managed for readability.
ZStream.repeatEffect(key.pollEvents <* key.reset)
// Optional: Flatten the `List` of values that is returned
.flattenIterables
}
)
val program = waitEvents
.map(e => (e.kind(), e.context()).toString)
.foreach(putStrLn).unit
override def run(args: List[String]): ZIO[zio.ZEnv, Nothing, ExitCode] =
program
.provideLayer(Console.live ++ Blocking.live ++ Clock.live)
.exitCode
}
Also as a side note, when using ZManaged, you probably don't want to do
ZManaged.make(otherManaged.use(doSomething))(tearDown)
because you will cause the finalizers to execute out of order. ZManaged can already handle the ordering of teardown just through normal flatMap composition.
otherManaged.flatMap { other => ZManaged.make(doSomething(other))(tearDown) }

Scala Futures with multiple dependencies

I have to compute asynchronously a set of features that can have multiple dependencies between each other (no loops). For example
class FeatureEncoderMock(val n:String, val deps: List[String] = List.empty) {
def compute = {
println(s"starting computation feature $n")
Thread.sleep(r.nextInt(2500))
println(s"end computation feature $n")
}
}
val registry = Map(
"feat1" -> new FeatureEncoderMock("feat1", List("factLogA", "factLogB")),
"factLogA" -> new FeatureEncoderMock("factLogA"),
"factLogB" -> new FeatureEncoderMock("factLogB"),
"feat1" -> new FeatureEncoderMock("feat1", List("factLogA", "factLogB")),
"feat2" -> new FeatureEncoderMock("feat2", List("factLogA")),
"feat3" -> new FeatureEncoderMock("feat3", List("feat1")),
"feat4" -> new FeatureEncoderMock("feat4", List("feat3", "factLogB"))
)
What I want to achieve is call a single function on feat4 that will trigger the computation of all dependent features and will take care of dependencies among them. I tried with this
def run(): Unit = {
val requested = "feat4"
val allFeatures = getChainOfDependencies(requested)
val promises = allFeatures.zip(Seq.fill(allFeatures.size)(Promise[Unit])).toMap
def computeWithDependencies(f: String) = Future {
println(s"computing $f")
val encoder = registry(f)
if(encoder.deps.isEmpty) {
promises(f).success(registry(f).compute)
}
else {
val depTasks = promises.filterKeys(encoder.deps.contains)
val depTasksFuture = Future.sequence(depTasks.map(_._2.future))
depTasksFuture.onSuccess({
case _ =>
println(s"all deps for $f has been computed")
promises(f).success(registry(f).compute)
println(s"done for $f")
})
}
}
computeWithDependencies(requested)
}
But I cannot understand why the order of execution is not as expected. I am not sure what is the proper way to feed the future inside a promise. I am quite sure that this piece of code is wrong on that part.
I think you're overthinking it with the promises; Future composition is probably all that you need. Something like this:
import scala.collection.mutable
def computeWithDependencies(s: String, cache: mutable.Map[String, Future[Unit]] = mutable.Map.empty)
(implicit ec: ExecutionContext): Future[Unit] = {
cache.get(s) match {
case Some(f) => f
case None => {
val encoder = registry(s)
val depsFutures = encoder.deps.map(d => computeWithDependencies(d, cache))
val result = Future.sequence(depsFutures).flatMap(_ => Future { encoder.compute })
cache += s -> result
result
}
}
}
The call to flatMap ensures that all of the dependency futures complete before the "current" future executes, even if the result (a List[Unit]) is ignored. The business with the cache is just to prevent recomputation if the dependency graph has a "diamond" in it, but could be left out if it won't or if you're ok with recomputing. Anyway, when running this:
val futureResult = computeWithDependencies("feat4")
Await.result(futureResult, 30 seconds)
I see this output:
starting computation feature factLogB
starting computation feature factLogA
end computation feature factLogB
end computation feature factLogA
starting computation feature feat1
end computation feature feat1
starting computation feature feat3
end computation feature feat3
starting computation feature feat4
end computation feature feat4
Which seems correct to me.

Iterate data source asynchronously in batch and stop while remote return no data in Scala

Let's say we have a fake data source which will return data it holds in batch
class DataSource(size: Int) {
private var s = 0
implicit val g = scala.concurrent.ExecutionContext.global
def getData(): Future[List[Int]] = {
s = s + 1
Future {
Thread.sleep(Random.nextInt(s * 100))
if (s <= size) {
List.fill(100)(s)
} else {
List()
}
}
}
object Test extends App {
val source = new DataSource(100)
implicit val g = scala.concurrent.ExecutionContext.global
def process(v: List[Int]): Unit = {
println(v)
}
def next(f: (List[Int]) => Unit): Unit = {
val fut = source.getData()
fut.onComplete {
case Success(v) => {
f(v)
v match {
case h :: t => next(f)
}
}
}
}
next(process)
Thread.sleep(1000000000)
}
I have mine, the problem here is some portion is more not pure. Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list? My situation is a little from this post, the next() there is synchronous call while my is also async.
Or is it ever possible to do what I want? Next batch will only be fetched when the previous one is resolved in the end whether to fetch the next batch depends on the size returned?
What's the best way to walk through this type of data sources? Are there any existing Scala frameworks that provide the feature I am looking for? Is play's Iteratee, Enumerator, Enumeratee the right tool? If so, can anyone provide an example on how to use those facilities to implement what I am looking for?
Edit----
With help from chunjef, I had just tried out. And it actually did work out for me. However, there was some small change I made based on his answer.
Source.fromIterator(()=>Iterator.continually(source.getData())).mapAsync(1) (f=>f.filter(_.size > 0))
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
However, can someone give comparison between Akka Stream and Play Iteratee? Does it worth me also try out Iteratee?
Code snip 1:
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
Code snip 2: Assuming the getData depends on some other output of another flow, and I would like to concat it with the below flow. However, it yield too many files open error. Not sure what would cause this error, the mapAsync has been limited to 1 as its throughput if I understood correctly.
Flow[Int].mapConcat[Future[List[Int]]](c => {
Iterator.continually(ds.getData(c)).to[collection.immutable.Iterable]
}).mapAsync(1)(identity).takeWhile(_.nonEmpty).runForeach(println)
The following is one way to achieve the same behavior with Akka Streams, using your DataSource class:
import scala.concurrent.Future
import scala.util.Random
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._
object StreamsExample extends App {
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
val ds = new DataSource(100)
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
}
class DataSource(size: Int) {
...
}
A simplified line-by-line overview:
line 1: Creates a stream source that continually calls ds.getData if there is downstream demand.
line 2: mapAsync is a way to deal with stream elements that are Futures. In this case, the stream elements are of type Future[List[Int]]. The argument 1 is the level of parallelism: we specify 1 here because DataSource internally uses a mutable variable, and a parallelism level greater than one could produce unexpected results. identity is shorthand for x => x, which basically means that for each Future, we pass its result downstream without transforming it.
line 3: Essentially, ds.getData is called as long as the result of the Future is a non-empty List[Int]. If an empty List is encountered, processing is terminated.
line 4: runForeach here takes a function List[Int] => Unit and invokes that function for each stream element.
Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list?
I think you are looking for a Promise.
You would set up a Promise before you start the first iteration.
This gives you promise.future, a Future that you can then use to follow the completion of everything.
In your onComplete, you add a case _ => promise.success().
Something like
def loopUntilDone(f: (List[Int]) => Unit): Future[Unit] = {
val promise = Promise[Unit]
def next(): Unit = source.getData().onComplete {
case Success(v) =>
f(v)
v match {
case h :: t => next()
case _ => promise.success()
}
case Failure(e) => promise.failure(e)
}
// get going
next(f)
// return the Future for everything
promise.future
}
// future for everything, this is a `Future[Unit]`
// its `onComplete` will be triggered when there is no more data
val everything = loopUntilDone(process)
You are probably looking for a reactive streams library. My personal favorite (and one I'm most familiar with) is Monix. This is how it will work with DataSource unchanged
import scala.concurrent.duration.Duration
import scala.concurrent.Await
import monix.reactive.Observable
import monix.execution.Scheduler.Implicits.global
object Test extends App {
val source = new DataSource(100)
val completed = // <- this is Future[Unit], completes when foreach is done
Observable.repeat(Observable.fromFuture(source.getData()))
.flatten // <- Here it's Observable[List[Int]], it has collection-like methods
.takeWhile(_.nonEmpty)
.foreach(println)
Await.result(completed, Duration.Inf)
}
I just figured out that by using flatMapConcat can achieve what I wanted to achieve. There is no point to start another question as I have had the answer already. Put my sample code here just in case someone is looking for similar answer.
This type of API is very common for some integration between traditional Enterprise applications. The DataSource is to mock the API while the object App is to demonstrate how the client code can utilize Akka Stream to consume the APIs.
In my small project the API was provided in SOAP, and I used scalaxb to transform the SOAP to Scala async style. And with the client calls demonstrated in the object App, we can consume the API with AKKA Stream. Thanks for all for the help.
class DataSource(size: Int) {
private var transactionId: Long = 0
private val transactionCursorMap: mutable.HashMap[TransactionId, Set[ReadCursorId]] = mutable.HashMap.empty
private val cursorIteratorMap: mutable.HashMap[ReadCursorId, Iterator[List[Int]]] = mutable.HashMap.empty
implicit val g = scala.concurrent.ExecutionContext.global
case class TransactionId(id: Long)
case class ReadCursorId(id: Long)
def startTransaction(): Future[TransactionId] = {
Future {
synchronized {
transactionId += transactionId
}
val t = TransactionId(transactionId)
transactionCursorMap.update(t, Set(ReadCursorId(0)))
t
}
}
def createCursorId(t: TransactionId): ReadCursorId = {
synchronized {
val c = transactionCursorMap.getOrElseUpdate(t, Set(ReadCursorId(0)))
val currentId = c.foldLeft(0l) { (acc, a) => acc.max(a.id) }
val cId = ReadCursorId(currentId + 1)
transactionCursorMap.update(t, c + cId)
cursorIteratorMap.put(cId, createIterator)
cId
}
}
def createIterator(): Iterator[List[Int]] = {
(for {i <- 1 to 100} yield List.fill(100)(i)).toIterator
}
def startRead(t: TransactionId): Future[ReadCursorId] = {
Future {
createCursorId(t)
}
}
def getData(cursorId: ReadCursorId): Future[List[Int]] = {
synchronized {
Future {
Thread.sleep(Random.nextInt(100))
cursorIteratorMap.get(cursorId) match {
case Some(i) => i.next()
case _ => List()
}
}
}
}
}
object Test extends App {
val source = new DataSource(10)
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
implicit val g = scala.concurrent.ExecutionContext.global
//
// def process(v: List[Int]): Unit = {
// println(v)
// }
//
// def next(f: (List[Int]) => Unit): Unit = {
// val fut = source.getData()
// fut.onComplete {
// case Success(v) => {
// f(v)
// v match {
//
// case h :: t => next(f)
//
// }
// }
//
// }
//
// }
//
// next(process)
//
// Thread.sleep(1000000000)
val s = Source.fromFuture(source.startTransaction())
.map { e =>
source.startRead(e)
}
.mapAsync(1)(identity)
.flatMapConcat(
e => {
Source.fromIterator(() => Iterator.continually(source.getData(e)))
})
.mapAsync(5)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
/*
val done = Source.fromIterator(() => Iterator.continually(source.getData())).mapAsync(1)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runFold(List[List[Int]]()) { (acc, r) =>
// println("=======" + acc + r)
r :: acc
}
done.onSuccess {
case e => {
e.foreach(println)
}
}
done.onComplete(_ => system.terminate())
*/
}

Submitting operations in created future

I have a Future lazy val that obtains some object and a function which submits operations in the Future.
class C {
def printLn(s: String) = println(s)
}
lazy val futureC: Future[C] = Future{Thread.sleep(3000); new C()}
def func(s: String): Unit = {
futureC.foreach{c => c.printLn(s)}
}
The problem is when Future is completed it executes operations in reverse order than they have been submited. So for example if I execute sequentialy
func("A")
func("B")
func("C")
I get after Future completion
scala> C
B
A
This order is important for me. Is there a way to preserve this order?
Of course I can use an actor who asks for future and stashing strings while future is not ready, but it seems redundant for me.
lazy val futureC: Future[C]
lazy vals in scala will be compiled in to the code which uses a synchronized block for thread safety.
Here when the func(A) is called, it will obtain the lock for the lazy val and that thread will go to sleep.
Therefore func(B) & func(C) will blocked by the lock.
When those blocked threads are run, the order cannot be guaranteed.
If you do it like below, you'll have the order as you expect. This is because the for comprehension creates a flatMap, & map based chain that gets executed sequentially.
lazy val futureC: Future[C] = Future {
Thread.sleep(1000)
new C()
}
def func(s: String) : Future[Unit] = {
futureC.map { c => c.printLn(s) }
}
val x = for {
_ <- func("A")
_ <- func("B")
_ <- func("C")
} yield ()
The order preserves even without the lazy keyword. You can remove the lazy keyword unless it is really necessary.
Hope this helps.
You can use Future.traverse to ensure the order of execution.
Something like this.. Im not sure how your func has a reference to the correct futureC, so I moved it inside.
def func(s: String): Future[Unit] = {
lazy val futureC = Future{Thread.sleep(3000); new C()}
futureC.map{c => c.printLn(s)}
}
def traverse[A,B](xs: Seq[A])(fn: A => Future[B]): Future[Seq[B]] =
xs.foldLeft(Future(Seq[B]())) { (acc, item) =>
acc.flatMap { accValue =>
fn(item).map { itemValue =>
accValue :+ itemValue
}
}
}
traverse(Seq("A","B","C"))(func)