ZIO watch file system events - scala

help me how to organize a directory scan on ZIO. This is my version, but it doesn't track all file creation events (miss some events).
object Main extends App {
val program = for {
stream <- ZIO.succeed(waitEvents)
_ <- stream.run(ZSink.foreach(k => putStrLn(k.map(e => (e.kind(), e.context())).mkString("\n"))))
} yield ()
val managedWatchService = ZManaged.make {
for {
watchService <- FileSystem.default.newWatchService
path = Path("c:/temp")
_ <- path.register(watchService,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_DELETE
)
} yield watchService
}(_.close.orDie)
val lookKey = ZManaged.make {
managedWatchService.use(watchService => watchService.take)
}(_.reset)
val waitEvents = ZStream.fromEffect {
lookKey.use(key => key.pollEvents)
}.repeat(Schedule.forever)
override def run(args: List[String]): ZIO[zio.ZEnv, Nothing, ExitCode] =
program
.provideLayer(Console.live ++ Blocking.live ++ Clock.live)
.exitCode
}
Thank you for your advice.

You are forcing your WatchService to shutdown and recreate every time you poll for events. Since that probably involves some system handles it is likely fairly slow so you would probably missing file events that occur in between. More likely you want to produce the WatchService once and then poll it repeatedly. I would suggest something like this instead:
object Main extends App {
val managedWatchService = ZManaged.make {
for {
watchService <- FileSystem.default.newWatchService
path = Path("c:/temp")
_ <- path.register(watchService,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_DELETE
)
} yield watchService
}(_.close.orDie)
// Convert ZManaged[R, E, ZStream[R, E, A]] into ZStream[R, E, A]
val waitEvents = ZStream.unwrapManaged(
managedWatchService.mapM(_.take).map { key =>
// Use simple effect composition instead of a managed for readability.
ZStream.repeatEffect(key.pollEvents <* key.reset)
// Optional: Flatten the `List` of values that is returned
.flattenIterables
}
)
val program = waitEvents
.map(e => (e.kind(), e.context()).toString)
.foreach(putStrLn).unit
override def run(args: List[String]): ZIO[zio.ZEnv, Nothing, ExitCode] =
program
.provideLayer(Console.live ++ Blocking.live ++ Clock.live)
.exitCode
}
Also as a side note, when using ZManaged, you probably don't want to do
ZManaged.make(otherManaged.use(doSomething))(tearDown)
because you will cause the finalizers to execute out of order. ZManaged can already handle the ordering of teardown just through normal flatMap composition.
otherManaged.flatMap { other => ZManaged.make(doSomething(other))(tearDown) }

Related

FS2 Stream with StateT[IO, _, _], periodically dumping state

I have a program which consumes an infinite stream of data. Along the way I'd like to record some metrics, which form a monoid since they're just simple sums and averages. Periodically, I want to write out these metrics somewhere, clear them, and return to accumulating them. I have essentially:
object Foo {
type MetricsIO[A] = StateT[IO, MetricData, A]
def recordMetric(m: MetricData): MetricsIO[Unit] = {
StateT.modify(_.combine(m))
}
def sendMetrics: MetricsIO[Unit] = {
StateT.modifyF { s =>
val write: IO[Unit] = writeMetrics(s)
write.attempt.map {
case Left(_) => s
case Right(_) => Monoid[MetricData].empty
}
}
}
}
So most of the execution uses IO directly and lifts using StateT.liftF. And in certain situations, I include some calls to recordMetric. At the end of it I've got a stream:
val mainStream: Stream[MetricsIO, Bar] = ...
And I want to periodically, say every minute or so, dump the metrics, so I tried:
val scheduler: Scheduler = ...
val sendStream =
scheduler
.awakeEvery[MetricsIO](FiniteDuration(1, TimeUnit.Minutes))
.evalMap(_ => Foo.sendMetrics)
val result = mainStream.concurrently(sendStream).compile.drain
And then I do the usual top level program stuff of calling run with the start state and then calling unsafeRunSync.
The issue is, I only ever see empty metrics! I suspect it's something to with my monoid implicitly providing empty metrics to sendStream but I can't quite figure out why that should be or how to fix it. Maybe there's a way I can "interleave" these sendMetrics calls into the main stream instead?
Edit: here's a minimal complete runnable example:
import fs2._
import cats.implicits._
import cats.data._
import cats.effect._
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import scala.concurrent.duration._
val sec = Executors.newScheduledThreadPool(4)
implicit val ec = ExecutionContext.fromExecutorService(sec)
type F[A] = StateT[IO, List[String], A]
val slowInts = Stream.unfoldEval[F, Int, Int](1) { n =>
StateT(state => IO {
Thread.sleep(500)
val message = s"hello $n"
val newState = message :: state
val result = Some((n, n + 1))
(newState, result)
})
}
val ticks = Scheduler.fromScheduledExecutorService(sec).fixedDelay[F](FiniteDuration(1, SECONDS))
val slowIntsPeriodicallyClearedState = slowInts.either(ticks).evalMap[Int] {
case Left(n) => StateT.liftF(IO(n))
case Right(_) => StateT(state => IO {
println(state)
(List.empty, -1)
})
}
Now if I do:
slowInts.take(10).compile.drain.run(List.empty).unsafeRunSync
Then I get the expected result - the state properly accumulates into the output. But if I do:
slowIntsPeriodicallyClearedState.take(10).compile.drain.run(List.empty).unsafeRunSync
Then I see an empty list consistently printed out. I would have expected partial lists (approx. 2 elements) printed out.
StateT is not safe to use with effect types, because it's not safe in the face of concurrent access. Instead, consider using a Ref (from either fs2 or cats-effect, depending what version).
Something like this:
def slowInts(ref: Ref[IO, Int]) = Stream.unfoldEval[F, Int, Int](1) { n =>
val message = s"hello $n"
ref.modify(message :: _) *> IO {
Thread.sleep(500)
val result = Some((n, n + 1))
result
}
}
val ticks = Scheduler.fromScheduledExecutorService(sec).fixedDelay[IO](FiniteDuration(1, SECONDS))
def slowIntsPeriodicallyClearedState(ref: Ref[IO, Int] =
slowInts.either(ticks).evalMap[Int] {
case Left(n) => IO.pure(n)
case Right(_) =>
ref.modify(_ => Nil).flatMap { case Change(previous, now) =>
IO(println(now)).as(-1)
}
}

Iterate data source asynchronously in batch and stop while remote return no data in Scala

Let's say we have a fake data source which will return data it holds in batch
class DataSource(size: Int) {
private var s = 0
implicit val g = scala.concurrent.ExecutionContext.global
def getData(): Future[List[Int]] = {
s = s + 1
Future {
Thread.sleep(Random.nextInt(s * 100))
if (s <= size) {
List.fill(100)(s)
} else {
List()
}
}
}
object Test extends App {
val source = new DataSource(100)
implicit val g = scala.concurrent.ExecutionContext.global
def process(v: List[Int]): Unit = {
println(v)
}
def next(f: (List[Int]) => Unit): Unit = {
val fut = source.getData()
fut.onComplete {
case Success(v) => {
f(v)
v match {
case h :: t => next(f)
}
}
}
}
next(process)
Thread.sleep(1000000000)
}
I have mine, the problem here is some portion is more not pure. Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list? My situation is a little from this post, the next() there is synchronous call while my is also async.
Or is it ever possible to do what I want? Next batch will only be fetched when the previous one is resolved in the end whether to fetch the next batch depends on the size returned?
What's the best way to walk through this type of data sources? Are there any existing Scala frameworks that provide the feature I am looking for? Is play's Iteratee, Enumerator, Enumeratee the right tool? If so, can anyone provide an example on how to use those facilities to implement what I am looking for?
Edit----
With help from chunjef, I had just tried out. And it actually did work out for me. However, there was some small change I made based on his answer.
Source.fromIterator(()=>Iterator.continually(source.getData())).mapAsync(1) (f=>f.filter(_.size > 0))
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
However, can someone give comparison between Akka Stream and Play Iteratee? Does it worth me also try out Iteratee?
Code snip 1:
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
Code snip 2: Assuming the getData depends on some other output of another flow, and I would like to concat it with the below flow. However, it yield too many files open error. Not sure what would cause this error, the mapAsync has been limited to 1 as its throughput if I understood correctly.
Flow[Int].mapConcat[Future[List[Int]]](c => {
Iterator.continually(ds.getData(c)).to[collection.immutable.Iterable]
}).mapAsync(1)(identity).takeWhile(_.nonEmpty).runForeach(println)
The following is one way to achieve the same behavior with Akka Streams, using your DataSource class:
import scala.concurrent.Future
import scala.util.Random
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._
object StreamsExample extends App {
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
val ds = new DataSource(100)
Source.fromIterator(() => Iterator.continually(ds.getData)) // line 1
.mapAsync(1)(identity) // line 2
.takeWhile(_.nonEmpty) // line 3
.runForeach(println) // line 4
}
class DataSource(size: Int) {
...
}
A simplified line-by-line overview:
line 1: Creates a stream source that continually calls ds.getData if there is downstream demand.
line 2: mapAsync is a way to deal with stream elements that are Futures. In this case, the stream elements are of type Future[List[Int]]. The argument 1 is the level of parallelism: we specify 1 here because DataSource internally uses a mutable variable, and a parallelism level greater than one could produce unexpected results. identity is shorthand for x => x, which basically means that for each Future, we pass its result downstream without transforming it.
line 3: Essentially, ds.getData is called as long as the result of the Future is a non-empty List[Int]. If an empty List is encountered, processing is terminated.
line 4: runForeach here takes a function List[Int] => Unit and invokes that function for each stream element.
Ideally, I would like to wrap the Future for each batch into a big future, and the wrapper future success when last batch returned 0 size list?
I think you are looking for a Promise.
You would set up a Promise before you start the first iteration.
This gives you promise.future, a Future that you can then use to follow the completion of everything.
In your onComplete, you add a case _ => promise.success().
Something like
def loopUntilDone(f: (List[Int]) => Unit): Future[Unit] = {
val promise = Promise[Unit]
def next(): Unit = source.getData().onComplete {
case Success(v) =>
f(v)
v match {
case h :: t => next()
case _ => promise.success()
}
case Failure(e) => promise.failure(e)
}
// get going
next(f)
// return the Future for everything
promise.future
}
// future for everything, this is a `Future[Unit]`
// its `onComplete` will be triggered when there is no more data
val everything = loopUntilDone(process)
You are probably looking for a reactive streams library. My personal favorite (and one I'm most familiar with) is Monix. This is how it will work with DataSource unchanged
import scala.concurrent.duration.Duration
import scala.concurrent.Await
import monix.reactive.Observable
import monix.execution.Scheduler.Implicits.global
object Test extends App {
val source = new DataSource(100)
val completed = // <- this is Future[Unit], completes when foreach is done
Observable.repeat(Observable.fromFuture(source.getData()))
.flatten // <- Here it's Observable[List[Int]], it has collection-like methods
.takeWhile(_.nonEmpty)
.foreach(println)
Await.result(completed, Duration.Inf)
}
I just figured out that by using flatMapConcat can achieve what I wanted to achieve. There is no point to start another question as I have had the answer already. Put my sample code here just in case someone is looking for similar answer.
This type of API is very common for some integration between traditional Enterprise applications. The DataSource is to mock the API while the object App is to demonstrate how the client code can utilize Akka Stream to consume the APIs.
In my small project the API was provided in SOAP, and I used scalaxb to transform the SOAP to Scala async style. And with the client calls demonstrated in the object App, we can consume the API with AKKA Stream. Thanks for all for the help.
class DataSource(size: Int) {
private var transactionId: Long = 0
private val transactionCursorMap: mutable.HashMap[TransactionId, Set[ReadCursorId]] = mutable.HashMap.empty
private val cursorIteratorMap: mutable.HashMap[ReadCursorId, Iterator[List[Int]]] = mutable.HashMap.empty
implicit val g = scala.concurrent.ExecutionContext.global
case class TransactionId(id: Long)
case class ReadCursorId(id: Long)
def startTransaction(): Future[TransactionId] = {
Future {
synchronized {
transactionId += transactionId
}
val t = TransactionId(transactionId)
transactionCursorMap.update(t, Set(ReadCursorId(0)))
t
}
}
def createCursorId(t: TransactionId): ReadCursorId = {
synchronized {
val c = transactionCursorMap.getOrElseUpdate(t, Set(ReadCursorId(0)))
val currentId = c.foldLeft(0l) { (acc, a) => acc.max(a.id) }
val cId = ReadCursorId(currentId + 1)
transactionCursorMap.update(t, c + cId)
cursorIteratorMap.put(cId, createIterator)
cId
}
}
def createIterator(): Iterator[List[Int]] = {
(for {i <- 1 to 100} yield List.fill(100)(i)).toIterator
}
def startRead(t: TransactionId): Future[ReadCursorId] = {
Future {
createCursorId(t)
}
}
def getData(cursorId: ReadCursorId): Future[List[Int]] = {
synchronized {
Future {
Thread.sleep(Random.nextInt(100))
cursorIteratorMap.get(cursorId) match {
case Some(i) => i.next()
case _ => List()
}
}
}
}
}
object Test extends App {
val source = new DataSource(10)
implicit val system = ActorSystem("Sandbox")
implicit val materializer = ActorMaterializer()
implicit val g = scala.concurrent.ExecutionContext.global
//
// def process(v: List[Int]): Unit = {
// println(v)
// }
//
// def next(f: (List[Int]) => Unit): Unit = {
// val fut = source.getData()
// fut.onComplete {
// case Success(v) => {
// f(v)
// v match {
//
// case h :: t => next(f)
//
// }
// }
//
// }
//
// }
//
// next(process)
//
// Thread.sleep(1000000000)
val s = Source.fromFuture(source.startTransaction())
.map { e =>
source.startRead(e)
}
.mapAsync(1)(identity)
.flatMapConcat(
e => {
Source.fromIterator(() => Iterator.continually(source.getData(e)))
})
.mapAsync(5)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runForeach(println)
/*
val done = Source.fromIterator(() => Iterator.continually(source.getData())).mapAsync(1)(identity)
.via(Flow[List[Int]].takeWhile(_.nonEmpty))
.runFold(List[List[Int]]()) { (acc, r) =>
// println("=======" + acc + r)
r :: acc
}
done.onSuccess {
case e => {
e.foreach(println)
}
}
done.onComplete(_ => system.terminate())
*/
}

Submitting operations in created future

I have a Future lazy val that obtains some object and a function which submits operations in the Future.
class C {
def printLn(s: String) = println(s)
}
lazy val futureC: Future[C] = Future{Thread.sleep(3000); new C()}
def func(s: String): Unit = {
futureC.foreach{c => c.printLn(s)}
}
The problem is when Future is completed it executes operations in reverse order than they have been submited. So for example if I execute sequentialy
func("A")
func("B")
func("C")
I get after Future completion
scala> C
B
A
This order is important for me. Is there a way to preserve this order?
Of course I can use an actor who asks for future and stashing strings while future is not ready, but it seems redundant for me.
lazy val futureC: Future[C]
lazy vals in scala will be compiled in to the code which uses a synchronized block for thread safety.
Here when the func(A) is called, it will obtain the lock for the lazy val and that thread will go to sleep.
Therefore func(B) & func(C) will blocked by the lock.
When those blocked threads are run, the order cannot be guaranteed.
If you do it like below, you'll have the order as you expect. This is because the for comprehension creates a flatMap, & map based chain that gets executed sequentially.
lazy val futureC: Future[C] = Future {
Thread.sleep(1000)
new C()
}
def func(s: String) : Future[Unit] = {
futureC.map { c => c.printLn(s) }
}
val x = for {
_ <- func("A")
_ <- func("B")
_ <- func("C")
} yield ()
The order preserves even without the lazy keyword. You can remove the lazy keyword unless it is really necessary.
Hope this helps.
You can use Future.traverse to ensure the order of execution.
Something like this.. Im not sure how your func has a reference to the correct futureC, so I moved it inside.
def func(s: String): Future[Unit] = {
lazy val futureC = Future{Thread.sleep(3000); new C()}
futureC.map{c => c.printLn(s)}
}
def traverse[A,B](xs: Seq[A])(fn: A => Future[B]): Future[Seq[B]] =
xs.foldLeft(Future(Seq[B]())) { (acc, item) =>
acc.flatMap { accValue =>
fn(item).map { itemValue =>
accValue :+ itemValue
}
}
}
traverse(Seq("A","B","C"))(func)

Using scalaz-stream as a real time Writer for asynchronous computations

I have a web-app that does a bunch of slow concurrent work to calculate its result. Instead of leaving the end user hanging I'd like to stream back progress updates via a websocket.
My codebase is built up of composition of Scalaz eithers (/) like:
type ProcessResult = Error \/ Int
def downloadFile(url: String): Future[Error \/ String] = ???
def doSlowProcessing(data1: String, data2: String): Future[ProcessResult] = ???
/* Very simple however doesn't give any progress update */
def execute(): Future[ProcessResult] = {
val download1 = downloadFile(...)
val download2 = downloadFile(...)
val et = for {
d1 <- download1
d2 <- download2
processed <- doSlowProcessing(d1, d2)
} yield processed
et.run
}
This works very well but of course the entire computation needs to be finished before I get anything out of the Future. Even if I stacked on a Writer monad to do logging I would only get the log once finished, not making my end users any happier.
I toyed around with using a scalaz-stream Queue to send the logs as a side effect while the code is running, however the end result is pretty ugly:
def execute(): Process[Task, String \/ ProcessResult] = {
val (q, src) = async.queue[String \/ ProcessResult]
val download1 = downloadFile(...)
val download2 = downloadFile(...)
val et = for {
d1 <- q.enqueue("Downloading 1".left); download1
d2 <- q.enqueue("Downloading 2".left); download2
processed <- q.enqueue("Doing processing".left); doSlowProcessing(d1, d2)
} yield processed
et.run.onSuccess {
x =>
q.enqueue(x.right)
q.close
}
src
}
It feels like there should be an idiomatic way to achieve this? Turning my SIP-14 Scala futures into Tasks is possible if necessary.
I don't think you need to use queue, one of the approaches can be to use non-Deterministic merging using the wye, i.e.
type Result = ???
val download1: Process[Task,File] = ???
val download2: Process[Task,File] = ???
val result: Process[Task,(File,File)] = (download1 yip download2).once
val processed: Process[Task, Result] = result.flatMap(doSlowProcessing)
// Run asynchronously,
processed.runLast.runAsync {
case Some(r) => .... // result computed
case None => .... //no result, hence download1,2 were empty.
}
//or run synchronously awaiting the result
processed.runLast.run match {
case Some(r) => .... // result computed
case None => .... //no result
}
//to capture the error information while download use
val withError: Process[Task,Throwable\/File] = download1.attempt
//or to log and recover to other file download
val withError: Process[Task,File] download1 onFailure { err => Log(err); download3 }
Does that make a sense?
Also please note that async.queue is deprecated since 0.5.0 in favor to async.unboundedQueue

Scala: Cool way to manage sequential execution of Futures?

I'm trying to write a data module in Scala.
While loading entire data in parallel, some data depends on other data, so execution sequence has to be managed in efficient way.
For example in code, I keep a map with name of data and manifest
val dataManifestMap = Map(
"foo" -> manifest[String],
"bar" -> manifest[Int],
"baz" -> manifest[Int],
"foobar" -> manifest[Set[String]], // need to be executed after "foo" and "bar" is ready
"foobarbaz" -> manifest[String], // need to be executed after "foobar" and "baz" is ready
)
These data will be stored in a mutable hash map
private var dataStorage = new mutable.HashMap[String, Future[Any]]()
There are some code that will load data
def loadAllData(): Future[Unit] = {
Future.join(
(dataManifestMap map {
case (data, m) => loadData(data, m) } // function has all the string matching and loading stuff
).toSeq
)
}
def loadData[T](data: String, m: Manifest[T]): Future[Unit] = {
val d = data match {
case "foo" => Future.value("foo")
case "bar" => Future.value(3)
case "foobar" => // do something with dataStorage("foo") and dataStorage("bar")
... // and so forth (in a real example it would be much more complicated for sure)
}
d flatMap {
dVal => { this.synchronized { dataStorage(data) = dVal }; Future.value(Unit) }
}
}
This way, I cannot make sure "foobar" is loaded when "foo" and "bar" is ready, and so forth.
How can I manage this in a "cool" way, since I might have hundreds of different data?
It would be "awesome" if I could have some kind of data structure that has all the info about something has to be loaded after something, and sequential execution can be handled by flatMap in a neat way.
Thanks for the help in advance.
All things being equal, I'd tend to use for comprehensions. For example:
def findBucket: Future[Bucket[Empty]] = ???
def fillBucket(bucket: Bucket[Empty]): Future[Bucket[Water]] = ???
def extinguishOvenFire(waterBucket: Bucket[Water]): Future[Oven] = ???
def makeBread(oven: Oven): Future[Bread] = ???
def makeSoup(oven: Oven): Future[Soup] = ???
def eatSoup(soup: Soup, bread: Bread): Unit = ???
def doLunch = {
for (bucket <- findBucket;
filledBucket <- fillBucket(bucket);
oven <- extinguishOvenFire(filledBucket);
soupFuture = makeSoup(oven);
breadFuture = makeBread(oven);
soup <- soupFuture;
bread <- breadFuture) {
eatSoup(soup, bread)
}
}
This chains futures together, and calls the relevant methods once dependencies are satisfied. Note that we use = in the for comprehension to allow us to start two Futures at the same time. As it stands, doLunch returns Unit, but if you replace the last few lines with:
// ..snip..
bread <- breadFuture) yield {
eatSoup(soup, bread)
oven
}
}
Then it will return Future[Oven] - which might be useful if you want to use the oven for something else after lunch.
As for your code, my first though would be that you should consider Spray cache, as it looks like it might fit your requirements. If not, my next thought would be to replace the Stringly typed interface you've currently got and go with something based on typed method calls:
private def save[T](key: String)(value: Future[T]) = this.synchronized {
dataStorage(key) = value
value
}
def loadFoo = save("foo"){Future("foo")}
def loadBar = save("bar"){Future(3)}
def loadFooBar = save("foobar"){
for (foo <- loadFoo;
bar <- loadBar) yield foo + bar // Or whatever
}
def loadBaz = save("baz"){Future(200L)}
def loadAll = {
val topLevelFutures = Seq(loadFooBar, loadBaz)
// Use standard library function to combine futures
Future.fold(topLevelFutures)(())((u,f) => ())
}
// I don't consider this method necessary, but if you've got a legacy API to support...
def loadData[T](key: String)(implicit manifest: Manifest[T]) = {
val future = key match {
case "foo" => loadFoo
case "bar" => loadBar
case "foobar" => loadFooBar
case "baz" => loadBaz
case "all" => loadAll
}
future.mapTo[T]
}