Mixing pure and impure with Cats Effect - scala

Suppose we have a two-stage pure method.
def filterPositivePure(seq: Seq[Int]): Seq[Int] =
if (seq.nonEmpty) {
val sorted = seq.sorted
sorted.filter(_ > 0)
}
else seq
We needed to log the intermediate result of the calculations or any other impure action.
def filterPositiveImpure(seq: Seq[Int]): Seq[Int] =
if (seq.nonEmpty) {
val sorted = seq.sorted
println(sorted)
sorted.filter(_ > 0)
}
else seq
Then to preserve purity, we wrapped in IO the output to the log, the rest of the calculations and the result of the entire alternative if branch.
def filterPositiveIO(seq: Seq[Int]): IO[Seq[Int]] =
if (seq.nonEmpty) {
val sorted = seq.sorted
IO(println(sorted)) *> IO(sorted.filter(_ > 0))
}
else IO(seq)
Is there a more coincise way to bring purity back?

I think better way is working on one action as a new effect and write separate function for logging:
import cats.syntax.flatMap._
// this three imports is just to make show is working
import cats.syntax.show._
import cats.instances.list._
import cats.instances.int._
// now function looks more structural and without multiple using sorted list
def filterPositiveIO(seq: Seq[Int]): IO[Seq[Int]] = {
if (seq.nonEmpty)
withLogging(seq.toList.sorted).map(_.filter(_ > 0))
else
IO(seq)
}
// this function is just for making some logging effect and return lifted value into IO
def withLogging[A: Show](value: A): IO[A] = logging(value.show) >> IO(value)
// here you should make logging effect
def logging(strExpr: => String): IO[Unit] = ???

Related

FS2 - How to route an element to a specific nested stream/pipe?

I want to run N nested streams/pipes in parallel and send each element to only one of the nested streams. Balance allows me to do this but I want to route elements with the same "key" to the same nested stream or pipe.
I can't see any functions to do this so I wrote a basic POC which broadcasts each element to every stream. The stream/pipe then filters only the elements it should handle (see below). This seems quite inefficient, is there a better way to route elements to specific nested streams?
package io.xxx.streams
import cats.effect.{ExitCode, IO, IOApp}
import fs2.{Pipe, Stream}
object StreamsApp extends IOApp {
import cats.syntax.functor._
import scala.concurrent.duration._
case class StreamMessage(routingKey: Int, value: String)
// filter elements which belong to the given bin
def filterAndLog(bin: Int, numBins: Int): IO[Pipe[IO, StreamMessage, Unit]] = IO {
val predicate = (m: StreamMessage) => m.routingKey % numBins == bin
in: Stream[IO, StreamMessage] => {
in.filter(predicate).evalMap(m => IO {
println(s"bin $bin - ${m.value}")
})
}
}
override def run(args: List[String]): IO[ExitCode] = {
val effectsStream = for {
pipeOne <- Stream.eval(filterAndLog(0, 2))
pipeTwo <- Stream.eval(filterAndLog(1, 2))
s <- Stream
.fixedDelay[IO](100.millis)
.zipRight(Stream.range(0, 50))
.map(i => StreamMessage(i, s"message $i"))
.broadcastThrough(pipeOne, pipeTwo)
} yield s
effectsStream.compile.drain.as(ExitCode(0))
}
}
Messages with the same routing key should be handled by the same stream/pipe

FS2 Stream with StateT[IO, _, _], periodically dumping state

I have a program which consumes an infinite stream of data. Along the way I'd like to record some metrics, which form a monoid since they're just simple sums and averages. Periodically, I want to write out these metrics somewhere, clear them, and return to accumulating them. I have essentially:
object Foo {
type MetricsIO[A] = StateT[IO, MetricData, A]
def recordMetric(m: MetricData): MetricsIO[Unit] = {
StateT.modify(_.combine(m))
}
def sendMetrics: MetricsIO[Unit] = {
StateT.modifyF { s =>
val write: IO[Unit] = writeMetrics(s)
write.attempt.map {
case Left(_) => s
case Right(_) => Monoid[MetricData].empty
}
}
}
}
So most of the execution uses IO directly and lifts using StateT.liftF. And in certain situations, I include some calls to recordMetric. At the end of it I've got a stream:
val mainStream: Stream[MetricsIO, Bar] = ...
And I want to periodically, say every minute or so, dump the metrics, so I tried:
val scheduler: Scheduler = ...
val sendStream =
scheduler
.awakeEvery[MetricsIO](FiniteDuration(1, TimeUnit.Minutes))
.evalMap(_ => Foo.sendMetrics)
val result = mainStream.concurrently(sendStream).compile.drain
And then I do the usual top level program stuff of calling run with the start state and then calling unsafeRunSync.
The issue is, I only ever see empty metrics! I suspect it's something to with my monoid implicitly providing empty metrics to sendStream but I can't quite figure out why that should be or how to fix it. Maybe there's a way I can "interleave" these sendMetrics calls into the main stream instead?
Edit: here's a minimal complete runnable example:
import fs2._
import cats.implicits._
import cats.data._
import cats.effect._
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import scala.concurrent.duration._
val sec = Executors.newScheduledThreadPool(4)
implicit val ec = ExecutionContext.fromExecutorService(sec)
type F[A] = StateT[IO, List[String], A]
val slowInts = Stream.unfoldEval[F, Int, Int](1) { n =>
StateT(state => IO {
Thread.sleep(500)
val message = s"hello $n"
val newState = message :: state
val result = Some((n, n + 1))
(newState, result)
})
}
val ticks = Scheduler.fromScheduledExecutorService(sec).fixedDelay[F](FiniteDuration(1, SECONDS))
val slowIntsPeriodicallyClearedState = slowInts.either(ticks).evalMap[Int] {
case Left(n) => StateT.liftF(IO(n))
case Right(_) => StateT(state => IO {
println(state)
(List.empty, -1)
})
}
Now if I do:
slowInts.take(10).compile.drain.run(List.empty).unsafeRunSync
Then I get the expected result - the state properly accumulates into the output. But if I do:
slowIntsPeriodicallyClearedState.take(10).compile.drain.run(List.empty).unsafeRunSync
Then I see an empty list consistently printed out. I would have expected partial lists (approx. 2 elements) printed out.
StateT is not safe to use with effect types, because it's not safe in the face of concurrent access. Instead, consider using a Ref (from either fs2 or cats-effect, depending what version).
Something like this:
def slowInts(ref: Ref[IO, Int]) = Stream.unfoldEval[F, Int, Int](1) { n =>
val message = s"hello $n"
ref.modify(message :: _) *> IO {
Thread.sleep(500)
val result = Some((n, n + 1))
result
}
}
val ticks = Scheduler.fromScheduledExecutorService(sec).fixedDelay[IO](FiniteDuration(1, SECONDS))
def slowIntsPeriodicallyClearedState(ref: Ref[IO, Int] =
slowInts.either(ticks).evalMap[Int] {
case Left(n) => IO.pure(n)
case Right(_) =>
ref.modify(_ => Nil).flatMap { case Change(previous, now) =>
IO(println(now)).as(-1)
}
}

How to make effectful computation referential transparent

I'm learning FP by writing simple apps. And now I'm approaching effect monads (cats.effect.IO/scalaz.IO does not really matter). I have two functions:
def open(path: String): IO[InputStream] = IO {
new FileInputStream(new File(path))
}
def read(is: InputStream): IO[Option[Array[Byte]]] = IO {
val buffer = new Array[Byte](4096)
val bytesRead = is.read(buffer)
if (bytesRead != -1) {
val newBuffer = new Array[Byte](bytesRead)
System.arraycopy(buffer, 0, newBuffer, 0, bytesRead)
print(new String(buffer))
Some(newBuffer)
}
else
None
}
And I can combine them into a stream as follows
import cats.effect.IO
import fs2.Stream
object App {
def main(args: Array[String]): Unit = logic.unsafeRunSync()
def logic: IO[Unit] = for {
is <- open("/tmp/prompthooks.py")
_ <- fs2.Stream.eval(read(is)).repeat.unNoneTerminate.compile.drain
} yield ()
}
And it works fine. But the question is if it all implemented in pure FP. I have doubts about that in a sense that def read(is: InputStream): IO[Option[Array[Byte]] accepts stream and try to read from it. Yes it suspends side-effect but in val io = read(is) is sort of stateful (if we perform unsafeRunSync twice we got different result.

How to handle recursion with monix's observable?

Using monix I'm trying to traverse a graph by building an Observable[Node] and using a breadth first algorithm.
However there I have a bit of a recursion problem. Here is a snippet illustrating my problem:
package gp
import monix.eval.Task
import monix.execution.Scheduler.Implicits.global
import monix.reactive._
object HelloObservable {
type Node = Int
//real case fetch next node across the network so the signature
//has to be Node -> List[Task[Node]]
def nexts(i : Node) : List[Task[Node]] =
List(Task(i), Task(i+1))
def go(i :Node) : Task[Iterator[List[Node]]] =
Task.sequence(nexts(i).sliding(100,100).map(Task.gatherUnordered))
def explore(r: Node): Observable[Node] = {
val firsts = for {
ilr <- Observable.fromTask(go(r))
lr <- Observable.fromIterator(ilr)
r <- Observable.fromIterable(lr)
} yield r
firsts ++ firsts.flatMap(explore)
}
def main(args : Array[String]) : Unit = {
val obs = explore(0)
val cancelable = obs
.dump("O")
.subscribe()
scala.io.StdIn.readLine()
}
}
The observable stop after the first iteration. Can anyone hint me why ?
I think the issue is not related to recursion. I think it comes from the fact that you use sliding which returns an Iterator. The major difference between Iterator and Iterable is that you can consume Iterator only once and after that all you are left with is an empty Iterator. It means when you do firsts.flatMap there is nothing left in the Observable.fromIterator(ilr) and so nothing is produced.
Fundamentally I don't think you can do a breadth-first search if you can't hold (most part of) the prefix in the memory. But since your nexts already returns List, I assume that you can afford having two copies of that list in the memory. And the second copy is a materialized result of the sliding. So your fixed code would be something like this:
object HelloObservable {
import monix.eval.Task
import monix.execution.Scheduler.Implicits.global
import monix.reactive._
type Node = Int
//real case fetch next node across the network so the signature
//has to be Node -> List[Task[Node]]
def nexts(i: Node): List[Task[Node]] = List(Task(i), Task(i + 1))
def go(i: Node): Task[List[List[Node]]] =
Task.sequence(nexts(i).sliding(100, 100).toList.map(Task.gatherUnordered))
def explore(r: Node): Observable[Node] = {
val firsts = for {
ilr <- Observable.fromTask(go(r))
lr <- Observable.fromIterable(ilr)
r <- Observable.fromIterable(lr)
} yield r
firsts ++ firsts.flatMap(explore)
}
def main(args: Array[String]): Unit = {
val obs = explore(0)
val cancelable = obs
.dump("O")
.subscribe()
scala.io.StdIn.readLine()
}
}

Sequencing Scala Futures with bounded parallelism (without messing around with ExecutorContexts)

Background: I have a function:
def doWork(symbol: String): Future[Unit]
which initiates some side-effects to fetch data and store it, and completes a Future when its done. However, the back-end infrastructure has usage limits, such that no more than 5 of these requests can be made in parallel. I have a list of N symbols that I need to get through:
var symbols = Array("MSFT",...)
but I want to sequence them such that no more than 5 are executing simultaneously. Given:
val allowableParallelism = 5
my current solution is (assuming I'm working with async/await):
val symbolChunks = symbols.toList.grouped(allowableParallelism).toList
def toThunk(x: List[String]) = () => Future.sequence(x.map(doWork))
val symbolThunks = symbolChunks.map(toThunk)
val done = Promise[Unit]()
def procThunks(x: List[() => Future[List[Unit]]]): Unit = x match {
case Nil => done.success()
case x::xs => x().onComplete(_ => procThunks(xs))
}
procThunks(symbolThunks)
await { done.future }
but, for obvious reasons, I'm not terribly happy with it. I feel like this should be possible with folds, but every time I try, I end up eagerly creating the Futures. I also tried out a version with RxScala Observables, using concatMap, but that also seemed like overkill.
Is there a better way to accomplish this?
I have example how to do it with scalaz-stream. It's quite a lot of code because it's required to convert scala Future to scalaz Task (abstraction for deferred computation). However it's required to add it to project once. Another option is to use Task for defining 'doWork'. I personally prefer task for building async programs.
import scala.concurrent.{Future => SFuture}
import scala.util.Random
import scala.concurrent.ExecutionContext.Implicits.global
import scalaz.stream._
import scalaz.concurrent._
val P = scalaz.stream.Process
val rnd = new Random()
def doWork(symbol: String): SFuture[Unit] = SFuture {
Thread.sleep(rnd.nextInt(1000))
println(s"Symbol: $symbol. Thread: ${Thread.currentThread().getName}")
}
val symbols = Seq("AAPL", "MSFT", "GOOGL", "CVX").
flatMap(s => Seq.fill(5)(s).zipWithIndex.map(t => s"${t._1}${t._2}"))
implicit class Transformer[+T](fut: => SFuture[T]) {
def toTask(implicit ec: scala.concurrent.ExecutionContext): Task[T] = {
import scala.util.{Failure, Success}
import scalaz.syntax.either._
Task.async {
register =>
fut.onComplete {
case Success(v) => register(v.right)
case Failure(ex) => register(ex.left)
}
}
}
}
implicit class ConcurrentProcess[O](val process: Process[Task, O]) {
def concurrently[O2](concurrencyLevel: Int)(f: Channel[Task, O, O2]): Process[Task, O2] = {
val actions =
process.
zipWith(f)((data, f) => f(data))
val nestedActions =
actions.map(P.eval)
merge.mergeN(concurrencyLevel)(nestedActions)
}
}
val workChannel = io.channel((s: String) => doWork(s).toTask)
val process = Process.emitAll(symbols).concurrently(5)(workChannel)
process.run.run
When you'll have all this transformation in scope, basically all you need is:
val workChannel = io.channel((s: String) => doWork(s).toTask)
val process = Process.emitAll(symbols).concurrently(5)(workChannel)
Quite short and self-decribing
Although you've already got an excellent answer, I thought I might still offer an opinion or two about these matters.
I remember seeing somewhere (on someone's blog) "use actors for state and use futures for concurrency".
So my first thought would be to utilize actors somehow. To be precise, I would have a master actor with a router launching multiple worker actors, with number of workers restrained according to allowableParallelism. So, assuming I have
def doWorkInternal (symbol: String): Unit
which does the work from yours doWork taken 'outside of future', I would have something along these lines (very rudimentary, not taking many details into consideration, and practically copying code from akka documentation):
import akka.actor._
case class WorkItem (symbol: String)
case class WorkItemCompleted (symbol: String)
case class WorkLoad (symbols: Array[String])
case class WorkLoadCompleted ()
class Worker extends Actor {
def receive = {
case WorkItem (symbol) =>
doWorkInternal (symbol)
sender () ! WorkItemCompleted (symbol)
}
}
class Master extends Actor {
var pending = Set[String] ()
var originator: Option[ActorRef] = None
var router = {
val routees = Vector.fill (allowableParallelism) {
val r = context.actorOf(Props[Worker])
context watch r
ActorRefRoutee(r)
}
Router (RoundRobinRoutingLogic(), routees)
}
def receive = {
case WorkLoad (symbols) =>
originator = Some (sender ())
context become processing
for (symbol <- symbols) {
router.route (WorkItem (symbol), self)
pending += symbol
}
}
def processing: Receive = {
case Terminated (a) =>
router = router.removeRoutee(a)
val r = context.actorOf(Props[Worker])
context watch r
router = router.addRoutee(r)
case WorkItemCompleted (symbol) =>
pending -= symbol
if (pending.size == 0) {
context become receive
originator.get ! WorkLoadCompleted
}
}
}
You could query the master actor with ask and receive a WorkLoadCompleted in a future.
But thinking more about 'state' (of number of simultaneous requests in processing) to be hidden somewhere, together with implementing necessary code for not exceeding it, here's something of the 'future gateway intermediary' sort, if you don't mind imperative style and mutable (used internally only though) structures:
object Guardian
{
private val incoming = new collection.mutable.HashMap[String, Promise[Unit]]()
private val outgoing = new collection.mutable.HashMap[String, Future[Unit]]()
private val pending = new collection.mutable.Queue[String]
def doWorkGuarded (symbol: String): Future[Unit] = {
synchronized {
val p = Promise[Unit] ()
incoming(symbol) = p
if (incoming.size <= allowableParallelism)
launchWork (symbol)
else
pending.enqueue (symbol)
p.future
}
}
private def completionHandler (t: Try[Unit]): Unit = {
synchronized {
for (symbol <- outgoing.keySet) {
val f = outgoing (symbol)
if (f.isCompleted) {
incoming (symbol).completeWith (f)
incoming.remove (symbol)
outgoing.remove (symbol)
}
}
for (i <- outgoing.size to allowableParallelism) {
if (pending.nonEmpty) {
val symbol = pending.dequeue()
launchWork (symbol)
}
}
}
}
private def launchWork (symbol: String): Unit = {
val f = doWork(symbol)
outgoing(symbol) = f
f.onComplete(completionHandler)
}
}
doWork now is exactly like yours, returning Future[Unit], with the idea that instead of using something like
val futures = symbols.map (doWork (_)).toSeq
val future = Future.sequence(futures)
which would launch futures not regarding allowableParallelism at all, I would instead use
val futures = symbols.map (Guardian.doWorkGuarded (_)).toSeq
val future = Future.sequence(futures)
Think about some hypothetical database access driver with non-blocking interface, i.e. returning futures on requests, which is limited in concurrency by being built over some connection pool for example - you wouldn't want it to return futures not taking parallelism level into account, and require you to juggle with them to keep parallelism under control.
This example is more illustrative than practical since I wouldn't normally expect that 'outgoing' interface would be utilizing futures like this (which is quote ok for 'incoming' interface).
First, obviously some purely functional wrapper around Scala's Future is needed, cause it's side-effective and runs as soon as it can. Let's call it Deferred:
import scala.concurrent.Future
import scala.util.control.Exception.nonFatalCatch
class Deferred[+T](f: () => Future[T]) {
def run(): Future[T] = f()
}
object Deferred {
def apply[T](future: => Future[T]): Deferred[T] =
new Deferred(() => nonFatalCatch.either(future).fold(Future.failed, identity))
}
And here is the routine:
import java.util.concurrent.CopyOnWriteArrayList
import java.util.concurrent.atomic.AtomicInteger
import scala.collection.immutable.Seq
import scala.concurrent.{ExecutionContext, Future, Promise}
import scala.util.control.Exception.nonFatalCatch
import scala.util.{Failure, Success}
trait ConcurrencyUtils {
def runWithBoundedParallelism[T](parallelism: Int = Runtime.getRuntime.availableProcessors())
(operations: Seq[Deferred[T]])
(implicit ec: ExecutionContext): Deferred[Seq[T]] =
if (parallelism > 0) Deferred {
val indexedOps = operations.toIndexedSeq // index for faster access
val promise = Promise[Seq[T]]()
val acc = new CopyOnWriteArrayList[(Int, T)] // concurrent acc
val nextIndex = new AtomicInteger(parallelism) // keep track of the next index atomically
def run(operation: Deferred[T], index: Int): Unit = {
operation.run().onComplete {
case Success(value) =>
acc.add((index, value)) // accumulate result value
if (acc.size == indexedOps.size) { // we've done
import scala.collection.JavaConversions._
// in concurrent setting next line may be called multiple times, that's why trySuccess instead of success
promise.trySuccess(acc.view.sortBy(_._1).map(_._2).toList)
} else {
val next = nextIndex.getAndIncrement() // get and inc atomically
if (next < indexedOps.size) { // run next operation if exists
run(indexedOps(next), next)
}
}
case Failure(t) =>
promise.tryFailure(t) // same here (may be called multiple times, let's prevent stdout pollution)
}
}
if (operations.nonEmpty) {
indexedOps.view.take(parallelism).zipWithIndex.foreach((run _).tupled) // run as much as allowed
promise.future
} else {
Future.successful(Seq.empty)
}
} else {
throw new IllegalArgumentException("Parallelism must be positive")
}
}
In a nutshell, we run as much operations initially as allowed and then on each operation completion we run next operation available, if any. So the only difficulty here is to maintain next operation index and results accumulator in concurrent setting. I'm not an absolute concurrency expert, so make me know if there are some potential problems in the code above. Notice that returned value is also a deferred computation that should be run.
Usage and test:
import org.scalatest.{Matchers, FlatSpec}
import org.scalatest.concurrent.ScalaFutures
import org.scalatest.time.{Seconds, Span}
import scala.collection.immutable.Seq
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.duration._
class ConcurrencyUtilsSpec extends FlatSpec with Matchers with ScalaFutures with ConcurrencyUtils {
"runWithBoundedParallelism" should "return results in correct order" in {
val comp1 = mkDeferredComputation(1)
val comp2 = mkDeferredComputation(2)
val comp3 = mkDeferredComputation(3)
val comp4 = mkDeferredComputation(4)
val comp5 = mkDeferredComputation(5)
val compountComp = runWithBoundedParallelism(2)(Seq(comp1, comp2, comp3, comp4, comp5))
whenReady(compountComp.run()) { result =>
result should be (Seq(1, 2, 3, 4, 5))
}
}
// increase default ScalaTest patience
implicit val defaultPatience = PatienceConfig(timeout = Span(10, Seconds))
private def mkDeferredComputation[T](result: T, sleepDuration: FiniteDuration = 100.millis): Deferred[T] =
Deferred {
Future {
Thread.sleep(sleepDuration.toMillis)
result
}
}
}
Use Monix Task. An example from Monix document for parallelism=10
val items = 0 until 1000
// The list of all tasks needed for execution
val tasks = items.map(i => Task(i * 2))
// Building batches of 10 tasks to execute in parallel:
val batches = tasks.sliding(10,10).map(b => Task.gather(b))
// Sequencing batches, then flattening the final result
val aggregate = Task.sequence(batches).map(_.flatten.toList)
// Evaluation:
aggregate.foreach(println)
//=> List(0, 2, 4, 6, 8, 10, 12, 14, 16,...
Akka streams, allow you to do the following:
import akka.NotUsed
import akka.stream.Materializer
import akka.stream.scaladsl.Source
import scala.concurrent.Future
def sequence[A: Manifest, B](items: Seq[A], func: A => Future[B], parallelism: Int)(
implicit mat: Materializer
): Future[Seq[B]] = {
val futures: Source[B, NotUsed] =
Source[A](items.toList).mapAsync(parallelism)(x => func(x))
futures.runFold(Seq.empty[B])(_ :+ _)
}
sequence(symbols, doWork, allowableParallelism)