Fan in/fan out concurrency with Monix - scala

I'm trying to learn Scala and having a good bit of fun, but I'm running into this classic problem. It reminds me a lot of nested callback hell in the early days of NodeJS.
Here's my program in psuedocode:
A task to fetch a list of S3 Buckets.
After task one completes I want to batch the processing of buckets in groups of ten.
For each batch:
Get every bucket's region.
Filter out buckets that are not in the region.
List all the objects in each bucket.
println everything
At one point I wind up with the type: Task[Iterator[Task[List[Bucket]]]]
Essentially:
The outer task being the initial step to list all the S3 buckets, and then the inside Iterator/Task/List is trying to batch Tasks that return lists.
I would hope there's some way to remove/flatten the outer Task to get to Iterator[Task[List[Bucket]]].
When I try to break down my processing into steps the deep nesting causes me to do many nested maps. Is this the right thing to do or is there a better way to handle this nesting?

In this particular case, I would suggest something like FS2 with Monix as F:
import cats.implicits._
import monix.eval._, monix.execution._
import fs2._
// use your own types here
type BucketName = String
type BucketRegion = String
type S3Object = String
// use your own implementations as well
val fetchS3Buckets: Task[List[BucketName]] = Task(???)
val bucketRegion: BucketName => Task[BucketRegion] = _ => Task(???)
val listObject: BucketName => Task[List[S3Object]] = _ => Task(???)
Stream.evalSeq(fetchS3Buckets)
.parEvalMap(10) { name =>
// checking region, filtering and listing on batches of 10
bucketRegion(name).flatMap {
case "my-region" => listObject(name)
case _ => Task.pure(List.empty)
}
}
.foldMonoid // combines List[S3Object] together
.compile.lastOrError // turns into Task with result
.map(list => println(s"Result: $list"))
.onErrorHandle { case error: Throwable => println(error) }
.runToFuture // or however you handle it
FS2 underneath uses cats.effect.IO or Monix Task, or whatever you want as long is it provided Cats Effect type classes. It builds a nice, functional DSL to design streams of data, so you could use reactive streams without Akka Streams.
Here there is this little problem that we are printing all results at once, which might be a bad idea if there was more of them than the memory could handle - we could do the printing in batches (weren't sure if that is what you wanted or not) or make filtering and printing separate batches.
Stream.evalSeq(fetchS3Buckets)
.parEvalMap(10) { name =>
bucketRegion(name).map(name -> _)
}
.collect { case (name, "my-region") => name }
.parEvalMap(10) { name =>
listObject(name).map(list => println(s"Result: $list"))
}
.compile
.drain
While none of that is impossible in bare Monix, FS2 makes such operations much easier to write and maintain, so you should be able to implement your flow much easier.

Related

scala ZIO foreachPar

I'm new to parallel programming and ZIO, i'm trying to get data from an API, by parallel requests.
import sttp.client._
import zio.{Task, ZIO}
ZIO.foreach(files) { file =>
getData(file)
Task(file.getName)
}
def getData(file: File) = {
val data: String = readData(file)
val request = basicRequest.body(data).post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
implicit val backend: SttpBackend[Identity, Nothing, NothingT] = HttpURLConnectionBackend()
request.send().body
resquest.Response match {
case Success(value) => {
val src = new PrintWriter(new File(filename))
src.write(value.toString)
src.close()
}
case Failure(exception) => log error
}
when i execute the program sequentially, it work as expected,
if i tried to run parallel, by changing ZIO.foreach to ZIO.foreachPar.
The program is terminating prematurely, i get that, i'm missing something basic here,
any help is appreciated to help me figure out the issue.
Generally speaking I wouldn't recommend mixing synchronous blocking code as you have with asynchronous non-blocking code which is the primary role of ZIO. There are some great talks out there on how to effectively use ZIO with the "world" so to speak.
There are two key points I would make, one ZIO lets you manage resources effectively by attaching allocation and finalization steps and two, "effects" we could say are "things which actually interact with the world" should be wrapped in the tightest scope possible*.
So lets go through this example a bit, first of all, I would not suggest using the default Identity backed backend with ZIO, I would recommend using the AsyncHttpClientZioBackend instead.
import sttp.client._
import zio.{Task, ZIO}
import zio.blocking.effectBlocking
import sttp.client.asynchttpclient.zio.AsyncHttpClientZioBackend
// Extract the common elements of the request
val baseRequest = basicRequest.post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
// Produces a writer which is wrapped in a `Managed` allowing it to be properly
// closed after being used
def managedWriter(filename: String): Managed[IOException, PrintWriter] =
ZManaged.fromAutoCloseable(UIO(new PrintWriter(new File(filename))))
// This returns an effect which produces an `SttpBackend`, thus we flatMap over it
// to extract the backend.
val program = AsyncHttpClientZioBackend().flatMap { implicit backend =>
ZIO.foreachPar(files) { file =>
for {
// Wrap the synchronous reading of data in a `Task`, but which allows runs this effect on a "blocking" threadpool instead of blocking the main one.
data <- effectBlocking(readData(file))
// `send` will return a `Task` because it is using the implicit backend in scope
resp <- baseRequest.body(data).send()
// Build the managed writer, then "use" it to produce an effect, at the end of `use` it will automatically close the writer.
_ <- managedWriter("").use(w => Task(w.write(resp.body.toString)))
} yield ()
}
}
At this point you will just have the program which you will need to run using one of the unsafe methods or if you are using a zio.App through the main method.
* Not always possible or convenient, but it is useful because it prevents resource hogging by yielding tasks back to the runtime for scheduling.
When you use a purely functional IO library like ZIO, you must not call any side-effecting functions (like getData) except when calling factory methods like Task.effect or Task.apply.
ZIO.foreach(files) { file =>
Task {
getData(file)
file.getName
}
}

Order of execution of Future - Making sequential inserts in a db non-blocking

A simple scenario here. I am using akka streams to read from kafka and write into an external source, in my case: cassandra.
Akka streams(reactive-kafka) library equips me with backpressure and other nifty things to make this possible.
kafka being a Source and Cassandra being a Sink, when I get bunch of events which are, for example be cassandra queries here through Kafka which are supposed to be executed sequentially (ex: it could be a INSERT, UPDATE and a DELETE and must be sequential).
I cannot use mayAsync and execute both the statement, Future is eager and there is a chance that DELETE or UPDATE might get executed first before INSERT.
I am forced to use Cassandra's execute as opposed to executeAsync which is non-blocking.
There is no way to make a complete async solution to this issue, but how ever is there a much elegant way to do this?
For ex: Make the Future lazy and sequential and offload it to a different execution context of sorts.
mapAsync gives a parallelism option as well.
Can Monix Task be of help here?
This a general design question and what are the approaches one can take.
UPDATE:
Flow[In].mapAsync(3)(input => {
input match {
case INSERT => //do insert - returns future
case UPDATE => //do update - returns future
case DELETE => //delete - returns future
}
The scenario is a little more complex. There could be thousands of insert, update and delete coming in order for specific key(s)(in kafka)
I would ideally want to execute the 3 futures of a single key in sequence. I believe Monix's Task can help?
If you process things with parallelism of 1, they will get executed in strict sequence, which will solve your problem.
But that's not interesting. If you want, you can run operations for different keys in parallel - if processing for different keys is independent, which, I assume from your description, is possible. To do this, you have to buffer the incoming values and then regroup it. Let's see some code:
import monix.reactive.Observable
import scala.concurrent.duration._
import monix.eval.Task
// Your domain logic - I'll use these stubs
trait Event
trait Acknowledgement // whatever your DB functions return, if you need it
def toKey(e: Event): String = ???
def processOne(event: Event): Task[Acknowledgement] = Task.deferFuture {
event match {
case _ => ??? // insert/update/delete
}
}
// Monix Task.traverse is strictly sequential, which is what you need
def processMany(evs: Seq[Event]): Task[Seq[Acknowledgement]] =
Task.traverse(evs)(processOne)
def processEventStreamInParallel(source: Observable[Event]): Observable[Acknowledgement] =
source
// Process a bunch of events, but don't wait too long for whole 100. Fine-tune for your data source
.bufferTimedAndCounted(2.seconds, 100)
.concatMap { batch =>
Observable
.fromIterable(batch.groupBy(toKey).values) // Standard collection methods FTW
.mapAsync(3)(processMany) // processing up to 3 different keys in parallel - tho 3 is not necessary, probably depends on your DB throughput
.flatMap(Observable.fromIterable) // flattening it back
}
The concatMap operator here will ensure that your chunks are processed sequentially as well. So even if one buffer has key1 -> insert, key1 -> update and the other has key1 -> delete, that causes no problems. In Monix, this is the same as flatMap, but in other Rx libraries flatMap might be an alias for mergeMap which has no ordering guarantee.
This can be done with Futures too, tho there's no standard "sequential traverse", so you have to roll your own, something like:
def processMany(evs: Seq[Event]): Future[Seq[Acknowledgement]] =
evs.foldLeft(Future.successful(Vector.empty[Acknowledgement])){ (acksF, ev) =>
for {
acks <- acksF
next <- processOne(ev)
} yield acks :+ next
}
You can use akka-streams subflows, to group by key, then merge substreams if you want to do something with what you get from your database operations:
def databaseOp(input: In): Future[Out] = input match {
case INSERT => ...
case UPDATE => ...
case DELETE => ...
}
val databaseFlow: Flow[In, Out, NotUsed] =
Flow[In].groupBy(Int.maxValues, _.key).mapAsync(1)(databaseOp).mergeSubstreams
Note that order from input source won't be kept in output as it is done in mapAsync, but all operations on the same key will still be in order.
You are looking for Future.flatMap:
def doSomething: Future[Unit]
def doSomethingElse: Future[Unit]
val result = doSomething.flatMap { _ => doSomethingElse }
This executes the first function, and then, when its Future is satisfied, starts the second one. The result is a new Future that completes when the result of the second execution is satisfied.
The result of the first future is passed into the function you give to .flatMap, so the second function can depend on the result of the first one. For example:
def getUserID: Future[Int]
def getUser(id: Int): Future[User]
val userName: Future[String] = getUserID.flatMap(getUser).map(_.name)
You can also write this as a for-comprehension:
for {
id <- getUserID
user <- getUser(id)
} yield user.name

How to create an Akka flow with backpressure and Control

I need to create a function with the following Interface:
import akka.kafka.scaladsl.Consumer.Control
object ItemConversionFlow {
def build(config: StreamConfig): Flow[Item, OtherItem, Control] = {
// Implementation goes here
}
My problem is that I don't know how to define the flow in a way that it fits the interface above.
When I am doing something like this
val flow = Flow[Item]
.map(item => doConversion(item)
.filter(_.isDefined)
.map(_.get)
the resulting type is Flow[Item, OtherItem, NotUsed]. I haven't found something in the Akka documentation so far. Also the functions on akka.stream.scaladsl.Flow only offer a "NotUsed" instead of Control. Would be great if someone could point me into the right direction.
Some background: I need to setup several pipelines which only distinguish in the conversion part. These pipelines are sub streams to a main stream which might be stopped for some reason (a corresponding message arrives in some kafka topic). Therefor I need the Control part. The idea would be to create a Graph template where I just insert the mentioned flow as argument (a factory returning it). For a specific case we have a solution which works. To generalize it I need this kind of flow.
You actually have backpressure. However, think about what do you really need about backpressure... you are not using asynchronous stages to increase your throughput... for example. Backpressure avoids fast producers overgrowing susbscribers https://doc.akka.io/docs/akka/2.5/stream/stream-rate.html. In your sample donĀ“t worry about it, your stream will ask for new elements to he publisher depending on how long doConversion takes to complete.
In case that you want to obtain the result of the stream use toMat or viaMat. For example, if your stream emits Item and transform these into OtherItem:
val str = Source.fromIterator(() => List(Item(Some(1))).toIterator)
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.toMat(Sink.fold(List[OtherItem]())((a, b) => {
// Examine the result of your stream
b :: a
}))(Keep.right)
.run()
str will be Future[List[OtherItem]]. Try to extrapolate this to your case.
Or using toMat with KillSwitches, "Creates a new [[Graph]] of [[FlowShape]] that materializes to an external switch that allows external completion
* of that unique materialization. Different materializations result in different, independent switches."
def build(config: StreamConfig): Flow[Item, OtherItem, UniqueKillSwitch] = {
Flow[Item]
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.viaMat(KillSwitches.single)(Keep.right)
}
val stream =
Source.fromIterator(() => List(Item(Some(1))).toIterator)
.viaMat(build(StreamConfig(1)))(Keep.right)
.toMat(Sink.ignore)(Keep.both).run
// This stops the stream
stream._1.shutdown()
// When it finishes
stream._2 onComplete(_ => println("Done"))

Scala adding elements to seq and handling futures, maps, and async behavior

I'm still a newbie in scala and don't quite yet understand the concept of Futures/Maps/Flatmaps/Seq and how to use them properly.
This is what I want to do (pseudo code):
def getContentComponents: Action[AnyContent] = Action.async {
contentComponentDTO.list().map( //Future[Seq[ContentComponentModel]] Get all contentComponents
contentComponents => contentComponents.map( //Iterate over [Seq[ContentComponentModel]
contentComponent => contentComponent.typeOf match { //Match the type of the contentComponent
case 1 => contentComponent.pictures :+ contentComponentDTO.getContentComponentPicture(contentComponent.id.get) //Future[Option[ContentComponentPictureModel]] add to _.pictures seq
case 2 => contentComponent.videos :+ contentComponentDTO.getContentComponentVideo(contentComponent.id.get) //Future[Option[ContentComponentVideoModel]] add to _.videos seq
}
)
Ok(Json.toJson(contentComponents)) //Return all the contentComponents in the end
)
}
I want to add a Future[Option[Foo]] to contentComponent.pictures: Option[Seq[Foo]] like so:
case 2 => contentComponent.pictures :+ contentComponentDTO.getContentComponentPicture(contentComponent.id.get) //contentComponent.pictures is Option[Seq[Foo]]
and return the whole contentComponent back to the front-end via json in the end.
I know this might be far away from the actual code in the end, but I hope you got the idea. Thanks!
I'll ignore your code and focus on what is short and makes sense:
I want to add a Future[Option[Foo]] to contentComponent.pictures: Option[Seq[Foo]] like so:
Let's do this, focusing on code readability:
// what you already have
val someFuture: Future[Option[Foo]] = ???
val pics: Option[Seq[Foo]] = contentComponent.pictures
// what I'm adding
val result: Future[Option[Seq[Foo]]] = someFuture.map {
case None => pics
case Some(newElement) =>
pics match {
case None => Some(Seq(newElement)) // not sure what you want here if pics is empty...
case Some(picsSequence) => Some(picsSequence :+ newElement)
}
}
And to show an example of flatMap let's say you need the result of result future in another future, just do:
val otherFuture: Future[Any] = ???
val everything: Future[Option[Seq[Foo]]] = otherFuture.flatmap { otherResult =>
// do something with otherResult i.e., the code above could be pasted in here...
result
}
My answer will attempt to help with some of the conceptual sub-questions which form parts of your overall larger question.
flatMap and for-yield
One of the points of flatMap is to help with the problem of the Pyramid of Doom. This happens when you have
structures nested within structures nested within structures ...
doA().map { resultOfA =>
doB(resultOfA).map { resolutOfB =>
doC(resultOfB).map { resultOfC =>
...
}
}
}
If you use for-yield you get flatMap out of the box and it allows you to
flatten the pyramid
so that your code looks more like a linear structure
for {
resultOfA <- doA
resultOfB <- doB(resultOfA)
resultOfC <- doC(resultOfB)
...
} yield {...}
There is a rule of thumb in software engineering that deeply nested structures are harder to debug and reason about, so
we strive to minimise the nesting. You will hit this issue especially when dealing with Futures.
Mapping over Future vs. mapping over sequence
Mapping is usually first thought in terms of iteration over a sequence, which might lead to understanding of
mapping over a Future in terms of iterating over a sequence of one. My advice would be not to use the iteration concept when
trying to understand mapping over Futures, Options etc. In these cases it might be better to think of mapping as a process of destructing the structure
so that you get at the element inside the structure. One could visualise mapping as
breaking the shell of a walnut so you get at the delicious kernel inside and then rebuilding the shell.
Futures and monads
As you try to learn more about Futures and when you begin to deal with types like Future[Option[SomeType]] you will inevitably
stumble upon documentation about monads and its cryptic terminology might scare you away. If this happens, it might help to think of monads (of which Future is a particular instance) as simply
something you can stick into a for-yield so that you can get at the
delicious walnut kernels whilst avoiding the pyramid of doom.

Is there an easy way to get a Stream as output of a RowParser?

Given rowParser of type RowParser[Photo], this is how you would parse a list of rows coming from a table photo, according to the code samples I have seen so far:
def getPhotos(album: Album): List[Photo] = DB.withConnection { implicit c =>
SQL("select * from photo where album = {album}").on(
'album -> album.id
).as(rowParser *)
}
Where the * operator creates a parser of type ResultSetParser[List[Photo]]. Now, I was wondering if it was equally possible to get a parser that yields a Stream (thinking that being more lazy is always better), but I only came up with this:
def getPhotos(album: Album): Stream[Photo] = DB.withConnection { implicit c =>
SQL("select * from photo where album = {album}").on(
'album -> album.id
)() collect (rowParser(_) match { case Success(photo) => photo })
}
It works, but it seems overly complicated. I could of course just call toStream on the List I get from the first function, but my goal was to only apply rowParser on rows that are actually read. Is there an easier way to achieve this?
EDIT: I know that limit should be used in the query, if the number of rows of interest is known beforehand. I am also aware that, in many cases, you are going to use the whole result anyway, so being lazy will not improve performance. But there might be a case where you save a few cycles, e.g. if for some reason, you have search criteria that you cannot or do not want to express in SQL. So I thought it was odd that, given the fact that anorm provides a way to obtain a Stream of SqlRow, I didn't find a straightforward way to apply a RowParser on that.
I ended up creating my own stream method which corresponds to the list method:
def stream[A](p: RowParser[A]) = new ResultSetParser[Stream[A]] {
def apply(rows: SqlParser.ResultSet): SqlResult[Stream[A]] = rows.headOption.map(p(_)) match {
case None => Success(Stream.empty[A])
case Some(Success(a)) => {
val s: Stream[A] = a #:: rows.tail.flatMap(r => p(r) match {
case Success(r) => Some(r)
case _ => None
})
Success(s)
}
case Some(Error(msg)) => Error(msg)
}
}
Note that the Play SqlResult can only be either Success/Error while each row can also be Success/Error. I handle this for the first row only, assuming the rest will be the same. This may or may not work for you.
You're better off making smaller (paged) queries using limit and offset.
Anorm would need some modification if you're going to keep your (large) result around in memory and stream it from there. Then the other concern would be the new memory requirements for your JVM. And how would you deal with caching on the service level? See, previously you could easily cache something like photos?page=1&size=10, but now you just have photos, and the caching technology would have no idea what to do with the stream.
Even worse, and possibly on a JDBC-level, wrapping Stream around limited and offset-ed execute statements and just making multiple calls to the database behind the scenes, but this sounds like it would need a fair bit of work to port the Stream code that Scala generates to Java land (to work with Groovy, jRuby, etc), then get it on the approved for the JDBC 5 or 6 roadmap. This idea will probably be shunned as being too complicated, which it is.
You could wrap Stream around your entire DAO (where the limit and offset trickery would happen), but this almost sounds like more trouble than it's worth :-)
I ran into a similar situation but ran into a Call Stack Overflow exception when the built-in anorm function to convert to Streams attempted to parse the result set.
In order to get around this I elected to abandon the anorm ResultSetParser paradigm, and fall back to the java.sql.ResultSet object.
I wanted to use anorm's internal classes for the parsing result set rows, but, ever since version 2.4, they have made all of the pertinent classes and methods private to their package, and have deprecated several other methods that would have been more straight-forward to use.
I used a combination of Promises and Futures to work around the ManagedResource that anorm now returns. I avoided all deprecated functions.
import anorm._
import java.sql.ResultSet
import scala.concurrent._
def SqlStream[T](sql:SqlQuery)(parse:ResultSet => T)(implicit ec:ExecutionContext):Future[Stream[T]] = {
val conn = db.getConnection()
val mr = sql.preparedStatement(conn, false)
val p = Promise[Unit]()
val p2 = Promise[ResultSet]()
Future {
mr.map({ stmt =>
p2.success(stmt.executeQuery)
Await.ready(p.future, duration.Duration.Inf)
}).acquireAndGet(identity).andThen { case _ => conn.close() }
}
def _stream(rs:ResultSet):Stream[T] = {
if (rs.next()) parse(rs) #:: _stream(rs)
else {
p.success(())
Stream.empty
}
}
p2.future.map { rs =>
rs.beforeFirst()
_stream(rs)
}
}
A rather trivial usage of this function would be something like this:
def getText(implicit ec:ExecutionContext):Future[Stream[String]] = {
SqlStream(SQL("select FIELD from TABLE")) { rs => rs.getString("FIELD") }
}
There are, of course, drawbacks to this approach, however, this got around my problem and did not require inclusion of any other libraries.