Akka combining Sinks without access to Flows - scala

I am using an API that accepts a single AKKA Sink and fills it with data:
def fillSink(sink:Sink[String, _])
Is there a way, without delving into the depths of akka, to handle the output with two sinks instead of one?
For example
val mySink1:Sink = ...
val mySink2:Sink = ...
//something
fillSink( bothSinks )
If I had access to the Flow used by the fillSink method I could use flow.alsoTo(mySink1).to(mySink2) but the flow is not exposed.
The only workaround at the moment is to pass a single Sink which handles the strings and passes it on to two StringBuilders to replace mySink1/mySink2, but it feels like that defeats the point of AKKA. Without spending a couple days learning AKKA, I can't tell if there is a way to split output from sinks.
Thanks!

The combine Sink operator, which combines two or more Sinks using a provided Int => Graph[UniformFanOutShape[T, U], NotUsed]] function, might be what you're seeking:
def combine[T, U](first: Sink[U, _], second: Sink[U, _], rest: Sink[U, _]*)(strategy: Int => Graph[UniformFanOutShape[T, U], NotUsed]): Sink[T, NotUsed]
A trivialized example:
val doubleSink = Sink.foreach[Int](i => println(s"Doubler: ${i*2}"))
val tripleSink = Sink.foreach[Int](i => println(s" Triper: ${i*3}"))
val combinedSink = Sink.combine(doubleSink, tripleSink)(Broadcast[Int](_))
Source(List(1, 2, 3)).runWith(combinedSink)
// Doubler: 2
// Triper: 3
// Doubler: 4
// Triper: 6
// Doubler: 6
// Triper: 9

Related

Akka streams pass through flow limiting Parallelism / throughput of processing flow

I have a use case where I want to send a message to an external system but the flow that sends this message takes and returns a type I cant use downstream. This is a great use case for the pass through flow. I am using the implementation here. Initially I was worried that if the processingFlow uses a mapAsyncUnordered then this flow wouldn't work. Since the processing flow may reorder messages and the zip with may push out a tuple with the incorrect pair. E.g In the following example.
val testSource = Source(1 until 50)
val processingFlow: Flow[Int, Int, NotUsed] = Flow[Int].mapAsyncUnordered(10)(x => Future {
Thread.sleep(Random.nextInt(50))
x * 10
})
val passThroughFlow = PassThroughFlow(processingFlow, Keep.both)
val future = testSource.via(passThroughFlow).runWith(Sink.seq)
I would expect that the processing flow could reorder its outputs with respect its input and i would get a result such as:
[(30,1), (40,2),(10,3),(10,4), ...]
With the right ( the passed through always being in order) but the left which goes through my mapAsyncUnordered potentially being joined with an incorrect element to make a bad tuple.
Instead i actually get:
[(10,1), (20,2),(30,3),(40,4), ...]
Every time. Upon further investigation I noticed the code was running slow and in fact its not running in parallel at all despite my map async unordered. I tried introducing a buffer before and after as well as an async boundary but it always seems to run sequentially. This explains why it always ordered but I want my processing flow to have a higher throughput.
I came up with the following work around:
object PassThroughFlow {
def keepRight[A, A1](processingFlow: Flow[A, A1, NotUsed]): Flow[A, A, NotUsed] =
keepBoth[A, A1](processingFlow).map(_._2)
def keepBoth[A, A1](processingFlow: Flow[A, A1, NotUsed]): Flow[A, (A1, A), NotUsed] =
Flow.fromGraph(GraphDSL.create() { implicit builder => {
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[A](2))
val zip = builder.add(ZipWith[A1, A, (A1, A)]((left, right) => (left, right)))
broadcast.out(0) ~> processingFlow ~> zip.in0
broadcast.out(1) ~> zip.in1
FlowShape(broadcast.in, zip.out)
}
})
}
object ParallelPassThroughFlow {
def keepRight[A, A1](parallelism: Int, processingFlow: Flow[A, A1, NotUsed]): Flow[A, A, NotUsed] =
keepBoth(parallelism, processingFlow).map(_._2)
def keepBoth[A, A1](parallelism: Int, processingFlow: Flow[A, A1, NotUsed]): Flow[A, (A1, A), NotUsed] = {
Flow.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val fanOut = builder.add(Balance[A](outputPorts = parallelism))
val merger = builder.add(Merge[(A1, A)](inputPorts = parallelism, eagerComplete = false))
Range(0, parallelism).foreach { n =>
val passThrough = PassThroughFlow.keepBoth(processingFlow)
fanOut.out(n) ~> passThrough ~> merger.in(n)
}
FlowShape(fanOut.in, merger.out)
})
}
}
Two questions:
In the original implementation, Why does the zip inside the pass through flow limit the amount of parallelism of the map async unordered?
Is my work around sound or could it be improved? I basically fan out my input the input to multiple stacks of the pass through flow and merge it all back together. It seems to have the properties that I want (parallel yet maintains order even if processing flow reorders) yet something doesn't feel right
The behavior you're witnessing is a result of how broadcast and zip work: broadcast emits downstream when all of its outputs signal demand; zip waits for all of its inputs before signaling demand (and emitting downstream).
broadcast.out(0) ~> processingFlow ~> zip.in0
broadcast.out(1) ~> zip.in1
Consider the movement of the first element (1) through the above graph. 1 is broadcast to both processingFlow and zip. zip immediately receives one of its inputs (1) and waits for its other input (10), which will take a little longer to arrive. Only when zip gets both 1 and 10 does it pull for more elements from upstream, thus triggering the movement of the second element (2) through the stream. And so on.
As for your ParallelPassThroughFlow, I don't know why "something doesn't feel right" to you.

Grouping event with fs2.Stream

I have event stream as follows:
sealed trait Event
val eventStream: fs2.Stream[IO, Event] = //...
I want to group this events received within a single minute (i.e from 0 sec to 59 sec of every minute). This sounds pretty straightforward with fs2
val groupedEventsStream = eventStream groupAdjacentBy {event =>
TimeUnit.MILLISECONDS.toMinutes(System.currentTimeMillis())
}
The problem is that the grouping function is not pure. It uses currentTimeMillis. I can workaroud this as follows:
stream.evalMap(t => IO(System.currentTimeMillis(), t))
.groupAdjacentBy(t => TimeUnit.MILLISECONDS.toMinutes(t._1))
The thing is that adds clumsy boilerplate with tuples I'd like to avoid. Is there any other solutions?
Or maybe using impure function is not that bad for such a case?
You could remove some of the boilerplate by using cats.effect.Clock:
def groupedEventsStream[A](stream: fs2.Stream[IO, A])
(implicit clock: Clock[IO], eq: Eq[Long]): fs2.Stream[IO, (Long, Chunk[(Long, A)])] =
stream.evalMap(t => clock.realTime(TimeUnit.MINUTES).map((_, t)))
.groupAdjacentBy(_._1)

Elegant way of reusing akka-stream flows

I am looking for a way to easily reuse akka-stream flows.
I treat the Flow I intend to reuse as a function, so I would like to keep its signature like:
Flow[Input, Output, NotUsed]
Now when I use this flow I would like to be able to 'call' this flow and keep the result aside for further processing.
So I want to start with Flow emiting [Input], apply my flow, and proceed with Flow emitting [(Input, Output)].
example:
val s: Source[Int, NotUsed] = Source(1 to 10)
val stringIfEven = Flow[Int].filter(_ % 2 == 0).map(_.toString)
val via: Source[(Int, String), NotUsed] = ???
Now this is not possible in a straightforward way because combining flow with .via() would give me Flow emitting just [Output]
val via: Source[String, NotUsed] = s.via(stringIfEven)
Alternative is to make my reusable flow emit [(Input, Output)] but this requires every flow to push its input through all the stages and make my code look bad.
So I came up with a combiner like this:
def tupledFlow[In,Out](flow: Flow[In, Out, _]):Flow[In, (In,Out), NotUsed] = {
Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val broadcast = b.add(Broadcast[In](2))
val zip = b.add(Zip[In, Out])
broadcast.out(0) ~> zip.in0
broadcast.out(1) ~> flow ~> zip.in1
FlowShape(broadcast.in, zip.out)
})
}
that is broadcasting the input to the flow and as well in a parallel line directly -> both to the 'Zip' stage where I join values into a tuple. It then can be elegantly applied:
val tupled: Source[(Int, String), NotUsed] = s.via(tupledFlow(stringIfEven))
Everything great but when given flow is doing a 'filter' operation - this combiner is stuck and stops processing further events.
I guess that is due to 'Zip' behaviour that requires all subflows to do the same - in my case one branch is passing given object directly so another subflow cannot ignore this element with. filter(), and since it does - the flow stops because Zip is waiting for push.
Is there a better way to achieve flow composition?
Is there anything I can do in my tupledFlow to get desired behaviour when 'flow' ignores elements with 'filter' ?
Two possible approaches - with debatable elegance - are:
1) avoid using filtering stages, mutating your filter into a Flow[Int, Option[Int], NotUsed]. This way you can apply your zipping wrapper around your whole graph, as was your original plan. However, the code looks more tainted, and there is added overhead by passing around Nones.
val stringIfEvenOrNone = Flow[Int].map{
case x if x % 2 == 0 => Some(x.toString)
case _ => None
}
val tupled: Source[(Int, String), NotUsed] = s.via(tupledFlow(stringIfEvenOrNone)).collect{
case (num, Some(str)) => (num,str)
}
2) separate the filtering and transforming stages, and apply the filtering ones before your zipping wrapper. Probably a more lightweight and better compromise.
val filterEven = Flow[Int].filter(_ % 2 == 0)
val toString = Flow[Int].map(_.toString)
val tupled: Source[(Int, String), NotUsed] = s.via(filterEven).via(tupledFlow(toString))
EDIT
3) Posting another solution here for clarity, as per the discussions in the comments.
This flow wrapper allows to emit each element from a given flow, paired with the original input element that generated it. It works for any kind of inner flow (emitting 0, 1 or more elements for each input).
def tupledFlow[In,Out](flow: Flow[In, Out, _]): Flow[In, (In,Out), NotUsed] =
Flow[In].flatMapConcat(in => Source.single(in).via(flow).map( out => in -> out))
I came up with an implementation of TupledFlow that works when wrapped Flow uses filter() or mapAsync() and when wrapped Flow emits 0,1 or N elements for every input:
def tupledFlow[In,Out](flow: Flow[In, Out, _])(implicit materializer: Materializer, executionContext: ExecutionContext):Flow[In, (In,Out), NotUsed] = {
val v:Flow[In, Seq[(In, Out)], NotUsed] = Flow[In].mapAsync(4) { in: In =>
val outFuture: Future[Seq[Out]] = Source.single(in).via(flow).runWith(Sink.seq)
val bothFuture: Future[Seq[(In,Out)]] = outFuture.map( seqOfOut => seqOfOut.map((in,_)) )
bothFuture
}
val onlyDefined: Flow[In, (In, Out), NotUsed] = v.mapConcat[(In, Out)](seq => seq.to[scala.collection.immutable.Iterable])
onlyDefined
}
the only drawback I see here is that I am instantiating and materializing a flow for a single entity - just to get a notion of 'calling a flow as a function'.
I didn't do any performance tests on that - however since heavy-lifting is done in a wrapped Flow which is executed in a future - I believe this will be ok.
This implementation passes all the tests from https://gist.github.com/kretes/8d5f2925de55b2a274148b69f79e55ac#file-tupledflowspec-scala

How to deal with source that emits Future[T]?

Let's say I have some iterator:
val nextElemIter: Iterator[Future[Int]] = Iterator.continually(...)
And I want to build a source from that iterator:
val source: Source[Future[Int], NotUsed] =
Source.fromIterator(() => nextElemIter)
So now my source emits Futures. I have never seen futures being passed between stages in Akka docs or anywhere else, so instead, I could always do something like this:
val source: Source[Int, NotUsed] =
Source.fromIterator(() => nextElemIter).mapAsync(1)(identity /* was n => n */)
And now I have a regular source that emits T instead of Future[T]. But this feels hacky and wrong.
What's the proper way to deal with such situations?
Answering your question directly: I agree with Vladimir's comment that there is nothing "hacky" about using mapAsync for the purpose you described. I can't think of any more direct way to unwrap the Future from around your underlying Int values.
Answering your question indirectly...
Try to stick with Futures
Streams, as a concurrency mechanism, are incredibly useful when backpressure is required. However, pure Future operations have their place in applications as well.
If your Iterator[Future[Int]] is going to produce a known, limited, number of Future values then you may want to stick with using the Futures for concurrency.
Imagine you want to filter, map, & reduce the Int values.
def isGoodInt(i : Int) : Boolean = ??? //filter
def transformInt(i : Int) : Int = ??? //map
def combineInts(i : Int, j : Int) : Int = ??? //reduce
Futures provide a direct way of using these functions:
val finalVal : Future[Int] =
Future sequence {
for {
f <- nextElemIter.toSeq //launch all of the Futures
i <- f if isGoodInt(i)
} yield transformInt(i)
} map (_ reduce combineInts)
Compared with a somewhat indirect way of using the Stream as you suggested:
val finalVal : Future[Int] =
Source.fromIterator(() => nextElemIter)
.via(Flow[Future[Int]].mapAsync(1)(identity))
.via(Flow[Int].filter(isGoodInt))
.via(Flow[Int].map(transformInt))
.to(Sink.reduce(combineInts))
.run()

akka split task into smaller and fold results

The question is about Akka actors library. A want to split one big task into smaller tasks and then fold the result of them into one 'big' result. This will give me faster computation profit. Smaller tasks can be computed in parallel if they are independent.
Assume that we need to compute somethig like this. Function count2X is time consuming, so using it several times in one thread is not optimal.
//NOT OPTIMAL
def count2X(x: Int) = {
Thread.sleep(1000)
x * 2
}
val sum = count2X(1) + count2X(2) + count2X(3)
println(sum)
And here goes the question.
How to dispatch tasks and collect results and then fold them, all using akka actors?
Is such functionality already provided by Akka or do I need to implement it myself? What are best practisies in such approach.
Here is 'visual' interpretation of my question:
/-> [SMALL_TASK_1] -\
[BIG_TASK] -+--> [SMALL_TASK_1] --> [RESULT_FOLD]
\-> [SMALL_TASK_1] -/
Below is my scaffold implementation with missing/bad implementation :)
case class Count2X(x: Int)
class Count2XActor extends Actor {
def receive = {
case Count2X(x) => count2X(x); // AND NOW WHAT ?
}
}
case class CountSumOf2X(a: Int, b: Int, c: Int)
class SumOf2XActor extends Actor {
val aCounter = context.actorOf(Props[Count2XActor])
val bCounter = context.actorOf(Props[Count2XActor])
val cCounter = context.actorOf(Props[Count2XActor])
def receive = {
case CountSumOf2X(a, b, c) => // AND NOW WHAT ? aCounter ! Count2X(a); bCounter ! Count2X(b); cCounter ! Count2X(c);
}
}
val aSystem = ActorSystem("mySystem")
val actor = aSystem.actorOf(Props[SumOf2XActor])
actor ! CountSumOf2X(10, 20, 30)
Thanks for any help.
In Akka I would do something like this:
val a = aCounter ? Count2X(10) mapTo[Int]
val b = bCounter ? Count2X(10) mapTo[Int]
val c = cCounter ? Count2X(10) mapTo[Int]
Await.result(Future.sequence(a, b, c) map (_.sum), 1 second).asInstanceOf[Int]
I'm sure there is a better way - here you start summing results after all Future-s are complete in parallel, for simple task it's ok, but generally you shouldn't wait so long
Two things you could do:
1) Use Akka futures. These allow you to dispatch operations and fold on them in an asynchronous manner. Check out http://doc.akka.io/docs/akka/2.0.4/scala/futures.html for more information.
2) You can dispatch work to multiple "worker" actors and then have a "master" actor aggregate them, keeping track of which messages are pending/processed by storing information in the messages themselves. I have a simple stock quote example of this using Akka actors here: https://github.com/ryanlecompte/quotes