Akka Streams, break tuple item apart? - scala

Using the superPool from akka-http, I have a stream that passes down a tuple. I would like to pipeline it to the Alpakka Google Pub/Sub connector. At the end of the HTTP processing, I encode everything for the pub/sub connector and end up with
(PublishRequest, Long) // long is a timestamp
but the interface of the connector is
Flow[PublishRequest, Seq[String], NotUsed]
One first approach is to kill one part:
.map{ case(publishRequest, timestamp) => publishRequest }
.via(publishFlow)
Is there an elegant way to create this pipeline while keeping the Long information?
EDIT: added my not-so-elegant solution in the answers. More answers welcome.

I don't see anything inelegant about your solution using GraphDSL.create(), which I think has an advantage of visualizing the stream structure via the diagrammatic ~> clauses. I do see problem in your code. For example, I don't think publisher should be defined by add-ing a flow to the builder.
Below is a skeletal version (briefly tested) of what I believe publishAndRecombine should look like:
val publishFlow: Flow[PublishRequest, Seq[String], NotUsed] = ???
val publishAndRecombine = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val bcast = b.add(Broadcast[(PublishRequest, Long)](2))
val zipper = b.add(Zip[Seq[String], Long])
val publisher = Flow[(PublishRequest, Long)].
map{ case (pr, _) => pr }.
via(publishFlow)
val timestamp = Flow[(PublishRequest, Long)].
map{ case (_, ts) => ts }
bcast.out(0) ~> publisher ~> zipper.in0
bcast.out(1) ~> timestamp ~> zipper.in1
FlowShape(bcast.in, zipper.out)
})

There is now a much nicer solution for this which will be released in Akka 2.6.19 (see https://github.com/akka/akka/pull/31123).
In order to use the aformentioned unsafeViaData you would first have to represent (PublishRequest, Long) using FlowWithContext/SourceWithContext. FlowWithContext/SourceWithContext is an abstraction that was specifically designed to solve this problem (see https://doc.akka.io/docs/akka/current/stream/stream-context.html). The problem being you have a stream with the data part that is typically what you want to operate on (in your case the ByteString) and then you have the context (aka metadata) part which you typically just pass along unmodified (in your case the Long).
So in the end you would have something like this
val myFlow: FlowWithContext[PublishRequest, Long, PublishRequest, Long, NotUsed] =
FlowWithContext.fromTuples(originalFlowAsTuple) // Original flow that has `(PublishRequest, Long)` as an output
myFlow.unsafeViaData(publishFlow)
In contrast to Akka Streams, break tuple item apart?, not only is this solution involve much less boilerplate since its part of akka but it also retains the materialized value rather than losing it and always ending up with a NotUsed.
For the people wondering why the method unsafeViaData has unsafe in the name, its because the Flow that you pass into this method cannot add,drop or reorder any of the elements in the stream (doing so would mean that the context no longer properly corresponds to the data part of the stream). Ideally we would use Scala's type system to catch such errors at compile time but doing so would require a lot of changes to akka-stream especially if the changes need to remain backwards compatibility (which when dealing with akka we do). More details are in the PR mentioned earlier.

My not-so-elegant solution is using a custom flows that recombine things:
val publishAndRecombine = Flow.fromGraph(GraphDSL.create() { implicit b =>
val bc = b.add(Broadcast[(PublishRequest, Long)](2))
val publisher = b.add(Flow[(PublishRequest, Long)]
.map { case (pr, _) => pr }
.via(publishFlow))
val zipper = b.add(Zip[Seq[String], Long]).
bc.out(0) ~> publisher ~> zipper.in0
bc.out(1).map { case (pr, long) => long } ~> zipper.in1
FlowShape(bc.in, zipper.out)
})

Related

How to create an Akka flow with backpressure and Control

I need to create a function with the following Interface:
import akka.kafka.scaladsl.Consumer.Control
object ItemConversionFlow {
def build(config: StreamConfig): Flow[Item, OtherItem, Control] = {
// Implementation goes here
}
My problem is that I don't know how to define the flow in a way that it fits the interface above.
When I am doing something like this
val flow = Flow[Item]
.map(item => doConversion(item)
.filter(_.isDefined)
.map(_.get)
the resulting type is Flow[Item, OtherItem, NotUsed]. I haven't found something in the Akka documentation so far. Also the functions on akka.stream.scaladsl.Flow only offer a "NotUsed" instead of Control. Would be great if someone could point me into the right direction.
Some background: I need to setup several pipelines which only distinguish in the conversion part. These pipelines are sub streams to a main stream which might be stopped for some reason (a corresponding message arrives in some kafka topic). Therefor I need the Control part. The idea would be to create a Graph template where I just insert the mentioned flow as argument (a factory returning it). For a specific case we have a solution which works. To generalize it I need this kind of flow.
You actually have backpressure. However, think about what do you really need about backpressure... you are not using asynchronous stages to increase your throughput... for example. Backpressure avoids fast producers overgrowing susbscribers https://doc.akka.io/docs/akka/2.5/stream/stream-rate.html. In your sample don´t worry about it, your stream will ask for new elements to he publisher depending on how long doConversion takes to complete.
In case that you want to obtain the result of the stream use toMat or viaMat. For example, if your stream emits Item and transform these into OtherItem:
val str = Source.fromIterator(() => List(Item(Some(1))).toIterator)
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.toMat(Sink.fold(List[OtherItem]())((a, b) => {
// Examine the result of your stream
b :: a
}))(Keep.right)
.run()
str will be Future[List[OtherItem]]. Try to extrapolate this to your case.
Or using toMat with KillSwitches, "Creates a new [[Graph]] of [[FlowShape]] that materializes to an external switch that allows external completion
* of that unique materialization. Different materializations result in different, independent switches."
def build(config: StreamConfig): Flow[Item, OtherItem, UniqueKillSwitch] = {
Flow[Item]
.map(item => doConversion(item))
.filter(_.isDefined)
.map(_.get)
.viaMat(KillSwitches.single)(Keep.right)
}
val stream =
Source.fromIterator(() => List(Item(Some(1))).toIterator)
.viaMat(build(StreamConfig(1)))(Keep.right)
.toMat(Sink.ignore)(Keep.both).run
// This stops the stream
stream._1.shutdown()
// When it finishes
stream._2 onComplete(_ => println("Done"))

Understanding NotUsed and Done

I am having a hard time understanding the purpose and significance of NotUsed and Done in Akka Streams.
Let us see the following 2 simple examples:
Using NotUsed :
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[NotUsed] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.to(Sink.foreach(println))
val runResult:NotUsed = myStream.run()
Using Done
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[Future[Done]] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.toMat(Sink.foreach(println))(Keep.right)
val runResult: Future[Done] = myStream.run()
When I run these examples, I get the same output in both cases:
STACKOVERFLOW //output
So what exactly are NotUsed and Done? What are the differences and when should I prefer one above the other ?
First of all, the choice you are making is between NotUsed and Future[Done] (not just Done).
Now, you are essentially deciding the materialized value of your graph, by using the different combinators (to and toMat with Keep.right).
The materialized value is a way to interact with your stream while it's running. This choice does not affect the data processed by your stream, and for this reason you see the same output in both cases. The same element (the string "stackoverflow") goes through both streams.
The choice depends on what your main program is supposed to do after running the stream:
in case you are not interested in interacting with it, NotUsed is the right choice. It is just a dummy object, and it conveys the information that no interaction with the stream is allowed nor needed
in case you need to listen for the completion of the stream to perform some other action, you need to expose the Future[Done]. This way you can attach a callback to it using (e.g.) onComplete or map.

Akka Streams: validation of elements being streamed

I'm new to Akka Streams and I'm wondering how to implement some kind of mid-stream validation. Example:
FileIO
.fromPath(file)
.via(Framing.delimiter(...)
.map(_.utf8String)
.map(_.split("\t", -1))
.validate(arr => arr.length == 10) // or similar
...
I assumed that this scenario is so common that there must be a predefined functionality for validating a stream on the fly. However, I wasn't able to find anything about it. Am I on the wrong tracks here and validation is something that should not be done this way in Akka Streams?
In my particular scenario, I'm processing a file line by line. If only one single line is invalid, it does not make sense to continue and the processing should be aborted.
I'd probably create a type to represent the constraints, then you can do the assertions when creating instances of that type, as well as know downstream which constraints have been applied.
Example:
object LineItem {
// Makes it possible to provide the validation before allocating the item
def apply(string: String): LineItem = {
require(string.length == 10)
new LineItem(string) // Call the companion-accessible constructor
}
}
// private[LineItem] makes sure that `new` only works from companion object
final case class LineItem private[LineItem](string: String)
You could use .takeWhile. This will process all elements before the invalid item, and not process any items after it.
FileIO
.fromPath(file)
.via(Framing.delimiter(...)
.map(_.utf8String)
.map(_.split("\t", -1))
.takeWhile(arr => arr.length == 10)
...
I agree with #Stephen that takeWhile is what you need. You can use it with the inclusive flag set to true if you want the failing elements to be passed downstream.
Also, if you want to make your stream the most expressive, you can have the validation flow producing Either[ValidationError, String].
The example below is a bit clunky, I would prefer to use the graphDSL and partition, but hopefully you get the idea.
val errorSink: Sink[TooManyElements, _] = ???
val sink: Sink[Array[String], _] = ???
FileIO
.fromPath(file)
.via(Framing.delimiter(...))
.map(_.utf8String.split("\t", -1))
.map{
case arr if arr.length > 10 ⇒ Left(TooManyElements(arr.length))
case arr ⇒ Right(arr)
}
.takeWhile(_.isRight, inclusive = true)
.alsoTo(Flow[Either[TooManyElements, Array[String]]].filter(_.isLeft).to(errorSink)
.filter(_.isRight)
.to(sink)

Akka Stream - Splitting flow into multiple Sources

I have a TCP connection in Akka Stream that ends in a Sink. Right now all messages go into one Sink. I want to split the stream into an unknown number of Sinks given some function.
The use case is as follows, from the TCP connection I get en continuous stream of something like List[DeltaValue], now I want to create an actorSink for each DeltaValue.id so that i can continuously accumulate and implement behaviour for each DeltaValue.id. I find this to be a standard use case in stream processing but I'm not able to find a good example with Akka Stream.
This is what I have right now:
def connect(): ActorRef = tcpConnection
.//SOMEHOW SPLIT HERE and create a ReceiverActor for each message
.to(Sink.actorRef(system.actorOf(ReceiverActor.props(), ReceiverActor.name), akka.Done))
.run()
Update:
I now have this, not sure what to say about it, it does not feel super stable but it should work:
private def spawnActorOrSendMessage(m: ResponseMessage): Unit = {
implicit val timeout = Timeout(FiniteDuration(1, TimeUnit.SECONDS))
system.actorSelection("user/" + m.id.toString).resolveOne().onComplete {
case Success(actorRef) => actorRef ! m
case Failure(ex) => (system.actorOf(ReceiverActor.props(), m.id.toString)) ! m
}
}
def connect(): ActorRef = tcpConnection
.to(Sink.foreachParallel(10)(spawnActorOrSendMessage))
.run()
The below should be a somewhat improved version of what was updated in the question. The main improvement is that your actors are kept in a data structure to avoid actorSelection resolution for every incoming message.
case class DeltaValue(id: String, value: Double)
val src: Source[DeltaValue, NotUsed] = ???
src.runFold(Map[String, ActorRef]()){
case (actors, elem) if actors.contains(elem.id) ⇒
actors(elem.id) ! elem.value
actors
case (actors, elem) ⇒
val newActor = system.actorOf(ReceiverActor.props(), ReceiverActor.name)
newActor ! elem.value
actors.updated(elem.id, newActor)
}
Keep in mind that, when you integrate Akka Streams with bare actors, you lose backpressure support. This is one of the reasons why you should try and implement your logic within the boundaries of Akka Streams whenever possible. And this is not always possible - e.g. when remoting is needed etc.
In your case, you could consider leveraging groupBy and the concept of substream. The example below is folding the elements of each substream by summing them, just to give an idea:
src.groupBy(maxSubstreams = Int.MaxValue, f = _.id)
.fold("" → 0d) {
case ((id, acc), delta) ⇒ id → delta.value + acc
}
.mergeSubstreams
.runForeach(println)
EventStream
You can send messages to the ActorSystem's EventStream within a stream sink and separately have the Actors subscribe to the stream.
Split At Stream Level
You can split the stream at the stream level using Broadcast. The documentation has a good example of this.
Split At Actor Level
You could also use Sink.actorRef in combination with a BroadcastPool to broadcast the messages to multiple Actors.

Iteratees in Scala that use lazy evaluation or fusion?

I have heard that iteratees are lazy, but how lazy exactly are they? Alternatively, can iteratees be fused with a postprocessing function, so that an intermediate data structure does not have to be built?
Can I in my iteratee for example build a 1 million item Stream[Option[String]] from a java.io.BufferedReader, and then subsequently filter out the Nones, in a compositional way, without requiring the entire Stream to be held in memory? And at the same time guarantee that I don't blow the stack? Or something like that - it doesn't have to use a Stream.
I'm currently using Scalaz 6 but if other iteratee implementations are able to do this in a better way, I'd be interested to know.
Please provide a complete solution, including closing the BufferedReader and calling unsafePerformIO, if applicable.
Here's a quick iteratee example using the Scalaz 7 library that demonstrates the properties you're interested in: constant memory and stack usage.
The problem
First assume we've got a big text file with a string of decimal digits on each line, and we want to find all the lines that contain at least twenty zeros. We can generate some sample data like this:
val w = new java.io.PrintWriter("numbers.txt")
val r = new scala.util.Random(0)
(1 to 1000000).foreach(_ =>
w.println((1 to 100).map(_ => r.nextInt(10)).mkString)
)
w.close()
Now we've got a file named numbers.txt. Let's open it with a BufferedReader:
val reader = new java.io.BufferedReader(new java.io.FileReader("numbers.txt"))
It's not excessively large (~97 megabytes), but it's big enough for us to see easily whether our memory use is actually staying constant while we process it.
Setting up our enumerator
First for some imports:
import scalaz._, Scalaz._, effect.IO, iteratee.{ Iteratee => I }
And an enumerator (note that I'm changing the IoExceptionOrs into Options for the sake of convenience):
val enum = I.enumReader(reader).map(_.toOption)
Scalaz 7 doesn't currently provide a nice way to enumerate a file's lines, so we're chunking through the file one character at a time. This will of course be painfully slow, but I'm not going to worry about that here, since the goal of this demo is to show that we can process this large-ish file in constant memory and without blowing the stack. The final section of this answer gives an approach with better performance, but here we'll just split on line breaks:
val split = I.splitOn[Option[Char], List, IO](_.cata(_ != '\n', false))
And if the fact that splitOn takes a predicate that specifies where not to split confuses you, you're not alone. split is our first example of an enumeratee. We'll go ahead and wrap our enumerator in it:
val lines = split.run(enum).map(_.sequence.map(_.mkString))
Now we've got an enumerator of Option[String]s in the IO monad.
Filtering the file with an enumeratee
Next for our predicate—remember that we said we wanted lines with at least twenty zeros:
val pred = (_: String).count(_ == '0') >= 20
We can turn this into a filtering enumeratee and wrap our enumerator in that:
val filtered = I.filter[Option[String], IO](_.cata(pred, true)).run(lines)
We'll set up a simple action that just prints everything that makes it through this filter:
val printAction = (I.putStrTo[Option[String]](System.out) &= filtered).run
Of course we've not actually read anything yet. To do that we use unsafePerformIO:
printAction.unsafePerformIO()
Now we can watch the Some("0946943140969200621607610...")s slowly scroll by while our memory usage remains constant. It's slow and the error handling and output are a little clunky, but not too bad I think for about nine lines of code.
Getting output from an iteratee
That was the foreach-ish usage. We can also create an iteratee that works more like a fold—for example gathering up the elements that make it through the filter and returning them in a list. Just repeat everything above up until the printAction definition, and then write this instead:
val gatherAction = (I.consume[Option[String], IO, List] &= filtered).run
Kick that action off:
val xs: Option[List[String]] = gatherAction.unsafePerformIO().sequence
Now go get a coffee (it might need to be pretty far away). When you come back you'll either have a None (in the case of an IOException somewhere along the way) or a Some containing a list of 1,943 strings.
Complete (faster) example that automatically closes the file
To answer your question about closing the reader, here's a complete working example that's roughly equivalent to the second program above, but with an enumerator that takes responsibility for opening and closing the reader. It's also much, much faster, since it reads lines, not characters. First for imports and a couple of helper methods:
import java.io.{ BufferedReader, File, FileReader }
import scalaz._, Scalaz._, effect._, iteratee.{ Iteratee => I, _ }
def tryIO[A, B](action: IO[B]) = I.iterateeT[A, IO, Either[Throwable, B]](
action.catchLeft.map(
r => I.sdone(r, r.fold(_ => I.eofInput, _ => I.emptyInput))
)
)
def enumBuffered(r: => BufferedReader) =
new EnumeratorT[Either[Throwable, String], IO] {
lazy val reader = r
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(reader.readLine())).flatMap {
case Right(null) => s.pointI
case Right(line) => k(I.elInput(Right(line))) >>== apply[A]
case e => k(I.elInput(e))
}
)
}
And now the enumerator:
def enumFile(f: File): EnumeratorT[Either[Throwable, String], IO] =
new EnumeratorT[Either[Throwable, String], IO] {
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(new BufferedReader(new FileReader(f)))).flatMap {
case Right(reader) => I.iterateeT(
enumBuffered(reader).apply(s).value.ensuring(IO(reader.close()))
)
case Left(e) => k(I.elInput(Left(e)))
}
)
}
And we're ready to go:
val action = (
I.consume[Either[Throwable, String], IO, List] %=
I.filter(_.fold(_ => true, _.count(_ == '0') >= 20)) &=
enumFile(new File("numbers.txt"))
).run
Now the reader will be closed when the processing is done.
I should have read a little bit further... this is precisely what enumeratees are for. Enumeratees are defined in Scalaz 7 and Play 2, but not in Scalaz 6.
Enumeratees are for "vertical" composition (in the sense of "vertically integrated industry") while ordinary iteratees compose monadically in a "horizontal" way.