Does Akka Stream Implement the Join Semantic as Kafka Streams Does? - apache-kafka

I am quite new to Akka Streams, whereas I have some experience with Kafka Streams.
One thing it seems lacking in Akka Streams is the possibility to join together two different streams.
Kafka Streams allows joining information coming from two different streams (or tables) using the messages' keys.
Is there something similar in Akka Streams?

The short answer is unfortunately no. I would argue that Akka-streams is more low level than Kafka-Stream, Spark Streaming, or Flink. However, you have more control over what you are doing. Basically, it means that you can build your join operator. Check this discussion at lightbend.
Basically, you have to get data from 2 Sources, Merge them and send to a window based on time or number of tuples, compute the join, and emit the data to the Sink. I have done this PoC (which is still unfinished) but I follow the operators that I said to you here, and it is compiling and working. Basically, I still have to join the data inside the window. Currently, I am just emitting them in a mini-batch.
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.{Attributes, ClosedShape, FlowShape, Inlet, Outlet}
import akka.stream.scaladsl.{Flow, GraphDSL, Merge, RunnableGraph, Sink, Source}
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler, TimerGraphStageLogic}
import scala.collection.mutable
import scala.concurrent.duration._
object StreamOpenGraphJoin {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("StreamOpenGraphJoin")
val incrementSource: Source[Int, NotUsed] = Source(1 to 10).throttle(1, 1 second)
val decrementSource: Source[Int, NotUsed] = Source(10 to 20).throttle(1, 1 second)
def tokenizerSource(key: Int) = {
Flow[Int].map { value =>
(key, value)
}
}
// Step 1 - setting up the fundamental for a stream graph
val switchJoinStrategies = RunnableGraph.fromGraph(
GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
// Step 2 - add partition and merge strategy
val tokenizerShape00 = builder.add(tokenizerSource(0))
val tokenizerShape01 = builder.add(tokenizerSource(1))
val mergeTupleShape = builder.add(Merge[(Int, Int)](2))
val batchFlow = Flow.fromGraph(new BatchTimerFlow[(Int, Int)](5 seconds))
val sinkShape = builder.add(Sink.foreach[(Int, Int)](x => println(s" > sink: $x")))
// Step 3 - tying up the components
incrementSource ~> tokenizerShape00 ~> mergeTupleShape.in(0)
decrementSource ~> tokenizerShape01 ~> mergeTupleShape.in(1)
mergeTupleShape.out ~> batchFlow ~> sinkShape
// Step 4 - return the shape
ClosedShape
}
)
// run the graph and materialize it
val graph = switchJoinStrategies.run()
}
// step 0: define the shape
class BatchTimerFlow[T](silencePeriod: FiniteDuration) extends GraphStage[FlowShape[T, T]] {
// step 1: define the ports and the component-specific members
val in = Inlet[T]("BatchTimerFlow.in")
val out = Outlet[T]("BatchTimerFlow.out")
// step 3: create the logic
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) {
// mutable state
val batch = new mutable.Queue[T]
var open = false
// step 4: define mutable state implement my logic here
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
val nextElement = grab(in)
batch.enqueue(nextElement)
Thread.sleep(50) // simulate an expensive computation
if (open) pull(in) // send demand upstream signal, asking for another element
else {
// forward the element to the downstream operator
emitMultiple(out, batch.dequeueAll(_ => true).to[collection.immutable.Iterable])
open = true
scheduleOnce(None, silencePeriod)
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
override protected def onTimer(timerKey: Any): Unit = {
open = false
}
}
// step 2: construct a new shape
override def shape: FlowShape[T, T] = FlowShape[T, T](in, out)
}
}

Related

How can I merge an arbitrary number of sources in Akka stream?

I have n sources that I'd like to merge by priority in Akka streams. I'm basing my implementation on the GraphMergePrioritiziedSpec, in which three prioritized sources are merged. I attempted to abstract away the number of Sources with the following:
import akka.NotUsed
import akka.stream.{ClosedShape, Graph, Materializer}
import akka.stream.scaladsl.{GraphDSL, MergePrioritized, RunnableGraph, Sink, Source}
import org.apache.activemq.ActiveMQConnectionFactory
class SourceMerger(
sources: Seq[Source[java.io.Serializable, NotUsed]],
priorities: Seq[Int],
private val sink: Sink[java.io.Serializable, _]
) {
require(sources.size == priorities.size, "Each source should have a priority")
import GraphDSL.Implicits._
private def partial(
sources: Seq[Source[java.io.Serializable, NotUsed]],
priorities: Seq[Int],
sink: Sink[java.io.Serializable, _]
): Graph[ClosedShape, NotUsed] = GraphDSL.create() { implicit b =>
val merge = b.add(MergePrioritized[java.io.Serializable](priorities))
sources.zipWithIndex.foreach { case (s, i) =>
s.shape.out ~> merge.in(i)
}
merge.out ~> sink
ClosedShape
}
def merge(
sources: Seq[Source[java.io.Serializable, NotUsed]],
priorities: Seq[Int],
sink: Sink[java.io.Serializable, _]
): RunnableGraph[NotUsed] = RunnableGraph.fromGraph(partial(sources, priorities, sink))
def run()(implicit mat: Materializer): NotUsed = merge(sources, priorities, sink).run()(mat)
}
However, I get an error when running the following stub:
import akka.actor.ActorSystem
import akka.stream.{ActorMaterializer, Materializer}
import akka.stream.scaladsl.{Sink, Source}
import org.scalatest.{Matchers, WordSpecLike}
import akka.testkit.TestKit
import scala.collection.immutable.Iterable
class SourceMergerSpec extends TestKit(ActorSystem("SourceMerger")) with WordSpecLike with Matchers {
implicit val materializer: Materializer = ActorMaterializer()
"A SourceMerger" should {
"merge by priority" in {
val priorities: Seq[Int] = Seq(1,2,3)
val highPriority = Iterable("message1", "message2", "message3")
val mediumPriority = Iterable("message4", "message5", "message6")
val lowPriority = Iterable("message7", "message8", "message9")
val source1 = Source[String](highPriority)
val source2 = Source[String](mediumPriority)
val source3 = Source[String](lowPriority)
val sources = Seq(source1, source2, source3)
val subscriber = Sink.seq[java.io.Serializable]
val merger = new SourceMerger(sources, priorities, subscriber)
merger.run()
source1.runWith(Sink.foreach(println))
}
}
}
The relevant stacktrace is here:
[StatefulMapConcat.out] is already connected
java.lang.IllegalArgumentException: [StatefulMapConcat.out] is already connected
at akka.stream.scaladsl.GraphDSL$Builder.addEdge(Graph.scala:1304)
at akka.stream.scaladsl.GraphDSL$Implicits$CombinerBase$class.$tilde$greater(Graph.scala:1431)
at akka.stream.scaladsl.GraphDSL$Implicits$PortOpsImpl.$tilde$greater(Graph.scala:1521)
at SourceMerger$$anonfun$partial$1$$anonfun$apply$1.apply(SourceMerger.scala:26)
at SourceMerger$$anonfun$partial$1$$anonfun$apply$1.apply(SourceMerger.scala:25)
It seems that the error comes from this:
sources.zipWithIndex.foreach { case (s, i) =>
s.shape.out ~> merge.in(i)
}
Is it possible to merge an arbitrary number of Sources in Akka streams Graph DSL? If so, why isn't my attempt successful?
Primary Problem with Code Example
One big issue with the code snippet provided in the question is that source1 is connected to the Sink from the merge call and the Sink.foreach(println). The same Source cannot be connected to multiple Sinks without an intermediate fan-out element.
Removing the Sink.foreach(println) may solve your problem outright.
Simplified Design
The merging can be simplified based on the fact that all messages from a particular Source have the same priority. This means that you can sort the sources by their respective priority and then concatenate them all together:
private def partial(sources: Seq[Source[java.io.Serializable, NotUsed]],
priorities: Seq[Int],
sink: Sink[java.io.Serializable, _]): RunnableGraph[NotUsed] =
sources.zip(priorities)
.sortWith(_._2 < _._2)
.map(_._1)
.reduceOption(_ ++ _)
.getOrElse(Source.empty[java.io.Serializable])
.to(sink)
Your code runs without the error if I replace
sources.zipWithIndex.foreach { case (s, i) =>
s.shape.out ~> merge.in(i)
}
with
sources.zipWithIndex.foreach { case (s, i) =>
s ~> merge.in(i)
}
I admit I'm not quite sure why! At any rate, s.shape is a StatefulMapConcat and that's the point where it's complaining about the out port already being connected. The problem occurs even if you only pass a single source, so the arbitrary number isn't the problem.

Akka Streams - jump over FlowShape

I have the following Graph.
At the inflateFlow stage, I check if there is already a request processed in DB. If there is already a processed message, I want to return MsgSuccess and not a RequestProcess, but the next FlowShape won't accept that, it needs a RequestProcess. Is there a way to jump from flowInflate to flowWrap without adding Either everywhere?
GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val flowInflate = builder.add(wrapFunctionInFlowShape[MsgRequest, RequestProcess](inflateFlow))
val flowProcess = builder.add(wrapFunctionInFlowShape[RequestProcess, SuccessProcess](convertFlow))
val flowWrite = builder.add(wrapFunctionInFlowShape[SuccessProcess, SuccessProcess](writeFlow))
val flowWrap = builder.add(wrapFunctionInFlowShape[SuccessProcess, MsgSuccess](wrapFlow))
flowInflate ~> flowProcess ~> flowWrite ~> flowWrap
FlowShape(flowInflate.in, flowWrap.out)
}
def wrapFunctionInFlowShape[Input, Output](f: Input => Output): Flow[Input, Output, NotUsed] = {
Flow.fromFunction { input =>
f(input)
}
}
//check for cache
def inflateFlow(msgRequest: MsgRequest): Either[RequestProcess, MsgSuccess] = {
val hash: String = hashMethod(msgRequest)
if(existisInDataBase(hash))
Right(MsgSuccess(hash))
else
Left(inflate(msgRequest))
}
def convertFlow(requestPorocess: RequestPocess): SuccessProcess = {}//process the request}
def writeFlow(successProcess: SuccessProcess): SuccessProcess = {}//write to DB}
def wrapFlow(successProcess: SuccessProcess): MsgSuccess = {}//wrap and return the message}
You can define alternative paths in a stream with a partition. In your case, the PartitionWith stage in the Akka Stream Contrib project could be helpful. Unlike the Partition stage in the standard Akka Streams API, PartitionWith allows the output types to be different: in your case, the output types are RequestProcess and MsgSuccess.
First, to use PartitionWith, add the following dependency to your build.sbt:
libraryDependencies += "com.typesafe.akka" %% "akka-stream-contrib" % "0.8"
Second, replace inflateFlow with the partition:
def split = PartitionWith[MsgRequest, RequestProcess, MsgSuccess] { msgRequest =>
val hash = hashMethod(msgRequest)
if (!existisInDataBase(hash))
Left(inflate(msgRequest))
else
Right(MsgSuccess(hash))
}
Then incorporate that stage into your graph:
val flow = Flow.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val pw = builder.add(split)
val flowProcess = builder.add(wrapFunctionInFlowShape[RequestProcess, SuccessProcess](convertFlow))
val flowWrite = builder.add(wrapFunctionInFlowShape[SuccessProcess, SuccessProcess](writeFlow))
val flowWrap = builder.add(wrapFunctionInFlowShape[SuccessProcess, MsgSuccess](wrapFlow))
val mrg = builder.add(Merge[MsgSuccess](2))
pw.out0 ~> flowProcess ~> flowWrite ~> flowWrap ~> mrg.in(0)
pw.out1 ~> mrg.in(1)
FlowShape(pw.in, mrg.out)
})
If an incoming MsgRequest is not found in the database and is converted to a RequestProcess, then that message goes through your original flow path. If an incoming MsgRequest is in the database and resolves to a MsgSuccess, then it bypasses the intermediate steps in the flow. In both cases, the resulting MsgSuccess messages are merged from the two alternative paths into one flow outlet.

Akka Streams: State in a flow

I want to read multiple big files using Akka Streams to process each line. Imagine that each key consists of an (identifier -> value). If a new identifier is found, I want to save it and its value in the database; otherwise, if the identifier has already been found while processing the stream of lines, I want to save only the value. For that, I think that I need some kind of recursive stateful flow in order to keep the identifiers that have already been found in a Map. I think I'd receive in this flow a pair of (newLine, contextWithIdentifiers).
I've just started to look into Akka Streams. I guess I can manage myself to do the stateless processing stuff but I have no clue about how to keep the contextWithIdentifiers. I'd appreciate any pointers to the right direction.
Maybe something like statefulMapConcat can help you:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
import scala.util.Random._
import scala.math.abs
import scala.concurrent.ExecutionContext.Implicits.global
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
//encapsulating your input
case class IdentValue(id: Int, value: String)
//some random generated input
val identValues = List.fill(20)(IdentValue(abs(nextInt()) % 5, "valueHere"))
val stateFlow = Flow[IdentValue].statefulMapConcat{ () =>
//state with already processed ids
var ids = Set.empty[Int]
identValue => if (ids.contains(identValue.id)) {
//save value to DB
println(identValue.value)
List(identValue)
} else {
//save both to database
println(identValue)
ids = ids + identValue.id
List(identValue)
}
}
Source(identValues)
.via(stateFlow)
.runWith(Sink.seq)
.onSuccess { case identValue => println(identValue) }
A few years later, here is an implementation I wrote if you only need a 1-to-1 mapping (not 1-to-N):
import akka.stream.stage.{GraphStage, GraphStageLogic}
import akka.stream.{Attributes, FlowShape, Inlet, Outlet}
object StatefulMap {
def apply[T, O](converter: => T => O) = new StatefulMap[T, O](converter)
}
class StatefulMap[T, O](converter: => T => O) extends GraphStage[FlowShape[T, O]] {
val in = Inlet[T]("StatefulMap.in")
val out = Outlet[O]("StatefulMap.out")
val shape = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
val f = converter
setHandler(in, () => push(out, f(grab(in))))
setHandler(out, () => pull(in))
}
}
Test (and demo):
behavior of "StatefulMap"
class Counter extends (Any => Int) {
var count = 0
override def apply(x: Any): Int = {
count += 1
count
}
}
it should "not share state among substreams" in {
val result = await {
Source(0 until 10)
.groupBy(2, _ % 2)
.via(StatefulMap(new Counter()))
.fold(Seq.empty[Int])(_ :+ _)
.mergeSubstreams
.runWith(Sink.seq)
}
result.foreach(_ should be(1 to 5))
}

Monitoring a closed graph Akka Stream

If I have created a RunningGraph in Akka Stream, how can I know (from the outside)
when all nodes are cancelled due to completion?
when all nodes have been stopped due to an error?
I don't think there is a way to do it for an arbitrary graph, but if you have your graph under control, you just need to attach monitoring sinks to the output of each node which can fail or complete (these are nodes which have at least one output), for example:
import akka.actor.Status
// obtain graph parts (this can be done inside the graph building as well)
val source: Source[Int, NotUsed] = ...
val flow: Flow[Int, String, NotUsed] = ...
val sink: Sink[String, NotUsed] = ...
// create monitoring actors
val aggregate = actorSystem.actorOf(Props[Aggregate])
val sourceMonitorActor = actorSystem.actorOf(Props(new Monitor("source", aggregate)))
val flowMonitorActor = actorSystem.actorOf(Props(new Monitor("flow", aggregate)))
// create the graph
val graph = GraphDSL.create() { implicit b =>
import GraphDSL._
val sourceMonitor = b.add(Sink.actorRef(sourceMonitorActor, Status.Success(()))),
val flowMonitor = b.add(Sink.actorRef(flowMonitorActor, Status.Success(())))
val bc1 = b.add(Broadcast[Int](2))
val bc2 = b.add(Broadcast[String](2))
// main flow
source ~> bc1 ~> flow ~> bc2 ~> sink
// monitoring branches
bc1 ~> sourceMonitor
bc2 ~> flowMonitor
ClosedShape
}
// run the graph
RunnableGraph.fromGraph(graph).run()
class Monitor(name: String, aggregate: ActorRef) extends Actor {
override def receive: Receive = {
case Status.Success(_) => aggregate ! s"$name completed successfully"
case Status.Failure(e) => aggregate ! s"$name completed with failure: ${e.getMessage}"
case _ =>
}
}
class Aggregate extends Actor {
override def receive: Receive = {
case s: String => println(s)
}
}
It is also possible to create only one monitoring actor and use it in all monitoring sinks, but in that case you won't be able to differentiate easily between streams which have failed.
And there also is watchTermination() method on sources and flows which allows to materialize a future which terminates together with the flow at this point. I think it may be difficult to use with GraphDSL, but with regular stream methods it could look like this:
import akka.Done
import akka.actor.Status
import akka.pattern.pipe
val monitor = actorSystem.actorOf(Props[Monitor])
source
.watchTermination()((f, _) => f pipeTo monitor)
.via(flow).watchTermination((f, _) => f pipeTo monitor)
.to(sink)
.run()
class Monitor extends Actor {
override def receive: Receive = {
case Done => println("stream completed")
case Status.Failure(e) => println(s"stream failed: ${e.getMessage}")
}
}
You can transform the future before piping its value to the actor to differentiate between streams.

Idiomatic way to turn an Akka Source into a Spark InputDStream

I'm essentially trying to do the opposite of what is being asked in this question; that is to say, use a Source[A] to push elements into a InputDStream[A].
So far, I've managed to clobber together an implementation that uses a Feeder actor and a Receiver actor similar to the ActorWordCount example, but this seems a bit complex so I'm curious if there is a simpler way.
EDIT: Self-accepting after 5 days since there have been no good answers.
I've extracted the Actor-based implementation into a lib, Sparkka-streams, and it's been working for me thus far. When a solution to this question that is better shows up, I'll either update or deprecate the lib.
Its usage is as follows:
// InputDStream can then be used to build elements of the graph that require integration with Spark
val (inputDStream, feedDInput) = Streaming.connection[Int]()
val source = Source.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val source = Source(1 to 10)
val bCast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val add1 = Flow[Int].map(_ + 1)
val times3 = Flow[Int].map(_ * 3)
source ~> bCast ~> add1 ~> merge
bCast ~> times3 ~> feedDInput ~> merge
SourceShape(merge.out)
})
val reducedFlow = source.runWith(Sink.fold(0)(_ + _))
whenReady(reducedFlow)(_ shouldBe 230)
val sharedVar = ssc.sparkContext.accumulator(0)
inputDStream.foreachRDD { rdd =>
rdd.foreach { i =>
sharedVar += i
}
}
ssc.start()
eventually(sharedVar.value shouldBe 165)
Ref: http://spark.apache.org/docs/latest/streaming-custom-receivers.html
You can do it like:
class StreamStopped extends RuntimeException("Stream stopped")
// Serializable factory class
case class SourceFactory(start: Int, end: Int) {
def source = Source(start to end).map(_.toString)
}
class CustomReceiver(sourceFactory: SourceFactory)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
implicit val materializer = ....
def onStart() {
sourceFactory.source.runForEach { e =>
if (isStopped) {
// Stop the source
throw new StreamStopped
} else {
store(e)
}
} onFailure {
case _: StreamStopped => // ignore
case ex: Throwable => reportError("Source exception", ex)
}
}
def onStop() {}
}
val customReceiverStream = ssc.receiverStream(new CustomReceiver(SourceFactory(1,100))