Idiomatic way to turn an Akka Source into a Spark InputDStream - scala

I'm essentially trying to do the opposite of what is being asked in this question; that is to say, use a Source[A] to push elements into a InputDStream[A].
So far, I've managed to clobber together an implementation that uses a Feeder actor and a Receiver actor similar to the ActorWordCount example, but this seems a bit complex so I'm curious if there is a simpler way.

EDIT: Self-accepting after 5 days since there have been no good answers.
I've extracted the Actor-based implementation into a lib, Sparkka-streams, and it's been working for me thus far. When a solution to this question that is better shows up, I'll either update or deprecate the lib.
Its usage is as follows:
// InputDStream can then be used to build elements of the graph that require integration with Spark
val (inputDStream, feedDInput) = Streaming.connection[Int]()
val source = Source.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val source = Source(1 to 10)
val bCast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val add1 = Flow[Int].map(_ + 1)
val times3 = Flow[Int].map(_ * 3)
source ~> bCast ~> add1 ~> merge
bCast ~> times3 ~> feedDInput ~> merge
SourceShape(merge.out)
})
val reducedFlow = source.runWith(Sink.fold(0)(_ + _))
whenReady(reducedFlow)(_ shouldBe 230)
val sharedVar = ssc.sparkContext.accumulator(0)
inputDStream.foreachRDD { rdd =>
rdd.foreach { i =>
sharedVar += i
}
}
ssc.start()
eventually(sharedVar.value shouldBe 165)

Ref: http://spark.apache.org/docs/latest/streaming-custom-receivers.html
You can do it like:
class StreamStopped extends RuntimeException("Stream stopped")
// Serializable factory class
case class SourceFactory(start: Int, end: Int) {
def source = Source(start to end).map(_.toString)
}
class CustomReceiver(sourceFactory: SourceFactory)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
implicit val materializer = ....
def onStart() {
sourceFactory.source.runForEach { e =>
if (isStopped) {
// Stop the source
throw new StreamStopped
} else {
store(e)
}
} onFailure {
case _: StreamStopped => // ignore
case ex: Throwable => reportError("Source exception", ex)
}
}
def onStop() {}
}
val customReceiverStream = ssc.receiverStream(new CustomReceiver(SourceFactory(1,100))

Related

Akka - convert Flow into Collection or Publisher

I'm trying to split an Akka Source into two separate ones.
val requestFlow = Flow[BodyPartEntity].to(Sink.seq) // convert to Seq[BodyPartEntity]
val dataFlow = Flow[BodyPartEntity].to(Sink.asPublisher(fanout = false)) // convert to Publisher[BodyPartEntity]
implicit class EitherSourceExtension[L, R, Mat](source: Source[FormData.BodyPart, Mat]) {
def partition(left: Sink[BodyPartEntity, NotUsed], right: Sink[BodyPartEntity, NotUsed]): Graph[ClosedShape, NotUsed] = {
GraphDSL.create() { implicit builder =>
import akka.stream.scaladsl.GraphDSL.Implicits._
val partition = builder.add(Partition[FormData.BodyPart](2, element => if (element.getName == "request") 0 else 1))
source ~> partition.in
partition.out(0).map(_.getEntity) ~> left
partition.out(1).map(_.getEntity) ~> right
ClosedShape
}
}
}
How to convert requestFlow into Seq[BodyPartEntity] and dataFlow into Publisher[BodyPartEntity]
You could use a BroadcastHub for this. From doc:
A BroadcastHub can be used to consume elements from a common producer by a dynamic set of consumers.
Simplified code:
val runnableGraph: RunnableGraph[Source[Int, NotUsed]] =
Source(1 to 5).toMat(
BroadcastHub.sink(bufferSize = 4))(Keep.right)
val fromProducer: Source[Int, NotUsed] = runnableGraph.run()
// Process the messages from the producer in two independent consumers
fromProducer.runForeach(msg => println("consumer1: " + msg))
fromProducer.runForeach(msg => println("consumer2: " + msg))

Does Akka Stream Implement the Join Semantic as Kafka Streams Does?

I am quite new to Akka Streams, whereas I have some experience with Kafka Streams.
One thing it seems lacking in Akka Streams is the possibility to join together two different streams.
Kafka Streams allows joining information coming from two different streams (or tables) using the messages' keys.
Is there something similar in Akka Streams?
The short answer is unfortunately no. I would argue that Akka-streams is more low level than Kafka-Stream, Spark Streaming, or Flink. However, you have more control over what you are doing. Basically, it means that you can build your join operator. Check this discussion at lightbend.
Basically, you have to get data from 2 Sources, Merge them and send to a window based on time or number of tuples, compute the join, and emit the data to the Sink. I have done this PoC (which is still unfinished) but I follow the operators that I said to you here, and it is compiling and working. Basically, I still have to join the data inside the window. Currently, I am just emitting them in a mini-batch.
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.{Attributes, ClosedShape, FlowShape, Inlet, Outlet}
import akka.stream.scaladsl.{Flow, GraphDSL, Merge, RunnableGraph, Sink, Source}
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler, TimerGraphStageLogic}
import scala.collection.mutable
import scala.concurrent.duration._
object StreamOpenGraphJoin {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("StreamOpenGraphJoin")
val incrementSource: Source[Int, NotUsed] = Source(1 to 10).throttle(1, 1 second)
val decrementSource: Source[Int, NotUsed] = Source(10 to 20).throttle(1, 1 second)
def tokenizerSource(key: Int) = {
Flow[Int].map { value =>
(key, value)
}
}
// Step 1 - setting up the fundamental for a stream graph
val switchJoinStrategies = RunnableGraph.fromGraph(
GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
// Step 2 - add partition and merge strategy
val tokenizerShape00 = builder.add(tokenizerSource(0))
val tokenizerShape01 = builder.add(tokenizerSource(1))
val mergeTupleShape = builder.add(Merge[(Int, Int)](2))
val batchFlow = Flow.fromGraph(new BatchTimerFlow[(Int, Int)](5 seconds))
val sinkShape = builder.add(Sink.foreach[(Int, Int)](x => println(s" > sink: $x")))
// Step 3 - tying up the components
incrementSource ~> tokenizerShape00 ~> mergeTupleShape.in(0)
decrementSource ~> tokenizerShape01 ~> mergeTupleShape.in(1)
mergeTupleShape.out ~> batchFlow ~> sinkShape
// Step 4 - return the shape
ClosedShape
}
)
// run the graph and materialize it
val graph = switchJoinStrategies.run()
}
// step 0: define the shape
class BatchTimerFlow[T](silencePeriod: FiniteDuration) extends GraphStage[FlowShape[T, T]] {
// step 1: define the ports and the component-specific members
val in = Inlet[T]("BatchTimerFlow.in")
val out = Outlet[T]("BatchTimerFlow.out")
// step 3: create the logic
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) {
// mutable state
val batch = new mutable.Queue[T]
var open = false
// step 4: define mutable state implement my logic here
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
val nextElement = grab(in)
batch.enqueue(nextElement)
Thread.sleep(50) // simulate an expensive computation
if (open) pull(in) // send demand upstream signal, asking for another element
else {
// forward the element to the downstream operator
emitMultiple(out, batch.dequeueAll(_ => true).to[collection.immutable.Iterable])
open = true
scheduleOnce(None, silencePeriod)
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
override protected def onTimer(timerKey: Any): Unit = {
open = false
}
}
// step 2: construct a new shape
override def shape: FlowShape[T, T] = FlowShape[T, T](in, out)
}
}

Akka Streams scala DSL and Op-Rabbit

I have started using Akka Streams and Op-Rabbit and am a bit confused.
I need to split the stream based on a predicate and then combine them much like I have done when creating graphs and using the Partition and Merge.
I have been able to do things like this using the GraphDSL.Builder, but can't seem to get it to work with AckedSource/Flow/Sink
the graph would look like:
| --> flow1 --> |
source--> partition --> | | --> flow3 --> sink
| --> flow2 --> |
I'm not sure if splitWhen is what I should use because I always need exactly 2 flows.
This is a sample that does not do the partitioning and does not use the GraphDSL.Builder:
def splitExample(source: AckedSource[String, SubscriptionRef],
queueName: String)
(implicit actorSystem: ActorSystem): RunnableGraph[SubscriptionRef] = {
val toStringFlow: Flow[AckTup[Message], AckTup[String], NotUsed] = Flow[AckTup[Message]]
.map[AckTup[String]](tup => {
val (p,m) = tup
(p, new String(m.data))
})
val printFlow1: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]]
.map[AckTup[String]](tup => {
val (p, s) = tup
println(s"flow1 processing $s")
tup
})
val printFlow2: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]]
.map[AckTup[String]](tup => {
val (p, s) = tup
println(s"flow2 processing $s")
tup
})
source
.map(Message.queue(_, queueName))
.via(AckedFlow(toStringFlow))
// partition if string.length < 10
.via(AckedFlow(printFlow1))
.via(AckedFlow(printFlow2))
.to(AckedSink.ack)
}
This is the code that I can't seem to get working:
import GraphDSL.Implicits._
def buildModelAcked(source: AckedSource[String, SubscriptionRef] , queueName: String)(implicit actorSystem: ActorSystem): Graph[ClosedShape, Future[Done]] = {
import GraphDSL.Implicits._
GraphDSL.create(Sink.ignore) { implicit builder: GraphDSL.Builder[Future[Done]] => s =>
import GraphDSL.Implicits._
source.map(Message.queue(_, queueName)) ~> AckedFlow(toStringFlow) ~> AckedSink.ack
// source.map(Message.queue(_, queueName)).via(AckedFlow(toStringFlow)).to(AckedSink.ack)
ClosedShape
}}
The compiler can't resolve the ~> operator
So my questions are:
Is there an example project that uses the scala dsl to build graphs of Acked/Source/Flow/Sink?
Is there an example project that partitions and merges that is similar to what I am trying to do here?
Keep in mind the following definitions when dealing the acked-stream.
AckedSource[Out, Mat] is a wrapper for Source[AckTup[Out], Mat]]
AckedFlow[In, Out, Mat] is a wrapper for Flow[AckTup[In], AckTup[Out], Mat]
AckedSink[In, Mat] is a wrapper for Sink[AckTup[In], Mat]
AckTup[T] is an alias for (Promise[Unit], T)
the classic flow combinators will operate on the T part of the AckTup
the .acked combinator will complete the Promise[Unit] of an AckedFlow
The GraphDSL edge operator (~>) will work against a bunch of Akka predefined shapes (see the code for GraphDSL.Implicits), but it won't work against custom shapes defined by the acked-stream lib.
You got 2 ways out:
you define your own ~> implicit operator, along the lines of the ones in GraphDSL.Implicits
you unwrap the acked stages to obtain standard stages. You are able to access the wrapped stage using .wrappedRepr - available on AckedSource, AckedFlow and AckedSink.
Based on Stefano Bonetti's excellent direction, here is a possible solution:
graph:
|--> short --|
rabbitMq --> before --| |--> after
|--> long --|
solution:
val before: Flow[AckTup[Message], AckTup[String], NotUsed] = Flow[AckTup[Message]].map[AckTup[String]](tup => {
val (p,m) = tup
(p, new String(m.data))
})
val short: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"short: $s")
tup
})
val long: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"long: $s")
tup
})
val after: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"all $s")
tup
})
def buildSplitGraph(source: AckedSource[String, SubscriptionRef]
, queueName: String
, splitLength: Int)(implicit actorSystem: ActorSystem): Graph[ClosedShape, Future[Done]] = {
GraphDSL.create(Sink.ignore) { implicit builder: GraphDSL.Builder[Future[Done]] => s =>
val toShort = 0
val toLong = 1
// junctions
val split = builder.add(Partition[AckTup[String]](2, (tup: AckTup[String]) => {
val (p, s) = tup
if (s.length < splitLength) toShort else toLong
}
))
val merge = builder.add(Merge[AckTup[String]](2))
//graph
val beforeSplit = source.map(Message.queue(_, queueName)).wrappedRepr ~> AckedFlow(before).wrappedRepr
beforeSplit ~> split
// must do short, then long since the split goes in that order
split ~> AckedFlow(short).wrappedRepr ~> merge
split ~> AckedFlow(long).wrappedRepr ~> merge
// after the last AckedFlow, be sure to '.acked' so that the message will be removed from the queue
merge ~> AckedFlow(after).acked ~> s
ClosedShape
}}
As Stefano Bonetti said, the key was to use the .wrappedRepr associated with the AckedFlow and then to use the .acked combinator as the last step.

Merge and broadcast, building a (simple) Akka graph

The Akka documentation is vast and there are a lot of tutorials. But either they are outdated or they only cover the basics (or, maybe I simply can't find the right ones).
What I want to create is a websocket application with multiple clients and multiple sources on the server side. As I don't want to get over my head from the start, I want to make baby steps and incrementally increase the complexity of the software I am building.
After toying around with some simple flows I wanted to start with a more sophisticated graph now.
What I want is:
Two sources, one that pushes "keepAlive" messages from the server to the client (currently only one) and a second one that actually pushes useful data.
Now for the first one I have this:
val tickingSource: Source[Array[Byte], Cancellable] =
Source.tick(initialDelay = 1 second, interval = 10 seconds, tick = NotUsed)
.zipWithIndex
.map{ case (_, counter) => SomeMessage().toByteArray}
Where SomeMessage is a protobuf type.
Because I can't find an up-to-date way to add an actor as a source, I tried the following for my second source:
val secondSource = Source(1 to 1000)
val secondSourceConverter = Flow[Int].map(x => BigInteger.valueOf(x).toByteArray)
My attempt at the graph:
val g: RunnableGraph[NotUsed] = RunnableGraph.fromGraph(GraphDSL.create()
{
implicit builder =>
import GraphDSL.Implicits._
val sourceMerge = builder.add(Merge[Array[Byte]](2).named("sourceMerge"))
val x = Source(1 to 1000)
val y = Flow[Int].map(x => BigInteger.valueOf(x).toByteArray)
val out = Sink.ignore
tickingSource ~> sourceMerge ~> out
x ~> y ~> sourceMerge
ClosedShape
})
Now g is of type RunnableGraph[NotUsed] while it should be RunnableGraph[Array[Byte]] for my websocket. And I wonder here: am I already doing something completely wrong?
You need to pass the secondSourceConverter into the GraphDSL.create, like the following example taken from their docs. Here they import 2 sinks, but it's the same technique.
RunnableGraph.fromGraph(GraphDSL.create(topHeadSink, bottomHeadSink)((_, _)) { implicit builder =>
(topHS, bottomHS) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[Int](2))
Source.single(1) ~> broadcast.in
broadcast.out(0) ~> sharedDoubler ~> topHS.in
broadcast.out(1) ~> sharedDoubler ~> bottomHS.in
ClosedShape
})
Your graph is of type RunnableGraph[NotUsed] because you're using Sink.ignore. And you probably want a RunnableGraph[Future[Array[Byte]]] instead of a RunnableGraph[Array[Byte]]:
val byteSink = Sink.fold[Array[Byte], Array[Byte]](Array[Byte]())(_ ++ _)
val g = RunnableGraph.fromGraph(GraphDSL.create(byteSink) { implicit builder => bSink =>
import GraphDSL.Implicits._
val sourceMerge = builder.add(Merge[Array[Byte]](2))
tickingSource ~> sourceMerge ~> bSink.in
secondSource ~> secondSourceConverter ~> sourceMerge
ClosedShape
})
// RunnableGraph[Future[Array[Byte]]]
I'm not sure how you would like to process incoming messages but here is a simple example. Hope that it'll help you.
path("ws") {
extractUpgradeToWebSocket { upgrade =>
complete {
import scala.concurrent.duration._
val tickSource = Source.tick(1.second, 1.second, TextMessage("ping"))
val messagesSource = Source.queue(10, OverflowStrategy.backpressure)
messagesSource.mapMaterializedValue { queue =>
//do something with out queue
//like myHandler ! RegisterOutQueue(queue)
}
val sink = Sink.ignore
val source = tickSource.merge(messagesSource)
upgrade.handleMessagesWithSinkSource(
inSink = sink,
outSource = source
)
}
}

How do you deal with futures in Akka Flow?

I have built an akka graph that defines a flow. My objective is to reformat my future response and save it to a file. The flow can be outlined bellow:
val g = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val balancer = builder.add(Balance[(HttpRequest, String)](6, waitForAllDownstreams = false))
val merger = builder.add(Merge[Future[Map[String, String]]](6))
val fileSink = FileIO.toPath(outputPath, options)
val ignoreSink = Sink.ignore
val in = Source(seeds)
in ~> balancer.in
for (i <- Range(0,6)) {
balancer.out(i) ~>
wikiFlow.async ~>
// This maps to a Future[Map[String, String]]
Flow[(Try[HttpResponse], String)].map(parseHtml) ~>
merger
}
merger.out ~>
// When we merge we need to map our Map to a file
Flow[Future[Map[String, String]]].map((d) => {
// What is the proper way of serializing future map
// so I can work with it like a normal stream into fileSink?
// I could manually do ->
// d.foreach(someWriteToFileProcess(_))
// with ignoreSink, but this defeats the nice
// akka flow
}) ~>
fileSink
ClosedShape
})
I can hack this workflow to write my future map to a file via foreach, but I'm afraid this could somehow lead to concurrency issues with FileIO and it just doesn't feel right. What is the proper way to handle futures with our akka flow?
The easiest way to create a Flow which involves an asynchronous computation is by using mapAsync.
So... lets say you want to create a Flow which consumes Int and produces String using an asynchronous computation mapper: Int => Future[String] with a parallelism of 5.
val mapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](5)(mapper)
Now, you can use this flow in your graph however you want.
An example usage will be,
val graph = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val intSource = Source(1 to 10)
val printSink = Sink.foreach[String](s => println(s))
val yourMapper: Int => Future[String] = (i: Int) => Future(i.toString)
val yourFlow = Flow[Int].mapAsync[String](2)(yourMapper)
intSource ~> yourFlow ~> printSink
ClosedShape
}