How to use Futures within Kafka Streams - scala

When using the org.apache.kafka.streams.KafkaStreams library in Scala, I have been trying to read in an inputStream, pass that information over to a method: validateAll(infoToValidate) that returns a Future, resolve that and then send to an output stream.
Example:
builder.stream[String, Object](REQUEST_TOPIC)
.mapValues(v => ValidateFormat.from(v.asInstanceOf[GenericRecord]))
.mapValues(infoToValidate => {
SuccessFailFormat.to(validateAll(infoToValidate))
})
Is there any documentation on performing this? I have looked into filter() and transform() but still not sure how to deal with Futures in KStreams.

The answer depends whether you need to preserve the original order of your messages. If yes, then you will have to block in one way or the other. For example:
val duration = 10 seconds // whatever your timeout should be or Duration.Inf
sourceStream
.mapValues(x => Await.result(validate(x), duration))
.to(outputTopic)
If however the order is not important, you can simply use a Kafka producer:
sourceStream
.mapValues(x => validate(x)) // now you have KStream[.., Future[...]]
.foreach { future =>
future.foreach { item =>
val record = new ProducerRecord(outputTopic, key, item)
producer.send(record) // provided you have the implicit serializer
}
}

Related

Akka Stream continuously consume websocket

Im kinda new to Scala and Akka Stream and im trying to get JSON String messages from a websocket and push them to a Kafka topic.
For now i am only working on the "get messages from a ws" part.
Messages coming from the websocket looks like this :
{
"bitcoin":"6389.06534240",
"ethereum":"192.93111286",
"monero":"108.90302506",
"litecoin":"52.25484165"
}
I want to split this JSON message to multiple messages :
{"coin": "bitcoin", "price": "6389.06534240"}
{"coin": "ethereum", "price": "192.93111286"}
{"coin": "monero", "price": "108.90302506"}
{"coin": "litecoin", "price": "52.25484165"}
And then push each of these messages to a kafka topic.
Here's what i achieved so far :
val message_decomposition: Flow[Message, String, NotUsed] = Flow[Message].mapConcat(
msg => msg.toString.replaceAll("[{})(]", "").split(",")
).map( msg => {
val splitted = msg.split(":")
s"{'coin': ${splitted(0)}, 'price': ${splitted(1)}}"
})
val sink: Sink[String, Future[Done]] = Sink.foreach[String](println)
val flow: Flow[Message, Message, Promise[Option[Message]]] =
Flow.fromSinkAndSourceMat(
message_decomposition.to(sink),
Source.maybe[Message])(Keep.right)
val (upgradeResponse, promise) = Http().singleWebSocketRequest(
WebSocketRequest("wss://ws.coincap.io/prices?assets=ALL"),
flow)
It's working im getting the expected output Json message but i was wondering if i could write this producer in a more "Akka-ish" style, like using GraphDSL. So i have a few questions :
Is it possible to continuously consume a WebSocket using a GraphDSL ? If yes, can you show me a example please ?
Is it a good idea to consume the WS using a GraphDSL ?
Should i decompose the received Json Message like im doing before sending it to kafka ? Or it's better to send it as it is for lower latency ?
After producing the message to Kafka, i am planning to consume it using Apache Storm, is it a good idea ? Or should i stick with Akka ?
Thanks for reading me, Regards,
Arès
That code is plenty Akka-ish: scaladsl is just as Akka as the GraphDSL or implementing a custom GraphStage. The only reason, IMO/E, to go to the GraphDSL is if the actual shape of the graph isn't readily expressible in the scaladsl.
I would personally go the route of defining a CoinPrice class to make the model explicit
case class CoinPrice(coin: String, price: BigDecimal)
And then have a Flow[Message, CoinPrice, NotUsed] which parses 1 incoming message into zero or more CoinPrices. Something (using Play JSON here) like:
val toCoinPrices =
Flow[Message]
.mapConcat { msg =>
Json.parse(msg.toString)
.asOpt[JsObject]
.toList
.flatMap { json =>
json.underlying.flatMap { kv =>
import scala.util.Try
kv match {
case (coin, JsString(priceStr)) =>
Try(BigDecimal(priceStr)).toOption
.map(p => CoinPrice(coin, p))
case (coin, JsNumber(price)) => Some(CoinPrice(coin, price))
case _ => None
}
}
}
}
You might, depending on what the size of the JSONs in the message are, want to break that into different stream stages to allow for an async boundary between the JSON parse and the extraction into CoinPrices. For example,
Flow[Message]
.mapConcat { msg =>
Json.parse(msg.toString).asOpt[JsObject].toList
}
.async
.mapConcat { json =>
json.underlying.flatMap { kv =>
import scala.util.Try
kv match {
case (coin, JsString(priceStr)) =>
Try(BigDecimal(priceStr)).toOption
.map(p => CoinPrice(coin, p))
case (coin, JsNumber(price)) => Some(CoinPrice(coin, price))
case _ => None
}
}
}
In the above, the stages on either side of the async boundary will execute in separate actors and thus, possibly concurrently (if there's enough CPU cores available etc.), at the cost of extra overhead for the actors to coordinate and exchange messages. That extra coordination/communication overhead (cf. Gunther's Universal Scalability Law) is only going to be worth it if the JSON objects are sufficiently large and coming in sufficiently quickly (consistently coming in before the previous one has finished processing).
If your intention is to consume the websocket until the program stops, you might find it clearer to just use Source.never[Message].

Akka HTTP. Streaming source from callback

I am trying to connect Akka HTTP and some old Java library. That library has two methods - one accepting a callback function to receive string, and one signaling the end of data stream. The callback function receiving the data can be called multiple times. Consider this snippet:
oldJavaLib.receiveData((s:String) => {
println("received:" + s)
})
oldJavaLib.dataEnd(() => {
println("data transmission is over")
})
I want to stream data using Akka HTTP as its being received by the callback function. But I am not sure what is a best way to go about that.
I was thinking to create a stream and then use it directly in HTTP route like this:
def fetchUsers(): Source[User, NotUsed] = Source.fromIterator(() => Iterator.fill(1000000) {
val id = Random.nextInt()
dummyUser(id.toString)
})
lazy val routes: Route =
pathPrefix("test") {
concat(
pathEnd {
concat(
get {
complete(fetchUsers())
}
)
}
)
}
fetchUsers() function should return a stream which is getting data from some legacy java API. May-be there is a better approach.
I assume that you want to create an Akka stream that emits values from callback? You can use Source.queue. For the first callback it would be:
val queue = Source.queue[String](bufferSize = 1000)
.toMat(Sink.ignore)(Keep.left)
.run()
oldJavaLib.receiveData((s: String) => {
queue.offer(s) match {
case Enqueued => println("received:" + s)
case _ => println("failed to enqueue:" + s)
}
})
Edit after question clarification
If you want to use the source in HTTP route you have to prematerialize it. Refering to my previous code it would look like this:
val (queue, source) = Source.queue[String](bufferSize = 1000).preMaterialize()
source then can be used in any route.

Flow of my spark streaming program

Small question about the flow of my spark streaming program.
I have this function :
def parse(msg: String): Seq[String]
Which actually split a "good" message into multiple strings, and, if the string is "bad", returns an empty Seq.
I'm reading the messages from a kafka topic, and I want to send the results of the parsing into two different topics:
If the message is "good", send the result of the parsing to the topic "good_msg_topic"
If the message is "bad", send the "bad" message to the topic "bad_msg_topic"
To achieve that, I did this :
stream.foreachRDD(rdd => {
val res = rdd.map(msg => msg.value() -> parse(msg.value()))
res.foreach(pair => {
if (pair._2.isEmpty) {
producer.send(junkTopic, pair._1)
} else {
pair._2.foreach(m => producer.send(splitTopic, m))
}
})
})
However, I feel like this is not optimal. Using a Map Object to associate the original message to the result may slow down the process...
I'm beginning with Spark and Scala, so I think one could do better.
Any idea on how I could improve that ? Changing the signature of the parse function if also possible if you think it's better.
Thank you
I wouldn't be too concerned regarding performance if you haven't already measured this and found a bottleneck.
One thing I can think of which might make this code clearer is to use an ADT to describe the result type:
sealed trait Result
case class GoodResult(seq: Seq[String]) extends Result
case class BadResult(original: String) extends Result
Have parse return Result
def parse(s: String): Result
And then use map on DStream instead of RDD:
stream
.map(msg => parse(msg.value())
.foreachRDD { rdd =>
rdd.foreach { result =>
result match {
case GoodResult(seq) => seq.foreach(value => producer.send(splitTopic, value))
case BadResult(original) => producer.send(junkTopic, original)
}
}
}

Akka Stream return object from Sink

I've got a SourceQueue. When I offer an element to this I want it to pass through the Stream and when it reaches the Sink have the output returned to the code that offered this element (similar as Sink.head returns an element to the RunnableGraph.run() call).
How do I achieve this? A simple example of my problem would be:
val source = Source.queue[String](100, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.ReturnTheStringSomehow
val graph = source.via(flow).to(sink).run()
val x = graph.offer("foo")
println(x) // Output should be "Modified foo"
val y = graph.offer("bar")
println(y) // Output should be "Modified bar"
val z = graph.offer("baz")
println(z) // Output should be "Modified baz"
Edit: For the example I have given in this question Vladimir Matveev provided the best answer. However, it should be noted that this solution only works if the elements are going into the sink in the same order they were offered to the source. If this cannot be guaranteed the order of the elements in the sink may differ and the outcome might be different from what is expected.
I believe it is simpler to use the already existing primitive for pulling values from a stream, called Sink.queue. Here is an example:
val source = Source.queue[String](128, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(1, 1))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
def getNext: String = Await.result(sinkQueue.pull(), 1.second).get
sourceQueue.offer("foo")
println(getNext)
sourceQueue.offer("bar")
println(getNext)
sourceQueue.offer("baz")
println(getNext)
It does exactly what you want.
Note that setting the inputBuffer attribute for the queue sink may or may not be important for your use case - if you don't set it, the buffer will be zero-sized and the data won't flow through the stream until you invoke the pull() method on the sink.
sinkQueue.pull() yields a Future[Option[T]], which will be completed successfully with Some if the sink receives an element or with a failure if the stream fails. If the stream completes normally, it will be completed with None. In this particular example I'm ignoring this by using Option.get but you would probably want to add custom logic to handle this case.
Well, you know what offer() method returns if you take a look at its definition :) What you can do is to create Source.queue[(Promise[String], String)], create helper function that pushes pair to stream via offer, make sure offer doesn't fail because queue might be full, then complete promise inside your stream and use future of the promise to catch completion event in external code.
I do that to throttle rate to external API used from multiple places of my project.
Here is how it looked in my project before Typesafe added Hub sources to akka
import scala.concurrent.Promise
import scala.concurrent.Future
import java.util.concurrent.ConcurrentLinkedDeque
import akka.stream.scaladsl.{Keep, Sink, Source}
import akka.stream.{OverflowStrategy, QueueOfferResult}
import scala.util.Success
private val queue = Source.queue[(Promise[String], String)](100, OverflowStrategy.backpressure)
.toMat(Sink.foreach({ case (p, param) =>
p.complete(Success(param.reverse))
}))(Keep.left)
.run
private val futureDeque = new ConcurrentLinkedDeque[Future[String]]()
private def sendQueuedRequest(request: String): Future[String] = {
val p = Promise[String]
val offerFuture = queue.offer(p -> request)
def addToQueue(future: Future[String]): Future[String] = {
futureDeque.addLast(future)
future.onComplete(_ => futureDeque.remove(future))
future
}
offerFuture.flatMap {
case QueueOfferResult.Enqueued =>
addToQueue(p.future)
}.recoverWith {
case ex =>
val first = futureDeque.pollFirst()
if (first != null)
addToQueue(first.flatMap(_ => sendQueuedRequest(request)))
else
sendQueuedRequest(request)
}
}
I realize that blocking synchronized queue may be bottleneck and may grow indefinitely but because API calls in my project are made only from other akka streams which are backpressured I never have more than dozen items in futureDeque. Your situation may differ.
If you create MergeHub.source[(Promise[String], String)]() instead you'll get reusable sink. Thus every time you need to process item you'll create complete graph and run it. In that case you won't need hacky java container to queue requests.

How to use an Akka Streams SourceQueue with PlayFramework

I would like to use a SourceQueue to push elements dynamically into an Akka Stream source.
Play controller needs a Source to be able to stream a result using the chuncked method.
As Play uses its own Akka Stream Sink under the hood, I can't materialize the source queue myself using a Sink because the source would be consumed before it's used by the chunked method (except if I use the following hack).
I'm able to make it work if I pre-materialize the source queue using a reactive-streams publisher, but it's a kind of 'dirty hack' :
def sourceQueueAction = Action{
val (queue, pub) = Source.queue[String](10, OverflowStrategy.fail).toMat(Sink.asPublisher(false))(Keep.both).run()
//stupid example to push elements dynamically
val tick = Source.tick(0 second, 1 second, "tick")
tick.runForeach(t => queue.offer(t))
Ok.chunked(Source.fromPublisher(pub))
}
Is there a simpler way to use an Akka Streams SourceQueue with PlayFramework?
Thanks
The solution is to use mapMaterializedValue on the source to get a future of its queue materialization :
def sourceQueueAction = Action {
val (queueSource, futureQueue) = peekMatValue(Source.queue[String](10, OverflowStrategy.fail))
futureQueue.map { queue =>
Source.tick(0.second, 1.second, "tick")
.runForeach (t => queue.offer(t))
}
Ok.chunked(queueSource)
}
//T is the source type, here String
//M is the materialization type, here a SourceQueue[String]
def peekMatValue[T, M](src: Source[T, M]): (Source[T, M], Future[M]) = {
val p = Promise[M]
val s = src.mapMaterializedValue { m =>
p.trySuccess(m)
m
}
(s, p.future)
}
Would like to share an insight I got today, though it may not be appropriate to your case with Play.
Instead of thinking of a Source to trigger, one can often turn the problem upside down and provide a Sink to the function that does the sourcing.
In such a case, the Sink would be the "recipe" (non-materialized) stage and we can now use Source.queue and materialize it right away. Got queue. Got the flow that it runs.