Akka Stream continuously consume websocket - scala

Im kinda new to Scala and Akka Stream and im trying to get JSON String messages from a websocket and push them to a Kafka topic.
For now i am only working on the "get messages from a ws" part.
Messages coming from the websocket looks like this :
{
"bitcoin":"6389.06534240",
"ethereum":"192.93111286",
"monero":"108.90302506",
"litecoin":"52.25484165"
}
I want to split this JSON message to multiple messages :
{"coin": "bitcoin", "price": "6389.06534240"}
{"coin": "ethereum", "price": "192.93111286"}
{"coin": "monero", "price": "108.90302506"}
{"coin": "litecoin", "price": "52.25484165"}
And then push each of these messages to a kafka topic.
Here's what i achieved so far :
val message_decomposition: Flow[Message, String, NotUsed] = Flow[Message].mapConcat(
msg => msg.toString.replaceAll("[{})(]", "").split(",")
).map( msg => {
val splitted = msg.split(":")
s"{'coin': ${splitted(0)}, 'price': ${splitted(1)}}"
})
val sink: Sink[String, Future[Done]] = Sink.foreach[String](println)
val flow: Flow[Message, Message, Promise[Option[Message]]] =
Flow.fromSinkAndSourceMat(
message_decomposition.to(sink),
Source.maybe[Message])(Keep.right)
val (upgradeResponse, promise) = Http().singleWebSocketRequest(
WebSocketRequest("wss://ws.coincap.io/prices?assets=ALL"),
flow)
It's working im getting the expected output Json message but i was wondering if i could write this producer in a more "Akka-ish" style, like using GraphDSL. So i have a few questions :
Is it possible to continuously consume a WebSocket using a GraphDSL ? If yes, can you show me a example please ?
Is it a good idea to consume the WS using a GraphDSL ?
Should i decompose the received Json Message like im doing before sending it to kafka ? Or it's better to send it as it is for lower latency ?
After producing the message to Kafka, i am planning to consume it using Apache Storm, is it a good idea ? Or should i stick with Akka ?
Thanks for reading me, Regards,
Arès

That code is plenty Akka-ish: scaladsl is just as Akka as the GraphDSL or implementing a custom GraphStage. The only reason, IMO/E, to go to the GraphDSL is if the actual shape of the graph isn't readily expressible in the scaladsl.
I would personally go the route of defining a CoinPrice class to make the model explicit
case class CoinPrice(coin: String, price: BigDecimal)
And then have a Flow[Message, CoinPrice, NotUsed] which parses 1 incoming message into zero or more CoinPrices. Something (using Play JSON here) like:
val toCoinPrices =
Flow[Message]
.mapConcat { msg =>
Json.parse(msg.toString)
.asOpt[JsObject]
.toList
.flatMap { json =>
json.underlying.flatMap { kv =>
import scala.util.Try
kv match {
case (coin, JsString(priceStr)) =>
Try(BigDecimal(priceStr)).toOption
.map(p => CoinPrice(coin, p))
case (coin, JsNumber(price)) => Some(CoinPrice(coin, price))
case _ => None
}
}
}
}
You might, depending on what the size of the JSONs in the message are, want to break that into different stream stages to allow for an async boundary between the JSON parse and the extraction into CoinPrices. For example,
Flow[Message]
.mapConcat { msg =>
Json.parse(msg.toString).asOpt[JsObject].toList
}
.async
.mapConcat { json =>
json.underlying.flatMap { kv =>
import scala.util.Try
kv match {
case (coin, JsString(priceStr)) =>
Try(BigDecimal(priceStr)).toOption
.map(p => CoinPrice(coin, p))
case (coin, JsNumber(price)) => Some(CoinPrice(coin, price))
case _ => None
}
}
}
In the above, the stages on either side of the async boundary will execute in separate actors and thus, possibly concurrently (if there's enough CPU cores available etc.), at the cost of extra overhead for the actors to coordinate and exchange messages. That extra coordination/communication overhead (cf. Gunther's Universal Scalability Law) is only going to be worth it if the JSON objects are sufficiently large and coming in sufficiently quickly (consistently coming in before the previous one has finished processing).
If your intention is to consume the websocket until the program stops, you might find it clearer to just use Source.never[Message].

Related

alpakka jms client acknowledgement mode delivery guarantee

I have an alpakka JMS source -> kafka sink kind of a flow. I'm looking at the alpakka jms consumer documentation and trying to figure out what kind of delivery guarantees this gives me.
From https://doc.akka.io/docs/alpakka/current/jms/consumer.html
val result: Future[immutable.Seq[javax.jms.Message]] =
jmsSource
.take(msgsIn.size)
.map { ackEnvelope =>
ackEnvelope.acknowledge()
ackEnvelope.message
}
.runWith(Sink.seq)
I'm hoping that the way this actually works is that messages will only be ack'ed once the sinking succeeds (for at-least-once delivery guarantees), but I can't rely on wishful thinking.
Given that alpakka does not seem to utilize any kind of own state that persists across restarts, I can't think how I'd be able to get exactly-once guarantees a'la flink here, but can I at least count on at-least-once, or would I have to (somehow) ack in a map of a kafka producer flexiFlow (https://doc.akka.io/docs/alpakka-kafka/current/producer.html#producer-as-a-flow)
Thanks,
Fil
In that stream, the ack will happen before messages are added to the materialized sequence and before result becomes available for you to do anything (i.e. the Future completes). It therefore would be at-most-once.
To delay the ack until some processing has succeeded, the easiest approach is to keep what you're doing with the messages in the flow rather than materialize a future. The Alpakka Kafka producer supports a pass-through element which could be the JMS message:
val topic: String = ???
def javaxMessageToKafkaMessage[Key, Value](
ae: AckEnvelope,
kafkaKeyFor: javax.jms.Message => Key,
kafkaValueFor: javax.jms.Message => Value
): ProducerMessage.Envelope[Key, Value, javax.jms.Message] = {
val key = kafkaKeyFor(ae.message)
val value = kafkaValueFor(ae.message)
ProducerMessage.single(new ProducerRecord(topic, key, value), ae)
}
// types K and V are unspecified...
jmsSource
.map(
javaxMessageToKafkaMessage[K, V](
_,
{ _ => ??? },
{ _ => ??? }
)
)
.via(Producer.flexiFlow(producerSettings))
.to(
Sink.foreach { results =>
val msg = results.passThrough
msg.acknowledge()
}
)(Keep.both)
running this stream will materialize as a tuple of a JmsConsumerControl with a Future[Done]. Not being familiar with JMS, I don't know how a shutdown of the consumer control would interact with the acks.

How to handle backpressure when Streaming file from s3 with actor interop

I am trying to download a large file from S3 and sending it's data to another actor that is doing an http request and then to persist the response. I want to limit number of requests sent by that actor hence I need to handle backpressure.
I tried doing something like this :
S3.download(bckt, bcktKey).map{
case Some((file, _)) =>
file
.via(CsvParsing.lineScanner())
.map(_.map(_.utf8String)).drop(1)//drop headers
.map(p => Foo(p.head, p(1)))
.mapAsync(30) { p =>
implicit val askTimeout: Timeout = Timeout(10 seconds)
(httpClientActor ? p).mapTo[Buzz]
}
.mapAsync(1){
case b#Buzz(_, _) =>
(persistActor ? b).mapTo[Done]
}.runWith(Sink.head)
The problem is that I see that it reads only 30 lines from file as the limit set for parallelism. I am not sure that this is the correct way to achieve what I'm looking for
As Johny notes in his comment, the Sink.head is what causes the stream to only process about 30 elements. What happens is approximately:
Sink.head signals demand for 1 element
this demand propagates up through the second mapAsync
when the demand reaches the first mapAsync, since that one has parallelism 30, it signals demand for 30 elements
the CSV parsing stages emit 30 elements
when the response to the ask with the first element from the client actor is received, the response propagates down to the ask of the persist actor
demand is signaled for one more element from the CSV parsing stages
when the persist actor responds, the response goes to the sink
since the sink is Sink.head which cancels the stream once it receives an element, the stream gets torn down
any asks of the client actor which have been sent but are awaiting a response will still get processed
There's a bit of a race between the persist actor's response and the CSV parsing and sending an ask to the client actor: if the latter is faster, 31 lines might get processed by the client actor.
If you just want a Future[Done] after every element has been processed, Sink.last will work very well with this code.
If the reason is not the usage of Sink.head as I mentioned in the comment, you can backpressure the stream using Sink.actorRefWithBackpressure.
Sample code:
class PersistActor extends Actor {
override def receive: Receive = {
case "init" =>
println("Initialized")
case "complete" =>
context.stop(self)
case message =>
//Persist Buzz??
sender() ! Done
}
}
val sink = Sink
.actorRefWithBackpressure(persistActor, "init", Done, "complete", PartialFunction.empty)
S3.download(bckt, bcktKey).map{
case Some((file, _)) =>
file
.via(CsvParsing.lineScanner())
.map(_.map(_.utf8String)).drop(1)//drop headers
.map(p => Foo(p.head, p(1)))
//You could backpressure here too...
.mapAsync(30) { p =>
implicit val askTimeout: Timeout = Timeout(10 seconds)
(httpClientActor ? p).mapTo[Buzz]
}
.to(sink)
.run()

Why does Source.tick stop after one hundred HttpRequests?

Using akka stream and akka HTTP, I have created a stream which polls an api every 3 seconds, Unmarshalls the result to a JsValue object and sends this result to an actor. As can be seen in the following code:
// Source wich performs an http request every 3 seconds.
val source = Source.tick(0.seconds,
3.seconds,
HttpRequest(uri = Uri(path = Path("/posts/1"))))
// Processes the result of the http request
val flow = Http().outgoingConnectionHttps("jsonplaceholder.typicode.com").mapAsync(1) {
// Able to reach the API.
case HttpResponse(StatusCodes.OK, _, entity, _) =>
// Unmarshal the json response.
Unmarshal(entity).to[JsValue]
// Failed to reach the API.
case HttpResponse(code, _, entity, _) =>
entity.discardBytes()
Future.successful(code.toString())
}
// Run stream
source.via(flow).runWith(Sink.actorRef[Any](processJsonActor,akka.actor.Status.Success(("Completed stream"))))
This works, however the stream closes after 100 HttpRequests (ticks).
What is the cause of this behaviour?
Definitely something to do with outgoingConnectionHttps. This is a low level DSL and there could be some misconfigured setting somewhere which is causing this (although I couldn't figure out which one).
Usage of this DSL is actually discouraged by the docs.
Try using a higher level DSL like cached connection pool
val flow = Http().cachedHostConnectionPoolHttps[NotUsed]("akka.io").mapAsync(1) {
// Able to reach the API.
case (Success(HttpResponse(StatusCodes.OK, _, entity, _)), _) =>
// Unmarshal the json response.
Unmarshal(entity).to[String]
// Failed to reach the API.
case (Success(HttpResponse(code, _, entity, _)), _) =>
entity.discardBytes()
Future.successful(code.toString())
case (Failure(e), _) ⇒
throw e
}
// Run stream
source.map(_ → NotUsed).via(flow).runWith(...)
A potential issue is that there is no backpressure signal with Sink.actorRef, so the actor's mailbox could be getting full. If the actor, whenever it receives a JsValue object, is doing something that could take a long time, use Sink.actorRefWithAck instead. For example:
val initMessage = "start"
val completeMessage = "done"
val ackMessage = "ack"
source
.via(flow)
.runWith(Sink.actorRefWithAck[Any](
processJsonActor, initMessage, ackMessage, completeMessage))
You would need to change the actor to handle an initMessage and reply to the stream for every stream element with an ackMessage (with sender ! ackMessage). More information on Sink.actorRefWithAck is found here.

Flow of my spark streaming program

Small question about the flow of my spark streaming program.
I have this function :
def parse(msg: String): Seq[String]
Which actually split a "good" message into multiple strings, and, if the string is "bad", returns an empty Seq.
I'm reading the messages from a kafka topic, and I want to send the results of the parsing into two different topics:
If the message is "good", send the result of the parsing to the topic "good_msg_topic"
If the message is "bad", send the "bad" message to the topic "bad_msg_topic"
To achieve that, I did this :
stream.foreachRDD(rdd => {
val res = rdd.map(msg => msg.value() -> parse(msg.value()))
res.foreach(pair => {
if (pair._2.isEmpty) {
producer.send(junkTopic, pair._1)
} else {
pair._2.foreach(m => producer.send(splitTopic, m))
}
})
})
However, I feel like this is not optimal. Using a Map Object to associate the original message to the result may slow down the process...
I'm beginning with Spark and Scala, so I think one could do better.
Any idea on how I could improve that ? Changing the signature of the parse function if also possible if you think it's better.
Thank you
I wouldn't be too concerned regarding performance if you haven't already measured this and found a bottleneck.
One thing I can think of which might make this code clearer is to use an ADT to describe the result type:
sealed trait Result
case class GoodResult(seq: Seq[String]) extends Result
case class BadResult(original: String) extends Result
Have parse return Result
def parse(s: String): Result
And then use map on DStream instead of RDD:
stream
.map(msg => parse(msg.value())
.foreachRDD { rdd =>
rdd.foreach { result =>
result match {
case GoodResult(seq) => seq.foreach(value => producer.send(splitTopic, value))
case BadResult(original) => producer.send(junkTopic, original)
}
}
}

Creating an actorpublisher and actorsubscriber with same actor

I'm a newbie to akka streams. I'm using kafka as a source(using ReactiveKafka library) and do some processing of data through the flow and using a subscriber(EsHandler) as sink.
Now I need to handle errors and push it to different kafka queue through an error handler. I'm trying to use EsHandler both as publisher and subscriber. I'm not sure how to include EsHandler as an middle man instead of sink.
This is my code:
val publisher = Kafka.kafka.consume(topic, "es", new StringDecoder())
val flow = Flow[String].map { elem => JsonConverter.convert(elem.toString()) }
val sink = Sink.actorSubscriber[GenModel](Props(classOf[EsHandler]))
Source(publisher).via(flow).to(sink).run()
class EsHandler extends ActorSubscriber with ActorPublisher[Model] {
val requestStrategy = WatermarkRequestStrategy(100)
def receive = {
case OnNext(msg: Model) =>
context.actorOf(Props(classOf[EsStorage], self)) ! msg
case OnError(err: Exception) =>
context.stop(self)
case OnComplete =>
context.stop(self)
case Response(msg) =>
if (msg.isError()) onNext(msg.getContent())
}
}
class ErrorHandler extends ActorSubscriber {
val requestStrategy = WatermarkRequestStrategy(100)
def receive = {
case OnNext(msg: Model) =>
println(msg)
}
}
We highly recommend against implementing your own Processor (which is the name the reactive streams spec gives to "Subscriber && Publisher". It is pretty hard to get it right, which is why there's not Publisher exposed directly as helper trait.
Instead, most of the time you'll want to use Sources/Sinks (or Publishers/Subscribers) provided to you and run your operations between those, as map/filter etc. steps.
In fact, there is an existing implementation for Kafka Sources and Sinks you can use, it's called reactive-kafka and is verified by the Reactive Streams TCK, so you can trust it to be valid implementations.