running akka stream in parallel - scala

I have a stream that
listens for HTTP post receiving a list of events
mapconcat the list of events in stream elements
convert events in kafka record
produce the record with reactive kafka (akka stream kafka producer sink)
Here is the simplified code
// flow to split group of lines into lines
val splitLines = Flow[List[Evt]].mapConcat(list=>list)
// sink to produce kafka records in kafka
val kafkaSink: Sink[Evt, Future[Done]] = Flow[Evt]
.map(evt=> new ProducerRecord[Array[Byte], String](evt.eventType, evt.value))
.toMat(Producer.plainSink(kafka))(Keep.right)
val routes = {
path("ingest") {
post {
(entity(as[List[ReactiveEvent]]) & extractMaterializer) { (eventIngestList,mat) =>
val ingest= Source.single(eventIngestList).via(splitLines).runWith(kafkaSink)(mat)
val result = onComplete(ingest){
case Success(value) => complete(s"OK")
case Failure(ex) => complete((StatusCodes.InternalServerError, s"An error occurred: ${ex.getMessage}"))
}
complete("eventList ingested: " + result)
}
}
}
}
Could you highlight me what is run in parallel and what is sequential ?
I think the mapConcat sequentialize the events in the stream so how could I parallelize the stream so after the mapConcat each step would be processed in parallel ?
Would a simple mapAsyncUnordered be sufficient ? Or should I use the GraphDSL with a Balance and Merge ?

In your case it will be sequential I think. Also you're getting whole request before you start pushing data to Kafka. I'd use extractDataBytes directive that gives you src: Source[ByteString, Any]. Then I'd process it like
src
.via(Framing.delimiter(ByteString("\n"), 1024 /* Max size of line */ , allowTruncation = true).map(_.utf8String))
.mapConcat { line =>
line.split(",")
}.async
.runWith(kafkaSink)(mat)

Related

Akka streams websocket stream things to a Sink.seq ends with exception SubscriptionWithCancelException$StageWasCompleted

I'm failing to materialize the Sink.seq, when it comes time to materialize I fail with this exception
akka.stream.SubscriptionWithCancelException$StageWasCompleted$:
Here is the full source code on github: https://github.com/Christewart/bitcoin-s-core/blob/aaecc7c180e5cc36ec46d73d6b2b0b0da87ab260/app/server-test/src/test/scala/org/bitcoins/server/WebsocketTests.scala#L51
I'm attempting to aggregate all elements being pushed out of a websocket into a Sink.seq. I have to a bit of json transformation before I aggreate things inside of Sink.seq.
val endSink: Sink[WalletNotification[_], Future[Seq[WalletNotification[_]]]] =
Sink.seq[WalletNotification[_]]
val sink: Sink[Message, Future[Seq[WalletNotification[_]]]] = Flow[Message]
.map {
case message: TextMessage.Strict =>
//we should be able to parse the address message
val text = message.text
val notification: WalletNotification[_] = {
upickle.default.read[WalletNotification[_]](text)(
WsPicklers.walletNotificationPickler)
}
logger.info(s"Notification=$notification")
notification
case msg =>
logger.error(s"msg=$msg")
sys.error("")
}
.log(s"### endSink ###")
.toMat(endSink)(Keep.right)
val f: Flow[
Message,
Message,
(Future[Seq[WalletNotification[_]]], Promise[Option[Message]])] = {
Flow
.fromSinkAndSourceMat(sink, Source.maybe[Message])(Keep.both)
}
val tuple: (
Future[WebSocketUpgradeResponse],
(Future[Seq[WalletNotification[_]]], Promise[Option[Message]])) = {
Http()
.singleWebSocketRequest(req, f)
}
val walletNotificationsF: Future[Seq[WalletNotification[_]]] =
tuple._2._1
val promise: Promise[Option[Message]] = tuple._2._2
logger.info(s"Requesting new address for expectedAddrStr")
val expectedAddressStr = ConsoleCli
.exec(CliCommand.GetNewAddress(labelOpt = None), cliConfig)
.get
val expectedAddress = BitcoinAddress.fromString(expectedAddressStr)
promise.success(None)
logger.info(s"before notificationsF")
//hangs here, as the future never gets completed, fails with an exception
for {
notifications <- walletNotificationsF
_ = logger.info(s"after notificationsF")
} yield {
//assertions in here...
}
What am i doing wrong?
To keep the client connection open you need "more code", sth like this:
val sourceKickOff = Source
.single(TextMessage("kick off msg"))
// Keeps the connection open
.concatMat(Source.maybe[Message])(Keep.right)
See full working example, which consumes msgs from a server:
https://github.com/pbernet/akka_streams_tutorial/blob/b6d4c89a14bdc5d72c557d8cede59985ca8e525f/src/main/scala/akkahttp/WebsocketEcho.scala#L280
The problem is this line
Flow.fromSinkAndSourceMat(sink, Source.maybe[Message])(Keep.both)
it needs to be
Flow.fromSinkAndSourceCoupledMat(sink, Source.maybe[Message])(Keep.both)
When the stream is terminated, the Coupled part of the materialized flow will make sure to terminate the Sink downstream.

Azure Databricks Concurrent Job -Avoid Consuming the same eventhub messages in all Jobs

Please help us to implement the partition/grouping when receiving eventhub messages in a Azure Databricks concurrent Job and the right approach to consume eventhub messages in a concurrent job.
Created 3 concurrent jobs in Azure Databricks uploading consumer code written in scala as a jar files. In this case receiving the same messages in all 3 concurrent jobs. To overcome from this issue tried to consume the events by partitioning but receiving the same messages in all 3 partitions.
And also tried by sending messages based on partition key and also tried creating a consumer groups in eventhubs even though receiving same messages in all the groups. We are not sure to handle the eventhub messages in the concurrent job
EventHub Configuration:
No.of partitions is 3 and Message Retention is 1
EventHub Producer: Sending messages to Eventhub using .NET (C#) is working fine.
EventHub Consumer: Able to receive messages through Scala Program without any issues.
Problem : Created 3 concurrent jobs in Azure Databricks uploading consumer code written in Scala as a jar files.In this case receiving the same messages in all 3 concurrent jobs. To overcome from this issue tried to consume the events by partitioning but receiving the same messages in all 3 partitions.And
also tried by sending messages based on partition key and also tried creating a consumer groups in eventhubs even though receiving same messages in all the groups. We are not sure to handle the eventhub messages in the concurrent job.
Producer C# Code:
string eventHubName = ConfigurationManager.AppSettings["eventHubname"];
string connectionString = ConfigurationManager.AppSettings["eventHubconnectionstring"];
eventHubClient = EventHubClient.CreateFromConnectionString(connectionString, eventHubName);
for (var i = 0; i < 100; i++)
{
var sender = "event hub message 1" + i;
var data = new EventData(Encoding.UTF8.GetBytes(sender));
Console.WriteLine($"Sending message: {sender}");
eventHubClient.SendAsync(data);
}
eventHubClient.CloseAsync();
Console.WriteLine("Press ENTER to exit.");
Console.ReadLine();
Consumer Scala Code:
object ReadEvents {
val spark = SparkSession.builder()
.appName("eventhub")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
def main(args : Array[String]) : Unit = {
val connectionString = ConnectionStringBuilder("ConnectionString").setEventHubName("eventhub1").build
val positions = Map(new NameAndPartition("eventhub1", 0) -> EventPosition.fromStartOfStream)
val position2 = Map(new NameAndPartition("eventhub1", 1) -> EventPosition.fromEnqueuedTime(Instant.now()))
val position3 = Map(new NameAndPartition("eventhub1", 2) -> EventPosition.fromEnqueuedTime(Instant.now()))
val ehConf = EventHubsConf(connectionString).setStartingPositions(positions)
val ehConf2 = EventHubsConf(connectionString).setStartingPositions(position2)
val ehConf3 = EventHubsConf(connectionString).setStartingPositions(position3)
val stream = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf)
println("Before the loop")
stream.foreachRDD(rdd => {
rdd.collect().foreach(rec => {
println(String.format("Message is first stream ===>: %s", new String(rec.getBytes(), Charset.defaultCharset())))
})
})
val stream2 = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf2)
stream2.foreachRDD(rdd2 => {
rdd2.collect().foreach(rec2 => {
println(String.format("Message second stream is ===>: %s", new String(rec2.getBytes(), Charset.defaultCharset())))
})
})
val stream3 = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf)
stream3.foreachRDD(rdd3 => {
println("Inside 3rd stream foreach loop")
rdd3.collect().foreach(rec3 => {
println(String.format("Message is third stream ===>: %s", new String(rec3.getBytes(), Charset.defaultCharset())))
})
})
ssc.start()
ssc.awaitTermination()
}
}
Expecting to partition the eventhub messages properly when receiving it on concurrent jobs running using scala program.
Below code help to iterate through all partions
eventHubsStream.foreachRDD { rdd =>
rdd.foreach { message =>
if (message != null) {
callYouMethod(new String(message.getBytes()))
}
}
}

Akka streams Source.actorRef vs Source.queue vs buffer, which one to use?

I am using akka-streams-kafka to created a stream consumer from a kafka topic.
Using broadcast to serve events from kafka topic to web socket clients.
I have found following three approaches to create a stream Source.
Question:
My goal is to serve hundreds/thousands of websocket clients (some of which might be slow consumers). Which approach scales better?
Appreciate any thoughts?
Broadcast lowers the rate down to slowest consumer.
BUFFER_SIZE = 100000
Source.ActorRef (source actor does not support backpressure option)
val kafkaSourceActorWithBroadcast = {
val (sourceActorRef, kafkaSource) = Source.actorRef[String](BUFFER_SIZE, OverflowStrategy.fail)
.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.both).run
Consumer.plainSource(consumerSettings,
Subscriptions.topics(KAFKA_TOPIC))
.runForeach(record => sourceActorRef ! Util.toJson(record.value()))
kafkaSource
}
Source.queue
val kafkaSourceQueueWithBroadcast = {
val (futureQueue, kafkaQueueSource) = Source.queue[String](BUFFER_SIZE, OverflowStrategy.backpressure)
.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.both).run
Consumer.plainSource(consumerSettings, Subscriptions.topics(KAFKA_TOPIC))
.runForeach(record => futureQueue.offer(Util.toJson(record.value())))
kafkaQueueSource
}
buffer
val kafkaSourceWithBuffer = Consumer.plainSource(consumerSettings, Subscriptions.topics(KAFKA_TOPIC))
.map(record => Util.toJson(record.value()))
.buffer(BUFFER_SIZE, OverflowStrategy.backpressure)
.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.right).run
Websocket route code for completeness:
val streamRoute =
path("stream") {
handleWebSocketMessages(websocketFlow)
}
def websocketFlow(where: String): Flow[Message, Message, NotUsed] = {
Flow[Message]
.collect {
case TextMessage.Strict(msg) => Future.successful(msg)
case TextMessage.Streamed(stream) =>
stream.runFold("")(_ + _).flatMap(msg => Future.successful(msg))
}
.mapAsync(parallelism = PARALLELISM)(identity)
.via(logicStreamFlow)
.map { msg: String => TextMessage.Strict(msg) }
}
private def logicStreamFlow: Flow[String, String, NotUsed] =
Flow.fromSinkAndSource(Sink.ignore, kafkaSourceActorWithBroadcast)

Kafka producer hangs on send

The logic is that a streaming job, getting data from a custom source has to write both to Kafka as well as HDFS.
I wrote a (very) basic Kafka producer to do this, however the whole streaming job hangs on the send method.
class KafkaProducer(val kafkaBootstrapServers: String, val kafkaTopic: String, val sslCertificatePath: String, val sslCertificatePassword: String) {
val kafkaProps: Properties = new Properties()
kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServers)
kafkaProps.put("acks", "1")
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("ssl.truststore.location", sslCertificatePath)
kafkaProps.put("ssl.truststore.password", sslCertificatePassword)
val kafkaProducer: KafkaProducer[Long, Array[String]] = new KafkaProducer(kafkaProps)
def sendKafkaMessage(message: Message): Unit = {
message.data.foreach(list => {
val producerRecord: ProducerRecord[Long, Array[String]] = new ProducerRecord[Long, Array[String]](kafkaTopic, message.timeStamp.getTime, list.toArray)
kafkaProducer.send(producerRecord)
})
}
}
And the code calling the producer:
receiverStream.foreachRDD(rdd => {
val messageRowRDD: RDD[Row] = rdd.mapPartitions(partition => {
val parser: Parser = new Parser
val kafkaProducer: KafkaProducer = new KafkaProducer(kafkaBootstrapServers, kafkaTopic, kafkaSslCertificatePath, kafkaSslCertificatePass)
val newPartition = partition.map(message => {
Logger.getLogger("importer").error("Writing Message to Kafka...")
kafkaProducer.sendKafkaMessage(message)
Logger.getLogger("importer").error("Finished writing Message to Kafka")
Message.data.map(singleMessage => parser.parseMessage(Message.timeStamp.getTime, singleMessage))
})
newPartition.flatten
})
val df = sqlContext.createDataFrame(messageRowRDD, Schema.messageSchema)
Logger.getLogger("importer").info("Entries-count: " + df.count())
val row = Try(df.first)
row match {
case Success(s) => Persister.writeDataframeToDisk(df, outputFolder)
case Failure(e) => Logger.getLogger("importer").warn("Resulting DataFrame is empty. Nothing can be written")
}
})
From the logs I can tell that each executor is reaching the "sending to kafka" point, however not any further. All executors hang on that and no exception is thrown.
The Message class is a very simple case class with 2 fields, a timestamp and an array of strings.
This was due to the acks field in Kafka.
Acks was set to 1 and sends went ahead a lot faster.

Kafka topic to websocket

I am trying to implement a setup where I have multiple web browsers open a websocket connection to my akka-http server in order to read all messages posted to a kafka topic.
so the stream of messages should go this way
kafka topic -> akka-http -> websocket connection 1
-> websocket connection 2
-> websocket connection 3
For now I have created a path for the websocket:
val route: Route =
path("ws") {
handleWebSocketMessages(notificationWs)
}
Then I have created a consumer for my kafka topic:
val consumerSettings = ConsumerSettings(system,
new ByteArrayDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val source = Consumer
.plainSource(consumerSettings, Subscriptions.topics("topic1"))
And then finally I want to connect this source to the websocket in handleWebSocketMessages
def handleWebSocketMessages: Flow[Message, Message, Any] =
Flow[Message].mapConcat {
case tm: TextMessage =>
TextMessage(source)::Nil
case bm: BinaryMessage =>
// ignore binary messages but drain content to avoid the stream being clogged
bm.dataStream.runWith(Sink.ignore)
Nil
}
Here is the error I get when I try to use source in the TextMessage:
Error:(77, 9) overloaded method value apply with alternatives:
(textStream: akka.stream.scaladsl.Source[String,Any])akka.http.scaladsl.model.ws.TextMessage
(text: String)akka.http.scaladsl.model.ws.TextMessage.Strict
cannot be applied to (akka.stream.scaladsl.Source[org.apache.kafka.clients.consumer.ConsumerRecord[Array[Byte],String],akka.kafka.scaladsl.Consumer.Control])
TextMessage(source)::Nil
I think I'm making numerous mistakes along the way but I would say that the most blocking part is the handleWebSocketMessages.
The first thing, is to understand that source is of type : Source[ConsumerRecord[K, V], Control].
So, it's not something that you could pass as an argument of a TextMessage.
Now, let's take the websocket's point of view:
An outgoing message is built for each message in the Kafka source. The message will be a TextMessage from a String transformation of the Kafka message.
For each incoming message, just println() it
So, the Flow can be seen as two components: the Source & the Sink.
val incomingMessages: Sink[Message, NotUsed] =
Sink.foreach(println(_))
val outgoingMessages: Source[Message, NotUsed] =
source
.map { consumerRecord => TextMessage(consumerRecord.record.value) }
val handleWebSocketMessages: Flow[Message, Message, Any]
= Flow.fromSinkAndSource(incomingMessages, outgoingMessages)
Hope it helps.