Akka Streams recreate stream in case of stage failure - scala

I have very simple Akka Streams flow which reads msg from Kafka using alpakka, performs some manipulation on msg and indexes it to Elasticsearch.
I'm using CommitableSource, therefore i'm in At-Least-Once strategy. I commit my offset only when index to ES succeed, if it fails I will read again the message because form latest known offset.
val decider: Supervision.Decider = {
case _:Throwable => Supervision.Restart
case _ => Supervision.Restart
}
val config: Config = context.system.settings.config.getConfig("akka.kafka.consumer")
val flow: Flow[CommittableMessage[String, String], Done, NotUsed] =
Flow[CommittableMessage[String,String]].
map(msg => Event(msg.committableOffset,Success(Json.parse(msg.record.value()))))
.mapAsync(10) { event => indexEvent(event.json.get).map(f=> event.copy(json = f))}
.mapAsync(10)(f => {
f.json match {
case Success(_)=> f.committableOffset.commitScaladsl()
case Failure(ex) => throw new StreamFailedException(ex.getMessage,ex)
}
})
val r: Flow[CommittableMessage[String, String], Done, NotUsed] = RestartFlow.onFailuresWithBackoff(
minBackoff = 3.seconds,
maxBackoff = 3.seconds,
randomFactor = 0.2, // adds 20% "noise" to vary the intervals slightly
maxRestarts = 20 // limits the amount of restarts to 20
)(() => {
println("Creating flow")
flow
})
val consumerSettings: ConsumerSettings[String, String] =
ConsumerSettings(config, new StringDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val restartSource: Source[CommittableMessage[String, String], NotUsed] = RestartSource.withBackoff(
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2, // adds 20% "noise" to vary the intervals slightly
maxRestarts = 20 // limits the amount of restarts to 20
) {() =>
Consumer.committableSource(consumerSettings, Subscriptions.topics("test"))
}
implicit val mat: ActorMaterializer = ActorMaterializer(ActorMaterializerSettings(context.system).withSupervisionStrategy(decider))
restartSource
.via(flow)
.toMat(Sink.ignore)(Keep.both).run()
What I would like to achieve, is to restart entire flow Source -> Flow-> Sink. If from any reason I was no able to index message in Elastic.
I tried the following:
Supervision.Decider - It looks like flow was recreated but no
message was pulled from Kafka, obviously because it remembers it
offset.
RestartSource - doesn't looks ether, because exception happens in flow stage.
RestartFlow - Doesn't help as well because it restarts only Flow, but I need to restart Source from last successful offset.
Is there any elegant way to do that?

You can combine restartable source, flow & sink. Nobody prevents you from doing restartable source/flow/sink for each part of the graph
Update:
code example
val sourceFactory = () => Source(1 to 10).via(Flow.fromFunction(x => { println("problematic flow"); x }))
RestartSource.withBackoff(4.seconds, 4.seconds, 0.2)(sourceFactory)

Related

Akka RestartSource does not restart

object TestSource {
implicit val ec = ExecutionContext.global
def main(args: Array[String]): Unit = {
def buildSource = {
println("fresh")
Source(List(() => 1,() => 2,() => 3,() => {
println("crash")
throw new RuntimeException(":(((")
}))
}
val restarting = RestartSource.onFailuresWithBackoff(
minBackoff = Duration(1, SECONDS) ,
maxBackoff = Duration(1, SECONDS),
randomFactor = 0.0,
maxRestarts = 10
)(() => {
buildSource
})
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val executionContext: ExecutionContext = actorSystem.dispatcher
restarting.runWith(Sink.foreach(e => println(e())))
}
}
The code above prints: 1,2,3, crash
Why does my source not restart?
This is pretty much a 1:1 copy of the official documentation.
edit:
I also tried
val rs = RestartSink.withBackoff[() => Int](
Duration(1, SECONDS),
Duration(1, SECONDS),
0.0,
10
)(_)
val rsDone = rs(() => {
println("???")
Sink.foreach(e => println(e()))
})
restarting.runWith(rsDone)
but still get no restarts
This is because the exception is triggered outside of the buildSource Source in the Sink.foreach when you call the functions emitted from the Source.
Try this:
val restarting = RestartSource.onFailuresWithBackoff(
minBackoff = Duration(1, SECONDS) ,
maxBackoff = Duration(1, SECONDS),
randomFactor = 0.0,
maxRestarts = 10
)(() => {
buildSource
.map(e => e()) //call the functions inside the RestartSource
})
That way your exception will happen inside the inner Source wrapped by RestartSource and the restarting mechanism will kick in.
The source doesn't restart because your source never fails, therefore never needs to restart.
The exception gets thrown when Sink.foreach evaluates the function it received.
As artur noted, if you can move the failing bit into the source, you can wrap everything up to the sink in the RestartSource.
While it won't help for this contrived example (as restarting a sink doesn't result in resending previously sent messages), wrapping the sink in a RestartSink may be useful in real-world cases where this sort of thing can happen (off the top of my head, streams from Kafka blowing up because the offset commit in a sink failed (e.g. after a rebalance) should be an example of such a case).
An alternative, if you want to restart the whole stream if any part fails, and the stream materializes as a Future, you can implement retry-with-backoff on the failed future.
Source just never crashes, as already said here.
You are actually crashing you sink, not a source with this statement e => e()
this happens when applying lambda above to last element of source:
java.lang.RuntimeException: :(((
Here's the same stream without unhandled exception in sink:
...
RestartSource.withBackoff(
...
restarting.runWith(
Sink.foreach(e => {
def i: Int = try{ e() } catch {
case t: Throwable =>
println(t)
-1
}
println(i)
})
)
Works perfectly.

RestartSource masking the materialized value for the wrapped source?

I am modifying an existing stream graph by adding some retry logic around various functionality. One of those pieces is the source, which in this case happens to be a kafka Consumer.committableSource from the alpakka kafka connector. Downstream, the graph is expecting a type of Source[ConsumerMessage.CommittableMessage[String, AnyRef], Control], but when I wrap the committable source in a RestartSource I end up with Source[ConsumerMessage.CommittableMessage[String, AnyRef], NotUsed]
I tried adding (Keep.both) on the end, but ended up with a compile time error. Here are the two examples for reference:
val restartSource: Source[ConsumerMessage.CommittableMessage[String, AnyRef], NotUsed] = RestartSource.onFailuresWithBackoff(
minBackoff = 3.seconds,
maxBackoff = 60.seconds,
randomFactor = .2
) {() => Consumer.committableSource(consumerSettings, subscription)}
val s: Source[ConsumerMessage.CommittableMessage[String, AnyRef], Control] = Consumer.committableSource(consumerSettings, subscription)
As you have observed, and as discussed in this currently open ticket, the materialized value of the original Source is not exposed in the return value of the wrapping RestartSource. To get around this, try using mapMaterializedValue (disclaimer: I didn't test the following):
val restartSource: Source[ConsumerMessage.CommittableMessage[String, AnyRef], Control] = {
var control: Option[Control] = None
RestartSource.onFailuresWithBackoff(
minBackoff = 3.seconds,
maxBackoff = 60.seconds,
randomFactor = .2
) { () =>
Consumer
.committableSource(consumerSettings, subscription)
.mapMaterializedValue { c =>
control = Some(c)
}
}
.mapMaterializedValue(_ => control)
.collect { case Some(c) => c }
}
You could preMaterialize the Source which will yield the Control like so:
Pair<Consumer.Control, Source<ConsumerMessage.CommittableOffset, NotUsed>> controlSourcePair =
origSrc.preMaterialize(materializer);
Source<ConsumerMessage.CommittableOffset, NotUsed> source =
RestartSource.withBackoff(
Duration.ofSeconds(1),
Duration.ofSeconds(10),
0.2,
20,
controlSourcePair::second);
source
.toMat(Committer.sink(CommitterSettings.create(system)
.withMaxBatch(1)), Keep.both())
.mapMaterializedValue(pair ->
Consumer.createDrainingControl(
new Pair<>(controlSourcePair.first(), pair.second())))
.run(materializer);
Apologies for not providing you with the Scala equivalent.

Akka streams Source.actorRef vs Source.queue vs buffer, which one to use?

I am using akka-streams-kafka to created a stream consumer from a kafka topic.
Using broadcast to serve events from kafka topic to web socket clients.
I have found following three approaches to create a stream Source.
Question:
My goal is to serve hundreds/thousands of websocket clients (some of which might be slow consumers). Which approach scales better?
Appreciate any thoughts?
Broadcast lowers the rate down to slowest consumer.
BUFFER_SIZE = 100000
Source.ActorRef (source actor does not support backpressure option)
val kafkaSourceActorWithBroadcast = {
val (sourceActorRef, kafkaSource) = Source.actorRef[String](BUFFER_SIZE, OverflowStrategy.fail)
.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.both).run
Consumer.plainSource(consumerSettings,
Subscriptions.topics(KAFKA_TOPIC))
.runForeach(record => sourceActorRef ! Util.toJson(record.value()))
kafkaSource
}
Source.queue
val kafkaSourceQueueWithBroadcast = {
val (futureQueue, kafkaQueueSource) = Source.queue[String](BUFFER_SIZE, OverflowStrategy.backpressure)
.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.both).run
Consumer.plainSource(consumerSettings, Subscriptions.topics(KAFKA_TOPIC))
.runForeach(record => futureQueue.offer(Util.toJson(record.value())))
kafkaQueueSource
}
buffer
val kafkaSourceWithBuffer = Consumer.plainSource(consumerSettings, Subscriptions.topics(KAFKA_TOPIC))
.map(record => Util.toJson(record.value()))
.buffer(BUFFER_SIZE, OverflowStrategy.backpressure)
.toMat(BroadcastHub.sink(bufferSize = 256))(Keep.right).run
Websocket route code for completeness:
val streamRoute =
path("stream") {
handleWebSocketMessages(websocketFlow)
}
def websocketFlow(where: String): Flow[Message, Message, NotUsed] = {
Flow[Message]
.collect {
case TextMessage.Strict(msg) => Future.successful(msg)
case TextMessage.Streamed(stream) =>
stream.runFold("")(_ + _).flatMap(msg => Future.successful(msg))
}
.mapAsync(parallelism = PARALLELISM)(identity)
.via(logicStreamFlow)
.map { msg: String => TextMessage.Strict(msg) }
}
private def logicStreamFlow: Flow[String, String, NotUsed] =
Flow.fromSinkAndSource(Sink.ignore, kafkaSourceActorWithBroadcast)

Kafka producer hangs on send

The logic is that a streaming job, getting data from a custom source has to write both to Kafka as well as HDFS.
I wrote a (very) basic Kafka producer to do this, however the whole streaming job hangs on the send method.
class KafkaProducer(val kafkaBootstrapServers: String, val kafkaTopic: String, val sslCertificatePath: String, val sslCertificatePassword: String) {
val kafkaProps: Properties = new Properties()
kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServers)
kafkaProps.put("acks", "1")
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("ssl.truststore.location", sslCertificatePath)
kafkaProps.put("ssl.truststore.password", sslCertificatePassword)
val kafkaProducer: KafkaProducer[Long, Array[String]] = new KafkaProducer(kafkaProps)
def sendKafkaMessage(message: Message): Unit = {
message.data.foreach(list => {
val producerRecord: ProducerRecord[Long, Array[String]] = new ProducerRecord[Long, Array[String]](kafkaTopic, message.timeStamp.getTime, list.toArray)
kafkaProducer.send(producerRecord)
})
}
}
And the code calling the producer:
receiverStream.foreachRDD(rdd => {
val messageRowRDD: RDD[Row] = rdd.mapPartitions(partition => {
val parser: Parser = new Parser
val kafkaProducer: KafkaProducer = new KafkaProducer(kafkaBootstrapServers, kafkaTopic, kafkaSslCertificatePath, kafkaSslCertificatePass)
val newPartition = partition.map(message => {
Logger.getLogger("importer").error("Writing Message to Kafka...")
kafkaProducer.sendKafkaMessage(message)
Logger.getLogger("importer").error("Finished writing Message to Kafka")
Message.data.map(singleMessage => parser.parseMessage(Message.timeStamp.getTime, singleMessage))
})
newPartition.flatten
})
val df = sqlContext.createDataFrame(messageRowRDD, Schema.messageSchema)
Logger.getLogger("importer").info("Entries-count: " + df.count())
val row = Try(df.first)
row match {
case Success(s) => Persister.writeDataframeToDisk(df, outputFolder)
case Failure(e) => Logger.getLogger("importer").warn("Resulting DataFrame is empty. Nothing can be written")
}
})
From the logs I can tell that each executor is reaching the "sending to kafka" point, however not any further. All executors hang on that and no exception is thrown.
The Message class is a very simple case class with 2 fields, a timestamp and an array of strings.
This was due to the acks field in Kafka.
Acks was set to 1 and sends went ahead a lot faster.

AKKA Streams: Performance degradation after connecting to AMQP

I have been trying to create an application that runs recursive task (crawler) with akka streams and QPID message broker.
What I have noticed is that separate parts of the graph alone perform quite good, but when connected together performance drops down significantly.
Here is statistics for the graph running on my local machine:
sending messages to message queue can achieve > 1700 msg/sec;
making HTTP requests approx. 70 req/sec;
whole graph, including reading
messages from queue 2-4 items/sec;
Source code for the pipeline can be found here:
https://gist.github.com/volisoft/3617824b16a3f3b6e01c933a8bdf8049
The pipeline is straightforward:
def main(args: Array[String]): Unit = {
startBroker()
val queueName = "amqp-conn-it-spec-simple-queue-" + System.currentTimeMillis()
val queueDeclaration = QueueDeclaration(queueName)
val in = AmqpSource(
NamedQueueSourceSettings(AmqpConnectionDetails("localhost", 5672, Some(AmqpCredentials("guest", "guest"))), queueName)
.withDeclarations(queueDeclaration),
bufferSize = 1028
).map(_.bytes.utf8String).log(":in")
val out = AmqpSink.simple(
AmqpSinkSettings(AmqpConnectionDetails("localhost", 5672, Some(AmqpCredentials("guest", "guest"))))
.withRoutingKey(queueName).withDeclarations(queueDeclaration))
val urlsSink = Flow[String].map(ByteString(_)).to(out)
val g = RunnableGraph.fromGraph(GraphDSL.create(in, urlsSink)((_,_)){ implicit b => (in, urlsSink0) =>
import GraphDSL.Implicits._
val pool = Http().superPool[String]()(materializer).log(":pool")
val download: Flow[String, Document, NotUsed] = Flow[String]
.map(url => (HttpRequest(method = HttpMethods.GET, Uri(url)), url) )
.via(pool)
.mapAsyncUnordered(8){ case (Success(response: HttpResponse), url) => parse(response, url)}
val filter = Flow[String].filter(notVisited).log(":filter")
val save = Flow[String].map(saveVisited)
val extractLinks: Flow[Document, String, NotUsed] = Flow[Document].mapConcat(document => getUrls(document))
in ~> save ~> download ~> extractLinks ~> filter ~> urlsSink0
ClosedShape
})
g.run()
Source.single(rootUrl).map(s => ByteString(s)).runWith(out)
}
How can this code be optimized to increase performance?