Kafka Ensure At Least Once - scala

First project with Kafka, trying to prove that an event will get processed at least once. So far, not seeing evidence that processing is retried.
Structure of dummy app is simple: subscribe, process, publish, commit; if exception, abort transaction and hope it gets retried. I am logging every message.
I expect to see (1) "process messageX" (2) "error for messageX" (3) "process messageX". Instead, I see processing continue beyond messageX, i.e. it does not get re-processed.
What I see is: (1) "process messageX" (2) "error for messageX" (3) "process someOtherMessage".
Using Kafka 2.7.0, Scala 2.12.
What am I missing? Showing relevant parts of dummy app below.
I also tried by removing the producer from the code (and all references to it).
UPDATE 1: I managed to get records re-processed by using the offsets with consumer.seek(), i.e. sending the consumer back to the start of the batch of records. Not sure why simply NOT reaching consumer.commitSync() (because of an exception) does not do this already.
import com.myco.somepackage.{MyEvent, KafkaConfigTxn}
import org.apache.kafka.clients.consumer.{ConsumerRecords, KafkaConsumer, OffsetAndMetadata}
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.common.{KafkaException, TopicPartition}
import org.slf4j.LoggerFactory
import java.util
import scala.collection.JavaConverters._
import scala.util.control.NonFatal
// Prove that a message can be re-processed if there is an exception
object TopicDrainApp {
private val logger = LoggerFactory.getLogger(this.getClass)
private val subTopic = "input.topic"
private val pubTopic = "output.topic"
val producer = new KafkaProducer[String, String](KafkaConfigTxn.producerProps)
producer.initTransactions()
val consumer = new KafkaConsumer[String, String](KafkaConfigTxn.consumerProps)
private var lastEventMillis = System.currentTimeMillis
private val pollIntervalMillis = 1000
private val pollDuration = java.time.Duration.ofMillis(pollIntervalMillis)
def main(args: Array[String]): Unit = {
subscribe(subTopic)
}
def subscribe(subTopic: String): Unit = {
consumer.subscribe(util.Arrays.asList(subTopic))
while (System.currentTimeMillis - lastEventMillis < 5000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
records.asScala.foreach { record =>
try {
lastEventMillis = System.currentTimeMillis
val event = MyEvent.deserialize(record.value())
logger.info("ReceivedMyEvent:" + record.value())
producer.beginTransaction()
simulateProcessing(event) // [not shown] throw exception to test re-processing
producer.flush()
val offsetsToCommit = getOffsetsToCommit(records)
//consumer.commitSync() // tried this; does not work
//producer.sendOffsetsToTransaction(offsetsToCommit, "group1") // tried this; does not work
producer.commitTransaction()
} catch {
case e: KafkaException => logger.error(s"rollback ${record.value()}", e)
producer.abortTransaction()
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
}
private def getOffsetsToCommit(records: ConsumerRecords[String, String]): util.Map[TopicPartition, OffsetAndMetadata] = {
records.partitions().asScala.map { partition =>
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(partitionedRecords.size - 1).offset
(partition, new OffsetAndMetadata(offset + 1))
}.toMap.asJava
}
}
object KafkaConfigTxn {
// Only relevant properties are shown
def commonProperties: Properties = {
val props = new Properties()
props.put(CommonClientConfigs.CLIENT_ID_CONFIG, "...")
props.put(CommonClientConfigs.GROUP_ID_CONFIG, "...")
props
}
def producerProps: Properties = {
val props = new Properties()
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true") // "enable.idempotence"
props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "...") // "transactional.id"
props.put(ProducerConfig.ACKS_CONFIG, "all")
props.put(ProducerConfig.RETRIES_CONFIG, "3")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
def consumerProps: Properties = {
val props = new Properties()
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed") // "isolation.level"
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
commonProperties.asScala.foreach { case (k, v) => props.put(k, v) }
props
}
}

according to the reference I gave you , you need to use sendOffsetsToTransaction in the process, but again your consumer won't get the message of the aborted transcation as you are reading only committed transcation
Transcations were introduced in order to allow exactly once processing between kafka to kafka, being said that kafka supported from day one delivery semantics of at least once and at most once,
To get at least once behavior you disable auto commit and commit when processing had finished successfully that way next time you call poll() if you had exception before commit you will read again the records from the last commit offset
To get at most once behavior you commit before processing starts that way if exception happens , next time you call poll() you get new messages (but lose the other messages)
Exactly once is the most hard to achieve in plain java (not talking on spring framework which makes everything easier) - it involves saving offsets to external db ( usually where you process is done) and reading from there on startup/rebalance
For transcation usage example in java you might read this excellent guide by baeldung
https://www.baeldung.com/kafka-exactly-once

Figured out the correct combination of method calls (subscribe, beginTransaction, process, commit / abortTransaction, etc.), for a demo app. The core of the code is
def readProcessWrite(subTopic: String, pubTopic: String): Int = {
var lastEventMillis = System.currentTimeMillis
val consumer = createConsumer(subTopic)
val producer = createProducer()
val groupMetadata = consumer.groupMetadata()
var numRecords = 0
while (System.currentTimeMillis - lastEventMillis < 10000L) {
try {
val records: ConsumerRecords[String, String] = consumer.poll(pollDuration)
val offsetsToCommit = getOffsetsToCommit(records)
// println(s">>> PollRecords: ${records.count()}")
records.asScala.foreach { record =>
val currentOffset = record.offset()
try {
numRecords += 1
lastEventMillis = System.currentTimeMillis
println(s">>> Topic: $subTopic, ReceivedEvent: offset=${record.offset()}, key=${record.key()}, value=${record.value()}")
producer.beginTransaction()
val eventOut = simulateProcessing(record.value()) // may throw
publish(producer, pubTopic, eventOut)
producer.sendOffsetsToTransaction(offsetsToCommit, groupMetadata)
consumer.commitSync()
producer.commitTransaction()
} catch {
case e: KafkaException => println(s"---------- rollback ${record.value()}", e)
producer.abortTransaction()
offsetsToCommit.forEach { case (topicPartition, _) =>
consumer.seek(topicPartition, currentOffset)
}
}
}
} catch {
case NonFatal(e) => logger.error(e.getMessage, e)
}
}
consumer.close()
producer.close()
numRecords
}
// Consumer created with props.put("max.poll.records", "1")
I was able to prove that this will process each event exactly once, even when simulateProcessing() throws an exception. To be precise: when processing works fine, each event is processed exactly once. If there is an exception, the event is re-processed until success. In my case, there is no real reason for the exceptions, so re-processing will always end in success.

Related

Akka streams websocket stream things to a Sink.seq ends with exception SubscriptionWithCancelException$StageWasCompleted

I'm failing to materialize the Sink.seq, when it comes time to materialize I fail with this exception
akka.stream.SubscriptionWithCancelException$StageWasCompleted$:
Here is the full source code on github: https://github.com/Christewart/bitcoin-s-core/blob/aaecc7c180e5cc36ec46d73d6b2b0b0da87ab260/app/server-test/src/test/scala/org/bitcoins/server/WebsocketTests.scala#L51
I'm attempting to aggregate all elements being pushed out of a websocket into a Sink.seq. I have to a bit of json transformation before I aggreate things inside of Sink.seq.
val endSink: Sink[WalletNotification[_], Future[Seq[WalletNotification[_]]]] =
Sink.seq[WalletNotification[_]]
val sink: Sink[Message, Future[Seq[WalletNotification[_]]]] = Flow[Message]
.map {
case message: TextMessage.Strict =>
//we should be able to parse the address message
val text = message.text
val notification: WalletNotification[_] = {
upickle.default.read[WalletNotification[_]](text)(
WsPicklers.walletNotificationPickler)
}
logger.info(s"Notification=$notification")
notification
case msg =>
logger.error(s"msg=$msg")
sys.error("")
}
.log(s"### endSink ###")
.toMat(endSink)(Keep.right)
val f: Flow[
Message,
Message,
(Future[Seq[WalletNotification[_]]], Promise[Option[Message]])] = {
Flow
.fromSinkAndSourceMat(sink, Source.maybe[Message])(Keep.both)
}
val tuple: (
Future[WebSocketUpgradeResponse],
(Future[Seq[WalletNotification[_]]], Promise[Option[Message]])) = {
Http()
.singleWebSocketRequest(req, f)
}
val walletNotificationsF: Future[Seq[WalletNotification[_]]] =
tuple._2._1
val promise: Promise[Option[Message]] = tuple._2._2
logger.info(s"Requesting new address for expectedAddrStr")
val expectedAddressStr = ConsoleCli
.exec(CliCommand.GetNewAddress(labelOpt = None), cliConfig)
.get
val expectedAddress = BitcoinAddress.fromString(expectedAddressStr)
promise.success(None)
logger.info(s"before notificationsF")
//hangs here, as the future never gets completed, fails with an exception
for {
notifications <- walletNotificationsF
_ = logger.info(s"after notificationsF")
} yield {
//assertions in here...
}
What am i doing wrong?
To keep the client connection open you need "more code", sth like this:
val sourceKickOff = Source
.single(TextMessage("kick off msg"))
// Keeps the connection open
.concatMat(Source.maybe[Message])(Keep.right)
See full working example, which consumes msgs from a server:
https://github.com/pbernet/akka_streams_tutorial/blob/b6d4c89a14bdc5d72c557d8cede59985ca8e525f/src/main/scala/akkahttp/WebsocketEcho.scala#L280
The problem is this line
Flow.fromSinkAndSourceMat(sink, Source.maybe[Message])(Keep.both)
it needs to be
Flow.fromSinkAndSourceCoupledMat(sink, Source.maybe[Message])(Keep.both)
When the stream is terminated, the Coupled part of the materialized flow will make sure to terminate the Sink downstream.

How to iterate over key values of a Kafka Streams Table

I'm new to Kafka streams and I tried to iterate over items in a kafka Streams table via the keyValueStore:
The idea is to create a Ktable (I've also tried with a globalKTable) with a KeyValueStore.
Then a separated thread is in charge to read the content of the KeyValueStore in order to iterate over last value of each key.
val streamProperties: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "test-application")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, config.getStringList("kafka.bootstrap.servers").toList.mkString(","))
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.ByteArray.getClass.getName)
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p
}
val builder: StreamsBuilder = new StreamsBuilder()
import org.apache.kafka.streams.kstream.Materialized
import org.apache.kafka.streams.state.KeyValueStore
val globalTable = builder.table("test",
Materialized
.as[String, Array[Byte], KeyValueStore[org.apache.kafka.common.utils.Bytes, Array[Byte]]]("TestStore")
.withCachingDisabled()
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.ByteArray())
)
val streams: KafkaStreams = new KafkaStreams(builder.build(), streamProperties)
streams.start()
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
println("\n\n\n tick \n\n\n")
try {
val keyValueStore = streams.store(globalTable.queryableStoreName(), QueryableStoreTypes.keyValueStore())
keyValueStore.all().toIterator.map { k =>
print(k.key)
}
} catch {
case _ => println("error")
}
}
}
val f = ex.scheduleAtFixedRate(task, 1, 10, TimeUnit.SECONDS)
}
}
In the thread the keyValueStore stays empty even when I produce items on topic "test".
Is there something I missed or didn't understand?
One thing missing is state directory location config:
p.put(StreamsConfig.STATE_DIR_CONFIG, "/tmp")
Without it Kafka Streams would not throw exception, but stateful things like global KTables would silently not work.

Kafka producer hangs on send

The logic is that a streaming job, getting data from a custom source has to write both to Kafka as well as HDFS.
I wrote a (very) basic Kafka producer to do this, however the whole streaming job hangs on the send method.
class KafkaProducer(val kafkaBootstrapServers: String, val kafkaTopic: String, val sslCertificatePath: String, val sslCertificatePassword: String) {
val kafkaProps: Properties = new Properties()
kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServers)
kafkaProps.put("acks", "1")
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("ssl.truststore.location", sslCertificatePath)
kafkaProps.put("ssl.truststore.password", sslCertificatePassword)
val kafkaProducer: KafkaProducer[Long, Array[String]] = new KafkaProducer(kafkaProps)
def sendKafkaMessage(message: Message): Unit = {
message.data.foreach(list => {
val producerRecord: ProducerRecord[Long, Array[String]] = new ProducerRecord[Long, Array[String]](kafkaTopic, message.timeStamp.getTime, list.toArray)
kafkaProducer.send(producerRecord)
})
}
}
And the code calling the producer:
receiverStream.foreachRDD(rdd => {
val messageRowRDD: RDD[Row] = rdd.mapPartitions(partition => {
val parser: Parser = new Parser
val kafkaProducer: KafkaProducer = new KafkaProducer(kafkaBootstrapServers, kafkaTopic, kafkaSslCertificatePath, kafkaSslCertificatePass)
val newPartition = partition.map(message => {
Logger.getLogger("importer").error("Writing Message to Kafka...")
kafkaProducer.sendKafkaMessage(message)
Logger.getLogger("importer").error("Finished writing Message to Kafka")
Message.data.map(singleMessage => parser.parseMessage(Message.timeStamp.getTime, singleMessage))
})
newPartition.flatten
})
val df = sqlContext.createDataFrame(messageRowRDD, Schema.messageSchema)
Logger.getLogger("importer").info("Entries-count: " + df.count())
val row = Try(df.first)
row match {
case Success(s) => Persister.writeDataframeToDisk(df, outputFolder)
case Failure(e) => Logger.getLogger("importer").warn("Resulting DataFrame is empty. Nothing can be written")
}
})
From the logs I can tell that each executor is reaching the "sending to kafka" point, however not any further. All executors hang on that and no exception is thrown.
The Message class is a very simple case class with 2 fields, a timestamp and an array of strings.
This was due to the acks field in Kafka.
Acks was set to 1 and sends went ahead a lot faster.

stopping spark streaming after reading first batch of data

I am using spark streaming to consume kafka messages. I want to get some messages as sample from kafka instead of reading all messages. So I want to read a batch of messages, return them to caller and stopping spark streaming. Currently I am passing batchInterval time in awaitTermination method of spark streaming context method. I don't now how to return processed data to caller from spark streaming. Here is my code that I am using currently
def getsample(params: scala.collection.immutable.Map[String, String]): Unit = {
if (params.contains("zookeeperQourum"))
zkQuorum = params.get("zookeeperQourum").get
if (params.contains("userGroup"))
group = params.get("userGroup").get
if (params.contains("topics"))
topics = params.get("topics").get
if (params.contains("numberOfThreads"))
numThreads = params.get("numberOfThreads").get
if (params.contains("sink"))
sink = params.get("sink").get
if (params.contains("batchInterval"))
interval = params.get("batchInterval").get.toInt
val sparkConf = new SparkConf().setAppName("KafkaConsumer").setMaster("spark://cloud2-server:7077")
val ssc = new StreamingContext(sparkConf, Seconds(interval))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
var consumerConfig = scala.collection.immutable.Map.empty[String, String]
consumerConfig += ("auto.offset.reset" -> "smallest")
consumerConfig += ("zookeeper.connect" -> zkQuorum)
consumerConfig += ("group.id" -> group)
var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, consumerConfig, topicMap, StorageLevel.MEMORY_ONLY).map(_._2)
val streams = data.window(Seconds(interval), Seconds(interval)).map(x => new String(x))
streams.foreach(rdd => rdd.foreachPartition(itr => {
while (itr.hasNext && size >= 0) {
var msg=itr.next
println(msg)
sample.append(msg)
sample.append("\n")
size -= 1
}
}))
ssc.start()
ssc.awaitTermination(5000)
ssc.stop(true)
}
So instead of saving messages in a String builder called "sample" I want to return to caller.
You can implement a StreamingListener and then inside it, onBatchCompleted you can call ssc.stop()
private class MyJobListener(ssc: StreamingContext) extends StreamingListener {
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) = synchronized {
ssc.stop(true)
}
}
This is how you attach your SparkStreaming to the JobListener:
val listen = new MyJobListener(ssc)
ssc.addStreamingListener(listen)
ssc.start()
ssc.awaitTermination()
We can get sample messages using following piece of code
var sampleMessages=streams.repartition(1).mapPartitions(x=>x.take(10))
and if we want to stop after first batch then we should implement our own StreamingListener interface and should stop streaming in onBatchCompleted method.

request-reply with akka-camel and ActiveMQ

Update: It would seem that an even simpler test case is not working: just trying to send a message from an ActiveMQ producer to an ActiveMQ consumer via the in-process broker. Here is the code:
val brokerURL = "vm://localhost?broker.persistent=false"
val connectionFactory = new ActiveMQConnectionFactory(brokerURL)
val connection = connectionFactory.createConnection()
val session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE)
val queue = session.createQueue("foo.bar")
val producer = session.createProducer(queue)
val consumer = session.createConsumer(queue)
val message = session.createTextMessage("marco")
producer.send(message)
val resp = consumer.receive(2000)
assert(resp != null)
I'm trying to implement a very simple request-reply pattern using akka-camel. Here's my (testbench) code which is trying to use activeMQ directly to send a message and expect a response:
val brokerURL = "vm://localhost?broker.persistent=false"
// create in-process broker, session, queue, etc...
val connectionFactory = new ActiveMQConnectionFactory(brokerURL)
val connection = connectionFactory.createConnection()
val session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE)
val queue = session.createQueue("myapp.somequeue")
val producer = session.createProducer(queue)
val tempDest = session.createTemporaryQueue()
val respConsumer = session.createConsumer(tempDest)
val message = session.createTextMessage("marco")
message.setJMSReplyTo(tempDest)
message.setJMSCorrelationID("myCorrelationID")
// create actor system with CamelExtension
val camel = CamelExtension(system)
val camelContext = camel.context
camelContext.addComponent("activemq", ActiveMQComponent.activeMQComponent(brokerURL))
val listener = system.actorOf(Props[Frontend])
// send a message, expect a response
producer.send(message)
val resp: TextMessage = respConsumer.receive(5000).asInstanceOf[TextMessage]
assert(resp.getText() == "polo")
I've tried two different approaches for the Consumer actor. The first is simpler, which attempts to respond using sender !:
class Frontend extends Actor with Consumer {
def endpointUri = "activemq:myapp.somequeue"
override def autoAck = false
def receive = {
case msg: CamelMessage => {
println("received %s" format msg.bodyAs[String])
sender ! "polo"
}
}
}
The second attempts to reply using the CamelTemplate:
class Frontend extends Actor with Consumer {
def endpointUri = "activemq:myapp.somequeue"
override def autoAck = false
def receive = {
case msg: CamelMessage => {
println("received %s" format msg.bodyAs[String])
val replyTo = msg.getHeaderAs("JMSReplyTo", classOf[ActiveMQTempQueue], camelContext)
val correlationId = msg.getHeaderAs("JMSCorrelationID", classOf[String], camelContext)
camel.template.sendBodyAndHeader("activemq:"+replyTo.getQueueName(), "polo", "JMSCorrelationID", correlationId)
}
}
}
I do see the println() output from my actor's receive method, so the ActiveMQ message is getting into the actor, but I get a timeout on the respConsumer.receive() call in the testbench. I've tried lots of combinations of specifying and not specifying headers in the reply. I've also tried enabling and disabling autoAck.
Thanks in advance.
Turns out I needed to call connection.start() in the JMS code.