Producer lost some message on kafka restart - apache-kafka

Kafka Client : 0.11.0.0-cp1
Kafka Broker :
On Kafka broker rolling restart, our application lost some messages while sending to broker. I believe with rolling restart there should not be any loss of message. These are the producer (Using Producer with asynchronous send() and not using callback/future etc) settings we are using :
val acksConfig: String = "all",
val retriesConfig: Int = Int.MAX_VALUE,
val retriesBackOffConfig: Int = 1000,
val batchSize: Int = 32768,
val lingerTime: Int = 1,
val maxBlockTime: Int = Int.MAX_VALUE,
val requestTimeOut: Int = 420000,
val bufferMemory: Int = 33_554_432,
val compressionType: String = "gzip",
val keySerializer: Class<StringSerializer> = StringSerializer::class.java,
val valueSerializer: Class<ByteArraySerializer> = ByteArraySerializer::class.java
I am seeing these exceptions in the logs
2019-03-19 17:30:59,224 [org.apache.kafka.clients.producer.internals.Sender] [kafka-producer-network-thread | producer-1] (Sender.java:511) WARN org.apache.kafka.clients.producer.internals.Sender - Got error produce response with correlation id 1105790 on topic-partition catapult_on_entitlement_updates_prod-67, retrying (2147483643 attempts left). Error: NOT_LEADER_FOR_PARTITION
But log says retry attempt left, i am curious why didnt it retry then? Let me know if anyone has any idea?

Two things to note:
What is the replication factor of the topic you are producing and what is the required number of min.insync.replicas?
What do you mean by "producer lost some messages". The producer if it cannot successfully produce to #min.insync.replicas brokers it will throw an exception and fail (for synchronous production). It is up to the producer/ client to retry in case of failure (synchronous or asynchronous production).

Related

Kafka transaction: Receiving CONCURRENT_TRANSACTIONS on AddPartitionsToTxnRequest

I am trying to publish in a transaction a message on 16 Kafka partitions on 7 brokers.
The flow is like this:
open transaction
write a message to 16 partitions
commit transaction
sleep 25 ms
repeat
Sometimes the transaction takes over 1 second to complete, with an average of 50 ms.
After enabling trace logging on producer's side, I noticed the following error:
TRACE internals.TransactionManager [kafka-producer-network-thread | producer-1] - [Producer clientId=producer-1, transactionalId=cma-2]
Received transactional response AddPartitionsToTxnResponse(errors={modelapp-ecb-0=CONCURRENT_TRANSACTIONS, modelapp-ecb-9=CONCURRENT_TRANSACTIONS, modelapp-ecb-10=CONCURRENT_TRANSACTIONS, modelapp-ecb-11=CONCURRENT_TRANSACTIONS, modelapp-ecb-12=CONCURRENT_TRANSACTIONS, modelapp-ecb-13=CONCURRENT_TRANSACTIONS, modelapp-ecb-14=CONCURRENT_TRANSACTIONS, modelapp-ecb-15=CONCURRENT_TRANSACTIONS, modelapp-ecb-1=CONCURRENT_TRANSACTIONS, modelapp-ecb-2=CONCURRENT_TRANSACTIONS, modelapp-ecb-3=CONCURRENT_TRANSACTIONS, modelapp-ecb-4=CONCURRENT_TRANSACTIONS, modelapp-ecb-5=CONCURRENT_TRANSACTIONS, modelapp-ecb-6=CONCURRENT_TRANSACTIONS, modelapp-ecb-=CONCURRENT_TRANSACTIONS, modelapp-ecb-8=CONCURRENT_TRANSACTIONS}, throttleTimeMs=0)
for request (type=AddPartitionsToTxnRequest, transactionalId=cma-2, producerId=59003, producerEpoch=0, partitions=[modelapp-ecb-0, modelapp-ecb-9, modelapp-ecb-10, modelapp-ecb-11, modelapp-ecb-12, modelapp-ecb-13, modelapp-ecb-14, modelapp-ecb-15, modelapp-ecb-1, modelapp-ecb-2, modelapp-ecb-3, modelapp-ecb-4, modelapp-ecb-5, modelapp-ecb-6, modelapp-ecb-7, modelapp-ecb-8])
The Kafka producer retries sending AddPartitionsToTxnRequest(s) several times until it succeeds, but this leads to delays.
The code looks like this:
Properties producerProperties = PropertiesUtil.readPropertyFile(_producerPropertiesFile);
_producer = new KafkaProducer<>(producerProperties);
_producer.initTransactions();
_producerService = Executors.newSingleThreadExecutor(new NamedThreadFactory(getClass().getSimpleName()));
_producerService.submit(() -> {
while (!Thread.currentThread().isInterrupted()) {
try {
_producer.beginTransaction();
for (int partition = 0; partition < _numberOfPartitions; partition++)
_producer.send(new ProducerRecord<>(_producerTopic, partition, KafkaRecordKeyFormatter.formatControlMessageKey(_messageNumber, token), EMPTY_BYTE_ARRAY));
_producer.commitTransaction();
_messageNumber++;
Thread.sleep(_timeBetweenProducedMessagesInMillis);
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException | UnsupportedVersionException e) {
closeProducer();
break;
} catch (KafkaException e) {
_producer.abortTransaction();
} catch (InterruptedException e) {...}
}
});
Looking to broker's code, it seems there are 2 cases when this error is thrown, but I cannot tell why I get there
object TransactionCoordinator {
...
def handleAddPartitionsToTransaction(...): Unit = {
...
if (txnMetadata.pendingTransitionInProgress) {
// return a retriable exception to let the client backoff and retry
Left(Errors.CONCURRENT_TRANSACTIONS)
} else if (txnMetadata.state == PrepareCommit || txnMetadata.state == PrepareAbort) {
Left(Errors.CONCURRENT_TRANSACTIONS)
}
...
}
...
}
Thanks in advance for help!
Later edit:
Enabling trace logging on broker we were able to see that broker sends to the producer END_TXN response before transaction reaches state CompleteCommit. The producer is able to start a new transaction, which is rejected by the broker while it is still in the transition PrepareCommit -> CompleteCommit.

Gracefully restart a Reactive-Kafka Consumer Stream on failure

Problem
When I restart/complete/STOP stream the old Consumer does not Die/Shutdown:
[INFO ] a.a.RepointableActorRef -
Message [akka.kafka.KafkaConsumerActor$Internal$Stop$]
from Actor[akka://ufo-sightings/deadLetters]
to Actor[akka://ufo-sightings/system/kafka-consumer-1#1896610594]
was not delivered. [1] dead letters encountered.
Description
I'm building a service that receives a message from Kafka topic and sends the message to an external service via HTTP request.
A connection with the external service can be broken, and my service needs to retry the request.
Additionally, if there is an error in the Stream, entire stream needs to restart.
Finally, sometimes I don't need the stream and its corresponding Kafka-consumer and I would like to shut down the entire stream
So I have a Stream:
Consumer.committableSource(customizedSettings, subscriptions)
.flatMapConcat(sourceFunction)
.toMat(Sink.ignore)
.run
Http request is sent in sourceFunction
I followed new Kafka Consumer Restart instructions in the new documentation
RestartSource.withBackoff(
minBackoff = 20.seconds,
maxBackoff = 5.minutes,
randomFactor = 0.2 ) { () =>
Consumer.committableSource(customizedSettings, subscriptions)
.watchTermination() {
case (consumerControl, streamComplete) =>
logger.info(s" Started Watching Kafka consumer id = ${consumer.id} termination: is shutdown: ${consumerControl.isShutdown}, is f completed: ${streamComplete.isCompleted}")
consumerControl.isShutdown.map(_ => logger.info(s"Shutdown of consumer finally happened id = ${consumer.id} at ${DateTime.now}"))
streamComplete
.flatMap { _ =>
consumerControl.shutdown().map(_ -> logger.info(s"3.consumer id = ${consumer.id} SHUTDOWN at ${DateTime.now} GRACEFULLY:CLOSED FROM UPSTREAM"))
}
.recoverWith {
case _ =>
consumerControl.shutdown().map(_ -> logger.info(s"3.consumer id = ${consumer.id} SHUTDOWN at ${DateTime.now} ERROR:CLOSED FROM UPSTREAM"))
}
}
.flatMapConcat(sourceFunction)
}
.viaMat(KillSwitches.single)(Keep.right)
.toMat(Sink.ignore)(Keep.left)
.run
There is an issue opened that discusses this non-terminating Consumer in a complex Akka-stream, but there is no solution yet.
Is there a workaround that forces the Kafka Consumer termination
How about wrapping the consumer in an Actor and registering a KillSwitch, see: https://doc.akka.io/docs/akka/2.5/stream/stream-dynamic.html#dynamic-stream-handling
Then in the Actor postStop method you can terminate the stream.
By wrapping the Actor in a BackoffSupervisor, you get the exponential backoff.
Example actor: https://github.com/tradecloud/kafka-akka-extension/blob/master/src/main/scala/nl/tradecloud/kafka/KafkaSubscriberActor.scala#L27

Akka streams with gilt aws kinesis exception: Stream is terminated. SourceQueue is detached

I'm using gilt aws kinesis stream consumer library there to connect to a single shard kinesis stream.
Specifically:
...
val streamConfig = KinesisStreamConsumerConfig[String](
streamName = queueName
, applicationName = kinesisConsumerApp
, regionName = Some(awsRegion)
, checkPointInterval = 5.minutes
, retryConfig = RetryConfig(initialDelay = 1.second, retryDelay = 1.second, maxRetries = 3)
, initialPositionInStream = InitialPositionInStream.LATEST
)
implicit val mat = ActorMaterializer()
val flow = Source.queue[String](0, OverflowStrategy.backpressure)
.to(Sink.foreach {
msgBody => {
log.info(s"Flow got message: $msgBody")
try {
val workAsJson = parse(msgBody)
frontEnd ! workAsJson
} catch {
case th: Throwable => log.error(s"Exception thrown trying to parse message from Kinesis stream, e.cause: ${th.getCause}, e.message: ${th.getMessage}")
}
}
})
.run()
val consumer = new KinesisStreamConsumer[String](
streamConfig,
KinesisStreamHandler(
KinesisStreamSource.pumpKinesisStreamTo(flow, 10.second)
)
)
val ec = Executors.newSingleThreadExecutor()
ec.submit(new Runnable {
override def run(): Unit = consumer.run()
})
The application runs fine for about 24 hours, I verify occasionally by pushing records using aws kinesis put-record command line and watch them getting consumed by my application, but then suddenly the application start receiving exceptions each time a new record is pushed to the stream.
Here is the console logging when that happens:
INFO: Sleeping ... [863/1962]
DEBUG[RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Processing 1 records from shard shardId-000000000000
WARN [RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
WARN [RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
WARN [RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
ERROR[RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - SKIPPING 1 records from shard shardId-000000000000 :: Kinesis shard: shardId-000000000000 :: Stream is termi
nated. SourceQueue is detached
com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$KCLProcessorException: Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.$anonfun$doRetry$2(KCLRecordProcessorFactory.scala:156)
at com.gilt.gfc.util.Retry$.retryWithExponentialDelay(Retry.scala:67)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.doRetry(KCLRecordProcessorFactory.scala:151)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.processRecords(KCLRecordProcessorFactory.scala:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.processRecords(V1ToV2RecordProcessorAdapter.java:42)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:176)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Stream is terminated. SourceQueue is detached
at akka.stream.impl.QueueSource$$anon$1.$anonfun$postStop$1(Sources.scala:57)
at akka.stream.impl.QueueSource$$anon$1.$anonfun$postStop$1$adapted(Sources.scala:56)
at akka.stream.stage.CallbackWrapper.$anonfun$invoke$1(GraphStage.scala:1373)
at akka.stream.stage.CallbackWrapper.locked(GraphStage.scala:1379)
at akka.stream.stage.CallbackWrapper.invoke(GraphStage.scala:1370)
at akka.stream.stage.CallbackWrapper.invoke$(GraphStage.scala:1369)
at akka.stream.impl.QueueSource$$anon$1.invoke(Sources.scala:47)
at akka.stream.impl.QueueSource$$anon$2.offer(Sources.scala:180)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamSource$.$anonfun$pumpKinesisStreamTo$1(KinesisStreamSource.scala:20)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamSource$.$anonfun$pumpKinesisStreamTo$1$adapted(KinesisStreamSource.scala:20)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamHandler$$anon$1.onRecord(KinesisStreamHandler.scala:29)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamConsumer.$anonfun$run$1(KinesisStreamConsumer.scala:40)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamConsumer.$anonfun$run$1$adapted(KinesisStreamConsumer.scala:40)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$2(KCLWorkerRunner.scala:159)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$2$adapted(KCLWorkerRunner.scala:159)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$1(KCLWorkerRunner.scala:159)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$1$adapted(KCLWorkerRunner.scala:159)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runBatchProcessor$1(KCLWorkerRunner.scala:121)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runBatchProcessor$1$adapted(KCLWorkerRunner.scala:116)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.$anonfun$processRecords$2(KCLRecordProcessorFactory.scala:120)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.$anonfun$doRetry$2(KCLRecordProcessorFactory.scala:153)
... 11 common frames omitted
Wondering if that answer might be related. If so, I'd appreciate simpler explanation/how-to-fix that suits a newbie like myself
Notes:
This is still in testing/staging phase so there is no real load to
the stream except for the occasional manual pushes I'm making
The 24h duration in which the application runs fine is not
accurately tested but was an observation.
I'm running the test for a third time (started at 8:42 UTC) but with
the difference of increasing Source.queue buffer size to 100.
If 24h turns out to be accurate, could that be related to Kinesis
default 24h retention period of stream records?
Update:
Application still working fine after 24+ hours of operation.
Update2:
So the application has been running fine for the past 48+ hours, again the only difference is increasing the stream's Source.queue size to 100.
Could that be the proper fix to the issue?
Will I face similar issue with increased load once we go to production?
Is 100 enough/too much/too few?
Can someone please explain how this change fixed/suppressed/metigated the error?

Akka.Kafka - warning message - Resuming partitions

Am getting debug messages continuously as resuming the partitions for all the topics. Like below. This message prints every millisecond on my server continuously.
08:44:34.850 [default-akka.kafka.default-dispatcher-10] DEBUG o.a.k.clients.consumer.KafkaConsumer - Resuming partition test222-7
08:44:34.850 [default-akka.kafka.default-dispatcher-10] DEBUG o.a.k.clients.consumer.KafkaConsumer - Resuming partition test222-6
08:44:34.850 [default-akka.kafka.default-dispatcher-10] DEBUG o.a.k.clients.consumer.KafkaConsumer - Resuming partition test222-9
08:44:34.850 [default-akka.kafka.default-dispatcher-10] DEBUG o.a.k.clients.consumer.KafkaConsumer - Resuming partition test222-8
This
Here is the code
val zookeeperHost = "localhost"
val zookeeperPort = "9092"
// Kafka queue settings
val consumerSettings = ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer)
.withBootstrapServers(zookeeperHost + ":" + zookeeperPort)
.withGroupId((groupName))
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
// Streaming the Messages from Kafka queue
Consumer.committableSource(consumerSettings, Subscriptions.topics(topicName))
.map(msg => {
consumed(msg.record.value)
})
.runWith(Sink.ignore)
Please help to do the partition correctly to stop the DEBUG messages.
Seems that the reactive-kafka code resumes every partition before starting to fetch:
consumer.assignment().asScala.foreach { tp =>
if (partitionsToFetch.contains(tp)) consumer.resume(java.util.Collections.singleton(tp))
else consumer.pause(java.util.Collections.singleton(tp))
}
def tryPoll{...}
checkNoResult(tryPoll(0))
KafkaConsumer.resume method is a no-op if the partitions were not previously paused.

How to shutdown a Kafka ConsumerConnector

I have a system that pulls messages from a Kafka topic, and when it's unable to process messages because some external resource is unavailable, it shuts down the consumer, returns the message to the topic, and waits some time before starting the consumer again. The only problem is, shutting down doesn't work. Here's what I see in my logs:
2014-09-30 08:24:10,918 - com.example.kafka.KafkaConsumer [info] - [application-akka.actor.workflow-context-8] Shutting down kafka consumer for topic new-problem-reports
2014-09-30 08:24:10,927 - clients.kafka.ProblemReportObserver [info] - [application-akka.actor.workflow-context-8] Consumer shutdown
2014-09-30 08:24:11,946 - clients.kafka.ProblemReportObserver [warn] - [application-akka.actor.workflow-context-8] Sending 7410-1412090624000 back to the queue
2014-09-30 08:24:12,021 - clients.kafka.ProblemReportObserver [debug] - [kafka-akka.actor.kafka-consumer-worker-context-9] Message from partition 0: key=7410-1412090624000, msg=7410-1412090624000
There's a few layers at work here, but the important code is:
In KafkaConsumer.scala:
protected def consumer: ConsumerConnector = Consumer.create(config.asKafkaConfig)
def shutdown() = {
logger.info(s"Shutting down kafka consumer for topic ${config.topic}")
consumer.shutdown()
}
In the routine that observes messages:
(processor ? ProblemReportRequest(problemReportKey)).map {
case e: ConnectivityInterruption =>
val backoff = 10.seconds
logger.warn(s"Can't connect to essential services, pausing for $backoff", e)
stop()
// XXX: Shutdown isn't instantaneous, so returning has to happen after a delay.
// Unfortunately, there's still a race condition here, plus there's a chance the
// system will be shut down before the message has been returned.
system.scheduler.scheduleOnce(100 millis) { returnMessage(message) }
system.scheduler.scheduleOnce(backoff) { start() }
false
case e: Exception => returnMessage(message, e)
case _ => true
}.recover { case e => returnMessage(message, e) }
And the stop method:
def stop() = {
if (consumerRunning.get()) {
consumer.shutdown()
consumerRunning.compareAndSet(true, false)
logger.info("Consumer shutdown")
} else {
logger.info("Consumer is already shutdown")
}
!consumerRunning.get()
}
Is this a bug, or am I doing it wrong?
Because your consumer is a def. It creates a new Kafka instance and shut that new instance down when you call it like consumer.shutdown(). Make consumer a val instead.