Kafka Streams: TimeoutException: Failed to update metadata after 60000 ms - apache-kafka

I'm running a Kafka Streams application to consumer one topic with 20 partitions at traffic 15K records/sec. The application does a 1hr windowing and suppress the latest result per window. After running for sometime, it starts getting TimeoutException and then the instance is marked down by Kafka Streams.
Error trace:
Caused by: org.apache.kafka.streams.errors.StreamsException:
task [2_7] Abort sending since an error caught with a previous record
(key 372716656751 value InterimMessage [xxxxx..]
timestamp 1566433307547) to topic XXXXX due to org.apache.kafka.common.errors.TimeoutException:
Failed to update metadata after 60000 ms.
You can increase producer parameter `retries` and
`retry.backoff.ms` to avoid this error.
I already increased the value of that two configs.
retries = 5
retry.backoff.ms = 80000
Should I increase them again as the error message mentioned? What should be a good value for these two values?

Related

FlinkKafkaException: Failed to send data to Kafka: Expiring 19 record(s) for `topic`:120001 ms has passed since batch creation

I'm using Flink 1.13.1, and trying to write data to kafka with an RPS(rate per second) of 10k records. I have a kafka cluster of 30 brokers, my flink job does a filter operator and just sinks data to kafka. Below is my producer setting, Initially I have 5 sink topic partitions, but now 10, still I'm getting the same issue. Also I tried to set request.timeout.ms to 1 min instead of kafka default 30 sec, but still getting 120001 ms has passed since batch creation. Due to this error, checkpointing is getting failed. The checkpoint size is just 107kb as flink only committing offset to kafka. Job parallelism is 36. Here is the sink kafka properties and full stacktrace.
private static Properties kafkaSinkProperties() {
val properties = new Properties();
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class);
properties.setProperty("bootstrap.servers", KafkaServers.SINK_KAFKA_BOOTSTRAP_SERVER);
properties.setProperty("request.timeout.ms", "60000");
return properties;
}
FlinkKafkaProducer<Map<String, Object>> kafkaProducer =
new FlinkKafkaProducer<Map<String, Object>>(
KafkaTopics.TOPIC, jsonSerde, kafkaSinkProperties());
filterStream.addSink(kafkaProducer);
java.lang.Exception: Could not perform checkpoint 6984 for operator Source: source -> Filter -> Sink: Unnamed (42/48)#3.
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:1000)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$7(StreamTask.java:960)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:344)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:681)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:636)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 6984 for operator Source: source -> Filter -> Sink: Unnamed (42/48)#3. Failure reason: Checkpoint was declined.
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:264)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1086)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1070)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:988)
... 13 more
Caused by: org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: Expiring 20 record(s) for topic-1:120000 ms has passed since batch creation
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1392)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:1095)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:1002)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:99)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:320)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:1100)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:89)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:218)
... 23 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 20 record(s) for topic-1:120000 ms has passed since batch creation
Kafka is overloaded, the easiest way is to add more resources to your kafka.
There are some minor things you can do:
decrease parallelism (less data will be send to kafka at once)
do IO kafka operations less often
set kafka acks=1 instead of acks=all
increase checkpointing trigger interval
It is for sure a sign of Flink-Kafka sending problems. It can be that Kafka is overloaded or you have some very important connectivity issues in your setup. For sure you can optimise your software or hardware setup, but also you can consider to fully understand the nature of your error.
I think the philosophy of Flink-Kafka interaction is that most of config should be done in Flink operator itself.
Thus I think it worth to consider next options for Kafka Driver:
delivery.timeout.ms - default 2 minutes
retries - default Integer.MAX_VALUE, you probably do not want to touch it

Kafka streams throwing InvalidProducerException frequently

I have a kafka streams application with 4 instances, each runing on a separate ec2 instance with 16 threads. Total threads = 16 * 4. The input topic has only 32 partitions. I understand that some of the threads will remain idle.
I am continously seeing this exception
Caused by: org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
01:57:23.971 [kafka-producer-network-thread | bids_kafka_streams_beta_007-fd78c6fa-62bc-437d-add0-c31f5b7c1901-StreamThread-12-1_6-producer] ERROR org.apach
e.kafka.streams.processor.internals.RecordCollectorImpl - stream-thread [bids_kafka_streams_beta_007-fd78c6fa-62bc-437d-add0-c31f5b7c1901-StreamThread-12] t
ask [1_6] Error encountered sending record to topic kafka_streams_bids_output for task 1_6 due to:
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out
The only settings I have change in the streams config are the producer configs to reduce CPU usage on brokers
linger.ms=10000
commit.interval.ms=10000
Records are windowed by 2 mins
Is it due to rebalancing? Why so frequent?

Kafka Streams: Retries

Kafka version - 1.0.1
I am getting the following exception at random intervals. Tried increasing the request.timeout.ms to 5 mins, it's still continued to timeout at random intervals again (few hours). It's not clear why the exception arises, but the restart seems to recover from where it left, but requires a manual task. So, tried enabling the retries but that seems no effect, because I don't see any retries in the logs (meaning fails, then attempts first time, fails again, then second time and there on until the max retries). Can you please shed some light on the below exception and advise on how we can let the Kafka stream application continue to run when this exception occurs, perhaps retry again? If we need to increase the request.timeout.ms for max value, what is the downside that we need to be aware of, meaning we should not let the thread go on hung state indefinetely when the broker fails?
props.put(ProducerConfig.RETRIES_CONFIG, 3);
2018-07-05 06:04:25 ERROR Housestream:91 - Unknown Exception occurred
org.apache.kafka.streams.errors.StreamsException: task [1_1] Abort sending since an error caught with a previous record (key GCB21K1X value [L#5e86f18a timestamp 1530783812110)
to topic housestream-digitstore-changelog due to org.apache.kafka.common.errors.TimeoutException: Expiring 201 record(s) for housestream-digitstore-changelog: 30144 ms has passed since last append.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:287)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 201 record(s) for housestream-digitstore-changelog: 30144 ms has passed since last append
Tried increasing the request timeout to max integer value but ran into another Timeout exception.
2018-07-05 12:22:15 ERROR Housestream:179 - Unknown Exception occurred
org.apache.kafka.streams.errors.StreamsException: task [1_0] Exception caught while punctuating processor 'validatequote'
at org.apache.kafka.streams.processor.internals.StreamTask.punctuate(StreamTask.java:267)
at org.apache.kafka.streams.processor.internals.PunctuationQueue.mayPunctuate(PunctuationQueue.java:54)
at org.apache.kafka.streams.processor.internals.StreamTask.maybePunctuateSystemTime(StreamTask.java:619)
at org.apache.kafka.streams.processor.internals.AssignedTasks.punctuate(AssignedTasks.java:430)
at org.apache.kafka.streams.processor.internals.TaskManager.punctuate(TaskManager.java:324)
at org.apache.kafka.streams.processor.internals.StreamThread.punctuate(StreamThread.java:969)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:834)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:774)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:744)
Caused by: org.apache.kafka.streams.errors.StreamsException: task [1_1] Abort sending since an error caught with a previous record (key 32342 value com.int.digital.QUOTE#2c73fa63 timestamp 153083237883) to topic digital_quote due to org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms..
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:819)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:760)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:100)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:78)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:87)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:113)
at org.cox.processor.CheckQuote.handleTasks(CheckQuote.java:122)
at org.cox.processor.CheckQuote$1.punctuate(CheckQuote.java:145)
at org.apache.kafka.streams.processor.internals.ProcessorNode$4.run(ProcessorNode.java:131)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:208)
at org.apache.kafka.streams.processor.internals.ProcessorNode.punctuate(ProcessorNode.java:134)
at org.apache.kafka.streams.processor.internals.StreamTask.punctuate(StreamTask.java:263)
... 8 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms.

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817

org.apache.kafka.common.errors.TimeoutException

I have two broker 1.0.0 kafka cluster and I am running 1.0.0 kafka stream API application against this kafka.I increased the producer request.timeout.ms to 5 minutes to fix producer Timeoutexception.
Currently I am getting below two types of exceptions after running some time. I am trying to fix these exceptions as suggested in Apache Kafka: TimeoutException and then nothing works‏ But incomplete solution was here. Is this solution recommendable (decreasing producer batch.size). Please help.
Exception 1
2017-12-08 13:11:55,129 ERROR o.a.k.s.p.i.RecordCollectorImpl [sample-app-0.0.1-156ec0d4-6d7c-40b0-a493-370f8d9a092c-StreamThread-1] task [2_0] Error sending record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#54a6900d timestamp 1512536799387) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms.; No more records will be sent and no more offsets will be recorded for this task.
2017-12-08 13:11:55,131 ERROR o.a.k.s.p.i.AssignedTasks [sample-app-0.0.1-156ec0d4-6d7c-40b0-a493-370f8d9a092c-StreamThread-1] stream-thread [sample-app-0.0.1-156ec0d4-6d7c-40b0-a493-370f8d9a092c-StreamThread-1] Failed to process stream task 2_0 due to the following error: org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=2_0, processor=KSTREAM-SOURCE-0000000004, topic=Sample-Event, partition=0, offset=508417
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:232)
at org.apache.kafka.streams.processor.internals.AssignedTasks.process(AssignedTasks.java:403)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:317)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:942)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:822)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:774)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:744)
Caused by: org.apache.kafka.streams.errors.StreamsException: task [2_0] Abort sending since an error caught with a previous record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#54a6900d timestamp 1512536799387) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms..
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:819)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:760)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:100)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:78)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:87)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:85)
at org.apache.kafka.streams.kstream.internals.KStreamTransform$KStreamTransformProcessor.process(KStreamTransform.java:56)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:208)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:85)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:216)
... 6 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms.
Exception 2
2017-12-11 11:08:35,257 ERROR o.a.k.s.p.i.RecordCollectorImpl [kafka-producer-network-thread | sample-app-0.0.1-030b5133-df00-4abd-a3de-8bfab114f626-StreamThread-1-producer] task [2_0] Error sending record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#1758de61 timestamp 1512795449471) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 14 record(s) for abc-0: 122597 ms has passed since last append; No more records will be sent and no more offsets will be recorded for this task.
2017-12-11 11:08:56,001 ERROR o.a.k.s.p.i.AssignedTasks [sample-app-0.0.1-030b5133-df00-4abd-a3de-8bfab114f626-StreamThread-1] stream-thread [sample-app-0.0.1-030b5133-df00-4abd-a3de-8bfab114f626-StreamThread-1] Failed to commit stream task 2_0 due to the following error: org.apache.kafka.streams.errors.StreamsException: task [2_0] Abort sending since an error caught with a previous record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#1758de61 timestamp 1512795449471) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 14 record(s) for abc-0: 122597 ms has passed since last append.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:287)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 14 record(s) for abc-0: 122597 ms has passed since last append
We faced similar issue which we resolved
First issue by setting:
max.block.ms to something higher than currently configured value.
Second issue by: increasing the batch.size and decreasing the linger.ms (might increase latency) on Kafka Producer side. Increasing batch.size would send more batches with fewer messages in each batch.
This looks something that often happens when an expected topic hasn't been created. Try looking further back in the log files.
You can also explicity use the admin client to check which topics exist.
first issue is due to this reason: ( Producer sends continuous heartbeat and it will wait for 60,000 ms (default value) for the metadata. If metadata isn't present in the specified time, it throws streams timeout exception. To fix this, add kafka producer config (ProducerConfig.MAX_BLOCK_MS_CONFIG) to some value grater that 60000ms. This will resolve the issue.
if you are using nifi and kafka with SASL_SSL without kerberos and you are providing kafka client jaas then increase the Metadata wait time to 100sec and Acknowledgment wait time to 100 sec this will work for you.