there are many send requests expired when I use KafkaConsumer:commitAsync commit offset - apache-kafka

I had a kafka consumer that processes 2w messages per second and I used commitAsync method to commit offset with manual commit offset . on this case, I found many offset commit failed logs like org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets.\nCaused by: org.apache.kafka.common.errors.TimeoutException: Failed to send request after 60000 ms,so I read the source code ,I find the log was produced by the mehotd of org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient#failExpiredRequests.
enter image description here
I modify the param max.poll.records=1000, but this no effect;

Doubling the number of consumed records won't help you to commit any of offsets you've consumed within the timeout window
The error is suggesting that you either 1) retry the commit (catch in a loop, and commit again), or 2) reduce max.poll.records from default 500 to something smaller, increase your timeouts (although one minute should be more than enough), or commit more frequently

Related

Kafka Producer Retry and Failed record handling

My requirement as follows -
apart from broker metadata related error -I try to simulate a RecordTooLargeException while sending the message to the Kafka Topic.
For the producer configuration I add acks: all and retries: 5
Also I use addCallback method to send the message.
I received org.apache.kafka.common.errors.RecordTooLargeException: The message is 2000103 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.
but I did not notice any retry ( 5 times ) in the log.
My requirement is retry 5 times , then marked the record as permanent failure and send back to the call back handler - for further reprocess the failed record( ex. send to DLT or DB)
How can I achieve this kind of retry and handling?
It's simple. As per theory KAFKA Producer API doesn't retry on RecordTooLargeException, that means it is a non-retriable exception. If you still want to break this and retry irrespectively, then you can catch that Exception string through the Search String when error returned from the broker and retry from the catch block as many as times you want.
KafkaProducer has two types of errors. Retriable errors are those that can be resolved by sending the message again. For example, a connection error can be resolved because the connection may get reestablished. A “not leader for partition” error can be resolved when a new leader is elected for the partition and the client metadata is refreshed. KafkaProducer can be configured to retry those errors automatically, so the application code will get retriable exceptions only when the number of retries was exhausted and the error was not resolved. Some errors will not be resolved by retrying — for example, “Message size too large.” In those cases, KafkaProducer will not attempt a retry and will return the exception immediately.
-- Kafka: The Definitive Guide 2nd Edition, Chapter 3
RecordTooLargeException is a non-retriable exception, retrying makes no sense if the max.request.size configuration does not change. Therefore, Kafka producer will not attempt a retry and will return the exception immediately. The callback handler will be triggered for further reprocess.

Kakfa retries Concept - What Basis retries will be stopped in Kafka?

As am new to Kafka , trying to understand the retries concept in Kafka . What basis retries process will be completed ?
Example Retries parameter we set as 7 . Now questions here ,
Kafka will be retried in all 7 times ?
Will be tried until successful process ? If so , How Kafka will come to know about successful ?
If that would be depends upon any parameter what Is that parameter and how ?
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.
Kafka will retry until the initiated process is successfully completed or retry count is zero.
Kafka maintains the status of each API call ( producer , consumer, and Streams ), and if the error condition meets then retry count is decreased.
Please go through the completeBatch function of the Sender.java in the following URL to get more information.
https://github.com/apache/kafka/blob/68ac551966e2be5b13adb2f703a01211e6f7a34b/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java
I guess you are talking about producer retrying to send failed messages.
From kafka producer retries property documentation -
"Setting a value greater than zero will cause the client to resend any
record whose send fails with a potentially transient error."
This means that kafka producer will retry if the error it encountered is considered "Retriable". not all errors are retriable - for example, if the target kafka topic does not exist, theres no point in trying to send the message again.
but if for example the connection was interrupted, it makes sense to try again.
Important to note - retries are only relevant if you have set broker ack != 0.
So, in your example you have 7 retries configured.
I assume that ack is set to a value different than 0 because then no retries will be attempted.
If your message failed with a non-retriable error, Kafka producer will not try to send the message again (it will actually 'give-up' on that message and move on to next messages).
If your message failed with a retriable error, Kafka producer will retry sending until message is successfully sent, or until retries are exhausted (when 7 retries were attempted and none of them succeeded).
Kafka client producer knows when your message was successfully sent to broker because when ack is set to 1\all, the kafka broker is "Acknowledging" any message received and informs the producer (in a kind of handshake between the producer and broker).
see acks & retries # https://kafka.apache.org/documentation/#producerconfigs
Kafka reties happens for transient exceptions such as NotEnoughReplicaException.
In Kafka version <=2.0 default retry is 0.
In Kafka version > 2.0 default retry is Integer.MAX
From kafka 2.1 retries are bounded to timeouts, there are couple of producer configuration such as.
delivery.timeout.ms=120000ms - by default producer will retry for 2 mins, if retry is not successful after 2 mins the request will not send to broker and we have to handle manually.
retry.backoff.ms=100ms - by default every 100ms producer will retry till delivery.timeout reaches.

How to fix 'Kafka Offset commit failed on partition: The request timed out'

I suddenly got exceptions of type in production Kafka
ERROR[pool-XX-thread-YY] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=someclientid, groupId=somegroup] Offset commit failed on partition SomeTopic-SomePartition at offset SomeOffset: The request timed out.
This occurred for 3.5 seconds from a lot of different services (clients) (different threads\different topics\different partitions) and than just self healed...
The offset commit configurations is 5 second auto-commit for all those clients.
Couldn't track anything from kafka broker logs except of some re-balancing right of one group (out of 10 had that issue) which is normal when heart-beating fails, in the metrics server I can see some spikes of commit latency which is the symptom I guess and some TCP spikes on 1 broker (out of 3)
How can I start investigate it? what can cause such an issue? where should I look when things like that is occurring?
Attaching photos of some graphs here:
TCP Spike in server-3
Commit latency spike
Group syncs
Heartbeats

Kafka UNKNOWN_PRODUCER_ID exception

I sometimes find UNKNOWN_PRODUCER_ID exception when using kafka streams.
2018-06-25 10:31:38.329 WARN 1 --- [-1-1_0-producer] o.a.k.clients.producer.internals.Sender : [Producer clientId=default-groupz-7bd94946-3bc0-4400-8e73-7126b9b9c0d4-StreamThread-1-1_0-producer, transactionalId=default-groupz-1_0] Got error produce response with correlation id 1996 on topic-partition default-groupz-mplat-five-minute-stat-urlCount-counts-store-changelog-0, retrying (2147483646 attempts left). Error: UNKNOWN_PRODUCER_ID
Referred to official documents:
This exception is raised by the broker if it could not locate the
producer metadata associated with the producerId in question. This
could happen if, for instance, the producer's records were deleted
because their retention time had elapsed. Once the last records of the
producerId are removed, the producer's metadata is removed from the
broker, and future appends by the producer will return this exception.
It says one possibility is that a producer is idle for more than retention time (by default a week) so the producer's metadata will be removed from broker. Are there any other reasons that brokers could not locate producer metadata?
You might be experiencing https://issues.apache.org/jira/browse/KAFKA-7190. As it says in that ticket:
When a streams application has little traffic, then it is possible that consumer purging would delete
even the last message sent by a producer (i.e., all the messages sent by
this producer have been consumed and committed), and as a result, the broker
would delete that producer's ID. The next time when this producer tries to
send, it will get this UNKNOWN_PRODUCER_ID error code, but in this case,
this error is retriable: the producer would just get a new producer id and
retries, and then this time it will succeed.
This issue is also being tracked at https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
Two reasons might delete your producer's metadata:
The log segments are deleted due to hitting retention time.
The producer state might get expired due to inactivity which is controlled by the setting transactional.id.expiration.ms which defaults to 7 days
So if your Kafka is < 2.4 you can workaround this by increasing the retention time(considering that your system allows that) of your topic's log(e.g 30 days) and to increase the transactional.id.expiration.ms setting( to 24 days) until KIP-360 is released:
log.retention.hours=720
transactional.id.expiration.ms=2073600000
This shall guarantee that for low-traffic topics(messages written rarely than 7 days), your producer's metadata state will remain stored in broker's memory for a longer period, thus decreasing the risk of getting UnknownProducerIdException.

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817