Apache Kafka: Lowering `request.timeout.ms` causes metadata fetch failures? - apache-kafka

I have a 9 broker, 5 node Zookeeper Kafka setup.
In order to reduce the time for reporting failures, we had set request.timeout.ms to 3000. However, with this setting, I'm observing some weird behavior.
Occasionally, I'm seeing client (producer) getting an error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
This doesn't happen always. There are some producers that work just fine.
When I bumped up the request.timeout.ms value, I didn't see any errors.
Any idea why lowering request.timeout.ms cause metadata fetch timeouts?

Related

"min.fetch.bytes" not Working Properly when used in Conjunction with "fetch.max.wait.ms" in Kafka Consumer

I have an use case where I need to wait for some specific time interval before fetching records from Kafka . But if a min data is present in the topic , I need to get the records immediately and not need to wait for time interval.
I gave the following Kafka Consumer Config Values :
spring:
kafka:
consumer:
auto-offset-reset: latest
properties:
security.protocol: SASL_SSL
sasl.mechanism: PLAIN
ssl.endpoint.identification.algorithm: https
max.poll.interval.ms: 3600000
max.poll.records: 10000
fetch.min.bytes: 2
fetch.max.wait.ms: 60000
request.timeout.ms: 120000
retry:
interval: 1000
max.attempts: 3
I am observing the following -
When there are retryable exceptions faced , the consumer tried to commit the current offsets and reseek current position. But that fetch request also is taking 60 s , even though fetch.min.bytes is set to 2.
Can Some please help here or explain why this behaviour is observed..?
Why are records not returning even though fetch.min.bytes is set to 2 and always waiting till 60 s..? This is especially happening when retryable exceptions are faced.
Added a screenshot of the logs. We can see that after the exception has occured , it takes atleast 1min again for the message to be retried.enter image description here
Note : Spring Kakfa Version i Am using in consumer Side is : org.springframework.kafka:spring-kafka=2.8.8

Kakfa retries Concept - What Basis retries will be stopped in Kafka?

As am new to Kafka , trying to understand the retries concept in Kafka . What basis retries process will be completed ?
Example Retries parameter we set as 7 . Now questions here ,
Kafka will be retried in all 7 times ?
Will be tried until successful process ? If so , How Kafka will come to know about successful ?
If that would be depends upon any parameter what Is that parameter and how ?
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.
Kafka will retry until the initiated process is successfully completed or retry count is zero.
Kafka maintains the status of each API call ( producer , consumer, and Streams ), and if the error condition meets then retry count is decreased.
Please go through the completeBatch function of the Sender.java in the following URL to get more information.
https://github.com/apache/kafka/blob/68ac551966e2be5b13adb2f703a01211e6f7a34b/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java
I guess you are talking about producer retrying to send failed messages.
From kafka producer retries property documentation -
"Setting a value greater than zero will cause the client to resend any
record whose send fails with a potentially transient error."
This means that kafka producer will retry if the error it encountered is considered "Retriable". not all errors are retriable - for example, if the target kafka topic does not exist, theres no point in trying to send the message again.
but if for example the connection was interrupted, it makes sense to try again.
Important to note - retries are only relevant if you have set broker ack != 0.
So, in your example you have 7 retries configured.
I assume that ack is set to a value different than 0 because then no retries will be attempted.
If your message failed with a non-retriable error, Kafka producer will not try to send the message again (it will actually 'give-up' on that message and move on to next messages).
If your message failed with a retriable error, Kafka producer will retry sending until message is successfully sent, or until retries are exhausted (when 7 retries were attempted and none of them succeeded).
Kafka client producer knows when your message was successfully sent to broker because when ack is set to 1\all, the kafka broker is "Acknowledging" any message received and informs the producer (in a kind of handshake between the producer and broker).
see acks & retries # https://kafka.apache.org/documentation/#producerconfigs
Kafka reties happens for transient exceptions such as NotEnoughReplicaException.
In Kafka version <=2.0 default retry is 0.
In Kafka version > 2.0 default retry is Integer.MAX
From kafka 2.1 retries are bounded to timeouts, there are couple of producer configuration such as.
delivery.timeout.ms=120000ms - by default producer will retry for 2 mins, if retry is not successful after 2 mins the request will not send to broker and we have to handle manually.
retry.backoff.ms=100ms - by default every 100ms producer will retry till delivery.timeout reaches.

Kafka Streams: TimeoutException: Failed to update metadata after 60000 ms

I'm running a Kafka Streams application to consumer one topic with 20 partitions at traffic 15K records/sec. The application does a 1hr windowing and suppress the latest result per window. After running for sometime, it starts getting TimeoutException and then the instance is marked down by Kafka Streams.
Error trace:
Caused by: org.apache.kafka.streams.errors.StreamsException:
task [2_7] Abort sending since an error caught with a previous record
(key 372716656751 value InterimMessage [xxxxx..]
timestamp 1566433307547) to topic XXXXX due to org.apache.kafka.common.errors.TimeoutException:
Failed to update metadata after 60000 ms.
You can increase producer parameter `retries` and
`retry.backoff.ms` to avoid this error.
I already increased the value of that two configs.
retries = 5
retry.backoff.ms = 80000
Should I increase them again as the error message mentioned? What should be a good value for these two values?

Kafka UNKNOWN_PRODUCER_ID exception

I sometimes find UNKNOWN_PRODUCER_ID exception when using kafka streams.
2018-06-25 10:31:38.329 WARN 1 --- [-1-1_0-producer] o.a.k.clients.producer.internals.Sender : [Producer clientId=default-groupz-7bd94946-3bc0-4400-8e73-7126b9b9c0d4-StreamThread-1-1_0-producer, transactionalId=default-groupz-1_0] Got error produce response with correlation id 1996 on topic-partition default-groupz-mplat-five-minute-stat-urlCount-counts-store-changelog-0, retrying (2147483646 attempts left). Error: UNKNOWN_PRODUCER_ID
Referred to official documents:
This exception is raised by the broker if it could not locate the
producer metadata associated with the producerId in question. This
could happen if, for instance, the producer's records were deleted
because their retention time had elapsed. Once the last records of the
producerId are removed, the producer's metadata is removed from the
broker, and future appends by the producer will return this exception.
It says one possibility is that a producer is idle for more than retention time (by default a week) so the producer's metadata will be removed from broker. Are there any other reasons that brokers could not locate producer metadata?
You might be experiencing https://issues.apache.org/jira/browse/KAFKA-7190. As it says in that ticket:
When a streams application has little traffic, then it is possible that consumer purging would delete
even the last message sent by a producer (i.e., all the messages sent by
this producer have been consumed and committed), and as a result, the broker
would delete that producer's ID. The next time when this producer tries to
send, it will get this UNKNOWN_PRODUCER_ID error code, but in this case,
this error is retriable: the producer would just get a new producer id and
retries, and then this time it will succeed.
This issue is also being tracked at https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
Two reasons might delete your producer's metadata:
The log segments are deleted due to hitting retention time.
The producer state might get expired due to inactivity which is controlled by the setting transactional.id.expiration.ms which defaults to 7 days
So if your Kafka is < 2.4 you can workaround this by increasing the retention time(considering that your system allows that) of your topic's log(e.g 30 days) and to increase the transactional.id.expiration.ms setting( to 24 days) until KIP-360 is released:
log.retention.hours=720
transactional.id.expiration.ms=2073600000
This shall guarantee that for low-traffic topics(messages written rarely than 7 days), your producer's metadata state will remain stored in broker's memory for a longer period, thus decreasing the risk of getting UnknownProducerIdException.

Kafka Producer is not retring when Broker is Down

I have setup up Kafka using version 0.9 with the basic configuration as
1 Broker 1 Topic and 1 Partition.
Below are Producer Configurations that I have added to enable the retry from Producer.
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.RETRIES_CONFIG, 5);
props.put(ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, 500);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 500);
props.put(ProducerConfig.METADATA_MAX_AGE_CONFIG, 50);
I understand from the documents that
Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error.
Both my Broker & Zookeeper are down and the retry operation is not working.
ERROR o.s.k.s.LoggingProducerListener - Exception thrown when sending a message to topic TestTopic1|
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 500 ms.
I need to know if I am missing anything here for the retry to work.
Resend (retry) works only if you have connection to the Broker and something happened during sending a message.
So, if your Broker is dead, there is no any reason to send message at all - no connection. And that is an exception about.
I think retries should work anyway, even if the broker is down. This is the whole reason to have retries in the first place. Could be a temporary network issue after all.
There is a bug in the Kafka 0.9.0.1 producer which causes retries not to work. See here.
Fixed in 0.9.0.2 (which is not released yet) and 0.10. I'd upgrade the broker to 0.10 and try again.
As #artem answered Kafka producer config is not designed to retry when broker is down. It only retries during transient errors which is pretty much useless to be honest. It beats me why spring-Kafka did not take care of it.
Anyways to solve the situation I handled this with #Retry config with springboot. Checkin this SO answer for details : https://stackoverflow.com/a/65248428/6621377