Kafka : Error from SyncGroup, The request timed out - apache-kafka

Recently we are experiencing "Error from SyncGroup: The request timed out" frequently with the Java Kafka APIs.
This issue usually happens with few topic or consumer group in Kafka cluster. Does anyone can provide some pointers about this error?
As a workaround, if I change the consumer group name I don't see the error.
Broker Version : 0.9.0
Kafka client version : 0.9.0.1
Exception in thread "main" org.apache.kafka.common.KafkaException: Unexpected error from SyncGroup: The request timed out.
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupRequestHandler.handle(AbstractCoordinator.java:444)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupRequestHandler.handle(AbstractCoordinator.java:411)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:665)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:644)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:274)

#zer0Id0l
We have had the same problem recently. It happens because some Kafka Streams messages have meta information footprint which is more than a regular one (when you don't use Kafka Streams). To fix the issue, go to __consumer_offsets topic settings and set max.message.bytes param higher than it is by default. For example, in our case we have max.message.bytes = 20971520. That will completely solve your problem.

Related

Spring Cloud Stream Kafka Binder Async Producer messages getting lost if broker goes down

I am using Spring Cloud Stream Kafka Binder to produce message into Kafka. I kept producer sync to false, enabled error channel for producer, unclean leader election true on server side.
spring.cloud.stream.kafka.bindings.outputChannelName.producer.sync: false
error-channel-enabled: true
unclean.leader.election.enabled: true
I subscribed to errorChannel, to log the message those are failed to send.
I saw for async producer till delivery.timeout.ms reached messages are getting retried for UNKNOWN_TOPIC_OR_PARTITION, NOT_LEADER_FOR_PARTITION, NETWORK_EXCEPTION errors if broker goes down. Not for all errors its retrying. And after delivery.timeout.ms exceeds records are getting expired and reached to errorChannel and printed in logs. But still I have some messages are getting lost, those are not visible in success or errorChannel. I don't find any related exception / error even after log level set to Debug.
Can someone faced same issue earlier?
My Query is:
How to track the lost messages?
Is there any specific configuration I am missing?
Is there any specific exception I need to search in logs?
Any suggestions are welcome.

Kafka Admin client unregistered causing metadata issues

After migrating our microservice functionality to Spring Cloud function we have been facing issues with one of the producer topics.
Event of type: abc and key: xxx_yyy could not be sent to kafka org.springframework.messaging.MessageHandlingException: error occurred in message handler [org.springframework.cloud.stream.binder.kafka.KafkaMessageChannelBinder$ProducerConfigurationMessageHandler#2333d598]; nested exception is org.springframework.kafka.KafkaException: Send failed; nested exception is org.apache.kafka.common.errors.TimeoutException: Topic pc-abc not present in metadata after 60000 ms.
o.s.kafka.support.LoggingProducerListener - Exception thrown when sending a message with key='byte[15]' and payload='byte[256]' to topic pc-abc and partition 6: org.apache.kafka.common.errors.TimeoutException: Topic pc-abc not present in metadata after 60000 ms.
FYI: Topics are already created in our staging/prod environment and are not to be created as the application starts.
My producer config:
spring.cloud.stream.bindings.pc-abc-out-0.content-type=application/json
spring.cloud.stream.bindings.pc-abc-out-0.destination=pc-abc
spring.cloud.stream.bindings.pc-abc-out-0.producer.header-mode=headers
***spring.cloud.stream.bindings.pc-abc-out-0.producer.partition-count=5***
spring.cloud.stream.bindings.pc-abc-out-0.producer.partitionKeyExpression=payload.key
spring.cloud.stream.kafka.bindings.pc-abc-out-0.producer.sync=true
I am kind of stuck at this point and exhausted. Has anyone else faced this issue?
Spring Cloud version: 2.5.5
Kafka: 2.7.1
The issue is :
The producer is configured with partition-count=5
and Kafka is looking for partition number 6 , which obviously does not exist , I have commented the auto-add partitions property, but the issue still turns up !! Is it stale configuration? How do I force kafka to take up new configuration.

Kakfa retries Concept - What Basis retries will be stopped in Kafka?

As am new to Kafka , trying to understand the retries concept in Kafka . What basis retries process will be completed ?
Example Retries parameter we set as 7 . Now questions here ,
Kafka will be retried in all 7 times ?
Will be tried until successful process ? If so , How Kafka will come to know about successful ?
If that would be depends upon any parameter what Is that parameter and how ?
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.
Kafka will retry until the initiated process is successfully completed or retry count is zero.
Kafka maintains the status of each API call ( producer , consumer, and Streams ), and if the error condition meets then retry count is decreased.
Please go through the completeBatch function of the Sender.java in the following URL to get more information.
https://github.com/apache/kafka/blob/68ac551966e2be5b13adb2f703a01211e6f7a34b/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java
I guess you are talking about producer retrying to send failed messages.
From kafka producer retries property documentation -
"Setting a value greater than zero will cause the client to resend any
record whose send fails with a potentially transient error."
This means that kafka producer will retry if the error it encountered is considered "Retriable". not all errors are retriable - for example, if the target kafka topic does not exist, theres no point in trying to send the message again.
but if for example the connection was interrupted, it makes sense to try again.
Important to note - retries are only relevant if you have set broker ack != 0.
So, in your example you have 7 retries configured.
I assume that ack is set to a value different than 0 because then no retries will be attempted.
If your message failed with a non-retriable error, Kafka producer will not try to send the message again (it will actually 'give-up' on that message and move on to next messages).
If your message failed with a retriable error, Kafka producer will retry sending until message is successfully sent, or until retries are exhausted (when 7 retries were attempted and none of them succeeded).
Kafka client producer knows when your message was successfully sent to broker because when ack is set to 1\all, the kafka broker is "Acknowledging" any message received and informs the producer (in a kind of handshake between the producer and broker).
see acks & retries # https://kafka.apache.org/documentation/#producerconfigs
Kafka reties happens for transient exceptions such as NotEnoughReplicaException.
In Kafka version <=2.0 default retry is 0.
In Kafka version > 2.0 default retry is Integer.MAX
From kafka 2.1 retries are bounded to timeouts, there are couple of producer configuration such as.
delivery.timeout.ms=120000ms - by default producer will retry for 2 mins, if retry is not successful after 2 mins the request will not send to broker and we have to handle manually.
retry.backoff.ms=100ms - by default every 100ms producer will retry till delivery.timeout reaches.

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817

Kafka Producer is not retring when Broker is Down

I have setup up Kafka using version 0.9 with the basic configuration as
1 Broker 1 Topic and 1 Partition.
Below are Producer Configurations that I have added to enable the retry from Producer.
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.RETRIES_CONFIG, 5);
props.put(ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, 500);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 500);
props.put(ProducerConfig.METADATA_MAX_AGE_CONFIG, 50);
I understand from the documents that
Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error.
Both my Broker & Zookeeper are down and the retry operation is not working.
ERROR o.s.k.s.LoggingProducerListener - Exception thrown when sending a message to topic TestTopic1|
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 500 ms.
I need to know if I am missing anything here for the retry to work.
Resend (retry) works only if you have connection to the Broker and something happened during sending a message.
So, if your Broker is dead, there is no any reason to send message at all - no connection. And that is an exception about.
I think retries should work anyway, even if the broker is down. This is the whole reason to have retries in the first place. Could be a temporary network issue after all.
There is a bug in the Kafka 0.9.0.1 producer which causes retries not to work. See here.
Fixed in 0.9.0.2 (which is not released yet) and 0.10. I'd upgrade the broker to 0.10 and try again.
As #artem answered Kafka producer config is not designed to retry when broker is down. It only retries during transient errors which is pretty much useless to be honest. It beats me why spring-Kafka did not take care of it.
Anyways to solve the situation I handled this with #Retry config with springboot. Checkin this SO answer for details : https://stackoverflow.com/a/65248428/6621377