Why is KafkaTemplate losing connection to Kafka after AWS MSK cluster update? - apache-kafka

We have a Spring Boot application producing messages to a AWS MSK Kafka cluster. Every now and then our MSK cluster gets an automatic security update (or such) and after that our KafkaTemplate producer loses connection to the cluster or something so all sends end up in a timeout. The producer doesn't recover from this automatically and keeps on trying to send messages. The following idempotent sends throw an exception:
org.apache.kafka.common.errors.ClusterAuthorizationException: The producer is not authorized to do idempotent sends
Restarting the producer application fixes the issue. Our producer is very simple application using KafkaTemplate to send messages without any custom retry logic or such.
One suggestion was to add a producer reset call to the error handler but testing the solution is very hard as there seems to be no real way to reproduce the issue.
https://docs.spring.io/spring-kafka/api/org/springframework/kafka/core/ProducerFactory.html#reset()
Any ideas why this happens and what is the best way to fix it?

We have an open issue to close the producer on any timeout...
https://github.com/spring-projects/spring-kafka/issues/2251
Contributions are welcome.

Related

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Is it always necessary to restart streams after a Kafka broker outage/failover

We are using Kafka streams (0.11.0.1) api to consume events from topic. But whenever there is a Kafka Broker outage/failover, we need to restart all Kafka streamers to recover from following error:
"Connection to node 39366 could not be established. Broker may not be available."
Just wondering if it is really required for streamers to do streams close and restart? Why streamers are not able recover from this issue automatically? Or are we missing any configuration in client/Broker?
Now we are planning to introduce code changes to handle all stream exception and trigger an automated restart of streams. But am really worried if that is the right way to handle this scenario.
If you think in a real world use case where hundreds of clients connected to brokers and restarting each of them, its not making any sense.

Spring Cloud Stream Kafka Binder autoCommitOnError=false get unexpected behavior

I am using Spring Boot 2.1.1.RELEASE and Spring Cloud Greenwich.RC2, and the managed version for spring-cloud-stream-binder-kafka is 2.1.0RC4. The Kafka version is 1.1.0. I have set the following properties as the messages should not be consumed if there is an error.
spring.cloud.stream.bindings.input.group=consumer-gp-1
...
spring.cloud.stream.kafka.bindings.input.consumer.autoCommitOnError=false
spring.cloud.stream.kafka.bindings.input.consumer.enableDlq=false
spring.cloud.stream.bindings.input.consumer.max-attempts=3
spring.cloud.stream.bindings.input.consumer.back-off-initial-interval=1000
spring.cloud.stream.bindings.input.consumer.back-off-max-interval=3000
spring.cloud.stream.bindings.input.consumer.back-off-multiplier=2.0
....
There are 20 partitions in the Kafka topic and Kerberos is used for authentication (not sure if this is relevant).
The Kafka consumer is calling a web service for every message it processes, and if the web service is unavailable then I expect that the consumer will then try to process the message for 3 times before it moves on to the next message. So for my test, I disabled the webservice, and therefore none of the message could be processed correctly. From the logs I can see that this is happening.
After a while I stopped and then restarted the Kafka consumer (webservice is still disabled). I was expecting that after the restart of the Kafka consumer, it would attempt to process the messages that was not successfully processed the first time around. From the logs (I printed out each message with its fields) after the restart of the Kafka Consumer I couldn't see this happening. I thought the partition might be influencing something, but I check the logs and all 20 partitions were assigned to this single consumer.
Is there a property I have missed? I thought the expected behavior when I restart the consumer the second time, is that Kafka broker would pass the records that were not successfully processed to the consumer again.
Thanks
Parameters working as expected. See comment.

Kafka producer unable to produce after one broker dies in a cluster

Thank you, everyone for all your assistance, my project is successfully integrated with Kafka.While testing i came with a issue for which i need a little assistance. my producer and consumer both were pointing to Kafka broker say KB-1. i have cluster of brokers with replication factor 1.KB-1 dies due to some internal reason so we switched the IP's of our producer and consumer to another broker of same cluster KB-2. It was effective to consume all data and process the necessary alerts necessary,but when i tried to produce the data through producer with KB-2 IP in bootstrap server it failed to produce giving the following error: org.apache.kafka.common.errors.TimeoutException
please also explain single point failure if possible.
thank you for the help.

Kafka: What happens when the entire Kafka Cluster is down?

We're testing out the Producer and Consumer using Kafka. A few questions:
What happens when all the brokers are down and they're not responding at all?
Does the Producer need to keep pinging the Kafka brokers to know when it is back up online? Or is there a more elegant way for the Producer application to know?
How does Zookeeper help in all this? What if the ZK is down as well?
If one or more brokers are down, the producer will re-try for a certain period of time (based on the settings). And during this time one or more of the consumers will not be able to read anything until the respective brokers are up.
But if the cluster is down for a longer period than your total re-try period, then probably you need to find a way to resend those failed messages again.
This is the one scenario where Kafka Mirroring(MirrorMaker tool) comes into picture.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330
Producer will fail because cluster will be unavailable, this means they will get a non retriable error from kafka client implementation and depending on your client process, message will buffer on the local send queue of your application.
I'm sure that if zookeeper is down your system will not work anymore. This is one of the weakness of Kafka, he need zookeeper to work.