Kafka Consumer's poll() method gets blocked - apache-kafka

I'm new to Kafka 0.9 and testing some features I realized a strange behaviour in the Java implemented Consumer (KafkaConsumer).
The Kafka broker is located in an Ambari external machine.
Even thou I could implement a Producer and start sending messages to the external broker, I have no clue why, when the consumer tries to read the events (poll), it gets stuck.
I know the producer is working well, since I do can consume messages through the console consumer (which is working locally on ambari). But when I execute the Java Consumer, nothing happens, just gets stuck. Debugging the code I could see that it gets blocked at the poll() line:
ConsumerRecords<String, String> records = consumer.poll(100);
The timeout does nothing, by the way. Doesn't matter if you put 0, 100 or 1000 ms, the consumer gets blocked in this line and does not timeout nor throw exceptions.
I tried all kind of alternative properties, such as advertised.host.name, advertised.listener,... and so on, with zero luck.
Any help would be highly appreciated. Thanks in advance!

The reason might be the machine where your consumer code is running is unable to connect to zookeeper. Try running the same consumer code on the machine where your Kafka is installed (i tried this and worked for me). I also solved the problem by mentioning the below properties in the server.properties file:
advertised.host.name="ip address which you want to expose"
// In my case, it is the public IP of the EC2 machine, I have kafka and zookeeper installed on the same EC2 machine.
advertised.port=9092
Regarding the statement:
ConsumerRecords<String, String> records = consumer.poll(100);
The above statement doesn't mean the consumer will timeout after 100 ms; rather, it is the polling period. Whatever data it captures in 100 ms is read into records collection.

in my cases,the poll() method finally stuck in the limitless loop ensureCoordinatorReady(), the Coordinator word mentioned me that the coordinator runs on another host.(for test purpose, i only add one broker host to my /etc/hosts while there are three broker totally). so the consumer get the consumer coordinator correctly.
so the solution comes out:
configure the hosts correctly running kafka broker in /etc/hosts file

Related

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Spring Cloud Stream Kafka Binder autoCommitOnError=false get unexpected behavior

I am using Spring Boot 2.1.1.RELEASE and Spring Cloud Greenwich.RC2, and the managed version for spring-cloud-stream-binder-kafka is 2.1.0RC4. The Kafka version is 1.1.0. I have set the following properties as the messages should not be consumed if there is an error.
spring.cloud.stream.bindings.input.group=consumer-gp-1
...
spring.cloud.stream.kafka.bindings.input.consumer.autoCommitOnError=false
spring.cloud.stream.kafka.bindings.input.consumer.enableDlq=false
spring.cloud.stream.bindings.input.consumer.max-attempts=3
spring.cloud.stream.bindings.input.consumer.back-off-initial-interval=1000
spring.cloud.stream.bindings.input.consumer.back-off-max-interval=3000
spring.cloud.stream.bindings.input.consumer.back-off-multiplier=2.0
....
There are 20 partitions in the Kafka topic and Kerberos is used for authentication (not sure if this is relevant).
The Kafka consumer is calling a web service for every message it processes, and if the web service is unavailable then I expect that the consumer will then try to process the message for 3 times before it moves on to the next message. So for my test, I disabled the webservice, and therefore none of the message could be processed correctly. From the logs I can see that this is happening.
After a while I stopped and then restarted the Kafka consumer (webservice is still disabled). I was expecting that after the restart of the Kafka consumer, it would attempt to process the messages that was not successfully processed the first time around. From the logs (I printed out each message with its fields) after the restart of the Kafka Consumer I couldn't see this happening. I thought the partition might be influencing something, but I check the logs and all 20 partitions were assigned to this single consumer.
Is there a property I have missed? I thought the expected behavior when I restart the consumer the second time, is that Kafka broker would pass the records that were not successfully processed to the consumer again.
Thanks
Parameters working as expected. See comment.

Kafka: What happens when the entire Kafka Cluster is down?

We're testing out the Producer and Consumer using Kafka. A few questions:
What happens when all the brokers are down and they're not responding at all?
Does the Producer need to keep pinging the Kafka brokers to know when it is back up online? Or is there a more elegant way for the Producer application to know?
How does Zookeeper help in all this? What if the ZK is down as well?
If one or more brokers are down, the producer will re-try for a certain period of time (based on the settings). And during this time one or more of the consumers will not be able to read anything until the respective brokers are up.
But if the cluster is down for a longer period than your total re-try period, then probably you need to find a way to resend those failed messages again.
This is the one scenario where Kafka Mirroring(MirrorMaker tool) comes into picture.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330
Producer will fail because cluster will be unavailable, this means they will get a non retriable error from kafka client implementation and depending on your client process, message will buffer on the local send queue of your application.
I'm sure that if zookeeper is down your system will not work anymore. This is one of the weakness of Kafka, he need zookeeper to work.

What happens if Zookeeper fails completely?

we have setup a Kafka/Zookeeper Cluster consisting of 3 Brokers. We have one producer, sending messages to one specific Kafka topic and a few consumer groups reading from said topic. Those consumers perform a leader election via Zookeeper for themselves (independent from Kafka).
The versions used are:
Kafka: 0.9.0.1
Zookeeper: 3.4.6 (included in the Kafka-Package)
All processes are managed by Supervisor. So far, everything works just fine. What we tried now (for testing purposes) was to simply kill off all Zookeeper processes and see what happens.
As we expected, our consumer processes couldn't connect to Zookeeper anymore. But unexpectedly, the Kafka Brokers still worked. Our producer didn't complain at all and was still able to write into the topic. While I couldn't use kafka/bin/kafka-topics.sh or similar, since they all require a zookeeper-parameter, I could still see the actual size of the topic-log grow. After restarting the zookeeper processes, everything again worked just like before.
What we couldn't figure out is now... what actually happened there?
We thought, Kafka would require a working Zookeeper-Connection and we couldn't find any explanation for this behaviour online.
When you have one node of zookeeper, broker will not be able to contact zookeeper, after broker discovers zookeeper is not reachable, broker also will become unreachable. Hence the producer and consumer.
In case of producer it starts dropping(reject the record). In case of consumer it can happen that, the read record which is not ack'ed may end up processing again when broker is up and ready...
in case of 3node zk one node failure is acceptable as quorum is still satisfied... but cant afford the 2node failures which will lead to the above consequences...

Storm Kafka Spout not commit offset in local cluster , spout retrieves same message repeatedly

I have set storm topology which gets input data from kafka server. I used kafka-storm package to get data. I have implemented the connection between kafka server and storm topology succesfully in local cluster, but i am facing some issues in retrieving data from kafka server.
kafka Spout retrieves same message repeatedly at runtime even i set spoutconfig.forceFromStart=false and spoutconfig.startOffsetTime =-1
Note : When i stop and restart the cluster the data is sent correctly based on the lastest offset.
I figured out by myself, the issue is with outputcollector ack() method. I have implemented the bolt collector with BaseBasicBolt, it didn't acknowledge the kafkaspout. I have replace with BaseRichBolt and made this.collector.ack(tuple) manually.
Now its work fine