Kafka losing messages under intense stress - normal? - apache-kafka

A few details about the context. I have an application with the following flow:
Put 10k messages in the input topic.
Kafka App 1 will consume messages from input topic and write to mid topic (using exactly once with an idempotent producer writing consumer offsets for input topic plus message to mid topic)
Kafka App 2 will consume messages from mid topic and write to output topic (using exactly once with an idempotent producer writing consumer offsets for mid topic plus message to output topic)
Expect 10k messages in the output topic.
So I've configured both Kafka App 1 and App 2 to have exactly once processing, everything went well when processing messages and:
killing with -9 Zk instances and starting again the instance
killing with -9 Kafka brokers and starting again the broker
killing with -9 Kafka App 1 and starting again the app
killing with -9 Kafka App 2 and starting again the app
In above 4 cases exactly once is achieved and I don't lose messages and I don't have duplicates.
However when processing messages and randomly killing with -9 Zk Instances and Kafka Brokers (in parallel), I've saw that I lose messages.
Is this expected?

Related

kafka consumers in consumer group not resuming messages after restart

Hope you are having good day.
I have an issue with kafka consumers on kubernetes. I am running 3 replicas inside a consumer group
I have a topic with 3 partitions and 3 brokers with offsets replication factor set to 3. My offset in consumer group is set to earliest.
When I start the consumer group, all are working fine with each consumer replica taking different partition and processing the data.
Issue: When by any means if a consumer replica inside the consumer group "abc-consumer-group" restarts OR if a broker(leader) restarts, it is not resuming from the point where it stopped. It states that I am up to date and no messages I have to process.
Any suggestions please where to look at?
Tried increasing rebalance, heartbeat, session timeout on broker level, no luck.
And yes whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka. I do see it happening but still not consumers are not resuming messages. It states nothing to process.

How to improve kafka consumer performance PHP/RDKafka

I'm consuming messages from kafka using php rdkafka and high-level consuming approach as it is described here https://arnaud.le-blanc.net/php-rdkafka/phpdoc/rdkafka.examples-high-level-consumer.html
The issue I have is that I have to wait for message approx around 3s to be read by consumer, topic has 20 partitions and there are 10 consumers in same group, even if there is only 1 consumer it is taking 3s to read message from kafka

Kafka consumer is not reading from only one partition out of 4

I was using Kafka 0.9 and recently migrated to Kafka 1.0, but the client I am using is still 0.9. Irrespective of this I was facing a problem where our consumers sometimes intermittently stop consuming from one or two of the partitions.
I have 5 consumers reading from 24 partitions, these are consumer JVM threads created from an application deployed in the single server. Frequently one of the consumer (thread) will stop reading from one of the partitions it would be consuming from.
Eg: One consumer thread would be reading from partition 1,2,3,and 4. It will stop reading from partition 1 and end up in building the lag. I have to restart the consumer to start picking those messages from that particular partition.
I want to understand the issue here.
My consumer configuration
session.timeout.ms=150000
request.timeout.ms=300000
max.partition.fetch.bytes=153600

How exactly Apache Nifi ConsumeKafka_1_0 processor works

I have Nifi cluster of and Kafka is also installed there.
Created one topic with 5 partitions, start consuming that topic with one gourp-id. So that each partition will get unique messages.
Now I created the 5 ConsumeKafka_1_0 processors having the intent of getting unique messages on each consumer side. But only 2 of the ConsumeKafka_1_0 are consuming all the messages rest is setting ideal.
Now what I did is started the 5 command line Kafka consumer, and what happened is, I was able to see the all the partitions are getting the messages and able to consume them from command line consumer in round-robin fashion only.
Also, I tried descried the Kafka group and what I saw was only 2 of the Nifi ConsumeKafka_1_0 is consuming all the 5 partitions and rest is ideal, see the snapshot.
Would you please let me what I am doing wrong here with Nifi consumer processor.
Note - i used Nifi version is 1.5 and Kafka version is 1.0.
I've written this article which explains how the integration with Kafka works:
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka
The Apache Kafka client (used by NiFi) is what assigns partitions to the consumers.
Typically if you had a 5 node NiFi cluster, with 1 ConsumeKafka processor on the canvas with 1 concurrent task, then each node would be consuming 1 partition.

High Level Consumer Failure in Kafka

I have the following Kafka Setup
Number of producer : 1
Number of topics : 1
Number of partitions : 2
Number of consumers : 3 (with same group id)
Number of Kafka cluster : none(single Kafka server)
Zookeeper.session.timeout : 1000
Consumer Type : High Level Consumer
Producer produces messages without any specific partitioning logic(default partitioning logic).
Consumer 1 consumes message continuously. I am abruptly killing consumer 1 and I would except consumer 2 or consumer 3 to consume the messages after the failure of consumer 1.
In some cases rebalance occurs and consumer 2 starts consuming messages. This is perfectly fine.
But in some cases either consumer 2 or consumer 3 is not at all consuming. I have to manually kill all the consumers and start all three consumers again.
Only after this restart consumer 1 starts consuming again.
Precisely rebalance is successful in some cases while in some cases rebalance is not successful.
Is there any configuration that I am missing.
Kafka uses Zookeeper to coordinate high level consumers.
From http://kafka.apache.org/documentation.html :
Partition Owner registry
Each broker partition is consumed by a single consumer within a given
consumer group. The consumer must establish its ownership of a given
partition before any consumption can begin. To establish its
ownership, a consumer writes its own id in an ephemeral node under the
particular broker partition it is claiming.
/consumers/[group_id]/owners/[topic]/[broker_id-partition_id] -->
consumer_node_id (ephemeral node)
There is a known ephemeral nodes quirk that they can linger up to 30 seconds after ZK client suddenly goes down :
http://developers.blog.box.com/2012/04/10/a-gotcha-when-using-zookeeper-ephemeral-nodes/
So you may be running into this if you expect consumer 2 and 3 to start reading messages immediately after #1 is terminated.
You can also check that /consumers/[group_id]/owners/[topic]/[broker_id-partition_id] contains correct data after rebalancing.