Uneven partition assignment in kafka streams - apache-kafka

I am experiencing strange assignment behavior with Kafka Streams. I am having 3-node cluster of Kafka streams. My stream is pretty straightforward, one source topic (24 partitions, all kafka brokers are running on other machines than kafka stream nodes) and our stream graph only takes messages, group them by key, perform some filtering and store everything to sink topic. Everything is running with 2 Kafka Threads on each node.
However whenever I am doing rolling update of my kafka stream (by shutting down always only one app so other two nodes are running) my kafka streams ends with uneven number of partitions per "node"(usually 16-9-0). Only once I restart node01 and sometimes node02 cluster gets back to more even state.
Can somebody advice any hint how I can achieve more equal distribution before additional restarts?

I assume both nodes running the kafka streams app have identical group ids for consumption.
I suggest you check to see if the partition assignment strategy your consumers are using isn't org.apache.kafka.clients.consumer.RangeAssignor.
If this is the case, configure it to be org.apache.kafka.clients.consumer.RoundRobinAssignor. This way, when the group coordinator receives a JoinGroup request and hands the partitions over to the group leader, the group leader will ensure the spread between the nodes isn't uneven by more than 1.
Unless you're using an older version of Kafka streams, the default is Range and does not guarantee even spread across consumers.

Is your Kafka Streams application stateful? If so, you can possibly thank this well-intentioned KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams
If you want to override this behaviour, you can set acceptable.recovery.lag=9223372036854775807 (Long.MAX_VALUE).
The definition of that config from https://docs.confluent.io/platform/current/streams/developer-guide/config-streams.html#acceptable-recovery-lag
The maximum acceptable lag (total number of offsets to catch up from the changelog) for an instance to be considered caught-up and able to receive an active task. Streams only assigns stateful active tasks to instances whose state stores are within the acceptable recovery lag, if any exist, and assigns warmup replicas to restore state in the background for instances that are not yet caught up. Should correspond to a recovery time of well under a minute for a given workload. Must be at least 0.

Related

In what situation can a Flink 1.15.2 job stop consumption on a single Kafka partition, but continue to consume on the other partitions?

We are running a Flink 1.15.2 cluster with a job that has a Kafka Source and Kafka Sink.
The Source topic has 30 partitions. There are 5 TaskManager nodes with a capacity of 4 slots, and we are running the job with a parallelism of 16, so that is 4 free slots. So depending upon the slots/node assignment, we can expect, each node to have roughly 6-7 partitions assigned.
Our alerting mechanisms notified us that consumer lag was getting built up on a single partition out of the 30 partitions.
As Flink does its own offset management, we had no way of figuring out (through the Flink Web UI or the Kafka console tools) which TaskManager the partition was assigned to.
I would like to know if anyone else has faced this in their experience, and what can be done to proactively monitor and/or mitigate such instances in future. Is it possible for a single partition consumer thread to behave in this manner?
We decided to bounce the Flink TaskManager service one by one hoping that a partition reassignment would jump start consumption again. Bouncing the first node had no impact, but when we bounced the second node, some other TaskManager picked up the lagging partition and started consumption again.
Maybe related to this https://issues.apache.org/jira/browse/FLINK-28975? For more details see also here.
I doubt this is the correct explanation, but perhaps watermark alignment could explain this sort of behavior.

Max size production kafka cluster deployment for now

I am considering how to deploy our kafka cluster: a big cluster with several broker groups or several clusters. If a big cluster, I want to know how big a kafka cluster can be. kafka has a controller node and I don't know how many brokers it can support. And another one is _consume_offset_ topic ,how big it can be and can we add more partitions to it.
I've personally worked with production Kafka clusters anywhere from 3 brokers to 20 brokers. They've all worked fine, it just depends on what kind of workload you're throwing at it. With Kafka, my general recommendation is that it's better to have a smaller amount of larger/more-powerful brokers, than having a bunch of tiny servers.
For a standing cluster, each broker you add increases "crosstalk" between the nodes, since they have to move partitions around, replicate data, as well as maintain the metadata in sync. This additional network chatter can impact how much load the broker can handle. As a general rule, adding brokers will add overall capacity, but you have to shift partitions around so that the load will be balanced properly across the entire cluster. Because of that, it's much better to start with 10 nodes, so that topics and partitions will be spread out evenly from the beginning, than starting out with 6 nodes and then adding 4 nodes later.
Regardless of the size of the cluster, there is always only one controller node at a time. If that node happens to go down, another node will take over as controller, but only one can be active at a given time, assuming the cluster is not in an unstable state.
The __consumer_offsets topic can have as many partitions as you want, but it comes by default set to 50 partitions. Since this is a compacted topic, assuming that there is no excessive committing happening (this has happened to me twice already in production environments), then the default settings should be enough for almost any scenario. You can look up the configuration settings for consumer offsets topics by looking for broker properties that start with offsets. in the official Kafka documentation.
You can get more details at the official Kafka docs page: https://kafka.apache.org/documentation/
The size of a cluster can be determined by the following ways.
The most accurate way to model your use case is to simulate the load you expect on your own hardware.You can use the kafka load generation tools kafka-producer-perf-test and kafka-consumer-perf-test.
Based on the producer and consumer metrics, we can decide the number of brokers for our cluster.
The other approach is without simulation, which is based on the estimated rate at which you get data that required data retention period.
We can also calculate the throughput and based on that we can also decide the number of brokers in our cluster.
Example
If you have 800 messages per second, of 500 bytes each then your throughput is 800*500/(1024*1024) = ~0.4MB/s. Now if your topic is partitioned and you have 3 brokers up and running with 3 replicas that would lead to 0.4/3*3=0.4MB/s.
More details about the architecture are available at confluent.
Within a Kafka Cluster, a single broker works as a controller. If you have a cluster of 100 brokers then one of them will act as the controller.
If we talk internally, each broker tries to create a node(ephemeral node) in the zookeeper(/controller). The first one becomes the controller. The other brokers get an exception("node already exists"), they set a watch on the controller. When the controller dies, the ephemeral node is removed and watching brokers are notified for the controller selection process.
The functionality of the controller can be found here.
The __consumer_offset topic is used to store the offsets committed by consumers. Its default value is 50 but it can be set for more partitions. To change, set the offsets.topic.num.partitions property.

Kafka - Multiple consumers (only one active) on same group/topic

Is it possible to have multiple copies of an application listen to the same Kafka group/topic so that only one is reading it at a time, but the other ones will start working if the main one crashes/stops reading?
I need to make an application highly available but can't tolerate doubling the traffic to the data store on the other end of the application by having multiple copies actively running.
FYI - Technically I'm using MapR streams but it adheres to the Kafka API and functionality, in case anyone knows a MapR stream-specific feature that helps the situation.
It is possible. If multi consumers are in same consumer group, when the group subscribes a topic, kafka will do a partition assignment work for your consumers: one partition could only be consumed by only one consumer in a same group.
So you could set your topic to have only one partition, then only one consumer to consume message, others will be idle. Once the consumer is shutdown, it will trigger the group rebalance operation : kafka will do the partition assignment again. And Then in your case , a new consumer will go ahead this work. It will process message from the last committed offset which commited by old consumer.
And if your case supports parallel processing, you could make many process(app) doing same work and set the topic to multi partitions. They will be assigned to consume different partitions and process different messages. So it will speed up your process and also can tolerant the fail over. As above said, if some consumers is failed, kafka will take care it for you, it will assign their paritition to other working consumer. So everything will be ok.

Why do Kafka consumers connect to zookeeper, and producers get metadata from brokers?

Why is it that consumers connect to zookeeper to retrieve the partition locations? And kafka producers have to connect to one of the brokers to retrieve metadata.
My point is, what exactly is the use of zookeeper when every broker already has all the necessary metadata to tell producers the location to send their messages? Couldn't the brokers send this same information to the consumers?
I can understand why brokers have the metadata, to not have to make a connection to zookeeper each time a new message is sent to them. Is there a function that zookeeper has that I'm missing? I'm finding it hard to think of a reason why zookeeper is really needed within a kafka cluster.
First of all, zookeeper is needed only for high level consumer. SimpleConsumer does not require zookeeper to work.
The main reason zookeeper is needed for a high level consumer is to track consumed offsets and handle load balancing.
Now in more detail.
Regarding offset tracking, imagine following scenario: you start a consumer, consume 100 messages and shut the consumer down. Next time you start your consumer you'll probably want to resume from your last consumed offset (which is 100), and that means you have to store the maximum consumed offset somewhere. Here's where zookeeper kicks in: it stores offsets for every group/topic/partition. So this way next time you start your consumer it may ask "hey zookeeper, what's the offset I should start consuming from?". Kafka is actually moving towards being able to store offsets not only in zookeeper, but in other storages as well (for now only zookeeper and kafka offset storages are available and i'm not sure kafka storage is fully implemented).
Regarding load balancing, the amount of messages produced can be quite large to be handled by 1 machine and you'll probably want to add computing power at some point. Lets say you have a topic with 100 partitions and to handle this amount of messages you have 10 machines. There are several questions that arise here actually:
how should these 10 machines divide partitions between each other?
what happens if one of machines die?
what happens if you want to add another machine?
And again, here's where zookeeper kicks in: it tracks all consumers in group and each high level consumer is subscribed for changes in this group. The point is that when a consumer appears or disappears, zookeeper notifies all consumers and triggers rebalance so that they split partitions near-equally (e.g. to balance load). This way it guarantees if one of consumer dies others will continue processing partitions that were owned by this consumer.
With kafka 0.9+ the new Consumer API was introduced. New consumers do not need connection to Zookeeper since group balancing is provided by kafka itself.
You are right, the consumers don't need to connect to ZooKeeper since kafka 0.9 release. They redesigned the api and new consumer client was introduced:
the 0.9 release introduces beta support for the newly redesigned
consumer client. At a high level, the primary difference in the new
consumer is that it removes the distinction between the “high-level”
ZooKeeper-based consumer and the “low-level” SimpleConsumer APIs, and
instead offers a unified consumer API.
and
Finally this completes a series of projects done in the last few years
to fully decouple Kafka clients from Zookeeper, thus entirely removing
the consumer client’s dependency on ZooKeeper.

Kafka Only One Consumer in Consumer Group Getting Messages

In my setup, I have a consumer group with three processes (3 instances of a service) that can consume from Kafka. What I've found to be happing is that the first node is receiving all of the traffic. If one node is manually killed, the next node picks up all Kafka traffic, but the last remaining node sits idle.
The behavior desired is that all messages get distributed evenly across all instances within the consumer group, which is what I thought should happen. As I understand, the way Kafka works is that it is supposed to distribute the messages evenly amongst all members of a consumer group. Is my understanding correct? I've been trying to determine why it may be that only one member of the consumer group is getting all traffic with no luck. Any thoughts/suggestions?
You need to make sure that the topic has more than one partition to be able to consume it in parallel. A consumer in a consumer group gets one or more allocated partitions from the broker but a single partition will never be shared across several consumers within the same group unless a consumer goes offline. The number of partitions a topic has equals the maximum number of consumers in a consumer group that can feed from a topic.