My topics has 115 partition and around 130 consumers. I expect 115 consumers in active state (1 to 1 assignment) and the remaining 15 consumers in idle state.
A few times, I observed high memory and JVM in hung state due to which rebalancing is triggered. However, I am unsure if this causes the full rebalancing (i.e., healthy node assignments also gets changed ??) or only the dead node's assigned partitions get assigned to one of the idle nodes ?
Also, in case of restart of the application (mine is a distributed 1 thread/consumer per JVM), how does the rebalance behave ? As the nodes are starting one by one (rolling restart), will the rebalance happen 115 times (ie., every time a new consumer joins the group) or is some threshold/wait applied before kick starting the rebalance (to ensure all healthy nodes join the application)?
Consumer rebalance is triggered anytime a Kafka consumer with the same group ID joins the group or leaves. Leaving the consumer group can be done explicitly by closing a consumer connection, or by timeout if the JVM or server crashed.
So in your case, yes, a rolling restart of the consumers would trigger 115 consumer rebalances. There is no "threshold" or "wait period" before starting a rebalance in Kafka.
By default RangeAssignor.java - is used which may cause that even healthy consumers get different partitions assigned to them over and over again when something happens with other node. It may also mean that partition will be taken from healthy consumer. You may tweak that so it uses different implementations of PartitionAssignor interface - for example StickyAssignor.java
"one advantage of the sticky assignor is that, in general, it reduces the number of partitions that actually move from one consumer to another during a reassignment".
I would also recommend reading https://medium.com/#anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab if you want some deep dive how does it work undearneath
Related
I am experiencing strange assignment behavior with Kafka Streams. I am having 3-node cluster of Kafka streams. My stream is pretty straightforward, one source topic (24 partitions, all kafka brokers are running on other machines than kafka stream nodes) and our stream graph only takes messages, group them by key, perform some filtering and store everything to sink topic. Everything is running with 2 Kafka Threads on each node.
However whenever I am doing rolling update of my kafka stream (by shutting down always only one app so other two nodes are running) my kafka streams ends with uneven number of partitions per "node"(usually 16-9-0). Only once I restart node01 and sometimes node02 cluster gets back to more even state.
Can somebody advice any hint how I can achieve more equal distribution before additional restarts?
I assume both nodes running the kafka streams app have identical group ids for consumption.
I suggest you check to see if the partition assignment strategy your consumers are using isn't org.apache.kafka.clients.consumer.RangeAssignor.
If this is the case, configure it to be org.apache.kafka.clients.consumer.RoundRobinAssignor. This way, when the group coordinator receives a JoinGroup request and hands the partitions over to the group leader, the group leader will ensure the spread between the nodes isn't uneven by more than 1.
Unless you're using an older version of Kafka streams, the default is Range and does not guarantee even spread across consumers.
Is your Kafka Streams application stateful? If so, you can possibly thank this well-intentioned KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams
If you want to override this behaviour, you can set acceptable.recovery.lag=9223372036854775807 (Long.MAX_VALUE).
The definition of that config from https://docs.confluent.io/platform/current/streams/developer-guide/config-streams.html#acceptable-recovery-lag
The maximum acceptable lag (total number of offsets to catch up from the changelog) for an instance to be considered caught-up and able to receive an active task. Streams only assigns stateful active tasks to instances whose state stores are within the acceptable recovery lag, if any exist, and assigns warmup replicas to restore state in the background for instances that are not yet caught up. Should correspond to a recovery time of well under a minute for a given workload. Must be at least 0.
I am running a kafka cluster with a set of consumers on a dockerized Kubernetes infrastructure. The typical workflow is that when a certain consumer (of the consumer group) dies, a rebalancing process will be triggered, and a new assignment of the partitions to the set of the consumers (excluding the failed one) is performed.
After some time, Kubernetes controller will recreate/restart the consumer instance that has failed/died and a new rebalance is performed again.
Is there any way to control the first rebalancing process (when the consumer died) e.g., such as to wait few seconds without rebalancing until the failed consumer returns, or until a time out is triggered. And if the consumer returned, continue consuming based on the old rebalancing assignment (i.e., without new rebalancing)?
There are the 3 parameter on that basis group coordinator decide consumer is dead or alive
session.timeout.ms
max.poll.interval.ms
heartbeat.interval.ms
You can avoid unwanted rebalancing by tuning above three parameter and one thumb rule : used separate thread for calling 3rd party api in pool loop.
tuning above three parameter required ans. of below questions
what is size max.poll.records
How much avg. time application is taking to process 1 record[message]
How much avg. time application is taking to process complete batch
Please refer Kafka consumer config
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
you can also explore Cooperative Rebalances
https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/
I got to know in a recent version of kafka, static membership strategy is available for consumer subscription instead of early dynamic membership detection which helps is scenario when consumer is bounces as part of rolling deployment. Now when consumer is up after getting bounced it catches up with the same partition and starts processing.
My question is what will happen if we have deliberately shutdown consumer ? How message in partition to which particular consumer was subscribed will get processed ?
After a consumer has been shutdown, the Consumer Group will undergo a normal rebalance after the consumer's session.timeout.ms has elapsed.
https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#kafka-consumers-and-producer-configuration-parameters
When configuring Static Membership, it is important to increase the session.timeout.ms higher than the default of 10000 so that consumers are not prematurely rebalanced. Set this value high enough to allow workers time to start, restart, deploy, etc. Otherwise, the application may get into a restart cycle if it misses too many
heartbeats during normal operations. Setting it too high may cause
long periods of partial unavailability if a worker dies, and the
workload is not rebalanced. Each application will set this value
differently based on its own availability needs.
If you manually subscribe then you would have to deal with that scenario in your application code - that's the advantage of automatic subscription, all partitions will be assigned to one of the group after a rebalance.
To cater for consumers permanently leaving the group with manual subscription, I guess you would need to track subscriptions somewhere and maybe have each consumer pinging to let you know it is alive.
I'm not sure what use cases the manual subscription is catering for - I will have to go back and check the Javadoc in KafkaConsumer, which is pretty comprehensive. As long as you have no local state in consumers the automatic subscription seems much safer and more resilient.
My app has N instance running. The number of instances is always greater than the number of Kafka partitions. E.g. 6 instances of a consumer-group, consuming from 4 Kafka partitions... so, only 4 of the instances are actually consuming at any point.
In this context can I suspend a Kafka consumer Camel route, without causing Kafka to attempt to re-balance to other potential consumers? My understanding is that the suspended route would stop polling, causing the other to pick up the load.
This is not a Camel but a Kafka question. The rebalancing is handled by Kafka and triggered whenever a consumer explicitly leaves the consumer group or silently dies (does no more sending heartbeats).
Kafka 2.3 introduced a new feature called "Static Membership" to avoid rebalancing just because of a consumer restart.
But in your case (another consumer must take the load of a leaving consumer) I think Kafka must trigger a rebalancing over all consumers due to the protocol used.
See also this article for a quite deep dive into rebalancing and its trade-offs between availability and fault-tolerance.
Edit due to comments
If you want to avoid rebalancing, I think you would have to increase both session.timeout.ms (heartbeat interval) and max.poll.interval.ms (processing timeout).
But even if you set them very high I guess it would not work reliably because route suspension could still happen just before a heartbeat (simply bad timing).
See this q&a for the difference between session.timeout.ms and max.poll.interval.ms.
I have a topic with 20 partitions and 3 processes with consumers(with the same group_id) consuming messages from the topic.
But I am seeing a discrepancy where unless one of the process commits , the other consumer(in a different process) is not reading any message.
The consumers in other process do cconsume messages when I set auto-commit to true. (which is why I suspect the consumers are being assigned to the first partition in each process)
Can someone please help me out with this issue? And also how to consume messages parallely across processes ?
If it is of any use , I am doing this on a pod(kubernetes) , where the 3 processes are 3 different mules.
Commit shouldn't make any difference because the committed offset is only used when there is a change in group membership. With three processes there would be some rebalancing while they start up but then when all 3 are running they will each have a fair share of the partitions.
Each time they poll, they keep track in memory of which offset they have consumed on each partition and each poll causes them to fetch from that point on. Whether they commit or not doesn't affect that behaviour.
Autocommit also makes little difference - it just means a commit is done synchronously during a subsequent poll rather than your application code doing it. The only real reason to manually commit is if you spawn other threads to process messages and so need to avoid committing messages that have not actually been processed - doing this is generally not advisable - better to add consumers to increase throughput rather than trying to share out processing within a consumer.
One possible explanation is just infrequent polling. You mention that other consumers are picking up partitions, and committing affects behaviour so I think it is safe to say that rebalances must be happening. Rebalances are caused by either a change in partitions at the broker (presumably not the case) or a change in group membership caused by either heartbeat thread dying (a pod being stopped) or a consumer failing to poll for a long time (default 5 minutes, set by max.poll.interval.ms)
After a rebalance, each partition is assigned to a consumer, and if a previous consumer has ever committed an offset for that partition, then the new one will poll from that offset. If not then the new one will poll from either the start of the partition or the high watermark - set by auto.offset.reset - default is latest (high watermark)
So, if you have a consumer, it polls but doesn't commit, and doesn't poll again for 5 minutes then a rebalance happens, a new consumer picks up the partition, starts from the end (so skipping any messages up to that point). Its first poll will return nothing as it is starting from the end. If it doesn't poll for 5 minutes another rebalance happens and the sequence repeats.
That could be the cause - there should be more information about what is going on in your logs - Kafka consumer code puts in plenty of helpful INFO level logging about rebalances.