Suspending Camel KafkaConsumer - apache-kafka

My app has N instance running. The number of instances is always greater than the number of Kafka partitions. E.g. 6 instances of a consumer-group, consuming from 4 Kafka partitions... so, only 4 of the instances are actually consuming at any point.
In this context can I suspend a Kafka consumer Camel route, without causing Kafka to attempt to re-balance to other potential consumers? My understanding is that the suspended route would stop polling, causing the other to pick up the load.

This is not a Camel but a Kafka question. The rebalancing is handled by Kafka and triggered whenever a consumer explicitly leaves the consumer group or silently dies (does no more sending heartbeats).
Kafka 2.3 introduced a new feature called "Static Membership" to avoid rebalancing just because of a consumer restart.
But in your case (another consumer must take the load of a leaving consumer) I think Kafka must trigger a rebalancing over all consumers due to the protocol used.
See also this article for a quite deep dive into rebalancing and its trade-offs between availability and fault-tolerance.
Edit due to comments
If you want to avoid rebalancing, I think you would have to increase both session.timeout.ms (heartbeat interval) and max.poll.interval.ms (processing timeout).
But even if you set them very high I guess it would not work reliably because route suspension could still happen just before a heartbeat (simply bad timing).
See this q&a for the difference between session.timeout.ms and max.poll.interval.ms.

Related

Kafka incremental sticky rebalancing

I am running Kafka on Kubernetes using the Kafka Strimzi operator. I am using incremental sticky rebalance strategy by configuring my consumers with the following:
ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
org.apache.kafka.clients.consumer.CooperativeStickyAssignor.class.getName()
Each time I scale consumers in my consumer group all existing consumer in the group generate the following exception
Exception in thread "main" org.apache.kafka.common.errors.RebalanceInProgressException: Offset commit cannot be completed since the consumer is undergoing a rebalance for auto partition assignment. You can try completing the rebalance by calling poll() and then retry the operation
Any idea on what caused this exception and/or how to resolve it?
Thank you.
The consumer rebalance happens whenever there is a change in the metadata information of a consumer group.
Adding more consumers (scaling in your words) in a group is one such change and triggers a rebalance. During this change, each consumer will be re-assigned partitions and therefore will not know which offsets to commit until the re-assignment is complete. Now, the StickyAssignor does try and ensure that the previous assignment gets preserved as much as possible but the rebalance will still be triggered and even distribution of partitions will take precedence over retaining previous assignment. (Reference - Kafka Documentation)
Rest, the exception's message is self-explanatory that while the rebalance is happening some of the operations are prohibited.
How to avoid such situations?
This is a tricky one because Kafka needs rebalancing to be able to work effectively. There are a few practices you could use to avoid unnecessary impact:
Increase the polling time - max.poll.interval.ms - so the possibility of experiencing these exceptions is reduced.
Decrease the number of poll records - max.poll.records or max.partition.fetch.bytes
Try and utilise the latest version(s) of Kafka (or upgrade if you're using an old one) as many of the latest upgrades so far have made improvements to the rebalance protocol
Use Static membership protocol to reduce rebalances
Might consider configuring group.initial.rebalance.delay.ms for empty consumer groups (either for the first time deployment or destroyin everything and redeploying again)
These techniques can only help you reduce the unnecessary behaviour or exception but will NOT prevent rebalance completely.

Selective Kafka rebalancing on Kubernetes infrastructure

I am running a kafka cluster with a set of consumers on a dockerized Kubernetes infrastructure. The typical workflow is that when a certain consumer (of the consumer group) dies, a rebalancing process will be triggered, and a new assignment of the partitions to the set of the consumers (excluding the failed one) is performed.
After some time, Kubernetes controller will recreate/restart the consumer instance that has failed/died and a new rebalance is performed again.
Is there any way to control the first rebalancing process (when the consumer died) e.g., such as to wait few seconds without rebalancing until the failed consumer returns, or until a time out is triggered. And if the consumer returned, continue consuming based on the old rebalancing assignment (i.e., without new rebalancing)?
There are the 3 parameter on that basis group coordinator decide consumer is dead or alive
session.timeout.ms
max.poll.interval.ms
heartbeat.interval.ms
You can avoid unwanted rebalancing by tuning above three parameter and one thumb rule : used separate thread for calling 3rd party api in pool loop.
tuning above three parameter required ans. of below questions
what is size max.poll.records
How much avg. time application is taking to process 1 record[message]
How much avg. time application is taking to process complete batch
Please refer Kafka consumer config
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
you can also explore Cooperative Rebalances
https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/

max.poll.intervals.ms set to int.Max by default

Apache Kafka documentation states:
The internal Kafka Streams consumer max.poll.interval.ms default value
was changed from 300000 to Integer.MAX_VALUE
Since this value is used to detect when the processing time for a batch of records exceeds a given threshold, is there a reason for such an "unlimited" value?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Kafka Streams leverages a heartbeat functionality of the Kafka consumer client in this context, and thus decouples heartbeats ("Is this app instance still alive?") from calls to poll(). The two main parameters are session.timeout.ms (for the heartbeat thread) and max.poll.interval.ms (for the processing thread), and their difference is described in more detail at https://stackoverflow.com/a/39759329/1743580.
The heartbeating was introduced so that an application instance may be allowed to spent a lot of time processing a record without being considered "not making progress" and thus "be dead". For example, your app can do a lot of crunching for a single record for a minute, while still heartbeating to Kafka "Hey, I'm still alive, and I am making progress. But I'm simply not done with the processing yet. Stay tuned."
Of course you can change max.poll.interval.ms from its default (Integer.MAX_VALUE) to a lower setting if, for example, you actually do want your app instance to be considered "dead" if it takes longer than X seconds in-between polling records, and thus if it takes longer than X seconds to process the latest round of records. It depends on your specific use case whether or not such a configuration makes sense -- in most cases, the default setting is a safe bet.
session.timeout.ms: The timeout used to detect consumer failures when using Kafka's group management facility. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.
max.poll.interval.ms: The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.

Kafka Rest Proxy Consumer Creation

Let's say I have a service that that consumes messages through kafka-rest-proxy and always on the same consumer group. Let's also say that it is consuming on a topic that has one partition. When the service is started, it creates a new consumer in kafka-rest-proxy, and uses the generated consumer url until the service is shutdown. When the service comes back up, it will create a new consumer in kafka-rest-proxy, and use the new url (and new consumer) for consuming.
My Questions
Since kafka can only have at most one consumer per partition. What will happen in kafka and kafka-rest-proxy, when the consumer is restarted? i.e. A new consumer is created in kafka-rest-proxy, but the old one didn't have a chance to be destroyed. So now there are 'n' consumers after 'n' restarts of my service in kafka-rest-proxy, but only one of them is actively being consumed. Will I even be able to consume messages on my new consumer since there are more consumers than partitions?
Let's make this more complicated and say that I have 5 instances of my service on the same consumer group and 5 partitions in the topic. After 'n' restarts of all 5 instances of my service, would I even be guranteed to consume all messages without ensuring the proper destruction of the existing consumers. i.e. What does Kafka and kafka-rest-proxy do during consumer creation, when the consumers out number the partitions?
What is considered to be the kafka-rest-proxy best practice, to ensure stale consumers are always cleaned up? Do you suggest persisting the consumer url? Should I force a kafka-rest-proxy restart to ensure existing consumers are destroyed before starting my service?
* EDIT *
I believe part of my question is answered with this configuration, but not all of it.
consumer.instance.timeout.ms - Amount of idle time before a consumer instance is automatically destroyed.
Type: int
Default: 300000
Importance: low
If you cannot cleanly shutdown the consumer, it will stay alive for a period after last request was made to it. The proxy will garbage collect stale consumers for exactly this case -- if it isn't cleanly shutdown, the consumer would hold on to some partitions indefinitely. By automatically garbage collecting the consumers, you don't need some separate durable storage to keep track of your consumer instances. As you discovered, you can control this timeout via the config consumer.instance.timeout.ms.
Since instances will be garbage collected, you are guaranteed to eventually consume all the messages. But during the timeout period, some partitions may still be assigned to the old set of consumers and you will not make any progress on those partitions.
Ideally unclean shutdown of your app is rare, so best practice is just to clean up the consumer when you're app is shutting down. Even in exceptional cases, you can use the finally block of a try/catch/finally to destroy the consumer. If one is left alive, it will eventually recover. Other than that, consider tweaking the consumer.instance.timeout.ms setting to be lower if your application can tolerate that. It just needs to be larger than the longest period between calls that use the consumer (and you should keep in mind possible error cases, e.g. if processing a message requires interacting with another system and that system can become slow/inaccessible, you should account for that when setting this config).
You can persist the URLs, but even that is at some risk for losing track of consumers since you can't atomically create the consumer and save its URL to some other persistent storage. Also, since completely uncontrolled failures where you have no chance to cleanup shouldn't be a common case, it often doesn't benefit you much to do that. If you need really fast recovery from that failure, the consumer instance timeout can probably be reduced significantly for your application anyway.
Re: forcing a restart of the proxy, this would be fairly uncommon since the REST Proxy is often a shared service and doing so would affect all other applications that are using it.

kafka consumer sessions timing out

We have an application that a consumer reads a message and the thread does a number of things, including database accesses before a message is produced to another topic. The time between consuming and producing the message on the thread can take several minutes. Once message is produced to new topic, a commit is done to indicate we are done with work on the consumer queue message. Auto commit is disabled for this reason.
I'm using the high level consumer and what I'm noticing is that zookeeper and kafka sessions timeout because it is taking too long before we do anything on consumer queue so kafka ends up rebalancing every time the thread goes back to read more from consumer queue and it starts to take a long time before a consumer reads a new message after a while.
I can set zookeeper session timeout very high to not make that a problem but then i have to adjust the rebalance parameters accordingly and kafka won't pickup a new consumer for a while among other side effects.
What are my options to solve this problem? Is there a way to heartbeat to kafka and zookeeper to keep both happy? Do i still have these same issues if i were to use a simple consumer?
It sounds like your problems boil down to relying on the high-level consumer to manage the last-read offset. Using a simple consumer would solve that problem since you control the persistence of that offset. Note that all the high-level consumer commit does is store the last read offset in zookeeper. There's no other action taken and the message you just read is still there in the partition and is readable by other consumers.
With the kafka simple consumer, you have much more control over when and how that offset storage takes place. You can even persist that offset somewhere other than Zookeeper (a data base, for example).
The bad news is that while the simple consumer itself is simpler than the high-level consumer, there's a lot more work you have to do code-wise to make it work. You'll also have to write code to access multiple partitions - something the high-level consumer does quite nicely for you.
I think issue is consumer's poll method trigger consumer's heartbeat request. And when you increase session.timeout. Consumer's heartbeat will not reach to coordinator. Because of this heartbeat skipping, coordinator mark consumer dead. And also consumer rejoining is very slow especially in case of single consumer.
I have faced a similar issue and to solve that I have to change following parameter in consumer config properties
session.timeout.ms=
request.timeout.ms=more than session timeout
Also you have to add following property in server.properties at kafka broker node.
group.max.session.timeout.ms =
You can see the following link for more detail.
http://grokbase.com/t/kafka/users/16324waa50/session-timeout-ms-limit