Static membership in Apache Kafka for consumers - apache-kafka

I got to know in a recent version of kafka, static membership strategy is available for consumer subscription instead of early dynamic membership detection which helps is scenario when consumer is bounces as part of rolling deployment. Now when consumer is up after getting bounced it catches up with the same partition and starts processing.
My question is what will happen if we have deliberately shutdown consumer ? How message in partition to which particular consumer was subscribed will get processed ?

After a consumer has been shutdown, the Consumer Group will undergo a normal rebalance after the consumer's session.timeout.ms has elapsed.
https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#kafka-consumers-and-producer-configuration-parameters
When configuring Static Membership, it is important to increase the session.timeout.ms higher than the default of 10000 so that consumers are not prematurely rebalanced. Set this value high enough to allow workers time to start, restart, deploy, etc. Otherwise, the application may get into a restart cycle if it misses too many
heartbeats during normal operations. Setting it too high may cause
long periods of partial unavailability if a worker dies, and the
workload is not rebalanced. Each application will set this value
differently based on its own availability needs.

If you manually subscribe then you would have to deal with that scenario in your application code - that's the advantage of automatic subscription, all partitions will be assigned to one of the group after a rebalance.
To cater for consumers permanently leaving the group with manual subscription, I guess you would need to track subscriptions somewhere and maybe have each consumer pinging to let you know it is alive.
I'm not sure what use cases the manual subscription is catering for - I will have to go back and check the Javadoc in KafkaConsumer, which is pretty comprehensive. As long as you have no local state in consumers the automatic subscription seems much safer and more resilient.

Related

How to load balance Kafka consumers within a consumer group

Scenario
10 kafka consumers within a same Consumer Group.
Kafka has 10 partitions => which means each partition is automatically assigned to a single consumer within the group.
Message is sent to partition on a round-robin basis.
Every now and then, a message will take much longer to process than other messages.
In such occasions, there's a chance the next message is assigned to a consumer that is still busy working while there are other free consumers
Question
Does Kafka support a mechanism to automatically send message to a partition whose consumer is free?
If it doesn't, what is the common approach to this scenario?
Although you could implement a custom Assignor class, by default, consumption is only based on assignment, not by load; such information is not communicated back to the group coordinator. Plus, shuffling around constantly based on load would likely cause frequent group rebalances, causing consumption to be even slower
Regarding length-of-processing, I am not aware of any way your consumer would be able to inspect message before partition assignment and polling such records. Therefore, you'd need to decouple your processing logic from the actual poll loop if you'd like to improve processing times.

Can a kafka consumer group freeze during a rebalance

Can a rolling deployment of a Kafka consumer group cause the group to freeze?
So let's consider this scenario,
we start a rolling deployment
one consumer leaves the group
Kafka notices this and triggers a rebalance (hence consumption stops)
rebalance happens but soon a new consumer wants to join
also another consumer leaves
again a new rebalance happens
(loop till deployment is complete)
So if you have a large enough cluster and it takes some time for the deployment to get completed on one machine (which is usually the case), Will this lead to a complete freeze in consumption?
If yes, What are the strategies to do a consumer group update in production
Yes, that's definitely possible. There have been a number of recent improvements to mitigate the downtime during events like this. I'd recommend enabling one or both or the following features:
Static membership was added in 2.3 and can prevent a rebalance from occurring when a known member of the group is bounced. This requires both the client and the broker to be on version 2.3+
Incremental cooperative rebalancing enables the group to have faster rebalances AND allows individual members to continue consuming throughout the rebalance. You'll still see rebalances during a rolling deployment but they won't result in a complete freeze in consumption for the duration. This is completely client side so it will work with any brokers, but your clients should be on version 2.5.1+

Processing kafka messages taking long time

I have a Python process (or rather, set of processes running in parallel within a consumer group) that processes data according to inputs coming in as Kafka messages from certain topic. Usually each message is processed quickly, but sometimes, depending on the content of the message, it may take a long time (several minutes). In this case, Kafka broker disconnects the client from the group and initiates the rebalance. I could set session_timeout_ms to a really large value but it would be like 10 minutes of more, which means if a client dies, the cluster would not be properly rebalanced for 10 minutes. This seems to be a bad idea. Also, most messages (about 98% of them) are fast, so paying such penalty for just 1-2% of messages seems wasteful. OTOH, large messages are frequent enough to cause a lot of rebalances and cost a lot of performance (since while the group is rebalancing, nothing is getting done, and then the "dead" client re-joins again and causes another rebalance).
So, I wonder, are there any other ways for handling messages that take a long time to process? Is there any way to initiate heartbeats manually to tell the broker "it's ok, I am alive, I'm just working on the message"? I thought the Python client (I use kafka-python 1.4.7) was supposed to do that for me but it doesn't seem to happen. Also, the API doesn't seem to even have separate "heartbeat" function at all. And as I understand, calling poll() would actually get me the next messages - while I am not even done with the current one, and would also mess up iterator API for Kafka consumer, which is quite convenient to use in Python.
In case it's important, the Kafka cluster is Confluent, version 2.3 if I remember correctly.
In Kafka, 0.10.1+ Kafka polling and session heartbeat are decoupled to each other.
You can get an explanationhere
max.poll.interval.ms how much time permit to complete processing by consumer instance before time out means if processing time takes more than max.poll.interval.ms time Consumer Group will presume its die remove from Consumer Group and invoke rebalance.
To increase this will increase the interval between expected polls which give consumers more time to handle a batch of records returned from poll(long).
But at the same time, it will also delay group rebalances since the consumer will only join the rebalance inside the call to poll.
session.timeout.ms is the timeout used to identify if the consumer is still alive and sending a heartbeat on a defined interval (heartbeat.interval.ms). In general, the thumb-rule is heartbeat.interval.ms should be 1/3 of session timeout so in case of network failure consumers can miss at most 3-time heartbeat before session timeout.
session.timeout.ms: low value would be good to detect failure more quickly.
max.poll.interval.ms: large value will reduce the risk of failure due to increased processing time however increases the rebalancing time.
Note: A large number of partition and topics consumed by Consumer Group also effect on overall rebalance time
The other approach if you would really want to get rid of rebalancing you can assign partitions on each consumer instance manually, using partition assign. In that case, each consumer instance will be running independently with their own assigned partitions. But in that case, you would not able to leverage the rebalance features to assign partitions automatically.

How to reinstate a Kafka Consumer which has been kicked out of the group?

I have a situation where I have a single kafka consumer which retrieves records from kafka using the poll mechanism. Sometimes this consumer gets kicked out of the consumer group due to failure to call poll within the session.timeout period which I have configured to 30s. My question is if this happens will a poll at some later point of time re-add the consumer to the group or do I need to do something else?
I am using kafka version 0.10.2.1
Edit: Aug 14 2018
Some more info. After I do a poll I never process the records in the same thread. I simply add all the records to a separate queue (serviced by a separate thread pool) for processing.
Poll will initiate a "join group" request, if the consumer is not a member of the group yet, and will result in consumer joining the group (unless some error situation prevents it). Note that depending on the group status (other members in the group, subscribed topics in the group) the consumer may or may not get the same partitions it was consuming from before it was kicked out. This would not be the case if the consumer is the only consumer in the group.
Consumer gets kicked out if it fails to send heart beat in designated time period. Every call to poll sends one heart beat to consumer group coordinator.
You need to look at how much time is it taking to process your single record. Maybe it is exceeding session.timeout.ms value which you have set as 30s. Try increasing that. Also keep max.poll.records to a lower value. This setting determines how many records are fetched after the call to poll method. If you fetch too many records then even if you keep session.timeout.ms to a large value your consumer might still get kicked out and group will enter rebalancing stage.
Vahid already mentioned what happens when a kicked-out consumer rejoins the group. You can also tune the below configuration so that consumer won't be kicked out of the group.
max.poll.records - which gives the pre-defined number of records in the poll loop (default: 500)
max.poll.interval.ms - which gives you the amount of time required to process the messages that are received in the poll. (default: 5 min)
You can see the impacts of updating the above configuration in KIP-62
Alternatively, you can use KafkaConsumer#assign mode as you've mentioned that you're using only one consumer. This mode won't do any re-balance.

Kafka Rest Proxy Consumer Creation

Let's say I have a service that that consumes messages through kafka-rest-proxy and always on the same consumer group. Let's also say that it is consuming on a topic that has one partition. When the service is started, it creates a new consumer in kafka-rest-proxy, and uses the generated consumer url until the service is shutdown. When the service comes back up, it will create a new consumer in kafka-rest-proxy, and use the new url (and new consumer) for consuming.
My Questions
Since kafka can only have at most one consumer per partition. What will happen in kafka and kafka-rest-proxy, when the consumer is restarted? i.e. A new consumer is created in kafka-rest-proxy, but the old one didn't have a chance to be destroyed. So now there are 'n' consumers after 'n' restarts of my service in kafka-rest-proxy, but only one of them is actively being consumed. Will I even be able to consume messages on my new consumer since there are more consumers than partitions?
Let's make this more complicated and say that I have 5 instances of my service on the same consumer group and 5 partitions in the topic. After 'n' restarts of all 5 instances of my service, would I even be guranteed to consume all messages without ensuring the proper destruction of the existing consumers. i.e. What does Kafka and kafka-rest-proxy do during consumer creation, when the consumers out number the partitions?
What is considered to be the kafka-rest-proxy best practice, to ensure stale consumers are always cleaned up? Do you suggest persisting the consumer url? Should I force a kafka-rest-proxy restart to ensure existing consumers are destroyed before starting my service?
* EDIT *
I believe part of my question is answered with this configuration, but not all of it.
consumer.instance.timeout.ms - Amount of idle time before a consumer instance is automatically destroyed.
Type: int
Default: 300000
Importance: low
If you cannot cleanly shutdown the consumer, it will stay alive for a period after last request was made to it. The proxy will garbage collect stale consumers for exactly this case -- if it isn't cleanly shutdown, the consumer would hold on to some partitions indefinitely. By automatically garbage collecting the consumers, you don't need some separate durable storage to keep track of your consumer instances. As you discovered, you can control this timeout via the config consumer.instance.timeout.ms.
Since instances will be garbage collected, you are guaranteed to eventually consume all the messages. But during the timeout period, some partitions may still be assigned to the old set of consumers and you will not make any progress on those partitions.
Ideally unclean shutdown of your app is rare, so best practice is just to clean up the consumer when you're app is shutting down. Even in exceptional cases, you can use the finally block of a try/catch/finally to destroy the consumer. If one is left alive, it will eventually recover. Other than that, consider tweaking the consumer.instance.timeout.ms setting to be lower if your application can tolerate that. It just needs to be larger than the longest period between calls that use the consumer (and you should keep in mind possible error cases, e.g. if processing a message requires interacting with another system and that system can become slow/inaccessible, you should account for that when setting this config).
You can persist the URLs, but even that is at some risk for losing track of consumers since you can't atomically create the consumer and save its URL to some other persistent storage. Also, since completely uncontrolled failures where you have no chance to cleanup shouldn't be a common case, it often doesn't benefit you much to do that. If you need really fast recovery from that failure, the consumer instance timeout can probably be reduced significantly for your application anyway.
Re: forcing a restart of the proxy, this would be fairly uncommon since the REST Proxy is often a shared service and doing so would affect all other applications that are using it.