How to scale up real-time kafka consumer without rebalancing latency? - deployment

When scaling up\down consumers (adding or removing consumers to group) a rebalance occurs - this make some latency issues in system that makes event to timeout. (saw about 3 sec of producer to consumer latency)
How can I scale up consumers safely without rebalancing process making latency increasing ?
I expect seamless consumers deployments, but the actual behaviour of rebalancing causing timeouts.

Related

Selective Kafka rebalancing on Kubernetes infrastructure

I am running a kafka cluster with a set of consumers on a dockerized Kubernetes infrastructure. The typical workflow is that when a certain consumer (of the consumer group) dies, a rebalancing process will be triggered, and a new assignment of the partitions to the set of the consumers (excluding the failed one) is performed.
After some time, Kubernetes controller will recreate/restart the consumer instance that has failed/died and a new rebalance is performed again.
Is there any way to control the first rebalancing process (when the consumer died) e.g., such as to wait few seconds without rebalancing until the failed consumer returns, or until a time out is triggered. And if the consumer returned, continue consuming based on the old rebalancing assignment (i.e., without new rebalancing)?
There are the 3 parameter on that basis group coordinator decide consumer is dead or alive
session.timeout.ms
max.poll.interval.ms
heartbeat.interval.ms
You can avoid unwanted rebalancing by tuning above three parameter and one thumb rule : used separate thread for calling 3rd party api in pool loop.
tuning above three parameter required ans. of below questions
what is size max.poll.records
How much avg. time application is taking to process 1 record[message]
How much avg. time application is taking to process complete batch
Please refer Kafka consumer config
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
you can also explore Cooperative Rebalances
https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/

Suspending Camel KafkaConsumer

My app has N instance running. The number of instances is always greater than the number of Kafka partitions. E.g. 6 instances of a consumer-group, consuming from 4 Kafka partitions... so, only 4 of the instances are actually consuming at any point.
In this context can I suspend a Kafka consumer Camel route, without causing Kafka to attempt to re-balance to other potential consumers? My understanding is that the suspended route would stop polling, causing the other to pick up the load.
This is not a Camel but a Kafka question. The rebalancing is handled by Kafka and triggered whenever a consumer explicitly leaves the consumer group or silently dies (does no more sending heartbeats).
Kafka 2.3 introduced a new feature called "Static Membership" to avoid rebalancing just because of a consumer restart.
But in your case (another consumer must take the load of a leaving consumer) I think Kafka must trigger a rebalancing over all consumers due to the protocol used.
See also this article for a quite deep dive into rebalancing and its trade-offs between availability and fault-tolerance.
Edit due to comments
If you want to avoid rebalancing, I think you would have to increase both session.timeout.ms (heartbeat interval) and max.poll.interval.ms (processing timeout).
But even if you set them very high I guess it would not work reliably because route suspension could still happen just before a heartbeat (simply bad timing).
See this q&a for the difference between session.timeout.ms and max.poll.interval.ms.

Slow consumer impact on a topic partition

Let there be a single Kafka topic with just a single partition configured with an infinite retention policy. Let there be two consumers, Fast and Slow.
The Fast consumer processes the message as they appear and has almost no lag.
The Slow consumer tends to have a significant lag e.g. two days worth of messages. Slow will sometimes catch up to Fast but this happens rarely, there is usually a significant lag.
Will this setup, with two different consumer speeds in the same partition, cause negative side effects on a Kafka broker? Could there be an increased I/O cost to retrieve older messages for Slow consumer from the disk?
Lagging consumer won't be able to read data from OS cache. Therefore there will be I/O cost for slow consumers. On the other hand, after your slow consumer started to read message, kafka will make sequential I/O to cache messages. If the latency is not too much, consumer can find the next message in the cache.

How does Kafka handle a consumer which is running slower than other consumers?

Let's say I have 20 partitions and five workers. Each partition is assigned a worker. However, one worker is running slower than the other machines. It's still processing (that is, not slow consumer described here), but at 60% rate of the other machines. This could be because the worker is running on a slower VM on AWS EC2, a broken disk or CPU or whatnot. Does Kafka handle rebalancing gracefully somehow to give the slow worker fewer partitions?
Kafka doesn't really concern itself with how fast messages are being consumed. It doesn't even get involved with how many consumers there are or how many times each message is read. Kafka just commits messages to partitions and ages them out at the configured time.
It's the responsibility of the group of consumers to make sure that the messages are being read evenly and in a timely fashion. In your case, you have two problems: The reading of one set of partitions lags and then then processing of the messages from those partitions lags.
For the actual consumption of messages from the topic, you'll have to use the Kafka metadata API's to track the relative loads each consumer faces, whether by skewed partitioning or because the consumers are running at different speeds. You either have to re-allocate partitions to consumers to give the slow consumers less work or randomly re-assign consumers to partitions in the hope of eventually evening out the workload over time.
To better balance the processing of messages, you should factor out the reading of the messages from the processing of the messages - something like the Storm streaming model. You still have to programmatically monitor the backlogs into the processing logic, but you'd have the ability to move work to faster nodes in order to balance the work.

Cost of Rebalancing partitions of a topic in Kafka

I am trying to come up with a design for consuming from Kafka. I am using 0.8.1.1 version of Kafka. I am thinking of designing a system where the consumer will be created every few seconds, consume the data from Kafka, process it and then quits after committing the offsets to Kafka. At any point of time expect 250 - 300 consumers to be active (running as ThreadPools in different machines).
How and When a rebalance of partitions happens?
How costly is the rebalancing of partitions among the consumers. I am expecting a new consumer finishing up or joining every few seconds to the same consumer group. So I just want to know the overhead and latency of a rebalancing operation.
Say Consumer C1 has Partitions P1, P2, P3 assigned to it and it is processing a message M1 from Partition P1. Now Consumer C2 joins the group. How is the partitions divided between C1 and C2. Is there a possibility where C1's (which might take some time to commit its message to Kafka) commit for M1 will be rejected and M1 will be treated as a fresh message and will be delivered to someone else (I know Kafka is at least once delivery model but wanted to confirm if the re partition by any chance cause a re delivery of the same message)?
I'd rethink the design if I were you. Perhaps you need a consumer pool?
Rebalancing happens every time a consumer joins or leaves the group.
Kafka and the current consumer were definitely designed for long running consumers. The new consumer design (planned for 0.9) will handle short-lived consumers better. Re-balances takes 100-500ms in my experience, depending a lot on how ZooKeeper is doing.
Yes, duplicates happen often during rebalancing. Thats why we try to avoid them. You can try to work around that by committing offsets more frequently, but with 300 consumers committing frequently and a lot of consumers joining and leaving - your Zookeeper may become a bottleneck.