Cost of Rebalancing partitions of a topic in Kafka - apache-kafka

I am trying to come up with a design for consuming from Kafka. I am using 0.8.1.1 version of Kafka. I am thinking of designing a system where the consumer will be created every few seconds, consume the data from Kafka, process it and then quits after committing the offsets to Kafka. At any point of time expect 250 - 300 consumers to be active (running as ThreadPools in different machines).
How and When a rebalance of partitions happens?
How costly is the rebalancing of partitions among the consumers. I am expecting a new consumer finishing up or joining every few seconds to the same consumer group. So I just want to know the overhead and latency of a rebalancing operation.
Say Consumer C1 has Partitions P1, P2, P3 assigned to it and it is processing a message M1 from Partition P1. Now Consumer C2 joins the group. How is the partitions divided between C1 and C2. Is there a possibility where C1's (which might take some time to commit its message to Kafka) commit for M1 will be rejected and M1 will be treated as a fresh message and will be delivered to someone else (I know Kafka is at least once delivery model but wanted to confirm if the re partition by any chance cause a re delivery of the same message)?

I'd rethink the design if I were you. Perhaps you need a consumer pool?
Rebalancing happens every time a consumer joins or leaves the group.
Kafka and the current consumer were definitely designed for long running consumers. The new consumer design (planned for 0.9) will handle short-lived consumers better. Re-balances takes 100-500ms in my experience, depending a lot on how ZooKeeper is doing.
Yes, duplicates happen often during rebalancing. Thats why we try to avoid them. You can try to work around that by committing offsets more frequently, but with 300 consumers committing frequently and a lot of consumers joining and leaving - your Zookeeper may become a bottleneck.

Related

Does kafka partition assignment happen across processes?

I have a topic with 20 partitions and 3 processes with consumers(with the same group_id) consuming messages from the topic.
But I am seeing a discrepancy where unless one of the process commits , the other consumer(in a different process) is not reading any message.
The consumers in other process do cconsume messages when I set auto-commit to true. (which is why I suspect the consumers are being assigned to the first partition in each process)
Can someone please help me out with this issue? And also how to consume messages parallely across processes ?
If it is of any use , I am doing this on a pod(kubernetes) , where the 3 processes are 3 different mules.
Commit shouldn't make any difference because the committed offset is only used when there is a change in group membership. With three processes there would be some rebalancing while they start up but then when all 3 are running they will each have a fair share of the partitions.
Each time they poll, they keep track in memory of which offset they have consumed on each partition and each poll causes them to fetch from that point on. Whether they commit or not doesn't affect that behaviour.
Autocommit also makes little difference - it just means a commit is done synchronously during a subsequent poll rather than your application code doing it. The only real reason to manually commit is if you spawn other threads to process messages and so need to avoid committing messages that have not actually been processed - doing this is generally not advisable - better to add consumers to increase throughput rather than trying to share out processing within a consumer.
One possible explanation is just infrequent polling. You mention that other consumers are picking up partitions, and committing affects behaviour so I think it is safe to say that rebalances must be happening. Rebalances are caused by either a change in partitions at the broker (presumably not the case) or a change in group membership caused by either heartbeat thread dying (a pod being stopped) or a consumer failing to poll for a long time (default 5 minutes, set by max.poll.interval.ms)
After a rebalance, each partition is assigned to a consumer, and if a previous consumer has ever committed an offset for that partition, then the new one will poll from that offset. If not then the new one will poll from either the start of the partition or the high watermark - set by auto.offset.reset - default is latest (high watermark)
So, if you have a consumer, it polls but doesn't commit, and doesn't poll again for 5 minutes then a rebalance happens, a new consumer picks up the partition, starts from the end (so skipping any messages up to that point). Its first poll will return nothing as it is starting from the end. If it doesn't poll for 5 minutes another rebalance happens and the sequence repeats.
That could be the cause - there should be more information about what is going on in your logs - Kafka consumer code puts in plenty of helpful INFO level logging about rebalances.

How does a Kafka Consumer behave if a Producer goes down. What happens to the data in the interval when the producer goes down

I just want to know how the Consumer is able to consume data when the producer is down. Let's say Producer keeps sending logs to the consumer at a steady rate and then the producer goes down from 8AM- 6PM. How does the consumer work in such a case and is there a way that the consumer can get the data that would have been sent during 8am - 6pm if the producer was up.
In Apache Kafka there is no relationship between how producer and consumer behaves.
Acting as a messaging system, Kafka allows to decoupling producer from a consumer providing an asynchronous communication channel.
The producer can send messages at its own pace and the consumer can read these messages in real time or later at its own pace (different from the producer one).
The messages are saved in a topic living in the Kafka cluster, and each message has a position in the topic partition (offset).
Of course, it's possible to tune when messages are deleted from the topic if the consumer isn't online for long time reading the messages.
You can set to store messages for very long time (days, weeks, months) and after that they will be deleted; or you can set to store messages based on time (so deleting the ones older than a time).
Furthermore, the consumer is also able to rewind the stream of messages in the topic, actually re-reading the messages if needed.
Finally, the consumer can also seek to a specific position in the topic partition based on offset or specifiying a time.
The Kafka doc has a nice diagram which I copied below. It shows the novelty of Kafka in a succinct way.
Without Kafka, the situation is something like this. We have multiple servers, e.g. Frontend servers, DB servers, Chat servers etc. On the other side, we have probably different metrics and monitoring tools (e.g. DB monitor, UI monitor etc.). Direct one-to-one communications between different servers and collectors might work out for smaller systems, but it breaks down pretty quickly after the system has surpassed a a certain threshold, in terms of scalability. Kafka solves this problem by decoupling the senders and receivers. Both of them talk through the Kafka brokers instead of talking to each other.
So, in your case the consumer would simply ask the broker if there's any new data on the topic it's subscribing to. As the producer is down, and assuming there is no data in the queue, broker would reply, there's nothing to be consumed.. So, the consumer would be perpetually polling in a fixed interval, in an endless loop and do nothing. Whenever the producer comes up and starts pumping out data, consumer would start receiving (and processing) it. There are more involved use cases when you might be losing data if retention period for particular topic is over, and the consumer hasn't processed the backlog. But I don't think that's a concern for you at this point of your journey.

Kafka repartitioning

From my understanding partitions and consumers are tied up into a 1:1 relationship in which a single consumer processes a partition. However is there such a way to repartition in the middle of processing?
We are currently trying to optimize a process in which the topic gets consumed across a group but there are cases in which the data processing needs to take longer on a certain consumer while others are already idle. Its like data cleansing where a certain partition might no longer need cleansing while others require fuzzy matching thereby adding complexity to the task a consumer performs.
Your understanding with regards to partitions and consumers is not quite right.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
If you have one consumer per partition, then some of the partitions might receive more messages and this is why some of your consumers might be idle while some others might still processing some messages. Note that messages are not always inserted into topic partitions in a round-robin fashion as messages with the same key are placed into the same partition.
in kafka topics are partitioned, and even if you can add partitions to a topic there is no repartitioning: all the data already written to a partition stays there, new data will be partitioned among the existing partitions (in a round robin fashion if you do not define keys, otherwise one key will always land in the same partition as long as you do not add partitions.)
But if you have a consumer group, and you add or remove consumers to this group, there is a group rebalancing where each consumer receives its share of partitions to exclusively consume from.
So if you have 3 partitions (with evenly distributed messages among them) and 2 consumers (in the same group) one consumer will have twice as much messages to handle than the other; with 3 consumers each one will consume one partition; with 4 consumers one will stay idle...
So as you already have evenly distributed messages (which is good), you should have as many consumers as you have partitions, and if it is still not fast enough you may add n partitions and n consumers. (For sure you could also try to optimize the consumer but that is another story...)
Added to answer comment:
Once a consumer -- from a given group -- is consuming a partition, it will continue to do so and will be the only one from the group consuming this partition, even if a lot of other consumers from the same group are idle. In one group a partition is never shared between consumers. (If the consumer crashes, another one will continue the work, and if a new consumer enters the group a rebalance will occur, but anyway only one consumer will work on one partition at a given time).
So one approach, as said in your comment would be to distribute the load evenly over the partitions. Another approach, would be to have a topic dedicated to expensive jobs, let it have a lot of partitions and a lot of consumers; and let the topic for non-expensive jobs have fever consumers.
Last approach that I would not recommend would be to not use the consumer group features and to manage yourself how you consume from Kafka, by using assign and seek methods from the consumer. (See KafkaConsumer JavaDoc for more information). Spark Structured Streaming for example is using that approach, but it is much more complex...

Kafka one consumer with two different checkpoints

I have a Kafka consumer project which consumes data from a specific Kafka topic. The 90% of the records are processed as soon as I got them but I have the delay processing some of the records (10%).
This these records need to be delayed, I can't commit the records so it may cause Kafka to reassign the partitions to new nodes. In order to avoid that, I can read the same topic twice and delay the fetching data part in the second consumer but it requires deserialization twice so comes with an overhead.
Is it possible the read records using single consumer but have two separate commits with Kafka consumers? It will be basically similar to having two different consumers in terms of commit, consumer.poll will be called from a single consumer but there will be two consumer.commitSync for each batch. I will help me to avoid extra deserialization and also the network cost.
Below mentioned are the things you can do to achieve the above-mentioned task.
Create a pipe Line having two topics(T1, T2) push all the messages (90%) in topic T1 and rest all the messages 10% in topic T2.
Make your Kafka consumer configurable i.e. you can easily pass polling interval, batchSize, and batch timeout whenever you are starting your consumer.
Find a logic/ or if your second topic consumption is time-based then schedule the cron which will start and stop your consumer topic T2 when it is required.
Regarding consumer Groups, you can place both of your topics in the same group or indifferent. It's completely your choice.
By this way you will be keeping the topics clean.and each and every time you need to process the messages you can do it easily by setting up the pipeline just for once.

Kafka - Multiple consumers (only one active) on same group/topic

Is it possible to have multiple copies of an application listen to the same Kafka group/topic so that only one is reading it at a time, but the other ones will start working if the main one crashes/stops reading?
I need to make an application highly available but can't tolerate doubling the traffic to the data store on the other end of the application by having multiple copies actively running.
FYI - Technically I'm using MapR streams but it adheres to the Kafka API and functionality, in case anyone knows a MapR stream-specific feature that helps the situation.
It is possible. If multi consumers are in same consumer group, when the group subscribes a topic, kafka will do a partition assignment work for your consumers: one partition could only be consumed by only one consumer in a same group.
So you could set your topic to have only one partition, then only one consumer to consume message, others will be idle. Once the consumer is shutdown, it will trigger the group rebalance operation : kafka will do the partition assignment again. And Then in your case , a new consumer will go ahead this work. It will process message from the last committed offset which commited by old consumer.
And if your case supports parallel processing, you could make many process(app) doing same work and set the topic to multi partitions. They will be assigned to consume different partitions and process different messages. So it will speed up your process and also can tolerant the fail over. As above said, if some consumers is failed, kafka will take care it for you, it will assign their paritition to other working consumer. So everything will be ok.