In Kafka, should you reduce the number of consumers within a group as the overall lag between the partitions drops? - kubernetes

I have a topic with 100 partitions. Initially, with over 1 billion messages, I had scaled 100 VMs to consume from each partition in parallel. Now the distributions seems to no longer be uniform as the number of messages is now down to just a few million. My question is.. does it now make sense to reduce the number of consuming VMs within my consumer group as the lag drops or to always keep at 100? My reasoning is, I'm wondering if a lot of rebalancing will start to occur and therefore lower my overall throughput of output messages to my sink.
Let's ignore financial cost within this decision.

As long as you don't use keyed messages in your producer, the messages should balance across your partition evenly; this applies for billions of messages, and also for millions or less.
If you use dynamic partition assignment for your consumers (which is the default) changing the number of consumers will cause rebalances and extra complication to your project structure.
As long as you take the cost out of the discussion- stay with fixed number of consumers/VMs and make sure it is a multiplication of the number of partitions, i.e 200/300/400 partitions is also alright for your case because consumers can subscribe to multiple partitions.
Remember that even for smaller number of messages, you don't lose anything by having more consumers as long as you have enough partitions to balance the work, only unused computation power (for financial considerations).

Related

Increase or decrease Kafka partitions dynamically

I have a system where load is not constant. We may get 1000 requests a day or no requests at all.
We use Kafka to pass on the requests between services. We have kept average number of Kafka consumers to reduce cost incurred. Now my consumers of Kafka will sit ideal if there are no requests received that day, and there will be lag if too many requests are received.
We want to keep these consumers on Autoscale mode, such that my number of servers(Kafka consumer) will increase if there is a spike in number of requests. Once the number of requests get reduced, we will remove the servers. Therefore, the Kafka partitions have to be increased or decreased accordingly
Kafka allows increasing partition. In this case, how can we decrease Kafka partitions dynamically?
Is there any other solution to handle this Auto-scaling?
Scaling up partitions will not fix your lag problems in the short term since no data is moved between partitions when you do this, so existing (or new) consumers are still stuck with reading the data in the previous partitions.
It's not possible to decrease partitions and its not possible to scale consumers beyond the partition count.
If you are able to sacrifice processing order for consumption speed, you can separate the consuming threads and working threads, as hinted at in the KafkaConsumer javadoc, then you would be able to scale these worker threads.
Since you are thinking about modifying the partition counts, then I'm guessing processing order isn't a problem.
have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing.
A single consumer can consume multiple partitions. Therefore partition your work for the largest anticipated parallel requirement, and then scale your number of consumers as required.
For example, if you think you need 32 parallel consumers you would give you Kafka topic 32 partitions.
You can run 32 consumers (each gets one partition), or eight consumers (each gets four partitions) or just one (which gets all 32 partitions). Or any number of consumers in between. Kafka's protocol ensures that within the consumer group all partitions are consumed, and will rebalance as and when consumers are added or removed.

What happens when partition Kafka is full?

Does it put data to another partition that was configured before launching Kafka?
What is benefit and major reason to have more partition as usual 3?
How it affects on the reading, writing performance?
I am having a bit of trouble understanding your question but I think you may be misinterpreting what a partition actually is. A partition only has a limit if you specify it in the config, otherwise you can simply think of a partition as a separate stream of data that contains an offset. In actuality the limit on the size of a partition (or topic for that matter) is simply what the disk capacity will allow. Often the data will stay there and get deleted once a specified retention period or max size/data limit has been reached.
The list of configurations is pretty extensive but you can view it here:
https://kafka.apache.org/documentation/#configuration
As for you other question about the reasoning for having more partitions well it simply comes down to scaling. If you only have s small amount of data then a small number of partitions will be enough. With kafka there is no benefit to having multiple consumers per partitions. I.e. if you have three partitions then three consumers is the most optimal number of consumers you can have as each consumer will be assigned to a single partition. If you have more consumers, some will sit idle.
So what if we have more data and need to read it faster to avoid lag? Well we can add more partitions and by doing so we can also scale up the number of consumers.
Side note. You can have a single consumer that reads from multiple partitions but this will become an issue because reassigning partitions takes time and if you have too few consumers compared to the number of partitions they simply won't be able to deal with high loads due to the time of reassignment where no processing is happening.

Kafka - Best practices in case of slow processing consumer. How to achieve more parallelism?

I'm aware that the maximum number of active consumers in a consumer group is the number of partitions of a topic.
What's the best practice in case of slow processing consumers? How to achieve more parallelism?
An example: A topic with 6 partitions and thousands of messages per second produced from Producers. So I have at most 6 consumers in the group. Consider that processing those messages is complex and the consumers are much slower than the producers. The result is that the consumers are always behind the last offset and the lag is increasing.
In a traditional MQ system, we simply add more and more consumers to stay up to date.
How to achieve this with Kafka, since the total of the consumers in a group is at most the number of partitions? Should I:
Configure the topic to have more partitions allowing more consumers per group?
Route the message from the consumer to a traditional MQ Queue (but lose the ordering)?
What's the best practice for this situation?
In Kafka, partitions are the unit of parallelism.
Without knowing our exact use case and requirements it's hard to come up with precise recommendations but there are a few options.
First you should really consider having more partitions. 6 partitions is relatively small, you could easily have 60, 120 or even more partitions (and the corresponding number of consumers). Suddenly the amount of work each consumers has to do is significantly reduced.
Also if your requirements allow, you can also consume at a fast rate and spread the processing of records across many workers. In solutions like this it's harder to maintain ordering but if you don't need it then you can consider it.
I'm not sure how routing messages through a MQ Queue would really help in this scenario. If you are still reading slower than writing the amount of data in the queue will grow till you have no disk space left.
Kafka is better designed to serve as buffer between your producers and consumers so just ensure you have retention limits on your topics that allow some flexibility on the consumer side without losing data.

Kafka Consumer being Starved because of unbalance

I am new to Kafka and think I am missing something on how partition queues get balanced on a topic
We have 5 partitions and 2 consumers on a topic. The topic has a null key so I assume Kafka randomly picks a new partition to add the new record to in a round robin fashion.
This would mean one consumer would be reading from 3 partitions and the other 2. If my assumption is right (that the records get evenly distrusted across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
I think you should have an even divisible number of partitions to consumers.
Am I missing something?
The unit of parallelism in consuming Kafka messages is the partition. The routine scenario for consuming Kafka messages is getting messages using a data stream processing engine like Apache Flink, Spark, and Storm that all of them distributed processing on CPU cores. The rule is the maximum level of parallelism for each consumer group can be the number of partitions. Each consumer instance of a consumer group (say CPU cores) can consume one or more partitions and on the other hand, each partition can be consumed by just one consumer instance of each consumer group.
If you have more CPU core than the number of partitions, some of them
will be idle.
If you have less CPU core than the number of partitions, some of
them will consume more than one partitions.
And the optimized case is when the number of CPU cores and
Kafka partitions are equal.
The image can describe all well:
If my assumption is right (that the records get evenly distributed across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
Why would one consumer do nothing? It would still process records from those 2 partitions [assuming of course, that both the consumers are in same group]
I think you should have an even divisible number of partitions to consumers.
Yes, that's right. For maximum parallelism, you can have as many number of consumers, as the #partitions, e.g. in your case 5 consumers would give you max parallelism.
There is an assumption built into your understanding that each partition has exactly the same throughput. For most applications, though, that may or may not be true. If you set up your keying/partitioning right, then the partitions should hopefully be close to equal, especially with a large and diverse keyspace if you average them out over a large period of time. But in a more practical, realistic sense, you'll probably have some skew at any given time anyway, and your stream processing setup will need to tolerate that. So having one more partition assigned to a particular consumer is probably not going to make a big difference.
Your understanding is correct. May be there is data skew. You can check how many records are there in each partition by using offset checker or other tool.

How to choose the no of partitions for a kafka topic?

We have 3 zk nodes cluster and 7 brokers. Now we have to create a topic and have to create partitions for this topic.
But I did not find any formula to decide that how much partitions should I create for this topic.
Rate of producer is 5k messages/sec and size of each message is 130 Bytes.
Thanks In Advance
I can't give you a definitive answer, there are many patterns and constraints that can affect the answer, but here are some of the things you might want to take into account:
The unit of parallelism is the partition, so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions. Add a percentage more that that to cope with peaks and variable infrastructure performance. Queuing Theory can give you the math to calculate your parallelism needs.
How bursty is your traffic and what latency constraints do you have? Considering the last point, if you also have latency requirements then you may need to scale out your partitions to cope with your peak rate of traffic.
If you use any data locality patterns or require ordering of messages then you need to consider future traffic growth. For example, you deal with customer data and use your customer id as a partition key, and depend on each customer always being routed to the same partition. Perhaps for event sourcing or simply to ensure each change is applied in the right order. Well, if you add new partitions later on to cope with a higher rate of messages, then each customer will likely be routed to a different partition now. This can introduce a few headaches regarding guaranteed message ordering as a customer exists on two partitions. So you want to create enough partitions for future growth.
Just remember that is easy to scale out and in consumers, but partitions need some planning, so go on the safe side and be future proof.
Having thousands of partitions can increase overall latency.
This old benchmark by Kafka co-founder is pretty nice to understand the magnitudes of scale - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
The immediate conclusion from this, like Vanlightly said here, is that the consumer handling time is the most important factor in deciding on number of partition (since you are not close to challenge the producer throughput).
maximal concurrency for consuming is the number of partitions, so you want to make sure that:
((processing time for one message in seconds x number of msgs per second) / num of partitions) << 1
if it equals to 1, you cannot read faster than writing, and this is without mentioning bursts of messages and failures\downtime of consumers. so you will need to it to be significantly lower than 1, how significant depends on the latency that your system can endure.
It depends on your required throughput, cluster size, hardware specifications:
There is a clear blog about this written by Jun Rao from Confluent:
How to choose the number of topics/partitions in a Kafka cluster?
Also this might be helpful to have an insight:
Apache Kafka Supports 200K Partitions Per Cluster
Partitions = max(NP, NC)
where:
NP is the number of required producers determined by calculating: TT/TP
NC is the number of required consumers determined by calculating: TT/TC
TT is the total expected throughput for our system
TP is the max throughput of a single producer to a single partition
TC is the max throughput of a single consumer from a single partition
For example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.
So a simple formula could be:
#Partitions = max(NP, NC)
where:
NP is the number of required producers determined by calculating: TT/TP
NC is the number of required consumers determined by calculating: TT/TC
TT is the total expected throughput for our system
TP is the max throughput of a single producer to a single partition
TC is the max throughput of a single consumer from a single partition
source : https://docs.cloudera.com/runtime/7.2.10/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html
You could choose the no of partitions equal to maximum of {throughput/#producer ; throughput/#consumer}. The throughput is calculated by message volume per second. Here you have:
Throughput = 5k * 130bytes = 650MB/s