I am learning Kafka and trying to create a topic for my recent search application. The data being pushed to kafka topics is assumed be a high number.
My kafka cluster have 3 brokers and there are already topics created for other requirements.
Now what should be the number of partitions which i should choose for my recent search topic? And what if i do not provide the partition number explicitly? What are things needs to be considered when choosing the partition number?
This will depend on the throughput of your consumers. If you are producing 100 messages a second and your consumers can process 10 messages a second then you'll want at least 10 partitions (produce / consume) with 10 instances of your consumer. If you want this topic to be able to handle future growth, then you'll want to increase the partition count even higher so that you can add more instances of your consumer to handle the new volume.
Another piece of advice would be to make your partition count a highly divisible number so that you can scale up/down consumers while keeping their load balanced. For example, if you choose 10 partitions then you would have to have 1, 2, 5, or 10 instances of your consumer to keep them each processing from the same number of partitions. If you choose 12 partitions instead then you could be balanced with either 1, 2, 3, 4, 6, or 12 instances of your consumer.
I would consider evaluating two main things before deciding on the no of partitions.
First point is, how the partitions, consumers of a consumer group act together. In simple words, One consumer can consume messages from more than one partitions but one partition can't be consumed by more than one consumer. That means, it makes sense to have no.of partitions >= no.of consumers in a consumer group. Otherwise you will end up having consumers without any partition is being assigned.
Second point is, what's your requirement from latency vs throughout point of view.
In simple words,
Latency is the time required to perform some action or to produce some result. Latency is measured in units of time -- hours, minutes, seconds, nanoseconds or clock periods.
Throughput is the number of such actions executed or results produced per unit of time
Now, coming back to the comparison from kafka stand point, In general, more partitions in a Kafka cluster leads to higher throughput. But, you should be careful with this number if you are really looking for low latency.
Related
I have a standalone Kafka setup with single disk. planning to stream over million records. How to decide partitions for my topic for better through-put? has to be 1 partition?
Is it recommended to have multiple partitions for a topic on standalone Kafka server?
Yes you need multiple partitions even for a single node kafka cluster. That is because you can only have as many consumers as you have partitions. If you have a single partition then you can only have a single consumer, and that will limit throughput. Especially if you want to stream millions of rows (although the period for those is not specified).
The only real downside to this is that messages are only consumed in order within the same partition. Other than that, you should go with multiple partitions. You will need to estimate the throughput of a single consumer in order to calculate the partitions, then maybe add one or 2 on top of that.
You can still add partitions later but it's probably better to try to start with the right amount first and change later as you learn more or as your volume increases/decreases.
There are two main factors to consider:
Number of producers and consumers
Each client, producer or consumer, can only connect to one partition. For this reason, the number of partitions must be at least the max(number of producers, number of consumers).
Throughput
You must determine the troughput to calculate how many consumers should be in the consumer group. The combined reading capacity of consumers should be at least as high as the combined writing capacity of producers.
I have a system where load is not constant. We may get 1000 requests a day or no requests at all.
We use Kafka to pass on the requests between services. We have kept average number of Kafka consumers to reduce cost incurred. Now my consumers of Kafka will sit ideal if there are no requests received that day, and there will be lag if too many requests are received.
We want to keep these consumers on Autoscale mode, such that my number of servers(Kafka consumer) will increase if there is a spike in number of requests. Once the number of requests get reduced, we will remove the servers. Therefore, the Kafka partitions have to be increased or decreased accordingly
Kafka allows increasing partition. In this case, how can we decrease Kafka partitions dynamically?
Is there any other solution to handle this Auto-scaling?
Scaling up partitions will not fix your lag problems in the short term since no data is moved between partitions when you do this, so existing (or new) consumers are still stuck with reading the data in the previous partitions.
It's not possible to decrease partitions and its not possible to scale consumers beyond the partition count.
If you are able to sacrifice processing order for consumption speed, you can separate the consuming threads and working threads, as hinted at in the KafkaConsumer javadoc, then you would be able to scale these worker threads.
Since you are thinking about modifying the partition counts, then I'm guessing processing order isn't a problem.
have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing.
A single consumer can consume multiple partitions. Therefore partition your work for the largest anticipated parallel requirement, and then scale your number of consumers as required.
For example, if you think you need 32 parallel consumers you would give you Kafka topic 32 partitions.
You can run 32 consumers (each gets one partition), or eight consumers (each gets four partitions) or just one (which gets all 32 partitions). Or any number of consumers in between. Kafka's protocol ensures that within the consumer group all partitions are consumed, and will rebalance as and when consumers are added or removed.
I am new to Kafka and think I am missing something on how partition queues get balanced on a topic
We have 5 partitions and 2 consumers on a topic. The topic has a null key so I assume Kafka randomly picks a new partition to add the new record to in a round robin fashion.
This would mean one consumer would be reading from 3 partitions and the other 2. If my assumption is right (that the records get evenly distrusted across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
I think you should have an even divisible number of partitions to consumers.
Am I missing something?
The unit of parallelism in consuming Kafka messages is the partition. The routine scenario for consuming Kafka messages is getting messages using a data stream processing engine like Apache Flink, Spark, and Storm that all of them distributed processing on CPU cores. The rule is the maximum level of parallelism for each consumer group can be the number of partitions. Each consumer instance of a consumer group (say CPU cores) can consume one or more partitions and on the other hand, each partition can be consumed by just one consumer instance of each consumer group.
If you have more CPU core than the number of partitions, some of them
will be idle.
If you have less CPU core than the number of partitions, some of
them will consume more than one partitions.
And the optimized case is when the number of CPU cores and
Kafka partitions are equal.
The image can describe all well:
If my assumption is right (that the records get evenly distributed across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
Why would one consumer do nothing? It would still process records from those 2 partitions [assuming of course, that both the consumers are in same group]
I think you should have an even divisible number of partitions to consumers.
Yes, that's right. For maximum parallelism, you can have as many number of consumers, as the #partitions, e.g. in your case 5 consumers would give you max parallelism.
There is an assumption built into your understanding that each partition has exactly the same throughput. For most applications, though, that may or may not be true. If you set up your keying/partitioning right, then the partitions should hopefully be close to equal, especially with a large and diverse keyspace if you average them out over a large period of time. But in a more practical, realistic sense, you'll probably have some skew at any given time anyway, and your stream processing setup will need to tolerate that. So having one more partition assigned to a particular consumer is probably not going to make a big difference.
Your understanding is correct. May be there is data skew. You can check how many records are there in each partition by using offset checker or other tool.
We have 3 zk nodes cluster and 7 brokers. Now we have to create a topic and have to create partitions for this topic.
But I did not find any formula to decide that how much partitions should I create for this topic.
Rate of producer is 5k messages/sec and size of each message is 130 Bytes.
Thanks In Advance
I can't give you a definitive answer, there are many patterns and constraints that can affect the answer, but here are some of the things you might want to take into account:
The unit of parallelism is the partition, so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions. Add a percentage more that that to cope with peaks and variable infrastructure performance. Queuing Theory can give you the math to calculate your parallelism needs.
How bursty is your traffic and what latency constraints do you have? Considering the last point, if you also have latency requirements then you may need to scale out your partitions to cope with your peak rate of traffic.
If you use any data locality patterns or require ordering of messages then you need to consider future traffic growth. For example, you deal with customer data and use your customer id as a partition key, and depend on each customer always being routed to the same partition. Perhaps for event sourcing or simply to ensure each change is applied in the right order. Well, if you add new partitions later on to cope with a higher rate of messages, then each customer will likely be routed to a different partition now. This can introduce a few headaches regarding guaranteed message ordering as a customer exists on two partitions. So you want to create enough partitions for future growth.
Just remember that is easy to scale out and in consumers, but partitions need some planning, so go on the safe side and be future proof.
Having thousands of partitions can increase overall latency.
This old benchmark by Kafka co-founder is pretty nice to understand the magnitudes of scale - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
The immediate conclusion from this, like Vanlightly said here, is that the consumer handling time is the most important factor in deciding on number of partition (since you are not close to challenge the producer throughput).
maximal concurrency for consuming is the number of partitions, so you want to make sure that:
((processing time for one message in seconds x number of msgs per second) / num of partitions) << 1
if it equals to 1, you cannot read faster than writing, and this is without mentioning bursts of messages and failures\downtime of consumers. so you will need to it to be significantly lower than 1, how significant depends on the latency that your system can endure.
It depends on your required throughput, cluster size, hardware specifications:
There is a clear blog about this written by Jun Rao from Confluent:
How to choose the number of topics/partitions in a Kafka cluster?
Also this might be helpful to have an insight:
Apache Kafka Supports 200K Partitions Per Cluster
Partitions = max(NP, NC)
where:
NP is the number of required producers determined by calculating: TT/TP
NC is the number of required consumers determined by calculating: TT/TC
TT is the total expected throughput for our system
TP is the max throughput of a single producer to a single partition
TC is the max throughput of a single consumer from a single partition
For example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.
So a simple formula could be:
#Partitions = max(NP, NC)
where:
NP is the number of required producers determined by calculating: TT/TP
NC is the number of required consumers determined by calculating: TT/TC
TT is the total expected throughput for our system
TP is the max throughput of a single producer to a single partition
TC is the max throughput of a single consumer from a single partition
source : https://docs.cloudera.com/runtime/7.2.10/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html
You could choose the no of partitions equal to maximum of {throughput/#producer ; throughput/#consumer}. The throughput is calculated by message volume per second. Here you have:
Throughput = 5k * 130bytes = 650MB/s
I'm reading the Kafka documentation and noticed the following line:
Note however that there cannot be more consumer instances in a consumer group than partitions.
Hmm. How can I auto-scale this?
For example let's say I have a messaging system with hi/lo priorities, so I create a topic for messages and partitions for hi and lo priority messages.
If this was RabbitMQ, I'd have an auto-scalable group of consumers assigned to each partition, like this:
If I understand the Kafka model I can't have >1 consumer per partition in a consumer group, so that picture doesn't work for Kafka, right?
Ok, so what about >1 consumer groups like this:
That get's around Kafka's limitation but... If I understand how this works both consumer groups would be pulling from a partition, for example msg.hi, with their own offsets so neither would know about the other--meaning messages would likely be delivered twice!
How can I achieve the capability I had in the Rabbit design w/Kafka and still maintain the "queue-ness" of the behavior (I don't want to send a message twice)? What am I missing?
TL;DR
Topic is made up of partitions. Partitions decide the max number of consumers you can have in a group.
Scenario 1:
When we have only one consumer, It can read all the messages from all the partitions.
Scenario 2:
In the above set up, when you increase the number of consumers in the group, partition reassignment happens and instead of consumer 1 reading all the messages from all the partitions, consumer 2 could share some of the load with consumer 1 as shown below.
Scenario 3:
What happens If I have more number of consumers than the number of partitions.? Each consumer would be assigned 1 partition. Any additional consumers in the group will be sitting idle unless you increase the number of partitions for a Topic.
Summary:
We need to choose the partitions accordingly. That decides the max number of consumers in the group. Changing the partition for an existing topic is really NOT recommended as It could cause issues.
That is, Let's assume a producer producing names into a topic where we have 3 partitions. All the names starting with A-I go to Partition 1, J-R in partition 2 and S-Z in partition 3. Let's also assume that we have already produced 1 million messages. Now if you suddenly increase the number of partitions to 5 from 3, It will create a different A-Z range now. That is, A-F in Partition 1, G-K in partition 2, L-Q in partition 3, R-U in partition 4 and V-Z in partition 5. Do you get it? It kind of affects the order of the messages we had before! So you need to be aware of this. If this could be a problem, then we need to choose the partition accordingly upfront.
More info is here - http://www.vinsguru.com/kafka-scaling-consumers-out-for-a-consumer-group/
Your assumption about messages being consumed twice is correct (since each group consumes 100% of messages from a topic).
I agree with David. Moreover, I suggest that you create more partitions than you really need, which would leave you some headroom to increase the number of threads in the group when such a need arises.
You can always increase the number of partitions later (and/or add additional brokers), but it's nice to have that already done, so that you can only increase number of threads and be done with it (those situations usually require a quick response, so you should do all the prep. that you can do in advance).
Just create a bunch of partitions for hi and lo. 12 is a good number. So is 60. Just pick a number of partitions that matches how much maximum parallelization you want.
Honestly, although I personally would makemsg.hi and msg.lo be different topics entirely, that's not a requirement -- you can do custom parititoning to divide messages between partitions.
You can also use an AI based auto scaler like this https://www.confluent.io/events/kafka-summit-americas-2021/intelligent-auto-scaling-of-kafka-consumers-with-workload-prediction/
This scaler calculates the right number of consumer PODs based on workload prediciton and target KPI metrics