I understand that kafka consumer group concept enables us to come up with parallel processing of partitions by the underlying consumers within the same consumer group, meaning if I spin up multiple consumer objects each belonging to the same consumer group-id would have load balanced across its underlying consumer instances, by using various partition assignment strategy configs which are natively available.
We were trying out with RoundRobin assignment strategy config; created an application which subscribes with 2 topics ( say topic-1 and topic-2, each having 10 partitions ) and created 10 consumer objects per topic, so that aspiration is that we have got 20 consumer objects processing from each of the partitions (total 20 ) of 2 topics; so far so good when we run one instance of this application as each consumer is attached to one partition of a topic.
When we try to spin up another instance of this application, where it again does the same thing by attaching itself to same consumer-group-id, such that we will now have total 40 consumer objects overall and they get load balanced in such a way that 10 consumers from earlier instance of the application(1) released their partition assignments to this new instance (2) and seeming to be having an equal distribution among both these instances where each is sharing half of the load ( 10 consumers from each instance processing messages from 10 partitions each and 10 consumers in each instance staying idle)
Now the fun starts when we’re trying to spin up 3rd instance of the application, we expected the load balance to be something like 7, 7 and 6 for each application instance, however it is turning out to be 7, 8 and 5 and sometimes even random allocations, which I don’t consider as fair equal distribution.
We understand that we can not have more consumers than the partitions available out there, were looking to have some fine load balancing across the application instances sharing the same consumer group id to not to overburden one particular instance as such.
Are we missing any config here or some fundamental understanding, please guide, many thanks!
Related
I would like to ask for some input on the following question - I'm using a Consumer.committableSource in my application. During tests I have discovered that instead of going round-robin among partitions of the the Kafka topic, the application will drain a given partition until it consumes the latest entry before switching to the next partition. This is not ideal for my application as it cares about the temporal order at which the events are put on Kafka. This exhaustive way of reading partitions is like going back and forth in time.
Any ideas on how I can tune the consumer to favor round-robin on partition consumption instead?
Thank you!
You can use this scenario in 2 ways first one preferable as it achieves parallelization and high throughput with minimal latency.
Create multiple instances for the same consumer. It will work as a consumer group and all instances will shared partition load in parallel.
e.g. if you have 4 partitions and you use 2 instances that means ideal case 1 instance will consume 2 partitions. Now if you increase instance to 4 then in that case each instance in the ideal case will be using 1 partition. In that case, partition rebalance will be managed by the consumer's group management.
You can also assign a list of partition to the consumer by using below API
public void assign(java.util.Collection partitions)
This will manually be assigned list of partitions to the consumer so consumers will consume only the assigned partition. This will not use consumer rebalance.
I will be having 10 servers which would be accessing single kafka topic. how should i ensure that every server(Consumer) will have distinct records to be processed by the server.
I will be having 10 or more instances of my code running. So every instances will act as a consumer.
Per the Kafka Consumer Group protocol, you're guaranteed that no two partitions will be shared amongst the same consumer group.
This means if there are 10+ partitions on this single topic, and each of the 10 server shares the same group.id consumer setting, then every server is getting distinct events, though not necessarily unique events
If you have less than 10 partitions in the topic, then you have idle servers reading nothing until one of the other servers crashes.
Regarding exactly once, you must disable auto commits, manage offset management yourself, and look at the documentation around transactional producers and look at the documentation around the isolation.level consumer setting... Otherwise, you're going to get at least once delivery
I have a requirement in my IoT project like, a custom java application called "NorthBound" (NB) can manage 3000 devices maximum. Devices send data to SouthBound (SB - Java Application), SB sends data to Kafka and from Kafka, NB consume the messages.
To manage around 100K devices, I am planning to start multiple instances (around 35) of NorthBound, but i want same instance should receive the messages from same devices. e.g. Device1 is sending data to NB_instance1, Device2 is sending data to NB_instance2 etc.
To handle this, i am thinking of creating 35 partitions of same topic (Device-Messages) so that each NB instance can consume one partition and same device's data should go to same NB instance. Is it the right approach? Or is there any better way?
How many partitions can we make in a Kafka cluster? and What is a recommended value considering 3 nodes (Brokers) in a cluster?
Currently, we have only 1 node in Kafka. Can we continue with single node and 35 partitions?
Say on startup I might have only 5-6K devices, then I will have only 2 partitions with 2 NB instances. Gradually when we add more devices, we will keep adding more partitions and NB instances. Can we do it without restarting Kafka? Is it possible to create partitions dynamically?
Regards,
Krishan
As you can imagine the number of partitions you can have depends on a number of factors.
Assuming you have recent hardware, since Kafka 1.1, you can have 1000s of partitions per broker. Moreover Kafka has been tested with over 100000 partitions in a cluster. Link 1
As a rule of thumb, it's recommended to over partition a bit in order to allow future growth in traffic/usage. Kafka allows to add partitions at runtime but that will change partitioning of keyed messages which can be an issue depending on your use case.
Finally, it's not recommended to run a single broker for production workloads as if it was to crash or fail, you'd be exposed to an outage and possibly data loss. It's best to at least have 2 of them with a replication factor of 2 even with only 35 partitions.
I'm creating a new service which will be a consumer of Kafka topic. It's Spring app so I'm using spring-kafka.
Topic has 20 partitions. In the beginning, there are two instances in Kubernetes. In future, depends on load, we want to scale and run additional instances. What should be the appropriate value of kafka.consumer.concurrency in my case? I bet that 10, but am I right?
When there are only two service instances, each one runs 10 threads and each thread reads from one partition. But what if I would like to scale service? What will happen if I run two additional instances? As far as I know, when a new consumer joins a consumer group the set of consumers attempt to "rebalance" the load to assign partitions to each consumer.
Does it mean that two existing instances will reduce threads number to 5 and will listen on only 5 partitions (and each instance will handle 5 partitions)?
Is my understanding correct?
If not, what should be the appropriate value in my case?
Documentation says:
if you have more partitions than you have threads, some threads will receive data from multiple partitions
Just to make sure: if I set concurrency to e.g. 5, each thread will read from two partitions. Will it affect service performance?
When a new consumer is added to the same group, Kafka will perform a rebalance; if there are more consumers than partitions, there is no guarantee that each instance will get 5 partitions - Kafka just sees 40 consumers and the 20 partitions will be distributed. However, it probably depends on configured Assignor - the default RangeAssignor seems to do it that way.
However, when you exceed the number of partitions, the containers will have idle threads (assigned no partitions).
Generally, the best practice is to over-provision the number of partitions and let each consumer handle multiple partitions; that way, when you scale out; you won't end up with idle consumers.
If not, what should be the appropriate value in my case?
It depends entirely on your application.
Bottom line; if you start with 2x10 consumers, and you expect you might end up requiring 10x10, you should start out with 100 partitions.
I'm confused to what degree partition assignment is a client side concern partition.assignment.strategy and what part is handled by Kafka.
For example, say I have one kafka topic with 100 partitions.
If I make 1 app that runs 5 threads of consumers, with a partition.assignment.strategy of RangeAssignor then I should get 5 consumers each consuming 25 partitions.
Now if I scale this app by deploying it 4 times, and using the same consumer group. Will kafka first divide 25 partitions to each of these apps on its side, and only then are these 25 partitions further subdivided by the app using the PartitionStrategy?
Which would result neatly in 4 apps with 5 consumers each, consuming 5 partitions each.
The behavior of the default Assignors is well documented in the Javadocs.
RangeAssignor is the default Assignor, see its Javadoc for example of assignment it generates: http://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/RangeAssignor.html
If you have 20 consumers using RangeAssignor that are consuming from a topic with 100 partitions, each consumer will be assigned 5 partitions.
Because RangeAssignor assigns partitions topic by topic, it can create really unbalanced assignments if you have topics with very few partitions. In that case, RoundRobinAssignor works better
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if any one of the following events are triggered:
Number of partitions change for any of the subscribed topics
A subscribed topic is created or deleted
An existing member of the consumer group is shutdown or fails.
A new member is added to the consumer group.
Most likely point no. 4 is your case and the strategy used will be the same(partition.assignment.strategy). Not that this is not applicable if you have explicitly specified the partition to be consumed by your consumer