Kafka topic is not doing load balance for springboot consumer applications - apache-kafka

I am new to KAFKA and would like help
I have a topic XXXX and I have some applications consuming this topic, all listening to the same group
spring.cloud.stream.bindings.aaa_bbb.destination=XXXX
spring.cloud.stream.bindings.aaa_bbb.group=XXXX_group
topic XXXX has only one partition
sending 1000 messages to topic XXXX only one application consumes all messages.
but when I add a new partition to topic XXXX and the messages are divided into 2 applications and I still have applications without receiving anything.
I repeat the process and add a new partition to topic XXXX
now the topic has 3 partitions and the messages are divided into 3 applications.
it looks like it's a partition for each consumer.
which doesn't make much sense to me or I don't understand.
is there a way to make this load balance work, without having to create a partition for each consumer?
Can someone explain to me how this relationship works?

That is a fundamental of Kafka - only one consumer in each group can consume from a partition. A topic/partition is a simple log, it is not like a queue in JMS or RabbitMQ.
Kafka maintains only a simple current committed offset for each group/topic/partition.
The only way to add concurrency is to increase the partitions.

To increase the throughput you have to have more than one partition. When events are written to the log the ID of the event will determine which partition the message will be delivered to.
Kafka only guarantees the ordering over an ID, not over the entire log.
I normally recommend having more than one partition even if you have a single node as this allows the cluster to scaled in the future for improved performance.
You can't change the number of partitions after the topic has been created as that would have an impact on the partitioning target.
In your case I'd start with 3 per node up to a maximum of 9 if you had 3 nodes in the cluster - Please test this yourself.
There's a limit of 1 consumer per partition which is the behaviour you're seeing.

Related

Best ways to design a kafka consumer

Need a help in getting the best design solution for creating Kafka consumers.
Will be having multiple topics and those can be like groups say for example
10 topics that are used to send out emails (10 count is chosen because will be getting more client traffic and want to dedicate a topic per client like each topic for one client so that others will not be delayed or waited)
10 topics to process a business logic and the 10 count explanation is same as above.
Now with this usage what's the best way to design Kafka consumers? Consumer dedicated to each topic ? or is there a way where we can scale up consumer dynamically by passing in which topic it needs to subscribe? For sure will be deploying this in containers but want suggestions on how to get started with consumer part with dynamic scalability and common code. And what's the best technology to implement this type of kafka consumers? (dotnet/java/python) ?
Also please do suggest if partitions make sense in this kind of design so that we can leverage consumer groups.
Consumers belonging to a same consumer group are assigned partitions in a topic.
In kafka, a topic can have multiple partitions. The consumers consume the messages of a particular topic from their assigned partition(s). The messages in the partition are ordered by sequential offsets.
Now, topic-wide record order is not important, you generally want to start with a higher number of partitions in a topic. Let's say start with 100 partitions. Your data will be distributed across the 100 partitions in a topic, assuming null keys or at least 100 unique key values with non colliding hashes, as record keys determine partitioning. If topic order is important, you're limited to one partition, and therefore one consumer thread; however, this thread can separate consumption from processing by loading records into alternative data structures (a queue) for processing.
You can now have 10 consumers consuming from 100 partitions. Each consumer will be assigned to about 10 partitions, and they will consume the messages in a round-robin fashion.
If you want to scale out, you simply increase the number of consumers. If you double the number of consumer to 20 then each consumer will process 5 partitions, thus, you get double throughput.

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

Apache Kafka Scaling Topics using partitions

We started to use Apache Kafka to persist Timeseries data into a Timeseries database. What we started with was to just have a single topic, a producer writing to this topic and a single consumer reading from this topic and dumping the data to the Timeseries database.
We had 3 broker instances and what we noticed in the first try was that the producer was pretty fast in writing messages to the topic. Within a matter of 30 minutes, we had around 1.5 million messages. The consumer was just doing 300 messages per second.
Our next approach was to partition the topic and have more consumer instances (equal to the number of partitions). This definitely improved on the consumer write speed. Now my questions are:
What happens if I set my topic partition to 6, but I have only 3 broker instances. Which broker instance would be the leader for partition 1 to 6?
Is there a formula to determine how many partitions would I be needing? Since this was our test environment, we could play with it and scale it. We might not be able to do the same on our production environment. So how to determine the partition size?
The partitions get distributed amongst your brokers. It's impossible to know which broker will be elected leader of a given partition -- and it can change over time. Depending on which version of Kafka and which Consumer API you use, your consumer may or may not discover partition leaders on its own. With the SimpleConsumer you have to find partition leaders on your own, and respond to new leader election in your code (instead of having it handled by the API automatically).
As to the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions. If you have 4 partitions and 5 consumers, one of the consumers will starve. I usually use numbers like 12 or 60 or multiples thereof for the number of partitions for large topics. Something that divides easily and cleanly among variable numbers of consumers.
Also, note that you can later on change the number of partitions, with some caveats. See this answer for how and what the caveats are.

Kafka distributing messages from a partition among consumers

I have a Kafka topic which currently has 3 partitions. I want my consumers to read from the same partition but each message should go to a different consumer in a round-robin fashion. Is it possible to achieve this?
In order to do that, you have to implement a consumer group. It's provided out of the box with Kafka. You have just to specify the same group.id to your tree consumer.
[edit] But, each consumers will read in different Kafka partition. I think that make difference consumer for mthe same group read in the same partition is not possible if you're using only the Kafka API.
See more in the documentation : http://kafka.apache.org/documentation.html#intro_consumers
How about this, at the producer, the messages are routed based on some key. It is possible to route message 1 to partition 1, message 2 to partition 2, message 3 to partition 3. Then you should group three consumers in one group. It is possible to make consumer 1 to consume partition 1, consumer 2 to consume partition 2, consumer 3 to consume partition 3.
By the way, how to implement it depends on which kafka client you are using, what the messages are. You should give more details....
What you are saying defeats the purpose of partitions. Partitions are not designed for simple load balancing in kafka. If you really want that, you have two options.
If you have a control over the producer producing to the topic, do a simple mod 3 hash partitioning. So the messages will be distributed equally in the 3 partitions. Now each of your consumer will consume from one partition. This effectively means every third message is read by each consumer. That solves your problem.
If you cannot control the producer, consume from the topic in the normal way. Write a producer with simple mod 3 hash partitioning and produce it to a new topic. Again consume from that topic. The same thing repeats as in the first case.

Kafka how to consume one topic parallel

I read kafka document, still don't know how consume one topic parallel?
Suppose:
I have one topic like "something happened" (don't split this topic), and I have many customers that want to consume it.
So what should I do, so that multiple customers can consume it parallel? Should I use partitioning and customer groups?
I have one idea about this, but I'm not sure whether is it right.
Make many partitions about the same topic, and make one partition to one customer, so one producer must produce the same to these partitions, and every customer in the different customer group, is it right?
Using partitions is the way of being able to parallelize the consumption of a topic. Let´s say you have 10 partitions for your topic, then you can have 10 consumers in the same consumer group reading one partition each. If you have less consumers than partitions, then they would be responsible for more than one partition each. If you have more consumers than partitions, then there would be consumers who would not get any partition assigned to them and have nothing to do except being available to replace another consumer who has died.
Each topic in Kafka can be organized into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream
(say C1 uses 3, C2 uses 3 and C3 uses 4). In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Just to add the list of answers, Confluent has a library to do this for you, like Rapids. The project is here:
https://github.com/confluentinc/parallel-consumer
It's open source.
Note: I'm the author.
#Lundahl did all the didactic, I'll give you a pratical sample.
Create a topic for some meaning, e.g. news_events with the parallelism your consumers will need (partitions), you can calc that using the time to process one message, the number of messages you will have and the time you want to have all the messages processed.
Let's create consumers for that topic, you wan't to read the news and your sister or brother also, each one on your time, then every one needs a consumer group id, this way kafka will know that threads a,b,c are for one consumer group and the d,e,c are for the second consumer group, every consumer group will receive the same messages, process it at their time and won't affect each other.
A message will come at one or other partition, never at two, by default Kafka makes round robin to choose the partition, remember, all consumers groups can connect and read data from all the same partitions
I would suggest you to use rapids-kafka-client, a library which do that parallelism stuff for you, choose the number of threads equal the number of partitions you have, choose a consumer group, and see the magic happen.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "news-app")
.topics("news_events")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
You can read more about consumer groups, topics and partitions here
I assume what you want is parallel consumption between customers in a publish/subscribe fashion.
Beside that, you can also have parallel consumption within a single customer in order to scale the consumer application.
Parallel consumption between customers
If by "customers" you mean different organizations which are interested in consuming topic's messages independently, all you need is consumer groups.
This is a simple publish/subscribe pattern where each customer runs its own application and read all topic's messages without interfering with others.
Each customer application can be seen as a consumer group, made up by one or more Kafka consumers (whether running on a single node or spread across a cluster), all of them sharing the consumer group's identifier.
You achieve this goal regardless of partitions. In case topic is partitioned, you don't need to worry about writing the same message to all partitions. Remember that in Kafka messages are durable, a message read by a Kafka consumer is not deleted and is available to be read by other Kafka consumers from a different consumer group (until it expires). Furthermore, partitions are not meant to work like this, they help scale storage of data (at a certain point all topic's data wouldn't fit into just one node) and scale consumer applications as you can see below.
Parallel consumption within single customer
You can further parallelize, or better to say, scale the consumption of messages within a consumer group with, in fact, Kafka consumers.
Imagine topic is huge, producers write into it with an high rate, and consumer group has only one consumer: this poor consumer may struggle to keep up with the message arrival rate, especially if message processing is time-consuming too.
That's the case where you need partitions and more consumers in your consumer group, so that Kafka will assign partitions to consumers to distribute reading load among them.
How partition assignment works has been already explained in other answers here, but basically for a given consumer group:
each topic's partition is assigned exclusively to one consumer,
a consumer might get assigned more partitions
if consumers are more than topic's partitions, some of them will stay idle as they won't get assigned any partition to consume from.
Remember that message ordering in Kafka is guaranteed only at partition level, so if you have many partitions and ordering matters, you need to choose the right message key to partition data according to your requirements.
For example if you want messages be ordered by device, a device_id would be your key that guarantees messages of the same device will be written to the same partition.