Kafka Topic Rebalance - apache-kafka

I've a problem on kafka message consuming. For some topics it's unable to consume due to partition problem. So I want to delete all topics and recreate them with less partitions and they will have a same name. Is it possible or since topic name is same will partitions size on kafka reamin also same? I'm managing partitions size with spring boot kafka application yaml file.
I have tried to restart kafka but still same issue. I equalized partition size in kafka and application yaml which was different before.

Deleting topics is asynchronous operation while it removes log segments. Re-creating immediately with the same name may cause problems.
Besides, you want to preserve existing data, you need to instead mirror the topics within the cluster to a topic with a different name, but smaller partition count.

Related

Auto increase Kafka partitions based on load on kubernetes

We have installed Kafka on kubernetes , we want Kafka broker partitions needs to auto increase based on load from input source , so I want to know is there any possibile way to increase partitions based on load on kubernete
Brokers don't have partitions that scale; topics do.
You'll have to build your own scripts for this because there are some caveats
Adding partitions to topics doesn't move data, and will not reduce producer load. It might increase broker cluster load because there would now be more data be able to get written/read. The only way to reduce broker load is to add more brokers, not just partitions
If you add brokers, you'll need to have scripts that move partitions across brokers since that's not automatic either
It'll also remove any ordering guarantee for consumers, if they were relying on it (for example, compacted topics)

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

Kafka - is it possible to alter Topic's partition count while keeping the change transparent to Producers and Consumers?

I am investigating on Kafka to assess its suitability for our use case. Can you please help me understand how flexible is Kafka with changing the number of partitions for an existing Topic?
Specifically,
Is it possible to change the number of partitions without tearing down the cluster?
And is it possible to do that without bringing down the topic?
Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
Ideally, I would want the change to be transparent to the producers and consumers. Does Kafka ensure this?
Update:
From my understanding so far, it looks like Kafka's design cannot allow this because it mapping of consumer groups to partitions will have to be altered. Is that correct?
1.Is it possible to change the number of partitions without tearing down the cluster?
Yes kafka supports increasing the number of partitions at runtime but doesn't support decreasing number of partitions due to its design
2.And is it possible to do that without bringing down the topic?
Yes provided you are increasing partitions.
3.Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
As mentioned earlier removing partitions is not supported .
When you increase the number of partitions, the existing messages will remain in the same partitions as before only the new messages will be considered for new partitions (also depending on you partitioner logic). Increasing the partitions for a topic will trigger a cluster rebalance , where in the consumers and producers will get notified with the updated metadata of the topics. Producers will start sending messages to new partitions after receiving updated metadata and consumer rebalancer will redistribute the partitions among the consumers groups and resume consumption from the last committed offset.All this will happen under the hood , so you wont have to do any changes at client side
Yes, it it perfectly possible. You just execute the following command against the topic of your choice: bin/kafka-topics.sh --zookeeper zk_host:port --alter --topic <your_topic_name> --partitions <new_partition_count>. Remember, Kafka only allows increasing the number of partitions, because decreasing it would cause in data loss.
There's a catch here. Kafka doc says the following:
Be aware that one use case for partitions is to semantically partition
data, and adding partitions doesn't change the partitioning of
existing data so this may disturb consumers if they rely on that
partition. That is if data is partitioned by hash(key) %
number_of_partitions then this partitioning will potentially be
shuffled by adding partitions but Kafka will not attempt to
automatically redistribute data in any way.
Yes, if by bringing down the topic you mean deleting the topic.
Once you've increased the partition count, Kafka would trigger a rebalance, for consumers who are subscribing to that topic, and on subsequent polls, the partitions would get distributed across the consumers. It's transparent to the client code, you don't have to worry about it.
NOTE: As I mentioned before, you can only add partitions, removing is not possible.
+one more thing, if you are using stateful operations in clients like aggregations(making use of statestore), change in partition will kill all the streams thread in consumer. This is expected as increase in partition may corrupt stateful applications. So beware changing partition size, it may break stateful consumers connected to the topic.
Good read: Why does kafka streams threads die when the source topic partitions changes ? Can anyone point to reading material around this?

Increase the Number of partitions

We are working on Confluent Platform and we are still getting to know the internals. But we have implemented generic use cases . We are trying to optimizing our cluster
In my use case, I need to increase the number of partitions of a topic . What is the impact of it ? Can you please share of it
Sure, you can increase partitions.
However,
Increasing partitions does not move existing data. If using Confluent Enterprise, you could use confluent-rebalancer, or if not, then kafka-reassign-partitions CLI tool. So, you'll definitely want to rebalance a topic to "optimize" the cluster.
During the retention period of the topic (read: for the existing data), if you previously had a producer sending data to partition N, and now had N+1 partitions, then you lose ordering of those messages that solely existed in partition N. New messages could be spread across new partitions that a new producer calculates with the DefaultPartitioner. If you don't send keys with messages, then this isn't a problem.

Clickhouse kafka table engine with many consumer

I'm planning to do some test with Clickhouse by ingesting my kafka topics into a SummingMergeTree using this method: https://clickhouse.yandex/docs/en/table_engines/kafka/
For my test on a dev env, I'm not afraid of the volume but on the production environment we are already consuming those topics and we have to put many consumers to be able to read message as fast as they are pushed into. My question is: is there a way on Clickhouse to have many kafka consumer on one table with kafka engine ?
Thanks,
Romaric
Reading the documentation it seems that the num_consumers parameter in the Kafka engine is exactly what you need:
num_consumers – The number of consumers per table. Default: 1. Specify
more consumers if the throughput of one consumer is insufficient. The
total number of consumers should not exceed the number of partitions
in the topic, since only one consumer can be assigned per partition.