Decide when to create new topic or increase partition count - apache-kafka

Say, I am having a kafka topic with 10 partitions. When a data rate is increased, I can increase the partitions to speed up my processing logic.
But my doubt is that, whether increasing the partitions is good or can I go for topic split up (That is, Based on my application logic, some data will go for topic 1 and some data to topic2. So by doing this, I can split the data rate to two topics)
Whether choosing new topic rather than increasing partitions or increasing partitions rather than creating new topic will have any performance impact on kafka cluster?
Which one will be the best solution?

It depends!
It is usually recommended to slightly over-partition topics that are likely to increase in throughput so you don't have to add partitions when this happens.
The main reason is that if you're using keyed messages adding partitions will change the key-partition mappings. So after having added partitions, messages with a key won't go to the same partition than before. If you need ordering per key this can be problematic.
Adding partitions is usually easier as consumers and producers won't need updates. You will just be able to add consumers to scale. You also keep all events together and have to worry about a single topic. Depending on the size of your cluster, with only 10 partitions you probably still have a lot of leeway to add partitions. From Kafka's point of view, 10 partitions is pretty small and you can easily have 50 or even more.
On the other hand, when creating new topics, clients will need to be updated to use them. Nvertheless, that could be a solution if over time you start receiving more types of events and want to reorder them across several topics.

Related

Kafka - Standalone sever - How to decide partitions?

I have a standalone Kafka setup with single disk. planning to stream over million records. How to decide partitions for my topic for better through-put? has to be 1 partition?
Is it recommended to have multiple partitions for a topic on standalone Kafka server?
Yes you need multiple partitions even for a single node kafka cluster. That is because you can only have as many consumers as you have partitions. If you have a single partition then you can only have a single consumer, and that will limit throughput. Especially if you want to stream millions of rows (although the period for those is not specified).
The only real downside to this is that messages are only consumed in order within the same partition. Other than that, you should go with multiple partitions. You will need to estimate the throughput of a single consumer in order to calculate the partitions, then maybe add one or 2 on top of that.
You can still add partitions later but it's probably better to try to start with the right amount first and change later as you learn more or as your volume increases/decreases.
There are two main factors to consider:
Number of producers and consumers
Each client, producer or consumer, can only connect to one partition. For this reason, the number of partitions must be at least the max(number of producers, number of consumers).
Throughput
You must determine the troughput to calculate how many consumers should be in the consumer group. The combined reading capacity of consumers should be at least as high as the combined writing capacity of producers.

What happens when partition Kafka is full?

Does it put data to another partition that was configured before launching Kafka?
What is benefit and major reason to have more partition as usual 3?
How it affects on the reading, writing performance?
I am having a bit of trouble understanding your question but I think you may be misinterpreting what a partition actually is. A partition only has a limit if you specify it in the config, otherwise you can simply think of a partition as a separate stream of data that contains an offset. In actuality the limit on the size of a partition (or topic for that matter) is simply what the disk capacity will allow. Often the data will stay there and get deleted once a specified retention period or max size/data limit has been reached.
The list of configurations is pretty extensive but you can view it here:
https://kafka.apache.org/documentation/#configuration
As for you other question about the reasoning for having more partitions well it simply comes down to scaling. If you only have s small amount of data then a small number of partitions will be enough. With kafka there is no benefit to having multiple consumers per partitions. I.e. if you have three partitions then three consumers is the most optimal number of consumers you can have as each consumer will be assigned to a single partition. If you have more consumers, some will sit idle.
So what if we have more data and need to read it faster to avoid lag? Well we can add more partitions and by doing so we can also scale up the number of consumers.
Side note. You can have a single consumer that reads from multiple partitions but this will become an issue because reassigning partitions takes time and if you have too few consumers compared to the number of partitions they simply won't be able to deal with high loads due to the time of reassignment where no processing is happening.

Kafka - Best practices in case of slow processing consumer. How to achieve more parallelism?

I'm aware that the maximum number of active consumers in a consumer group is the number of partitions of a topic.
What's the best practice in case of slow processing consumers? How to achieve more parallelism?
An example: A topic with 6 partitions and thousands of messages per second produced from Producers. So I have at most 6 consumers in the group. Consider that processing those messages is complex and the consumers are much slower than the producers. The result is that the consumers are always behind the last offset and the lag is increasing.
In a traditional MQ system, we simply add more and more consumers to stay up to date.
How to achieve this with Kafka, since the total of the consumers in a group is at most the number of partitions? Should I:
Configure the topic to have more partitions allowing more consumers per group?
Route the message from the consumer to a traditional MQ Queue (but lose the ordering)?
What's the best practice for this situation?
In Kafka, partitions are the unit of parallelism.
Without knowing our exact use case and requirements it's hard to come up with precise recommendations but there are a few options.
First you should really consider having more partitions. 6 partitions is relatively small, you could easily have 60, 120 or even more partitions (and the corresponding number of consumers). Suddenly the amount of work each consumers has to do is significantly reduced.
Also if your requirements allow, you can also consume at a fast rate and spread the processing of records across many workers. In solutions like this it's harder to maintain ordering but if you don't need it then you can consider it.
I'm not sure how routing messages through a MQ Queue would really help in this scenario. If you are still reading slower than writing the amount of data in the queue will grow till you have no disk space left.
Kafka is better designed to serve as buffer between your producers and consumers so just ensure you have retention limits on your topics that allow some flexibility on the consumer side without losing data.

kafka consumer rebalancing in case of manual/assigned partitioning

I have some doubt regarding rebalancing. Right now, I am manually assigning partition to consumer. So as per docs, there will no rebalancing in case consumer leave/crashed in a consumer groups.
Let's say there are 3 partition and 3 consumers in same group and each partition is manually assigned to each consumer. And after some time, the 3rd consumer went down. Since there is no rebalancing, what all measures I can take to ensure minimum downtime?
Do I need to change config of any of the 1st two partition to start consuming from 3rd partition or something else?
Well I don't know why would you assign partitions to consumers manually?
I think you need to write rebalanceListener. https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/ConsumerRebalanceListener.html
My advice: just let kafka decide which consumer will listen to which partition and you would not have to worry about this.
Although there might be context that would make the approach valid, as written, I question your approach a little bit.
The best way to ensure minimum downtime is to let the kafka brokers and zookeeper do what they're good at, managing your workload (partitions) among your consumers, which includes reassigning partitions when a consumer goes down.
Your best path is likely to use the OnPartitionsRevoked and OnpartitionsAssigned events to handle whatever logic you need to be able to assume a new partition (see JRs link for more-details information on these events).
I'll describe a recent use-case I've had, in the hope it is relevant to your use-case.
I recently had 5 consumers that required an in-memory cache of 50 million objects. Without partitioning, each consumer had its own cache, resulting in 250 mil objects.
To reduce that number to the original 50 million, we could use the onpartitionsrevoked event to clear the cache and the onassigned to repopulate the cache with the relevant cache for the assigned partitions.
Short of using those two handlers, if you really want to manually assign your partitions, you're going to have to do all of the orchestration yourself:
Something to monitor if one of the other consumers is down
Something to pick up the dead consumer's partition and process it
Orchestrate communication between the consumers to communicate when the dead consumer is alive again, so it can start working again.
As you can probably tell from the list, you're in for a real world of hurt if you force yourself down that path, and you probably won't do a better job than the kafka brokers - there's an entire business whose entire focus focus is developing and maintaining kafka so you don't have to handle all of that complexity.

Kafka - is it possible to alter Topic's partition count while keeping the change transparent to Producers and Consumers?

I am investigating on Kafka to assess its suitability for our use case. Can you please help me understand how flexible is Kafka with changing the number of partitions for an existing Topic?
Specifically,
Is it possible to change the number of partitions without tearing down the cluster?
And is it possible to do that without bringing down the topic?
Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
Ideally, I would want the change to be transparent to the producers and consumers. Does Kafka ensure this?
Update:
From my understanding so far, it looks like Kafka's design cannot allow this because it mapping of consumer groups to partitions will have to be altered. Is that correct?
1.Is it possible to change the number of partitions without tearing down the cluster?
Yes kafka supports increasing the number of partitions at runtime but doesn't support decreasing number of partitions due to its design
2.And is it possible to do that without bringing down the topic?
Yes provided you are increasing partitions.
3.Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
As mentioned earlier removing partitions is not supported .
When you increase the number of partitions, the existing messages will remain in the same partitions as before only the new messages will be considered for new partitions (also depending on you partitioner logic). Increasing the partitions for a topic will trigger a cluster rebalance , where in the consumers and producers will get notified with the updated metadata of the topics. Producers will start sending messages to new partitions after receiving updated metadata and consumer rebalancer will redistribute the partitions among the consumers groups and resume consumption from the last committed offset.All this will happen under the hood , so you wont have to do any changes at client side
Yes, it it perfectly possible. You just execute the following command against the topic of your choice: bin/kafka-topics.sh --zookeeper zk_host:port --alter --topic <your_topic_name> --partitions <new_partition_count>. Remember, Kafka only allows increasing the number of partitions, because decreasing it would cause in data loss.
There's a catch here. Kafka doc says the following:
Be aware that one use case for partitions is to semantically partition
data, and adding partitions doesn't change the partitioning of
existing data so this may disturb consumers if they rely on that
partition. That is if data is partitioned by hash(key) %
number_of_partitions then this partitioning will potentially be
shuffled by adding partitions but Kafka will not attempt to
automatically redistribute data in any way.
Yes, if by bringing down the topic you mean deleting the topic.
Once you've increased the partition count, Kafka would trigger a rebalance, for consumers who are subscribing to that topic, and on subsequent polls, the partitions would get distributed across the consumers. It's transparent to the client code, you don't have to worry about it.
NOTE: As I mentioned before, you can only add partitions, removing is not possible.
+one more thing, if you are using stateful operations in clients like aggregations(making use of statestore), change in partition will kill all the streams thread in consumer. This is expected as increase in partition may corrupt stateful applications. So beware changing partition size, it may break stateful consumers connected to the topic.
Good read: Why does kafka streams threads die when the source topic partitions changes ? Can anyone point to reading material around this?