Issue of having more Topic Partitions than Consumer - apache-kafka

I am trying to create an active/active cluster. So my Consumers need to subscribe to the original as well as the replicated topics. So the number of partitions will be double of what is initially used in a single cluster setup. Will this increase in the partitions cause issue to my consumers. Sometimes the replicated topic might not receive message.
The Number of Consumers will always be less than the number of Partitions.
Will consumers subscribing to ideal patitions(Partitions that might not receive messages) cause latency or any performance issue?

cause issue to my consumers
No. You can still have up-to the number of total partitions being consumed (for all topics) active instances of a consumer app.
Number of Consumers will always be less than the number of Partitions
That is not the most optimal way to guarantee minimal lag.
cause latency or any performance issue?
No more than without active/active replication.

Related

Advice on how I can decrease Kafka Lag

I'm relatively new to working with Kafka, below is a sample of what my current set up is.
Kafka Setup
Multiple topics that all have one partition each. 2 Consumer Groups with each group containing one consumer.
The issue I am seeing is that the Lag is enormous, sometimes upwards of 8-10 hours waiting for consuming, the load is about 100-200 million messages a day
What steps should I look at in order to address this? Is it as simple as reassigning partitions or creating new partitions for the 3 topics that are being consumed by the two consumers? - I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag. I've looked at network connections and don't feel that it is anything got to do with this. If anyone could point me in the direction of Kafka and Low Latency documents that would be good also.
Generally the flow is to parallelize your consumption through the increase on the number of partitions and consumers in consumer groups that subscribe to those topics with increased partitions (Nconsumers <= Npartitions).
And distribute your topics with increase on the number of brokers in your cluster.
So from topic considerations:
Less partition per topic result:
in producer and/or consumer lag
starved or overloaded brokers and consumers.
(But take into account) More partition per topic result in:
More broker resources – file handlers and memory.
There is an overhead with each additional partition and a number of partitions a broker can handle is limited.
Overhead of replication load
Then increase the number of consumers in that consumer groups.
Try increasing partition per topic, but by itself it should not help! You also will need to increase the number of consumers in your consumer group. Is that single consumers or consumer groups on your diagram? How many consumers in your consumer group vs partitions on the topic that they are subscibed to.
From this in your message:
I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag.
I get an idead that your messages may be huge! Is it so? In case yes, try to keep messages small (for example by excluding BLOBs and keep external links to them)
Still the issue may be somewhere else like bad configs, consumer commit messages (acknowledgment handling), etc.
So, I highly advice you to read article Fine-tune Kafka performance with the Kafka optimization theorem
I also advise you to go through Apache Kafka courses on Confluent web-page
This should be added as a comment, but I haven't had permissions to do so. The provided info is very limited with incorrect diagram, which limits the ability to provide an adequate helpfull answer. If possible please correct your diagram and add more details about your set-up like:
broker configuration, file attached;
consumer set-up (Consumer commit messages);
producer set-up;
topic set-up;
kafka version (the defaults differ with major/minor versions)
The provided diagram is not correct in the notion of topic - partition relationship, so I assume it is a mistype and Partition 0 must be substituded with Broker 0, right?
Kafka's topics are divided into several partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic...
Then there is an open question on the number of partiotions in each topic and the number of topics in each broker, as well as the number of brokers in your cluster!

Kafka - Standalone sever - How to decide partitions?

I have a standalone Kafka setup with single disk. planning to stream over million records. How to decide partitions for my topic for better through-put? has to be 1 partition?
Is it recommended to have multiple partitions for a topic on standalone Kafka server?
Yes you need multiple partitions even for a single node kafka cluster. That is because you can only have as many consumers as you have partitions. If you have a single partition then you can only have a single consumer, and that will limit throughput. Especially if you want to stream millions of rows (although the period for those is not specified).
The only real downside to this is that messages are only consumed in order within the same partition. Other than that, you should go with multiple partitions. You will need to estimate the throughput of a single consumer in order to calculate the partitions, then maybe add one or 2 on top of that.
You can still add partitions later but it's probably better to try to start with the right amount first and change later as you learn more or as your volume increases/decreases.
There are two main factors to consider:
Number of producers and consumers
Each client, producer or consumer, can only connect to one partition. For this reason, the number of partitions must be at least the max(number of producers, number of consumers).
Throughput
You must determine the troughput to calculate how many consumers should be in the consumer group. The combined reading capacity of consumers should be at least as high as the combined writing capacity of producers.

Increase or decrease Kafka partitions dynamically

I have a system where load is not constant. We may get 1000 requests a day or no requests at all.
We use Kafka to pass on the requests between services. We have kept average number of Kafka consumers to reduce cost incurred. Now my consumers of Kafka will sit ideal if there are no requests received that day, and there will be lag if too many requests are received.
We want to keep these consumers on Autoscale mode, such that my number of servers(Kafka consumer) will increase if there is a spike in number of requests. Once the number of requests get reduced, we will remove the servers. Therefore, the Kafka partitions have to be increased or decreased accordingly
Kafka allows increasing partition. In this case, how can we decrease Kafka partitions dynamically?
Is there any other solution to handle this Auto-scaling?
Scaling up partitions will not fix your lag problems in the short term since no data is moved between partitions when you do this, so existing (or new) consumers are still stuck with reading the data in the previous partitions.
It's not possible to decrease partitions and its not possible to scale consumers beyond the partition count.
If you are able to sacrifice processing order for consumption speed, you can separate the consuming threads and working threads, as hinted at in the KafkaConsumer javadoc, then you would be able to scale these worker threads.
Since you are thinking about modifying the partition counts, then I'm guessing processing order isn't a problem.
have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing.
A single consumer can consume multiple partitions. Therefore partition your work for the largest anticipated parallel requirement, and then scale your number of consumers as required.
For example, if you think you need 32 parallel consumers you would give you Kafka topic 32 partitions.
You can run 32 consumers (each gets one partition), or eight consumers (each gets four partitions) or just one (which gets all 32 partitions). Or any number of consumers in between. Kafka's protocol ensures that within the consumer group all partitions are consumed, and will rebalance as and when consumers are added or removed.

Kafka - is it possible to alter Topic's partition count while keeping the change transparent to Producers and Consumers?

I am investigating on Kafka to assess its suitability for our use case. Can you please help me understand how flexible is Kafka with changing the number of partitions for an existing Topic?
Specifically,
Is it possible to change the number of partitions without tearing down the cluster?
And is it possible to do that without bringing down the topic?
Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
Ideally, I would want the change to be transparent to the producers and consumers. Does Kafka ensure this?
Update:
From my understanding so far, it looks like Kafka's design cannot allow this because it mapping of consumer groups to partitions will have to be altered. Is that correct?
1.Is it possible to change the number of partitions without tearing down the cluster?
Yes kafka supports increasing the number of partitions at runtime but doesn't support decreasing number of partitions due to its design
2.And is it possible to do that without bringing down the topic?
Yes provided you are increasing partitions.
3.Will adding/removing partitions automatically take care of redistributing messages across the new partitions?
As mentioned earlier removing partitions is not supported .
When you increase the number of partitions, the existing messages will remain in the same partitions as before only the new messages will be considered for new partitions (also depending on you partitioner logic). Increasing the partitions for a topic will trigger a cluster rebalance , where in the consumers and producers will get notified with the updated metadata of the topics. Producers will start sending messages to new partitions after receiving updated metadata and consumer rebalancer will redistribute the partitions among the consumers groups and resume consumption from the last committed offset.All this will happen under the hood , so you wont have to do any changes at client side
Yes, it it perfectly possible. You just execute the following command against the topic of your choice: bin/kafka-topics.sh --zookeeper zk_host:port --alter --topic <your_topic_name> --partitions <new_partition_count>. Remember, Kafka only allows increasing the number of partitions, because decreasing it would cause in data loss.
There's a catch here. Kafka doc says the following:
Be aware that one use case for partitions is to semantically partition
data, and adding partitions doesn't change the partitioning of
existing data so this may disturb consumers if they rely on that
partition. That is if data is partitioned by hash(key) %
number_of_partitions then this partitioning will potentially be
shuffled by adding partitions but Kafka will not attempt to
automatically redistribute data in any way.
Yes, if by bringing down the topic you mean deleting the topic.
Once you've increased the partition count, Kafka would trigger a rebalance, for consumers who are subscribing to that topic, and on subsequent polls, the partitions would get distributed across the consumers. It's transparent to the client code, you don't have to worry about it.
NOTE: As I mentioned before, you can only add partitions, removing is not possible.
+one more thing, if you are using stateful operations in clients like aggregations(making use of statestore), change in partition will kill all the streams thread in consumer. This is expected as increase in partition may corrupt stateful applications. So beware changing partition size, it may break stateful consumers connected to the topic.
Good read: Why does kafka streams threads die when the source topic partitions changes ? Can anyone point to reading material around this?

Kafka Consumer being Starved because of unbalance

I am new to Kafka and think I am missing something on how partition queues get balanced on a topic
We have 5 partitions and 2 consumers on a topic. The topic has a null key so I assume Kafka randomly picks a new partition to add the new record to in a round robin fashion.
This would mean one consumer would be reading from 3 partitions and the other 2. If my assumption is right (that the records get evenly distrusted across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
I think you should have an even divisible number of partitions to consumers.
Am I missing something?
The unit of parallelism in consuming Kafka messages is the partition. The routine scenario for consuming Kafka messages is getting messages using a data stream processing engine like Apache Flink, Spark, and Storm that all of them distributed processing on CPU cores. The rule is the maximum level of parallelism for each consumer group can be the number of partitions. Each consumer instance of a consumer group (say CPU cores) can consume one or more partitions and on the other hand, each partition can be consumed by just one consumer instance of each consumer group.
If you have more CPU core than the number of partitions, some of them
will be idle.
If you have less CPU core than the number of partitions, some of
them will consume more than one partitions.
And the optimized case is when the number of CPU cores and
Kafka partitions are equal.
The image can describe all well:
If my assumption is right (that the records get evenly distributed across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.
Why would one consumer do nothing? It would still process records from those 2 partitions [assuming of course, that both the consumers are in same group]
I think you should have an even divisible number of partitions to consumers.
Yes, that's right. For maximum parallelism, you can have as many number of consumers, as the #partitions, e.g. in your case 5 consumers would give you max parallelism.
There is an assumption built into your understanding that each partition has exactly the same throughput. For most applications, though, that may or may not be true. If you set up your keying/partitioning right, then the partitions should hopefully be close to equal, especially with a large and diverse keyspace if you average them out over a large period of time. But in a more practical, realistic sense, you'll probably have some skew at any given time anyway, and your stream processing setup will need to tolerate that. So having one more partition assigned to a particular consumer is probably not going to make a big difference.
Your understanding is correct. May be there is data skew. You can check how many records are there in each partition by using offset checker or other tool.