Number of Messages Per Transaction in Kafka - apache-kafka

Are there guidelines on how many messages and/or partitions can be involved in a Kafka Producer Transaction before performance really starts to suffer?
Obviously, the more partitions are involved, the more coordination is required. But in Kafka Streams, for example, the default commit interval is 100ms. In that time, normally hundreds of messages can be processed by a Stream thread. And depending on the topology, that may involve many different output topics (and by extension, partitions). Does that mean that it's safe to push transactions with hundreds of messages and dozens of partitions?
I can't find anything about this in the documentation.

https://www.confluent.io/blog/transactions-apache-kafka/
I found this in Confluent's blog. Turns out that the overhead doesn't increase per message per transactin; rather, you actually increase throughput if you have more messages per transaction.
In fact, the average overhead per message decreases as you add more messages to the transaction.

Related

Does kafka support millions of partitions?

Will we have any problem if we have millions of partitions for one topic?
Due to our business requirement, we are thinking if we can make a partition for every user in kafka.
We have millions of users.
Any insight would be appreciated!
Yes, I think you will end up having problems if you have millions of partitions for several reasons:
(Most importantly!!) Customers come and go, so you will have the requirement to constantly change the number of partitions or have plenty of unused partitions (because you can not reduce the number of partitions within a topic).
More Partitions Requires More Open File Handles: More Partitions means more directories and segment files on disk.
More Partitions May Increase Unavailability: Planned failures move Leaders off of a Broker one at a time, with minimal downtime per partition. In a hard failure all the leaders are immediately unavailable.
More Partitions May Increase End-to-end Latency: For the message to be seen by a Consumer it must be committed. The Broker replicates data from the leader with a single thread, resulting in overhead per Partition.
More Partitions May Require More Memory In the Client
More details are provided in the blog from Confluent on How to choose the number of topics/partitions in a Kafka cluster?.
In addition, according to Confluent's training material for Kafka developers it is recommended:
"The current limits (2-4K Partitions/Broker, 100s K Partitions per cluster) are maximums. Most environments are well below these values (typically in the 1000-1500 range or less per Broker)."
This blog explains that "Apache Kafka Supports 200K Partitions Per Cluster".
This might change with the replacement of Zookeeper KIP-500 but, again, looking at the first bullet point above this will still be a unhealthy software design.

Kafka Should Number of Consumer Threads equal number of Topic Partitions

Pretend you determined that you wanted to use exactly 8 consumer threads for your application.
Would there be any difference in processing if a Kafka topic was set up as having 8 partitions vs 16 partitions?
In the first case, each thread is assigned to a single partition with twice the data, and in the second case each thread is assigned to two partitions with half the data each. It seems to me that there is no difference between these two setups.
I believe that, on the consumer side there could be a difference, if your threads are not CPU-constrained (and network is not at capacity). Assuming infinite data on the Kafka broker, or a lagging consumer, since each thread is consuming from two partitions in your second example, the kafka broker is able to send more data than if each thread had only one partition assigned. Kafka has a limit on the maximum amount of bytes that can be retrieved per fetch (replica.fetch.max.bytes in the config), so if you 2x the partitions, you can increase capacity, assuming the data is available.
When configured properly, and assuming ideal conditions, Kafka will serve data from page cache, so it can blast data down to consumers, and 90% of the time, the bottleneck will be the amount of partitions/available CPU on the consumer side. In general, the more partitions you have, the faster you can consume from Kafka, until you are CPU or bandwidth constrained on the consumer, at which point it won't matter if you have more or less partitions, since you're consuming data as fast as you can anyway.
An additional thing to take into account is that there could be more consumer commits being sent back to the brokers, since there are now more partitions, which means some additional overhead/crosstalk in the cluster. It's probably not 2x the commits, but probably higher than 1x the commits from the first scenario.
An important thing to remember is to, whenever possible, do the actual message processing on your consumer off-thread. That is, do not process the inbound messages on the same thread that is consuming/polling from Kafka. It might work at first, but you're going to start running into issues if your processing takes longer, there's a delay, huge volume increase on the inbound side, etc. Whenever possible, throw the inbound messages on a queue, and let another thread worry about processing/parsing them.
Finally, you don't want to take this to the extreme, and configure 1000 partitions if you don't have to. Each partition requires overhead on commits, zookeeper znodes, consumer rebalancing time, startup time, etc. So, I would suggest benchmarking different scenarios, and seeing what works best for you. In general, anything from 2-4 partitions per consumer thread has worked well for me in the past, even with very high message loads (topics with 50K+ messages per second, each ~1KB).

Kafka - Best practices in case of slow processing consumer. How to achieve more parallelism?

I'm aware that the maximum number of active consumers in a consumer group is the number of partitions of a topic.
What's the best practice in case of slow processing consumers? How to achieve more parallelism?
An example: A topic with 6 partitions and thousands of messages per second produced from Producers. So I have at most 6 consumers in the group. Consider that processing those messages is complex and the consumers are much slower than the producers. The result is that the consumers are always behind the last offset and the lag is increasing.
In a traditional MQ system, we simply add more and more consumers to stay up to date.
How to achieve this with Kafka, since the total of the consumers in a group is at most the number of partitions? Should I:
Configure the topic to have more partitions allowing more consumers per group?
Route the message from the consumer to a traditional MQ Queue (but lose the ordering)?
What's the best practice for this situation?
In Kafka, partitions are the unit of parallelism.
Without knowing our exact use case and requirements it's hard to come up with precise recommendations but there are a few options.
First you should really consider having more partitions. 6 partitions is relatively small, you could easily have 60, 120 or even more partitions (and the corresponding number of consumers). Suddenly the amount of work each consumers has to do is significantly reduced.
Also if your requirements allow, you can also consume at a fast rate and spread the processing of records across many workers. In solutions like this it's harder to maintain ordering but if you don't need it then you can consider it.
I'm not sure how routing messages through a MQ Queue would really help in this scenario. If you are still reading slower than writing the amount of data in the queue will grow till you have no disk space left.
Kafka is better designed to serve as buffer between your producers and consumers so just ensure you have retention limits on your topics that allow some flexibility on the consumer side without losing data.

Does the number of consumer groups impact Kafka performance

While trying to get a deep understanding of the Kafka distribution model, one sentence here from StackOverflow got me buzzing, and I can't get a confirmation nor deny.
So, the more subscriber groups you have, the lower the performance is, as kafka needs to replicate the messages to all those groups and guarantee the total order.
As far as I understood from the Kafka docs, multiple consumer groups act similarly to singular consumers. There is no replicating done within the brokers, since each consumer has it's own offset for a certain partition. The number of groups should, then, not put any significant overhead, all of the data is on one place, only the offset is different. Is that correct?
If this is correct, then there is no way of actually introducing multiple disjoint consumers without impacting throughput, since all consumers always query all of the partitions, and some kind of copying is introduced. Note that this is not related to the number of consumer threads, threads only improve consumer performance, they don't interfere with broker operations as far as I conclude.
I've found an answer myself, it's located within the new consumer API docs for Kafka 0.9 and after:
Conceptually you can think of a consumer group as being a single logical subscriber that happens to be made up of multiple processes. As a multi-subscriber system, Kafka naturally supports having any number of consumer groups for a given topic without duplicating data (additional consumers are actually quite cheap).
Bottom line: no, multiple consumer groups do not decrease performance, at least not significantly.
It does not effect kafka process's performance, but since 2 or more consumer groups means, 2 or more times more read from kafka servers, it effects network utilization in outgoing traffic if you have lots of consumer groups. Besides that data is read from mostly memory and does not effect performance, because ram is way faster then network communication.

kafka log deletion and load balancing across consumers

Say a consumer does a time intensive processing. In order to scale consumer side processing, i would like to spawn multiple consumers and consumer messages from kafka topic in a round robin fashion. Based on the documentation, it seems like if i create multiple consumers and add them in one consumer group, only one consumer will get the messages. If i add consumers to different consumer groups, each consumer will get the same message. So, in order to achieve the above objective, is the only solution to partition the topic ? This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design. Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer. Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
Second if i choose "cleanup.policy" to compact, does it mean that kafka log will keep increasing as it will maintain the latest value for each key? If not, how can i get log deletion and compaction?
UPDATE:
It seems like i have two options to achieve scalability on consumer side, which are independent of topic scaling.
Create consumer groups and have them consume odd and even offsets. This logic would have to be built into the consumers to discard un-needed messages. Also doubles the network requirements
Create a hierarchy of topics, where the root topic gets all the messages. Then some job classifies the logs and publish them again to more fine grained topics. In this case, the strong ordering can be achieved at root and more fine grained topics for consumer scaling can be constructed.
In 0.8, kafka maintains the consumer offset, so publishing messages in a round robin across various consumers is not a too far fetched requirement from their design.
Partitions are the unit of parallelism in Kafka by design. Not just for consumtion but kafka distributes the partiotions accross cluster which has different other benifits like sharing load among different servers, replication management for ensuring no Data loss, managing log to scale beyond a size that will fit on a single server etc.
Ordering of messages is a key factor as if you do not need a storng ordering then diving topics with multiple partitions will allow you to evenly distribute the load while producing (this will be handled by the producer itself). And while using consumer group you just need to add more consumer instances in the same group in order to consume them parallely.
Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
True,from the doc
However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process.
Maintaining ordering whiile consuming in distributed manner requires the messaging system to maintain per-message state to keep track of message acknowledgement. But this will involve a lot of expensive random I/O in the system. So clearly there is a trade-off.
Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer
Distributing messages across partitions is typically handled by the producer it self without any intervention from the programmers end (assuming you don't want to categories messages using key). And for the consumers as you just mentioned here the better choice would be to use Simple/Low level consumers which will allow you to consume only a subset of the partitions in a topic.
This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design
I believe for a system like Kafka which focuses on high throughput ( handle hundreds of megabytes of reads and writes per second from thousands of clients ), ensuring scalability and strong durability and fault-tolerance guarantees might not be a good fit for someone having totally a different business requirements.
Topic partitioning is primarily a way to scale out consumers and brokers so if you need many consumers to keep up then you need to partition the topic and add multiple consumer instances in the same consumer group. The producer API will manage partitions transparently. If you need to have certain consumers subscribing only to some partitions, then you need to use the simple consumer API instead of the high level API and in this case you don't have the consumer group concept and have to coordinate consumption yourself.
Message ordering is guaranteed within partitions but not between partitions so if this is a requirement it needs to be dealt with on consumer side.
Setting cleanup.policy=compact means that the Kafka brokers will keep the latest version of a message key indefinitely and use cases like that should be more for recording of data updates for things you intend to keep around rather than the log stream buffering use case.
You need to factor out the reading of Kafka messages from the subsequent processing of those messages. You can use partitions and consumer groups to make reading messages as fast as possible, but if you process the messages as part of your consumer logic then you'll just slow down your consumers. By streaming the messages from consumers to other classes that will perform your processing you can adjust the parallelism of the consumers and of the processors independently. You'll see this approach in technologies like Spark and Storm.
This approach does add one complication and that is that the consumer has to commit the message offset before the message has been processed. You may have to track the messages in flight to insure execute-exactly-once.