Maximum subscription limit of Kafka Topics Per Consumer - apache-kafka

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.

500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.

To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.

Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.

Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.

Related

Advice on how I can decrease Kafka Lag

I'm relatively new to working with Kafka, below is a sample of what my current set up is.
Kafka Setup
Multiple topics that all have one partition each. 2 Consumer Groups with each group containing one consumer.
The issue I am seeing is that the Lag is enormous, sometimes upwards of 8-10 hours waiting for consuming, the load is about 100-200 million messages a day
What steps should I look at in order to address this? Is it as simple as reassigning partitions or creating new partitions for the 3 topics that are being consumed by the two consumers? - I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag. I've looked at network connections and don't feel that it is anything got to do with this. If anyone could point me in the direction of Kafka and Low Latency documents that would be good also.
Generally the flow is to parallelize your consumption through the increase on the number of partitions and consumers in consumer groups that subscribe to those topics with increased partitions (Nconsumers <= Npartitions).
And distribute your topics with increase on the number of brokers in your cluster.
So from topic considerations:
Less partition per topic result:
in producer and/or consumer lag
starved or overloaded brokers and consumers.
(But take into account) More partition per topic result in:
More broker resources – file handlers and memory.
There is an overhead with each additional partition and a number of partitions a broker can handle is limited.
Overhead of replication load
Then increase the number of consumers in that consumer groups.
Try increasing partition per topic, but by itself it should not help! You also will need to increase the number of consumers in your consumer group. Is that single consumers or consumer groups on your diagram? How many consumers in your consumer group vs partitions on the topic that they are subscibed to.
From this in your message:
I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag.
I get an idead that your messages may be huge! Is it so? In case yes, try to keep messages small (for example by excluding BLOBs and keep external links to them)
Still the issue may be somewhere else like bad configs, consumer commit messages (acknowledgment handling), etc.
So, I highly advice you to read article Fine-tune Kafka performance with the Kafka optimization theorem
I also advise you to go through Apache Kafka courses on Confluent web-page
This should be added as a comment, but I haven't had permissions to do so. The provided info is very limited with incorrect diagram, which limits the ability to provide an adequate helpfull answer. If possible please correct your diagram and add more details about your set-up like:
broker configuration, file attached;
consumer set-up (Consumer commit messages);
producer set-up;
topic set-up;
kafka version (the defaults differ with major/minor versions)
The provided diagram is not correct in the notion of topic - partition relationship, so I assume it is a mistype and Partition 0 must be substituded with Broker 0, right?
Kafka's topics are divided into several partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic...
Then there is an open question on the number of partiotions in each topic and the number of topics in each broker, as well as the number of brokers in your cluster!

Best ways to design a kafka consumer

Need a help in getting the best design solution for creating Kafka consumers.
Will be having multiple topics and those can be like groups say for example
10 topics that are used to send out emails (10 count is chosen because will be getting more client traffic and want to dedicate a topic per client like each topic for one client so that others will not be delayed or waited)
10 topics to process a business logic and the 10 count explanation is same as above.
Now with this usage what's the best way to design Kafka consumers? Consumer dedicated to each topic ? or is there a way where we can scale up consumer dynamically by passing in which topic it needs to subscribe? For sure will be deploying this in containers but want suggestions on how to get started with consumer part with dynamic scalability and common code. And what's the best technology to implement this type of kafka consumers? (dotnet/java/python) ?
Also please do suggest if partitions make sense in this kind of design so that we can leverage consumer groups.
Consumers belonging to a same consumer group are assigned partitions in a topic.
In kafka, a topic can have multiple partitions. The consumers consume the messages of a particular topic from their assigned partition(s). The messages in the partition are ordered by sequential offsets.
Now, topic-wide record order is not important, you generally want to start with a higher number of partitions in a topic. Let's say start with 100 partitions. Your data will be distributed across the 100 partitions in a topic, assuming null keys or at least 100 unique key values with non colliding hashes, as record keys determine partitioning. If topic order is important, you're limited to one partition, and therefore one consumer thread; however, this thread can separate consumption from processing by loading records into alternative data structures (a queue) for processing.
You can now have 10 consumers consuming from 100 partitions. Each consumer will be assigned to about 10 partitions, and they will consume the messages in a round-robin fashion.
If you want to scale out, you simply increase the number of consumers. If you double the number of consumer to 20 then each consumer will process 5 partitions, thus, you get double throughput.

Spring Kafka, subscribing to large number of topics using topic pattern

Are there any known limitations with the number of Kafka topics and the topics distribution between consumer instances while subscribing to Kafka topics using topicPattern.
Our use case is that we need to subscribe to a large number of topics (potentially few thousands of topics). All the topics follow a naming convention and have only 1 partition. We don't know the list of topic names before hand and new topics that match the pattern can be created anytime. Our consumers should be able to consume messages from all the topics that match the pattern.
Do we have any limit on the number of topics that can be subscribed this way using the topic pattern.
Can we scale up the number of consumer instances to read from more topics
Will the topics be distributed among all the running consumer instances. In our local testing it has been observed that all the topics are read by a single consumer despite having multiple consumer instances running in parallel. Only when the serving consumer instance is shutdown, the topics are picked by another instance (not distributed among available instances)
All the topics mentioned are single partition topics.
I don't believe there is a hard limit but...
No; you can't change concurrency at runtime, but you can set a larger concurrency than needed and you will have idle consumers waiting for assignment.
You would need to provide a custom partition assigner, or select one of the alternate provided ones e.g. RoundRobinAssignor will probably work for you https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy

Scaling out with 200+ Kafka topics

I'm trying to understand how to dynamically scale out application which consumes a huge number of topics (unfortunately I can't reduce their number - by design each topic is for particular type of data).
I want my application cluster to share the load from all 200+ topics. E.g when a new app node added to the cluster, it should "steal" some topics subscriptions from old nodes, so the load become evenly distributed again.
As far as I understand, Kafka partinions/consumer groups help to parallelize a topic, not to share a load between multiple topics.
You need to make sure that all your App instances use the same Kafka Consumer Group (via group.id). In this case you actually have an even distribution you want. When a new App instance is added, consumer group is going to rebalance and make sure the load is distributed.
Also, when a new topic/partition is created it'll take consumer up to "metadata.max.age.ms" (default is 5 minutes) to start consuming from it. Make sure to set "auto.offset.reset" to "earliest" to not miss any data.
Finally, you might want to use a regex to subscribe to all those topics (if possible).
A Kafka Topic is a grouping of messages of a similar type, so you probably have 200+ types of messages that have be consumed by 200+ types of consumers (even if one consumer may be able to handle several types, logically you have 200+ different handlings).
Kafka Partitions is a way to parallelize the consumption of the messages from one Topic. Each Partition will be fully consumed by one consumer in a consumer group bound to the topic, therefore the total number of partitions for a topic needs to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature.
So here you would have 200+ Topics, each having N partitions (where N greater or equal to your expected Max number of applications) and each application should consume from all 200+ Topics. Consumers have to label themselves with a consumer group name, each record published to a topic is delivered to one consumer instance within each subscribing consumer group. All consumers can use the same consumer group.
See Kafka documentation for an even better explanation...

kafka log deletion and load balancing across consumers

Say a consumer does a time intensive processing. In order to scale consumer side processing, i would like to spawn multiple consumers and consumer messages from kafka topic in a round robin fashion. Based on the documentation, it seems like if i create multiple consumers and add them in one consumer group, only one consumer will get the messages. If i add consumers to different consumer groups, each consumer will get the same message. So, in order to achieve the above objective, is the only solution to partition the topic ? This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design. Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer. Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
Second if i choose "cleanup.policy" to compact, does it mean that kafka log will keep increasing as it will maintain the latest value for each key? If not, how can i get log deletion and compaction?
UPDATE:
It seems like i have two options to achieve scalability on consumer side, which are independent of topic scaling.
Create consumer groups and have them consume odd and even offsets. This logic would have to be built into the consumers to discard un-needed messages. Also doubles the network requirements
Create a hierarchy of topics, where the root topic gets all the messages. Then some job classifies the logs and publish them again to more fine grained topics. In this case, the strong ordering can be achieved at root and more fine grained topics for consumer scaling can be constructed.
In 0.8, kafka maintains the consumer offset, so publishing messages in a round robin across various consumers is not a too far fetched requirement from their design.
Partitions are the unit of parallelism in Kafka by design. Not just for consumtion but kafka distributes the partiotions accross cluster which has different other benifits like sharing load among different servers, replication management for ensuring no Data loss, managing log to scale beyond a size that will fit on a single server etc.
Ordering of messages is a key factor as if you do not need a storng ordering then diving topics with multiple partitions will allow you to evenly distribute the load while producing (this will be handled by the producer itself). And while using consumer group you just need to add more consumer instances in the same group in order to consume them parallely.
Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
True,from the doc
However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process.
Maintaining ordering whiile consuming in distributed manner requires the messaging system to maintain per-message state to keep track of message acknowledgement. But this will involve a lot of expensive random I/O in the system. So clearly there is a trade-off.
Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer
Distributing messages across partitions is typically handled by the producer it self without any intervention from the programmers end (assuming you don't want to categories messages using key). And for the consumers as you just mentioned here the better choice would be to use Simple/Low level consumers which will allow you to consume only a subset of the partitions in a topic.
This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design
I believe for a system like Kafka which focuses on high throughput ( handle hundreds of megabytes of reads and writes per second from thousands of clients ), ensuring scalability and strong durability and fault-tolerance guarantees might not be a good fit for someone having totally a different business requirements.
Topic partitioning is primarily a way to scale out consumers and brokers so if you need many consumers to keep up then you need to partition the topic and add multiple consumer instances in the same consumer group. The producer API will manage partitions transparently. If you need to have certain consumers subscribing only to some partitions, then you need to use the simple consumer API instead of the high level API and in this case you don't have the consumer group concept and have to coordinate consumption yourself.
Message ordering is guaranteed within partitions but not between partitions so if this is a requirement it needs to be dealt with on consumer side.
Setting cleanup.policy=compact means that the Kafka brokers will keep the latest version of a message key indefinitely and use cases like that should be more for recording of data updates for things you intend to keep around rather than the log stream buffering use case.
You need to factor out the reading of Kafka messages from the subsequent processing of those messages. You can use partitions and consumer groups to make reading messages as fast as possible, but if you process the messages as part of your consumer logic then you'll just slow down your consumers. By streaming the messages from consumers to other classes that will perform your processing you can adjust the parallelism of the consumers and of the processors independently. You'll see this approach in technologies like Spark and Storm.
This approach does add one complication and that is that the consumer has to commit the message offset before the message has been processed. You may have to track the messages in flight to insure execute-exactly-once.