Spring Kafka, subscribing to large number of topics using topic pattern - apache-kafka

Are there any known limitations with the number of Kafka topics and the topics distribution between consumer instances while subscribing to Kafka topics using topicPattern.
Our use case is that we need to subscribe to a large number of topics (potentially few thousands of topics). All the topics follow a naming convention and have only 1 partition. We don't know the list of topic names before hand and new topics that match the pattern can be created anytime. Our consumers should be able to consume messages from all the topics that match the pattern.
Do we have any limit on the number of topics that can be subscribed this way using the topic pattern.
Can we scale up the number of consumer instances to read from more topics
Will the topics be distributed among all the running consumer instances. In our local testing it has been observed that all the topics are read by a single consumer despite having multiple consumer instances running in parallel. Only when the serving consumer instance is shutdown, the topics are picked by another instance (not distributed among available instances)
All the topics mentioned are single partition topics.

I don't believe there is a hard limit but...
No; you can't change concurrency at runtime, but you can set a larger concurrency than needed and you will have idle consumers waiting for assignment.
You would need to provide a custom partition assigner, or select one of the alternate provided ones e.g. RoundRobinAssignor will probably work for you https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy

Related

Publisher which subscribes to its own topic

I'm currently designing an application which is will have hundreds of log-compacted topics. Each topic is related to a failover group and should have a dynamic (e.g., to be changed on demand) set of producers and consumers.
For example, let's say I have 3 failover instances related to topic T1. Each of those failover instances should have the same data / state (eventually consistent). And each of the instances may consume and produce messages on that topic.
As I understand, I need to assign different group IDs for each consumer/producer in order to have every instance read the topic entirely.
Though given that the number of readers and writers for a topic are not fixed, how is it possible to avoid reading ones own messages for that topic?
Sure, I could add a source ID to the message and just dismiss the message when the consumer figures out that he is about to read a message he previously produced himself. But I'd rather avoid the data transfer entirely.
Producers and consumers are independent processes. If you subscribe to the same topic that's being produced to without some extra processing logic, you'll end up with an infinite loop.
You also cannot have more consumers than partitions, so the dynamic consumer amount will be limited by that.
need to assign different group IDs for each consumer/producer in order to have every instance read the topic entirely
Not necessarily. You've mentioned you have compacted topics, so I assume you are using Kafka Streams. In the Streams API, you can set num.standby.replicas for copying statestore data across instances of the same application.id

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

One KafkaConsumer listening to multiple partitions VS multiple KafkaConsumers listening to multiple partitions

I have ten Kafka Producers each one writing to different partition of a topic.
I cannot tell which is more effective.
Having one consumer listening to the ten partitions or having ten consumers listening to different partition?
There is no difference between these two ways. But remember when you have ten consumers there is overhead for connecting each consumer to Kafka.
If there is a capability in consuming different partitions by one consumer so probably it is enough performant.
Typically, if you have multiple consumers, you'll be able to get more throughput, since you'll have multiple threads/applications pulling data from the kafka cluster, which means you'll be able to parallelize across multiple cores, and maybe multiple servers.
However, you also need to take into account what you're trying to accomplish. Does one process/application need to look at all the data? Are the messages independent of each other? All of this will inform how your application should be designed.
In a default configuration, all of the available partitions for a topic will be distributed evently across all consumers with the same group id. So, you could have one consumer, and it will automatically grab all partitions for that topic. Or you could instantiate ten consumers, and each consumer will get exactly one partition in this case.

Scaling out with 200+ Kafka topics

I'm trying to understand how to dynamically scale out application which consumes a huge number of topics (unfortunately I can't reduce their number - by design each topic is for particular type of data).
I want my application cluster to share the load from all 200+ topics. E.g when a new app node added to the cluster, it should "steal" some topics subscriptions from old nodes, so the load become evenly distributed again.
As far as I understand, Kafka partinions/consumer groups help to parallelize a topic, not to share a load between multiple topics.
You need to make sure that all your App instances use the same Kafka Consumer Group (via group.id). In this case you actually have an even distribution you want. When a new App instance is added, consumer group is going to rebalance and make sure the load is distributed.
Also, when a new topic/partition is created it'll take consumer up to "metadata.max.age.ms" (default is 5 minutes) to start consuming from it. Make sure to set "auto.offset.reset" to "earliest" to not miss any data.
Finally, you might want to use a regex to subscribe to all those topics (if possible).
A Kafka Topic is a grouping of messages of a similar type, so you probably have 200+ types of messages that have be consumed by 200+ types of consumers (even if one consumer may be able to handle several types, logically you have 200+ different handlings).
Kafka Partitions is a way to parallelize the consumption of the messages from one Topic. Each Partition will be fully consumed by one consumer in a consumer group bound to the topic, therefore the total number of partitions for a topic needs to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature.
So here you would have 200+ Topics, each having N partitions (where N greater or equal to your expected Max number of applications) and each application should consume from all 200+ Topics. Consumers have to label themselves with a consumer group name, each record published to a topic is delivered to one consumer instance within each subscribing consumer group. All consumers can use the same consumer group.
See Kafka documentation for an even better explanation...

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.