I have the following scenario:
4 wearable sensors attached on individuals.
Potentially infinite individuals.
A Kafka cluster.
I have to perform real-time processing on data streams on a cluster with a running instance of apache flink.
Kafka is the data hub between flink cluster and sensors.
Moreover, subject's streams are totally independent and also different streams belonging to same subject are independent each other.
I imagine this setup in my mind:
I set a specific topic for each subject and each topic is partitioned in 4 partition, each one for each sensor on specific person.
In this way I though to establish a consumer group for every topic.
Actually, my data amount is not so much big but mine interest is to build an easily scalable system. A day maybe I can have hundreds of individuals for instance...
My questions are:
Is this setup good? What do you think about it?
In this way I will have 4 kafka broker and each one handles a partition, right (without consider potential backups)?
Destroy me guys,
and thanks in advance
You can't have an infinite number of topics in a Kafka cluster so if you plan to scale beyond 10,000 or more topics then you should consider another design. Instead of giving each individual a dedicated topic, you can use an individual's ID as a key and publish data as a key/value pair to a smaller number of topics. In Kafka you can have an (almost) infinite number of keys.
Also consider more partitions. Each of your 4 brokers can handle many partitions. If you only have 4 partitions in a topic then you can only have at most 4 consumers working together in parallel in a consumer group (in your case in Flink)
Related
I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill
As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.
suppose I have a kafka topic with say about 10 partitions, I understand that every consumer group should have 10 consumers reading from the topic at any given time to achieve maximum paralellism.
However, I wanted to know if there is any direct rule also for the number of consumer groups a topic can handle at any given point of time. (I was asked this in an interview recently). According to my best knowledge, it depends on the configuration of the broker so as to which how many connections it can handle at any given point of time.
However, just wanted to know how many maximum consumer groups (each with 10 consumers) can be scaled at a given point of time?
As it was said above, up to few thousands, should be okay.
For those who will land here (like me) wondering about many thousands of connections (e.g connecting IoT devices directly to kafka), it seems that kafka wasn't designed for that, at least according to this blog.
In Kafka, there is no explicit limit on the number of consumer groups that can be instantiated for a particular topic. However, you should be aware that the more the consumer groups, the bigger the impact on network utilisation.
Conceptually you can think of a consumer group as being a single logical subscriber
that happens to be made up of multiple processes. As a multi-subscriber system,
Kafka naturally supports having any number of consumer groups for a given topic
without duplicating data (additional consumers are actually quite cheap).
As given in the API docs for Kafka 0.9 , Kafka can support any number of consumer groups for given topic.
Link : http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
I have a requirement in my IoT project like, a custom java application called "NorthBound" (NB) can manage 3000 devices maximum. Devices send data to SouthBound (SB - Java Application), SB sends data to Kafka and from Kafka, NB consume the messages.
To manage around 100K devices, I am planning to start multiple instances (around 35) of NorthBound, but i want same instance should receive the messages from same devices. e.g. Device1 is sending data to NB_instance1, Device2 is sending data to NB_instance2 etc.
To handle this, i am thinking of creating 35 partitions of same topic (Device-Messages) so that each NB instance can consume one partition and same device's data should go to same NB instance. Is it the right approach? Or is there any better way?
How many partitions can we make in a Kafka cluster? and What is a recommended value considering 3 nodes (Brokers) in a cluster?
Currently, we have only 1 node in Kafka. Can we continue with single node and 35 partitions?
Say on startup I might have only 5-6K devices, then I will have only 2 partitions with 2 NB instances. Gradually when we add more devices, we will keep adding more partitions and NB instances. Can we do it without restarting Kafka? Is it possible to create partitions dynamically?
Regards,
Krishan
As you can imagine the number of partitions you can have depends on a number of factors.
Assuming you have recent hardware, since Kafka 1.1, you can have 1000s of partitions per broker. Moreover Kafka has been tested with over 100000 partitions in a cluster. Link 1
As a rule of thumb, it's recommended to over partition a bit in order to allow future growth in traffic/usage. Kafka allows to add partitions at runtime but that will change partitioning of keyed messages which can be an issue depending on your use case.
Finally, it's not recommended to run a single broker for production workloads as if it was to crash or fail, you'd be exposed to an outage and possibly data loss. It's best to at least have 2 of them with a replication factor of 2 even with only 35 partitions.
What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.