Partitions on Message-Hub Service from Bluemix

Partitions on Message-Hub Service from Bluemix - ibm-cloud

I know there isn't this feature yet but, Will it be provided in the future?

Do you mean multiple partitions on a single topic? Right now Message Hub only allows a single partition per topic. This restriction will remain for some time once we exit Beta as well, but we do aim to allow a flexible number of partitions per topic in the future, although we do not have a date for when this will be available yet.

Now is possible to create multiple partitions for a topic. There is a limit of 100 partitions per each Bluemix space. To create more partitions, is necessary to use a new Bluemix space. It's important to mention that all those partitions (100) could be divided among different topics.
e.g. if we have 5 topics in that instance/space, the total sum of partitions should be 100 (max) divided between those 5 topics

Related

Kafka - Standalone sever - How to decide partitions?

I have a standalone Kafka setup with single disk. planning to stream over million records. How to decide partitions for my topic for better through-put? has to be 1 partition?
Is it recommended to have multiple partitions for a topic on standalone Kafka server?

Yes you need multiple partitions even for a single node kafka cluster. That is because you can only have as many consumers as you have partitions. If you have a single partition then you can only have a single consumer, and that will limit throughput. Especially if you want to stream millions of rows (although the period for those is not specified).
The only real downside to this is that messages are only consumed in order within the same partition. Other than that, you should go with multiple partitions. You will need to estimate the throughput of a single consumer in order to calculate the partitions, then maybe add one or 2 on top of that.
You can still add partitions later but it's probably better to try to start with the right amount first and change later as you learn more or as your volume increases/decreases.

There are two main factors to consider:
Number of producers and consumers
Each client, producer or consumer, can only connect to one partition. For this reason, the number of partitions must be at least the max(number of producers, number of consumers).
Throughput
You must determine the troughput to calculate how many consumers should be in the consumer group. The combined reading capacity of consumers should be at least as high as the combined writing capacity of producers.

What is the ideal number of partitions in kafka topic?

I am learning Kafka and trying to create a topic for my recent search application. The data being pushed to kafka topics is assumed be a high number.
My kafka cluster have 3 brokers and there are already topics created for other requirements.
Now what should be the number of partitions which i should choose for my recent search topic? And what if i do not provide the partition number explicitly? What are things needs to be considered when choosing the partition number?

This will depend on the throughput of your consumers. If you are producing 100 messages a second and your consumers can process 10 messages a second then you'll want at least 10 partitions (produce / consume) with 10 instances of your consumer. If you want this topic to be able to handle future growth, then you'll want to increase the partition count even higher so that you can add more instances of your consumer to handle the new volume.
Another piece of advice would be to make your partition count a highly divisible number so that you can scale up/down consumers while keeping their load balanced. For example, if you choose 10 partitions then you would have to have 1, 2, 5, or 10 instances of your consumer to keep them each processing from the same number of partitions. If you choose 12 partitions instead then you could be balanced with either 1, 2, 3, 4, 6, or 12 instances of your consumer.

I would consider evaluating two main things before deciding on the no of partitions.
First point is, how the partitions, consumers of a consumer group act together. In simple words, One consumer can consume messages from more than one partitions but one partition can't be consumed by more than one consumer. That means, it makes sense to have no.of partitions >= no.of consumers in a consumer group. Otherwise you will end up having consumers without any partition is being assigned.
Second point is, what's your requirement from latency vs throughout point of view.
In simple words,
Latency is the time required to perform some action or to produce some result. Latency is measured in units of time -- hours, minutes, seconds, nanoseconds or clock periods.
Throughput is the number of such actions executed or results produced per unit of time
Now, coming back to the comparison from kafka stand point, In general, more partitions in a Kafka cluster leads to higher throughput. But, you should be careful with this number if you are really looking for low latency.

how many consumer groups can a kafka topic handle?

suppose I have a kafka topic with say about 10 partitions, I understand that every consumer group should have 10 consumers reading from the topic at any given time to achieve maximum paralellism.
However, I wanted to know if there is any direct rule also for the number of consumer groups a topic can handle at any given point of time. (I was asked this in an interview recently). According to my best knowledge, it depends on the configuration of the broker so as to which how many connections it can handle at any given point of time.
However, just wanted to know how many maximum consumer groups (each with 10 consumers) can be scaled at a given point of time?

As it was said above, up to few thousands, should be okay.
For those who will land here (like me) wondering about many thousands of connections (e.g connecting IoT devices directly to kafka), it seems that kafka wasn't designed for that, at least according to this blog.

In Kafka, there is no explicit limit on the number of consumer groups that can be instantiated for a particular topic. However, you should be aware that the more the consumer groups, the bigger the impact on network utilisation.

Conceptually you can think of a consumer group as being a single logical subscriber
that happens to be made up of multiple processes. As a multi-subscriber system,
Kafka naturally supports having any number of consumer groups for a given topic
without duplicating data (additional consumers are actually quite cheap).
As given in the API docs for Kafka 0.9 , Kafka can support any number of consumer groups for given topic.
Link : http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

Can Kafka scale to tens of millions of topics?

I'm currently planning the development of a Device Server and am keen to use Kafka, however, I'm unsure if it's capable of supporting a paradigm where there is one topic per device, when there could be 10 million+ devices.
I would expect only one partition per topic and limited required storage (<1MB) per topic. If it makes any difference one topic with millions of partitions could also be considered.
Is anyone able to provide clarification of the scaling limits and expectations of Kafka at this level? In particular, I'm keen to understand the overheads per topic and the effectiveness/feasibility of an individual consumer consuming from ~10k subscribed topics over a single connection.
Any advice much appreciated,
Many thanks

Kafka best practices would be to use keys rather than topics for that many devices. Kafka scales to an unlimited number of keys but not an unlimited number of topics

Having one topic with many partitions has some advantages. First of all you can use keys, as already said, for specifying the device which is sending the message. You don't need to have the number of partitions equals to the number of devices but it could be less that that; thanks to the key usage, the main aspect is that messages from same device (same key) go always to the same partition and in order. On the consumer side you have the advantage to leverage on more consumers in the same consumer group working on different partitions and sharing the messages load; you can scale up to a number of consumers equals to the number of partitions.

How can I scale Kafka consumers?

I'm reading the Kafka documentation and noticed the following line:
Note however that there cannot be more consumer instances in a consumer group than partitions.
Hmm. How can I auto-scale this?
For example let's say I have a messaging system with hi/lo priorities, so I create a topic for messages and partitions for hi and lo priority messages.
If this was RabbitMQ, I'd have an auto-scalable group of consumers assigned to each partition, like this:
If I understand the Kafka model I can't have >1 consumer per partition in a consumer group, so that picture doesn't work for Kafka, right?
Ok, so what about >1 consumer groups like this:
That get's around Kafka's limitation but... If I understand how this works both consumer groups would be pulling from a partition, for example msg.hi, with their own offsets so neither would know about the other--meaning messages would likely be delivered twice!
How can I achieve the capability I had in the Rabbit design w/Kafka and still maintain the "queue-ness" of the behavior (I don't want to send a message twice)? What am I missing?

TL;DR
Topic is made up of partitions. Partitions decide the max number of consumers you can have in a group.
Scenario 1:
When we have only one consumer, It can read all the messages from all the partitions.
Scenario 2:
In the above set up, when you increase the number of consumers in the group, partition reassignment happens and instead of consumer 1 reading all the messages from all the partitions, consumer 2 could share some of the load with consumer 1 as shown below.
Scenario 3:
What happens If I have more number of consumers than the number of partitions.? Each consumer would be assigned 1 partition. Any additional consumers in the group will be sitting idle unless you increase the number of partitions for a Topic.
Summary:
We need to choose the partitions accordingly. That decides the max number of consumers in the group. Changing the partition for an existing topic is really NOT recommended as It could cause issues.
That is, Let's assume a producer producing names into a topic where we have 3 partitions. All the names starting with A-I go to Partition 1, J-R in partition 2 and S-Z in partition 3. Let's also assume that we have already produced 1 million messages. Now if you suddenly increase the number of partitions to 5 from 3, It will create a different A-Z range now. That is, A-F in Partition 1, G-K in partition 2, L-Q in partition 3, R-U in partition 4 and V-Z in partition 5. Do you get it? It kind of affects the order of the messages we had before! So you need to be aware of this. If this could be a problem, then we need to choose the partition accordingly upfront.
More info is here - http://www.vinsguru.com/kafka-scaling-consumers-out-for-a-consumer-group/

Your assumption about messages being consumed twice is correct (since each group consumes 100% of messages from a topic).
I agree with David. Moreover, I suggest that you create more partitions than you really need, which would leave you some headroom to increase the number of threads in the group when such a need arises.
You can always increase the number of partitions later (and/or add additional brokers), but it's nice to have that already done, so that you can only increase number of threads and be done with it (those situations usually require a quick response, so you should do all the prep. that you can do in advance).

Just create a bunch of partitions for hi and lo. 12 is a good number. So is 60. Just pick a number of partitions that matches how much maximum parallelization you want.
Honestly, although I personally would makemsg.hi and msg.lo be different topics entirely, that's not a requirement -- you can do custom parititoning to divide messages between partitions.

You can also use an AI based auto scaler like this https://www.confluent.io/events/kafka-summit-americas-2021/intelligent-auto-scaling-of-kafka-consumers-with-workload-prediction/
This scaler calculates the right number of consumer PODs based on workload prediciton and target KPI metrics

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse