Can I use kafka as the messaging queue in system which recieves end user events as messages there will be short bursts of load (lasting couple of hours to a day) in month.
The objective is to add additional consumers (worker) during these peak load duration and remove them when not being utilized (scaling up and down).
However it seems that one consumers in a consumer-group will read at-most one partition from the topic, thus the max number of consumers parallely processing the topic is limited to number of partitions in topic.
Is kafka suitable for these use cases where I would like to increase the number of worker 100 folds for a short bursts?
Related
I have a standalone Kafka setup with single disk. planning to stream over million records. How to decide partitions for my topic for better through-put? has to be 1 partition?
Is it recommended to have multiple partitions for a topic on standalone Kafka server?
Yes you need multiple partitions even for a single node kafka cluster. That is because you can only have as many consumers as you have partitions. If you have a single partition then you can only have a single consumer, and that will limit throughput. Especially if you want to stream millions of rows (although the period for those is not specified).
The only real downside to this is that messages are only consumed in order within the same partition. Other than that, you should go with multiple partitions. You will need to estimate the throughput of a single consumer in order to calculate the partitions, then maybe add one or 2 on top of that.
You can still add partitions later but it's probably better to try to start with the right amount first and change later as you learn more or as your volume increases/decreases.
There are two main factors to consider:
Number of producers and consumers
Each client, producer or consumer, can only connect to one partition. For this reason, the number of partitions must be at least the max(number of producers, number of consumers).
Throughput
You must determine the troughput to calculate how many consumers should be in the consumer group. The combined reading capacity of consumers should be at least as high as the combined writing capacity of producers.
I have a system where load is not constant. We may get 1000 requests a day or no requests at all.
We use Kafka to pass on the requests between services. We have kept average number of Kafka consumers to reduce cost incurred. Now my consumers of Kafka will sit ideal if there are no requests received that day, and there will be lag if too many requests are received.
We want to keep these consumers on Autoscale mode, such that my number of servers(Kafka consumer) will increase if there is a spike in number of requests. Once the number of requests get reduced, we will remove the servers. Therefore, the Kafka partitions have to be increased or decreased accordingly
Kafka allows increasing partition. In this case, how can we decrease Kafka partitions dynamically?
Is there any other solution to handle this Auto-scaling?
Scaling up partitions will not fix your lag problems in the short term since no data is moved between partitions when you do this, so existing (or new) consumers are still stuck with reading the data in the previous partitions.
It's not possible to decrease partitions and its not possible to scale consumers beyond the partition count.
If you are able to sacrifice processing order for consumption speed, you can separate the consuming threads and working threads, as hinted at in the KafkaConsumer javadoc, then you would be able to scale these worker threads.
Since you are thinking about modifying the partition counts, then I'm guessing processing order isn't a problem.
have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing.
A single consumer can consume multiple partitions. Therefore partition your work for the largest anticipated parallel requirement, and then scale your number of consumers as required.
For example, if you think you need 32 parallel consumers you would give you Kafka topic 32 partitions.
You can run 32 consumers (each gets one partition), or eight consumers (each gets four partitions) or just one (which gets all 32 partitions). Or any number of consumers in between. Kafka's protocol ensures that within the consumer group all partitions are consumed, and will rebalance as and when consumers are added or removed.
Is there any limit on the number of consumers or consumer groups in Kafka?
I am planning to push 200 MB of data every 10 mins to a topic and have 200+ distinct consumers listen and consume from this topic. Is there any other recommended way to do this?
As Rohit answer states, there' no such limit.
Regarding your issue, it seems like you want to achieve some kind of paralellization of consumption. If you send 200 consumers with 200 different consumer groups, each consumer will read all the data independently, so you'll have 200 threads reading the same 200MB every 10 minutes (200x200 MB = 40GB received every 10 minutes). I guess you wanted every consumer to read 1MB every 10 mins with your approach, but that's not how it works.
If the logic implemented by each consumer is the same, you shouldn't declare more than a consumer group. If you declare two consumer groups, each one will read the same data, and you'll just repeat the job done, duplicating the output. Set different consumer groups if the job to be done on the topic's records is different: for example, one consumer group must store the records into a DDBB. The other consumer group must visualize the data into Grafana. Those are two different processing mechanisms, so each one must read all the data at its own. This is not the only reason to declare different consumer groups, but one example of them. There are multiple justifications for declaring more than a consumer group for a topic.
Imagine an scenario where the only job to be done is storing the messages into a DDBB. If you declare two consumer groups and launch your consumers, what you'll get is duplicate values stored in your database, as the first consumer group is just doing the same work than the second. Not only you are re-reading from kafka, you are re-storing the same messages into the ddbb.
In order to achieve launching multiple consumers that efficiently share the work (so for example, launching 4 consumers each one reads 50MB), you must partition your topic.
Only one consumer thread from the same consumer group can read from an specific partition. If you have 4 partitions in that topic, and 4 consumer threads that share the same consumer group, launching them will lead to each thread reading from one partition. If you launch two consumers, both will be assigned 2 partitions. Works like this:
And in this scenario, you do have a limit in the number of consumers concurrently reading if they share the same consumer group, which is, the number of partitions of that topic. If you launch a 5th consumer thread, one of them will block/wait, because it wasn't assigned any partition. In the example, consumer 5 waits until a partition is avaliable for him (so maybe waits forever).
What I suggest is: decide how many consumer threads you'll need to consume the data and partition the topic in base of that. If you, for example, partition the topic to 8 different partitions, you'll be able to launch 8 consumers from the same consumer group. Each one will then read, more or less, (depending on the producer partitioner) 25MB (200/8) of the incoming data, efficiently sharing the work load: Each consumer will read from its own partition.
If you launch 200 consumers with 200 different consumer groups,
you'll just multiply the work to be done x200, as every single consumer will read the data from start to end.
If you launch 200 consumers with the same consumer group and the topic has a single partition,
you'll have one thread doing all the job and 199 stale consumers.
In Kafka, there is no limit on the number of Consumer groups for a particular topic. However, the increase in consumer groups increases network utilization.
Worth nothing that newer versions of Kafka, store offsets in the internal Kafka topic called __consumer_offsets.
I am learning Kafka and trying to create a topic for my recent search application. The data being pushed to kafka topics is assumed be a high number.
My kafka cluster have 3 brokers and there are already topics created for other requirements.
Now what should be the number of partitions which i should choose for my recent search topic? And what if i do not provide the partition number explicitly? What are things needs to be considered when choosing the partition number?
This will depend on the throughput of your consumers. If you are producing 100 messages a second and your consumers can process 10 messages a second then you'll want at least 10 partitions (produce / consume) with 10 instances of your consumer. If you want this topic to be able to handle future growth, then you'll want to increase the partition count even higher so that you can add more instances of your consumer to handle the new volume.
Another piece of advice would be to make your partition count a highly divisible number so that you can scale up/down consumers while keeping their load balanced. For example, if you choose 10 partitions then you would have to have 1, 2, 5, or 10 instances of your consumer to keep them each processing from the same number of partitions. If you choose 12 partitions instead then you could be balanced with either 1, 2, 3, 4, 6, or 12 instances of your consumer.
I would consider evaluating two main things before deciding on the no of partitions.
First point is, how the partitions, consumers of a consumer group act together. In simple words, One consumer can consume messages from more than one partitions but one partition can't be consumed by more than one consumer. That means, it makes sense to have no.of partitions >= no.of consumers in a consumer group. Otherwise you will end up having consumers without any partition is being assigned.
Second point is, what's your requirement from latency vs throughout point of view.
In simple words,
Latency is the time required to perform some action or to produce some result. Latency is measured in units of time -- hours, minutes, seconds, nanoseconds or clock periods.
Throughput is the number of such actions executed or results produced per unit of time
Now, coming back to the comparison from kafka stand point, In general, more partitions in a Kafka cluster leads to higher throughput. But, you should be careful with this number if you are really looking for low latency.
Lets say we have one topic "topic-1" in kafka with partition 5.
Consumer Group-A with 5 consumer attached to "topic-1" each partition. Due to large workload large number of message get publish. Now we want to scale up consumer / add more consumer in Group-A to process message.
How can we increase consumer ON_DEMAND in same group?
Is any way to do it from coding ? so that single message get consumed by each consumer.
Once load is decrease shut-down few consumer from same group.
What I would suggest is having some partitions as buffer for when the
load increases.
For eg. if having 5 partitions is enough for normal load, I would
suggest having 15 partitions for that topic but only 5 consumers
for them at the start.
Then, when the load increases, keep adding consumers, preferably in other machines, until the load decreases
You can have kubernetes do the autoscaling for you
Kafka framework suggest that the number of the consumers corresponds to the number of the partitions. Increasing the number of the consumers will not help as you will have one consumer per partition anyway and the rest will remain idle. If you need to speed it up you can read the data from Kafka and process them in another thread. You can scale with this number of processing threads and you will need to program it yourself.