How does Kafka handle a consumer which is running slower than other consumers? - apache-kafka

Let's say I have 20 partitions and five workers. Each partition is assigned a worker. However, one worker is running slower than the other machines. It's still processing (that is, not slow consumer described here), but at 60% rate of the other machines. This could be because the worker is running on a slower VM on AWS EC2, a broken disk or CPU or whatnot. Does Kafka handle rebalancing gracefully somehow to give the slow worker fewer partitions?

Kafka doesn't really concern itself with how fast messages are being consumed. It doesn't even get involved with how many consumers there are or how many times each message is read. Kafka just commits messages to partitions and ages them out at the configured time.
It's the responsibility of the group of consumers to make sure that the messages are being read evenly and in a timely fashion. In your case, you have two problems: The reading of one set of partitions lags and then then processing of the messages from those partitions lags.
For the actual consumption of messages from the topic, you'll have to use the Kafka metadata API's to track the relative loads each consumer faces, whether by skewed partitioning or because the consumers are running at different speeds. You either have to re-allocate partitions to consumers to give the slow consumers less work or randomly re-assign consumers to partitions in the hope of eventually evening out the workload over time.
To better balance the processing of messages, you should factor out the reading of the messages from the processing of the messages - something like the Storm streaming model. You still have to programmatically monitor the backlogs into the processing logic, but you'd have the ability to move work to faster nodes in order to balance the work.

Related

Consuming messages in a Kafka topic ASAP

Imagine a scenario in which a producer is producing 100 messages per second, and we're working on a system that consuming messages ASAP matters a lot, even 5 seconds delay might result in a decision not to take care of that message anymore. also, the order of messages does not matter.
So I don't want to use a basic queue and a single pod listening on a single partition to consume messages, since in order to consume a message, the consumer needs to make multiple remote API calls and this might take time.
In such a scenario, I'm thinking of a single Kafka topic, with 100 partitions. and for each partition, I'm gonna have a separate machine (pod) listening for partitions 0 to 99.
Am I thinking right? this is my first project with Kafka. this seems a little weird to me.
For your use case, think of partitions = max number of instances of the service consuming data. Don't create extra partitions if you'll have 8 instances. This will have a negative impact if consumers need to be rebalanced and probably won't give you any performace improvement. Also 100 messages/s is very, very little, you can make this work with almost any technology.
To get the maximum performance I would suggest:
Use a round robin partitioner
Find a Parallel consumer implementation for your platform (for jvm)
And there a few producer and consumer properties that you'll need to change, but they depend your environment. For example batch.size, linger.ms, etc. I would also check about the need to set acks=all as it might be ok for you to lose data if a broker dies given that old data is of no use.
One warning: In Java, the standard kafka consumer is single threaded. This surprises many people and I'm not sure if the same is true for other platforms. So having 100s of partitions won't give any performance benefit with these consumers, and that's why it's important to use a Parallel Consumer.
One more warning: Kafka is a complex broker. It's trivial to start using it, but it's a very bumpy journey to use it correctly.
And a note: One of the benefits of Kafka is that it keeps the messages rather than delete them once they are consumed. If messages older than 5 seconds are useless for you, Kafka might be the wrong technology and using a more traditional broker might be easier (activeMQ, rabbitMQ or go to blazing fast ones like zeromq)
Your bottleneck is your application processing the event, not Kafka.
when you have ten consumers, there is overhead for connecting each consumer to Kafka so it will lower the performance.
I advise focusing on your application performance rather than message broker.
Kafka p99 Latency is 5 ms with 200 MB/s load.
https://developer.confluent.io/learn/kafka-performance/

Increase or decrease Kafka partitions dynamically

I have a system where load is not constant. We may get 1000 requests a day or no requests at all.
We use Kafka to pass on the requests between services. We have kept average number of Kafka consumers to reduce cost incurred. Now my consumers of Kafka will sit ideal if there are no requests received that day, and there will be lag if too many requests are received.
We want to keep these consumers on Autoscale mode, such that my number of servers(Kafka consumer) will increase if there is a spike in number of requests. Once the number of requests get reduced, we will remove the servers. Therefore, the Kafka partitions have to be increased or decreased accordingly
Kafka allows increasing partition. In this case, how can we decrease Kafka partitions dynamically?
Is there any other solution to handle this Auto-scaling?
Scaling up partitions will not fix your lag problems in the short term since no data is moved between partitions when you do this, so existing (or new) consumers are still stuck with reading the data in the previous partitions.
It's not possible to decrease partitions and its not possible to scale consumers beyond the partition count.
If you are able to sacrifice processing order for consumption speed, you can separate the consuming threads and working threads, as hinted at in the KafkaConsumer javadoc, then you would be able to scale these worker threads.
Since you are thinking about modifying the partition counts, then I'm guessing processing order isn't a problem.
have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing.
A single consumer can consume multiple partitions. Therefore partition your work for the largest anticipated parallel requirement, and then scale your number of consumers as required.
For example, if you think you need 32 parallel consumers you would give you Kafka topic 32 partitions.
You can run 32 consumers (each gets one partition), or eight consumers (each gets four partitions) or just one (which gets all 32 partitions). Or any number of consumers in between. Kafka's protocol ensures that within the consumer group all partitions are consumed, and will rebalance as and when consumers are added or removed.

Kafka Should Number of Consumer Threads equal number of Topic Partitions

Pretend you determined that you wanted to use exactly 8 consumer threads for your application.
Would there be any difference in processing if a Kafka topic was set up as having 8 partitions vs 16 partitions?
In the first case, each thread is assigned to a single partition with twice the data, and in the second case each thread is assigned to two partitions with half the data each. It seems to me that there is no difference between these two setups.
I believe that, on the consumer side there could be a difference, if your threads are not CPU-constrained (and network is not at capacity). Assuming infinite data on the Kafka broker, or a lagging consumer, since each thread is consuming from two partitions in your second example, the kafka broker is able to send more data than if each thread had only one partition assigned. Kafka has a limit on the maximum amount of bytes that can be retrieved per fetch (replica.fetch.max.bytes in the config), so if you 2x the partitions, you can increase capacity, assuming the data is available.
When configured properly, and assuming ideal conditions, Kafka will serve data from page cache, so it can blast data down to consumers, and 90% of the time, the bottleneck will be the amount of partitions/available CPU on the consumer side. In general, the more partitions you have, the faster you can consume from Kafka, until you are CPU or bandwidth constrained on the consumer, at which point it won't matter if you have more or less partitions, since you're consuming data as fast as you can anyway.
An additional thing to take into account is that there could be more consumer commits being sent back to the brokers, since there are now more partitions, which means some additional overhead/crosstalk in the cluster. It's probably not 2x the commits, but probably higher than 1x the commits from the first scenario.
An important thing to remember is to, whenever possible, do the actual message processing on your consumer off-thread. That is, do not process the inbound messages on the same thread that is consuming/polling from Kafka. It might work at first, but you're going to start running into issues if your processing takes longer, there's a delay, huge volume increase on the inbound side, etc. Whenever possible, throw the inbound messages on a queue, and let another thread worry about processing/parsing them.
Finally, you don't want to take this to the extreme, and configure 1000 partitions if you don't have to. Each partition requires overhead on commits, zookeeper znodes, consumer rebalancing time, startup time, etc. So, I would suggest benchmarking different scenarios, and seeing what works best for you. In general, anything from 2-4 partitions per consumer thread has worked well for me in the past, even with very high message loads (topics with 50K+ messages per second, each ~1KB).

Kafka - Best practices in case of slow processing consumer. How to achieve more parallelism?

I'm aware that the maximum number of active consumers in a consumer group is the number of partitions of a topic.
What's the best practice in case of slow processing consumers? How to achieve more parallelism?
An example: A topic with 6 partitions and thousands of messages per second produced from Producers. So I have at most 6 consumers in the group. Consider that processing those messages is complex and the consumers are much slower than the producers. The result is that the consumers are always behind the last offset and the lag is increasing.
In a traditional MQ system, we simply add more and more consumers to stay up to date.
How to achieve this with Kafka, since the total of the consumers in a group is at most the number of partitions? Should I:
Configure the topic to have more partitions allowing more consumers per group?
Route the message from the consumer to a traditional MQ Queue (but lose the ordering)?
What's the best practice for this situation?
In Kafka, partitions are the unit of parallelism.
Without knowing our exact use case and requirements it's hard to come up with precise recommendations but there are a few options.
First you should really consider having more partitions. 6 partitions is relatively small, you could easily have 60, 120 or even more partitions (and the corresponding number of consumers). Suddenly the amount of work each consumers has to do is significantly reduced.
Also if your requirements allow, you can also consume at a fast rate and spread the processing of records across many workers. In solutions like this it's harder to maintain ordering but if you don't need it then you can consider it.
I'm not sure how routing messages through a MQ Queue would really help in this scenario. If you are still reading slower than writing the amount of data in the queue will grow till you have no disk space left.
Kafka is better designed to serve as buffer between your producers and consumers so just ensure you have retention limits on your topics that allow some flexibility on the consumer side without losing data.

Kafka Random Access to Logs

I am trying to implement a way to randomly access messages from Kafka, by using KafkaConsumer.assign(partition), KafkaConsumer.seek(partition, offset).
And then read poll for a single message.
Yet i can't get past 500 messages per second in this case. In comparison if i "subscribe" to the partition i am getting 100,000+ msg/sec. (#1000 bytes msg size)
I've tried:
Broker, Zookeeper, Consumer on the same host and on different hosts. (no replication is used)
1 and 15 partitions
default threads configuration in "server.properties" and increased to 20 (io and network)
Single consumer assigned to a different partition each time and one consumer per partition
Single thread to consume and multiple threads to consume (calling multiple different consumers)
Adding two brokers and a new topic with it's partitions on both brokers
Starting multiple Kafka Consumer Processes
Changing message sizes 5k, 50k, 100k -
In all cases the minimum i get is ~200 msg/sec. And the maximum is 500 if i use 2-3 threads. But going above, makes the ".poll()", call take longer and longer (starting from 3-4 ms on a single thread to 40-50 ms with 10 threads).
My naive kafka understanding is that the consumer opens a connection to the broker and sends a request to retrieve a small portion of it's log. While all of this has some involved latency, and retrieving a batch of messages will be much better - i would imagine that it would scale with the number of receivers involved, with the expense of increased server usage on both the VM running the consumers and the VM running the broker. But both of them are idling.
So apparently there is some synchronization happening on broker side, but i can't figure out if it is due to my usage of Kafka or some inherent limitation of using .seek
I would appreaciate some hints of whether i should try something else, or this is all i can get.
Kafka is a streaming platform by design. It means there are many, many things has been developed for accelerating sequential access. Storing messages in batches is just one thing. When you use poll() you utilize Kafka in such way and Kafka do its best. Random access is something for what Kafka don't designed.
If you want fast random access to distributed big data you would want something else. For example, distributed DB like Cassandra or in-memory system like Hazelcast.
Also you could want to transform Kafka stream to another one which would allow you to use sequential way.