Slow consumer impact on a topic partition - apache-kafka

Let there be a single Kafka topic with just a single partition configured with an infinite retention policy. Let there be two consumers, Fast and Slow.
The Fast consumer processes the message as they appear and has almost no lag.
The Slow consumer tends to have a significant lag e.g. two days worth of messages. Slow will sometimes catch up to Fast but this happens rarely, there is usually a significant lag.
Will this setup, with two different consumer speeds in the same partition, cause negative side effects on a Kafka broker? Could there be an increased I/O cost to retrieve older messages for Slow consumer from the disk?

Lagging consumer won't be able to read data from OS cache. Therefore there will be I/O cost for slow consumers. On the other hand, after your slow consumer started to read message, kafka will make sequential I/O to cache messages. If the latency is not too much, consumer can find the next message in the cache.

Related

Consuming messages in a Kafka topic ASAP

Imagine a scenario in which a producer is producing 100 messages per second, and we're working on a system that consuming messages ASAP matters a lot, even 5 seconds delay might result in a decision not to take care of that message anymore. also, the order of messages does not matter.
So I don't want to use a basic queue and a single pod listening on a single partition to consume messages, since in order to consume a message, the consumer needs to make multiple remote API calls and this might take time.
In such a scenario, I'm thinking of a single Kafka topic, with 100 partitions. and for each partition, I'm gonna have a separate machine (pod) listening for partitions 0 to 99.
Am I thinking right? this is my first project with Kafka. this seems a little weird to me.
For your use case, think of partitions = max number of instances of the service consuming data. Don't create extra partitions if you'll have 8 instances. This will have a negative impact if consumers need to be rebalanced and probably won't give you any performace improvement. Also 100 messages/s is very, very little, you can make this work with almost any technology.
To get the maximum performance I would suggest:
Use a round robin partitioner
Find a Parallel consumer implementation for your platform (for jvm)
And there a few producer and consumer properties that you'll need to change, but they depend your environment. For example batch.size, linger.ms, etc. I would also check about the need to set acks=all as it might be ok for you to lose data if a broker dies given that old data is of no use.
One warning: In Java, the standard kafka consumer is single threaded. This surprises many people and I'm not sure if the same is true for other platforms. So having 100s of partitions won't give any performance benefit with these consumers, and that's why it's important to use a Parallel Consumer.
One more warning: Kafka is a complex broker. It's trivial to start using it, but it's a very bumpy journey to use it correctly.
And a note: One of the benefits of Kafka is that it keeps the messages rather than delete them once they are consumed. If messages older than 5 seconds are useless for you, Kafka might be the wrong technology and using a more traditional broker might be easier (activeMQ, rabbitMQ or go to blazing fast ones like zeromq)
Your bottleneck is your application processing the event, not Kafka.
when you have ten consumers, there is overhead for connecting each consumer to Kafka so it will lower the performance.
I advise focusing on your application performance rather than message broker.
Kafka p99 Latency is 5 ms with 200 MB/s load.
https://developer.confluent.io/learn/kafka-performance/

Number of Messages Per Transaction in Kafka

Are there guidelines on how many messages and/or partitions can be involved in a Kafka Producer Transaction before performance really starts to suffer?
Obviously, the more partitions are involved, the more coordination is required. But in Kafka Streams, for example, the default commit interval is 100ms. In that time, normally hundreds of messages can be processed by a Stream thread. And depending on the topology, that may involve many different output topics (and by extension, partitions). Does that mean that it's safe to push transactions with hundreds of messages and dozens of partitions?
I can't find anything about this in the documentation.
https://www.confluent.io/blog/transactions-apache-kafka/
I found this in Confluent's blog. Turns out that the overhead doesn't increase per message per transactin; rather, you actually increase throughput if you have more messages per transaction.
In fact, the average overhead per message decreases as you add more messages to the transaction.

Kafka Should Number of Consumer Threads equal number of Topic Partitions

Pretend you determined that you wanted to use exactly 8 consumer threads for your application.
Would there be any difference in processing if a Kafka topic was set up as having 8 partitions vs 16 partitions?
In the first case, each thread is assigned to a single partition with twice the data, and in the second case each thread is assigned to two partitions with half the data each. It seems to me that there is no difference between these two setups.
I believe that, on the consumer side there could be a difference, if your threads are not CPU-constrained (and network is not at capacity). Assuming infinite data on the Kafka broker, or a lagging consumer, since each thread is consuming from two partitions in your second example, the kafka broker is able to send more data than if each thread had only one partition assigned. Kafka has a limit on the maximum amount of bytes that can be retrieved per fetch (replica.fetch.max.bytes in the config), so if you 2x the partitions, you can increase capacity, assuming the data is available.
When configured properly, and assuming ideal conditions, Kafka will serve data from page cache, so it can blast data down to consumers, and 90% of the time, the bottleneck will be the amount of partitions/available CPU on the consumer side. In general, the more partitions you have, the faster you can consume from Kafka, until you are CPU or bandwidth constrained on the consumer, at which point it won't matter if you have more or less partitions, since you're consuming data as fast as you can anyway.
An additional thing to take into account is that there could be more consumer commits being sent back to the brokers, since there are now more partitions, which means some additional overhead/crosstalk in the cluster. It's probably not 2x the commits, but probably higher than 1x the commits from the first scenario.
An important thing to remember is to, whenever possible, do the actual message processing on your consumer off-thread. That is, do not process the inbound messages on the same thread that is consuming/polling from Kafka. It might work at first, but you're going to start running into issues if your processing takes longer, there's a delay, huge volume increase on the inbound side, etc. Whenever possible, throw the inbound messages on a queue, and let another thread worry about processing/parsing them.
Finally, you don't want to take this to the extreme, and configure 1000 partitions if you don't have to. Each partition requires overhead on commits, zookeeper znodes, consumer rebalancing time, startup time, etc. So, I would suggest benchmarking different scenarios, and seeing what works best for you. In general, anything from 2-4 partitions per consumer thread has worked well for me in the past, even with very high message loads (topics with 50K+ messages per second, each ~1KB).

Kafka - Best practices in case of slow processing consumer. How to achieve more parallelism?

I'm aware that the maximum number of active consumers in a consumer group is the number of partitions of a topic.
What's the best practice in case of slow processing consumers? How to achieve more parallelism?
An example: A topic with 6 partitions and thousands of messages per second produced from Producers. So I have at most 6 consumers in the group. Consider that processing those messages is complex and the consumers are much slower than the producers. The result is that the consumers are always behind the last offset and the lag is increasing.
In a traditional MQ system, we simply add more and more consumers to stay up to date.
How to achieve this with Kafka, since the total of the consumers in a group is at most the number of partitions? Should I:
Configure the topic to have more partitions allowing more consumers per group?
Route the message from the consumer to a traditional MQ Queue (but lose the ordering)?
What's the best practice for this situation?
In Kafka, partitions are the unit of parallelism.
Without knowing our exact use case and requirements it's hard to come up with precise recommendations but there are a few options.
First you should really consider having more partitions. 6 partitions is relatively small, you could easily have 60, 120 or even more partitions (and the corresponding number of consumers). Suddenly the amount of work each consumers has to do is significantly reduced.
Also if your requirements allow, you can also consume at a fast rate and spread the processing of records across many workers. In solutions like this it's harder to maintain ordering but if you don't need it then you can consider it.
I'm not sure how routing messages through a MQ Queue would really help in this scenario. If you are still reading slower than writing the amount of data in the queue will grow till you have no disk space left.
Kafka is better designed to serve as buffer between your producers and consumers so just ensure you have retention limits on your topics that allow some flexibility on the consumer side without losing data.

How does Kafka handle a consumer which is running slower than other consumers?

Let's say I have 20 partitions and five workers. Each partition is assigned a worker. However, one worker is running slower than the other machines. It's still processing (that is, not slow consumer described here), but at 60% rate of the other machines. This could be because the worker is running on a slower VM on AWS EC2, a broken disk or CPU or whatnot. Does Kafka handle rebalancing gracefully somehow to give the slow worker fewer partitions?
Kafka doesn't really concern itself with how fast messages are being consumed. It doesn't even get involved with how many consumers there are or how many times each message is read. Kafka just commits messages to partitions and ages them out at the configured time.
It's the responsibility of the group of consumers to make sure that the messages are being read evenly and in a timely fashion. In your case, you have two problems: The reading of one set of partitions lags and then then processing of the messages from those partitions lags.
For the actual consumption of messages from the topic, you'll have to use the Kafka metadata API's to track the relative loads each consumer faces, whether by skewed partitioning or because the consumers are running at different speeds. You either have to re-allocate partitions to consumers to give the slow consumers less work or randomly re-assign consumers to partitions in the hope of eventually evening out the workload over time.
To better balance the processing of messages, you should factor out the reading of the messages from the processing of the messages - something like the Storm streaming model. You still have to programmatically monitor the backlogs into the processing logic, but you'd have the ability to move work to faster nodes in order to balance the work.