Kafka Topic Lag is keep increasing gradually when Message Size is huge - apache-kafka

I am using the Kafka Streams Processor API to construct a Kafka Streams application to retrieve messages from a Kafka topic. I have two consumer applications with the same Kafka Streams configuration. The difference is only in the message size. The 1st one has messages with 2000 characters (3KB) while 2nd one has messages with 34000 characters (60KB).
Now in my second consumer application I am getting too much lag which increases gradually with the traffic while my first application is able to process the messages at the same time without any lag.
My Stream configuration parameters are as below,
application.id=Application1
default.key.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
num.stream.threads=1
commit.interval.ms=10
topology.optimization=all
Thanks

In order to consume messages faster, you need to increase the number of partitions (if it's not yet done, depending on the current value), and do one of the following two options:
1) increase the value for the config num.stream.threads within your application
or
2) start several applications with the same consumer group (the same application.id).
as for me, increasing num.stream.threads is preferable (until you reach the number of CPUs of the machine your app runs on). Try gradually increasing this value, e.g go from 4 over 6 to 8, and monitor the consumer lag of your application.
By increasing num.stream.threads your app will be able to consume messages in parallel, assuming you have enough partitions.

Related

Kafka consumer can't read from multiple partitions at the same time

I use Confluent.Kafka 1.9.2 C# library to create single Kafka consumer for listening topic with several partitions. Currently consumer drain out all messages from first partition and only then goes to next. As I know from KIP, I can avoid such behavior and achieve round-robin by changing max.partition.fetch.bytes parameter to lower value. I changed this value to 5000 bytes and pushed 10000 messages to first partition and 1000 to second, average size of messages is 2000 bytes, so consumer should to move between partitions every 2-3 messages (if I understand correctly). But it still drains out first partition before consuming second one. My only guess why it don't work as should is latest comment here that such approach can't work with several brokers, btw Kafka server that I use just has 6 brokers. Could it be the reason or maybe something else?

Increase or decrease Kafka partitions dynamically

I have a system where load is not constant. We may get 1000 requests a day or no requests at all.
We use Kafka to pass on the requests between services. We have kept average number of Kafka consumers to reduce cost incurred. Now my consumers of Kafka will sit ideal if there are no requests received that day, and there will be lag if too many requests are received.
We want to keep these consumers on Autoscale mode, such that my number of servers(Kafka consumer) will increase if there is a spike in number of requests. Once the number of requests get reduced, we will remove the servers. Therefore, the Kafka partitions have to be increased or decreased accordingly
Kafka allows increasing partition. In this case, how can we decrease Kafka partitions dynamically?
Is there any other solution to handle this Auto-scaling?
Scaling up partitions will not fix your lag problems in the short term since no data is moved between partitions when you do this, so existing (or new) consumers are still stuck with reading the data in the previous partitions.
It's not possible to decrease partitions and its not possible to scale consumers beyond the partition count.
If you are able to sacrifice processing order for consumption speed, you can separate the consuming threads and working threads, as hinted at in the KafkaConsumer javadoc, then you would be able to scale these worker threads.
Since you are thinking about modifying the partition counts, then I'm guessing processing order isn't a problem.
have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing.
A single consumer can consume multiple partitions. Therefore partition your work for the largest anticipated parallel requirement, and then scale your number of consumers as required.
For example, if you think you need 32 parallel consumers you would give you Kafka topic 32 partitions.
You can run 32 consumers (each gets one partition), or eight consumers (each gets four partitions) or just one (which gets all 32 partitions). Or any number of consumers in between. Kafka's protocol ensures that within the consumer group all partitions are consumed, and will rebalance as and when consumers are added or removed.

Kafka Should Number of Consumer Threads equal number of Topic Partitions

Pretend you determined that you wanted to use exactly 8 consumer threads for your application.
Would there be any difference in processing if a Kafka topic was set up as having 8 partitions vs 16 partitions?
In the first case, each thread is assigned to a single partition with twice the data, and in the second case each thread is assigned to two partitions with half the data each. It seems to me that there is no difference between these two setups.
I believe that, on the consumer side there could be a difference, if your threads are not CPU-constrained (and network is not at capacity). Assuming infinite data on the Kafka broker, or a lagging consumer, since each thread is consuming from two partitions in your second example, the kafka broker is able to send more data than if each thread had only one partition assigned. Kafka has a limit on the maximum amount of bytes that can be retrieved per fetch (replica.fetch.max.bytes in the config), so if you 2x the partitions, you can increase capacity, assuming the data is available.
When configured properly, and assuming ideal conditions, Kafka will serve data from page cache, so it can blast data down to consumers, and 90% of the time, the bottleneck will be the amount of partitions/available CPU on the consumer side. In general, the more partitions you have, the faster you can consume from Kafka, until you are CPU or bandwidth constrained on the consumer, at which point it won't matter if you have more or less partitions, since you're consuming data as fast as you can anyway.
An additional thing to take into account is that there could be more consumer commits being sent back to the brokers, since there are now more partitions, which means some additional overhead/crosstalk in the cluster. It's probably not 2x the commits, but probably higher than 1x the commits from the first scenario.
An important thing to remember is to, whenever possible, do the actual message processing on your consumer off-thread. That is, do not process the inbound messages on the same thread that is consuming/polling from Kafka. It might work at first, but you're going to start running into issues if your processing takes longer, there's a delay, huge volume increase on the inbound side, etc. Whenever possible, throw the inbound messages on a queue, and let another thread worry about processing/parsing them.
Finally, you don't want to take this to the extreme, and configure 1000 partitions if you don't have to. Each partition requires overhead on commits, zookeeper znodes, consumer rebalancing time, startup time, etc. So, I would suggest benchmarking different scenarios, and seeing what works best for you. In general, anything from 2-4 partitions per consumer thread has worked well for me in the past, even with very high message loads (topics with 50K+ messages per second, each ~1KB).

How to increase KAFKA consumer on demand in single Group

Lets say we have one topic "topic-1" in kafka with partition 5.
Consumer Group-A with 5 consumer attached to "topic-1" each partition. Due to large workload large number of message get publish. Now we want to scale up consumer / add more consumer in Group-A to process message.
How can we increase consumer ON_DEMAND in same group?
Is any way to do it from coding ? so that single message get consumed by each consumer.
Once load is decrease shut-down few consumer from same group.
What I would suggest is having some partitions as buffer for when the
load increases.
For eg. if having 5 partitions is enough for normal load, I would
suggest having 15 partitions for that topic but only 5 consumers
for them at the start.
Then, when the load increases, keep adding consumers, preferably in other machines, until the load decreases
You can have kubernetes do the autoscaling for you
Kafka framework suggest that the number of the consumers corresponds to the number of the partitions. Increasing the number of the consumers will not help as you will have one consumer per partition anyway and the rest will remain idle. If you need to speed it up you can read the data from Kafka and process them in another thread. You can scale with this number of processing threads and you will need to program it yourself.

Kafka Random Access to Logs

I am trying to implement a way to randomly access messages from Kafka, by using KafkaConsumer.assign(partition), KafkaConsumer.seek(partition, offset).
And then read poll for a single message.
Yet i can't get past 500 messages per second in this case. In comparison if i "subscribe" to the partition i am getting 100,000+ msg/sec. (#1000 bytes msg size)
I've tried:
Broker, Zookeeper, Consumer on the same host and on different hosts. (no replication is used)
1 and 15 partitions
default threads configuration in "server.properties" and increased to 20 (io and network)
Single consumer assigned to a different partition each time and one consumer per partition
Single thread to consume and multiple threads to consume (calling multiple different consumers)
Adding two brokers and a new topic with it's partitions on both brokers
Starting multiple Kafka Consumer Processes
Changing message sizes 5k, 50k, 100k -
In all cases the minimum i get is ~200 msg/sec. And the maximum is 500 if i use 2-3 threads. But going above, makes the ".poll()", call take longer and longer (starting from 3-4 ms on a single thread to 40-50 ms with 10 threads).
My naive kafka understanding is that the consumer opens a connection to the broker and sends a request to retrieve a small portion of it's log. While all of this has some involved latency, and retrieving a batch of messages will be much better - i would imagine that it would scale with the number of receivers involved, with the expense of increased server usage on both the VM running the consumers and the VM running the broker. But both of them are idling.
So apparently there is some synchronization happening on broker side, but i can't figure out if it is due to my usage of Kafka or some inherent limitation of using .seek
I would appreaciate some hints of whether i should try something else, or this is all i can get.
Kafka is a streaming platform by design. It means there are many, many things has been developed for accelerating sequential access. Storing messages in batches is just one thing. When you use poll() you utilize Kafka in such way and Kafka do its best. Random access is something for what Kafka don't designed.
If you want fast random access to distributed big data you would want something else. For example, distributed DB like Cassandra or in-memory system like Hazelcast.
Also you could want to transform Kafka stream to another one which would allow you to use sequential way.