Kafka Reduce Lag for Consumer - apache-kafka

I am setting up the new Kafka cluster and for testing purpose I created the topic with 1 partition and 3 replicas.
Now, I am firing the messages via producer in parallel say 50K messages per Second. And I have create One Consumer inside a Group and its only able to fetch 30K messages per second.
I can change topic level, partition level, consumer level configurations.
I am monitoring everything via grafana + prometheus.
Any Idea which configuration or something else can help me to consumer more data??
Thanks In Advance

A Kafka consumer polls the broker for messages and fetches whatever messages are available for consumption, depending upon the consumer configuration used. In general, it is efficient to transfer as much data is possible in a single poll request if increasing throughput is your aim. But how much data is transferred in a single poll is determined by the size of messages, number of records and some parameters which control how much time to wait for messages to be available.
In general, you can influence throughput using one or more of the following consumer configurations:
fetch.min.bytes
max.partition.fetch.bytes
fetch.max.bytes
max.poll.records
fetch.max.wait.ms

Related

Increase in Consumer Groups will increase delay in Producer Response?

I have an use case where i have to test the saturation point of my Kafka (3 node) Cluster with high no of Consumer groups.(To find the saturation point for our production use case) Producer ack=all.
I created many consumer groups more than 10000 , there is no Problem(No load Just created Consumer groups not consuming).
So i started load testing, I created 3 topics (1 partition) with 3 replication factor,Each broker is leader for a topic(i made sure by kafka-topic describe).
I planned to constantly produce 4.5MBps for each topic and increase consumer groups from zero.100KB size of 45 records to a topic.
When i produce data for no consumer groups in the cluster the producer response time is just 2 ms/record.
For 100 Consumer groups per record it taking 7ms.
When increasing consumer groups for a topic to 200 the time increasing to 28-30ms with that i cant able to produce 4.5MBps .When increasing more Consumer groups the producer response is decreasing.
The broker have 15 I/O threads and 10 Network threads.
Analysis done for above scenario
With grafana JMX metrics there is no spike in request and response queue.
There is no delay in I/O picking up by checking request queue time.
The network thread average idle percentage is 0.7 so network thread is not a bottleneck.
When reading some articles Socket buffer can be bottle neck for high bandwidth throughput so increased it from 100KB to 4MB but no use.
There is no spike in GC,file descriptor,heap space
By this there is no problem with I/O threads,Network Threads,Socket Buffer
So what can be a bottleneck here?
I thought it would be because of producing data to single partition. So i created more topic with 1 partition and parallel try to produced 4.5MBps per each topic ended up same delay in producer response.
What can be really bottleneck here? Because producer is decoupled from Consumer.
But when i increasing more no of Consumer groups to broker, The producer response why affecting?
As we know the common link between the Producer and consumer is Partition where the data remains and is being read and Write by consumers and producers respectively There are 3 things that we need to consider here
Relationship between Producer to Partition : I understand that you need to have the correct no. of partition created to send some message with consistent speed and here is the calculation we use to optimize the number of partitions for a Kafka implementation.
Partitions = Desired Throughput / Partition Speed
Conservatively, you can estimate that a single partition for a single Kafka topic runs at 10 MB/s.
As an example, if your desired throughput is 5 TB per day. That figure comes out to about 58 MB/s. Using the estimate of 10 MB/s per partition, this example implementation would require 6 partitions. I believe its not about creating more topics with one partition but it is about creating a topic with optimized no of partitions
Since you are sending the message consistently with 1 partition then this could be the issue. Also since you have chosen acks=all, this can be the reason for increased latency that every message that you pass to the topic has to make sure that it gets the acknowledgment from leader as well as the followers hence introducing the latency. As the message keeps on increasing, its likely that there must be some increase in latency factor as well. This could be in actual the reason for increased response time for producer. To have that addressed you can do below things:
a) Send the Messages in Batch
b) Compress the Data
Partition : Each partition has a leader. Also, with multiple replicas, most partitions are written into leaders. However, if the leaders are not balanced properly, it might be possible that one might be overworked, compared to others causing the latency again. So again the optimized number of partitions are the key factors.
Relationship between consumer to Partition : From your example I understand that you are increasing the consumer groups from Zero to some number. Please note that when you keep on increasing the consumer group , there is the rebalancing of the partition that takes place.
Rebalancing is the process to map each partition to precisely one consumer. While Kafka is rebalancing, all involved consumers processing is blocked
If you want to get more details
https://medium.com/bakdata/solving-my-weird-kafka-rebalancing-problems-c05e99535435
And when that rebalancing happens, there is the partition movement as well that happens which is nothing but again an overhead.
In conclusion I understand that the producer throughput might have been increasing because of below factors
a) No of partitions not getting created correctly w.r.t. messaging speed
b) Messages not being send in Batches with proper size and with a proper compression type
c) Increase in consumer group causing more of rebalancing hence movement of partition
d) Since due to rebalancing the consumer, the consumer blocks the functioning for partition reassignment we can say the message fetching also gets stopped hence causing the partition to store more of the data while accepting the new Data.
e) Acks= all which should be causing latency as well
In continuation to your query, trying to visualize it
Let us assume as per your condition
1 Topic having 1 partition(with RF as 3) and 1 to 100/500 consumers groups having 1 single consumer in each group(No rebalancing) subscribing to same topic
Here only one server(Being leader) in the cluster would be actively participating to do following functions that can result in the processing delays since the other 2 brokers are just the followers and will act if the leader goes down.

Kafka fetch max bytes doesn't work as expected

I have a topic worth 1 GB of messages. A. Kafka consumer decides to consume these messages. What could I do to prohibit the consumer from consuming all messages at once? I tried to set the
fetch.max.bytes on the broker
to 30 MB to allow only 30 MB of messages in each poll. The broker doesn't seem to honor that and tries to give all messages at once to the consumer causing Consumer out of memory error. How can I resolve this issue?
Kafka configurations can be quite overwhelming. Typically in Kafka, multiple configurations can work together to achieve a result. This brings flexibility, but flexibility comes with a price.
From the documentation of fetch.max.bytes:
Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress.
Only on the consumer side, there are more configurations to consider for bounding the consumer memory usage, including:
max.poll.records: limits the number of records retrieved in a single call to poll. Default is 500.
max.partition.fetch.bytes: limits the number of bytes fetched per partition. This should not be a problem as the default is 1MB.
As per the information in KIP-81, the memory usage in practice should be something like min(num brokers * max.fetch.bytes, max.partition.fetch.bytes * num_partitions).
Also, in the same KIP:
The consumer (Fetcher) delays decompression until the records are returned to the user, but because of max.poll.records, it may end up holding onto the decompressed data from a single partition for a few iterations.
I'd suggest you to also tune these parameters and hopefully this will get you into the desired state.

Kafka not sending enough messages to consumer

I've a kafka topic, 3 partitions, only one consumer with batch. I am using spring kafka on the consumer side with following consumer props:
max.poll.records=10000
fetch.min.bytes=2000000
fetch.max.bytes=15000000
fetch.max.wait.ms=1000
max.poll.interval.ms=300000
auto.offset.reset.config=earliest
idle.event.interval=120000
Even tho there are thousands of messages (GBs of data) waiting in the queue, kafka consumer receives around 10 messages (total size around 1MB) on each poll. The consumer should fetch batches of fetch.max.bytes(in my prop ~15MB) or max.poll.records (10000 in my case) . What's the problem?
There are several scenarios which may cause this, do the following changes:
Increase fetch.min.bytes- the consumer also may fetch batches of fetch.min.bytes, which is 1.9MB.
Increase fetch.max.wait.ms- the poll function waits for fetch.min.bytes or fetch.max.wait.ms to trigger, whatever comes first.
fetch.max.wait.ms is 1 second in your configuration, sounds alright but increase it just in case this is the problem.
Increase max.partition.fetch.bytes- default is 1MB, it can decrease the poll size for small partitioned topics like yours (limit up to 3MB poll for 3 partitions topic with a single consumer).
Try to use these values:
fetch.min.bytes=12000000
fetch.max.wait.ms=5000
max.partition.fetch.bytes=5000000
Deeper explanation:
https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html

How to increase KAFKA consumer on demand in single Group

Lets say we have one topic "topic-1" in kafka with partition 5.
Consumer Group-A with 5 consumer attached to "topic-1" each partition. Due to large workload large number of message get publish. Now we want to scale up consumer / add more consumer in Group-A to process message.
How can we increase consumer ON_DEMAND in same group?
Is any way to do it from coding ? so that single message get consumed by each consumer.
Once load is decrease shut-down few consumer from same group.
What I would suggest is having some partitions as buffer for when the
load increases.
For eg. if having 5 partitions is enough for normal load, I would
suggest having 15 partitions for that topic but only 5 consumers
for them at the start.
Then, when the load increases, keep adding consumers, preferably in other machines, until the load decreases
You can have kubernetes do the autoscaling for you
Kafka framework suggest that the number of the consumers corresponds to the number of the partitions. Increasing the number of the consumers will not help as you will have one consumer per partition anyway and the rest will remain idle. If you need to speed it up you can read the data from Kafka and process them in another thread. You can scale with this number of processing threads and you will need to program it yourself.

Consume messages without committing from Kafka 10 consumer

I have a requirement to read messages from a topic, batch them and push the batch to an external system. If the batch fails for any reason, I need to consume the same set of messages again and repeat the process. So for every batch, the from and to offsets for each partition are stored in a database. In order to achieve this, I am creating one Kafka consumer per partition by assigning partition to the reader, based on the previous offsets stored, the consumers seek to that position and start reading. I have turned off auto commit and I dont commit offsets from the consumer. For every batch, I create a new consumer per partition, read messages from the last offset stored and publish to the external system. Do you see any problems in consuming messages without committing offsets and using the same consumer group across batches, but at any point there won't be more than one consumer per partition ?
Your design seems reasonable to me.
Committing offsets to Kafka is just a convenient built-in mechanism within Kafka to keep track of offsets. However, there is no requirement whatsoever to use it -- you can use any other mechanism to track offsets, too (like using a DB as in your case).
Furthermore, if you assign partitions manually, there will be no group management anyway. So parameter group.id has no effect. See http://docs.confluent.io/current/clients/consumer.html for more details.
In kafka version two i achieved this behaviour without the need for a database to store the offsets.
The following is a configuration for spring-boot-kafka but it should also work with any kafka consumer api
spring:
kafka:
bootstrap-servers: ...
consumer:
value-deserializer: ...
max-poll-records: 1000
enable-auto-commit: false
fetch-min-size: 262144 # 1/4 mb..
group-id: ...
fetch-max-wait: 10000 # we will consume every 10s or when 1/4 mb or 1000 records are accumulated.
auto-offset-reset: earliest
listener:
type: batch
concurrency: 7
ack-mode: manual
This gives me the messages in batches of max. 1000 records (dependent on load). I then write these records asynchronously to a database and count how many success callbacks i get. If the successful writes equals the received batch size i acknowledge the batch, e.g. i commit the offset. This design was very reliable even in a high-load production environment.