Can offsets be skipped(almost 12000 in some cases) for a particular partition in the kafka broker? - apache-kafka

Kafka image :confluentinc/cp-kafka:5.2.1
Kafka client: apache.kafka.client 2.5.0
load:One thread is handling 4 partitions
Noticing some of the offsets are getting skipped (missing) in a partition.First thought it was an issue on the consumer side but In both thread consumer groups logs offsets are getting skipped(also another observation: Consumer thread is taking significant amount of time to jump the skipped offset which is causing lag)This happened when the load on the cluster was high . We are not using transactional kafka or using idempotent configurations.What can be the possible reasons for this?
Producer properties:
ACKS_CONFIG=1
RETRIES_CONFIG=1
BATCH_SIZE_CONFIG=16384);
LINGER_MS_CONFIG=1
BUFFER_MEMORY_CONFIG=33554432
KEY_SERIALIZER_CLASS_CONFIG= StringSerializer
VALUE_SERIALIZER_CLASS_CONFIG= ByteArraySerializer
ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, 1000
Consumer properties:
HEARTBEAT_INTERVAL_MS_CONFIG=6 seconds
SESSION_TIMEOUT_MS_CONFIG=30 sec
ConsumerConfig.MAX_POLL_RECORDS_CONFIG=10
FETCH_MAX_WAIT_MS_CONFIG=200
MAX_PARTITION_FETCH_BYTES_CONFIG=1048576
FETCH_MAX_BYTES_CONFIG=31457280
AUTO_OFFSET_RESET_CONFIG=latest
ENABLE_AUTO_COMMIT_CONFIG=false
Edit
there were no consumer rebalances checked the logs during this time so no consumer rebalances
we have two consumer groups (in different data centers )and both the groups skipped these offsets.So ruling out any issue from the consumer side.
-Both had the same pattern the consumer eg-> both consumer threads stopped consuming till offset 112300 and after 30 mins started consuming after skipping 12k offsets. And the threads were consuming other partitions. This only happened for 3-4 partitions.
-so what I’m wondering is it normal to have such huge offset gaps ? (During high loads)Didn’t find anything concrete when going through docs. And what can cause this issue - is it from the broker side or the producer side to have ghost offsets

Related

Kafka - message not getting consumed from only one of the partitions

We have a single node Kafka service running Kafka 2.13 version. For one of the topics, we have configured 20 partitions and there is just one consumer group consuming from this topic. This setup has been working fine for a long time now. Recently we have been seeing issues w.r.t Kafka rebalancing this consumer group frequently. After a while, the consumers start consuming again, but on one of the partitions, the current offset doesn't move forward at all indicating the messages are stuck in that partition. Other messages from other partitions get consumed without any issues. Logs from Kafka service doesn't show any issue. Any hints on what is going wrong and how to identify / rectify it?
A possible explanation is that the specific partition has a message that takes a long time to process, this could explain both the frequent rebalancing and the stuck offset

Increase in Consumer Groups will increase delay in Producer Response?

I have an use case where i have to test the saturation point of my Kafka (3 node) Cluster with high no of Consumer groups.(To find the saturation point for our production use case) Producer ack=all.
I created many consumer groups more than 10000 , there is no Problem(No load Just created Consumer groups not consuming).
So i started load testing, I created 3 topics (1 partition) with 3 replication factor,Each broker is leader for a topic(i made sure by kafka-topic describe).
I planned to constantly produce 4.5MBps for each topic and increase consumer groups from zero.100KB size of 45 records to a topic.
When i produce data for no consumer groups in the cluster the producer response time is just 2 ms/record.
For 100 Consumer groups per record it taking 7ms.
When increasing consumer groups for a topic to 200 the time increasing to 28-30ms with that i cant able to produce 4.5MBps .When increasing more Consumer groups the producer response is decreasing.
The broker have 15 I/O threads and 10 Network threads.
Analysis done for above scenario
With grafana JMX metrics there is no spike in request and response queue.
There is no delay in I/O picking up by checking request queue time.
The network thread average idle percentage is 0.7 so network thread is not a bottleneck.
When reading some articles Socket buffer can be bottle neck for high bandwidth throughput so increased it from 100KB to 4MB but no use.
There is no spike in GC,file descriptor,heap space
By this there is no problem with I/O threads,Network Threads,Socket Buffer
So what can be a bottleneck here?
I thought it would be because of producing data to single partition. So i created more topic with 1 partition and parallel try to produced 4.5MBps per each topic ended up same delay in producer response.
What can be really bottleneck here? Because producer is decoupled from Consumer.
But when i increasing more no of Consumer groups to broker, The producer response why affecting?
As we know the common link between the Producer and consumer is Partition where the data remains and is being read and Write by consumers and producers respectively There are 3 things that we need to consider here
Relationship between Producer to Partition : I understand that you need to have the correct no. of partition created to send some message with consistent speed and here is the calculation we use to optimize the number of partitions for a Kafka implementation.
Partitions = Desired Throughput / Partition Speed
Conservatively, you can estimate that a single partition for a single Kafka topic runs at 10 MB/s.
As an example, if your desired throughput is 5 TB per day. That figure comes out to about 58 MB/s. Using the estimate of 10 MB/s per partition, this example implementation would require 6 partitions. I believe its not about creating more topics with one partition but it is about creating a topic with optimized no of partitions
Since you are sending the message consistently with 1 partition then this could be the issue. Also since you have chosen acks=all, this can be the reason for increased latency that every message that you pass to the topic has to make sure that it gets the acknowledgment from leader as well as the followers hence introducing the latency. As the message keeps on increasing, its likely that there must be some increase in latency factor as well. This could be in actual the reason for increased response time for producer. To have that addressed you can do below things:
a) Send the Messages in Batch
b) Compress the Data
Partition : Each partition has a leader. Also, with multiple replicas, most partitions are written into leaders. However, if the leaders are not balanced properly, it might be possible that one might be overworked, compared to others causing the latency again. So again the optimized number of partitions are the key factors.
Relationship between consumer to Partition : From your example I understand that you are increasing the consumer groups from Zero to some number. Please note that when you keep on increasing the consumer group , there is the rebalancing of the partition that takes place.
Rebalancing is the process to map each partition to precisely one consumer. While Kafka is rebalancing, all involved consumers processing is blocked
If you want to get more details
https://medium.com/bakdata/solving-my-weird-kafka-rebalancing-problems-c05e99535435
And when that rebalancing happens, there is the partition movement as well that happens which is nothing but again an overhead.
In conclusion I understand that the producer throughput might have been increasing because of below factors
a) No of partitions not getting created correctly w.r.t. messaging speed
b) Messages not being send in Batches with proper size and with a proper compression type
c) Increase in consumer group causing more of rebalancing hence movement of partition
d) Since due to rebalancing the consumer, the consumer blocks the functioning for partition reassignment we can say the message fetching also gets stopped hence causing the partition to store more of the data while accepting the new Data.
e) Acks= all which should be causing latency as well
In continuation to your query, trying to visualize it
Let us assume as per your condition
1 Topic having 1 partition(with RF as 3) and 1 to 100/500 consumers groups having 1 single consumer in each group(No rebalancing) subscribing to same topic
Here only one server(Being leader) in the cluster would be actively participating to do following functions that can result in the processing delays since the other 2 brokers are just the followers and will act if the leader goes down.

Can we have retention period of zero in Kafka broker?

Does retention period of zero makes sense in kafka borker?
We want to quickly forward message from producer to consumer via kafka broker. From buffercache/pagecache on broker machine without flushing to disk. We do not need replication and assume our broker will never crash.
When a message is produced to a Kafka topic it is written to the disk. Once the message has been consumed, the offset of this message is committed by the consumer (if you are using the high-level consumer API) however, there is no functionality that deletes only the messages that have been consumed (many consumers may subscribe to the same topic and some of them might have consumed that message while some others might have not).
What I would suggest in your case is to set a short retention period (which by default is set to 7 days) but allow a reasonable amount of time in order to allow your consumer to consume the messages. To do this, you simply need to configure the following parameter in server.properties:
log.retention.ms=X
Note that there is no guarantee that the deleted message(s) have been successfully consumed by your consumer(s). For example, if you set the retention period to 2 seconds (i.e. log.retention.ms=2000) and your consumer crashes, then every message which is sent to the topic while the consumer is down will be lost.

Kafka Stream - CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member

I am running a Kafka Stream application which consumes data from 2 topics and output the joined/merged result into 3 topic.
The kafka topics have 15 partitions and 3 replication factor. We have 5 kafka brokers and 5 zookeeper's.
I am running 15 instances of Kafka Stream application so each application can have 1 partition.
Kafka version- 0.11.0.0
I am getting the below exception in my kafka stream application:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
be completed since the group has already rebalanced and assigned the
partitions to another member. This means that the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this either
by increasing the session timeout or by reducing the maximum size of
batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:725)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:604)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1173)
at org.apache.kafka.streams.processor.internals.StreamTask.commitOffsets(StreamTask.java:307)
at org.apache.kafka.streams.processor.internals.StreamTask.access$000(StreamTask.java:49)
at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:268)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.StreamTask.commitImpl(StreamTask.java:259)
at org.apache.kafka.streams.processor.internals.StreamTask.suspend(StreamTask.java:362)
at org.apache.kafka.streams.processor.internals.StreamTask.suspend(StreamTask.java:346)
at org.apache.kafka.streams.processor.internals.StreamThread$3.apply(StreamThread.java:1118)
at org.apache.kafka.streams.processor.internals.StreamThread.performOnStreamTasks(StreamThread.java:1448)
at org.apache.kafka.streams.processor.internals.StreamThread.suspendTasksAndState(StreamThread.java:1110)
at org.apache.kafka.streams.processor.internals.StreamThread.access$1800(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsRevoked(StreamThread.java:218)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:422)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:353)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:310)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:297)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:582)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
2017-08-09 14:50:49 - [ERROR] [click-live-StreamThread-1]
[org.apache.kafka.streams.processor.internals.StreamThread.performOnStreamTasks:1453]
:
Can someone please help what could be the cause and solution?
Also, when 1 of my kafka broker is down, my kafka stream application is not connecting to other broker?
I have set brokers.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092,broker5:9092
Based on the information, this is the most likely solution route:
Try to follow the suggestions in the message:
"You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records"

kafka partition rebalancing (assignment) is taking too much time

We are using kafka cluster with 3 servers, all of them are having zookeeper as well. we have 7 topics each with 6 partitions. We have 3 java consumers for each topic. When I start consumer it takes almost 3-5 min to assign partitions to consumers. The same behavior is encountered when we stop one of the consumer and start is again. How can I control or reduce it?
Please note, I am using kafka 0.9, with new consumer
I have added below properties in server.properties of each kafka
auto.leader.rebalance.enable=true
leader.imbalance.check.interval.seconds=10
Let me know if you need more information.
Thanks
Check the value your consumer is using for 'session.timeout.ms'.
The default is 30 seconds and the co-ordination won't trigger a rebalance until this time has passed E.g. no heartbeat for 30 seconds.
The danger in making this lower is if you take too long to process the messages a rebalance might occur because the co-ordinator will think your consumer is dead.