kafka rebalancing taking a long time for Deleting Obsolete state directory - apache-kafka

Whenever i try to spawn up a consumer for my consumer group, Kafka takes a lot of time to rebalance and gets stuck on this log
Deleting obsolete state directory 0_45 for task 0_45 as 601021ms has elapsed (cleanup delay is 600000ms).
I have 8 streaming threads for the topic i want to subscribe to and that topic has 64 partitions.
Everything like max.poll.records, session.timeout.ms are all default values in kafka. I tried to find a resolution to this but couldn't find anything.

Related

In Kafka what happens to the record which takes more time than the max.poll.interval.ms

What happens to the long processing record when max.poll.interval.ms time exceeds will it run in the background and rebalancing will be triggered .
As per my limited understanding the kafka consumer( Spring kafkalistener) service gets halted / restarted and the records get assigned to other consumers in the group during rebalancing
You will have records left in memory being processed if the application or processing logic doesn't stop with the consumer thread.
If offsets were committed beforehand, those records would effectively be skipped after a rebalance. Otherwise, those offsets ideally shouldn't be committed post-processing since those records might be tried to be processed again, potentially resulting in data duplication, by other consumers after a rebalance.

Can offsets be skipped(almost 12000 in some cases) for a particular partition in the kafka broker?

Kafka image :confluentinc/cp-kafka:5.2.1
Kafka client: apache.kafka.client 2.5.0
load:One thread is handling 4 partitions
Noticing some of the offsets are getting skipped (missing) in a partition.First thought it was an issue on the consumer side but In both thread consumer groups logs offsets are getting skipped(also another observation: Consumer thread is taking significant amount of time to jump the skipped offset which is causing lag)This happened when the load on the cluster was high . We are not using transactional kafka or using idempotent configurations.What can be the possible reasons for this?
Producer properties:
ACKS_CONFIG=1
RETRIES_CONFIG=1
BATCH_SIZE_CONFIG=16384);
LINGER_MS_CONFIG=1
BUFFER_MEMORY_CONFIG=33554432
KEY_SERIALIZER_CLASS_CONFIG= StringSerializer
VALUE_SERIALIZER_CLASS_CONFIG= ByteArraySerializer
ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, 1000
Consumer properties:
HEARTBEAT_INTERVAL_MS_CONFIG=6 seconds
SESSION_TIMEOUT_MS_CONFIG=30 sec
ConsumerConfig.MAX_POLL_RECORDS_CONFIG=10
FETCH_MAX_WAIT_MS_CONFIG=200
MAX_PARTITION_FETCH_BYTES_CONFIG=1048576
FETCH_MAX_BYTES_CONFIG=31457280
AUTO_OFFSET_RESET_CONFIG=latest
ENABLE_AUTO_COMMIT_CONFIG=false
Edit
there were no consumer rebalances checked the logs during this time so no consumer rebalances
we have two consumer groups (in different data centers )and both the groups skipped these offsets.So ruling out any issue from the consumer side.
-Both had the same pattern the consumer eg-> both consumer threads stopped consuming till offset 112300 and after 30 mins started consuming after skipping 12k offsets. And the threads were consuming other partitions. This only happened for 3-4 partitions.
-so what I’m wondering is it normal to have such huge offset gaps ? (During high loads)Didn’t find anything concrete when going through docs. And what can cause this issue - is it from the broker side or the producer side to have ghost offsets

Does kafka partition assignment happen across processes?

I have a topic with 20 partitions and 3 processes with consumers(with the same group_id) consuming messages from the topic.
But I am seeing a discrepancy where unless one of the process commits , the other consumer(in a different process) is not reading any message.
The consumers in other process do cconsume messages when I set auto-commit to true. (which is why I suspect the consumers are being assigned to the first partition in each process)
Can someone please help me out with this issue? And also how to consume messages parallely across processes ?
If it is of any use , I am doing this on a pod(kubernetes) , where the 3 processes are 3 different mules.
Commit shouldn't make any difference because the committed offset is only used when there is a change in group membership. With three processes there would be some rebalancing while they start up but then when all 3 are running they will each have a fair share of the partitions.
Each time they poll, they keep track in memory of which offset they have consumed on each partition and each poll causes them to fetch from that point on. Whether they commit or not doesn't affect that behaviour.
Autocommit also makes little difference - it just means a commit is done synchronously during a subsequent poll rather than your application code doing it. The only real reason to manually commit is if you spawn other threads to process messages and so need to avoid committing messages that have not actually been processed - doing this is generally not advisable - better to add consumers to increase throughput rather than trying to share out processing within a consumer.
One possible explanation is just infrequent polling. You mention that other consumers are picking up partitions, and committing affects behaviour so I think it is safe to say that rebalances must be happening. Rebalances are caused by either a change in partitions at the broker (presumably not the case) or a change in group membership caused by either heartbeat thread dying (a pod being stopped) or a consumer failing to poll for a long time (default 5 minutes, set by max.poll.interval.ms)
After a rebalance, each partition is assigned to a consumer, and if a previous consumer has ever committed an offset for that partition, then the new one will poll from that offset. If not then the new one will poll from either the start of the partition or the high watermark - set by auto.offset.reset - default is latest (high watermark)
So, if you have a consumer, it polls but doesn't commit, and doesn't poll again for 5 minutes then a rebalance happens, a new consumer picks up the partition, starts from the end (so skipping any messages up to that point). Its first poll will return nothing as it is starting from the end. If it doesn't poll for 5 minutes another rebalance happens and the sequence repeats.
That could be the cause - there should be more information about what is going on in your logs - Kafka consumer code puts in plenty of helpful INFO level logging about rebalances.

Kafka messages not getting purged

I am new to Kafka. I am doing some experiment as to how to purge messages in a kafka topic. I found that if we set "retention.ms" property for a topic to some less time value lets say 1 second, then after 1 seconds the messages in the topic will be purged as per my understanding.
I ran 1 producer which produced few messages to topic and stopped it after some time. At the same time I ran a console consumer so it got the generated messages.
I started another consumer console for the same topic after retention time is elapsed lets say after 1-2 minutes. But too my surprise I was able to get the messages on that topic.
Started console consumer after 2 minutes again when I fifnnaly didnt see any messages in topic. It took almost 3-4 minutes for kafka to purge the messages.
Is there any additional settings required at Kafka so that messages will be purged instantly ?
Setting retention.ms will not guarantee message will be deleted immediately from topic. Even though it will be marked for deletion.
If your message is in form of pair, then setting retention time is not good enough. You have to set following parameters also:
log.cleanup.policy
log.cleaner.min.compaction.lag.ms
log.cleaner.enable
Another set of parameters controls the deletion of message in case if they are present in your config:
log.retention.ms
log.roll.hours

Kafka group re-balancing after consumer failed. org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.