I am creating a service that polls messages from Kafka topics and hands over each message received during the poll interval to a worker thread from the thread pool. The worker thread processes the message by talking to another service.
How should I handle committing Kafka offsets for this case? If I choose to wait for all the threads to complete, the processing speed decreases. Also once a message reaches a worker thread, it is guaranteed that either the message processing completes successfully or the message is added to a dead letter topic to be seen later if some error occurs during message processing provided that the host on which the service is running doesn't go down. So I may commit the offsets as soon as I have submitted the messages to the thread pool but then I run the risk of losing messages for host crashes. How should I prevent losing messages here or should I use some other strategy for committing/ maintaining offsets.
Related
Is this a possible case for data loss? If due to unerlying hardware issue, kafka is having request queue queued up, If this time, we shutdown/bounce that kafka broker, What will happen to the follower?
what will happen to the message is the queue?
kafka.network:type=RequestChannel,name=RequestQueueSize
Size of the request queue. A congested request queue will not be able to process incoming or outgoing requests
Based on what I learn from kafka, this should be in networklayer, does that mean the message in the queue will be dropped, is this a case of data loss?
The message still present in the request queue has not yet been appended to log nor replicated to replicas.
Depending on your producer (mainly acks attribute) and broker configuration (min.insync.replicas), you're risking data loss.
Set acks to a higher value to ensure that your request has been processed.
I am referring:
https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2
And it says that if we use below:
factory.setErrorHandler(new SeekToCurrentErrorHandler(new DeadLetterPublishingRecoverer(kafkaTemplate), 3));
It will block the main consumer while its waiting for the retry. (https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2#:~:text=Also%20it%20blocks%20the%20main%20consumer%20while%20its%20waiting%20for%20the%20retry)
So, my question is do we really need retry on main topic or can we move the failed messages to a retry topic and then process messages there so that our main topic is non-blocking.
Can we achieve non-blocking retry using STCH?
Non-blocking retries were recently added to the new 2.7 release.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
Achieving non-blocking retry / dlt functionality with Kafka usually requires setting up extra topics and creating and configuring the corresponding listeners. Since 2.7 Spring for Apache Kafka offers support for that via the #RetryableTopic annotation and RetryTopicConfiguration class to simplify that bootstrapping.
If message processing fails, the message is forwarded to a retry topic with a back off timestamp. The retry topic consumer then checks the timestamp and if it’s not due it pauses the consumption for that topic’s partition. When it is due the partition consumption is resumed, and the message is consumed again. If the message processing fails again the message will be forwarded to the next retry topic, and the pattern is repeated until a successful processing occurs, or the attempts are exhausted, and the message is sent to the Dead Letter Topic (if configured).
I am using spring kafka and want to know when does kafka consumer get evicted from the group. Does it get evicted when the processing time taken is more than the poll interval? If yes then isn't the purpose of the heartbeat to indicate the consumer is alive and if that happens then the consumer should never be evicted unless the process itself fails.
You are correct that the heartbeat thread tells the group that the consumer process is still alive. The reason for additionally considering a consumer to be gone when there is excessive time between polls is to prevent livelock.
Without this, a consumer might never poll, and so would take partitions without making any progress through them.
The question then is really why there is a heartbeat and session timeout. The heartbeat thread is actually doing other stuff (pre-fetching) but I assume the reason it is used to check that consumers are alive is that it is generally talking to the broker more frequently than the polling thread as the latter has to process messages, and so a failed consumer process will be spotted earlier.
In short there are 3 things that can trigger a rebalance - a change in number of partitions at the broker end, polling taking longer than max.poll.interval.ms, and gap between heartbeats longer than session.timeout.ms
Suppose one of my program consuming message from kafka topic. During processing of message, consumer access some db. Its db acccess fails due to xyz reason. But we dont have to abandon the message. We need to park the message for later processing. In JMS when message processing fails application container put back the message to the queue. It does not lost. In Kafka once it received its offset increases and next message comes. How to handle this ?
There are two approaches to achieve this.
Set the Kafka Acknowledge mode to manual and in case of error terminate the consumer thread without submitting the offset (If group management is enabled new consumer will be added after triggering re balancing and poll the same batch)
Second approach is simple, just have one error topic and publish messages to error topic in case of any error, so later you can consumer them or keep track of them.
I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.