How to handle kafka consumer failures - apache-kafka

I am trying understand how to handle failed consumer records. How to
we know there is record failure. What I am seeing is when the record
processing failed in the consumer with runtime exception consumer is
keep on retrying. But when the next record is available to process it
is commiting offset of the latest record, which is expected. My
question how to we know about failed record. In older messaging
systems failed messages are rolled back to queues and processing stops
there. Then we know the queue is down and we can take action.
I can record the failed record into some db table,but what happens if this recording fails?
I can move failures to error/ dead letter queues, again what happens if this moving fails?
I am using kafka 2.6 with spring boot 2.3.4. Any help would be appreciated

Sounds like you would need to disable auto commits and manually commit the offsets yourself when your scope of "sucessfully processed" is achieved. If you include external processes like a database, then you will also need to increase Kafka client timeouts so it doesnt think the consumer is dead while waiting on error logging/handling.

Related

Error handling in SpringBoot kafka in Batch mode

I am trying to figure out is there any way to send failed records in Dead Letter topic in Spring Boot Kafka in Batch mode.
I don't want to make the records being sent in duplicate as it's consuming in batch and few are already processed.
I saw this link ofspring-kafka consumer batch error handling with spring boot version 2.3.7
I thought about a use case to stop container and start again without using DLT but again the issue of duplication will come in Batch mode.
#Garry Russel can you please provide a small code for batch error handling.
The RetryingBatchErrorHandler was added in spring-kafka version 2.5 (which comes with Boot 2.3).
The listener must throw an exception to indicate which record in the batch failed (either the complete record, or the index in the list).
Offsets for the records before the failed one are committed and the failed record can be retried and/or sent to the dead letter topic.
See https://docs.spring.io/spring-kafka/docs/current/reference/html/#recovering-batch-eh
There is a small example there.
The RetryingBatchErrorHandler was added in 2.3.7, but it sends the entire batch to the dead letter topic, which is typically not what you want (hence we added the RetryingBatchErrorHandler).

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Kafka commitTransaction acknowledgement failure

According to Kafka's commitTransaction documentation, commitTransaction will fail with TimeoutException if it doesn't receive a response in certain time
Note that this method will raise TimeoutException if the transaction cannot be committed before expiration of max.block.ms. Additionally, it will raise InterruptException if interrupted. It is safe to retry in either case, but it is not possible to attempt a different operation (such as abortTransaction) since the commit may already be in the progress of completing. If not retrying, the only option is to close the producer.
Consider an application in which a Kafka producer sends a group of records as Transaction A.
After the records have been successfully sent to the topic, Kafka producer will execute commitTransaction .
Kafka cluster receives the commit transaction request and successfully commits records that are part of transaction A. Kafka cluster sends an acknowledgement regarding successful commit.
However, due to some issue this acknowledgement is lost, causing a Timeout exception at Kafka producer's commitTransaction call. Thus even though the records have been committed on Kafka cluster, from producer's perspective the commit failed.
Generally in such a scenario the application would retry sending the transaction A records in a new transaction B, but this would lead to duplication of records as they were already committed as part of transaction A.
Is the above described scenario possible?
How do you handle loss of commitTransaction acknowledgement and the eventual duplication of records that is caused by it?

How to log messages which have failed to acknowledged in Apache Storm

I am reading messages from Kafka and processing them in storm.
I can see some messages which are failing in Storm UI.
I want to log these messages and figure out why these messages failed. There are no messages in logs as such for this.
If they have failed then they are supposed to be replayed by the Spout again. The KafkaSpout should have a fail method which you can use to identify the failed messageIds.
This might provide some direction.
The logs of every topology execution are stored in the logs folder (apache-storm-0.10.0/logs).

Connection not cleaned up if client failure occurs abruptly without closing the resources

I am using hornetq-2.2.14 Final and configured connection-ttl in hornetq-jms.xml is 60000ms . I have a publisher program which sends messages to a topic and a Consumer program which consumes messages from the topic. My consumer program exited abruptly without closing the resources. I waited 1 minute since the ttl is 60000ms,but server not clearing up the resources even after one minute. Any one can help me out to resolve this issue, if this is a configuration issue?
Sometimes it can take as 2X TTL depending on how it happened. we recently had a fix on master to make sure it is always close to the TTL configured.