How to use exponential backoff with manual acknowledgment of Kafka messages - apache-kafka

I have a Springboot project that receives messages on a Kafka topic. The application itself is fairly simple.
Receive a message
Perform a REST request to another service.
Acknowledge the message upon success.
If for whatever reason, the REST request fails, an indefinite exponential backoff is implemented to keep trying the REST request (Messages have an expiration time, which when exceeded, will be discarded to prevent messages permanently retrying).
Scenario:
A message is received that fails the REST request and is in the retry loop (Offset 1).
Another message is received that succeeds (Offset 2). The second message acknowledges the message because of its success, which sets the partition offset to 2.
If the application crashes or is brought down in this scenario, when it comes back up the message (offset 1) is lost.
What can I do to ensure the acknowledgments are sequential? Replaying messages is not a concern because there is deduplication logic implemented backed by a database that tracks message state.

an indefinite exponential backoff is implemented to keep trying the REST request
Another message is received that succeeds (Offset 2).
That won't happen if you implement the indefinite retry via an ExponentialBackOff configured in the listener container's SeekToCurrentErrorHandler.
You will always get offset 1 re-delivered until successful; you will not see offset 2 before that.

Related

Should consumer or client to produce retry event?

Let's say we have a Kafka consumer poll from a normal topic that is heavy loaded and for each event, make a client call to service. The duration of client call may vary, sometimes fast sometimes slow, we have a retry topic so whenever client call has issue, we'll produce a retry event.
Here is an interesting design question, which domain should be responsible for producing the retry event?
If we let consumer to handle retry produce, this means we have to let consumer to wait for our client call gets finished, which would bring risk of consumer lag because our event processing speed would become slow
If we let service to handle retry produce, this solve the consumer lag issue as consumer would just act as send and forget. However, when service tries to produce a retry event but fails, our retry record might get lost forever in current client call
I also think of having additional DB for persisting retry events, but this would bring more concern on what if DB write operations fails and we might lose the retry similarly as kafka produce error out
The expectation would be keep it more resilient so that all failed event may get a chance for retry and at same time, should also avoid consumer lag issue
I'm not sure I completely understand the question, but I will give it a shot. To summarise, you want to ensure the producer retries if the event failed.
The producer retries default is 2147483647. If the produce request fails, it will keep retrying.
However, produce requests will fail before the number of retries are exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. The default for delivery.timeout.ms is 2 mins so you might want to increase this.
To ensure the producer always sends the record you also want to focus on the producer configurations acks.
If acks=all, all replicas in the ISR must acknowledge the record before it is considered successful. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee.
The above can cause duplicate messages. If you wanted to avoid duplicates, I can also let you know how to do that.
With Spring for Apache Kafka, the DeadletterPublishingRecoverer (which can be used to publish to your "retry" topic) has a property failIfSendResultIsError.
When this is true (default), the recovery operation fails and the DefaultErrorHandler will detect the failure and re-seek the failed consumer record so that it will continue to be retried.
The non-blocking retry mechanism uses this recoverer internally so the same behavior will occur there too.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic

retry logic blocks the main consumer while its waiting for the retry in spring

I am referring:
https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2
And it says that if we use below:
factory.setErrorHandler(new SeekToCurrentErrorHandler(new DeadLetterPublishingRecoverer(kafkaTemplate), 3));
It will block the main consumer while its waiting for the retry. (https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2#:~:text=Also%20it%20blocks%20the%20main%20consumer%20while%20its%20waiting%20for%20the%20retry)
So, my question is do we really need retry on main topic or can we move the failed messages to a retry topic and then process messages there so that our main topic is non-blocking.
Can we achieve non-blocking retry using STCH?
Non-blocking retries were recently added to the new 2.7 release.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
Achieving non-blocking retry / dlt functionality with Kafka usually requires setting up extra topics and creating and configuring the corresponding listeners. Since 2.7 Spring for Apache Kafka offers support for that via the #RetryableTopic annotation and RetryTopicConfiguration class to simplify that bootstrapping.
If message processing fails, the message is forwarded to a retry topic with a back off timestamp. The retry topic consumer then checks the timestamp and if it’s not due it pauses the consumption for that topic’s partition. When it is due the partition consumption is resumed, and the message is consumed again. If the message processing fails again the message will be forwarded to the next retry topic, and the pattern is repeated until a successful processing occurs, or the attempts are exhausted, and the message is sent to the Dead Letter Topic (if configured).

When message process fails, can consumer put back message to same topic?

Suppose one of my program consuming message from kafka topic. During processing of message, consumer access some db. Its db acccess fails due to xyz reason. But we dont have to abandon the message. We need to park the message for later processing. In JMS when message processing fails application container put back the message to the queue. It does not lost. In Kafka once it received its offset increases and next message comes. How to handle this ?
There are two approaches to achieve this.
Set the Kafka Acknowledge mode to manual and in case of error terminate the consumer thread without submitting the offset (If group management is enabled new consumer will be added after triggering re balancing and poll the same batch)
Second approach is simple, just have one error topic and publish messages to error topic in case of any error, so later you can consumer them or keep track of them.

How to recover from exceptions sent by producer.send() in Spring Cloud Stream

We experienced the following scenario :
We have a Kafka cluster composed of 3 nodes, each topic created has 3 partitions
A message is sent through MessageChannel.send(), producing a record for, let's say, partition 1
The broker acting as the partition leader for that partition fails
By default, MessageChannel.send() returns true and doesn't throw any exception, even if, eventually, the KafkaProducer can't send successfully the message. We observe, about 30 seconds after this call, the following message in the logs : Expiring 10 record(s) for helloworld-topic-1 due to 30008 ms has passed since batch creation plus linger time
In our case, this is not acceptable as we have to be sure that all messages are eventually delivered to Kafka, at the moment of the return of the call to MessageChannel.send().
We turned on spring.cloud.stream.kafka.bindings.<channelName>.producer.sync to true which does exactly as the documentation describes. It blocks the caller for the producer's acknowledgment of the success or the failure of the delivery (MessageTimeoutException, InterruptedException, ExecutionException), all of this controlled by KafkaProducerMessageHandler. It seems to be the best approach for us as the performance impact is negligible in our case.
But, do we need to take care of the retry ourselves if an exception is thrown ? (in our client code with #Retryable for instance)
Here is a simple project to experiment : https://github.com/phdezann/spring-cloud-bus-kafka-helloworld
If the send() is performed on the #StreamListener thread and the exception is thrown back to the binder, the binder retry configuration will perform retries.
However, since you are doing the send on an HTTP thread you will need to do your own retry (call send within the scope of a RetryTemplate()) or make the controller method #Retryable.

activemq constantly retrying error messages and not picking up new messages

I have an activemq instance set up with tomcat for background message processing. It is set up to retry failed messages every 10 minutes for a retry period.
Now some dirty data has entered the system because of which the messages are failing. This is ok and can be fixed in the future. However, the problem is that none of the new correct incoming messages are getting processed and the error messages are constantly getting retried.
Any tips on what might be the issue, or how the priority is set? I haven't controlled the priority of the messages manually.
Thanks for your help.
-Pulkit
EDIT : I was able to solve the problem. The issue was that by the time all the dirty messages were handled, it was time for them to be retried. Thus none of the new messages were being consumed by the queue.
A dirty message was basically a message that was throwing an exception out due to some dirty data in the system. the redelivery settings was to do redeliveries every 10 mins for 1 day.
maximumRedeliveries=144
redeliveryDelayInMillis=600000
acknowledge.mode=transacted
ActiveMQ determines redelivery for a consumer based on the configuration of the RedeliveryPolicy that's assigned the ActiveMQConnectionFactory. Local redelivery halts new message dispatch until the rollbed back transaction messages are successfully redelivered so if you have a message that's causing you some sort of error such that you are throwing an exception or rolling back the transaction then it will get redelivered up to the max re-deliveries setting in the policy. Since your question doesn't provide much information on your setup and what you consider an error message I can't really direct you to a solution.
You should look at the settings available in the Redelivery Policy. Also you can configure redelivery to not block new message dispatch using the setNonBlockingRedelivery method.