How can I implement delay retry message pattern mentioned in this article https://jack-vanlightly.com/blog/2017/3/24/rabbitmq-delayed-retry-approaches-that-work. I don't want to loose the messages from the dead letter queue and also want retry with backoff. Once number of retries are exhausted I want the messages to be queued in DLQ.
There is a similar question asked on stack overflow but as per the solution it will continuously retry after dlq-ttl expiry. Also, I do not want to loose messages from DLQ after retry attempts are exhausted.
I believe you are looking for the functionality provided by the RetryTemplate. Is that right?
Related
I am referring:
https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2
And it says that if we use below:
factory.setErrorHandler(new SeekToCurrentErrorHandler(new DeadLetterPublishingRecoverer(kafkaTemplate), 3));
It will block the main consumer while its waiting for the retry. (https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2#:~:text=Also%20it%20blocks%20the%20main%20consumer%20while%20its%20waiting%20for%20the%20retry)
So, my question is do we really need retry on main topic or can we move the failed messages to a retry topic and then process messages there so that our main topic is non-blocking.
Can we achieve non-blocking retry using STCH?
Non-blocking retries were recently added to the new 2.7 release.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
Achieving non-blocking retry / dlt functionality with Kafka usually requires setting up extra topics and creating and configuring the corresponding listeners. Since 2.7 Spring for Apache Kafka offers support for that via the #RetryableTopic annotation and RetryTopicConfiguration class to simplify that bootstrapping.
If message processing fails, the message is forwarded to a retry topic with a back off timestamp. The retry topic consumer then checks the timestamp and if it’s not due it pauses the consumption for that topic’s partition. When it is due the partition consumption is resumed, and the message is consumed again. If the message processing fails again the message will be forwarded to the next retry topic, and the pattern is repeated until a successful processing occurs, or the attempts are exhausted, and the message is sent to the Dead Letter Topic (if configured).
Most places I look recommend that to prevent data loss you should create a "retry" topic. If a consumer fails it should send a message to the "retry" topic which would wait a set period of time and then send a message back to the "main" topic.
Isn't this an anti-pattern since when it goes back to the "main" topic all the services subscribed to the "main" topic would reprocess the failed message even though only one of the services failed to process it initially?
Is there a conventional way of solving this such as putting the clientId in the headers for messages that are the result of a retry? Am I missing something?
Dead-letter queues (DLQ), in themselves, are not an anti-pattern. Cycling it back through the main topic might be, but that is subjective.
The alternative would be to "stop the world" and update the consumer code to resolve the errors before the topic retention deletes the messages you care about. OR, make your downstream consumers also read from the DLQ topic(s), but handle them differently from the main topic.
Is there a conventional way of solving this such as putting the clientId in the headers
Maybe if you wanted to track lineage somehow, but re-introducing those previously bad messages would introduce lag and ordering issues that interfere with the "good" messages.
Related question
What is the best practice to retry messages from Dead letter Queue for Kafka
We would like to create a retry kafka mechanism for failures. I saw many introduced a way of have.multiple 'retry' topics. Was wondering why cant i simplify the flow by clone the message add into it a retry-counter field and just re-produce it on the same topic until reached X times and then exhausted.
What do I miss with that mechanism?
Basically you're missing a configurable delay that should be present in retrying strategy. The simple approach you have presented will lead to very high CPU usage during some outage (for example some service you depend on is unavailable for some minutes or hours). The best approach is to have exponential backoff strategy - with each retry you increase a delay after which a message processing is retried again. And this "delivery delay" is something Kafka doesn't support.
Not Sure if I understand the question correctly. Nevertheless, I would suggest that you have some Kafka 'retry' strategies.
Messages in 'retry' topics are already sorted in ‘retry_timestamp’
order
Because postponing message processing in case of failure is not a trivial process
If you would like to postpone processing of
some messages, you can republish them to separate topics, one for
each with some delay value
The failed messages processing can be
achieved by cloning the message and later republishing it to one of
the retry topics
Consumers of retry topics could block the thread
(unless it is time to process the message)
In the kafka consumer documentation https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html it states that care needs to taken to make sure poll is called every so often or the broker will assume the consumer is dead.
The most reliable procedure was pretty complicated:
For use cases where message processing time varies unpredictably,
neither of these options may be sufficient. The recommended way to
handle these cases is to move message processing to another thread,
which allows the consumer to continue calling poll while the processor
is still working. Some care must be taken to ensure that committed
offsets do not get ahead of the actual position. Typically, you must
disable automatic commits and manually commit processed offsets for
records only after the thread has finished handling them (depending on
the delivery semantics you need). Note also that you will need to
pause the partition so that no new records are received from poll
until after thread has finished handling those previously returned.
Does spring kafka handle this for me under the hood?
The heartbeat is mentioned very brief in the documentation. Apparently the heartbeat is managed by Spring-Kafka on a different thread.
Since version 0.10.1.0 heartbeats are sent on a background thread
You can also read this github issue to read more about the heartbeat.
In my kafka consumer threads(high level), after I consumed a message I am applying some business logic to this message and forwarding this to a WS. But this webservice may be down sometimes and since I consumed this object from kafka and offset is moved forward, i would missed this object.
One way get rid of from this problem is to disabling autocommit in zookeeper and committing offset by calling programmaticaly but i expect that this is a very costly operation. I will be producing to kafka at about 2000 tps and may increase later times.
Another way - which i am not sure if it is a good idea - is if i face with any problem, producing this consumed object to kafka again but i didn't see any post related to this across all my googleings. Is this a thing which is even not considerable?
Can you please give me some insights about handling this situation.
Thanks
You can post back the failed message to the same topic or another of your choice.
If you use the same topic, you will push the messages at the end of the topic and they will be picked up after the others (so if order matters to you don't do this). Also if the action that you perform before sending the message is not idempotent you will have to something to identifying this records so they don't perform the action twice.
If you use a failed_topic, you can push the messages that you can't send to this topic and when the WS is healthy again you need to create a consumer that consumes all the messages there and sends them to the WS.
Hope it helps!
Moving such messages to an error queue and retrying them later is a well known approach.
See Dead letter channel