We would like to create a retry kafka mechanism for failures. I saw many introduced a way of have.multiple 'retry' topics. Was wondering why cant i simplify the flow by clone the message add into it a retry-counter field and just re-produce it on the same topic until reached X times and then exhausted.
What do I miss with that mechanism?
Basically you're missing a configurable delay that should be present in retrying strategy. The simple approach you have presented will lead to very high CPU usage during some outage (for example some service you depend on is unavailable for some minutes or hours). The best approach is to have exponential backoff strategy - with each retry you increase a delay after which a message processing is retried again. And this "delivery delay" is something Kafka doesn't support.
Not Sure if I understand the question correctly. Nevertheless, I would suggest that you have some Kafka 'retry' strategies.
Messages in 'retry' topics are already sorted in ‘retry_timestamp’
order
Because postponing message processing in case of failure is not a trivial process
If you would like to postpone processing of
some messages, you can republish them to separate topics, one for
each with some delay value
The failed messages processing can be
achieved by cloning the message and later republishing it to one of
the retry topics
Consumers of retry topics could block the thread
(unless it is time to process the message)
Related
Let's say we have a Kafka consumer poll from a normal topic that is heavy loaded and for each event, make a client call to service. The duration of client call may vary, sometimes fast sometimes slow, we have a retry topic so whenever client call has issue, we'll produce a retry event.
Here is an interesting design question, which domain should be responsible for producing the retry event?
If we let consumer to handle retry produce, this means we have to let consumer to wait for our client call gets finished, which would bring risk of consumer lag because our event processing speed would become slow
If we let service to handle retry produce, this solve the consumer lag issue as consumer would just act as send and forget. However, when service tries to produce a retry event but fails, our retry record might get lost forever in current client call
I also think of having additional DB for persisting retry events, but this would bring more concern on what if DB write operations fails and we might lose the retry similarly as kafka produce error out
The expectation would be keep it more resilient so that all failed event may get a chance for retry and at same time, should also avoid consumer lag issue
I'm not sure I completely understand the question, but I will give it a shot. To summarise, you want to ensure the producer retries if the event failed.
The producer retries default is 2147483647. If the produce request fails, it will keep retrying.
However, produce requests will fail before the number of retries are exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. The default for delivery.timeout.ms is 2 mins so you might want to increase this.
To ensure the producer always sends the record you also want to focus on the producer configurations acks.
If acks=all, all replicas in the ISR must acknowledge the record before it is considered successful. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee.
The above can cause duplicate messages. If you wanted to avoid duplicates, I can also let you know how to do that.
With Spring for Apache Kafka, the DeadletterPublishingRecoverer (which can be used to publish to your "retry" topic) has a property failIfSendResultIsError.
When this is true (default), the recovery operation fails and the DefaultErrorHandler will detect the failure and re-seek the failed consumer record so that it will continue to be retried.
The non-blocking retry mechanism uses this recoverer internally so the same behavior will occur there too.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
Most places I look recommend that to prevent data loss you should create a "retry" topic. If a consumer fails it should send a message to the "retry" topic which would wait a set period of time and then send a message back to the "main" topic.
Isn't this an anti-pattern since when it goes back to the "main" topic all the services subscribed to the "main" topic would reprocess the failed message even though only one of the services failed to process it initially?
Is there a conventional way of solving this such as putting the clientId in the headers for messages that are the result of a retry? Am I missing something?
Dead-letter queues (DLQ), in themselves, are not an anti-pattern. Cycling it back through the main topic might be, but that is subjective.
The alternative would be to "stop the world" and update the consumer code to resolve the errors before the topic retention deletes the messages you care about. OR, make your downstream consumers also read from the DLQ topic(s), but handle them differently from the main topic.
Is there a conventional way of solving this such as putting the clientId in the headers
Maybe if you wanted to track lineage somehow, but re-introducing those previously bad messages would introduce lag and ordering issues that interfere with the "good" messages.
Related question
What is the best practice to retry messages from Dead letter Queue for Kafka
Let's imaging a simple message processing pipeline, like on the image below:
A group of consumers listens to a topic, picks messages one by one, does some sort of processing and sends them over to the next topic.
Some messages crash the consumer or make it stuck forever (so then a liveness probe kills the consumer after timeout).
In this case a consumer is not able to commit the offset, so the malicious message gets picked up by another consumer. And also makes it crash.
Ideally we want to move the message to a dead letter topic after N such attempts.
This can be achieved by introducing a shared storage:
But this creates coupling between the services and introduces a Single Point of Failure (SPOF) which is the shared database.
I'm looking for ideas on how to work this around with stateless services.
If your context is correct with this approach (that's something you should judge, as I'm only trying to give a suggestion), please consider decoupling the consumption and the processing.
In your case, the consumer is stopped, not because it was not able to read from kafka, and/or the kafka broker wasn't able to provide messages, but because the processing of the message was too slow and/or unsuccesful.
The consumer, in fact, was correctly receiving the messages. It was the processing of them that made it be declared dead.
First of all, the KafkaConsumer javadoc block regarding this (just above the constructor summary). The second option is the one quoted here
2. Decouple Consumption and Processing
Another alternative is to have one or more consumer threads that do
all data consumption and hands off ConsumerRecords instances to a
blocking queue consumed by a pool of processor threads that actually
handle the record processing. This option likewise has pros and cons:
PRO: This option allows independently scaling the number of consumers
and processors. This makes it possible to have a single consumer that
feeds many processor threads, avoiding any limitation on partitions.
CON: Guaranteeing order across the processors requires particular care
as the threads will execute independently an earlier chunk of data may
actually be processed after a later chunk of data just due to the luck
of thread execution timing. For processing that has no ordering
requirements this is not a problem.
CON: Manually committing the position becomes harder as it requires
that all threads co-ordinate to ensure that processing is complete for
that partition.**
Esentially, works like this. The consumer keeps reading and gives the responsibility of the processing and process-timeout management to the processor threads .
The error handling of the message processing would be responsibility of the processor threads as well. For example, if a timeout is thrown or an exception occurs, the processor will send the message to your defined "dead" queue, or whatever management of this you wish to perform, without involving the consumer. Regardless of the processor threads' success or fail, the consumer will continue its job and never be considered dead for not calling poll() in the specified timeout.
You should control the amount of messages the consumer retrieves in its poll call in order not to saturate the processors. Its a game regarding how fast the processors finish their job, how many messages the consumer retrieves (max.poll.records) at each iteration, and what's the specified timeout for the consumer.
Decoupled workflow
The first element to be quoted is the queue (with a limited size, which you should also manage in order not getting too filled - OOM).
This queue would be the link between consumer and processor threads, essentially a buffer that could dynamically get bigger or smaller depending on the specific word load at each time; It would manage overloads, something like a dam, or barrier, to find a similarity.
----->WORKERTHREAD1
KAFKA <------> CONSUMER ----> QUEUE -----|
----->WORKERTHREAD2
What you get is a second queue-lag mechanism:
1. Kafka Consumer LAG (the messages still to be read from the partition/topic)
2. Queue LAG (received messages still need to be processed)
--->WORKERTHREAD1
KAFKA <--(LAG)--> CONSUMER ----> QUEUE --(LAG)--|
--->WORKERTHREAD2
The queue could be some kind of synchronized queue, such a ConcurrentLinkedQueue. for example. Or you could manage yourself the synchronization with a customized queue.
Essentially, the duties would be divided, and the consumer is given the easiest one (as its the one that is most crucial).
Responsibilities:
Consumer
consume-->send to queue
Workers
read from queue|-->[manage timeout]
|==>PROCESS MESSAGE ==> send to topic
|-->[handle failed messages]
You should also manage if the processor threads die/deadlock; but usually those mechanisms are already implemented in most of ThreadPool variants.
I suggest the workers to share a unique KafkaProducer; The producer is thread safe and since the output topic would be the same for the group of consumers, this would also increase its performance. Also from the Kafka Producer javadoc:
The producer is thread safe and sharing a single producer instance
across threads will generally be faster than having multiple
instances.
In resume, each consumer thread feeds n processor threads. Some variants could be:
- 1 consumer - 1 worker (no processing paralellization, just division of duties)
- 1 consumer - 2 workers
- 1 consumer - 4 workers
- 2 consumers - 4 workers (2 for each)
- 2 consumers - 8 workers (4 for each)
...
Read carefully the pros and contras from this mechanism in the javadoc, and judge if this could be a solution to your specific case.
In my oppinion, there's a PRO that doesn't get reflected in the docs, which is the root of this answer/suggestion:
Consumption shouldn't be affected by processing. This approach avoids any consumer thread being considered dead due to a slow processing of the messages, and offers an extra "safety-window" thanks to the queue. I'm not saying that, at the point in which all processors fail for every message, or the queue hits maximum size, for example, the consumer would continue happily as if that didn't affect it; It will in fact be stopped by processing, but much, much later and due to bigger reasons that couldn't be avoided. This approach offers some extra time, or extra shield, for that to happen. Just like a dam can fail if it can't hold any more water.
Well, hope you take this as a suggestion, and may it be helpful somehow. It may avoid most of the dead consumer issues you're having. If well managed, it's a good approach for 24/7 real time data workflow.
I am wondering if there is something I am missing about my set up to facilitate long running jobs.
For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).
I have the following in order to achieve the competing consumer pattern:
A topic
X consumers in the same group
P partitions in a topic (where P >= X always)
My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this.
However this comes with some negative consequences:
if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of max.poll.interval.ms for a rebalance
if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of max.poll.interval.ms for a rebalance to occur in order to process any new messages
As it stands at the moment I see that I can proceed as follows:
Set max.poll.interval.ms to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance
However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this.
Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable.
I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.
I appreciate any advise that anybody can offer on this subject.
Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.
The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.
You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.
The way we got around the issues we were having was as suggested in the comments.
We decided to decouple the message processing from the consumer polling.
On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.
We also did some work with trying to reduce the processing times for messages.
However some messages still take time that can be measured in minutes.
This has worked for us now for some time with no issues.
Thanks for this suggestions in comments #Donal
I am using Apache storm to reach messages from kafka and insert data to a database. The messages are corresponding to transactions happened in an Oracle DB. Since we are using storm there is no guarantee on the processing order and some messages which are failure on the first attempt might be success on retry. We need to limit the retry to a threshold value say 10 and beyond this if the failure occurs the messages are moved to an error queue.
To be specific, our requirement is to do retry on bolt execution(on exception) for a specified number of times. After these retries the messages should be moved to Database. We were doing a fieldsGrouping and maintaining a hash map to track the retry count. This was required to make sure that the the retried messages uses the same worker, however this is causing major performance impact. We will have to do shuffleGrouping for maximum throughput. What is the best possible way to implement this behavior.
The following options are considered.
1) Maintain the retry count status in database(might be a performance issue).
2) Maintain the retry count status in external caches( like elasticache in Amazon AWS), however it might be expensive.
3) Maintain the retry count status in message itself and re-post the message to kafka queue. Also ignoring the exception occurred in bolt to make in success.
We are not considering exponential retry inside bolt itself as it might also cause performance issues.
Please advice.