resilience4J for aggregated async results - retry on failure and/or timeout? - apache-kafka

I have a scenario where is need to track the results of a group of Kafka messages. Each message is sent and we expect a SUCCESS or FAILURE message to return after a period of time. If any message of the group fails, or if we message does not return - we should retry.
I'd like to leverage the Retry logic of 'resilience4J' but am unsure how to correctly configure the RetryConfig so that it aggregates the responses correctly.
In my scenario I have this sudo code for submitting the messages
KafkaClient client;
KakfaRequest message; // 10 of these
And i then want to config a RetryConfig is manner simular to
RetryConfig config = RetryConfig.custom()
.retryOnResult(response -> 'FAILURE')
.retryExceptions(IOException.class, TimeoutException.class)
I guess i'm unsure into which data structure / Future so I add the message details, so the callback logic from kakfa and the Retry code can process the same result sets.


Kafka producer callback Exception

When we produce messages we can define a callback, this callback can expect an exception:
kafkaProducer.send(producerRecord, new Callback() {
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e == null) {
// OK
} else {
Considered the buitl-in retry logic in the producer, I wonder which kind of exception should developers deal explicitly with?
According to the Callback Java Docs there are the following Exception possible happening during callback:
The exception thrown during processing of this record. Null if no error occurred. Possible thrown exceptions include:
Non-Retriable exceptions (fatal, the message will never be sent):
Retriable exceptions (transient, may be covered by increasing #.retries):
Maybe this is a unsatisfactory answer, but in the end which Exceptions and how to handle them completely relies on your use case and business requirements.
Handling Producer Retries
However, as a developer you also need to deal with the retry mechanism itself of the Kafka Producer. The retries are mainly driven by:
retries: Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting (default: 5) to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionally that produce requests will be failed before the number of retries has been exhausted if the timeout configured by expires first before successful acknowledgement. Users should generally prefer to leave this config unset and instead use to control retry behavior. The amount of time to wait before attempting to retry a failed request to a given topic partition. This avoids repeatedly sending requests in a tight loop under some failure scenarios. The configuration controls the maximum amount of time the client will wait for the response of a request. If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhausted. This should be larger than (a broker configuration) to reduce the possibility of message duplication due to unnecessary producer retries.
The recommendation is to keep the default values of those three configurations above and rather focus on the hard upper time limit defined by An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures. The producer may report failure to send a record earlier than this config if either an unrecoverable error is encountered, the retries have been exhausted, or the record is added to a batch which reached an earlier delivery expiration deadline. The value of this config should be greater than or equal to the sum of and
You may get BufferExhaustedException or TimeoutException
Just bring your Kafka down after the producer has produced one record. And then continue producing records. After sometime, you should be seeing exceptions in the callback.
This is because, when you sent the first record, the metadata is fetched, after that, the records will be batched and buffered and they expire eventually after some timeout during which you may see these exceptions.
I suppose that the timeout is which when expired give you a TimeoutException exception there.
Trying to add more info to #Mike's answer, I think only a few Exceptions are enum in Callback Interface.
Here you can see the whole list: kafka.common.errors
And here, you can see which ones are retriables and which ones are
not: kafka protocol guide
And the code could be sht like this:
producer.send(record, callback)
def callback: Callback = new Callback {
override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
if(null != e) {
if (e == RecordTooLargeException || e == UnknownServerException || ..) {
log.error("Winter is comming") //it's non-retriable
} else {
log.warn("It's no that cold") //it's retriable
} else {
log.debug("It's summer. Everything is fine")

Does recieving message id (future) indicate message is sent in PubSub?

I am sending messages to google pubsub using a callback function which reads back the message id from the future. Using the following code:
"""Publishes multiple messages to a Pub/Sub topic with an error handler."""
import time
from import pubsub_v1
# ...
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
def get_callback(f, data):
def callback(f):
except: # noqa
print('Please handle {} for {}.'.format(f.exception(), data))
return callback
for i in range(10):
data = str('message')
# When you publish a message, the client returns a future.
future = publisher.publish(
topic_path, data=data.encode('utf-8') # data must be a bytestring.
# Publish failures shall be handled in the callback function.
future.add_done_callback(get_callback(future, data))
print('Published message with error handler.')
I am receiving back the message id successfully with now errors/exceptions, however I find that some messages are not read into pubsub (when viewing them from the GCP console).
The message id is printed in the line print(f.result()) within the callback function.
My question is:
Is it safe to assume that messages are sent to Pubsub successfully following the receipt of a messageid?
If so, what could be the cause for 'dropped' messages?
If publish has returned successfully with a message ID, then yes, Cloud Pub/Sub guarantees to deliver the message to subscribers. If you do not see the message, there are several things that could cause this:
The subscription did not exist at the time the message was published. Cloud Pub/Sub only delivers messages to subscribers if the subscription was created before the message was published.
Another subscriber for the subscription was already running and received the message. If you are using the GCP console to fetch messages while a subscriber is up and running, it is possible that subscriber got the messages.
If you got the message on the console once and then reloaded to get the messages again, you may not see the message again until the ack deadline has passed, which defaults to 10 seconds.
A single pull request (which is what the GCP console "View Messages" feature does), may not be sufficient to retrieve the message. If you click on it multiple times or start up a basic subscriber, you will likely see the message.

Handling REST request with Message Queue

I have two applications as stated below:
Spring boot application - Acts as rest end point, publishes the request to message queue. ( Apache Pulsar )
Heron (Storm) topology - which processes the message received from Message queue ( PULSAR ) and has all logic for processing.
My requirement, i need to serve different user queries through Spring boot application, which emits that query to message queue, and is consumed at spout. Once spout and bolts process the requests, a message is published again from bolt. That response from Bolt is handled at Spring boot (consumer) and replies to the user request. Typlically as shown below:
To serve to the same request, Im right now caching the deferred result object ( I set a reqID to each message which is sent to topology and I also maintain a key, value pair for ) in memory and when the message arrives I parse the request id and set the result to the defferedResult (I know this is a bad design, HOW SHOULD ONE SOLVE THIS ISSUE ?).
How can I proceed to serve the response back to the same request in this scenario where the order of messages received from topology is not sequential ( as each request which is processes takes its own time and producer bolt will fire the response as on when it is receives one ).
Im kind of stuck with this design and not able to proceed further.
public DeferredResult<ResponseEntity<?>> process(//someinput) {
DeferredResult<ResponseEntity<?>> result = new DeferredResult<>(config.getTimeout());
CompletableFuture<String> serviceResponse = service.processAsync(inputSource);
serviceResponse.whenComplete((response, exception) -> {
if (!ObjectUtils.isEmpty(exception))
return result;
//In Service
public CompletableFuture processAsync(//input){
CompletableFuture result = new CompletableFuture();
//consumer has a listener as shown below
// **I want to avoid below line, how can I redesign this**
map.put(id, result);
return result;
//in same service, a listener is present for consumer for reading the messages
consumerListener(Message msg){
int reqID = msg.getRequestID();
As shown above as soon as I get a message I get the completableFuture
object and set the result, which interally calls the defferred result
object and returns the response to the user.
How can I proceed to serve the response back to the same request in this scenario where the order of messages received from topology is not sequential ( as each request which is processes takes its own time and producer bolt will fire the response as on when it is receives one ).
It sounds like you are looking for the Correlation Identifier messaging pattern. In broad strokes, you compute/create an identifier that gets attached to the message sent to pulsar, and arrange that Heron copies that identifier from the request it receives to the response it sends.
Thus, when your Spring Boot component is consuming messages from pulsar at step 5, you match the correlation id to the correct http request, and return the result.
Using the original requestId() as your correlation identifier should be fine, as far as I can tell.
To serve to the same request, Im right now caching the deferred result object ( I set a reqID to each message which is sent to topology and I also maintain a key, value pair for ) in memory and when the message arrives I parse the request id and set the result to the defferedResult (I know this is a bad design, HOW SHOULD ONE SOLVE THIS ISSUE ?).
Ultimately, you are likely to be doing that at some level; which is to say that the consumer at step 5 is going to be using the correlation id to look up something that was stored by the producer. Trying to pass the original request across four different process boundaries is likely to end in tears.
The more general form is to store a callback, rather than a CompletableFuture, in the map; but in this case the callback probably just completes the future.
The one thing I would want to check carefully in the design: you want to be sure that the consumer at step 5 sees the future it is supposed to use before the message arrives. In other words, there should be a happens-before memory barrier somewhere to ensure that the map lookup at step 5 doesn't fail.

How to implement a microservice Event Driven architecture with Spring Cloud Stream Kafka and Database per service

I am trying to implement an event driven architecture to handle distributed transactions. Each service has its own database and uses Kafka to send messages to inform other microservices about the operations.
An example:
Order service -------> | Kafka |------->Payment Service
| |
Orders MariaDB DB Payment MariaDB Database
Order receives an order request. It has to store the new Order in its DB and publish a message so that Payment Service realizes it has to charge for the item:
private OrderBusiness orderBusiness;
public Order createOrder(#RequestBody Order order){
//a.- Save the order in the DB
//b. Publish in the topic so that Payment Service charges for the item.
}catch(Exception e){
logger.error("{}", e);
return order;
These are my doubts:
Steps a.- (save in Order DB) and b.- (publish the message) should be performed in a transaction, atomically. How can I achieve that?
This is related to the previous one: I send the message with: orderSource.output().send(MessageBuilder.withPayload(order).build()); This operations is asynchronous and ALWAYS returns true, no matter if the Kafka broker is down. How can I know that the message has reached the Kafka broker?
Steps a.- (save in Order DB) and b.- (publish the message) should be
performed in a transaction, atomically. How can I achieve that?
Kafka currently does not support transactions (and thus also no rollback or commit), which you'd need to synchronize something like this. So in short: you can't do what you want to do. This will change in the near-ish future, when KIP-98 is merged, but that might take some time yet. Also, even with transactions in Kafka, an atomic transaction across two systems is a very hard thing to do, everything that follows will only be improved upon by transactional support in Kafka, it will still not entirely solve your issue. For that you would need to look into implementing some form of two phase commit across your systems.
You can get somewhat close by configuring producer properties, but in the end you will have to chose between at least once or at most once for one of your systems (MariaDB or Kafka).
Let's start with what you can do in Kafka do ensure delivery of a message and further down we'll dive into your options for the overall process flow and what the consequences are.
Guaranteed delivery
You can configure how many brokers have to confirm receipt of your messages, before the request is returned to you with the parameter acks: by setting this to all you tell the broker to wait until all replicas have acknowledged your message before returning an answer to you. This is still no 100% guarantee that your message will not be lost, since it has only been written to the page cache yet and there are theoretical scenarios with a broker failing before it is persisted to disc, where the message might still be lost. But this is as good a guarantee as you are going to get.
You can further reduce the risk of data loss by lowering the intervall at which brokers force an fsync to disc (emphasized text and/or but please be aware, that these values can bring with them heavy performance penalties.
In addition to these settings you will need to wait for your Kafka producer to return the response for your request to you and check whether an exception occurred. This sort of ties into the second part of your question, so I will go into that further down.
If the response is clean, you can be as sure as possible that your data got to Kafka and start worrying about MariaDB.
Everything we have covered so far only addresses how to ensure that Kafka got your messages, but you also need to write data into MariaDB, and this can fail as well, which would make it necessary to recall a message you potentially already sent to Kafka - and this you can't do.
So basically you need to choose one system in which you are better able to deal with duplicates/missing values (depending on whether or not you resend partial failures) and that will influence the order you do things in.
Option 1
In this option you initialize a transaction in MariaDB, then send the message to Kafka, wait for a response and if the send was successful you commit the transaction in MariaDB. Should sending to Kafka fail, you can rollback your transaction in MariaDB and everything is dandy.
If however, sending to Kafka is successful and your commit to MariaDB fails for some reason, then there is no way of getting back the message from Kafka. So you will either be missing a message in MariaDB or have a duplicate message in Kafka, if you resend everything later on.
Option 2
This is pretty much just the other way around, but you are probably better able to delete a message that was written in MariaDB, depending on your data model.
Of course you can mitigate both approaches by keeping track of failed sends and retrying just these later on, but all of that is more of a bandaid on the bigger issue.
Personally I'd go with approach 1, since the chance of a commit failing should be somewhat smaller than the send itself and implement some sort of dupe check on the other side of Kafka.
This is related to the previous one: I send the message with:
This operations is asynchronous and ALWAYS returns true, no matter if
the Kafka broker is down. How can I know that the message has reached
the Kafka broker?
Now first of, I'll admit I am unfamiliar with Spring, so this may not be of use to you, but the following code snippet illustrates one way of checking produce responses for exceptions.
By calling flush you block until all sends have finished (and either failed or succeeded) and then check the results.
Producer<String, String> producer = new KafkaProducer<>(myConfig);
final ArrayList<Exception> exceptionList = new ArrayList<>();
for(MessageType message : messages){
producer.send(new ProducerRecord<String, String>("myTopic", message.getKey(), message.getValue()), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception exception) {
if (exception != null) {
if (!exceptionList.isEmpty()) {
// do stuff
I think the proper way for implementing Event Sourcing is by having Kafka be filled directly from events pushed by a plugin that reads from the RDBMS binlog e.g using Confluent BottledWater ( or more active Debezium ( Then consuming Microservices can listen to those events, consume them and act on their respective databases being eventually consistent with the RDBMS database.
Have a look here to my full answer for a guideline:

How can a kafka consumer doing infinite retires recover from a bad incoming message?

I am kafka newbie and as I was reading the docs, I had this design related question related to kafka consumer.
A kafka consumer reads messages from the kafka stream which is made up
of one or more partitions from one or more servers.
Lets say one of the incoming messages is corrupt and as a result the consumer fails to process. But when processing event logs you don't want to drop any events, as a result you do infinite retries to avoid transient errors during processing. In such cases of infinite retries, how can the consumer move forward. Is there a way to blacklist this message for next retry?
I'd think it needs manual intervention. Where we log some message metadata (don't know what exactly yet) to look at which message is failing and have logic in place where each consumer checks redis (or someplace else?) after n reties to see if this message needs to be skipped. The blacklist doesn't have to be stored forever in the redis either, only until the consumer can skip it. Here's a pseudocode of what i just described:
while (errorState) {
if (msg in blacklist) {
} else {
errorState = processMessage(msg);
if (!errorState) {
} else {
// log this msg so that we can add to blacklist
I'd like to hear from more experienced folks to see if there are better ways to do this.
We had a requirement in our project where the processing of an incoming message to update a record was dependent on the record being present. Due to some race condition, sometimes update arrived before the insert. In such cases, we implemented couple of approaches.
A. Manual retry with a predefined delay. The code checks if the insert has arrived. If so, processing goes as normal. Otherwise, it would sleep for 500ms, then try again. This would repeat 10 times. At the end, if the message is still not processed, the code logs the message, commits the offset and moves forward. The processing of message is always done in a thread from a pool, so it doesn't block the main thread either. However, in the worst case each message would take 5 seconds of application time.
B. Recently, we refined the above solution to use a message scheduler based on kafka. So now if insert has not arrived before the update, system sends it to a separate scheduler which operates on kafka. This scheduler would replay the message after some time. After 3 retries, we again log the message and stop scheduling or retrying. This gives us the benefit of not blocking the application threads and manage when we would like to replay the message again.