Kafka producer callback Exception - apache-kafka

When we produce messages we can define a callback, this callback can expect an exception:
kafkaProducer.send(producerRecord, new Callback() {
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e == null) {
// OK
} else {
// NOT OK
}
}
});
Considered the buitl-in retry logic in the producer, I wonder which kind of exception should developers deal explicitly with?

According to the Callback Java Docs there are the following Exception possible happening during callback:
The exception thrown during processing of this record. Null if no error occurred. Possible thrown exceptions include:
Non-Retriable exceptions (fatal, the message will never be sent):
InvalidTopicException
OffsetMetadataTooLargeException
RecordBatchTooLargeException
RecordTooLargeException
UnknownServerException
Retriable exceptions (transient, may be covered by increasing #.retries):
CorruptRecordException
InchvalidMetadataException
NotEnoughReplicasAfterAppendException
NotEnoughReplicasException
OffsetOutOfRangeException
TimeoutException
UnknownTopicOrPartitionException
Maybe this is a unsatisfactory answer, but in the end which Exceptions and how to handle them completely relies on your use case and business requirements.
Handling Producer Retries
However, as a developer you also need to deal with the retry mechanism itself of the Kafka Producer. The retries are mainly driven by:
retries: Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection (default: 5) to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionally that produce requests will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. Users should generally prefer to leave this config unset and instead use delivery.timeout.ms to control retry behavior.
retry.backoff.ms: The amount of time to wait before attempting to retry a failed request to a given topic partition. This avoids repeatedly sending requests in a tight loop under some failure scenarios.
request.timeout.ms: The configuration controls the maximum amount of time the client will wait for the response of a request. If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhausted. This should be larger than replica.lag.time.max.ms (a broker configuration) to reduce the possibility of message duplication due to unnecessary producer retries.
The recommendation is to keep the default values of those three configurations above and rather focus on the hard upper time limit defined by
delivery.timeout.ms: An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures. The producer may report failure to send a record earlier than this config if either an unrecoverable error is encountered, the retries have been exhausted, or the record is added to a batch which reached an earlier delivery expiration deadline. The value of this config should be greater than or equal to the sum of request.timeout.ms and linger.ms.

You may get BufferExhaustedException or TimeoutException
Just bring your Kafka down after the producer has produced one record. And then continue producing records. After sometime, you should be seeing exceptions in the callback.
This is because, when you sent the first record, the metadata is fetched, after that, the records will be batched and buffered and they expire eventually after some timeout during which you may see these exceptions.
I suppose that the timeout is delivery.timeout.ms which when expired give you a TimeoutException exception there.

Trying to add more info to #Mike's answer, I think only a few Exceptions are enum in Callback Interface.
Here you can see the whole list: kafka.common.errors
And here, you can see which ones are retriables and which ones are
not: kafka protocol guide
And the code could be sht like this:
producer.send(record, callback)
def callback: Callback = new Callback {
override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
if(null != e) {
if (e == RecordTooLargeException || e == UnknownServerException || ..) {
log.error("Winter is comming") //it's non-retriable
writeDiscardRecordsToSomewhereElse
} else {
log.warn("It's no that cold") //it's retriable
}
} else {
log.debug("It's summer. Everything is fine")
}
}
}

Related

resilience4J for aggregated async results - retry on failure and/or timeout?

I have a scenario where is need to track the results of a group of Kafka messages. Each message is sent and we expect a SUCCESS or FAILURE message to return after a period of time. If any message of the group fails, or if we message does not return - we should retry.
I'd like to leverage the Retry logic of 'resilience4J' but am unsure how to correctly configure the RetryConfig so that it aggregates the responses correctly.
In my scenario I have this sudo code for submitting the messages
KafkaClient client;
KakfaRequest message; // 10 of these
And i then want to config a RetryConfig is manner simular to
RetryConfig config = RetryConfig.custom()
.maxAttempts(2)
.waitDuration(Duration.ofMillis(100))
//.intervalFunction()
.retryOnResult(response -> 'FAILURE')
.retryExceptions(IOException.class, TimeoutException.class)
.build();
I guess i'm unsure into which data structure / Future so I add the message details, so the callback logic from kakfa and the Retry code can process the same result sets.

Distinguish how to handle exceptions in async Kafka producer

When producing message to Kafka you can get two kind of errors: retriables and non-retriables. How should you differentiate them when handling them?
I want to produce records asynchronously, saving in another topic (or HBase) those in which callback object receives a nonretriable exception and let the producer handle for me all those that receives a retriable exception (up to a maximum number of attempts and, when it finally reach it, becomes one of the first ones).
My question is: will the producer still handle the retrievable exceptions by itself despite the callback object?
Because in the Interface Callback says:
Retriable exceptions (transient, may be covered by increasing #
.retries)
Could be the code something like this?
producer.send(record, callback)
def callback: Callback = new Callback {
override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
if(null != e) {
if (e == RecordTooLargeException || e == UnknownServerException || ..) {
log.error("Winter is comming")
writeDiscardRecordsToSomewhereElse
} else {
log.warn("It's no that cold") //it's retriable. The producer will keep trying by itself?
}
} else {
log.debug("It's summer. Everything is fine")
}
}
}
Kafka version: 0.10.0
Any light will be appreciated! :)
As the Kafka bible (aka Kafka-The Definitive Guide) says:
The drawback is that while commitSync() will retry the commit until it
either succeeds or encounters a nonretriable failure, commitAsync()
will not retry.
The reason:
It does not retry is that by the time commitAsync() receives a
response from the server, there may have been a later commit that was
already successful.
Imagine that we sent a request to commit offset
2000. There is a temporary communication problem, so the broker never gets the request and therefore never responds. Meanwhile, we processed
another batch and successfully committed offset 3000. If commitA
sync() now retries the previously failed commit, it might succeed in
committing offset 2000 after offset 3000 was already processed and
committed. In the case of a rebalance, this will cause more
duplicates.
Beside that, you still can create an increasing sequence number , which you can increase every time you commit and add that number to the Callback object. When the time to retry comes, just check if the current value of the Acc is equal to the number you gave to the Callback. If it does, it is safe and you can perform the commit. Otherwise, there has been a new commit and you should not retry the commit of this offset.
It seems a lot of troubles, and that is because if you are thinking on this, you should change your strategy.

Different between KafkaProducer.close() and KafkaProducer.flush()

Looking at the documentation, I'm not sure if I understand the difference between using close() and flush().
This is the doc for flush()
* Invoking this method makes all buffered records immediately available to send (even if <code>linger.ms</code> is
* greater than 0) and blocks on the completion of the requests associated with these records. The post-condition
* of <code>flush()</code> is that any previously sent record will have completed (e.g. <code>Future.isDone() == true</code>).
* A request is considered completed when it is successfully acknowledged
* according to the <code>acks</code> configuration you have specified or else it results in an error.
And the doc for close():
* This method waits up to <code>timeout</code> for the producer to complete the sending of all incomplete requests.
* If the producer is unable to complete all requests before the timeout expires, this method will fail
* any unsent and unacknowledged records immediately.
Does this mean that:
If I use close() and there are some records pending in the in-memory buffer, they won't even be attempted (compared to flush, which would attempt to send them)?
If I use flush(), it might block "forever" if the retries are large? While with close(), I have an upper bound for how long this is going to take?
I suppose if I'm right in 1., a producer with acks=0 would get a confirmation for a record that might not even be attempted to be published if it is "unlucky" enough to be placed in the in-memory queue and immediately after close() is called.
If you want to keep using the producer and wait for messages to be sent, you would use flush else close. Close with timeout value will wait for the messages to be sent and ack received as per the config till time out value and then close the producer

Azure Function and queue

I have a function:
public async static Task Run([QueueTrigger("efs-api-call-last-datetime", Connection = "StorageConnectionString")]DateTime queueItem,
[Queue("efs-api-call-last-datetime", Connection = "StorageConnectionString")]CloudQueue inputQueue,
TraceWriter log)
{
Then I have long process for processing message from queue. Problem is the message will be readded to queue after 30 seconds, while I process this message. I don't need to add this message and process it twice.
I would like to have code like:
try
{
// long operation
}
catch(Exception ex)
{
// something wrong. Readd this message in 1 minute
await inputQueue.AddMessageAsync(new CloudQueueMessage(
JsonConvert.SerializeObject(queueItem)),
timeToLive: null,
initialVisibilityDelay: TimeSpan.FromMinutes(1),
options: null,
operationContext: null
);
}
and prevent to readd it automatically. Any way to do it?
There are couple of things here.
1) When there are multiple queue messages waiting, the queue trigger retrieves a batch of messages and invokes function instances concurrently to process them. By default, the batch size is 16. But this is configurable in Host.json. You can set the batch size to 1 if you want to minimize the parallel execution. Microsoft document explains this.
2) As it is long running process so it seems your messages are not complete and the function might timeout and message are visible again. You should try to break down your function into smaller functions. Then you can use durable function which will chain the work you have to do.
Yes, you can dequeue same message twice.
Reasons:
1.Worker A dequeues Message B and invisibility timeout expires. Message B becomes visible again and Worker C dequeues Message B, invalidating Worker A's pop receipt. Worker A finishes work, goes to delete Message B and error is thrown. This is most common.
2.The lock on the original message that triggers the first Azure Function to execute is likely expiring. This will cause the Queue to assume that processing the message failed, and it will then use that message to trigger the Function to execute again.
3.In certain conditions (very frequent queue polling) you can get the same message twice on a GetMessage. This is a type of race condition that while rare does occur. Worker A and B are polling very quickly and hit the queue simultaneously and both get same message. This used to be much more common (SDK 1.0 time frame) under high polling scenarios, but it has become much more rare now in later storage updates (can't recall seeing this recently).
1 and 3 only happen when you have more than 1 worker.
Workaround:
Install azure-webjobs-sdk 1.0.11015.0 version (visible in the 'Settings' page of the Functions portal). For more details, you could refer to fixing queue visibility renewals

How can a kafka consumer doing infinite retires recover from a bad incoming message?

I am kafka newbie and as I was reading the docs, I had this design related question related to kafka consumer.
A kafka consumer reads messages from the kafka stream which is made up
of one or more partitions from one or more servers.
Lets say one of the incoming messages is corrupt and as a result the consumer fails to process. But when processing event logs you don't want to drop any events, as a result you do infinite retries to avoid transient errors during processing. In such cases of infinite retries, how can the consumer move forward. Is there a way to blacklist this message for next retry?
I'd think it needs manual intervention. Where we log some message metadata (don't know what exactly yet) to look at which message is failing and have logic in place where each consumer checks redis (or someplace else?) after n reties to see if this message needs to be skipped. The blacklist doesn't have to be stored forever in the redis either, only until the consumer can skip it. Here's a pseudocode of what i just described:
while (errorState) {
if (msg in blacklist) {
//skip
commitOffset()
} else {
errorState = processMessage(msg);
if (!errorState) {
commitOffset();
} else {
// log this msg so that we can add to blacklist
logger.info(msg)
}
}
}
I'd like to hear from more experienced folks to see if there are better ways to do this.
We had a requirement in our project where the processing of an incoming message to update a record was dependent on the record being present. Due to some race condition, sometimes update arrived before the insert. In such cases, we implemented couple of approaches.
A. Manual retry with a predefined delay. The code checks if the insert has arrived. If so, processing goes as normal. Otherwise, it would sleep for 500ms, then try again. This would repeat 10 times. At the end, if the message is still not processed, the code logs the message, commits the offset and moves forward. The processing of message is always done in a thread from a pool, so it doesn't block the main thread either. However, in the worst case each message would take 5 seconds of application time.
B. Recently, we refined the above solution to use a message scheduler based on kafka. So now if insert has not arrived before the update, system sends it to a separate scheduler which operates on kafka. This scheduler would replay the message after some time. After 3 retries, we again log the message and stop scheduling or retrying. This gives us the benefit of not blocking the application threads and manage when we would like to replay the message again.