InvalidPidMappingException causes kafka stream application to shut down - apache-kafka

I have this application written in kafka streams. Every now and then it throws InvalidPidMappingException.
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.
And I have this snippet of code which sets the Uncaught Exception Handler for the app
streams.setUncaughtExceptionHandler(
(Thread thread, Throwable exp) -> {
log.error("Unhandled exception in thread with name ", exp);
SpringApplication.exit(applicationContext, () -> 1);
}
);
I understand that this exception occurs when the coordinator expires the producer's transaction id after it has not received any transaction status updates from it.
I have few questions regarding this exception:
I thought that after this exception, the producer would retry to
sync it's transaction id with the coordinator and resume without causing the kafka stream thread to be killed. Even if I change the above code snippet to not exit the Spring Application on InvalidPidMappingException, it still kills the stream thread. Is there a way to avoid the death of streams thread on InvalidPidMappingException? I have seen the desired behaviour when there is UnknownProducerIdException. Or am I missing something here?
Other than transactional id expiration, can there be any other reason for this exception to occur?
Why the InvalidPidMappingException is handled differently than
UnknownProducerIdException? The former kills the Stream Tread and the latter recovers just fine.
I am using the following versions of the libraries:
spring-kafka-version = '2.5.5.RELEASE'
apache-kafka-clientVersion = '2.5.1'
confluent-version = '5.4.2'

A few months late, but the discussion on this Apacha Kafka Jira is helpful.
To summarise, with v2.8.0 and above of the Apache Kafka streams library you can use the new setUncaughtExceptionHandler() method in org.apache.kafka.streams.KafkaStreams to handle any uncaught exceptions and keep the stream running by terminating the current thread and creating a new one for future processing by returning org.apache.kafka.streams.errors.StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD.
e.g.
kafkaStreams.setStreamsUncaughtExceptionHandler(e -> {
return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD;
});
Javadocs for KafkaStreams::setUncaughtExceptionHandler method added in 2.8.0

Related

Spring Kafka Consumer KafkaListenerErrorHandler vs ErrorHandler. What is the difference?

I am having a hard time wrapping my head around the roles and responsibilities of KafkaListenerErrorHandler and ErrorHandler. Here is my understanding of each one of these so far. Please correct me if I am wrong.
Assumption: Using default Spring Kafka configuration out of the box.
KafkaListenerErrorHandler
This handler gets invoked whenever an exception occurs in a method annotated with #KafkaListener. According to the documentation, this occurs on the listener level.
ErrorHandler
This handler gets invoked whenever an exception is thrown at the container level. This will commit the offset (since by default isAckAfterHandle() returns true) after handling the error (which is simply a log message).
My confusions
Why there is a separate KafkaListenerErrorHandler since we already have the ErrorHandler on a container level and a listener belongs to a container?
Does this mean when we use #KafkaListener annotation, the error is never handled by the ErrorHandler on the container level rather handled by the KafkaListenerErrorHandler on the listener level?
If ErrorHandler on the container level never gets invoked when using #KafkaListener annotation, then how do the offsets get committed? Is KafkaListenerErrorHandler responsible for that? How many times does a KafkaListenerErrorHandler will retry a failed message?
How does Retry/Recovery work for KafkaListenerErrorHandler?
The KafkaListenerErrorHandler is mostly used for request/reply scenarios - it allows sending some error indication as the result to the caller, but returning some value.
It could also be used, for example, to log the converted Message<?> and re-throw the exception so the container error handler (which now defaults to a SeekToCurrentErrorHandler) can handle the exception. The container error handler does not have access to the converted message.
If the listener error handler does not rethrow the exception, then retry is cancelled.
Retry in the adapter is deprecated in favor of the container error handler because, now, the SeekToCurrentErrorHandler supports back off, exception classification etc.

How to implement retry and recover logic with Spring Reactive Kafka

We are using the https://github.com/reactor/reactor-kafka project for implementing Spring Reactive Kafka. But we want to utilize Kafka retry and recover logic with reactive Kafka.
Can anyone provide some sample code?
Since you are using spring ecosystem for retry and recovery you can use spring-retry looks at there documentation spring -retry. There are enough references available on web.
A sample example below class is consuming messages from kafka topic and processing.
The method consuming is marked Retryable, so in case there is
exception processing it will retry and if retry doesn't succeed then
the corresponding recovery method will be called.
public class KafkaListener{
#KafkaListener(topic="books-topic", id ="group-1")
#Retryable(maxAttempts = 3, value = Exception.class))
public void consuming(String message){
// To do message processing
// Whenever there is exception thrown from this method
// - it will retry 3 times in total
// - Even after retry we get exception then it will be handed of to below
// recover method recoverConsuming
}
#Recover
public void recoverConsuming(Exception exception, String message){
// Recovery logic
// you can implement your recovery scenario
}
}

Kafka Transaction in case multi Threading

I am trying to create kafka producer in trasnsaction i.e. i want to write a group of msgs if anyone fails i want to rollback all the msg.
kafkaProducer.beginTransaction();
try
{
// code to produce to kafka topic
}
catch(Exception e)
{
kafkaProducer.abortTransaction();
}
kafkaProducer.commitTransaction();
The problem is for single thread above works just fine, but when multiple threads writes it throws exception
Invalid transaction attempted from state IN_TRANSITION to IN_TRANSITION
while debugging I found that if the thread1 transaction is in progress and thread2 also says beingTransaction it throws this exception. What I dont find if how to solve this issue. One possible thing I could find is creating a pool of produce.
Is there any already available API for kafka producer pool or i will have to create my own.
Below is the improvement jira already reported for this.
https://issues.apache.org/jira/browse/KAFKA-6278
Any other suggestion will be really helpful
You can only have a single transaction in progress at a time with a producer instance.
If you have multiple threads doing separate processing and they all need exactly once semantics, you should have a producer instance per thread.
Not sure if this was resolved.
you can use apache common pool2 to create a producer instance pool.
In the create() method of the factory implementation you can generate and assign a unique transactionalID to avoid a conflict (ProducerFencedException)

Kafka producer callback Exception

When we produce messages we can define a callback, this callback can expect an exception:
kafkaProducer.send(producerRecord, new Callback() {
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e == null) {
// OK
} else {
// NOT OK
}
}
});
Considered the buitl-in retry logic in the producer, I wonder which kind of exception should developers deal explicitly with?
According to the Callback Java Docs there are the following Exception possible happening during callback:
The exception thrown during processing of this record. Null if no error occurred. Possible thrown exceptions include:
Non-Retriable exceptions (fatal, the message will never be sent):
InvalidTopicException
OffsetMetadataTooLargeException
RecordBatchTooLargeException
RecordTooLargeException
UnknownServerException
Retriable exceptions (transient, may be covered by increasing #.retries):
CorruptRecordException
InchvalidMetadataException
NotEnoughReplicasAfterAppendException
NotEnoughReplicasException
OffsetOutOfRangeException
TimeoutException
UnknownTopicOrPartitionException
Maybe this is a unsatisfactory answer, but in the end which Exceptions and how to handle them completely relies on your use case and business requirements.
Handling Producer Retries
However, as a developer you also need to deal with the retry mechanism itself of the Kafka Producer. The retries are mainly driven by:
retries: Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection (default: 5) to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionally that produce requests will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. Users should generally prefer to leave this config unset and instead use delivery.timeout.ms to control retry behavior.
retry.backoff.ms: The amount of time to wait before attempting to retry a failed request to a given topic partition. This avoids repeatedly sending requests in a tight loop under some failure scenarios.
request.timeout.ms: The configuration controls the maximum amount of time the client will wait for the response of a request. If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhausted. This should be larger than replica.lag.time.max.ms (a broker configuration) to reduce the possibility of message duplication due to unnecessary producer retries.
The recommendation is to keep the default values of those three configurations above and rather focus on the hard upper time limit defined by
delivery.timeout.ms: An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures. The producer may report failure to send a record earlier than this config if either an unrecoverable error is encountered, the retries have been exhausted, or the record is added to a batch which reached an earlier delivery expiration deadline. The value of this config should be greater than or equal to the sum of request.timeout.ms and linger.ms.
You may get BufferExhaustedException or TimeoutException
Just bring your Kafka down after the producer has produced one record. And then continue producing records. After sometime, you should be seeing exceptions in the callback.
This is because, when you sent the first record, the metadata is fetched, after that, the records will be batched and buffered and they expire eventually after some timeout during which you may see these exceptions.
I suppose that the timeout is delivery.timeout.ms which when expired give you a TimeoutException exception there.
Trying to add more info to #Mike's answer, I think only a few Exceptions are enum in Callback Interface.
Here you can see the whole list: kafka.common.errors
And here, you can see which ones are retriables and which ones are
not: kafka protocol guide
And the code could be sht like this:
producer.send(record, callback)
def callback: Callback = new Callback {
override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
if(null != e) {
if (e == RecordTooLargeException || e == UnknownServerException || ..) {
log.error("Winter is comming") //it's non-retriable
writeDiscardRecordsToSomewhereElse
} else {
log.warn("It's no that cold") //it's retriable
}
} else {
log.debug("It's summer. Everything is fine")
}
}
}

Clearing Messages From a persistent Akka journal

I am testing a persistent Akka actor. I am using in-memory persistence for this. The test starts and I see the actor recovering persisted messages. I try the following:
I send a message to the actor that makes it trigger deleteMessages(LastMessage). I was hoping this message would cause the journal to be cleared.
The actor does not seem to process this message as the messages being recovered had previously run into an exception. It thus throws the exception and does not proceed to process the message.
How can I clear the persisted the journal?
I also thought the in memory persistence does not recover previous tests messages from the journal
For a more capable in-memory journal implementation to use in tests, I'd recommend using https://github.com/dnvriend/akka-persistence-inmemory.
It supports clearing the journal (and snapshots): https://github.com/dnvriend/akka-persistence-inmemory#clearing-journal-and-snapshot-messages, as well as ReadJournal.