spring-kafka - error handling via group rebalance - apache-kafka

I'm writing a kafka listener which should just forward the messages from a topic to a jms queue. I need to stop processing new messages only for a custom exception JmsBrokerConnectionException but i want to continue processing new messages for any other exceptions (ie invalid data) and send error messages to a DLT.
I am using spring-kafka 2.2.7 and cannot upgrade it.
I have currently a solution which uses:
SeekToCurrentErrorHandler (configured with 0 retries and a DeadLetterPubishingRecoverer)
a retry template used in the #KafkaListener method configured with Integer.MAX_VALUE retries, which retries only for JmsBrokerConnectionException
MANUAL_IMMEDIATE ack
The solution seems to do the job but it has the drawback that, for long outages of the jms broker, it would cause a rebalance each max.poll.interval.ms (ie 5 mins).
The question:
Is it a good idea to let max.poll.interval.ms expire and have a group rebalance to handle error conditions for which you want to stop message consumption?
I don't have high-throughput requirements.
The input topic has 10 partitions and i will have 2 consumers.
I know there are other solutions using stateful retry or pausing/resuming the container, but i'd like to keep using the current solution unless i am missing any major drawbacks about it.

I am using spring-kafka 2.2.7 and cannot upgrade it.
That version is no longer supported.
Version 2.3 added backoff and exception classification to the STCEH, eliminating the need for a retry template at the listener level.
That said, you can use stateful retry (https://docs.spring.io/spring-kafka/docs/current/reference/html/#stateful-retry) with a STCEH that always retries, and do the dead letter publishing in the RecoveryCallback at the listener level. The consumer record is available in the retry context with the RetryingMessageListenerAdapter.CONTEXT_RECORD key.
Since you are doing manual acks, you will also need to commit the offset via the CONTEXT_CONSUMER key.

Related

Should consumer or client to produce retry event?

Let's say we have a Kafka consumer poll from a normal topic that is heavy loaded and for each event, make a client call to service. The duration of client call may vary, sometimes fast sometimes slow, we have a retry topic so whenever client call has issue, we'll produce a retry event.
Here is an interesting design question, which domain should be responsible for producing the retry event?
If we let consumer to handle retry produce, this means we have to let consumer to wait for our client call gets finished, which would bring risk of consumer lag because our event processing speed would become slow
If we let service to handle retry produce, this solve the consumer lag issue as consumer would just act as send and forget. However, when service tries to produce a retry event but fails, our retry record might get lost forever in current client call
I also think of having additional DB for persisting retry events, but this would bring more concern on what if DB write operations fails and we might lose the retry similarly as kafka produce error out
The expectation would be keep it more resilient so that all failed event may get a chance for retry and at same time, should also avoid consumer lag issue
I'm not sure I completely understand the question, but I will give it a shot. To summarise, you want to ensure the producer retries if the event failed.
The producer retries default is 2147483647. If the produce request fails, it will keep retrying.
However, produce requests will fail before the number of retries are exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. The default for delivery.timeout.ms is 2 mins so you might want to increase this.
To ensure the producer always sends the record you also want to focus on the producer configurations acks.
If acks=all, all replicas in the ISR must acknowledge the record before it is considered successful. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee.
The above can cause duplicate messages. If you wanted to avoid duplicates, I can also let you know how to do that.
With Spring for Apache Kafka, the DeadletterPublishingRecoverer (which can be used to publish to your "retry" topic) has a property failIfSendResultIsError.
When this is true (default), the recovery operation fails and the DefaultErrorHandler will detect the failure and re-seek the failed consumer record so that it will continue to be retried.
The non-blocking retry mechanism uses this recoverer internally so the same behavior will occur there too.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic

Kafka incremental sticky rebalancing

I am running Kafka on Kubernetes using the Kafka Strimzi operator. I am using incremental sticky rebalance strategy by configuring my consumers with the following:
ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
org.apache.kafka.clients.consumer.CooperativeStickyAssignor.class.getName()
Each time I scale consumers in my consumer group all existing consumer in the group generate the following exception
Exception in thread "main" org.apache.kafka.common.errors.RebalanceInProgressException: Offset commit cannot be completed since the consumer is undergoing a rebalance for auto partition assignment. You can try completing the rebalance by calling poll() and then retry the operation
Any idea on what caused this exception and/or how to resolve it?
Thank you.
The consumer rebalance happens whenever there is a change in the metadata information of a consumer group.
Adding more consumers (scaling in your words) in a group is one such change and triggers a rebalance. During this change, each consumer will be re-assigned partitions and therefore will not know which offsets to commit until the re-assignment is complete. Now, the StickyAssignor does try and ensure that the previous assignment gets preserved as much as possible but the rebalance will still be triggered and even distribution of partitions will take precedence over retaining previous assignment. (Reference - Kafka Documentation)
Rest, the exception's message is self-explanatory that while the rebalance is happening some of the operations are prohibited.
How to avoid such situations?
This is a tricky one because Kafka needs rebalancing to be able to work effectively. There are a few practices you could use to avoid unnecessary impact:
Increase the polling time - max.poll.interval.ms - so the possibility of experiencing these exceptions is reduced.
Decrease the number of poll records - max.poll.records or max.partition.fetch.bytes
Try and utilise the latest version(s) of Kafka (or upgrade if you're using an old one) as many of the latest upgrades so far have made improvements to the rebalance protocol
Use Static membership protocol to reduce rebalances
Might consider configuring group.initial.rebalance.delay.ms for empty consumer groups (either for the first time deployment or destroyin everything and redeploying again)
These techniques can only help you reduce the unnecessary behaviour or exception but will NOT prevent rebalance completely.

retry logic blocks the main consumer while its waiting for the retry in spring

I am referring:
https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2
And it says that if we use below:
factory.setErrorHandler(new SeekToCurrentErrorHandler(new DeadLetterPublishingRecoverer(kafkaTemplate), 3));
It will block the main consumer while its waiting for the retry. (https://medium.com/trendyol-tech/how-to-implement-retry-logic-with-spring-kafka-710b51501ce2#:~:text=Also%20it%20blocks%20the%20main%20consumer%20while%20its%20waiting%20for%20the%20retry)
So, my question is do we really need retry on main topic or can we move the failed messages to a retry topic and then process messages there so that our main topic is non-blocking.
Can we achieve non-blocking retry using STCH?
Non-blocking retries were recently added to the new 2.7 release.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
Achieving non-blocking retry / dlt functionality with Kafka usually requires setting up extra topics and creating and configuring the corresponding listeners. Since 2.7 Spring for Apache Kafka offers support for that via the #RetryableTopic annotation and RetryTopicConfiguration class to simplify that bootstrapping.
If message processing fails, the message is forwarded to a retry topic with a back off timestamp. The retry topic consumer then checks the timestamp and if it’s not due it pauses the consumption for that topic’s partition. When it is due the partition consumption is resumed, and the message is consumed again. If the message processing fails again the message will be forwarded to the next retry topic, and the pattern is repeated until a successful processing occurs, or the attempts are exhausted, and the message is sent to the Dead Letter Topic (if configured).

Is consumer offset commited even when failing to post to output topic in Kafka Streams?

If I have a Kafka stream application that fails to post to a topic (because the topic does not exist) does it commit the consumer offset and continue, or will it loop on the same message until it can resolve the output topic? The application merely prints an error and runs fine otherwise from what I can observe.
An example of the error when trying to post to topic:
Error while fetching metadata with correlation id 80 : {super.cool.test.topic=UNKNOWN_TOPIC_OR_PARTITION}
In my mind it would just spin on the same message until the issue is resolved in order to not lose data? I could not find a clear answer on what the default behavior is. We haven't set autocommit to off or anything like that, most of the settings are set to the default.
I am asking as we don't want to end up in a situation where the health check is fine (application is running while printing errors to log) and we are just throwing away tons of Kafka messages.
Kafka Streams will not commit the offsets for this case, as it provides at-least-once processing guarantees (in fact, it's not even possible to reconfigure Kafka Streams differently -- only stronger exactly-once guarantees are possible). Also, Kafka Streams disables auto-commit on the consumer always (and does not allow you to enable it), as Kafka Streams manages committing offset itself.
If you run with default setting, the producer should actually throw an exception and the corresponding thread should die -- you can get a callback if a thread dies, by registering KafkaStreams#uncaughtExceptionHandler().
You can also observe KafkaStreams#state() (or register a callback KafkaStreams#setStateListener()). The state will go to DEAD if all threads are dead (note, there was a bug in older version for which the state was still RUNNING for this case: https://issues.apache.org/jira/browse/KAFKA-5372)
Hence, the application should not be in a healthy state and Kafka Streams will not retry the input message but stop processing and you would need to restart the client. On restart, it would re-read the failed input message an re-try to write to the output topic.
If you want Kafka Streams to retry, you need to increase the producer config reties to avoid that the producer throws an exception and retries writing internally. This may "block" further processing eventually if producer write buffer becomes full.

Kafka group re-balancing after consumer failed. org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.