How to handle failure of Kafka Cluster - apache-kafka

We are going to implement a Kafka Publish Subscribe system.
Now, in the worst of the worst cases - if all the kafka brokers for a given topic go down -- what happens?
I tried this out...the publisher detects it after the default timeout for metadata fetch & throws exception if not successful.
In this case, we can monitor the exception and restart Publisher after fixing Kafka.
But, what about the consumers -- they don't seem to get any exceptions once Kafka goes down. We simply can't ask "all" the consumers to restart their systems. Any better way to solve this problem?

But, what about the consumers -- they don't seem to get any exceptions
once Kafka goes down. We simply can't ask "all" the consumers to
restart their systems. Any better way to solve this problem?
Yes, consumer won't get any exceptions and the behavior is work as designed. However, you don't need to restart all the consumers, just make sure in your logic that consumer is calling the poll()method call regularly. Consumer is designed in a way that it does not get effected, even if there is no cluster alive. Consider the following steps to understand what will happen actually:
1: All clusters are down, there is no active cluster.
2: consumer.poll(timeout) // This will be called form you portion of code
3: Inside poll() method call in KafkaConsumer.java, following sequence of calls will take place.
poll() --> pollOnce() --> ensureCoordinatorKnown() --> awaitMetaDataUpdate()
I have highlighted the main method calls that will be called after performing logical checks internally. Now, at this point your consumer will wait until the cluster is up again.
4: Cluster up again or restarted
5: Consumer will be notified and it will start working again as normally it was before the cluster goes down.
Note:- Consumer will start receiving messages from the last offset commit, message received successfully won't be duplicated.
The described behavior is valid for (0.9.x version)

If the consumer (0.9.x version) is polling and the cluster goes down it should get the following exception
java.net.ConnectException: Connection refused
You can keep polling until the cluster is back again there is no need to restart the consumer, it will re-establish the connection.

Related

Should consumer or client to produce retry event?

Let's say we have a Kafka consumer poll from a normal topic that is heavy loaded and for each event, make a client call to service. The duration of client call may vary, sometimes fast sometimes slow, we have a retry topic so whenever client call has issue, we'll produce a retry event.
Here is an interesting design question, which domain should be responsible for producing the retry event?
If we let consumer to handle retry produce, this means we have to let consumer to wait for our client call gets finished, which would bring risk of consumer lag because our event processing speed would become slow
If we let service to handle retry produce, this solve the consumer lag issue as consumer would just act as send and forget. However, when service tries to produce a retry event but fails, our retry record might get lost forever in current client call
I also think of having additional DB for persisting retry events, but this would bring more concern on what if DB write operations fails and we might lose the retry similarly as kafka produce error out
The expectation would be keep it more resilient so that all failed event may get a chance for retry and at same time, should also avoid consumer lag issue
I'm not sure I completely understand the question, but I will give it a shot. To summarise, you want to ensure the producer retries if the event failed.
The producer retries default is 2147483647. If the produce request fails, it will keep retrying.
However, produce requests will fail before the number of retries are exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. The default for delivery.timeout.ms is 2 mins so you might want to increase this.
To ensure the producer always sends the record you also want to focus on the producer configurations acks.
If acks=all, all replicas in the ISR must acknowledge the record before it is considered successful. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee.
The above can cause duplicate messages. If you wanted to avoid duplicates, I can also let you know how to do that.
With Spring for Apache Kafka, the DeadletterPublishingRecoverer (which can be used to publish to your "retry" topic) has a property failIfSendResultIsError.
When this is true (default), the recovery operation fails and the DefaultErrorHandler will detect the failure and re-seek the failed consumer record so that it will continue to be retried.
The non-blocking retry mechanism uses this recoverer internally so the same behavior will occur there too.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic

How to choose Kafka transactional.id in a Kubernetes (Producer side only transaction) set up

I have a Kafka wrapper library that uses transactions on the produce side only. The library does not cover the consumer. The producer publishes to multiple topics. The goal is to achieve transactionality. So the produce should either succeed which means there should be exactly once copy of the message written in each topic, or fail which means message was not written to any topics. The users of the library are applications that run on Kubernetes pods. Hence, the pods could fail, or restart frequently. Also, the partition is not going to be explicitly set upon sending the message.
My question is, how should I choose the transactional.id for producers? My first idea is to simply choose UUID upon object initiation, as well as setting a transaction.timeout.ms to some reasonable time (a few seconds). That way, if a producer gets terminated due to pod restart, the consumers don't get locked on the transaction forever.
Are there any flaws with this strategy? Is there a smarter way to do this? Also, I cannot ask the library user for some kind of id.
UUID can be used in your library to generate transaction id for your producers. I am not really sure what you mean by: That way, if a producer gets terminated due to pod restart, the consumers don't get locked on the transaction forever.
Consumer is never really "stuck". Say the producer goes down after writing message to one topic (and hence transaction is not yet committed), then consumer will behave in one of the following ways:
If isolation.level is set to read_committed, consumer will never process the message (since the message is not committed). It will still read the next committed message that comes along.
If isolation.level is set to read_uncommitted, the message will be read and processed (defeating the purpose of transaction in the first place).

Does consumer recover after Kafka cluster was down for a long time

I start Kafka consumer on server start up. What happens to the consumer if Kafka cluster goes down for a long time (couple of hours)? Will it receive messages after Kafka is up again?
As the Kafka protocol works, they are always the clients (consumers in this case) to start communication with the cluster in a request/reply fashion.
It means that if the cluster goes down, the consumer will get this status only on its next request (maybe a metadata request or a fetch request). There is no push mechanism from brokers to clients to say them that the cluster came back working properly. So it depends on the logic of your consumer and how many times it polls for getting messages; as already mentioned by #cricket_007, it will log errors.
It ultimately depends on how you handle connection exceptions in the client.
If you have a Circuit Breaker type logic of retry+fail after a few times, then the consume loop would stop. Otherwise, if you blindly do a while(true) loop, then the consumer would keep trying to read messages, and be logging errors on each request (heartbeat, fetch, or poll)
Kafka stores messages and current offset values in disk for time which is defined in broker config as log.retention (default is 168 hours). So during down time your consumers will get "Broker may not be available" error each time they are trying to poll. When kafka is up, if down time is less than log.retention time, then your consumers will continue to receive and consume messages without any loss.

Is consumer offset commited even when failing to post to output topic in Kafka Streams?

If I have a Kafka stream application that fails to post to a topic (because the topic does not exist) does it commit the consumer offset and continue, or will it loop on the same message until it can resolve the output topic? The application merely prints an error and runs fine otherwise from what I can observe.
An example of the error when trying to post to topic:
Error while fetching metadata with correlation id 80 : {super.cool.test.topic=UNKNOWN_TOPIC_OR_PARTITION}
In my mind it would just spin on the same message until the issue is resolved in order to not lose data? I could not find a clear answer on what the default behavior is. We haven't set autocommit to off or anything like that, most of the settings are set to the default.
I am asking as we don't want to end up in a situation where the health check is fine (application is running while printing errors to log) and we are just throwing away tons of Kafka messages.
Kafka Streams will not commit the offsets for this case, as it provides at-least-once processing guarantees (in fact, it's not even possible to reconfigure Kafka Streams differently -- only stronger exactly-once guarantees are possible). Also, Kafka Streams disables auto-commit on the consumer always (and does not allow you to enable it), as Kafka Streams manages committing offset itself.
If you run with default setting, the producer should actually throw an exception and the corresponding thread should die -- you can get a callback if a thread dies, by registering KafkaStreams#uncaughtExceptionHandler().
You can also observe KafkaStreams#state() (or register a callback KafkaStreams#setStateListener()). The state will go to DEAD if all threads are dead (note, there was a bug in older version for which the state was still RUNNING for this case: https://issues.apache.org/jira/browse/KAFKA-5372)
Hence, the application should not be in a healthy state and Kafka Streams will not retry the input message but stop processing and you would need to restart the client. On restart, it would re-read the failed input message an re-try to write to the output topic.
If you want Kafka Streams to retry, you need to increase the producer config reties to avoid that the producer throws an exception and retries writing internally. This may "block" further processing eventually if producer write buffer becomes full.

Kafka reset partition re-consume or not

If I consume from my topic and manage the offset myself, some records I process are successful then I move the offset on-wards, but occasionally I process records that will throw an exception. I still need to move the offset onwards. But at a later point I will need to reset the offset and re-process the failed records. Is it possible when advancing the offset to set a flag to say that if I consumer over that event again ignore or consume?
The best way to handle these records is not by resetting the offsets, but by using a dead-letter queue, essentially, by posting them to another kafka topic for reprocessing later. That way, your main consumer can focus on processing the records that don't throw exceptions, and some other consumer can constantly be listening and trying to handle the records that are throwing errors.
If that second consumer is still throwing exceptions when trying to reprocess the messages, you can either opt to repost them to the same queue, if the exception is caused by a transient issue (system temporarily unavailable, database issue, network blip, etc), or simply opt to log the message ID and content, as well as the best guess as to what the problem is, for someone to manually look at later.
Actually - no, this is not possible. Kafka records are read only. I've seen this use case in practice and I will try to give you some suggestions:
if you experience an error, just copy the message in a separate error topic and move on. This will allow you to replay all error messages at any time from the error topic. That would definitely be my preferred solution - flexible and performant.
when there is an error - just hang your consumer - preferably enter an infinite loop with an exponential backoff rereading the same message over and over again. We used this strategy together with good monitoring/alerting and log compaction. When something goes wrong we either fix the broken consumer and redeploy our service or if the message itself was broken the producer will fix its bug, republish the message with the same key and log compaction will kick in. The faulty message will be deleted (log compaction). We will be able to move our consumers forward at this point. This requires manual interaction in most cases. If the reason for the fault is a networking issue (e.g. database down) the consumer may recover by itself.
use local storage (e.g. a database) to store which offsets failed. Then reset the offset and ignore the successfully processed records. This is my least preferred solution.