Not committing a message from Apache Kafka if exception occurs while operations in Apache Flink - apache-kafka

Sometimes, while processing and during sink I am getting few exceptions because of network. I want these message not to be committed so that they can be re processed sometimes in future.
Is there any way to achieve such functionality where I could ignore that individual message from offset commit, so that it can be processed sometimes in future?
One solution, I am currently following is to sink those messages to other topic which we process later.

If any exception occurred during processing of a message, the task will be restarted until this message has been eventually processed. Offsets are only committed for messages that have been fully processed.
So if you don't change anything to the error handling in source and sink, you will get exactly your desired behavior (also known as at-least once guarantee).
Btw, I recommend you to fix your tags.

Your approach of writing messages with errors during processing to a "dead letter queue" is a common and useful pattern. It's also pretty simple and straightforward. Don't change anything.

Related

Kafka "retry" topic anti-pattern

Most places I look recommend that to prevent data loss you should create a "retry" topic. If a consumer fails it should send a message to the "retry" topic which would wait a set period of time and then send a message back to the "main" topic.
Isn't this an anti-pattern since when it goes back to the "main" topic all the services subscribed to the "main" topic would reprocess the failed message even though only one of the services failed to process it initially?
Is there a conventional way of solving this such as putting the clientId in the headers for messages that are the result of a retry? Am I missing something?
Dead-letter queues (DLQ), in themselves, are not an anti-pattern. Cycling it back through the main topic might be, but that is subjective.
The alternative would be to "stop the world" and update the consumer code to resolve the errors before the topic retention deletes the messages you care about. OR, make your downstream consumers also read from the DLQ topic(s), but handle them differently from the main topic.
Is there a conventional way of solving this such as putting the clientId in the headers
Maybe if you wanted to track lineage somehow, but re-introducing those previously bad messages would introduce lag and ordering issues that interfere with the "good" messages.
Related question
What is the best practice to retry messages from Dead letter Queue for Kafka

How to ensure exactly once semantics while processing kafka messages in Apache Storm

I needed exactly once delivery in my app. I explored kafka and realised that to have message produced exactly once, I have to set idempotence=true in producer config. This also sets acks=all, making producer resend messages till all replicas have committed it. To ensure that consumer does not do duplicate processing or leave any message unprocessed, it is advised to commit the processing output and offset to external database in same database transaction, so that either both of them will be persisted or none avoiding duplicate and no processing.
In consumer, message is left processed if consumer first commits it but fails before processing it and message is processed more than once if consumers first processes it but fails before committing it.
Q1. Now I was guessing how can I imitate the same with Apache Storm. I guess exactly once production of message can be ensured by setting idemptence=true in KafkaBolt. Am I right?
I was guessing how I can ensure missed and duplicate message processing in Storm. For example, this doc page says if I anchor a tuple (by passing it as first parameter to OutputCollector.emit()) and then pass the tuple to OutputCollector.ack() or OutputCollector.fail(), Storm will ensure data loss. This is what it exactly says:
Now that you understand the reliability algorithm, let's go over all the failure cases and see how in each case Storm avoids data loss:
A tuple isn't acked because the task died: In this case the spout tuple ids at the root of the trees for the failed tuple will time out and be replayed.
Acker task dies: In this case all the spout tuples the acker was tracking will time out and be replayed.
Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
Q2. I guess this ensures that message is not left unprocessed, but does not avoid duplicate processing of messages. Am I correct with this? Also is there anything else that Storm offers to ensure exactly once semantics like kafka that I am missing?
Regarding Q1: Yes, you can get the same behavior from the KafkaBolt by setting that property, the KafkaBolt simply wraps a KafkaProducer.
Regarding semantics on the consuming side, you have the same options with Storm as you do with Kafka. When you read a message from Kafka, you can choose to commit before or after you do your processing (e.g. write to a database). If you do it before, and the program crashes, you will lose the message. Let's call this at-most-once processing. If you do it after, you risk processing the same message twice if the program crashes after the processing but before the commit, called at-least-once processing.
So, regarding Q2: Yes, using anchored tuples and acking will provide you with at-least-once semantics. Not using anchored tuple would give you at-most-once.
Yes, there is something else Storm offers to ensure exactly once semantics called Trident, but it requires you to write your topology differently, and your data store has to be adapted to it so message deduplication can happen. See the documentation at https://storm.apache.org/releases/2.0.0/Trident-tutorial.html.
Also just to caution you: When documentation for Storm (or Kafka) talk about exactly-once semantics, there are some assumptions made about what kind of processing you'll do. For example, when Storm's Trident docs talk about exactly-once, there's an assumption that you'll adapt your database so you can decide when given a message whether it has already been stored. When Kafka's documentation talks about exactly-once, the assumption is that your processing will be reading from Kafka, doing some computation (most likely with no side effects) and writing back to Kafka.
This is just to say that for some types of processing, you may still need to pick between at-least-once and at-most-once. If you can make your processing idempotent, at-least-once is a good option.
Finally if your processing fits the "read from Kafka, do computation, write to Kafka" model, you can likely get nicer semantics out of Kafka Streams than Storm, as Storm can't provide the exactly-once semantics Kafka can provide in that case.

How to safely skip messages using Lagom Kafka Message Broker API?

We've defined a basic subscriber that skips over failed messages (ie for some business logic reason, we are not going to handle) by throwing an exception and relying on a Akka Streams' stream supervision to resume the Flow:
someLagomService
.someTopic()
.subscribe
.withGroupId("lagom-service")
.atLeastOnce(
Flow[Int]
.mapAsync(1)(el => {
// Exception may occur here or can map to Done
})
.withAttributes(ActorAttributes.supervisionStrategy({
case t =>
Supervision.Resume
})
)
This seems to work fine for basic use cases under very little load, but we have noticed very strange things for larger number of messages (ex: very frequent re-processing of messages, etc).
Digging into the code, we saw that Lagom's broker.Subscriber.atLeastOnce documentation states:
The flow may pull more elements from upstream but it must emit
exactly one Done message for each message that it receives. It must
also emit them in the same order that the messages were received. This
means that the flow must not filter or collect a subset of the
messages, instead it must split the messages into separate streams and
map those that would have been dropped to Done.
Additionally, in the impl of Lagom's KafkaSubscriberActor, we see that the impl of private atLeastOnce essentially unzips the message payload and offset and then rezips then back up after our user flow maps messages to Done.
These two tidbits above seem to imply that by using stream supervisors and skipping elements, we can end up in a situation where the committable offsets no longer zip up evenly with the Dones that are to be produced per Kafka message.
Example: If we stream 1, 2, 3, 4 and map 1, 2, and 4 to Done but throw an exception on 3, we have 3 Dones and 4 committable offsets?
Is this correct / expected? Does this mean we should AVOID using stream supervisors here?
What sorts of behavior can the uneven zipping cause?
What is the recommended approach for error handling when it comes to consuming messages off of Kafka via the Lagom message broker API? Is the right thing to do to map / recover failures to Done?
Using Lagom 1.4.10
Is this correct / expected? Does this mean we should AVOID using
stream supervisors here?
The official API documentations says that
If the Kafka Lagom message broker module is being used, then by
default the stream is automatically restarted when a failure occurs.
So, there is no need to add your own supervisionStrategy to manage error handling. And the stream will be restarted by default and you should not think about "skipped" Done messages.
What sorts of behavior can the uneven zipping cause?
Exactly because of this the documentation says:
This means that the flow must not filter or collect a subset of the
messages
It can under-commit the wrong offset. And on restart, you might get the already processed messages in the form of replay from committed lower offset.
What is the recommended approach for error handling when it comes to
consuming messages off of Kafka via the Lagom message broker API? Is
the right thing to do to map / recover failures to Done?
Lagom is taking care of the exception handling by dropping the message that caused the error and restarting the stream. And map / recover failures to Done won't have any change on this.
You could consider, in case you need to have access to these messages later on, too use Try {} for example, ie not throwing an exception, and collect the messages with errors by sending them to a different topic, this will give you chance to monitor the amount of errors and replay messages that caused the error when the conditions are right, ie the bug is fixed.

Kafka reset partition re-consume or not

If I consume from my topic and manage the offset myself, some records I process are successful then I move the offset on-wards, but occasionally I process records that will throw an exception. I still need to move the offset onwards. But at a later point I will need to reset the offset and re-process the failed records. Is it possible when advancing the offset to set a flag to say that if I consumer over that event again ignore or consume?
The best way to handle these records is not by resetting the offsets, but by using a dead-letter queue, essentially, by posting them to another kafka topic for reprocessing later. That way, your main consumer can focus on processing the records that don't throw exceptions, and some other consumer can constantly be listening and trying to handle the records that are throwing errors.
If that second consumer is still throwing exceptions when trying to reprocess the messages, you can either opt to repost them to the same queue, if the exception is caused by a transient issue (system temporarily unavailable, database issue, network blip, etc), or simply opt to log the message ID and content, as well as the best guess as to what the problem is, for someone to manually look at later.
Actually - no, this is not possible. Kafka records are read only. I've seen this use case in practice and I will try to give you some suggestions:
if you experience an error, just copy the message in a separate error topic and move on. This will allow you to replay all error messages at any time from the error topic. That would definitely be my preferred solution - flexible and performant.
when there is an error - just hang your consumer - preferably enter an infinite loop with an exponential backoff rereading the same message over and over again. We used this strategy together with good monitoring/alerting and log compaction. When something goes wrong we either fix the broken consumer and redeploy our service or if the message itself was broken the producer will fix its bug, republish the message with the same key and log compaction will kick in. The faulty message will be deleted (log compaction). We will be able to move our consumers forward at this point. This requires manual interaction in most cases. If the reason for the fault is a networking issue (e.g. database down) the consumer may recover by itself.
use local storage (e.g. a database) to store which offsets failed. Then reset the offset and ignore the successfully processed records. This is my least preferred solution.

Kafka Consumes unprocessable messages - How to reprocess broken messages later?

We are implementing a Kafka Consumer using Spring Kafka. As I understand correctly if processing of a single message fails, there is the option to
Don't care and just ACK
Do some retry handling using a RetryTemplate
If even this doesn't work do some custom failure handling using a RecoveryCallback
I am wondering what your best practices are for that. I think of simple application exceptions, such as DeserializationException (for JSON formatted messages) or longer local storage downtime, etc. Meaning there is needed some extra work, like a hotfix deployment, to fix the broken application to be able to re-process the faulty messages.
Since losing messages (i. e. not processing them) is not an option for us, the only option left is IMO to store the faulty messages in some persistence store, e. g. another "faulty messages" Kafka topic for example, so that those events can be processed again at a later time and there is no need to stop event processing totally.
How do you handle these scenarios?
One example is Spring Cloud Stream, which can be configured to publish failed messages to another topic errors.foo; users can then copy them back to the original topic to try again later.
This logic is done in the recovery callback.
We have a use case where we can't drop any messages at all, even for faulty messages. So when we encounter a faulty message, we will send a default message in place of that faulty record and at the same time send the message to a failed-topic for retry later.