How to safely skip messages using Lagom Kafka Message Broker API? - scala

We've defined a basic subscriber that skips over failed messages (ie for some business logic reason, we are not going to handle) by throwing an exception and relying on a Akka Streams' stream supervision to resume the Flow:
someLagomService
.someTopic()
.subscribe
.withGroupId("lagom-service")
.atLeastOnce(
Flow[Int]
.mapAsync(1)(el => {
// Exception may occur here or can map to Done
})
.withAttributes(ActorAttributes.supervisionStrategy({
case t =>
Supervision.Resume
})
)
This seems to work fine for basic use cases under very little load, but we have noticed very strange things for larger number of messages (ex: very frequent re-processing of messages, etc).
Digging into the code, we saw that Lagom's broker.Subscriber.atLeastOnce documentation states:
The flow may pull more elements from upstream but it must emit
exactly one Done message for each message that it receives. It must
also emit them in the same order that the messages were received. This
means that the flow must not filter or collect a subset of the
messages, instead it must split the messages into separate streams and
map those that would have been dropped to Done.
Additionally, in the impl of Lagom's KafkaSubscriberActor, we see that the impl of private atLeastOnce essentially unzips the message payload and offset and then rezips then back up after our user flow maps messages to Done.
These two tidbits above seem to imply that by using stream supervisors and skipping elements, we can end up in a situation where the committable offsets no longer zip up evenly with the Dones that are to be produced per Kafka message.
Example: If we stream 1, 2, 3, 4 and map 1, 2, and 4 to Done but throw an exception on 3, we have 3 Dones and 4 committable offsets?
Is this correct / expected? Does this mean we should AVOID using stream supervisors here?
What sorts of behavior can the uneven zipping cause?
What is the recommended approach for error handling when it comes to consuming messages off of Kafka via the Lagom message broker API? Is the right thing to do to map / recover failures to Done?
Using Lagom 1.4.10

Is this correct / expected? Does this mean we should AVOID using
stream supervisors here?
The official API documentations says that
If the Kafka Lagom message broker module is being used, then by
default the stream is automatically restarted when a failure occurs.
So, there is no need to add your own supervisionStrategy to manage error handling. And the stream will be restarted by default and you should not think about "skipped" Done messages.
What sorts of behavior can the uneven zipping cause?
Exactly because of this the documentation says:
This means that the flow must not filter or collect a subset of the
messages
It can under-commit the wrong offset. And on restart, you might get the already processed messages in the form of replay from committed lower offset.
What is the recommended approach for error handling when it comes to
consuming messages off of Kafka via the Lagom message broker API? Is
the right thing to do to map / recover failures to Done?
Lagom is taking care of the exception handling by dropping the message that caused the error and restarting the stream. And map / recover failures to Done won't have any change on this.
You could consider, in case you need to have access to these messages later on, too use Try {} for example, ie not throwing an exception, and collect the messages with errors by sending them to a different topic, this will give you chance to monitor the amount of errors and replay messages that caused the error when the conditions are right, ie the bug is fixed.

Related

Reconsume Kafka Message that failed during processing due to DB error

I am new to Kafka and would like to seek advice on what is the best practice to handle such scenario.
Scenario:
I have a spring boot application that has a consumer method that is listening for messages via the #KafkaListner annotation. Once an incoming message has occurred, the consumer method will process the message, which simply performs database updates to different tables via JdbcTemplate.
If the updates to the tables are successful, I will manually commit the message by calling the acknowledge() method. If the database update fails, instead of calling the acknowledge() method, I will call the nack() method with a given duration (E.g. 10 seconds) such that the message will reappear again to be consumed.
Things to note
I am not concerned with the ordering of the messages. Whatever event comes I just have to consume and process it, that's all.
I am only given a topic (no retryable topic and no dead letter topic)
Here is the problem
If I do the above method, my consumer becomes inconsistent. Let's say if I call the nack() method with a duration of 1min, meaning to say after 1 min, the same message will reappear.
Within this 1 min, there could "x" number of incoming messages to be consumed and processed. The observation made was none of these messages are getting consumed and processed.
What I want to know
Hence, I hope someone will advise me what I am doing wrongly and what is the best practice / way to handle such scenarios.
Thanks!
Records are always received in order; there is no way to defer the current record until later, but continue to process other records after this one when consuming from a single topic.
Kafka topics are a linear log and not a queue.
You would need to send it to another topic; the #RetryableTopic (non-blocking retrties) feature is specifically designed for this use case.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
You could also increase the container concurrency so at least you could continue to process records from other partitions.

Kafka reset partition re-consume or not

If I consume from my topic and manage the offset myself, some records I process are successful then I move the offset on-wards, but occasionally I process records that will throw an exception. I still need to move the offset onwards. But at a later point I will need to reset the offset and re-process the failed records. Is it possible when advancing the offset to set a flag to say that if I consumer over that event again ignore or consume?
The best way to handle these records is not by resetting the offsets, but by using a dead-letter queue, essentially, by posting them to another kafka topic for reprocessing later. That way, your main consumer can focus on processing the records that don't throw exceptions, and some other consumer can constantly be listening and trying to handle the records that are throwing errors.
If that second consumer is still throwing exceptions when trying to reprocess the messages, you can either opt to repost them to the same queue, if the exception is caused by a transient issue (system temporarily unavailable, database issue, network blip, etc), or simply opt to log the message ID and content, as well as the best guess as to what the problem is, for someone to manually look at later.
Actually - no, this is not possible. Kafka records are read only. I've seen this use case in practice and I will try to give you some suggestions:
if you experience an error, just copy the message in a separate error topic and move on. This will allow you to replay all error messages at any time from the error topic. That would definitely be my preferred solution - flexible and performant.
when there is an error - just hang your consumer - preferably enter an infinite loop with an exponential backoff rereading the same message over and over again. We used this strategy together with good monitoring/alerting and log compaction. When something goes wrong we either fix the broken consumer and redeploy our service or if the message itself was broken the producer will fix its bug, republish the message with the same key and log compaction will kick in. The faulty message will be deleted (log compaction). We will be able to move our consumers forward at this point. This requires manual interaction in most cases. If the reason for the fault is a networking issue (e.g. database down) the consumer may recover by itself.
use local storage (e.g. a database) to store which offsets failed. Then reset the offset and ignore the successfully processed records. This is my least preferred solution.

Kafka Consumes unprocessable messages - How to reprocess broken messages later?

We are implementing a Kafka Consumer using Spring Kafka. As I understand correctly if processing of a single message fails, there is the option to
Don't care and just ACK
Do some retry handling using a RetryTemplate
If even this doesn't work do some custom failure handling using a RecoveryCallback
I am wondering what your best practices are for that. I think of simple application exceptions, such as DeserializationException (for JSON formatted messages) or longer local storage downtime, etc. Meaning there is needed some extra work, like a hotfix deployment, to fix the broken application to be able to re-process the faulty messages.
Since losing messages (i. e. not processing them) is not an option for us, the only option left is IMO to store the faulty messages in some persistence store, e. g. another "faulty messages" Kafka topic for example, so that those events can be processed again at a later time and there is no need to stop event processing totally.
How do you handle these scenarios?
One example is Spring Cloud Stream, which can be configured to publish failed messages to another topic errors.foo; users can then copy them back to the original topic to try again later.
This logic is done in the recovery callback.
We have a use case where we can't drop any messages at all, even for faulty messages. So when we encounter a faulty message, we will send a default message in place of that faulty record and at the same time send the message to a failed-topic for retry later.

Recreating caches from Kafka

I have decided to use Kafka for an event sourcing implementation and there are a few things I am still not quite sure about. One is finding a good way of recreating my materialized views (stored in a Postgres database) in case of failures.
I am building a messaging application so consider the example of a service receiving a REST request to create a new message. It will validate the request and then create an event in Kafka (e.g. "NewMessageCreated"). The service (and possibly other services as well) will then pick up that event in order to update its local database. Let's assume however that the database has crashed so saving the order in the database fails. If I understand correctly how to deal with this situation I should empty the database and try to recreate it by replaying all Kafka events.
If my assumption is correct I can see the following issues:
1) I need to enforce ordering by userId for my "messages" topic (so all messages from a particular user are consumed in order) so this means that I cannot use Kafka's log compaction feature for that topic. This means I will always have to replay all events from Kafka no matter how big my application becomes! Is there a way to address this in a better way?
2) Each time I replay any events from Kafka they may trigger the creation of new events (e.g. a consumer might do some processing and then generate a new event before committing). This sounds really problematic so I am thinking if instead of just replaying the events when rebuilding my caches, I should be processing the events but disable generation of new events (even though this would require extra code and seems cumbersome).
3) When an error occurs (e.g. due to some resource failure or due to a bug) while consuming some message, should I commit the message and generate an error in a Kafka topic, or should I not commit at all? In the latter case this will mean that subsequent messages in the same partition cannot be committed either (otherwise they will implicitly commit the previous one as well).
Any ideas how to address these issues?
Thanks.

multiplexing consumer and producer in kafka

In my kafka consumer threads(high level), after I consumed a message I am applying some business logic to this message and forwarding this to a WS. But this webservice may be down sometimes and since I consumed this object from kafka and offset is moved forward, i would missed this object.
One way get rid of from this problem is to disabling autocommit in zookeeper and committing offset by calling programmaticaly but i expect that this is a very costly operation. I will be producing to kafka at about 2000 tps and may increase later times.
Another way - which i am not sure if it is a good idea - is if i face with any problem, producing this consumed object to kafka again but i didn't see any post related to this across all my googleings. Is this a thing which is even not considerable?
Can you please give me some insights about handling this situation.
Thanks
You can post back the failed message to the same topic or another of your choice.
If you use the same topic, you will push the messages at the end of the topic and they will be picked up after the others (so if order matters to you don't do this). Also if the action that you perform before sending the message is not idempotent you will have to something to identifying this records so they don't perform the action twice.
If you use a failed_topic, you can push the messages that you can't send to this topic and when the WS is healthy again you need to create a consumer that consumes all the messages there and sends them to the WS.
Hope it helps!
Moving such messages to an error queue and retrying them later is a well known approach.
See Dead letter channel