Kafka ktable corrupt message handling - apache-kafka

We are using Ktabke for aggregation in kafka, its very basic use and have referred from kafka doc.
I am just trying to investigate that if some message consumption fails while aggregating how we can move such message to error topic or dlq.
I found something similar for KStream but not able to find for KTable and i was not able to simply extends KStream solution to ktable.
Reference for KStream
Handling bad messages using Kafka's Streams API
My use case is very simple for any kind of exception just move to error topic and move on to different message

There is no built in support for what you ask atm (Kafka 2.2), but you need to make sure that your application code does not throw any exceptions. All provided handlers that can be configured, are for exceptions thrown by Kafka Streams runtime. Those handlers are providing, because otherwise the user has no chance at all to react to those exception.
Feel free to create feature request Jira.

Related

what are all the possible exceptions i can expect when i produce a message onto Kafka topic?

I'm trying to figure out all the possible exceptions i can expect when i produce a message onto Kafka topic. I looked at the apache producer documentation but didn't find much. so, where can i find the consolidated list of possible exception that a producer can cause/throw?
https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html

Kafka Streams Retry PAPI and dead letter

I am trying to implement a retry logic within kafka streams processor topology in the event there was an exception producing to a sink topic.
I am using a custom ProductionExceptionHandler to be able to catch exception that happen on "producer.send" to the sink topic upon context.forward
What criteria should I use to be able resend the message to an alternate sink topic if there was an exception sending to original sink topic. Could this be deduced from type of exception in producer exception handler without compromising the transactional nature of the internal producer in Kafka streams.
If we decide to produce to a dead letter queue from production exception handler in some unrecoverable errors, could this be done within the context of "EOS" guarantee or it has to be a custom producer not known to the topology.
Kafka Streams has not built-in support for dead-letter-queue. Hence, you are "on your own" to implement it.
What criteria should I use to be able resend the message to an alternate sink topic if there was an exception sending to original sink topic.
Not sure what you mean by this? Can you elaborate?
Could this be deduced from type of exception in producer exception handler
Also not sure about this part.
without compromising the transactional nature of the internal producer in Kafka streams.
That is not possible. You have no access to the internal producer.
If we decide to produce to a dead letter queue from production exception handler in some unrecoverable errors, could this be done within the context of "EOS" guarantee or it has to be a custom producer not known to the topology.
You would need to maintain your own producer and thus it's out-of-scope for EOS.

Kafka Stream delivery semantic for a simple forwarder

I got a stateless Kafka Stream that consumes from a topic and publishes into a different queue (Cloud PubSub) within a forEach. The topology does not end on producing into a new Kafka topic.
How do I know which delivery semantic I can guarantee? Knowing that it's just a message forwarder and no deserialisation or any other transformation or whatsoever is applied: are there any cases in which I could have duplicates or missed messages?
I'm thinking about the following scenarios and related impacts on how offsets are commited:
Sudden application crash
Error occurring on publish
Thanks guys
If You consider the kafka to kafka loop that a Kafka Stream application usually creates, setting the property:
processing.guarantee=exactly_once
It's enough to have exactly-once semantic, of course also in failure scenarios.
Under the hood Kafka uses a transaction to guarantee that the consume - process - produce - commit offset processing is executed with all or nothing guarantee.
Writing a sink connector with exaclty once semantic kafka to Google PubSub, would mean solving the same issues Kafka solves already for the kafka to kafka scenario.
The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
Finally, in distributed environments, applications will crash or—worse!temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”
Assuming your producer logic to Cloud PubSub does not suffer from problem 1, just like Kafka producers when using enable.idempotence=true, you are still left with problems 2 and 3.
Without solving these issues your processing semantic will be the delivery semantic your consumer is using, so at least once, if you choose to manually commit the offset.

how to handle the exception when the DB is down while reading the mesage from kafka topic

In my spring boot application i am reading the message from kafka topic and saving the message in to HBase.
in case the DB is down and the message is consumed from the topic , how should i ensure that the message is not lost. can someone share me a sample code.
If your code encounters an error during the processing of a record, you as the developer, are responsible for handling retries or error catching. spring-kafka can't capture errors outside of the Kafka API for you.
That being said, Kafka will not remove the record just because it's consumed until it fully expires off the topic. You should definitely set enable.auto.commit to false and commit your own offsets after a successful database action, at the expense of potential duplicated records in hbase
I would also like to point out that you should probably be using Kafka Connect, which is meant to integrate external systems to Kafka, not a plain consumer.

How to safely skip messages using Lagom Kafka Message Broker API?

We've defined a basic subscriber that skips over failed messages (ie for some business logic reason, we are not going to handle) by throwing an exception and relying on a Akka Streams' stream supervision to resume the Flow:
someLagomService
.someTopic()
.subscribe
.withGroupId("lagom-service")
.atLeastOnce(
Flow[Int]
.mapAsync(1)(el => {
// Exception may occur here or can map to Done
})
.withAttributes(ActorAttributes.supervisionStrategy({
case t =>
Supervision.Resume
})
)
This seems to work fine for basic use cases under very little load, but we have noticed very strange things for larger number of messages (ex: very frequent re-processing of messages, etc).
Digging into the code, we saw that Lagom's broker.Subscriber.atLeastOnce documentation states:
The flow may pull more elements from upstream but it must emit
exactly one Done message for each message that it receives. It must
also emit them in the same order that the messages were received. This
means that the flow must not filter or collect a subset of the
messages, instead it must split the messages into separate streams and
map those that would have been dropped to Done.
Additionally, in the impl of Lagom's KafkaSubscriberActor, we see that the impl of private atLeastOnce essentially unzips the message payload and offset and then rezips then back up after our user flow maps messages to Done.
These two tidbits above seem to imply that by using stream supervisors and skipping elements, we can end up in a situation where the committable offsets no longer zip up evenly with the Dones that are to be produced per Kafka message.
Example: If we stream 1, 2, 3, 4 and map 1, 2, and 4 to Done but throw an exception on 3, we have 3 Dones and 4 committable offsets?
Is this correct / expected? Does this mean we should AVOID using stream supervisors here?
What sorts of behavior can the uneven zipping cause?
What is the recommended approach for error handling when it comes to consuming messages off of Kafka via the Lagom message broker API? Is the right thing to do to map / recover failures to Done?
Using Lagom 1.4.10
Is this correct / expected? Does this mean we should AVOID using
stream supervisors here?
The official API documentations says that
If the Kafka Lagom message broker module is being used, then by
default the stream is automatically restarted when a failure occurs.
So, there is no need to add your own supervisionStrategy to manage error handling. And the stream will be restarted by default and you should not think about "skipped" Done messages.
What sorts of behavior can the uneven zipping cause?
Exactly because of this the documentation says:
This means that the flow must not filter or collect a subset of the
messages
It can under-commit the wrong offset. And on restart, you might get the already processed messages in the form of replay from committed lower offset.
What is the recommended approach for error handling when it comes to
consuming messages off of Kafka via the Lagom message broker API? Is
the right thing to do to map / recover failures to Done?
Lagom is taking care of the exception handling by dropping the message that caused the error and restarting the stream. And map / recover failures to Done won't have any change on this.
You could consider, in case you need to have access to these messages later on, too use Try {} for example, ie not throwing an exception, and collect the messages with errors by sending them to a different topic, this will give you chance to monitor the amount of errors and replay messages that caused the error when the conditions are right, ie the bug is fixed.