how to handle the exception when the DB is down while reading the mesage from kafka topic - apache-kafka

In my spring boot application i am reading the message from kafka topic and saving the message in to HBase.
in case the DB is down and the message is consumed from the topic , how should i ensure that the message is not lost. can someone share me a sample code.

If your code encounters an error during the processing of a record, you as the developer, are responsible for handling retries or error catching. spring-kafka can't capture errors outside of the Kafka API for you.
That being said, Kafka will not remove the record just because it's consumed until it fully expires off the topic. You should definitely set enable.auto.commit to false and commit your own offsets after a successful database action, at the expense of potential duplicated records in hbase
I would also like to point out that you should probably be using Kafka Connect, which is meant to integrate external systems to Kafka, not a plain consumer.

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Logs and statements with Snowflake and Kafka Connector

Is there a possibility that we may lose some messages if we use snowflake kafka connector. For example if the kafka connector reads the message and commits the offset before the message is written to the variant table, then we will lose that message. Is this a scenario that can happen if we use kafka connect
If you have any examples, these are welcome as well, thank you!
According to the documentationfrom snowflake
Both Kafka and the Kafka connector are fault-tolerant. Messages are neither duplicated nor silently dropped. Messages are delivered exactly once, or an error message will be generated. If an error is detected while loading a record (for example, the record was expected to be a well-formed JSON or Avro record, but wasn’t well-formed), then the record is not loaded; instead, an error message is returned.
Limitations are listed as well. Arguably, nothing is impossible, but if you don't trust Kafka I'd not use Kafka at all.
How and where you could loose messages depends on your overall architecture too, like are records written into the Kafka-Topics you're consuming, how do you partition?

Using kafka for CQRS

Been reading a lot about kafka's use as an event store and a potential good candidate for CQRS.
I was wondering, since messages in kafka have a limited retention time, how will events be replayed after the messages were deleted from the disk where kafka retains messages?
Logically, when these messages are stored externally from kafka (after reading messages from kafka topics) in a db (sql/nosql), that would make more sense from an event store standpoint than kafka.
In lieu of above, given my understanding is correct, what is the real use case of kafka being used in CQRS even though the actual intent of kafka was just a high throughput messaging system?
You can use Kafka of event store and CQRS. You can use Kafka Stream to process all events generated by commands and store a snapshot of your entities in a changelog topic and store the changelog topic in a NOSQL one or more databases that meets your requirement. Also, all event can be store in a database(PostgresSql). What's important to know is that Kafka can be used as a store(its store files in high available way) or as a message query.
Retention time: You can set the retention time as long as you want or even keep messages forever in the topic.
Using Kafka as the data store: Sure, you can. There is a feature named Log Compaction. Let say the following scenario:
Insert product with ID=10, Name=Apple, Price=10
Insert product with ID=20, Name=Orange, Price=20
Update product with ID=10, Price becomes 30
When one topic is turned on the log compaction, a background job will periodically clean up messages on that topic. This job will check if any message has the same key then only keeps the final. With the above scenario, messages which are written to Kafka will the following format:
Message 1: Key=1, Name=Apple, Price=10
Message 2: Key=2, Name=Orange, Price=20
Message 3: Key=1, Name=Apple, Price=30 (Every update now includes all fields so it can self-contained)
After the log compaction, the topic will become:
Message 1: Key=2, Name=Orange, Price=20
Message 2: Key=1, Name=Apple, Price=30 (Keep the lastest record with the ID=1)
In reality, Kafka uses log compaction feature to make Kafka as the persistent data storage.

Can I make sure Kafka doesn't accept two copies of the same message?

I'm writing messages along with timestamps to kafka. If I retry, the timestamp might change, and the producer that's writing, but the message content and message id is the same. The message id is generated before the message gets here, and it's a uuid.
How can I make sure kafka doesn't accept the second copy, if it successfully wrote to the topic, but the ack got lost, so the service up the chain retries? The consumers must not ever see the duplicate message.
In general there are two cases when the same message can be sent to Kafka:
During normal operation your application intentionally sends messages with the same uuid to Kafka and you want Kafka to do deduplication.
While you are sending a message to Kafka your code or Kafka brokers fail and you want to make sure the message you try to send again isn't duplicated, and also isn't lost.
I assume you are interested in case 2.. The Kafka developer's call case 2. exactly-once delivery. The latest versions of Kafka support transactions in order to enable exactly-once delivery. A complete explanation of how Kafka does this along with a code snippet can be found in this article by Confluent (the Kafka company).

Making Kafka producer and Consumer synchronous

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.