Logs and statements with Snowflake and Kafka Connector - apache-kafka

Is there a possibility that we may lose some messages if we use snowflake kafka connector. For example if the kafka connector reads the message and commits the offset before the message is written to the variant table, then we will lose that message. Is this a scenario that can happen if we use kafka connect
If you have any examples, these are welcome as well, thank you!

According to the documentationfrom snowflake
Both Kafka and the Kafka connector are fault-tolerant. Messages are neither duplicated nor silently dropped. Messages are delivered exactly once, or an error message will be generated. If an error is detected while loading a record (for example, the record was expected to be a well-formed JSON or Avro record, but wasn’t well-formed), then the record is not loaded; instead, an error message is returned.
Limitations are listed as well. Arguably, nothing is impossible, but if you don't trust Kafka I'd not use Kafka at all.
How and where you could loose messages depends on your overall architecture too, like are records written into the Kafka-Topics you're consuming, how do you partition?

Related

how to handle the exception when the DB is down while reading the mesage from kafka topic

In my spring boot application i am reading the message from kafka topic and saving the message in to HBase.
in case the DB is down and the message is consumed from the topic , how should i ensure that the message is not lost. can someone share me a sample code.
If your code encounters an error during the processing of a record, you as the developer, are responsible for handling retries or error catching. spring-kafka can't capture errors outside of the Kafka API for you.
That being said, Kafka will not remove the record just because it's consumed until it fully expires off the topic. You should definitely set enable.auto.commit to false and commit your own offsets after a successful database action, at the expense of potential duplicated records in hbase
I would also like to point out that you should probably be using Kafka Connect, which is meant to integrate external systems to Kafka, not a plain consumer.

How to make restart-able producer?

Latest version of kafka support exactly-once-semantics (EoS). To support this notion, extra details are added to each message. This means that at your consumer; if you print offsets of messages they won't be necessarily sequential. This makes harder to poll a topic to read the last committed message.
In my case, consumer printed something like this
Offset-0 0
Offset-2 1
Offset-4 2
Problem: In order to write restart-able proudcer; I poll the topic and read the content of last message. In this case; last message would be offset#5 which is not a valid consumer record. Hence, I see errors in my code.
I can use the solution provided at : Getting the last message sent to a kafka topic. The only problem is that instead of using consumer.seek(partition, last_offset=1); I would use consumer.seek(partition, last_offset-2). This can immediately resolve my issue, but it's not an ideal solution.
What would be the most reliable and best solution to get last committed message for a consumer written in Java? OR
Is it possible to use local state-store for a partition? OR
What is the most recommended way to store last message to withstand producer-failure? OR
Are kafka connectors restartable? Is there any specific API that I can use to make producers restartable?
FYI- I am not looking for quick fix
In my case, multiple producers push data to one big topic. Therefore, reading entire topic would be nightmare.
The solution that I found is to maintain another topic i.e. "P1_Track" where producer can store metadata. Within a transaction a producer will send data to one big topic and P1_Track.
When I restart a producer, it will read P1_Track and figure out where to start from.
Thinking about storing last committed message in a database and using it when producer process restarts.

How Kafka Connectors are reliable in case of failures?

I'm thinking of using a Kafka Connector vs creating my own Kafka Consumers/Producers to move some data from/to Kafka, and I see the value Kafka Connectors provide in terms of scalability and fault tolerance. However, I haven't been able to find how exactly connectors behave if the "Task" fails for some reason. Here are a couple of scenarios:
For a sink connector (S3-Sink), if it (the Task) fails (after all retries) to successfully send the data to the destination (for example due to a network issue), what happens to the worker? Does it crash? Is it able to re-consume the same data from Kafak later on?
For a source connector (JDBC Source), if it fails to send to Kafka, does it re-process the same data later on? Does it depend on what the source is?
Does answer to the above questions depend on which connector we are talking about?
In Kafka 2.0, I think, they introduced the concept of graceful error handling, which can skip the over bad messages or write to a DLQ topic.
1) The S3 sink can fail, and it'll just stop processing data. However, if you fix the problem (for various edge cases that may arise) the sink itself is exactly once delivery to S3. The consumed offsets are stored as a regular consumer offset offset will not commit to Kafka until the file upload completes. However, obviously if you don't fix the issue before the retention period of a topic, you're losing data.
2) Yes, it depends on the source. I don't know the semantics of the JDBC Connector, but it really depends which query mode you're using. For example, for the incrementing timestamp, if you try to run a query every 5 seconds for all rows within a range, I do not believe it'll retry old, missed time windows
Overall, the failure recovery scenario are all dependent on the systems that are being connected to. Some errors are recoverable, and some are not (for example, your S3 access keys get revoked, and it won't write files until you get a new credential set)

Kafka stream application not consume data after restart

After I did restart our Kafka cluster my application of Kafka streams didn't receive messages from input topic and I got an exception of "can׳t create internal topic". After some research, I did reset with the Kafka tool (to the input topic and the application) the tool is Kafka-streams-application-reset.sh.
Unfortunately, it didn't resolve the problem and I also got the exception again
From the error message, you can infer that the topic already exists and thus, cannot be created. The reason for the failure is, that the existing topic does not have the expected number of partitions (it has 1 instead of 150) -- if the number of partitions would match, Kafka Streams would just use the existing topic.
This can happen, if you have topic auto-create enabled at the brokers (and the topic was created with a wrong number of partitions), or if the number of partitions of your input topic changed. Kafka Streams does not automatically change the number of partitions for the repartition topic, because this might result in data corruption and thus lead to incorrect results.
One way to fix this, it to either manually delete this topic: note, that this might result in data loss and you should only do this, if you know that it is what you want.
Another (better way) would be, to reset the application cleanly using bin/kafka-streams-application-reste.sh in combination with KafkaStreams#cleanup().
Because you need to clean up the application and users should be aware of the implication, Kafka Streams fails to make user aware of the issue instead of "auto magically" take some actions that might be undesired from a user point of view.
Check out the docs for more details. There is also a blog post that explains application reset in details:
https://kafka.apache.org/11/documentation/streams/developer-guide/app-reset-tool.html
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Can I make sure Kafka doesn't accept two copies of the same message?

I'm writing messages along with timestamps to kafka. If I retry, the timestamp might change, and the producer that's writing, but the message content and message id is the same. The message id is generated before the message gets here, and it's a uuid.
How can I make sure kafka doesn't accept the second copy, if it successfully wrote to the topic, but the ack got lost, so the service up the chain retries? The consumers must not ever see the duplicate message.
In general there are two cases when the same message can be sent to Kafka:
During normal operation your application intentionally sends messages with the same uuid to Kafka and you want Kafka to do deduplication.
While you are sending a message to Kafka your code or Kafka brokers fail and you want to make sure the message you try to send again isn't duplicated, and also isn't lost.
I assume you are interested in case 2.. The Kafka developer's call case 2. exactly-once delivery. The latest versions of Kafka support transactions in order to enable exactly-once delivery. A complete explanation of how Kafka does this along with a code snippet can be found in this article by Confluent (the Kafka company).