Apache Kafka messages got archived - is it possible to retrieve the messages - apache-kafka

We are using Apache Kafka and we process more than 30 million messages per day. We have an retention policy of "30" days. However, before 30 days, our messages got archived.
Is there a way we could retrieve the deleted messages?
Is it possible to reset the "start index" to older index to retrieve the data through query?
What other options do we have?
If we have "disk backup", could we use that for retrieving the data?
Thank You

I'm assuming your messages got deleted by the Kafka cluster here.
In general, no - if the records got deleted due to duration / size related policies, then they have been removed.
Theoretically, if you have access to backups you might move the Kafka data-log files to server directory, but the behaviour is undefined. Trying that with a fresh cluster with infinite size/time policies (so nothing gets purged instantly) might work and let you consume again.

In my experience, until the general availability of Tiered Storage, there is no free/easy way to recover data (via the Kafka Consumer protocol).
For example, you can use some Kafka Connect Sink connector to write to some external, more persistent storage. Then, would you want to write a job that scrapes that data? Sure, you could have a SQL database table of STRING topic, INT timestamp, BLOB key, BLOB value, and maybe track "consumer offsets" separately from that? If you use that design, then Kafka doesn't really seem useful, as you'd be reimplementing various parts of it when you could've just added more storage to the Kafka cluster.
Is it possible to reset the "start index" to older index to retrieve the data through query?
That is what auto.offset.reset=earliest will do, or kafka-consumer-groups --reset-offsets --to-earliest
have "disk backup", could we use that
With caution, maybe. For example - you can copy old broker log segments into a server, but then there aren't any tools I know of that will retroactively discover the new "low watermark" of each topic (maybe the broker finds this upon restart, I haven't tested). You'd need to copy this data for each broker manually, I believe, since the replicas wouldn't know about old segments (again, maybe after a full cluster restart, they might).
Plus, the consumer offsets would already be reading way past that data, unless you stop all consumers and reset them.
I'm also not sure what happens if you had gaps in the segment files. E.g. your current oldest segment is N and you copy N-2, but not N-1... You might then run into an error or the consumer will simply apply auto.offset.reset policy, and seek to the next available offset or to the very end of the topic

Related

How can I retrieve old data from Kafka to store in Mongodb

How can i transfer old data from Kafka to MongoDB database, I had changed the topic of Kafka so because of that there was old data it did not come in MongoDB so how can I do it?
If I try to retrieve data from old topic then I get poll timeout error.
Unclear how you define "old".
Kafka actively removes data from topics with retention policies. If data is deleted, you can never consume it.
While it is still retained, you simply set auto.offset.reset=earliest (in Connect, you add consumer. prefix to this and put in the worker properties file, not the connector itself), and start polling. Timeout errors are separate issue, and have nothing to do with retention time of the topic. By default, any consumer or connector will set this property to latest and not see retained data

kafka message expiry event - how to capture

I am a beginner to Kafka, and recently started using in my projects at work. One important thing that I wanna know is, whether it is possible to capture event(s) when messages expire in kafka. The intent is to trap these expired messages and back them up in a backup store.
I believe the goal you want to achieve is similar to Apache Kafka Tiered Storage which is still under development in Open Source Apache Kafka
Messages don't expire. I think you could think of two different scenarios when you think about messages that expire.
A topic is configured with cleanup.policy = delete . After retention.ms or retention.bytes it looks like messages expire. However what actually happens is that a whole log segment, whose newest message is older than retention.ms or if the partitions retention.bytes is exceeded, will be deleted. It will only be considered for deletion if it is not the active segment which Kafka currently writes to.
A topic is configured with cleanup.policy = tombstone. When two log segments are merged, Kafka will make sure that only the latest version for each distinct key will be kept. To "delete" messages one would send a message with a key to target a message and with an empty value - also called a tombstone.
There's no hook or event you could subscribe to, in order to figure out if either of these two cases will happen. You'd have to take of the logic on the client side, which is hard because the Kafka API does not expose any details about the log segments within a partition.

Apache Kafka: large retention time vs. fast read of last value

Dear Apache Kafka friends,
I have a use case for which I am looking for an elegant solution:
Data is published in a Kafka-Topic at a relatively high rate. There are two competing requirements
all records should be kept for 7 days (which is configured by min.compaction.lag)
applications should read the "last status" from the topic during their initialization phase
LogCompaction is enabled in order for the "last state" to be available in the topic.
Now comes the problem. If an application wants to initialize itself from the topic, it has to read a lot of records to get the last state for all keys (the entire topic content must be processed). But this is not performant possible with the amount of records.
Idea
A streaming process streams the data of the topic into a corresponding ShortTerm topic which has a much shorter min.compaction.lag time (1 hour). The applications initialize themselves from this topic.
Risk
The streaming process is a potential source of errors. If it temporarily fails, the applications will no longer receive the latest status.
My Question
Are there any other possible solutions to satisfy the two requirements. Did I maybe miss a Kafa concept that helps to handle these competing requirements?
Any contribution is welcome. Thank you all.
If you don't have a strict guarantee how frequently each key will be updated, you cannot do anything else as you proposed.
To avoid the risk that the downstream app does not get new updates (because the data replication jobs stalls), I would recommend to only bootstrap an app from the short term topic, and let it consume from the original topic afterwards. To not miss any updates, you can sync the switch over as follows:
On app startup, get the replication job's committed offsets from the original topic.
Get the short term topic's current end-offsets (because the replication job will continue to write data, you just need a fixed stopping point).
Consume the short term topic from beginning to the captured end offsets.
Resume consuming from the original topic using the captured committed offsets (from step 1) as start point.
This way, you might read some messages twice, but you won't lose any updates.
To me, the two requirements you have mentioned together with the requirement for new consumers are not competing. In fact, I do not see any reason why you should keep a message of an outdated key in your topic for 7 days, because
New consumers are only interested in the latest message of a key.
Already existing consumers will have processed the message within 1 hour (as taken from your comments).
Therefore, my understanding is that your requirement "all records should be kept for 7 days" can be replaced by "each consumer should have enough time to consume the message & the latest message for each key should be kept for 7 days".
Please correct me if I am wrong and explain which consumer actually does need "all records for 7 days".
If that is the case you could do the following:
Enable log compaction as well as time-based retention to 7 days for this topic
Fine-tune the compaction frequency to be very eager, meaning to keep as little as possible outdated messages for a key.
Set min.compaction.lag to 1 hour such that all consumers have the chance to keep up.
That way, new consumers will read (almost) only the latest message for each key. If that is not performant enough, you can try increasing the partitions and consumer threads of your consumer groups.

How Kafka Connectors are reliable in case of failures?

I'm thinking of using a Kafka Connector vs creating my own Kafka Consumers/Producers to move some data from/to Kafka, and I see the value Kafka Connectors provide in terms of scalability and fault tolerance. However, I haven't been able to find how exactly connectors behave if the "Task" fails for some reason. Here are a couple of scenarios:
For a sink connector (S3-Sink), if it (the Task) fails (after all retries) to successfully send the data to the destination (for example due to a network issue), what happens to the worker? Does it crash? Is it able to re-consume the same data from Kafak later on?
For a source connector (JDBC Source), if it fails to send to Kafka, does it re-process the same data later on? Does it depend on what the source is?
Does answer to the above questions depend on which connector we are talking about?
In Kafka 2.0, I think, they introduced the concept of graceful error handling, which can skip the over bad messages or write to a DLQ topic.
1) The S3 sink can fail, and it'll just stop processing data. However, if you fix the problem (for various edge cases that may arise) the sink itself is exactly once delivery to S3. The consumed offsets are stored as a regular consumer offset offset will not commit to Kafka until the file upload completes. However, obviously if you don't fix the issue before the retention period of a topic, you're losing data.
2) Yes, it depends on the source. I don't know the semantics of the JDBC Connector, but it really depends which query mode you're using. For example, for the incrementing timestamp, if you try to run a query every 5 seconds for all rows within a range, I do not believe it'll retry old, missed time windows
Overall, the failure recovery scenario are all dependent on the systems that are being connected to. Some errors are recoverable, and some are not (for example, your S3 access keys get revoked, and it won't write files until you get a new credential set)

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon