kafka : How to delete data which already been consumed by consumer? - apache-kafka

I set server.properties'
log.retention.minutes=8
to clean data under kafka-logs/ every 8 minutes automatically ,
is it possible let the cleaner only clean up the data which have been consumed
,data not consumed by consumer will retain ?
Thanks !

No. Kafka messages are appended to log files which roll over every x hours or when they reach a certain size (depending on configuration). Once rolled over, those files are immutable (you cannot delete individual records). Log files are cleaned up when the last write access to a file exceeds the retention time.
In other words: the retention time is the time a message is kept at least. It is possible for a message with retention time of minutes to last for weeks (depending on other configuration settings).
The concept of "consumer offsets" is the mechanism Kafka uses to avoid reconsumption of messags. Kafka 0.11 also will contain exactly-once capabilities.

Related

Is it possible to make a Kafka Consumer override/ignore its configurations when doing a records poll?

I have a Kafka consumer that should consume minimum 1MB worth of records at each poll. This data is then written to file and stored partioned by date - in example, records consumed during 2022.09.22 should be written to a file and stored to the date_id=20220922 folder. The file size should be a minimum of 1MB.
The configuration properties fetch.min.bytes and fetch.max.wait.ms are tuned to get the desired behavior. The problem, though, arrives when a new day occurs. On a day change, the consumer should consume the remaining records on topic (it is less than 1MB) without having to wait for the poll size threshold to be met or for the wait time to time out. The consumer should do a type of "force fetch" of the remaining records available on topic.
Is it possible to override the configuration of the consumer to achieve this behavior?
The properties are what they are - you cannot change them at runtime without stopping the consumer and creating a new one with other config settings.
Worth mentioning that the HDFS/S3 sink connectors from Confluent already have a Date directory partitions. They also work for local storage, but distributed storage makes more sense when your kafka consumers are distributed

Apache Kafka messages got archived - is it possible to retrieve the messages

We are using Apache Kafka and we process more than 30 million messages per day. We have an retention policy of "30" days. However, before 30 days, our messages got archived.
Is there a way we could retrieve the deleted messages?
Is it possible to reset the "start index" to older index to retrieve the data through query?
What other options do we have?
If we have "disk backup", could we use that for retrieving the data?
Thank You
I'm assuming your messages got deleted by the Kafka cluster here.
In general, no - if the records got deleted due to duration / size related policies, then they have been removed.
Theoretically, if you have access to backups you might move the Kafka data-log files to server directory, but the behaviour is undefined. Trying that with a fresh cluster with infinite size/time policies (so nothing gets purged instantly) might work and let you consume again.
In my experience, until the general availability of Tiered Storage, there is no free/easy way to recover data (via the Kafka Consumer protocol).
For example, you can use some Kafka Connect Sink connector to write to some external, more persistent storage. Then, would you want to write a job that scrapes that data? Sure, you could have a SQL database table of STRING topic, INT timestamp, BLOB key, BLOB value, and maybe track "consumer offsets" separately from that? If you use that design, then Kafka doesn't really seem useful, as you'd be reimplementing various parts of it when you could've just added more storage to the Kafka cluster.
Is it possible to reset the "start index" to older index to retrieve the data through query?
That is what auto.offset.reset=earliest will do, or kafka-consumer-groups --reset-offsets --to-earliest
have "disk backup", could we use that
With caution, maybe. For example - you can copy old broker log segments into a server, but then there aren't any tools I know of that will retroactively discover the new "low watermark" of each topic (maybe the broker finds this upon restart, I haven't tested). You'd need to copy this data for each broker manually, I believe, since the replicas wouldn't know about old segments (again, maybe after a full cluster restart, they might).
Plus, the consumer offsets would already be reading way past that data, unless you stop all consumers and reset them.
I'm also not sure what happens if you had gaps in the segment files. E.g. your current oldest segment is N and you copy N-2, but not N-1... You might then run into an error or the consumer will simply apply auto.offset.reset policy, and seek to the next available offset or to the very end of the topic

kafka automatic deletion based on message consumption

Like how we have in MQ solutions , is it possible to have the message automatically deleted in Kafka once it is consumed ?
As I don't have control when the message will be consumed ,its not possible to define retention by time / byte size
You can override the configuration of retention by time per topic basis, even set it to 0 for no deletion at all. Retention byte size retention is not limited by default, and you don't have to use it. Being said that I am not sure Kafka is best suited for your use case as it meant to use used for real time high performance streaming processes... another note you can use COMPACT topic and send tombstone message to delete a record once processed, but basically kafka does not have automatic delete on consumption

Apache Kafka: large retention time vs. fast read of last value

Dear Apache Kafka friends,
I have a use case for which I am looking for an elegant solution:
Data is published in a Kafka-Topic at a relatively high rate. There are two competing requirements
all records should be kept for 7 days (which is configured by min.compaction.lag)
applications should read the "last status" from the topic during their initialization phase
LogCompaction is enabled in order for the "last state" to be available in the topic.
Now comes the problem. If an application wants to initialize itself from the topic, it has to read a lot of records to get the last state for all keys (the entire topic content must be processed). But this is not performant possible with the amount of records.
Idea
A streaming process streams the data of the topic into a corresponding ShortTerm topic which has a much shorter min.compaction.lag time (1 hour). The applications initialize themselves from this topic.
Risk
The streaming process is a potential source of errors. If it temporarily fails, the applications will no longer receive the latest status.
My Question
Are there any other possible solutions to satisfy the two requirements. Did I maybe miss a Kafa concept that helps to handle these competing requirements?
Any contribution is welcome. Thank you all.
If you don't have a strict guarantee how frequently each key will be updated, you cannot do anything else as you proposed.
To avoid the risk that the downstream app does not get new updates (because the data replication jobs stalls), I would recommend to only bootstrap an app from the short term topic, and let it consume from the original topic afterwards. To not miss any updates, you can sync the switch over as follows:
On app startup, get the replication job's committed offsets from the original topic.
Get the short term topic's current end-offsets (because the replication job will continue to write data, you just need a fixed stopping point).
Consume the short term topic from beginning to the captured end offsets.
Resume consuming from the original topic using the captured committed offsets (from step 1) as start point.
This way, you might read some messages twice, but you won't lose any updates.
To me, the two requirements you have mentioned together with the requirement for new consumers are not competing. In fact, I do not see any reason why you should keep a message of an outdated key in your topic for 7 days, because
New consumers are only interested in the latest message of a key.
Already existing consumers will have processed the message within 1 hour (as taken from your comments).
Therefore, my understanding is that your requirement "all records should be kept for 7 days" can be replaced by "each consumer should have enough time to consume the message & the latest message for each key should be kept for 7 days".
Please correct me if I am wrong and explain which consumer actually does need "all records for 7 days".
If that is the case you could do the following:
Enable log compaction as well as time-based retention to 7 days for this topic
Fine-tune the compaction frequency to be very eager, meaning to keep as little as possible outdated messages for a key.
Set min.compaction.lag to 1 hour such that all consumers have the chance to keep up.
That way, new consumers will read (almost) only the latest message for each key. If that is not performant enough, you can try increasing the partitions and consumer threads of your consumer groups.

Kafka: Messages disappearing from topics, largestTime=0

We have messages disappearing from topics on Apache Kafka with versions 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling deployment of our clusters and unfortunately it doesn't happen every time, so it's very inconsistent.
Sometimes we lose all messages inside a topic, other times we lose all messages inside a partition. When this happens the following log is a constant:
[2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5, dir=/var/kafkadata/data01/data] Deleting segments List(LogSegment(baseOffset=6, size=728, lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
There is also a previous log saying this segment hit the retention time breach of 24 hours. In this example, the message was produced ~12 minutes before the deployment.
Notice, all messages that are wrongly deleted have largestTime=0 and the ones that are properly deleted have a valid timestamp in there. From what we read from documentation and code it looks like the largestTime is used to calculate if a given segment reached the time breach or not.
Since we can observe this in multiple versions of Kafka, we think this might be related to anything external to Kafka. E.g Zookeeper.
Does anyone have any ideas of why this could be happening? We are using Zookeeper 3.6.0.
We found out that the cause was not related to Kafka itself but to the volume where we stored the logs. Still, the following explanation might be useful for educational purposes:
In detail, it was a permission problem where Kafka was not able to read the .timeindex files when the log cleaner was triggered. This caused largestTime to be 0 and lead to some messages being deleted way before the retention time.
Each topic partition is divided into several segments and the last are then stored into different .log files that contain the actual messages. For each .log file there is a .timeindex file containing a map between offset and lastModifiedTime.
When Kafka needs to check if a segment is deletable, it searches for the most recent offset lastModifiedTime and stores it as largestTime. Then, checks if the retention limit was reached: currentTime - largestTime > retentionTime.
If so, it deletes the segment and the respective messages.
Since Kafka was not able to read the file, largestTime was 0 and the check currentTime > retentionTime was always true for our 1-day retention.
Ensure date is synced between all Kafka brokers and ZooKeeper nodes.
Bash command: date.
Compare year, day, hour and minute.