Undelete messages from kafka - apache-kafka

I have mixed retention.ms and delete.retention.ms properties in kafka, with the result that some messages that I'm interested are now in a deleted status. The messages are not deleted from the disk, as the delete.retention.ms property is big enough.
So, I can see that the segment files are on disk, but for kafka the earlies message is only 1 day ago even with files of 4 or 5 months on the directory.
Is there a way to tell kafka to move backwards the "earliest offset" and make those messages available again and not subjected to a possible flush?

Related

kafka topic with compact,delete does not remove old data

I have a kafka changelog topic which has a cleanup policy COMPACT_DELETE .
It has a retention.ms of 86520000 which is approximately 1 day.
However I've observed that partitions in this topic has data which is over a month old .
I would expect that since DELETE is also part of the cleanup policy there should be no messages beyond 1 day in any of the partitions in this topic .
The major problem here is that this topic is constantly growing and is never settling down which is causing disk issues in the kafka broker side.
I'd like to understand why retention.ms isnt kicking in for COMPACT_DELETE topics.

Will Kafka Reuse an old disk for writes after a new disk has been added?

I have a question about using multiple disks per Kafka broker.
Assume that a Kafka broker has 3 disks associated with it.
i) Disk-1 was full in 5 days
ii) Disk-2 is nearing 40 % usage in the next 3 days.
Now if the log.retention.hours = 168 (7 days) has completed, then let’s say the data in Disk-1 was deleted, so Disk-1 is free again and Disk-2 is 40% used
Now will Kafka reuse Disk-1 for new writes again, or will it only write to new disks i.e Disk-2 , Disk-3 so on?
Basically, my question is, will Kafka again write to an older disk, if there is enough free space in the older disk due to message deletion after max retention days in Kafka ?
When a partition is created, each broker that is a replica will pick a select a log directory to put data for that partition. On a broker, data for a specific partition is only stored in that selected log directory.
Log directories are specified in the broker configuration via the log.dirs setting.
If you have multiple log directories, when creating a partition, the log directory with the least amount of partitions is picked.
When producing messages to a partition, the data goes into the log directory where that partition is.
In short the answer to your specific question is "it depends" but hopefully I've described the process clearly enough for you to figure out the answer for your exact situation.

Kafka: Messages disappearing from topics, largestTime=0

We have messages disappearing from topics on Apache Kafka with versions 2.3, 2.4.0, 2.4.1 and 2.5.0. We noticed this when we make a rolling deployment of our clusters and unfortunately it doesn't happen every time, so it's very inconsistent.
Sometimes we lose all messages inside a topic, other times we lose all messages inside a partition. When this happens the following log is a constant:
[2020-04-27 10:36:40,386] INFO [Log partition=test-lost-messages-5, dir=/var/kafkadata/data01/data] Deleting segments List(LogSegment(baseOffset=6, size=728, lastModifiedTime=1587978859000, largestTime=0)) (kafka.log.Log)
There is also a previous log saying this segment hit the retention time breach of 24 hours. In this example, the message was produced ~12 minutes before the deployment.
Notice, all messages that are wrongly deleted have largestTime=0 and the ones that are properly deleted have a valid timestamp in there. From what we read from documentation and code it looks like the largestTime is used to calculate if a given segment reached the time breach or not.
Since we can observe this in multiple versions of Kafka, we think this might be related to anything external to Kafka. E.g Zookeeper.
Does anyone have any ideas of why this could be happening? We are using Zookeeper 3.6.0.
We found out that the cause was not related to Kafka itself but to the volume where we stored the logs. Still, the following explanation might be useful for educational purposes:
In detail, it was a permission problem where Kafka was not able to read the .timeindex files when the log cleaner was triggered. This caused largestTime to be 0 and lead to some messages being deleted way before the retention time.
Each topic partition is divided into several segments and the last are then stored into different .log files that contain the actual messages. For each .log file there is a .timeindex file containing a map between offset and lastModifiedTime.
When Kafka needs to check if a segment is deletable, it searches for the most recent offset lastModifiedTime and stores it as largestTime. Then, checks if the retention limit was reached: currentTime - largestTime > retentionTime.
If so, it deletes the segment and the respective messages.
Since Kafka was not able to read the file, largestTime was 0 and the check currentTime > retentionTime was always true for our 1-day retention.
Ensure date is synced between all Kafka brokers and ZooKeeper nodes.
Bash command: date.
Compare year, day, hour and minute.

Kafka Log Retention does not work for topic

For a Kafka topic I set segment.ms and retention.ms to 86400000ms (1 day).
With this topics config my assumption is that Kafka will roll a log segment after 1 day and also delete it, because retention.ms is also set to one day.
However, nothing happened. The segments are not rolled and therefore not deleted. At least I see nothing in the server logs and the free space is shrinking continuously.
Mysteriously, if I set segment.ms and retention.ms to a smaller value, for example 1800000ms (30 minutes), everything is working perfectly.

kafka : How to delete data which already been consumed by consumer?

I set server.properties'
log.retention.minutes=8
to clean data under kafka-logs/ every 8 minutes automatically ,
is it possible let the cleaner only clean up the data which have been consumed
,data not consumed by consumer will retain ?
Thanks !
No. Kafka messages are appended to log files which roll over every x hours or when they reach a certain size (depending on configuration). Once rolled over, those files are immutable (you cannot delete individual records). Log files are cleaned up when the last write access to a file exceeds the retention time.
In other words: the retention time is the time a message is kept at least. It is possible for a message with retention time of minutes to last for weeks (depending on other configuration settings).
The concept of "consumer offsets" is the mechanism Kafka uses to avoid reconsumption of messags. Kafka 0.11 also will contain exactly-once capabilities.