Kafka get timestamp based on offset and partition for monitoring time lag - apache-kafka

I have code in place to find offsets and TopicPartition from the KafkaConsumer, but can't find a way to just retrieve the timestamp based on that information.
I have looked through ConsumerRecord but since this is a monitoring service I do not think I should .poll() as I might cause some records to fall through if my monitoring service is directly polling from Kafka.
I know there's CLI kafka-console-consumer which can fetch the timestamp of a message based on partition and offset, but not sure if that's an SDK available for that.
Does anyone have any insights or readings I can go through to try to get time lag? I have been trying to find an SDK or any type of API that can do this.

There is no other way (as of 3.1) to do this - you can do consumer.poll. Of course, if you want to access only one, then you should set the max received records property to 1, so you don't waste effort. A consumer can be basically treated as an accessor to remote record-array, what you are doing is just accessing record[offset], and getting this record's ts.
So to sum it up:
get timestamp out of offset -> seek + poll 1,
get offset out of timestamp -> offsetsForTimes.

If I understood your question, given a ConsumerRecord, from Kafka 0.11+, all records have a .timestamp() method.
Alternatively, given a topic, (list of) partition(s), and offset(s), then you'd need to seek a consumer, with max.poll.records=1, then extract the timestamps from each polled partition after the seeked position.
The Confluent Monitoring Interceptors already do something very similiar to what you're asking, but for Control Center.

Related

How to Retrieve Kafka message fast based on key?

i have scenario where I need to test Kafka message when transaction is completed. How to retrieve the message fast using Java? I know the key initial first 10 digit details which is unique.
Currently I am reading all partition and offset for the relevant topic which is not efficient(worst case scenario takes 2 min to get key)
This is not really possible with Kafka, each Kafka partition is an append-only log that uses an offset to specify its position. The key isn't used when reading the partition.
The only way to "seek" a specific message in a partition is through its offset, so instead of reading the whole partition if you know that the message is roughly from one hour ago(or another timeframe) you can consume just that piece of information.
See this answer on how to initialize a consumer on a specific offset based on timestamp in Java

Options for getting a specific item's produce and consume time stamps?

Suppose I'm debugging an issue involving a single specific message that was produced and consumed. And I want to know when this message was produced and when it was consumed. What are my options for getting this info?
I guess when I construct a message I could include within it the current time. And when my consumer gets a message, it could write out a log entry.
But suppose I have many producer and consumer classes and none of the code is doing these things. Is there something already existing in kafka that could support finding out this info about a specific message without having to touch the implementation of these producers and consumers, something like the __consumer_offsets topic?
Kafka has built-in timestamp support for messages sent and this timestamp can be accessed via timestamp method of ConsumerRecord (link)
It can be configured with broker config (log.message.timestamp.type) or topic level config (message.timestamp.type). Its default value is CreateTime. You can also set this as LogAppendTime.
CreateTime: Timestamp is assigned when producer record is created (before sending).
LogAppendTime: broker will override the
timestamp with its current local time and append the message to the
log.
IMHO for consume timestamp your only option is to get currentTime of the system after message process is finished.
For more information about timestamp you can check this.
When it comes to consumption, there's no explicit way of specifying when message was consumed (also bear in mind that a single message can be consumed multiple times e.g. by consumers in different consumer groups).
However there are a few possible ways to track it on your own:
register the timestamps after the record has been received (after the .poll(...) call returns),
if using consumer groups, monitor consumer group's offset or see the value in __consumer_offsets (that would require you to deserialize internal format - see this answer for details (bear in mind, the timestamps of records in this topic correspond to consumer's commit timestamps, so they need to commit often enough to provide correct granurality,
log compaction + custom implementation: send the message with the same key and value that marks the consumption timestamp (however the message might still be re-read before the compaction happens).

Is it possible to filter Apache Kafka messages by retention time?

At an abstract point of view Apache Kafka stores data in topics. This data could be read by a consumer.
I'd like to have a (monitor)-consumer which greps data with a certain age. The monitor should send a warning to subsystems that records are still unread and would be discarded by Kafka if they reach retention time.
I couldn't find a suitable way until now.
You can use KafkaConsumer.offsetsForTimes() to map messages to dates.
For example, if you call it with the date of yesterday and it returns offset X, then any messages with an offset smaller than X are older than yesterday.
Then your logic can figure out from the current positions of your consumers if you are at risk of having unprocessed records discarded.
Note that there is currently a KIP under discussion to expose metrics to track that: https://cwiki.apache.org/confluence/display/KAFKA/KIP-223+-+Add+per-topic+min+lead+and+per-partition+lead+metrics+to+KafkaConsumer
http://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes-java.util.Map-

How to configure kafka such that we have an option to read from the earliest, latest and also from any given offset?

I know about configuring kafka to read from earliest or latest message.
How do we include an additional option in case I need to read from a previous offset?
The reason I need to do this is that the earlier messages which were read need to be processed again due to some mistake in the processing logic earlier.
In java kafka client, there is some methods about kafka consumer which could be used to specified next consume position.
public void seek(TopicPartition partition,
long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the same partition more than once, the latest offset will be used on the next poll(). Note that you may lose data if this API is arbitrarily used in the middle of consumption, to reset the fetch offsets
This is enough, and there are also seekToBeginning and seekToEnd.
I'm trying to answer a similar but not quite the same question so let's see if my information may help you.
First, I have been working from this other SO question/answer
In short, you want to commit your offsets and the most common solution for that is ZooKeeper. So if your consumer encounters an error or needs to shut down, it can resume where it left off.
Myself I'm working with a high volume stream that is extremely large and my consumer (for a test) needs to start from the very tail each time. The documentation indicates I must use KafkaConsumer seek to declare my starting point.
I'll try to update my findings here once they are successful and reliable. For sure this is a solved problem.

Retrieve Timestamp based data from Kafka

How can I get messages or data from the Kafka cluster for a specified day. For example 13 September, can anyone provide me code for this. I have googled it and found only theory but I want the code
There is no access method for this. Also, before Kafka v0.10 messages do not contain any timestamp information, thus, it is impossible to know when a message was written into a topic.
As of Kafka v0.10 each message contains a meta data timestamp attribute, that is either set by the producer on message creation time, or by the broker on message insertion time. A time-based index is planned, but not available yet. Thus, you need to consume the whole topic and check the timestamp field (and ignore all messaged you are not interested in). To find the beginning, you could also do a binary search with regard to offsets and timestamps to find the first message faster.
Update:
Kakfa 0.10.1 adds a time-based index. It allows to seek to the first record with a timestamp equals or larger of the given timestamp. You can use it via KafkaConsumer#offsetsForTime(). This will return the corresponding offsets and you can feed them into KafkaConsumer#seek(). You can just consume the data and check the records timestamp field via ConsumerRecord#timestamp() to see when you can stop processing.
Note, that data is strictly ordered by offsets but not by timestamp. Thus, during processing, you might get "late" records with smaller timestamp than your start timestamp (you could simple skip over those records though).
A more difficult problem is late arriving record at the end of your search interval though. After you got the first timestamp with a larger timestamp than your search interval, there might still be records with timestamp that are part of your search interval later on (if those records did got appended to the topic "late"). There is no way to know that though. Thus, you might want to keep reading "some more" records and check if there are "late" records. How much "some records" means, is a design decision you need to make by yourself.
There is not general guideline though -- if you have additional knowledge about your "write pattern" it can help to define a good strategy to how many records you want to consumer after your search interval "ended". Of course there are two default strategies: (1) stop at the very first record with larger timestamp than you search interval (and effectively ignore any late arriving records -- if you use "log append time" configuration this is of course a safe strategy); (2) you read to the end of the log -- this is the safest strategy with regard to completeness but might result in prohibitive overhead (also note, as record can be appended any time and if record "delay" could be arbitrary large, a late record might even be append after you reach end-of-log).
In practice, it might be a good idea to think about a "max expected delay" and read until you get a record with larger timestamp than this upper delay bound.
Add this to the current command --property print.timestamp=true That will print the timestamp CreateTime:1609917197764.
Example:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topicName --property print.timestamp=true --from-beginning
getting data for a specific day from kafka is NOT efficient, as the data is stored linearly inside kafka on each broker's storage system. therefore, even if you have timestamp inside each of your message or use kafka's message metadata which could contain timestamp in the later kafka message version(>=0.10), you still have to scan the entire topic on each partition to get the data. due to the fact that the data inside kafka is not indexed by date but only offset.
remember, kafka is a queue, NOT a database. if you want this date based retrieve pattern, you might want to consider storing kafka message inside another suitable databases system and use timestamp as your index.
I am new to Kafka and solution looks hacky as to me, but I would like to add at least any solution for this question:
In my case I use kafka-python==2.0.2
This code reads all messages starting from April 5, 2022
But you can find 'till offset' in the same fashion.
from kafka import KafkaConsumer, TopicPartition
TOPIC = 'test'
FROM_TIMESTAMP = 1649152610480 # April 5, 2022
consumer = KafkaConsumer(TOPIC)
# seek for each partition offset based on timestamp
for p in consumer.partitions_for_topic(TOPIC):
start_offset = consumer.beginning_offsets([TopicPartition(TOPIC, p)])[TopicPartition(TOPIC, p)]
end_offset = consumer.end_offsets([TopicPartition(TOPIC, p)])[TopicPartition(TOPIC, p)]
for_time = consumer.offsets_for_times({TopicPartition(TOPIC, p): FROM_TIMESTAMP})
offset_position = for_time[TopicPartition(TOPIC, p)]
offset = end_offset
if offset_position:
offset = offset_position.offset
consumer.seek(TopicPartition(TOPIC, p), offset)
for msg in consumer:
print(msg)