Retrieve Timestamp based data from Kafka - apache-kafka

How can I get messages or data from the Kafka cluster for a specified day. For example 13 September, can anyone provide me code for this. I have googled it and found only theory but I want the code

There is no access method for this. Also, before Kafka v0.10 messages do not contain any timestamp information, thus, it is impossible to know when a message was written into a topic.
As of Kafka v0.10 each message contains a meta data timestamp attribute, that is either set by the producer on message creation time, or by the broker on message insertion time. A time-based index is planned, but not available yet. Thus, you need to consume the whole topic and check the timestamp field (and ignore all messaged you are not interested in). To find the beginning, you could also do a binary search with regard to offsets and timestamps to find the first message faster.
Update:
Kakfa 0.10.1 adds a time-based index. It allows to seek to the first record with a timestamp equals or larger of the given timestamp. You can use it via KafkaConsumer#offsetsForTime(). This will return the corresponding offsets and you can feed them into KafkaConsumer#seek(). You can just consume the data and check the records timestamp field via ConsumerRecord#timestamp() to see when you can stop processing.
Note, that data is strictly ordered by offsets but not by timestamp. Thus, during processing, you might get "late" records with smaller timestamp than your start timestamp (you could simple skip over those records though).
A more difficult problem is late arriving record at the end of your search interval though. After you got the first timestamp with a larger timestamp than your search interval, there might still be records with timestamp that are part of your search interval later on (if those records did got appended to the topic "late"). There is no way to know that though. Thus, you might want to keep reading "some more" records and check if there are "late" records. How much "some records" means, is a design decision you need to make by yourself.
There is not general guideline though -- if you have additional knowledge about your "write pattern" it can help to define a good strategy to how many records you want to consumer after your search interval "ended". Of course there are two default strategies: (1) stop at the very first record with larger timestamp than you search interval (and effectively ignore any late arriving records -- if you use "log append time" configuration this is of course a safe strategy); (2) you read to the end of the log -- this is the safest strategy with regard to completeness but might result in prohibitive overhead (also note, as record can be appended any time and if record "delay" could be arbitrary large, a late record might even be append after you reach end-of-log).
In practice, it might be a good idea to think about a "max expected delay" and read until you get a record with larger timestamp than this upper delay bound.

Add this to the current command --property print.timestamp=true That will print the timestamp CreateTime:1609917197764.
Example:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topicName --property print.timestamp=true --from-beginning

getting data for a specific day from kafka is NOT efficient, as the data is stored linearly inside kafka on each broker's storage system. therefore, even if you have timestamp inside each of your message or use kafka's message metadata which could contain timestamp in the later kafka message version(>=0.10), you still have to scan the entire topic on each partition to get the data. due to the fact that the data inside kafka is not indexed by date but only offset.
remember, kafka is a queue, NOT a database. if you want this date based retrieve pattern, you might want to consider storing kafka message inside another suitable databases system and use timestamp as your index.

I am new to Kafka and solution looks hacky as to me, but I would like to add at least any solution for this question:
In my case I use kafka-python==2.0.2
This code reads all messages starting from April 5, 2022
But you can find 'till offset' in the same fashion.
from kafka import KafkaConsumer, TopicPartition
TOPIC = 'test'
FROM_TIMESTAMP = 1649152610480 # April 5, 2022
consumer = KafkaConsumer(TOPIC)
# seek for each partition offset based on timestamp
for p in consumer.partitions_for_topic(TOPIC):
start_offset = consumer.beginning_offsets([TopicPartition(TOPIC, p)])[TopicPartition(TOPIC, p)]
end_offset = consumer.end_offsets([TopicPartition(TOPIC, p)])[TopicPartition(TOPIC, p)]
for_time = consumer.offsets_for_times({TopicPartition(TOPIC, p): FROM_TIMESTAMP})
offset_position = for_time[TopicPartition(TOPIC, p)]
offset = end_offset
if offset_position:
offset = offset_position.offset
consumer.seek(TopicPartition(TOPIC, p), offset)
for msg in consumer:
print(msg)

Related

Kafka get timestamp based on offset and partition for monitoring time lag

I have code in place to find offsets and TopicPartition from the KafkaConsumer, but can't find a way to just retrieve the timestamp based on that information.
I have looked through ConsumerRecord but since this is a monitoring service I do not think I should .poll() as I might cause some records to fall through if my monitoring service is directly polling from Kafka.
I know there's CLI kafka-console-consumer which can fetch the timestamp of a message based on partition and offset, but not sure if that's an SDK available for that.
Does anyone have any insights or readings I can go through to try to get time lag? I have been trying to find an SDK or any type of API that can do this.
There is no other way (as of 3.1) to do this - you can do consumer.poll. Of course, if you want to access only one, then you should set the max received records property to 1, so you don't waste effort. A consumer can be basically treated as an accessor to remote record-array, what you are doing is just accessing record[offset], and getting this record's ts.
So to sum it up:
get timestamp out of offset -> seek + poll 1,
get offset out of timestamp -> offsetsForTimes.
If I understood your question, given a ConsumerRecord, from Kafka 0.11+, all records have a .timestamp() method.
Alternatively, given a topic, (list of) partition(s), and offset(s), then you'd need to seek a consumer, with max.poll.records=1, then extract the timestamps from each polled partition after the seeked position.
The Confluent Monitoring Interceptors already do something very similiar to what you're asking, but for Control Center.

Get kafka record timestamp by partition and offset

There are many threads describing how to get records from Kafka starting from the specified timestamp.
So I think Kafka 'knows' timestamp for every record it stores.
I need to get the timestamp for the record with a specified partition and offset. Is it possible?
An information system put the wrong data to Kafka (incorrect product id in client's order) and I need to analyze log files to find out the cause. It would be much easier to do it knowing the timestamp of that record.
org.apache.kafka.clients.consumer.ConsumerRecord has a method called timestamp().
Is this what you are looking for?

Options for getting a specific item's produce and consume time stamps?

Suppose I'm debugging an issue involving a single specific message that was produced and consumed. And I want to know when this message was produced and when it was consumed. What are my options for getting this info?
I guess when I construct a message I could include within it the current time. And when my consumer gets a message, it could write out a log entry.
But suppose I have many producer and consumer classes and none of the code is doing these things. Is there something already existing in kafka that could support finding out this info about a specific message without having to touch the implementation of these producers and consumers, something like the __consumer_offsets topic?
Kafka has built-in timestamp support for messages sent and this timestamp can be accessed via timestamp method of ConsumerRecord (link)
It can be configured with broker config (log.message.timestamp.type) or topic level config (message.timestamp.type). Its default value is CreateTime. You can also set this as LogAppendTime.
CreateTime: Timestamp is assigned when producer record is created (before sending).
LogAppendTime: broker will override the
timestamp with its current local time and append the message to the
log.
IMHO for consume timestamp your only option is to get currentTime of the system after message process is finished.
For more information about timestamp you can check this.
When it comes to consumption, there's no explicit way of specifying when message was consumed (also bear in mind that a single message can be consumed multiple times e.g. by consumers in different consumer groups).
However there are a few possible ways to track it on your own:
register the timestamps after the record has been received (after the .poll(...) call returns),
if using consumer groups, monitor consumer group's offset or see the value in __consumer_offsets (that would require you to deserialize internal format - see this answer for details (bear in mind, the timestamps of records in this topic correspond to consumer's commit timestamps, so they need to commit often enough to provide correct granurality,
log compaction + custom implementation: send the message with the same key and value that marks the consumption timestamp (however the message might still be re-read before the compaction happens).

Use Kafka offsets to calculate written messages statistics

I want to get some statistics from a Kafka topic:
total written messages
total written messages in the last 12 hours, last hour, ...
Can I safely assume that reading the offsets for each partition in a topic for a given timestamp (using getOffsetsByTimes) should give me the number of messages written in that specific time?
I can sum all the offsets for every partitions and then calculate the difference between a timestamp 1 and a timestamp 2. With these data I should be able to calculate a lot of statistics.
There are situations when these data can give me wrong results? I don't need a 100% precision, but I expect to have a reliable solution. Of course assuming that the topic is not deleted/reset.
There are other alternatives without using third party tools? (I cannot install other tools easily and I need data inside my app)
(using getOffsetsByTimes) should give me the number of messages written in that specific time?
In Kafka: The Definitive Guide it mentions that the getOffsetsByTime is not message-based, it is segment file based. Meaning the time index lookup won't read into a segment file, rather it gets the first segment containing the time you are interested in. (This may have changed in newer Kafka releases since the book was released)
If you don't need accuracy, this should be fine. Do note that compacted topics don't have sequentially ordered offsets one after the other, so a simple abs(offset#time2 - offset#time1) won't quite work for "total existing messages in a topic".
Otherwise, plenty of JMX metrics are exposed by the brokers like bytes-in and message rates, which you can aggregate and plot over time using Grafana, for example.

Kafka Streams TimestampExtractor

Hi everybody I have a question about TimestampExtractor and Kafka Streams....
In our application there is a possibility of receiving out-of-order events, so I like to order the events depending on a business date inside of the payload instead in point of time they placed in the topic.
For this purpose I programmed a custom TimestampExtractor to be able to pull the timestamp from the payload. Everything until I told here worked perfectly but when I build the KTable to this topic, I discerned that the event that I receive out-of-order (from Business point of view it is not last event but it received at the end) displayed as last state of the object while ConsumerRecord having the timestamp from the payload.
I don't know may be it was my mistake to assume Kafka Stream will fix this out-of-order problem with TimestampExtractor.
Then during debugging I saw that if the TimestampExtractor returns -1 as result Kafka Streams are ignoring the message and TimestampExtractor also delivering the timestamp of the last accepted Event, so I build a logic that realise the following check (payloadTimestamp < previousTimestamp) return -1, which achieves the logic I want but I am not sure I am sailing on dangerous waters or not.
Am I allowed to deal with a logic like this or what other ways exist to deal with out-of-order events in Kafka streams....
Thx for answers..
Currently (Kafka 2.0), KTables don't consider timestamps when they are updated, because the assumption is, that there is no out-of-order data in the input topic. The reason for this assumption is the "single writer principle" -- it's assumed, that for compacted KTable input topic, there is only one producer per key, and thus, there won't be any out-of-order data with regard to single keys.
It's a know issue: https://issues.apache.org/jira/browse/KAFKA-6521
For your fix: it's not 100% correct or safe to do this "hack":
First, assume you have two different messages with two different key <key1, value1, 5>, <key2, value2, 3>. The second record with timestamp 3 is later, compared to the first record with timestamp 5. However, both have different keys and thus, you actually want to put the second record into the KTable. Only if you have two record with the same key, you want to drop late arriving data IHMO.
Second, if you have two records with the same key and the second one if out-of-order and you crash before processing the second one, the TimestampExtractor looses the timestamp of the first record. Thus on restart, it would not discard the out-of-order record.
To get this right, you will need to filter "manually" in your application logic instead of the stateless and key-agnostic TimestampExtractor. Instead of reading the data via builder#table() you can read it as a stream, and apply an .groupByKey().reduce() to build the KTable. In you Reducer logic, you compare the timestamp of the new and old record and return the record with the larger timestamp.