Message created time in Kafka - apache-kafka

When I consume message in Kafka, how can I know when this message is produced/created?
There is no created method in MessageAndMetadata.
And How to set offset in Kafka when neccessary? I prefer setting by code or command line.
Thanks.

There is currently no per-message timestamping in Apache Kafka 0.7 or 0.8, but it would be handy.
What exists is a time-based offset-lookup in the Offset API:
Time - Used to ask for all messages before a certain time (ms). There are two special values. Specify -1 to receive the latest offset (i.e. the offset of the next coming message) and -2 to receive the earliest available offset. Note that because offsets are pulled in descending order, asking for the earliest offset will always return you a single element.
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-OffsetAPI
It relies on the filesystem ctime of the topic+partition segment files on the broker so it is a bit crude (depending on file rotation intervals, etc) and in the case of replication may be outright incorrect (timestamp of replication rather than when the original segment was created).
The solution is to embed the message creation time in the message itself.

Related

Is there a common offset value that spans across Kafka partitions?

I am just experimenting on Kafka as a SSE holder on the server side and I want "replay capability". Say each kafka topic is in the form events.<username> and it would have a delete items older than X time set.
Now what I want is an API that looks like
GET /events/offset=n
offset would be the last processed offset by the client if not specified it is the same as latest offset + 1 which means no new results. It can be earliest which represents the earliest possible entry. The offset needs to exist as a security-through-obscurity check.
My suspicion is for this to work correctly the topic must remain in ONE partition and cannot scale horizontally. Though because the topics are tied to a user name the distribution between brokers would be handled by the fact that the topics are different.
If you want to retain event sequence for each of the per-user topics, then yes, you have to use one partition per user only. Kafka cannot guarantee message delivery order with multiple partitions.
The earliest and latest options you mention are already supported in any basic Kafka consumer configuration. The specific offset one, you'd have to filter out manually by issuing a request for the given offset, and then returning nothing if the first message you receive does not match the requested offset.

Reading messages for specific timestamp in kafka

I want to read all the messages starting from a specific time in kafka.
Say I want to read all messages between 0600 to 0800
Request messages between two timestamps from Kafka
suggests the solution as the usage of offsetsForTimes.
Problem with that solution is :
If say my consumer is switched on everyday at 1300. The consumer would not have read any messages that day, which effectively means no offset was committed at/after 0600, which means offsetsForTimes(< partitionname > , <0600 for that day in millis>) will return null.
Is there any way I can read a message which was published to kafka queue at a certain time, irrespective of offsets?
offsetsForTimes() returns offsets of messages that were produced for the requested time. It works regardless if offsets were committed or not because the offsets are directly fetched from the partition logs.
So yes you should be using this method to find the first offset produced after 0600, seek to that position and consume messages until you reach 0800.

How to get current offset in Kafka 0.10.x without shell?

Is there any way to get the current offset in Kafka 0.10.x? I do not want to use shell commands.
When I use the API's consumer.endOffsets(...), I can get the last offsets (logSize). However consumer.position(...) does not get me the current offset!
All in all, I want to get the current offset, log size and lag in one partition.
You can use KafkaConsumer#committed() to get the latest committed position. Thus, I you disable auto-commit and do manual commits, you can compute the exact lag each time you do a commit.
On the other hand, each record you do process, provide its offset via ConsumerRecord#offset(), thus, you can also compute the lag after reading a single record (for the record's partition).

Can I retrieve the latest available offset for a Kafka partition without retrieving all the messages?

Looking at the latest (v0.10) Kafka Consumer documentation:
"The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives data calls poll(long) and receives messages."
Is there a way to query for the largest offset available for the partition on the server side, without retrieving all the messages?
The logic I am trying to implement is as follows:
query every second for the amount (A) of pending messages in a topic
if A > threshold, wake up a processor that would go ahead retrieving all the messages, and processing them
otherwise do nothing (sleep 1)
The motivation is that I need to do some batch processing, but I want the processor to wake up only when there is enough data (and I don't want to retrieve all the data twice).
You can use the Consumer.seekToEnd() method, run Consumer.poll(0) to make that take effect but return immediately, then Consumer.position() to find the positions for all subscribed (or assigned) topic partitions. These will be the current final offsets for all partitions. This will also start fetching some data from the brokers for those offsets, but any returned data will be ignored if you subsequently seek back to a different position.
Currently the alternative, as mentioned by serejja, is to use the old simple consumer, although the process is quite a bit more complicated as you need to manually find the leader for each partition.
Sadly, I don't see how this is possible with 0.10 consumer.
However, this is doable if you have any lower level Kafka client (sorry but I'm not sure if one exists for JVM, but there are plenty of them for other languages).
So if you have some time and inspiration to implement this, here's the way to go - every FetchResponse (which is the response for each "give me messages" request) contains a field called HighwaterMarkOffset, which essentially is an offset at the end of partition (https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse). The trick here is to send a FetchRequest that will immediately return (e.g. won't block waiting) nothing but HighwaterMarkOffset.
To do so your FetchRequest should have:
MaxWaitTime set to 0, which would mean "return immediately if cannot fetch at least MinBytes bytes".
MinBytes set to 0, which means "I'm OK if you return me an empty response".
FetchOffset doesn't matter in this case, and if I'm not wrong it might even be an invalid offset, but probably better to be a valid one.
MaxBytes set to 0, which means "give me no more than 0 bytes of data", e.g. nothing.
This way this request will return immediately with no data, but still with the highwatermark offset set to a proper value. Once you have the highwatermark offset, you can compare it to your current offset and figure out how much behind you are.
Hope this helps.
you can use this method public OffsetAndMetadata committed(TopicPartition partition) from the API below to get the last committed offset
https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

Kafka Topic Partition

Kafka Topic Partition offset position always start from 0 or random value and How to ensure the consumer record is the first record in the partition ? Is there any way to find out ? If any please let me know. Thanks.
Yes and no.
When you start a new topic, the offset start at zero. Depending on the Kafka version you are using, the offsets are
logical – and incremented message by message (since 0.8.0: https://issues.apache.org/jira/browse/KAFKA-506) – or
physical – ie, the offset is increased by the number of bytes for each message.
Furthermore, old log entries are cleared by configurable conditions:
retention time: eg, keep message of the last week only
retention size: eg, use at max 10GB of storage; delete old messages that cannot be stored any more
log-compaction (since 0.8.1): you only preserve the latest value for each key (see https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction)
Thus, the first offset might not be zero if old messages got deleted. Furthermore, if you turn on log-compaction, some offsets might be missing.
In any case, you can always seek to any offset safely, as Kafka can figure out if the offset is valid or not. For an invalid offset, is automatically advances to the next valid offset. Thus, if you seek to offset zero, you will always get the oldest message that is stored.
Yes, Kafka offset starts from 0 and ends with byte length of the complete record and then next record picks the offset from there onward.
As Kafka is distributed so we can not assure that Consumer will get the data in ordered way.