I was googling and reading Kafka documentation but I couldn't find out the max value of a consumer offset and whether there is offset wraparound after max value.
I understand offset is an Int64 value so max value is 0xFFFFFFFFFFFFFFFF.
If there is wraparound, how does Kafka handle this situation?
According to this post, the offset is not reset:
We don't roll back offset at this moment. Since the offset is a long, it
can last for a really long time. If you write 1TB a day, you can keep going
for about 4 million days.
Plus, you can always use more partitions (each partition has its own
offset).
So as Luciano said, probably not worth worrying about.
It seems that this is not really "handled". But, taking into account that the offset is per partition, it seems this is something we should not worry about :)
Please see http://search-hadoop.com/m/uyzND1uRn8D1sSH322/rollover/v=threaded
Related
Can somebody explain, how Kafka's current offset mechanism works from the consumer's point of view? I have a huge topic (several gigabytes), divided into 2 partitions. And in some business cases (rare ones), I need to choose random N records within partition and read it.
My colleague says, that Kafka consumer does not know anything about offsets, it just receives a bunch of records on every poll() with offset, attached to every record as meta-information. I.e. the "seek" mechanism works as follows: consumer asks records and ignores it until target offset has been met.
Is it true? In my understanding such a "rewinding" is a wasting of consumer resources and internet traffic. I think there MUST be a way to point at a specific offset, so that a broker could send the record with that specific offset immediately on poll() without that kinda "spinloop" stuff.
You can seek to a specific offset. But it's the consumer group / offsets topic that stores that information, not the consumer itself.
Hopping around to "random" offsets is indeed not efficient.
Size of topic doesn't matter.
I have code in place to find offsets and TopicPartition from the KafkaConsumer, but can't find a way to just retrieve the timestamp based on that information.
I have looked through ConsumerRecord but since this is a monitoring service I do not think I should .poll() as I might cause some records to fall through if my monitoring service is directly polling from Kafka.
I know there's CLI kafka-console-consumer which can fetch the timestamp of a message based on partition and offset, but not sure if that's an SDK available for that.
Does anyone have any insights or readings I can go through to try to get time lag? I have been trying to find an SDK or any type of API that can do this.
There is no other way (as of 3.1) to do this - you can do consumer.poll. Of course, if you want to access only one, then you should set the max received records property to 1, so you don't waste effort. A consumer can be basically treated as an accessor to remote record-array, what you are doing is just accessing record[offset], and getting this record's ts.
So to sum it up:
get timestamp out of offset -> seek + poll 1,
get offset out of timestamp -> offsetsForTimes.
If I understood your question, given a ConsumerRecord, from Kafka 0.11+, all records have a .timestamp() method.
Alternatively, given a topic, (list of) partition(s), and offset(s), then you'd need to seek a consumer, with max.poll.records=1, then extract the timestamps from each polled partition after the seeked position.
The Confluent Monitoring Interceptors already do something very similiar to what you're asking, but for Control Center.
I'm kind of new to Kafka but need to implement the logic for the consumer to consume from a particular topic based on timestamp. Another use case is also for me to be able to consume for a particular time range (for example from 10:00 to 10:20). The range will always be dividable by 5 minutes - meaning I won't need to consume from for example 10:00 to 10:04). The logic I was thinking would be as follows:
create a table where I store timestamp and Kafka messageId (timestamp | id)
create a console\service which does the following every 5 minutes:
Get all partitions for a topic
Query all partitions for min offset value (a starting point)
Store the offset and timestamp in the table
Get all partitions for a topic
Now if everything is alright I should have something like this in the table:
10:00 | 0
10:05 | 100
10:10 | 200
HH: mm | (some number)
Now having this I could start the consumer at any time and knowing the offsets I should be able to consume just what I need.
Does it look right or have I made a flaw somewhere? Or maybe there is a better way of achieving the required result? Any thoughts or suggestions would be highly appreciated.
P.S.: one of my colleagues suggested to use partition and work out with each partition separately... Meaning if I got a topic and replica count is for example 5 - then I'd need to save offsets 5 times for my topic for every interval (once per partition). And then the consumer would also need to account for the partitions and consume based on what offsets I got for each partition. But this would kind of incorporate additional complexity which I am trying to avoid...
Thanks in advance!
BR,
Mike
No need for tables.
You can use the seek method of a Consumer instance to move all partitions to an offset defined by that partition.
Partitioning might work... 12 partitions of 5 minute message intervals
I don't think replication addresses your problem.
Kafka Topic Partition offset position always start from 0 or random value and How to ensure the consumer record is the first record in the partition ? Is there any way to find out ? If any please let me know. Thanks.
Yes and no.
When you start a new topic, the offset start at zero. Depending on the Kafka version you are using, the offsets are
logical – and incremented message by message (since 0.8.0: https://issues.apache.org/jira/browse/KAFKA-506) – or
physical – ie, the offset is increased by the number of bytes for each message.
Furthermore, old log entries are cleared by configurable conditions:
retention time: eg, keep message of the last week only
retention size: eg, use at max 10GB of storage; delete old messages that cannot be stored any more
log-compaction (since 0.8.1): you only preserve the latest value for each key (see https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction)
Thus, the first offset might not be zero if old messages got deleted. Furthermore, if you turn on log-compaction, some offsets might be missing.
In any case, you can always seek to any offset safely, as Kafka can figure out if the offset is valid or not. For an invalid offset, is automatically advances to the next valid offset. Thus, if you seek to offset zero, you will always get the oldest message that is stored.
Yes, Kafka offset starts from 0 and ends with byte length of the complete record and then next record picks the offset from there onward.
As Kafka is distributed so we can not assure that Consumer will get the data in ordered way.
Are there any best practices when selecting value of auto.commit.interval.ms?
I read here that:
In general, it is not recommended to keep this interval too small because it vastly increases the read/write rates in zookeeper and zookeeper gets slowed down because it's strongly consistent across its quorum.
What is too small? Is this still an issue with kafka >= 0.9.0 version?
The question is not what is small, but rather what can you live with? If you can live with re-processing several minutes of messages in case of a crash of your consumer, you could set the interval to a few minutes. Because this is what it is all about: messages will be processed after a (re)start from the last committed offset.