How the Kafka Topic Offsets works - apache-kafka

I have a question about how the topic offsets works in Kafka, are they stored B-Tree like structure in Kafka?
The specific reason I ask for it, lets say I have a Topic with 10 millions records in Topic, that will mean 10 millions offset if no compaction occurred or it is turned off, now if I use consumer.seek(5000000), it will work like LinkList by that I mean, it will go to 0 offset and will try to hop from there to 5000000th offset or it does have index like structure will tell exactly where is the 5000000th record in the log?
Thx for answers?

Kafka records are stored sequentially in the logs. The exact format is well described in the documentation.
Kafka usually expects read to be sequential, as Consumers fetch records in order. However when a random access is required (via seek or to restart from a specific position), Kafka uses index files to quickly find a record based on its offset.
A Kafka log is made of several segments. Each segments has an index and a timeindex file associated which map offsets and timestamp to file position. The frequency at which entries are added to the indexes can be configured using index.interval.bytes. Using these files Kafka is able to immediately seek to the nearby position and avoid re-reading all messages.
You may have noticed after an unclean shutdown that Kafka is rebuilding indexes for a few minutes. It's these indexes used to file position lookups that are being rebuilt.

Related

Kafka Consumer - Random access fetching of partition range

The question:
How can I randomly fetch an old chunk of messages with a given range definition of [partition, start offset, end offset]. Hopefully ranges from multiple partitions at once (one range for each partition). This needs to be supported in a concurrent environment too.
My ideas for solution so far
I guess I can use a pool of consumers for the concurrency, and for each fetch, use Consumer.seek and Consumer.poll with max.poll.records. But this seems wrong. No promise that I will get the same exact chunk, for example in a case when a message get deleted (using log compact). As a whole this seek + poll method not seems like the right fit for one time random fetch.
My use case:
Like the typical consumer, mine reads 10MB chunks of messages and processes it.
In order to process that chunk I am pushing 3-20 jobs to different topics, in some kind of workflow.
Now, my goal is to avoid pushing the same chunk into the other topics again and again. Seems to me that it is better to push a reference to that chunk. e.g. [Topic X / partition Y, start offset, end offset]. Then, on the processing of the jobs, it will fetch the exact chunk again.
Your idea seems fine, and is practically the only solution with the Consumer API. There's nothing you can do once messages are removed between offsets.
If you really needed every single message between each and every possible offset range, then you should consider consuming that data as it's actively produced into some externally indexable destination where offset scans are also a common operation. Plenty of Kafka Connectors exist, and lots of databases or filesystems. But the takeaway here is that, I think you might have to reconsider your options for these "reprocessing" jobs

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

kafka partition has lots of log segments

One topic has 20 partitions, almost everyone has more than 20,000 log segment files, most of them are created months ago. Even after I config the retention.ms to very short, the segments are not deleted. While other topics can recycle normal.
I am wondering what's the issue inside, and how to solve it. Because I'm worry about the number of total segments will keep increasing that larger than OS vm.max_map_count, which will damage kafka process itself. Following image is the describe about the abnormal topic.
Not sure what the issue is exactly, but some things to consider:
Broker vs topic-specific configs. Check to make sure your topic actually has the configs you think it has, and is not inheriting them from the broker settings.
Configs related to retention. As mentioned by Girogos Myrianthous, you can look at log.retention.check.interval.ms and log.cleanup.policy. I would also look at the roll related settings, like log.roll.hours. I believe that in some cases, Kafka will not delete a segment until its partition rolls, even if the segment is old. And rolling follows the following behavior:
The log rolling time is no longer depending on log segment create time. Instead it is now based on the timestamp in the messages. More specifically. if the timestamp of the first message in the segment is T, the log will be rolled out when a new message has a timestamp greater than or equal to T + log.roll.ms (http://kafka.apache.org/20/documentation.html)
So make sure to consider the record timestamps, not just the segment files' age.
Finally:
What version of Kafka are you using?
Have you looked carefully at the broker logs? Broker logs is how I've solved all such problems that I've encountered.

Is it possible to filter Apache Kafka messages by retention time?

At an abstract point of view Apache Kafka stores data in topics. This data could be read by a consumer.
I'd like to have a (monitor)-consumer which greps data with a certain age. The monitor should send a warning to subsystems that records are still unread and would be discarded by Kafka if they reach retention time.
I couldn't find a suitable way until now.
You can use KafkaConsumer.offsetsForTimes() to map messages to dates.
For example, if you call it with the date of yesterday and it returns offset X, then any messages with an offset smaller than X are older than yesterday.
Then your logic can figure out from the current positions of your consumers if you are at risk of having unprocessed records discarded.
Note that there is currently a KIP under discussion to expose metrics to track that: https://cwiki.apache.org/confluence/display/KAFKA/KIP-223+-+Add+per-topic+min+lead+and+per-partition+lead+metrics+to+KafkaConsumer
http://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes-java.util.Map-

How is Apache Kafka offset generated?

Went through
How is the kafka offset value computed?
From the kafka documentation on replication:
The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades.
From the kafka documentation on Efficiency:
The message log maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most important operation: network transfer of persistent log chunks.
I did not see anywhere details regarding how the offset is generated for a topic. Will be offsets be generated by a single machine in the cluster in which case there is one master or Kafka has distributed logging that relies on some kind of clock synchronization and generates messages in a consistent order among all the nodes.
Any pointers or additional information will be helpful.
Offsets are not generated explicitly for each message and messages do also no store their offset.
A topic consists of partitions, and messages are written to partitions in junks, called segments (on the file system, there will be a folder for a topic, with subfolders for each partition -- a segment corresponds to a file within a partitions folder).
Furthermore, a index is maintained per partitions and stored along with the segment files, that uses the offset of the first message per segment as key and point to the segment. For all consecutive messages within a segment, the offset of a message can be computed by it's logical position within the segment (including the offset of the first messages).
If you start a new topic or actually a new partition, a first segment is generated and its start offset zero is inserted into the index. Message get written to the segment until it's full. A new segment is started and it's start offset get's added to the index -- the start offset of the new segment can easily be computed by the start offset of the latest segment plus the number of message within this segment.
Thus, for each partitions, the broker that hosts this partitions (ie, the leader for this partition) tracks the offset for this partitions by maintaining the index. If segments are deleted because retention time passed, the segment file get's deleted and the entry in the index is removed.