Went through
How is the kafka offset value computed?
From the kafka documentation on replication:
The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades.
From the kafka documentation on Efficiency:
The message log maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most important operation: network transfer of persistent log chunks.
I did not see anywhere details regarding how the offset is generated for a topic. Will be offsets be generated by a single machine in the cluster in which case there is one master or Kafka has distributed logging that relies on some kind of clock synchronization and generates messages in a consistent order among all the nodes.
Any pointers or additional information will be helpful.
Offsets are not generated explicitly for each message and messages do also no store their offset.
A topic consists of partitions, and messages are written to partitions in junks, called segments (on the file system, there will be a folder for a topic, with subfolders for each partition -- a segment corresponds to a file within a partitions folder).
Furthermore, a index is maintained per partitions and stored along with the segment files, that uses the offset of the first message per segment as key and point to the segment. For all consecutive messages within a segment, the offset of a message can be computed by it's logical position within the segment (including the offset of the first messages).
If you start a new topic or actually a new partition, a first segment is generated and its start offset zero is inserted into the index. Message get written to the segment until it's full. A new segment is started and it's start offset get's added to the index -- the start offset of the new segment can easily be computed by the start offset of the latest segment plus the number of message within this segment.
Thus, for each partitions, the broker that hosts this partitions (ie, the leader for this partition) tracks the offset for this partitions by maintaining the index. If segments are deleted because retention time passed, the segment file get's deleted and the entry in the index is removed.
Related
I've been developing a kafka consumer application (C# in kubernetes) and have been running it as a single node for a while, consuming from a single topic.
I noticed today that the topic I have been consuming from was quite full - I was doing continuous processing and was at offsets around ~38k (in general, agnostic of partition), but records my producer were putting on the topic (also, ignoring partition differences) were around offsets ~58k.
I decided to scale up another consumer pod - same code and config all around (group id, etc)
When it came online, it logged that it was processing messages in the ~58k offset range. I considered that this was maybe just a different partition, but I can see the same partition in both logs (with different offsets).
I was under the impression that if multiple consumers had the same groupid, that message consumption would be balanced between them, in order.
In other words, why wouldn't my second (or n-th) consumer come online and process messages in the same offset range as my first consumer which has been running for days?
I did eye some of the IConsumer settings such as:
https://docs.confluent.io/platform/current/clients/confluent-kafka-dotnet/_site/api/Confluent.Kafka.ConsumerConfig.html#Confluent_Kafka_ConsumerConfig_QueuedMinMessages
which seems to specify the minimum number of messages to keep in a "local consumer queue" (Default: 100,000), but I don't know if this actually indicates that ConsumerA has laid claim on 100K+ messages and ConsumerB is naturally starting +100k down the line
Other notes:
What limited access I have to the Administrative Tools (Control Center) shows that my consumer group id is about 900k messages behind.
ControlCenter says my topic has 60 partitions
Autocommit is not off (default: true)
Regardless of the Autocommit setting I am still doing a _consumer.Commit(msg) at the finally{} block after processing each individual message
I don't want to kill my long-running consumer (that's still processing like a champ) in the event that there's an offset retention problem and I will "miss" all messages in the delta between these two
Let us say I have a partition (partition-0) with 4 segments that are committed and are eligible for compaction. So all these segments will not have any duplicate data since the compaction is done on all the 4 segments.
Now, there is an active segment which is still not closed. Meanwhile, if the consumer starts reading the data from the partition-0, does it also read the messages from active segment?
Note: My goal is to not provide duplicate data to the consumer for a particular key.
Your concerns are valid as the Consumer will also read the messages from the active segment. Log compaction does not guarantee that you have exactly one value for a particular key, but rather at least one.
Here is how Log Compaction is introduced in the documentation:
Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition.
However, you can try to get the compaction running more frequently to have your active and non-compated segment as small as possible. This, however, comes at a cost as running the compaction log cleaner takes up ressources.
There are a lot of configurations at topic level that are related to the log compaction. Here are the most important and all details can be looked-up here:
delete.retention.ms
max.compaction.lag.ms
min.cleanable.dirty.ratio
min.compaction.lag.ms
segment.bytes
However, I am quite convinced that you will not be able to guarantee that your consumer is never getting any duplicates with a log compacted topic.
In designing a streaming processing pipeline what cost might be incurred if I were to have many topics which would have at least one partition but potentially no data going into it?
As an example, with one consumer and I could choose to have one "mega topic" which contains all of the data and many partitions or I could choose to split that data (by tenant, account, or user etc.) into many topics with, by default, a single partition. My worry about the second case is that there would be many topics/partitions which would see no data. So, is this unused partition costing anything or is there no cost that is incurred by an unused topic.
First of all, there is no difference between one fat topic and lots of partitions and more than one topic that contains a few partitions. Topic is just for logical distinction between events. Kafka only cares about number of partitions.
Secondly, having lots of partitions can lead some problems:
Too many open files:
Each partition maps to a directory in the file system in the broker.
Within that log directory, there will be two files (one for the index
and another for the actual data) per log segment.
More partitions requires more memory both in broker and consumer
sides:
Brokers allocate a buffer the size of replica.fetch.max.bytes for each
partition they replicate. If replica.fetch.max.bytes is set to 1 MiB,
and you have 1000 partitions, about 1 GiB of RAM is required.
More Partitions may increase unavailability:
If a broker which is controller is failed, then zookeeper elect another broker as controller. At that point newly elected broker should read metadata for every partition from Zookeeper during initialization.
For example, if there are 10,000 partitions in the Kafka cluster and
initializing the metadata from ZooKeeper takes 2 ms per partition,
this can add 20 more seconds to the unavailability window.
You may get more information from these links:
https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/
https://docs.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html
Assuming the mentioned topics are not compacted, there is the initial overhead of retaining any initially produced data, but after which, an empty topic is just
metadata in zookeeper
metadata in any consumer group coordinator, and wasted processing by any active consumer threads
empty directories on disk
For the first two, having lots of topics may increase request latency, causing an unhealthy cluster.
In all the Kafka tutorials I've read so far they all mentioned "Kafka partitions are immutable". However, I also read from this site https://towardsdatascience.com/log-compacted-topics-in-apache-kafka-b1aa1e4665a7 that from time to time, Kafka will remove older messages in the partition (depending on the retention time you set in the log-compact command). You can see from the screenshot below that data within the partition has clearly changed after removing the duplicate Keys in the partition:
So my question is what exactly does it mean to say "Kafka partitions are immutable"?
Tha Kafka partitions are defined as "immutable" referring to the fact that a producer can just append messages to a partition itself and not changing the value for an existing one (i.e. with the same key). The partition itself is a commit log working just in append mode from a producer point of view.
Of course, it means that without any kind of mechanisms like deletion (by retention time) and compaction, the partition size could grow endlessly.
At this point you could think .. "so it's not immutable!" as you mentioned.
Well, as I said the immutability is from a producer's point of view. Deletion and compaction are administrative operations.
For example, deleting records is also possible using the Admin Client API ... but we are always talking about administrative stuff, not producer/consumer related stuff.
If you think about compaction and how it works, the producer initially sends, for example, a message with key = A and payload = "Hello". After a while in order to "update" the value, it sends a new message with same key = A and payload = "Hi" ... but actually it's a really new message appended at the end of the partition log; it will be the compaction thread in the broker doing the work of deleting the old message with "Hello" payload leaving just the new one.
In the same way a producer can send the message with key = A and payload = null. It's the way for actually deleting the message (null is called "tombstone"). Anyway the producer is still appending a new message to the partition; it's always the compaction thread which will delete the last message with key = A when it saw the tombstone.
Inidividual messages are immutable.
Compaction or retention will drop messages. It doesn't alter messages or offsets
Data in Kafka is stored in topics, topics are partitioned, each partition is further divided into segments and finally each segment has a log file to store the actual message, an index file to store the position of the messages in the log file and timeindex file, for example:
$ ls -l /mnt/data/kafka/*consumer*/00000000004618814867*
-rw-r--r-- 1 kafka kafka 10485760 Oct 3 23:41 /mnt/data/kafka/__consumer_offsets-7/00000000004618814867.index
-rw-r--r-- 1 kafka kafka 8189913 Oct 3 23:41 /mnt/data/kafka/__consumer_offsets-7/00000000004618814867.log
-rw-r--r-- 1 kafka kafka 10485756 Oct 3 23:41 /mnt/data/kafka/__consumer_offsets-7/00000000004618814867.timeindex
In scenario where log.cleanup.policy (or cleanup.policy on particular topic) set to delete, occur complete delete some of log segments (one or more).
In scenario where params set to compact the compaction is done in the background by periodically recopying log segments: it recopies the log from beginning to end removing keys which have a later occurrence in the log. New, clean segments are swapped into the log immediately so the additional disk space required is just one additional log segment (not a fully copy of the log). In other words, the old segment is replaced by a new compacted segment
See more about distributed logs:
https://kafka.apache.org/documentation.html#compaction
https://medium.com/#durgaswaroop/a-practical-introduction-to-kafka-storage-internals-d5b544f6925f
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
https://bookkeeper.apache.org/distributedlog/docs/0.5.0/user_guide/architecture/main
https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-1-storage-mechanics/
Immutability is a property of the records stored within the partitions themselves. When the source (documentation or articles) states immutability within the context of topics or partitions, they are usually referring to either one of two things, both of which are correct in a limited context:
Records are immutable. Once a record is written, its contents can never be altered. A record can be deleted by the broker when either (a) the contents of the partition are pruned due to the retention limit, (b) a new record is added for the same key that supersedes the original record and compaction takes place, or (c) a record is added for the same key with a null value, which acts as a tombstone record, deleting the original without adding a replacement.
Partitions are append-only from a client's perspective, in that a client is not permitted to modify records or directly remove records from a partition, only append to the partition. This is somewhat debatable, because a client can induce the deletion of a record through the compaction feature, although this operation is asynchronous and the client cannot specify precisely which record should be deleted.
While trying to implement exactly-once semantics, I found this in the official Kafka documentation:
Exactly-once delivery requires co-operation with the destination
storage system but Kafka provides the offset which makes implementing
this straight-forward.
Does this mean that I can use the (topic, partiton, offset) tuple as a unique primary identifier to implement deduplication?
An example implementation would be to use an RDBMS and this tuple as a primary key for an insert operation within a big processing transaction where the transaction fails if the insertion is not possible anymore because of an already existing primary key.
I think the question is equivalent to:
Does a producer use the same offset for a message when retrying to send it after detecting a possible failure or does every retry attempt get its own offset?
If the offset is reused when retrying, consumers obviously see multiple messages with the same offset.
Other question, maybe somehow related:
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Another possibility could be that the offset is determined e.g. solely by or as recently as the message reaches the leader which does the job (implying that - if not listening to something like a producer's suggested offset - there are probably no gaps/offset jumps, but also different offsets for duplicate messages and I would have to use my own unique identifier within the application's message on application level).
To answer my own question:
The offset is generated solely by the server (more precisely: by the leader of the corresponding partition), not by the producing client. It is then sent back to the producer in the produce response. So:
Does a producer use the same offset for a message when retrying to
send it after detecting a possible failure or does every retry attempt
get its own offset?
No. (See update below!) The producer does not determine offsets and two identical/duplicate application messages can have different offsets. So the offset cannot be used to identify messages for producer deduplication purposes and a custom UID has to be defined in the application message. (Source)
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Due to the fact that there is only a single leader for every partition which maintains the current offset and the fact that (with the default configuration) this leadership is only transfered to active in-sync replica in case of a failure, I assume that the latest used offset is always communicated correctly when electing a new leader for a partition and therefore there are should not be any offset gaps or jumps initially. However, because of the log compaction feature, there are cases (assuming log compaction being enabled) where there can indeed be gaps in a stream of offsets when consuming already committed messages of a partition once again after the compaction has kicked in. (Source)
Update (Kafka >= 0.11.0)
Starting from Kafka version 0.11.0, producers now additionally send a sequence number with their requests, which is then used by the leader to deduplicate requests by this number and the producer's ID. So with 0.11.0, the precondition on the producer side for implementing exactly once semantics is given by Kafka itself and there's no need to send another unique ID or sequence number within the application's message.
Therefore, the answer to question 1 could now also be yes, somehow.
However, note that exactly once semantics are still only possible with the consumer never failing. Once the consumer can fail, one still has to watch out for duplicate message processings on consumer side.