A kafka message has:
key, value, compression type, headers(key-value pairs,optional),
partition+offset, timestamp
Key is hashed to partition to find which partition producer would write to.
Then why do we need partition as part of message.
Also, how does producer know the offset as offset seems more like a property of kafka server? And doesn't it cause coupling between server and producer?
And how would it work if multiple producers are writing to a topic, as offset send by them may clash?
why do we need partition as part of message.
It's optional for the client to set the record partition. The partition is still needed in the protocol because the key is not hashed server-side, then rerouted.
how does producer know the offset as offset seems more like a property of kafka server?
The producer would need a callback to get the OffsetMetadata, but it's not known when a batch is sent
And doesn't it cause coupling between server and producer?
Yes? It's the Kafka protocol. The consumer is also "coupled" with the server because it must understand how to communicate with it.
multiple producers are writing to a topic, as offset send by them may clash?
If max.inflight.connections is more than 1 and retires are enabled, then yes, batches may get rearranged, but send requests are initially ordered, and clients do not set the record offset, the broker does.
Related
From the kafka faq page
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key
So all the messages with a particular key will always go to the same partition in a topic:
How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
How does the consumer know which partition the producer wrote to
Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...
Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering
so that the consumers can consume messages from specific producers?
Again, this should not be done.
Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,
If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)
The offset is atomic action which will promise that no two messages will get same offset.
The offset is running number, it has a meaning only in the specific topic and specfic partition
If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value
The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition
Consider your producers create messages for the users of a system and the order of them is important in the user level.
My producers, add messages to the topic which have two partitions and I am using hashing against the user_id to put all the messages of each user in the same partition to guarantee the order.
How can I scale up the system and add more partitions to the topic while keeping the order of the messages?
How Kafka treat the messages that are already produced before partitioning?
What will happen to the messages that consume but not committed back to the Kafka to update the offset?
1.use a treeset(ordered set) cache messages at consumer client, keep 1 minute(or less); kafka only guarantee one partition's order, and I think producer also cannot guarantee order。
2.if you not commit offset manually, in the next fetch request ,will get same message. anyway, at consumer client, you should ensure message idempotency, even you conmmited offset.
I have been studying Apache Kafka for a while now.
Lets consider the following example.
Consider I have a topic with 3 partitions. I have a single producer and single consumer. I am producing my messages without specifying the key attribute.
So i know on the producer side, when i publish a message, the strategy used by kafka to assign a message to either of those partitions would be Round-Robin.
Now, what i want to know is when I start a single consumer belonging to a certain consumer group listening to that same topic, what strategy will it use to pull the messages from the different partitons(as there are 3)?
Would it follow the a similar round-robin model, where it will send a fetch request to a leader of a partition 1, wait for a response, get the response, return the records to process. Then, send a fetch request to the leader of a partition 2 and so on?
If it follows some other strategy/algorithm, I would love to know what it is?
Thank you in advance.
There is no ordering guarantee outside of a partition so in a way that algorithm used is moot to the end user and subject to change.
Today, there is nothing terribly complex that happens in this instance. The protocol shows you that a fetch request includes a partition so you get a fetch per partition. That means the order depends on the consumer. A partition won't be starved because fetch requests will happen for all partitions assigned to the consumer.
During creating kafka producer, we can assign a client id. What is it used for? Can I get the producer client id in a consumer? For example, to see which producer produced the message?
No, a consumer cannot get the producer's client-id.
From the Kaka documentation, client-ids are:
An id string to pass to the server when making requests. The purpose
of this is to be able to track the source of requests beyond just
ip/port by allowing a logical application name to be included in
server-side request logging.
They are only used for identifying clients in the broker logs.
No, you'd have to pass it on as part of the key or value if you need it at the consumer side.
Kafka's philosophy is to decouple producers and consumers. A topic can be read by 0-n consumers and be written to by 0-n producers. Kafka is usually used for communication between (micro)service boundaries where services don't care about who produced a message, just about its contents.
While trying to implement exactly-once semantics, I found this in the official Kafka documentation:
Exactly-once delivery requires co-operation with the destination
storage system but Kafka provides the offset which makes implementing
this straight-forward.
Does this mean that I can use the (topic, partiton, offset) tuple as a unique primary identifier to implement deduplication?
An example implementation would be to use an RDBMS and this tuple as a primary key for an insert operation within a big processing transaction where the transaction fails if the insertion is not possible anymore because of an already existing primary key.
I think the question is equivalent to:
Does a producer use the same offset for a message when retrying to send it after detecting a possible failure or does every retry attempt get its own offset?
If the offset is reused when retrying, consumers obviously see multiple messages with the same offset.
Other question, maybe somehow related:
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Another possibility could be that the offset is determined e.g. solely by or as recently as the message reaches the leader which does the job (implying that - if not listening to something like a producer's suggested offset - there are probably no gaps/offset jumps, but also different offsets for duplicate messages and I would have to use my own unique identifier within the application's message on application level).
To answer my own question:
The offset is generated solely by the server (more precisely: by the leader of the corresponding partition), not by the producing client. It is then sent back to the producer in the produce response. So:
Does a producer use the same offset for a message when retrying to
send it after detecting a possible failure or does every retry attempt
get its own offset?
No. (See update below!) The producer does not determine offsets and two identical/duplicate application messages can have different offsets. So the offset cannot be used to identify messages for producer deduplication purposes and a custom UID has to be defined in the application message. (Source)
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Due to the fact that there is only a single leader for every partition which maintains the current offset and the fact that (with the default configuration) this leadership is only transfered to active in-sync replica in case of a failure, I assume that the latest used offset is always communicated correctly when electing a new leader for a partition and therefore there are should not be any offset gaps or jumps initially. However, because of the log compaction feature, there are cases (assuming log compaction being enabled) where there can indeed be gaps in a stream of offsets when consuming already committed messages of a partition once again after the compaction has kicked in. (Source)
Update (Kafka >= 0.11.0)
Starting from Kafka version 0.11.0, producers now additionally send a sequence number with their requests, which is then used by the leader to deduplicate requests by this number and the producer's ID. So with 0.11.0, the precondition on the producer side for implementing exactly once semantics is given by Kafka itself and there's no need to send another unique ID or sequence number within the application's message.
Therefore, the answer to question 1 could now also be yes, somehow.
However, note that exactly once semantics are still only possible with the consumer never failing. Once the consumer can fail, one still has to watch out for duplicate message processings on consumer side.