I am implementing kafka producer with single topic with multiple partitions. I am choosing to which partition a message goes by a particular value (feedName property value in message json ) in message. I am maintaining an SQL table for the feedName - partitionId mapping. My questions is Will the partition Id will be same for leader as well as replicas ?
If different how can I identify a partition uniquely across all brokers?
Partition ID is same across the brokers. If not, would get real confusing.
Partition IDs are maintained in Zookeeper, and all brokers have access to Zookeeper. This is what it's used for -- so all the brokers have the same view of Topics and Partions (and brokers, for that matter).
Partition id is immutable message sequence. You can find the same in Kafka documentation
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
In your use case you no need to worry about mapping of id and feedName.
Hope this helps!
I have a requirement where I want to be able to read data from partition 1 of topic A and partition 1 of topic B from the same consumer, I have a group of consumers running in different Kubernetes pods. Both topics will have 5 partitions each and both topics have key based partition strategy.
So assuming partition 1 on topic A and partition 1 on topic B are keyed with same key value would they both colocate on the same consumer or pod? If that's the case then I can cross reference data from one topic using the key of the other topic's message.
Keys are only relevant to the producer partitioner.
There is no guarantee that a consumer will be assigned the same partitions across two topics. The ConsumerPartitionAssignor linked below is only per-topic. You might get lucky with consumers assigned partitions with the same keys across topics, but after a rebalancing, it'll no longer be true.
If you must consume the same partition of multiple topics, you may assign() those values to the consumer instance rather than subscribe()-ing to the whole topic.
However, if you are wanting to join data across topics, the more appropriate way to do this would be to use Kafka Streams / KSQL joins.
Yes, if you configure routing by key for both topics, same key will be sent to same partition. Have a look at the documentation here : https://kafka.apache.org/documentation/#design_loadbalancing
"For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers."
I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.
I have two Kafka topics on the same brokers, both topics use the same UUID as a partitioner, the UUID determines which consumer the records get sent to. If the same UUIDs are used across both topics does that guarantee the records for both topics arrive at the same consumers, I assume not.
If the topics have the same number of partitions, then the partitioner logic would map the records to the same partition.
If you're simply subscribing consumers to topics rather than using specific partition assignments, then there are no guarantees which partitions get read
I'm getting started with Kafka and trying to understand exactly what would happen in a case when a partition of a topic in a Kafka cluster fills up beyond its limit. I understand that a partition resides in the same node but different partitions reside in different nodes. If I use a custom partition logic in the Kafka producer where a certain key always goes to a certain partition, then what happens when that partition becomes full? Are messages with this key sent to a random partition?
The partition is just cleaned and all the messages are deleted so that new messages can be sent to the same partition. The "retention" policy is handled by all the "log.retention.*" broker parameters.
Just wanna understand the basics properly.
Let's say I've a topic called "myTopic" that has 3 partitions P0, P1 & P2.
Each of these partitions will have a leader and the data (messages) for this topic is distributed across these partitions.
1. Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
2. How do the producer know the leader of the partition?
3. Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Appreciate your help.
Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
By default, yes.
That said, a producer can also decide to use a custom partitioning scheme, i.e. a different strategy to which partitions data is being written to.
How do the producer know the leader of the partition?
Through the Kafka protocol.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
By default, yes.
That said, you can also implement e.g. consumer applications that implement custom logic, e.g. a "sampling" consumer that only reads from 1 out of N partitions.
Producer will always writes to the leader of the partition
Yes, always.
in a round robin fashion based on the load on the broker
No. If a partition is explicitly set on a ProducerRecord then that partition is used. Otherwise, if a custom partitioner implementation is provided, that determines the partition. Otherwise, if the msg key is not null, the hash of the key will be used to consistently send msgs with the same key to the same partition. If the msg key is null, only then the msg will indeed be sent to any partition in a round-robin fashion. However, this is irrespective of the load on the broker.
How do the producer know the leader of the partition?
By periodically asking the broker for metadata.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Consumers form consumer groups. If there are multiple consumer instances in a consumer group, each consumes a subset of the partitions. But the consumer group as a whole consumes from all partitions. That is, unless you decide to go "low-level" and manage that yourself, which you can do.