Kafka partitioner question, two topics same partition key - apache-kafka

I have two Kafka topics on the same brokers, both topics use the same UUID as a partitioner, the UUID determines which consumer the records get sent to. If the same UUIDs are used across both topics does that guarantee the records for both topics arrive at the same consumers, I assume not.

If the topics have the same number of partitions, then the partitioner logic would map the records to the same partition.
If you're simply subscribing consumers to topics rather than using specific partition assignments, then there are no guarantees which partitions get read

Related

Are partitions on different Kafka topics co-located within same consumer (k8s pod)

I have a requirement where I want to be able to read data from partition 1 of topic A and partition 1 of topic B from the same consumer, I have a group of consumers running in different Kubernetes pods. Both topics will have 5 partitions each and both topics have key based partition strategy.
So assuming partition 1 on topic A and partition 1 on topic B are keyed with same key value would they both colocate on the same consumer or pod? If that's the case then I can cross reference data from one topic using the key of the other topic's message.
Keys are only relevant to the producer partitioner.
There is no guarantee that a consumer will be assigned the same partitions across two topics. The ConsumerPartitionAssignor linked below is only per-topic. You might get lucky with consumers assigned partitions with the same keys across topics, but after a rebalancing, it'll no longer be true.
If you must consume the same partition of multiple topics, you may assign() those values to the consumer instance rather than subscribe()-ing to the whole topic.
However, if you are wanting to join data across topics, the more appropriate way to do this would be to use Kafka Streams / KSQL joins.
Yes, if you configure routing by key for both topics, same key will be sent to same partition. Have a look at the documentation here : https://kafka.apache.org/documentation/#design_loadbalancing
"For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers."

Kafka default partitioner behavior when number of producers more than partitions

From the kafka faq page
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key
So all the messages with a particular key will always go to the same partition in a topic:
How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
How does the consumer know which partition the producer wrote to
Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...
Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering
so that the consumers can consume messages from specific producers?
Again, this should not be done.
Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,
If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)
The offset is atomic action which will promise that no two messages will get same offset.
The offset is running number, it has a meaning only in the specific topic and specfic partition
If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value
The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition

How do co-partitioning ensure that partition from 2 different topics end up assigned to the same Kafka Stream Task?

while i understand the pre-requisite of having co-partitioning as explained here Why does co-partitioning of two Kstreams in kafka require same number of partitions for both the streams? , I do not understand the mechanism that make sure that the partitions of each topic that have the same key, get assigned to the same KAFKA Stream. I do not see how the consumer group of KAFKA would enable that.
The way i understand it is that, we have 2 independent consumer groups, which actually may have the same name, because it is the same kafka stream application, although the suscription to each topic is independent from each other.
Somehow, the consumers in each consumer group, get assigned to partition that contains the same key. I did not know that consumer assignment to partition could be related to the content of the partitions. So far i though it was random.
Can someone explain that part ?
The way i understand it is that, we have 2 independent consumer groups, which actually may have the same name, because it is the same kafka stream application, although the suscription to each topic is independent from each other.
All members of a consumer group have the same "name" (ie, group.id) -- it is not possible to have two consumer groups with the same name. It would be one consumer group.
although the suscription to each topic is independent from each other
For KafkaConsumer it's possible to have different subscription for different members in the group (even if this should be a very rare scenario). For Kafka Streams however, it is required that all members of the group (ie, application instances) execute the exact some Topology with the exact some input topics (ie, their subscription must be the same).
I did not know that consumer assignment to partition could be related to the content of the partitions. So far i though it was random.
That is correct.
From your own answer:
In other words, if the number of partitions is the same, and the partition strategy of each producer of the topic is the same, message with same key will be assigned in the same way on the partition range, which is assigned to the consumer in the same way, i.e. as consecutive subset of partitions from each topic. Hence The same stream task will always have partitions of both topics which have the same key.
That is also correct.
Note, that Kafka Streams uses a special partition assignor (not the default ones the consumer offers) to ensure co-partitioning, stickiness (ie, state-store awareness), and to assign standby-tasks.
After refreshing I found the two following statement that explains it all:
A consumer group has a unique id. Each consumer group is a subscriber to one or more Kafka topics.
Hence a consumer group may involve multiple topics with their partition and a strategy to assign them to the consumer of the group.
PARTITION.ASSIGNMENT.STRATEGY (In Kafka Definitive Guide)
A PartitionAssignor is a class that, given consumers and topics they subscribed to, decides which partitions will be assigned to which consumer. By default, Kafka has two assignment strategies:
Range: Assigns to each consumer a consecutive subset of partitions from each topic it subscribes to. So if consumers C1 and C2 are subscribed to two topics, T1 and T2, and each of the topics has three partitions, then C1 will be assigned partitions 0 and 1 from topics T1 and T2, while C2 will be assigned partition 2 from those topics. Because each topic has an uneven number of partitions and the assignment is done for each topic independently, the first consumer ends up with more partitions than the second. This happens whenever Range assignment is used and the number of consumers does not divide the number of partitions in each topic neatly.
In other words, if the number of partitions is the same, and the partition strategy of each producer of the topic is the same, message with same key will be assigned in the same way on the partition range, which is assigned to the consumer in the same way, i.e. as consecutive subset of partitions from each topic. Hence The same stream task will always have partitions of both topics which have the same key.

Kafka - Topic & Partitions & Consumer

Just wanna understand the basics properly.
Let's say I've a topic called "myTopic" that has 3 partitions P0, P1 & P2.
Each of these partitions will have a leader and the data (messages) for this topic is distributed across these partitions.
1. Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
2. How do the producer know the leader of the partition?
3. Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Appreciate your help.
Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
By default, yes.
That said, a producer can also decide to use a custom partitioning scheme, i.e. a different strategy to which partitions data is being written to.
How do the producer know the leader of the partition?
Through the Kafka protocol.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
By default, yes.
That said, you can also implement e.g. consumer applications that implement custom logic, e.g. a "sampling" consumer that only reads from 1 out of N partitions.
Producer will always writes to the leader of the partition
Yes, always.
in a round robin fashion based on the load on the broker
No. If a partition is explicitly set on a ProducerRecord then that partition is used. Otherwise, if a custom partitioner implementation is provided, that determines the partition. Otherwise, if the msg key is not null, the hash of the key will be used to consistently send msgs with the same key to the same partition. If the msg key is null, only then the msg will indeed be sent to any partition in a round-robin fashion. However, this is irrespective of the load on the broker.
How do the producer know the leader of the partition?
By periodically asking the broker for metadata.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Consumers form consumer groups. If there are multiple consumer instances in a consumer group, each consumes a subset of the partitions. But the consumer group as a whole consumes from all partitions. That is, unless you decide to go "low-level" and manage that yourself, which you can do.

How is message sequence preserved for topic with many partitions?

I want any information/explanation on how Kafka maintains a message sequence when messages are written to topic with multiple partition.
For e.g. I have multiple message producer each producing messages sequentially and writing on the Kafka topic with more than 1 partition. In this case, how consumer group will work to consume messages.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Even within one partition, you still could encounter the out-of-order events if retries is enabled and max.in.flight.requests.per.connection is larger than 1.
Work-around is create a topic with only one partition although it means only one consumer process per consumer group.
Kafka will store messages in the partitions according to the message key given to the producer. If none is given, then the messages will be written in a round-robin style into the partitions. To keep ordering for a topic, you need to make sure the ordered sequence has the same key, or that the topic has only one partition.