Kafka - Topic & Partitions & Consumer - apache-kafka

Just wanna understand the basics properly.
Let's say I've a topic called "myTopic" that has 3 partitions P0, P1 & P2.
Each of these partitions will have a leader and the data (messages) for this topic is distributed across these partitions.
1. Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
2. How do the producer know the leader of the partition?
3. Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Appreciate your help.

Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
By default, yes.
That said, a producer can also decide to use a custom partitioning scheme, i.e. a different strategy to which partitions data is being written to.
How do the producer know the leader of the partition?
Through the Kafka protocol.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
By default, yes.
That said, you can also implement e.g. consumer applications that implement custom logic, e.g. a "sampling" consumer that only reads from 1 out of N partitions.

Producer will always writes to the leader of the partition
Yes, always.
in a round robin fashion based on the load on the broker
No. If a partition is explicitly set on a ProducerRecord then that partition is used. Otherwise, if a custom partitioner implementation is provided, that determines the partition. Otherwise, if the msg key is not null, the hash of the key will be used to consistently send msgs with the same key to the same partition. If the msg key is null, only then the msg will indeed be sent to any partition in a round-robin fashion. However, this is irrespective of the load on the broker.
How do the producer know the leader of the partition?
By periodically asking the broker for metadata.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Consumers form consumer groups. If there are multiple consumer instances in a consumer group, each consumes a subset of the partitions. But the consumer group as a whole consumes from all partitions. That is, unless you decide to go "low-level" and manage that yourself, which you can do.

Related

Are partitions on different Kafka topics co-located within same consumer (k8s pod)

I have a requirement where I want to be able to read data from partition 1 of topic A and partition 1 of topic B from the same consumer, I have a group of consumers running in different Kubernetes pods. Both topics will have 5 partitions each and both topics have key based partition strategy.
So assuming partition 1 on topic A and partition 1 on topic B are keyed with same key value would they both colocate on the same consumer or pod? If that's the case then I can cross reference data from one topic using the key of the other topic's message.
Keys are only relevant to the producer partitioner.
There is no guarantee that a consumer will be assigned the same partitions across two topics. The ConsumerPartitionAssignor linked below is only per-topic. You might get lucky with consumers assigned partitions with the same keys across topics, but after a rebalancing, it'll no longer be true.
If you must consume the same partition of multiple topics, you may assign() those values to the consumer instance rather than subscribe()-ing to the whole topic.
However, if you are wanting to join data across topics, the more appropriate way to do this would be to use Kafka Streams / KSQL joins.
Yes, if you configure routing by key for both topics, same key will be sent to same partition. Have a look at the documentation here : https://kafka.apache.org/documentation/#design_loadbalancing
"For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers."

kafka should the consumer and producer have knowledge of the partitions

I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.

Kafka default partitioner behavior when number of producers more than partitions

From the kafka faq page
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key
So all the messages with a particular key will always go to the same partition in a topic:
How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
How does the consumer know which partition the producer wrote to
Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...
Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering
so that the consumers can consume messages from specific producers?
Again, this should not be done.
Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,
If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)
The offset is atomic action which will promise that no two messages will get same offset.
The offset is running number, it has a meaning only in the specific topic and specfic partition
If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value
The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition

How is message sequence preserved for topic with many partitions?

I want any information/explanation on how Kafka maintains a message sequence when messages are written to topic with multiple partition.
For e.g. I have multiple message producer each producing messages sequentially and writing on the Kafka topic with more than 1 partition. In this case, how consumer group will work to consume messages.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Even within one partition, you still could encounter the out-of-order events if retries is enabled and max.in.flight.requests.per.connection is larger than 1.
Work-around is create a topic with only one partition although it means only one consumer process per consumer group.
Kafka will store messages in the partitions according to the message key given to the producer. If none is given, then the messages will be written in a round-robin style into the partitions. To keep ordering for a topic, you need to make sure the ordered sequence has the same key, or that the topic has only one partition.

Kafka distributing messages from a partition among consumers

I have a Kafka topic which currently has 3 partitions. I want my consumers to read from the same partition but each message should go to a different consumer in a round-robin fashion. Is it possible to achieve this?
In order to do that, you have to implement a consumer group. It's provided out of the box with Kafka. You have just to specify the same group.id to your tree consumer.
[edit] But, each consumers will read in different Kafka partition. I think that make difference consumer for mthe same group read in the same partition is not possible if you're using only the Kafka API.
See more in the documentation : http://kafka.apache.org/documentation.html#intro_consumers
How about this, at the producer, the messages are routed based on some key. It is possible to route message 1 to partition 1, message 2 to partition 2, message 3 to partition 3. Then you should group three consumers in one group. It is possible to make consumer 1 to consume partition 1, consumer 2 to consume partition 2, consumer 3 to consume partition 3.
By the way, how to implement it depends on which kafka client you are using, what the messages are. You should give more details....
What you are saying defeats the purpose of partitions. Partitions are not designed for simple load balancing in kafka. If you really want that, you have two options.
If you have a control over the producer producing to the topic, do a simple mod 3 hash partitioning. So the messages will be distributed equally in the 3 partitions. Now each of your consumer will consume from one partition. This effectively means every third message is read by each consumer. That solves your problem.
If you cannot control the producer, consume from the topic in the normal way. Write a producer with simple mod 3 hash partitioning and produce it to a new topic. Again consume from that topic. The same thing repeats as in the first case.