are data split across partitions? - apache-kafka

I read a kafka documentation, but I still confused, when someone talk about data and partitions.
In documentation I see that client will send message to partition.
Then partition replicate message to replicas (across brokers).
And consumer read data from partition.
I have an topic which have 2 partitions.
Let's say I have one producer, which send message to partition#1.
But I have 2 consumers, one read from partition#1, and second from partition#2.
Is it mean that my partition#1 will have 50% messages, and partition#2 will have 50%. Or when client send data to partition#1 then partition#1 should be replicate data not only across brokers, but and for across partitions?

About your specific example ...
If your producer sends messages without a key on the message, the default partitioner (in the producer itself) will apply a round robin algorithm to send messages to partitions so: message 1 to partition 1, messages 2 to partition 2, message 3 to partition 1 and so on. It means that you are right, partition 1 will get 50% of messages. So one consumer reading from partition 1 will get 50% of sent messages; the other 50% will be got by the other consumer reading from partition 2. This is how Kafka works for having higher throughtput and handling more consumers.
It's important to add that when a partition has more replicas, one of them is defained "leader" and the other ones are "followers". The messages exchange happens always through the "leader". The "followers" are just copies. They are used in case the broker hosting the "leader" partition goes down and another broker which hosts a "follower" partition is elected as "leader".
I hope this helps.

Related

kafka should the consumer and producer have knowledge of the partitions

I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.

Clarification on producer while posting message to the topic

I am a beginner in learning Kafka and was going through topics and producer. As per my understanding
The topic is just a logical name for a group of partitions and the partitions are spread across the nodes.
Is my understanding correct that for a given topic, lets say there are 5 partitions, then all 5 partitions will be on 5 different brokers. And if there is another topic with 5 partitions, then all the 5 partitions will be on 5 brokers. Effectively for this configuration, each of the 5 brokers would have two partitions with each partition of a topic. Am I right?
Another point while the producer is posting a message and the consumer is consuming, is that, the producer will have a list of brokers configured and will post the message to a topic and the list of brokers. The message will always be written to the leader partition. i.e one of the partition on a broker. The message will then be replicated to all the other partitions on other brokers. In this, case, if the producer is configured with only one broker in the producer configuration, does the message be posted to the leader partition in this case too, even in case the broker configuration is not the same as the leader partition for that topic, ex: topic name - events with 5 partitions on 5 brokers. broker-2 is contains the leader partition but the producer is configured with broker-1 alone.
I also read that the producer can specify the partition name also while posting the message. If this is the case, is it not contradicting that the producer will also post the message to the leader partition and if the producer post the message to a custom partition and if the broker containing the custom partition is down, then the message will not be posted. Also in case of distributed systems, it is not a best practice to nail down a specific partition. Am I missing something here?
Does the consumer also reads from the lead partition or the consumer group assigns different consumers to different partition?

How can I control request/message send by a Kafka cluster?

Suppose, I have 3 Kafka broker, a zookeeper, 50 producers, 50 consumers, and 1 topics (testTopic1).
And All the consumer are subscribed to testTopic1. Now I will send 50 messages at the same time with the 50 producers to the same topic (testTopic1) . Now I want that Kafka cluster do not send more than 40 messages at the same time to consumers. The remaining 10 will keep on queue or drop it.
Maybe it is a load balancing in Kafka.
I do not understand how I will do this work. Im new in Kafka please help.
Kafka brokers are dumb. They cant limit/remove message published to kafka.
If all kafka consumers are part of same consumer group, and there are 50 consumers, then all consumers may or may not receive all those 50 messages at same time, depending on the key. If multiple messages have same key then all same key messages will be listened by single consumer one by one. If all 50 messages have distinct keys, then they they may or may not (depending on hash of the key) will be listened by same or different consumers.
Can you explain your use case more for better understanding.
Kafka broker cannot drop messages randomly. But you can implement logic within consumer to drop the message while processing.
If you have a single topic and single partition for that topic; one among your consumers belong to the same consumer group will process all your messages since partition guaranteed ordering in processing in consumer end.
If you have 10 consumer groups and each belongs to 5 consumers and there is a single partition for the topic, at least 10 consumers process your message from topic. In case one of the consumer from consumer-group-1 fails to process the message, another consumer from same consumer group will process the message.
If you have the requirement to drop randomly 1 out 10 messages while processing, you can achieve it by adjusting the logic in consumer end. But as per consumer group offset according to broker all data is processed in its end, if system configured to maintain offset management in brokers side.

Kafka Consumer group - No of partition - No of replication

Trying to Understand the relationship between replication factor and Consumer group . Example : Number of partition = 2 Number of replication = 3 Number consumers in consumer group = 4 . In this case ,
How many consumer will receive the message ?
How This replication will impact the number of consumer to receive .
For your first question, since you have two partitions in your example, only 2 of the 4 consumers will actually get data. The other two consumers will not have any partitions assigned to them, because there aren't any partitions left for that consumer group. If you had a different consumer group, then those consumers would still be assigned partitions.
Additionally, in this case, you mention there's only a single message coming through. Depending on which partition it's assigned to, the message will only be sent to that partition. So in this case, only one of the four consumers will get the message, the one that had that partition assigned to it.
As for your second question, replication factor configuration in Kafka doesn't impact the number of messages consumers receive. Replication, as far as consumers and producers are concerned, is an internal kafka cluster detail that they don't need to worry about. As long as they're producing/consuming to/from the leader of the partition, that's all they need to know. A topic could have replication factor 2, and another one could have replication factor 10, and they would both behave identically to producers and consumers.
There's a few more details in the official Kafka documentation: https://kafka.apache.org/documentation/#theconsumer
To give some additional details on the replication factor, it doesn't have any relation whatsoever to the number of consumers receiving messages from the topic. Replication serves only one major purpose, and that is High Availability. So, let's say you have 3 brokers in a cluster, and for a topic my-topic you've set replication factor as 2. Now, if at-most one broker goes down at some point of time, you'd still be okay, as the messages are replicated in another broker for the topic.

How does offset work when I have multiple topics on one partition in Kafka?

I am trying to develop a better understanding of how Kafka works. To keep things simple, currently I am running Kafka on one Zookeeper with 3 brokers and one partition with duplication factor of 3. I learned that, in general, it's better to have number of partitions ~= number of consumers.
Question 1: Do topics share offsets in the same partition?
I have multiple topics (e.g. dogs, cats, dinosaurs) on one partition (e.g. partition 0). Now my producers have produced a message to each of the topics. "msg: bark" to dogs, "msg: meow" to cats and "msg: rawr" to dinosaurs. I noticed that if I specify dogs[0][0], I get back bark and if I do the same on cats and dinosaurs, I do get back each message respectively. This is an awesome feature but it contradicts with my understanding. I thought offset is specific to a partition. If I have pushed three messages into a partition sequentially. Shouldn't the messages be indexed with 0, 1, and 2? Now it seems me that offset is specific to a topic.
This is how I imagined it
['bark', 'meow', 'rawr']
In reality, it looks like this
['bark']
['meow']
['rawr']
But that can't be it. There must be something keeping track of offset and the actual physical location of where the message is in the log file.
Question 2: How do you manage your messages if you were to have multiple partitions for one topic?
In question 1, I have multiple topics in one partition, now let's say I have multiple partitions for one topic. For example, I have 4 partitions for the dogs topic and I have 100 messages to push to my Kafka cluster. Do I distribute the messages evenly across partitions like 25 goes in partition 1, 25 goes in partition 2 and so on...?
If a consumer wants to consume all those 100 messages at once, he/she needs to hit all four partitions. How is this different from hitting 1 partition with 100 messages? Does network bandwidth impose a bottleneck?
Thank you in advance
For your question 1: It is impossible to have multiple topics on one partition. Partition is part of topic conceptually. You can have 3 topics and each of them has only one partition. So you have 3 partitions in total. That explains the behavior that you observed.
For your question 2: AT the producer side, if a valid partition number is specified that partition will be used when sending the record. If no partition is specified but a key is present, a partition will be chosen using a hash of the key. If neither key nor partition is present a partition will be assigned in a round-robin fashion. Now the number of partitions decides the max parallelism. There is a concept called consumer group, which can have multiple consumers in the same group consuming the same topic. In the example you gave, if your topic has only one partition, the max parallelism is one and only one consumer in the consumer group will receive messages (100 of them). But if you have 4 partitions, you can have up to 4 consumers, one for each partition and each receives 25 messages.