Two different types of partitions in kafka producer - apache-kafka

In Kafka producer, I am sending two different sets of data. I have two partitions for the topic. The first one is with a key and the second one is without a key. As far as I know the key is used to make partitions for the data. If the key is absent, null will be sent and the partition will be happening by round-robin scheduling.
But the question is if I am sending the data with and without key alternatively for some particular period of time, what will happen?
Will round robin scheduling happen for the partitions excluding the partition made by using key or will it happen for the all the two partitions?

Kafka select partition as per defined below rules
If used Custom Partitioner then partitioner will get selected based on Custom Partitioner logic.
If no Custom Partitioner then Kafka uses DefaultPartitioner
a. if the key is null then partition selected on round-robin.
b. If the key is non-null keys then It uses Murmur2 hash with modulo to identify partitions for the topic.
So message with key (null or not null) would get published on both partitions using Default Partitioner with no Custom Partitioner defined.
To achieve a message publish in a specific partition you can use the below method.
Pass partition explicitly while publishing a message
/**
* Creates a record to be sent to a specified topic and partition
*/
public ProducerRecord(String topic, Integer partition, K key, V value) {
this(topic, partition, null, key, value, null);
}
You can create Custom Partitioner and implement logic to select the partition
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/Partitioner.html

I want to correct you. You said that the key is used to make partitions for the data. The key with a message is basically sent to get the message ordering for a specific field.
If key=null, data is sent round-robin (to a different partition and to a different broker in a distributed env. and of course to the same topic.).
If a key is sent, then all messages for that key will always go to the same partition.
Explain and example
key can be any string or integer, etc.. take an example of an integer employee_id as key.
So emplyee_id 123 will always go to partition 0, employee_id 345 will always go to partition 1. This is decided by the key hashing algorithm which depends on the number of partitions.
if you don't send any key then the message can go to any partition using a round-robin technique.

Kafka has a very organized scenario when it comes to sending and storing the records in the partitions. As you have mentioned, the Key is used for the purpose that the same key records go to the same partition. This helps in maintaining the chronology of those messages on that topic.
In your case, the two partitions will store the data as:
Partition 1: Store the data which contains a particular key with it. The records with this key will always go to this Partition. This is the concept of Custom Partitioning. Apart from this, the key with null values will also go to this partition as it follows the Round Robin Fashion to store the records
Partition 2: This partition will contain records which are entered without any key. i.e the key is null.

Related

How Kafka Handles Keyed Message Related to Partition

Can anyone explain:
How actually Kafka store keyed message? Does a partition only assigned to a key? I mean, is it possible that a partition stores messages with multiple keys?
If first question answer is yes, then how if the number of key is more than partition available?
My use case is, I am considering to send lot of ship data to brokers and store it by ship_id (MMSI, if you know) as key. The problem is, I dont know how many ship will be received then. So I can't define partition number in advance.
is it possible that a partition stores messages with multiple keys?
Yes, the murmur2 hash (algorithm used by Kafka), mod the number of partitions in a topic can result in the same number. For example, if you have only one partition, any key obviously goes to the same partition
how if the number of key is more than partition available?
The hash is modulo'd, so it always is assigned a valid partition
Now, if you have a well defined key, you are guaranteed ordering of messages into partitions, so the answer to the number of partitions really comes down to how much throughput a single partition can handle, and there is no short answer - how much data are you sending and how fast can one consumer get that data from one partition at "peak" consumption? Do appropriate performance tests, then scale the partition number up over new topics to handle potential future load
You'll also need to consider "hot" / "cold" data. If you have 10 partitions for example that mapped to the first digit of the ID, then all your data started with even numbers, you'd end up with half of the partitions being empty
1. Kafka messages are form of key and value and it stored into in topics. Topics are partitioned into multiple partitioner and each
partition further divided into segment each segment has a log file to
store the actual message in key - value form and index or offset of
the message.
Key is optional which is used to identify partition going to store message if key is null then message stored into round-robin way whereas if key is not null then it will use hash key with module partition size which guarantee to choose one of the partition.
e.g.
hash(key)%num_partition
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyBytes == null) {
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
return Utils.toPositive(nextValue) % numPartitions;
}
} else {
// hash the keyBytes to choose a partition
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
}
So since its use module it will message always be stores in the range of available partitions and thats reason multiple keys may go to same partition. The main benefit of message key is to bucketing same message key should go to same partition.
2. So you no need to worry about number of partitions can be defined based on number of key. As mentioned above key is use to bucketing the messages to different partition based on Default partitioner logic. Partition number basically help to parallelize the process to high throughput.
Note:You also make sure by using key for partitioned data may cause
unequal distribution so if you don't worry just keep key null which select partition on round-robin
Other approach is to create custom partitioner to further refine partition selection logic.
here

Uneven Distribution of messages in Kafka Partitions

I have a topic with 10 partitions, 1 consumer group with 4 consumers and worker size is 3.
I could see there is an uneven distribution of messages in the partitions, One partition is having so much data and another one is free.
How can I make my producer to evenly distribute the load into all the partitions, so that all partitions are being utilized properly?
According to the JavaDoc comment in the DefaultPartitioner class itself, the default partitioning strategy is:
If a partition is specified in the record, use it.
If no partition is specified but a key is present choose a partition based on a hash of the key.
If no partition or key is present choose a partition in a round-robin fashion.
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
So here are two possible reasons that may be causing the uneven distribution, depending on whether you are specifying a key while producing the message or not:
If you are specifying a key and you are getting an uneven distribution using the DefaultPartitioner, the most apparent explanation would be that you are specifying the same key multiple times.
If you are not specifying a key and using the DefaultPartitioner, a non-obvious behavior could be happening. According to the above you would expect round-robin distribution of messages, but this is not necessarily the case. An optimization introduced in 0.8.0 could be causing the same partition to be used. Check this link for a more detailed explanation: https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified? .
Instead of going for the default partitioner class you can assign the producer with a partition number so that message directly goes to the specified partition,
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, partitionNumber,key, value);
Seems like your problem is uneven consumption of messages rather than uneven producing of messages to Kafka topic. In other words, your amount of reading threads doesn't match amount of partitions you have (they do not need to match 1:1 though, only be the same amout of partitions to read from per each consumer thread).
See short explanation for more details.
You can make use of the key parameter of the producer record. Here is a thing that for a specific key the data goes in to the same partition always now, I don’t know the structure of your producer record but as you said you have 10 partition then you can use simply n%10 as your producer record key.
Where n is 0 to 9 now your for record 0 key will be 0 and then kafka will generate a hash key and put it in some partition say partition 0, and for record 1 it will be one and then it will go into the 1st partition and so on.
This way you will able to apply round robin on your producer record your key will be independent from the fields in your record so you can have a variable n and key as n%10.
Or you can specify the partition in your producer record. So either you use the key or the partition field of the producer record.
If you have defined partitioner from record let's say in Kafka key is string and value is student Pojo.
In student Pojo let's say based on student country field, I want to go in a specific partition. Imagine that there is 10 partitions in a topic and for example, in value, "India" is a country and based on "India" we got partition number 5.
Whenever country is "India", Kafka will allocate the 5 number partition and that record goes to the partition number 5 always (if the partition has not changed).
Let's say that in your pipeline there are lots of records which are coming and have a country "India", all those records will go to partition number 5, and you will see uneven distribution in Kafka partition.
In my case, I used the default partitioner but still had much much more records in one partition than in others. The problem was I unexpectedly had many records with the same key. Check your keys!
As I was unable to resolve this with Faust, the approach I am using is to implement the 'round-robin' distribution myself.
I iterate over my records to produce and do for example:
for index, message in enumerate(messages):
topic.send(message, partition=index % num_partitions)
I.e. bound my index to within the range of partitions I have.
There could still be unevenness - consider you repeatedly run this but your number of records is less than your num_partitions - then your first partitions will keep getting the major share of messages. You can avoid this issue by adding a random offset:
import random
initial_partition = random.randrange(0, num_partitions)
for index, message in enumerate(messages):
topic.send(message, partition=(initial_partition + index) % num_partitions)

Kafka: Write custom partitioner or just use Key?

So I have a Kafka topic with multiple partitions and on it I'm producing messages. I want my messages to be partitioned based on user id. I can achieve this either by using UserId as the message key or by writing a custom partitioner. How do I figure out which is the right solution, what are the pros and cons?
As you know using user-id as the key, you are sure that messages with same user-id will be delivered always to the same partition but you can't decide the partition itself. I mean that the default partitioner process an hash on the key % number of partitions for having the destination partition.
If in your application you need that messages with a specific user-id go to a specific partition (i.e. you want that user-id beginning with "A" go to partition 0) you need to write a custom partitioner.
If you have no restrictions I think that the default partitioner using user-id as key works fine for you.
In any case after sending and on receiving you got information about the partition.

Why is data not evenly distributed among partitions when a partitioning key is not specified?

Is this explanation still valid in Kafka 10?
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one. So, if there are fewer producers than partitions, at a given point of time, some partitions may not receive any data. To alleviate this problem, one can either reduce the metadata refresh interval or specify a message key and a customized random partitioner. For more detail see this thread http://mail-archives.apache.org/mod_mbox/kafka-dev/201310.mbox/%3CCAFbh0Q0aVh%2Bvqxfy7H-%2BMnRFBt6BnyoZk1LWBoMspwSmTqUKMg%40mail.gmail.com%3E
From here https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified?
The new producer has changed to use round-robin policy. That's to say, messages will be delivered to all partitions evenly if no keys are specified.

Kafka - Difference between DefaultPartitioner with MessageKey vs Custom Partitioner?

I am trying to send the message to KafkaProducer using ProducerRecord.
new ProducerRecord(topicName,messageKey,message)
This uses DefaultPartitioner, DefaultPartitioner will use the hash of the key to ensure that all messages for the same key go to same Partition.
What is the difference between this, and using CustomPartitioner? I hope Custom Partitioner also used to send the message to same partition based on Key.
The default partitioning strategy is
If a partition is specified in the record, use it
If no partition is specified but a key is present choose a partition based on a hash of the key
If no partition or key is present choose a partition in a round-robin fashion
(This is pulled from the DefaultPartitioner source code)
The custom partitioner just lets you set your own strategy. So you could for example assign partitions randomly or if you somehow have prior knowledge of how large the partition will be assign it based off that. The default part of DefaultPartitioner is more about the round robin strategy. I'd imagine in most/all situations option 1 and 2 would be considered the norm.