How to make Kafka distribute the data equally among all partitions with multiple keys? - apache-kafka

A Kafka topic was created with 10 partitions and producer produced multiple messages with 12 different keys (labeled key_1, key_2, key_3, ... , key_10).
It was observed that all the messages were sent to only 2 partitions with most of the messages in one of the partition and remaining few in another partition. 8 out of 10 partitions remained empty.
How to make Kafka distribute the data equally among all 10 partitions based on the keys ?

You'd need to write your own partitioner class to make even distribution a guarantee.
Otherwise, the computed hashes of the keys you sent, modulo'd by the number of partitions, might only be binned into 2 partitions.

Since you have 12 distinct keys and 10 partitions it is impossible to get uniform distribution based on key values. The reason is simple: partitioner is a function and {f(key1), f(key2), ..., f(key12)} is a subset of {p1, p2, ..., p8} where some partitions may not be present and some may be present multiple times.
You have following choices:
Write custom partitioner that maps keys 1-10 to partitions 1-10 and keys 11, 12 to partitions 1, 2. Additionally you can increase number of partitions to 12 instead of 10 and rewrite your partitioner so that it will put each key into its own partition.
Remove key from your messages so round robin algorithm is used. However messages with the same key (that is gone now) may and will get into different partitions.
(trash one) Write your one partitioner that ignores message key and puts message into random partition, for example, in a round robin way.
For details on how to implement partitioner try to look for kafka's default one org.apache.kafka.clients.producer.internals.DefaultPartitioner on github.

Related

Do we need to know number of partitions for a topic beforehand?

We want to put messages/records of a different customers on different partitions of a kafka topic.
But number of customers is not known in prior. So how can we set partition count for kafka topic in this case? Do we need any other way where partition count changes at runtime based on keys (customer_id in this case). Thanks in advance.
need to know number of partitions
Assuming Java, use AdminClient.describeTopics() method call and get partitions of each response object.
Regarding the rest of the question, consumer instances automatically distribute partition assignment when subscribing to topics.
Producers should not know about consumers, so you don't "put records on partitions" based on any factor of (possible) consumers.
partition count changes at runtime based on keys (customer_id)
Unclear what this means. Partition count can only increase, and if you do increase it, then your partitions will become unordered, so you should consider how large your keyspace is before creating the topic. For example, if you have a numeric ID, and use the first two digits as the partition value, then you could create a topic up to 100 partitions.

Kafka topics - How to ensure a one-key-one-partition relationship

I am working on a project where I would like to ensure I have 1 and only 1 key per partition in my topic.
The problem is that I don't know the number of specific keys I will produce data from (could be 1, 2, or 1000 ! The number of different keys streamed vary in time).
In a topic, I know we have to specify an initial number of partitions, but we can add more after creation.
What would be the best solution to ensure 1 key 1 partition?.
I have some leads...
I could create a topic with 3000 partition in advance so I have a buffer, but it is definitely not optimized. What about disk space consumption? What about the impact on performances?
I could add partitions once I run out of available partitions, so I will only have the same number of partitions as keys, but there will be collision and it will threaten the continuity of the event stream as potentially a key will be assigned to a different partition
I could override the default Partitioner used by my Producer services to ensure there is no collision between keys and the resizing does not affect, but how my consumer will know the partition the partitioner choose? How to ensure no other Producer affect the same partition number to another key...
Many thanks for your help !

Key and value avro messages distribution in Kafka topic partitions

We use kafka topic with 6 partitions and the incoming messages from producers have 4 keys key1,key2,key3,key4 and their corresponding values, I see that the values are distributed only with 3 partitions and the remaining partitions remains empty.
Is the distribution of the messages based n the hash values of the key?
Let is say my hash value of Key1 is XXXX, to which partition does it go among the total of 6 partitions?
I am using kafka connect HDFS connector to write the data to HDFS, and I knew that it uses the hash values of the keys to distribute to the messages to the partitions,is it the same way kafka uses to distribute the messages?
Yes, the distribution of messages against partitions is determined by hash of the message-key modulo total partition count on that topic. E.g. if you're sending a message m with key as k, to a topic mytopic that has p partitions, then m goes to the partition k.hashCode() % p in mytopic. I think that answers your second question too. In your case two of the resulting values are getting mapped to same partition.
If my memory serves me correctly Kafka-hdfs connector should take care of consuming from a Kafka topic and putting it into Hadoop HDFS. You don't need to worry about the partitions there, it is abstracted out.

Uneven Distribution of messages in Kafka Partitions

I have a topic with 10 partitions, 1 consumer group with 4 consumers and worker size is 3.
I could see there is an uneven distribution of messages in the partitions, One partition is having so much data and another one is free.
How can I make my producer to evenly distribute the load into all the partitions, so that all partitions are being utilized properly?
According to the JavaDoc comment in the DefaultPartitioner class itself, the default partitioning strategy is:
If a partition is specified in the record, use it.
If no partition is specified but a key is present choose a partition based on a hash of the key.
If no partition or key is present choose a partition in a round-robin fashion.
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
So here are two possible reasons that may be causing the uneven distribution, depending on whether you are specifying a key while producing the message or not:
If you are specifying a key and you are getting an uneven distribution using the DefaultPartitioner, the most apparent explanation would be that you are specifying the same key multiple times.
If you are not specifying a key and using the DefaultPartitioner, a non-obvious behavior could be happening. According to the above you would expect round-robin distribution of messages, but this is not necessarily the case. An optimization introduced in 0.8.0 could be causing the same partition to be used. Check this link for a more detailed explanation: https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified? .
Instead of going for the default partitioner class you can assign the producer with a partition number so that message directly goes to the specified partition,
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, partitionNumber,key, value);
Seems like your problem is uneven consumption of messages rather than uneven producing of messages to Kafka topic. In other words, your amount of reading threads doesn't match amount of partitions you have (they do not need to match 1:1 though, only be the same amout of partitions to read from per each consumer thread).
See short explanation for more details.
You can make use of the key parameter of the producer record. Here is a thing that for a specific key the data goes in to the same partition always now, I don’t know the structure of your producer record but as you said you have 10 partition then you can use simply n%10 as your producer record key.
Where n is 0 to 9 now your for record 0 key will be 0 and then kafka will generate a hash key and put it in some partition say partition 0, and for record 1 it will be one and then it will go into the 1st partition and so on.
This way you will able to apply round robin on your producer record your key will be independent from the fields in your record so you can have a variable n and key as n%10.
Or you can specify the partition in your producer record. So either you use the key or the partition field of the producer record.
If you have defined partitioner from record let's say in Kafka key is string and value is student Pojo.
In student Pojo let's say based on student country field, I want to go in a specific partition. Imagine that there is 10 partitions in a topic and for example, in value, "India" is a country and based on "India" we got partition number 5.
Whenever country is "India", Kafka will allocate the 5 number partition and that record goes to the partition number 5 always (if the partition has not changed).
Let's say that in your pipeline there are lots of records which are coming and have a country "India", all those records will go to partition number 5, and you will see uneven distribution in Kafka partition.
In my case, I used the default partitioner but still had much much more records in one partition than in others. The problem was I unexpectedly had many records with the same key. Check your keys!
As I was unable to resolve this with Faust, the approach I am using is to implement the 'round-robin' distribution myself.
I iterate over my records to produce and do for example:
for index, message in enumerate(messages):
topic.send(message, partition=index % num_partitions)
I.e. bound my index to within the range of partitions I have.
There could still be unevenness - consider you repeatedly run this but your number of records is less than your num_partitions - then your first partitions will keep getting the major share of messages. You can avoid this issue by adding a random offset:
import random
initial_partition = random.randrange(0, num_partitions)
for index, message in enumerate(messages):
topic.send(message, partition=(initial_partition + index) % num_partitions)

Why is data not evenly distributed among partitions when a partitioning key is not specified?

Is this explanation still valid in Kafka 10?
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one. So, if there are fewer producers than partitions, at a given point of time, some partitions may not receive any data. To alleviate this problem, one can either reduce the metadata refresh interval or specify a message key and a customized random partitioner. For more detail see this thread http://mail-archives.apache.org/mod_mbox/kafka-dev/201310.mbox/%3CCAFbh0Q0aVh%2Bvqxfy7H-%2BMnRFBt6BnyoZk1LWBoMspwSmTqUKMg%40mail.gmail.com%3E
From here https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified?
The new producer has changed to use round-robin policy. That's to say, messages will be delivered to all partitions evenly if no keys are specified.