Key and value avro messages distribution in Kafka topic partitions - apache-kafka

We use kafka topic with 6 partitions and the incoming messages from producers have 4 keys key1,key2,key3,key4 and their corresponding values, I see that the values are distributed only with 3 partitions and the remaining partitions remains empty.
Is the distribution of the messages based n the hash values of the key?
Let is say my hash value of Key1 is XXXX, to which partition does it go among the total of 6 partitions?
I am using kafka connect HDFS connector to write the data to HDFS, and I knew that it uses the hash values of the keys to distribute to the messages to the partitions,is it the same way kafka uses to distribute the messages?

Yes, the distribution of messages against partitions is determined by hash of the message-key modulo total partition count on that topic. E.g. if you're sending a message m with key as k, to a topic mytopic that has p partitions, then m goes to the partition k.hashCode() % p in mytopic. I think that answers your second question too. In your case two of the resulting values are getting mapped to same partition.
If my memory serves me correctly Kafka-hdfs connector should take care of consuming from a Kafka topic and putting it into Hadoop HDFS. You don't need to worry about the partitions there, it is abstracted out.

Related

How to make Kafka distribute the data equally among all partitions with multiple keys?

A Kafka topic was created with 10 partitions and producer produced multiple messages with 12 different keys (labeled key_1, key_2, key_3, ... , key_10).
It was observed that all the messages were sent to only 2 partitions with most of the messages in one of the partition and remaining few in another partition. 8 out of 10 partitions remained empty.
How to make Kafka distribute the data equally among all 10 partitions based on the keys ?
You'd need to write your own partitioner class to make even distribution a guarantee.
Otherwise, the computed hashes of the keys you sent, modulo'd by the number of partitions, might only be binned into 2 partitions.
Since you have 12 distinct keys and 10 partitions it is impossible to get uniform distribution based on key values. The reason is simple: partitioner is a function and {f(key1), f(key2), ..., f(key12)} is a subset of {p1, p2, ..., p8} where some partitions may not be present and some may be present multiple times.
You have following choices:
Write custom partitioner that maps keys 1-10 to partitions 1-10 and keys 11, 12 to partitions 1, 2. Additionally you can increase number of partitions to 12 instead of 10 and rewrite your partitioner so that it will put each key into its own partition.
Remove key from your messages so round robin algorithm is used. However messages with the same key (that is gone now) may and will get into different partitions.
(trash one) Write your one partitioner that ignores message key and puts message into random partition, for example, in a round robin way.
For details on how to implement partitioner try to look for kafka's default one org.apache.kafka.clients.producer.internals.DefaultPartitioner on github.

Kafka Streams Co-Partitioning is required while joining two KStreams

Recently i started reading about Kafka streams for upcoming project and stumbled upon the concept which says co-partitioning is required if we want to join two streams, all i was able to understand is if we have two Topics A and B both must have same number of partitions and for key 'X' say the partition number also must be same for both topics.
Topic A with partition A0, A1 ,A2
Topic B with partition B0, B1, B2
then message with key 'X' must be publish in A0 and B0 respectively.
Question: why partition number must be same for both topic (for 'X' key) and what issues we might faced if we have same number of partition in two topics but some of partition is idle i.e messages is not distributed evenly across partition.
When you do Kafka streaming, Kafka group consumer is used. So, your topic partitions are assigned according to Kafka partitioning strategies. Default is range assigner. read here for more.
To join Two streams, Both messages with same key should be available in same consumer instance. Otherwise your streaming consumer can not find other message to join. To make sure that, Partition number should be same for both topics and key should be same.
When partition number same for both topics, Kafka Partitioning Range Assigner makes sure that same partition assigned to same instance.
This from kafka perspective. From application side, your producer should make sure to produce messages using hash partitioner. It is the default. Then if there is same number of partition for both topics, then hashing makes sure same key should go to same partition number for both topics.
Kafka streaming Co-Partitioning is doing this to make sure when your topics has not these things.

How are messages distributed in the kafka partition?

If we have one topic with 4 partitions in Kafka. There are 4 publisher which publish message in the same topic.
All publisher publish different count of message like publisher1 publishes W messages, publisher2 publishes X messages, Publisher3 publishes Y messages and Publisher4 publishes Z messages.
How many messages are in the Each Partition?
Unless your producers do not specifically write to certain partitions (by providing the partition number while constructing the ProducerRecord), the message produced by each producer will - by default - land in one of the partitions based on its key. Internally the following logic is being used:
kafka.common.utils.Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
where keyBytes is the byte presentation of your key and numPartitions is 4 in your case. In case you are not using any key, it will be distributed in a round-robin fashion.
Therefore, it is not possible to predict how many messages are in each partitions without knowing the keys being used (if keys are used at all).
More on the partitioning of message is given here.

Autoscaling with KAFKA and non-transactional databases

Say, I have an application that reads a batch of data from KAFKA, it uses the keys of the incoming messages and makes a query to HBase (reads the current data from HBase for those keys), does some computation and writes data back to HBase for the same set of keys. For e.g.
{K1, V1}, {K2, V2}, {K3, V3} (incoming messages from KAFKA) --> My Application (Reads the current value of K1, K2 and K3 from HBase, uses the incoming value V1, V2 and V3 does some compute and writes the new values for K1 (V1+x), K2 (V2+y) and K3(V3+z) back to HBase after the processing is complete.
Now, let’s say I have one partition for the KAFKA topic and 1 consumer. My application has one consumer thread that is processing the data.
The problem is that say HBase goes down, at which point my application stops processing messages, and a huge lag builds into KAFKA. Even, though I have the ability to increase the number of partitions and correspondingly the consumers, I cannot increase either of them because of RACE conditions in HBase. HBase doesn’t support row level locking so now if I increase the number of partitions the same key could go to two different partitions and correspondingly to two different consumers who may end up in a RACE condition and whoever writes last is the winner. I will have to wait till all the messages gets processed before I can increase the number of partitions.
For e.g.
HBase goes down --> Initially I have one partition for the topic and there is unprocessed message --> {K3, V3} in partition 0 --> now I increase the number of partitions and message with key K3 is now present let’s say in partition 0 and 1 --> then consumer consuming from partition 0 and another consumer consuming from partition 1 will end up competing to write to HBase.
Is there a solution to the problem? Of course locking the key K3 by the consumer processing the message is not the solution since we are dealing with Big Data.
When you increase a number of partitions only new messages come to the newly added partitions. Kafka takes responsibility for processing one message exactly once
A message will only appear in one and only one kafka partition. It is using a hash function on the message modulo the number of partitions. I believe this guarantee solves your problem.
But bear in mind that if you change the number of partitions the same message key could be allocated to a different partition. That may matter if you care about the ordering of messages that is only guaranteed per partition. If you care about the ordering of messages repartitioning (e.g. increasing the number of partitions) is not an option.
As Vassilis mentioned, Kafka guarantee that single key will be only in one partition.
There are different strategies how to distribute keys on partitions.
When you increase partition number or change partitioning strategy, a rebalance process could occur which may affect to working consumers. If you stop consumers for a while, you could avoid possibility of processing the same key by two consumers.

Kafka partiton performance issue

I have 6 partitions for some topic and created a consumer group with 6 consumers. So, Here each consumer reading data from 1 partition. Here the problem is one partition having more data and other partitions having minimal data, due to data skew in one partition one consumer performing slow. How to handle this situation.
You need to know which keys you are sending into the topic. If the keys are null, then this shouldn't be possible.
If there is a distribution of non-null keys that is skewed towards a few values, then those partitions will be greater.
You are welcome to write your own Partitioner interface, otherwise.