Kafka Rest Proxy Api Uneven Distribution of messages - apache-kafka

I have 2 partitions and messages in avro-format. I send the messages via kafka rest proxy api. I use the key for the messages. The key is a string. For example, there are my keys:
41:46-300, 41:45-300, 41:44-300, 41:43-300, 41:42-300.
But the messages are uneven distributed. In the partition 0 there are messages with keys 41:46-300, 41:45-300, 41:44-300, 41:43-300 and in the partition 1 there are only messages with the key 41:42-300.
Kafka version: 2.4
Could you explain me Why there is happened?

Kafka uses Murmur2 hashing to distribute keys, not an evenly distributed round-robin mechanism.
So, this means all events in the same partition ended up with hashes that modulo'd into that partition.

Related

Apache Kafka PubSub

How does the pubsub work in Kafka?
I was reading about Kafka Topic-Partition theory, and it mentioned that In one consumer group, each partition will be processed by one consumer only. Now there are 2 cases:-
If the producer didn't mention the partition key or message key, the message will be evenly distributed across the partitions of a specific topic. ---- If this is the case, and there can be only one consumer(or subscriber in case of PubSub) per partition, how does all the subscribers receive the similar message?
If I producer produced to a specific partition, then how does the other consumers (or subscribers) receive the message?
How does the PubSub works in each of the above cases? if only a single consumer can get attached to a specific partition, how do other consumers receive the same msg?
Kafka prevents more than one consumer in a group from reading a single partition. If you have a use-case where multiple consumers in a consumer group need to process a particular event, then Kafka is probably the wrong tool. Otherwise, you need to write code external to Kafka API to transmit one consumer's events to other services via other protocols. Kafka Streams Interactive Query feature (with an RPC layer) is one example of this.
Or you would need lots of unique consumers groups to read the same event.
Answer doesn't change when producers send data to a specific partitions since "evenly distributed" partitions are still pre-computed, as far as the consumer is concerned. The consumer API is assigned to specific partitions, and does not coordinate the assignment with any producer.

Kafka default partitioner behavior when number of producers more than partitions

From the kafka faq page
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key
So all the messages with a particular key will always go to the same partition in a topic:
How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
How does the consumer know which partition the producer wrote to
Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...
Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering
so that the consumers can consume messages from specific producers?
Again, this should not be done.
Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,
If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)
The offset is atomic action which will promise that no two messages will get same offset.
The offset is running number, it has a meaning only in the specific topic and specfic partition
If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value
The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition

Will Kafka allow "unballanced" partitions?

One question raised during system design, if message key is selected in the way that it happens too often in the stream of data, does that mean that only one topic partition will be receiving these messages exclusively even if that creates disbalance in the way how partitions are filled with data?
Does Kafka have a mechanism to "split" messages with the same key among several partitions, sacrificing order in this case?
Or there are no exceptions in key -> partition allocation regardless how that impact size of partitions?
To answer your question in the topic, the answer is yes, kafka will allow unbalanced partitions.
You can define your own partioner class to decide where the messages would be sent to, in default architecture it is using murmur2 algorithm to decide where to send each key , so it will have same keys in the same partition if your use case is not requiring ordering between the events you might not need to send key at all, and than the messages would be distributed across the partitions, in last updates kafka "batch" messages sent from producer to same partition to have even better throughput...
To make it clear , kafka does not require you to send a key for a message

How are messages distributed in the kafka partition?

If we have one topic with 4 partitions in Kafka. There are 4 publisher which publish message in the same topic.
All publisher publish different count of message like publisher1 publishes W messages, publisher2 publishes X messages, Publisher3 publishes Y messages and Publisher4 publishes Z messages.
How many messages are in the Each Partition?
Unless your producers do not specifically write to certain partitions (by providing the partition number while constructing the ProducerRecord), the message produced by each producer will - by default - land in one of the partitions based on its key. Internally the following logic is being used:
kafka.common.utils.Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
where keyBytes is the byte presentation of your key and numPartitions is 4 in your case. In case you are not using any key, it will be distributed in a round-robin fashion.
Therefore, it is not possible to predict how many messages are in each partitions without knowing the keys being used (if keys are used at all).
More on the partitioning of message is given here.

Key and value avro messages distribution in Kafka topic partitions

We use kafka topic with 6 partitions and the incoming messages from producers have 4 keys key1,key2,key3,key4 and their corresponding values, I see that the values are distributed only with 3 partitions and the remaining partitions remains empty.
Is the distribution of the messages based n the hash values of the key?
Let is say my hash value of Key1 is XXXX, to which partition does it go among the total of 6 partitions?
I am using kafka connect HDFS connector to write the data to HDFS, and I knew that it uses the hash values of the keys to distribute to the messages to the partitions,is it the same way kafka uses to distribute the messages?
Yes, the distribution of messages against partitions is determined by hash of the message-key modulo total partition count on that topic. E.g. if you're sending a message m with key as k, to a topic mytopic that has p partitions, then m goes to the partition k.hashCode() % p in mytopic. I think that answers your second question too. In your case two of the resulting values are getting mapped to same partition.
If my memory serves me correctly Kafka-hdfs connector should take care of consuming from a Kafka topic and putting it into Hadoop HDFS. You don't need to worry about the partitions there, it is abstracted out.