So I have a Kafka topic with multiple partitions and on it I'm producing messages. I want my messages to be partitioned based on user id. I can achieve this either by using UserId as the message key or by writing a custom partitioner. How do I figure out which is the right solution, what are the pros and cons?
As you know using user-id as the key, you are sure that messages with same user-id will be delivered always to the same partition but you can't decide the partition itself. I mean that the default partitioner process an hash on the key % number of partitions for having the destination partition.
If in your application you need that messages with a specific user-id go to a specific partition (i.e. you want that user-id beginning with "A" go to partition 0) you need to write a custom partitioner.
If you have no restrictions I think that the default partitioner using user-id as key works fine for you.
In any case after sending and on receiving you got information about the partition.
Related
One question raised during system design, if message key is selected in the way that it happens too often in the stream of data, does that mean that only one topic partition will be receiving these messages exclusively even if that creates disbalance in the way how partitions are filled with data?
Does Kafka have a mechanism to "split" messages with the same key among several partitions, sacrificing order in this case?
Or there are no exceptions in key -> partition allocation regardless how that impact size of partitions?
To answer your question in the topic, the answer is yes, kafka will allow unbalanced partitions.
You can define your own partioner class to decide where the messages would be sent to, in default architecture it is using murmur2 algorithm to decide where to send each key , so it will have same keys in the same partition if your use case is not requiring ordering between the events you might not need to send key at all, and than the messages would be distributed across the partitions, in last updates kafka "batch" messages sent from producer to same partition to have even better throughput...
To make it clear , kafka does not require you to send a key for a message
In Kafka producer, I am sending two different sets of data. I have two partitions for the topic. The first one is with a key and the second one is without a key. As far as I know the key is used to make partitions for the data. If the key is absent, null will be sent and the partition will be happening by round-robin scheduling.
But the question is if I am sending the data with and without key alternatively for some particular period of time, what will happen?
Will round robin scheduling happen for the partitions excluding the partition made by using key or will it happen for the all the two partitions?
Kafka select partition as per defined below rules
If used Custom Partitioner then partitioner will get selected based on Custom Partitioner logic.
If no Custom Partitioner then Kafka uses DefaultPartitioner
a. if the key is null then partition selected on round-robin.
b. If the key is non-null keys then It uses Murmur2 hash with modulo to identify partitions for the topic.
So message with key (null or not null) would get published on both partitions using Default Partitioner with no Custom Partitioner defined.
To achieve a message publish in a specific partition you can use the below method.
Pass partition explicitly while publishing a message
/**
* Creates a record to be sent to a specified topic and partition
*/
public ProducerRecord(String topic, Integer partition, K key, V value) {
this(topic, partition, null, key, value, null);
}
You can create Custom Partitioner and implement logic to select the partition
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/Partitioner.html
I want to correct you. You said that the key is used to make partitions for the data. The key with a message is basically sent to get the message ordering for a specific field.
If key=null, data is sent round-robin (to a different partition and to a different broker in a distributed env. and of course to the same topic.).
If a key is sent, then all messages for that key will always go to the same partition.
Explain and example
key can be any string or integer, etc.. take an example of an integer employee_id as key.
So emplyee_id 123 will always go to partition 0, employee_id 345 will always go to partition 1. This is decided by the key hashing algorithm which depends on the number of partitions.
if you don't send any key then the message can go to any partition using a round-robin technique.
Kafka has a very organized scenario when it comes to sending and storing the records in the partitions. As you have mentioned, the Key is used for the purpose that the same key records go to the same partition. This helps in maintaining the chronology of those messages on that topic.
In your case, the two partitions will store the data as:
Partition 1: Store the data which contains a particular key with it. The records with this key will always go to this Partition. This is the concept of Custom Partitioning. Apart from this, the key with null values will also go to this partition as it follows the Round Robin Fashion to store the records
Partition 2: This partition will contain records which are entered without any key. i.e the key is null.
I am new in Kafka and micronaut and I do not understand the usage of #KafkaKey. What I found on internet is :
The Kafka key can be specified by providing a parameter annotated with
#KafkaKey. If no such parameter is specified the record is sent with a null key.
So what exactly it means? How it will effect me if I do not use it ?
Most important effect of Kafka message keys is partitioning. For example if the key chosen was a user id then all data for a given user would be sent to the same partition. If you wouldn't specify the key of messages Kafka would use round-robin strategy for message distribution.
Kafka preserves the order within the partitions. As you specify a key for a particular message type, the message type is bound to a particular partition associated with that key. Since the order of messages is preserved in a partition, you can preserve the message order by specifying a key. This is particularly useful if you are working with state machines.
All of the examples of Kafka | producers show the ProducerRecord's key/value pair as not only being the same type (all examples show <String,String>), but the same value. For example:
producer.send(new ProducerRecord<String, String>("someTopic", Integer.toString(i), Integer.toString(i)));
But in the Kafka docs, I can't seem to find where the key/value concept (and its underlying purpose/utility) is explained. In traditional messaging (ActiveMQ, RabbitMQ, etc.) I've always fired a message at a particular topic/queue/exchange. But Kafka is the first broker that seems to require key/value pairs instead of just a regulare 'ole string message.
So I ask: What is the purpose/usefulness of requiring producers to send KV pairs?
Kafka uses the abstraction of a distributed log that consists of partitions. Splitting a log into partitions allows to scale-out the system.
Keys are used to determine the partition within a log to which a message get's appended to. While the value is the actual payload of the message. The examples are actually not very "good" with this regard; usually you would have a complex type as value (like a tuple-type or a JSON or similar) and you would extract one field as key.
See: http://kafka.apache.org/intro#intro_topics and http://kafka.apache.org/intro#intro_producers
In general the key and/or value can be null, too. If the key is null a random partition will the selected. If the value is null it can have special "delete" semantics in case you enable log-compaction instead of log-retention policy for a topic (http://kafka.apache.org/documentation#compaction).
Late addition... Specifying the key so that all messages on the same key go to the same partition is very important for proper ordering of message processing if you will have multiple consumers in a consumer group on a topic.
Without a key, two messages on the same key could go to different partitions and be processed by different consumers in the group out of order.
Another interesting use case
We could use the key attribute in Kafka topics for sending user_ids and then can plug in a consumer to fetch streaming events (events stored in value attributes). This could allow you to process any max-history of user event sequences for creating features in your machine learning models.
I still have to find out if this is possible or not. Will keep updating my answer with further details.
I am trying to send the message to KafkaProducer using ProducerRecord.
new ProducerRecord(topicName,messageKey,message)
This uses DefaultPartitioner, DefaultPartitioner will use the hash of the key to ensure that all messages for the same key go to same Partition.
What is the difference between this, and using CustomPartitioner? I hope Custom Partitioner also used to send the message to same partition based on Key.
The default partitioning strategy is
If a partition is specified in the record, use it
If no partition is specified but a key is present choose a partition based on a hash of the key
If no partition or key is present choose a partition in a round-robin fashion
(This is pulled from the DefaultPartitioner source code)
The custom partitioner just lets you set your own strategy. So you could for example assign partitions randomly or if you somehow have prior knowledge of how large the partition will be assign it based off that. The default part of DefaultPartitioner is more about the round robin strategy. I'd imagine in most/all situations option 1 and 2 would be considered the norm.