I'm trying to understand how Kafka works. I've read that by default Kafka will distribute the messages from a producer in a round-robin fashion among the partitions.
But, why are the messages always put in the same partition if the messages have the same key ? (no partition key strategy configured).
For instance, using the code below the messages are always put in the same partition:
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
String key = properties.getProperty("dev.id");
producer.send(new ProducerRecord<String, String>(properties.getProperty("kafka.topic"), key, value), new EventGeneratorCallback(key));
With a different key, messages are distributed in a round-robin fashion:
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
String key = properties.getProperty("dev.id") + UUID.randomUUID().toString();
producer.send(new ProducerRecord<String, String>(properties.getProperty("kafka.topic"), key, value), new EventGeneratorCallback(key));
This is exactly how a Kafka producer works. This behaviour is defined by the DefaultPartitioner class you can find here at the official repo.
If the key isn't specified, the producer uses a round-robin way for sending messages across all the topic related partitions; if the key is specified, the partitioner processes an hash of the key module the number of partition and in this way messages with same key goes to the same partition. Because Kafka guarantees ordering at partition level (not topic level), this is also a way for having all messages with same key in the same partition and so received by consumer in the same order they are sent.
Finally, another possible way for sending messages is that the producer specifies the partition destination : in this case no key, no round-robin but the message is sent exactly to the partition specified by the producer itself.
This is the expected behavior. All messages with the same key are put in the same partition. If you want round robin assignment for all messages, you should not provide a key. To Kafka, the reason to have a key is to distribute data throughout partitions and putting identical keys in different partitions would break this contract.
Related
I am new in Kafka, so i have some issues related on basic things for Kafka. I wanted to distribute all messages equally to all over partitions.
As I know, Producer chose the partition based on key hashing (If key is available) using default Partitioner hash algorithm (Random, Consistent, Murmur2, sticky etc.). Which is great. But I want to distribute the messages to all partitions. Like:
Topic: "Test"
Partition: 3
Now, If i produce messages (Key Available) then I want to distribute those messages equally like:
Partition 1: 1,4,7,10
Partition 2: 2,5,8
Partition 3: 3,6,9
So, how can i distribute messages equally to all partition
The default partitioner chooses partition based on the hash of key if a key is available and no partition is specified in the record itself. Otherwise (i.e. no key is present and no partition is specified) it chooses the partition in a round-robin fashion (Kafka<2.4, read below).
public int partition(String key, int partitionNum) {
byte[] keyBytes = key.getBytes();
return toPositive(murmur2(keyBytes)) % partitionNum;
}
For a handful number of keys, using the default partitioner may not give you even data distribution, as toPositive(murmur2(keyBytes)) % numberOfPartitions will have collisions. The best way is for the producer to take responsibility and decide which partition to send the message to using CustomPartitioner based on your business use-case.
Kafka guarantees that any consumer of a given topic-partition will always read that partition's events in exactly the same order as they were written.
https://kafka.apache.org/documentation.html#introduction
One thing to note here is, that although eliminating data skewness is important - The order of messages going in different partitions in a topic may or may not be in-order - this may have consequences based on your use-case. But within a Partition will they'll be stored in the order, Thus keep related messages in the same partition.
For e.g. In an E-commerce delivery-related environment, Messages related to an OrderID should come in order (you don't want "Out-For-Delivery" to be after "Delivered"), thus messages for specific order_id should go into the same partition.
Update:
As mentioned in the comment, Kafka ≥ v2.4 uses Sticky Partitioner as the default partitioner.
The sticky partitioner addresses the problem of spreading out records without keys into smaller batches by picking a single partition to send all non-keyed records. Once the batch at that partition is filled or otherwise completed, the sticky partitioner randomly chooses and “sticks” to a new partition. That way, over a larger period of time, records are about evenly distributed among all the partitions while getting the added benefit of larger batch sizes.
https://www.confluent.io/blog/apache-kafka-producer-improvements-sticky-partitioner/
This means Kafka producers don’t immediately send records but keeps a batch of records for a specific topic with no keys and no assigned partition and will send to the same partition until the batch is ready to be sent. When a new batch is created, a new partition is chosen.
Effectively, the partitioner assigns records to the same partition until the batch is sent based on batch.size and linger.ms, once that batch is sent, a new partition will be used. Thus messages may not necessarily be evenly distributed.
Further Reading:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-480%3A+Sticky+Partitioner
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner#KIP794:StrictlyUniformStickyPartitioner-UniformStickyBatchSize
https://www.confluent.io/blog/5-things-every-kafka-developer-should-know/#tip-2-new-sticky-partitioner
https://aiven.io/blog/balance-data-across-kafka-partitions#challenge-of-uneven-record-distribution
I think this answers your question best:
https://rajatjain-ix.medium.com/whats-wrong-with-kafka-b53d0549677a
So, there are two solutions available..
You don't specify any partition_key. In this case, the DefaultPartitioner will automatically round-robin the messages across the partitions.
You use a (incremental uuid) % (count of partitions) as the partition number in Producer API. This way you are manually telling it to round-robin the messages to partitions.
Ronak explained very precisely.
You could achieve distribution of the messages over partitions evenly by implementing Partitioner interface regardless of the key.
New sticky version
public class SimplePartitioner implements Partitioner {
private final StickyPartitionCache stickyPartitionCache = new StickyPartitionCache();
public void configure(Map<String, ?> configs) {
}
#Override
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
return partition(topic, key, keyBytes, value, valueBytes, cluster, cluster.partitionsForTopic(topic).size());
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster,
int numPartitions) {
return stickyPartitionCache.partition(topic, cluster);
}
#Override
public void close() {
}
}
Old version - see this link: hhttps://github.com/sharefeel/kafka-simple-partitioner/blob/0.8.2/SimplePartitioner.java
Don't forget this. Target partitions of SimplePartitioner and DefaultPartitioner are not same. But normally same.
If key is given, DefaultPartitioner will return one number from 0 to numPartition-1.
But SimplePartition always returns number of stikyPartitionCache.partitionCache.partition()'s value.
If there's an unavailable partition (all replicas of that parition down), producing will fail with DefaulPartitioner. But Simpartition can make producing success.
I tested about this with old version of SimplePartitioner but did not with newer one.
From the kafka faq page
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key
So all the messages with a particular key will always go to the same partition in a topic:
How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
How does the consumer know which partition the producer wrote to
Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...
Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering
so that the consumers can consume messages from specific producers?
Again, this should not be done.
Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,
If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)
The offset is atomic action which will promise that no two messages will get same offset.
The offset is running number, it has a meaning only in the specific topic and specfic partition
If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value
The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition
When a partition is assigned by the Producer using a number . For eg
kafkaTemplate.send(topic, 1, "[" + LocalDateTime.now() + "]" + "Message to partition 1");
The number 1 second parameter defines the partition id where i want my message to be sent. So the consumer can consume this message:
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer1.assign(Arrays.asList(partition1));
But how do i achieve this for a producer choosing a partition based on the hash of the key sent by the producer using the DefaultPartitioner. Example:
kafkaTemplate.send(topic, "forpartition1", "testkey");
Here the key is "forpartition1" , how do i assign my consumer to consume from this partition generated from hash Key of "forpartition1". Do i again compute the hash for that key in the consumer or are there any other ways to achieve that. I am pretty new to this technology.
Based on the information that you are new to Kafka, I am tempted to guess you are unintentionally trying an advanced use case and that is probably not what you want.
The common use case is that you publish messages to a topic. The message gets assigned a partition based on the key and all messages for the same key ends at the same partition.
On the consumer, you subscribe to the whole topic (without explicitly asking for a partition) and Kafka will handle the distribution of partitions between all the consumers available.
This gives you the guarantee that all messages with a specific key will be processed by the same consumer (they all go to the same partition and only one consumer handles each partition) and in the same order they were sent.
If you really want to choose the partition yourself, you can write a partitioner class and configure your producer to use it by setting partitioner.class configuration.
From the Kafka Documentation
NAME
partitioner.class
DESCRIPTION
Partitioner class that implements the org.apache.kafka.clients.producer.Partitioner interface.
TYPE
class
DEFAULT
org.apache.kafka.clients.producer.internals.DefaultPartitioner
VALID VALUES
IMPORTANCE
medium
A few example tutorials on how to do it can be found online. Here's a sample for reference:
Write An Apache Kafka Custom Partitioner
Apache Kafka
Foundation Course - Custom Partitioner
I am trying to store Messages with different key to different partition.
For example:
ProducerRecord<String, String> rec1 = new ProducerRecord<String, String>("topic", "key1", line);
ProducerRecord<String, String> rec2 = new ProducerRecord<String, String>("topic", "key2", line);
producer.send(rec1);
producer.send(rec2);
But when i try to run my Producer class, it always stored in single partition.
As per documentation, DefaultPartitioner uses message key hash code to find the partition.
I also saw this question Kafka partition key not working properly, but i cannot find ByteArrayPartitioner class in 0.9.x version of Kafka Client library.
props.put("partitioner.class", "kafka.producer.ByteArrayPartitioner")
Update: I am creating the topic on the fly using code.
If i create a topic with partitions manually, then its working fine.
If topics are created "on the fly", the are created with number of partitions according to num.partitions parameters (with default value 1). And if you have only one partitions, all data will go to this single partitions.
However, keep in mind, even if you have multiple partitions, a partitions can still get different keys assigned! Even if you have num-partitions == num-distinct-keys there might be hash collisions, assigning two different keys to the same partitions (and leaving some partitions empty).
If you want to ensure that different keys always go to different partitions, you need to use a consumer partitioner or specify the partition number directly.
I am new to kafka so apology if I sound stupid but what I understood so far
is .. A stream of message can be defined as a topic, like a category. And every topic is divided
into one or more partitions (each partition can have multiple replicas). so they act in parallel
From the Kafka main site they say
The producer is able to chose which message to assign to which partition within the topic.
This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message).
Does this mean while consuming I will be able to choose the message offset from particular partition?
While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
In Kafka 0.7 quick start they say
Send a message with a partition key. Messages with the same key are sent to the same partition.
And the key can be provided while creating the producer as below
ProducerData<String, String> data = new ProducerData<String, String>("test-topic", "test-key", "test-message");
producer.send(data);
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
While creating producer in 0.8beta we can provide the partitioner class attribute through the config file.
The custom partitioner class can be perhaps created implementing the kafka partitioner interface.
But m little confused how exactly it works. 0.8 doc also does not explain much. Any advice or m i missing something ?
This is what I've found so far ..
Define our own custom partitioner class by implementing the kafka Partitioner interface. The implemented method will have two arguments, first the key that we provide from the producer and next the number of partition available. So we can define our own logic to set which key of message goes to what partition.
Now while creating the producer we can specify our own partitioner class using the "partitioner.class" attribute
props.put("partitioner.class", "path.to.custom.partitioner.class");
If we don't mention it then Kafka will use its default class and try to distribute message evenly among the partitions available.
Also inform Kafka how to serialize the key
props.put("key.serializer.class", "kafka.serializer.StringEncoder");
Now if we send some message using a key in the producer the message will be delivered to a specific partition (based on our logic written on the custom partitioner class), and in the consumer (SimpleConsumer) level we can specify the partition to retrieve the specific messages.
In case we need to pass a String as a key, the same should be handled in the custom partitioner class ( take hash value of the key and then take first two digit etc )
Each topic in Kafka is split into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream (say C1 uses 3, C2 uses 3 and C3 uses 4).
In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Does this mean while consuming I will be able to choose the message offset from particular partition? While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
Yes you can choose message from one specific partition from your consumer but if you want that to be identified dynamically then it depends on the logic how you have implemented Partitioner Class in your producer.
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
There are two way of consuming the message. One is using Zookeeper Host and another is Static Host. Zookeper host consumes message from all partition. However if you are uisng Static Host than you can provide broker with partition number that needs to be consumed.
Please check below example of Kafka 0.8
Producer
KeyedMessage<String, String> data = new KeyedMessage<String, String>(<<topicName>>, <<KeyForPartition>>, <<Message>>);
Partition Class
public int partition(Object arg0, int arg1) {
// arg0 is the key given while producing, arg1 is the number of
// partition the broker has
long organizationId = Long.parseLong((String) arg0);
// if the given key is less than the no of partition available then send
// it according to the key given Else send it to the last partition
if (arg1 < organizationId) {
return (arg1 - 1);
}
// return (int) (organizationId % arg1);
return Integer.parseInt((String) arg0);
}
So the partiotioner class decide where to send message based on your logic.
Consumer (PN:I have used Storm Kafka 0.8 integration)
HostPort hosts = new HostPort("10.**.**.***",9092);
GlobalPartitionInformation gpi = new GlobalPartitionInformation();
gpi.addPartition(0, hosts);
gpi.addPartition(2, hosts);
StaticHosts statHost = new StaticHosts(gpi);
SpoutConfig spoutConf = new SpoutConfig(statHost, <<topicName>>, "/kafkastorm", <<spoutConfigId>>);