Kafka - Message with different key stored in same partition - apache-kafka

I am trying to store Messages with different key to different partition.
For example:
ProducerRecord<String, String> rec1 = new ProducerRecord<String, String>("topic", "key1", line);
ProducerRecord<String, String> rec2 = new ProducerRecord<String, String>("topic", "key2", line);
producer.send(rec1);
producer.send(rec2);
But when i try to run my Producer class, it always stored in single partition.
As per documentation, DefaultPartitioner uses message key hash code to find the partition.
I also saw this question Kafka partition key not working properly‏, but i cannot find ByteArrayPartitioner class in 0.9.x version of Kafka Client library.
props.put("partitioner.class", "kafka.producer.ByteArrayPartitioner")
Update: I am creating the topic on the fly using code.
If i create a topic with partitions manually, then its working fine.

If topics are created "on the fly", the are created with number of partitions according to num.partitions parameters (with default value 1). And if you have only one partitions, all data will go to this single partitions.
However, keep in mind, even if you have multiple partitions, a partitions can still get different keys assigned! Even if you have num-partitions == num-distinct-keys there might be hash collisions, assigning two different keys to the same partitions (and leaving some partitions empty).
If you want to ensure that different keys always go to different partitions, you need to use a consumer partitioner or specify the partition number directly.

Related

Distribute messages equally into partitions in kafka

I am new in Kafka, so i have some issues related on basic things for Kafka. I wanted to distribute all messages equally to all over partitions.
As I know, Producer chose the partition based on key hashing (If key is available) using default Partitioner hash algorithm (Random, Consistent, Murmur2, sticky etc.). Which is great. But I want to distribute the messages to all partitions. Like:
Topic: "Test"
Partition: 3
Now, If i produce messages (Key Available) then I want to distribute those messages equally like:
Partition 1: 1,4,7,10
Partition 2: 2,5,8
Partition 3: 3,6,9
So, how can i distribute messages equally to all partition
The default partitioner chooses partition based on the hash of key if a key is available and no partition is specified in the record itself. Otherwise (i.e. no key is present and no partition is specified) it chooses the partition in a round-robin fashion (Kafka<2.4, read below).
public int partition(String key, int partitionNum) {
byte[] keyBytes = key.getBytes();
return toPositive(murmur2(keyBytes)) % partitionNum;
}
For a handful number of keys, using the default partitioner may not give you even data distribution, as toPositive(murmur2(keyBytes)) % numberOfPartitions will have collisions. The best way is for the producer to take responsibility and decide which partition to send the message to using CustomPartitioner based on your business use-case.
Kafka guarantees that any consumer of a given topic-partition will always read that partition's events in exactly the same order as they were written.
https://kafka.apache.org/documentation.html#introduction
One thing to note here is, that although eliminating data skewness is important - The order of messages going in different partitions in a topic may or may not be in-order - this may have consequences based on your use-case. But within a Partition will they'll be stored in the order, Thus keep related messages in the same partition.
For e.g. In an E-commerce delivery-related environment, Messages related to an OrderID should come in order (you don't want "Out-For-Delivery" to be after "Delivered"), thus messages for specific order_id should go into the same partition.
Update:
As mentioned in the comment, Kafka ≥ v2.4 uses Sticky Partitioner as the default partitioner.
The sticky partitioner addresses the problem of spreading out records without keys into smaller batches by picking a single partition to send all non-keyed records. Once the batch at that partition is filled or otherwise completed, the sticky partitioner randomly chooses and “sticks” to a new partition. That way, over a larger period of time, records are about evenly distributed among all the partitions while getting the added benefit of larger batch sizes.
https://www.confluent.io/blog/apache-kafka-producer-improvements-sticky-partitioner/
This means Kafka producers don’t immediately send records but keeps a batch of records for a specific topic with no keys and no assigned partition and will send to the same partition until the batch is ready to be sent. When a new batch is created, a new partition is chosen.
Effectively, the partitioner assigns records to the same partition until the batch is sent based on batch.size and linger.ms, once that batch is sent, a new partition will be used. Thus messages may not necessarily be evenly distributed.
Further Reading:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-480%3A+Sticky+Partitioner
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner#KIP794:StrictlyUniformStickyPartitioner-UniformStickyBatchSize
https://www.confluent.io/blog/5-things-every-kafka-developer-should-know/#tip-2-new-sticky-partitioner
https://aiven.io/blog/balance-data-across-kafka-partitions#challenge-of-uneven-record-distribution
I think this answers your question best:
https://rajatjain-ix.medium.com/whats-wrong-with-kafka-b53d0549677a
So, there are two solutions available..
You don't specify any partition_key. In this case, the DefaultPartitioner will automatically round-robin the messages across the partitions.
You use a (incremental uuid) % (count of partitions) as the partition number in Producer API. This way you are manually telling it to round-robin the messages to partitions.
Ronak explained very precisely.
You could achieve distribution of the messages over partitions evenly by implementing Partitioner interface regardless of the key.
New sticky version
public class SimplePartitioner implements Partitioner {
private final StickyPartitionCache stickyPartitionCache = new StickyPartitionCache();
public void configure(Map<String, ?> configs) {
}
#Override
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
return partition(topic, key, keyBytes, value, valueBytes, cluster, cluster.partitionsForTopic(topic).size());
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster,
int numPartitions) {
return stickyPartitionCache.partition(topic, cluster);
}
#Override
public void close() {
}
}
Old version - see this link: hhttps://github.com/sharefeel/kafka-simple-partitioner/blob/0.8.2/SimplePartitioner.java
Don't forget this. Target partitions of SimplePartitioner and DefaultPartitioner are not same. But normally same.
If key is given, DefaultPartitioner will return one number from 0 to numPartition-1.
But SimplePartition always returns number of stikyPartitionCache.partitionCache.partition()'s value.
If there's an unavailable partition (all replicas of that parition down), producing will fail with DefaulPartitioner. But Simpartition can make producing success.
I tested about this with old version of SimplePartitioner but did not with newer one.

Kafka Streams: pipe one topic into another

I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.

Uneven Distribution of messages in Kafka Partitions

I have a topic with 10 partitions, 1 consumer group with 4 consumers and worker size is 3.
I could see there is an uneven distribution of messages in the partitions, One partition is having so much data and another one is free.
How can I make my producer to evenly distribute the load into all the partitions, so that all partitions are being utilized properly?
According to the JavaDoc comment in the DefaultPartitioner class itself, the default partitioning strategy is:
If a partition is specified in the record, use it.
If no partition is specified but a key is present choose a partition based on a hash of the key.
If no partition or key is present choose a partition in a round-robin fashion.
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
So here are two possible reasons that may be causing the uneven distribution, depending on whether you are specifying a key while producing the message or not:
If you are specifying a key and you are getting an uneven distribution using the DefaultPartitioner, the most apparent explanation would be that you are specifying the same key multiple times.
If you are not specifying a key and using the DefaultPartitioner, a non-obvious behavior could be happening. According to the above you would expect round-robin distribution of messages, but this is not necessarily the case. An optimization introduced in 0.8.0 could be causing the same partition to be used. Check this link for a more detailed explanation: https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified? .
Instead of going for the default partitioner class you can assign the producer with a partition number so that message directly goes to the specified partition,
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, partitionNumber,key, value);
Seems like your problem is uneven consumption of messages rather than uneven producing of messages to Kafka topic. In other words, your amount of reading threads doesn't match amount of partitions you have (they do not need to match 1:1 though, only be the same amout of partitions to read from per each consumer thread).
See short explanation for more details.
You can make use of the key parameter of the producer record. Here is a thing that for a specific key the data goes in to the same partition always now, I don’t know the structure of your producer record but as you said you have 10 partition then you can use simply n%10 as your producer record key.
Where n is 0 to 9 now your for record 0 key will be 0 and then kafka will generate a hash key and put it in some partition say partition 0, and for record 1 it will be one and then it will go into the 1st partition and so on.
This way you will able to apply round robin on your producer record your key will be independent from the fields in your record so you can have a variable n and key as n%10.
Or you can specify the partition in your producer record. So either you use the key or the partition field of the producer record.
If you have defined partitioner from record let's say in Kafka key is string and value is student Pojo.
In student Pojo let's say based on student country field, I want to go in a specific partition. Imagine that there is 10 partitions in a topic and for example, in value, "India" is a country and based on "India" we got partition number 5.
Whenever country is "India", Kafka will allocate the 5 number partition and that record goes to the partition number 5 always (if the partition has not changed).
Let's say that in your pipeline there are lots of records which are coming and have a country "India", all those records will go to partition number 5, and you will see uneven distribution in Kafka partition.
In my case, I used the default partitioner but still had much much more records in one partition than in others. The problem was I unexpectedly had many records with the same key. Check your keys!
As I was unable to resolve this with Faust, the approach I am using is to implement the 'round-robin' distribution myself.
I iterate over my records to produce and do for example:
for index, message in enumerate(messages):
topic.send(message, partition=index % num_partitions)
I.e. bound my index to within the range of partitions I have.
There could still be unevenness - consider you repeatedly run this but your number of records is less than your num_partitions - then your first partitions will keep getting the major share of messages. You can avoid this issue by adding a random offset:
import random
initial_partition = random.randrange(0, num_partitions)
for index, message in enumerate(messages):
topic.send(message, partition=(initial_partition + index) % num_partitions)

KafkaProducer round-robin distribution not working for the same key

I'm trying to understand how Kafka works. I've read that by default Kafka will distribute the messages from a producer in a round-robin fashion among the partitions.
But, why are the messages always put in the same partition if the messages have the same key ? (no partition key strategy configured).
For instance, using the code below the messages are always put in the same partition:
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
String key = properties.getProperty("dev.id");
producer.send(new ProducerRecord<String, String>(properties.getProperty("kafka.topic"), key, value), new EventGeneratorCallback(key));
With a different key, messages are distributed in a round-robin fashion:
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
String key = properties.getProperty("dev.id") + UUID.randomUUID().toString();
producer.send(new ProducerRecord<String, String>(properties.getProperty("kafka.topic"), key, value), new EventGeneratorCallback(key));
This is exactly how a Kafka producer works. This behaviour is defined by the DefaultPartitioner class you can find here at the official repo.
If the key isn't specified, the producer uses a round-robin way for sending messages across all the topic related partitions; if the key is specified, the partitioner processes an hash of the key module the number of partition and in this way messages with same key goes to the same partition. Because Kafka guarantees ordering at partition level (not topic level), this is also a way for having all messages with same key in the same partition and so received by consumer in the same order they are sent.
Finally, another possible way for sending messages is that the producer specifies the partition destination : in this case no key, no round-robin but the message is sent exactly to the partition specified by the producer itself.
This is the expected behavior. All messages with the same key are put in the same partition. If you want round robin assignment for all messages, you should not provide a key. To Kafka, the reason to have a key is to distribute data throughout partitions and putting identical keys in different partitions would break this contract.

KafKa partitioner class, assign message to partition within topic using key

I am new to kafka so apology if I sound stupid but what I understood so far
is .. A stream of message can be defined as a topic, like a category. And every topic is divided
into one or more partitions (each partition can have multiple replicas). so they act in parallel
From the Kafka main site they say
The producer is able to chose which message to assign to which partition within the topic.
This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message).
Does this mean while consuming I will be able to choose the message offset from particular partition?
While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
In Kafka 0.7 quick start they say
Send a message with a partition key. Messages with the same key are sent to the same partition.
And the key can be provided while creating the producer as below
ProducerData<String, String> data = new ProducerData<String, String>("test-topic", "test-key", "test-message");
producer.send(data);
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
While creating producer in 0.8beta we can provide the partitioner class attribute through the config file.
The custom partitioner class can be perhaps created implementing the kafka partitioner interface.
But m little confused how exactly it works. 0.8 doc also does not explain much. Any advice or m i missing something ?
This is what I've found so far ..
Define our own custom partitioner class by implementing the kafka Partitioner interface. The implemented method will have two arguments, first the key that we provide from the producer and next the number of partition available. So we can define our own logic to set which key of message goes to what partition.
Now while creating the producer we can specify our own partitioner class using the "partitioner.class" attribute
props.put("partitioner.class", "path.to.custom.partitioner.class");
If we don't mention it then Kafka will use its default class and try to distribute message evenly among the partitions available.
Also inform Kafka how to serialize the key
props.put("key.serializer.class", "kafka.serializer.StringEncoder");
Now if we send some message using a key in the producer the message will be delivered to a specific partition (based on our logic written on the custom partitioner class), and in the consumer (SimpleConsumer) level we can specify the partition to retrieve the specific messages.
In case we need to pass a String as a key, the same should be handled in the custom partitioner class ( take hash value of the key and then take first two digit etc )
Each topic in Kafka is split into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream (say C1 uses 3, C2 uses 3 and C3 uses 4).
In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Does this mean while consuming I will be able to choose the message offset from particular partition? While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
Yes you can choose message from one specific partition from your consumer but if you want that to be identified dynamically then it depends on the logic how you have implemented Partitioner Class in your producer.
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
There are two way of consuming the message. One is using Zookeeper Host and another is Static Host. Zookeper host consumes message from all partition. However if you are uisng Static Host than you can provide broker with partition number that needs to be consumed.
Please check below example of Kafka 0.8
Producer
KeyedMessage<String, String> data = new KeyedMessage<String, String>(<<topicName>>, <<KeyForPartition>>, <<Message>>);
Partition Class
public int partition(Object arg0, int arg1) {
// arg0 is the key given while producing, arg1 is the number of
// partition the broker has
long organizationId = Long.parseLong((String) arg0);
// if the given key is less than the no of partition available then send
// it according to the key given Else send it to the last partition
if (arg1 < organizationId) {
return (arg1 - 1);
}
// return (int) (organizationId % arg1);
return Integer.parseInt((String) arg0);
}
So the partiotioner class decide where to send message based on your logic.
Consumer (PN:I have used Storm Kafka 0.8 integration)
HostPort hosts = new HostPort("10.**.**.***",9092);
GlobalPartitionInformation gpi = new GlobalPartitionInformation();
gpi.addPartition(0, hosts);
gpi.addPartition(2, hosts);
StaticHosts statHost = new StaticHosts(gpi);
SpoutConfig spoutConf = new SpoutConfig(statHost, <<topicName>>, "/kafkastorm", <<spoutConfigId>>);