Kafka topics - How to ensure a one-key-one-partition relationship - apache-kafka

I am working on a project where I would like to ensure I have 1 and only 1 key per partition in my topic.
The problem is that I don't know the number of specific keys I will produce data from (could be 1, 2, or 1000 ! The number of different keys streamed vary in time).
In a topic, I know we have to specify an initial number of partitions, but we can add more after creation.
What would be the best solution to ensure 1 key 1 partition?.
I have some leads...
I could create a topic with 3000 partition in advance so I have a buffer, but it is definitely not optimized. What about disk space consumption? What about the impact on performances?
I could add partitions once I run out of available partitions, so I will only have the same number of partitions as keys, but there will be collision and it will threaten the continuity of the event stream as potentially a key will be assigned to a different partition
I could override the default Partitioner used by my Producer services to ensure there is no collision between keys and the resizing does not affect, but how my consumer will know the partition the partitioner choose? How to ensure no other Producer affect the same partition number to another key...
Many thanks for your help !

Related

Do we need to know number of partitions for a topic beforehand?

We want to put messages/records of a different customers on different partitions of a kafka topic.
But number of customers is not known in prior. So how can we set partition count for kafka topic in this case? Do we need any other way where partition count changes at runtime based on keys (customer_id in this case). Thanks in advance.
need to know number of partitions
Assuming Java, use AdminClient.describeTopics() method call and get partitions of each response object.
Regarding the rest of the question, consumer instances automatically distribute partition assignment when subscribing to topics.
Producers should not know about consumers, so you don't "put records on partitions" based on any factor of (possible) consumers.
partition count changes at runtime based on keys (customer_id)
Unclear what this means. Partition count can only increase, and if you do increase it, then your partitions will become unordered, so you should consider how large your keyspace is before creating the topic. For example, if you have a numeric ID, and use the first two digits as the partition value, then you could create a topic up to 100 partitions.

Kafka topic with multiple sources

If I have 1 Kafka topic with 1 partition and multiple sources are posted in the same partition. What happens if 2 servers are trying to post in the same partition at the same time? Would it mix the information between both of those servers or one of them would wait until the other finishes?
The producers will mix the messages in the partition.
As per theory, events will be guaranteed to be appended in order per partition per producer. But if we are talking about multiple producers, then the behaviour will depend on the configuration set at the producer side. In particular, max.in.flight.requests.per.connection = 1. The reason being is if there are multiple in flight events and the first one failed, the second will get appended to the log earlier, thus breaking the ordering.
Have a glance at https://blog.softwaremill.com/does-kafka-really-guarantee-the-order-of-messages-3ca849fd19d2
If somehow keys are same for both sources and every record, all of them will be recorded in the same partition (other partitions will remain empty)
If every source has a different key from each other but this key is used for every message from same source, then messages from different sources will be recorded at different partitions (if partition count is no less than source count).
If each value has a different key, regardless of sources, still kafka will mix them in partitions as I know.
In short, keys determine the partition of a message. Values with same key go to same partition. If every record has a unique key, Kafka will apply Round-Robin for incoming messages and each partition will have almost same amount of records.

In Kafka, if I increase the number of partitions in a topic then will order of messages be broken? (I used a key to partition)

Recently, I started to study Kafka and have been thinking how to adopt it into my service. Some of my messages should be processed in strict order, so I chose to use a key for partitioning on producer. However, even though we just need one partition right now, we might increase the number of partitions in the near future. So, in Kafka, if I increase the number of partitions in a topic then will consumers get messages in order?
Thanks in advance.
If you increase partitions, there's no guarantee that future, equal keys will land in their prior partition, so you'll experience a temporary period, based on topic retention, where you'll have keys spanning more than one partition (by default)
One workaround is to ensure you've consumed all messages, stop all clients interacting with the topic, then empty the topic and increase the count
Or you can start with an increased count to begin with and continue having all equal keys distributed over multiple partitions

Uneven Distribution of messages in Kafka Partitions

I have a topic with 10 partitions, 1 consumer group with 4 consumers and worker size is 3.
I could see there is an uneven distribution of messages in the partitions, One partition is having so much data and another one is free.
How can I make my producer to evenly distribute the load into all the partitions, so that all partitions are being utilized properly?
According to the JavaDoc comment in the DefaultPartitioner class itself, the default partitioning strategy is:
If a partition is specified in the record, use it.
If no partition is specified but a key is present choose a partition based on a hash of the key.
If no partition or key is present choose a partition in a round-robin fashion.
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
So here are two possible reasons that may be causing the uneven distribution, depending on whether you are specifying a key while producing the message or not:
If you are specifying a key and you are getting an uneven distribution using the DefaultPartitioner, the most apparent explanation would be that you are specifying the same key multiple times.
If you are not specifying a key and using the DefaultPartitioner, a non-obvious behavior could be happening. According to the above you would expect round-robin distribution of messages, but this is not necessarily the case. An optimization introduced in 0.8.0 could be causing the same partition to be used. Check this link for a more detailed explanation: https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified? .
Instead of going for the default partitioner class you can assign the producer with a partition number so that message directly goes to the specified partition,
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, partitionNumber,key, value);
Seems like your problem is uneven consumption of messages rather than uneven producing of messages to Kafka topic. In other words, your amount of reading threads doesn't match amount of partitions you have (they do not need to match 1:1 though, only be the same amout of partitions to read from per each consumer thread).
See short explanation for more details.
You can make use of the key parameter of the producer record. Here is a thing that for a specific key the data goes in to the same partition always now, I don’t know the structure of your producer record but as you said you have 10 partition then you can use simply n%10 as your producer record key.
Where n is 0 to 9 now your for record 0 key will be 0 and then kafka will generate a hash key and put it in some partition say partition 0, and for record 1 it will be one and then it will go into the 1st partition and so on.
This way you will able to apply round robin on your producer record your key will be independent from the fields in your record so you can have a variable n and key as n%10.
Or you can specify the partition in your producer record. So either you use the key or the partition field of the producer record.
If you have defined partitioner from record let's say in Kafka key is string and value is student Pojo.
In student Pojo let's say based on student country field, I want to go in a specific partition. Imagine that there is 10 partitions in a topic and for example, in value, "India" is a country and based on "India" we got partition number 5.
Whenever country is "India", Kafka will allocate the 5 number partition and that record goes to the partition number 5 always (if the partition has not changed).
Let's say that in your pipeline there are lots of records which are coming and have a country "India", all those records will go to partition number 5, and you will see uneven distribution in Kafka partition.
In my case, I used the default partitioner but still had much much more records in one partition than in others. The problem was I unexpectedly had many records with the same key. Check your keys!
As I was unable to resolve this with Faust, the approach I am using is to implement the 'round-robin' distribution myself.
I iterate over my records to produce and do for example:
for index, message in enumerate(messages):
topic.send(message, partition=index % num_partitions)
I.e. bound my index to within the range of partitions I have.
There could still be unevenness - consider you repeatedly run this but your number of records is less than your num_partitions - then your first partitions will keep getting the major share of messages. You can avoid this issue by adding a random offset:
import random
initial_partition = random.randrange(0, num_partitions)
for index, message in enumerate(messages):
topic.send(message, partition=(initial_partition + index) % num_partitions)

Why is data not evenly distributed among partitions when a partitioning key is not specified?

Is this explanation still valid in Kafka 10?
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one. So, if there are fewer producers than partitions, at a given point of time, some partitions may not receive any data. To alleviate this problem, one can either reduce the metadata refresh interval or specify a message key and a customized random partitioner. For more detail see this thread http://mail-archives.apache.org/mod_mbox/kafka-dev/201310.mbox/%3CCAFbh0Q0aVh%2Bvqxfy7H-%2BMnRFBt6BnyoZk1LWBoMspwSmTqUKMg%40mail.gmail.com%3E
From here https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified?
The new producer has changed to use round-robin policy. That's to say, messages will be delivered to all partitions evenly if no keys are specified.