Do we need to know number of partitions for a topic beforehand? - apache-kafka

We want to put messages/records of a different customers on different partitions of a kafka topic.
But number of customers is not known in prior. So how can we set partition count for kafka topic in this case? Do we need any other way where partition count changes at runtime based on keys (customer_id in this case). Thanks in advance.

need to know number of partitions
Assuming Java, use AdminClient.describeTopics() method call and get partitions of each response object.
Regarding the rest of the question, consumer instances automatically distribute partition assignment when subscribing to topics.
Producers should not know about consumers, so you don't "put records on partitions" based on any factor of (possible) consumers.
partition count changes at runtime based on keys (customer_id)
Unclear what this means. Partition count can only increase, and if you do increase it, then your partitions will become unordered, so you should consider how large your keyspace is before creating the topic. For example, if you have a numeric ID, and use the first two digits as the partition value, then you could create a topic up to 100 partitions.

Related

What does lexicografic Order of Consumers in Kafka exactly mean?

The question seems to be very easy, but I haven't found an explanation on it.
The default partition assignement strategy in kafka is with the RangeAssignor. How this assignor is working is explained as:
"The range assignor works on a per-topic basis. For each topic, we lay out the available partitions in numeric order and the consumers in lexicographic order. We then divide the number of partitions by the total number of consumers to determine the number of partitions to assign to each consumer. If it does not evenly divide, then the first few consumers will have one extra partition." https://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/RangeAssignor.html
How it works is clear so far. Unclear is on what attribute the lexicographic order is done. Is done by the id of a Consumer?
Can anybody give an example for a lexicografic order of consumers?
Greetings,
maudeees
Since consumer client id is not required and group id should only be used for offset management, I assume it means by topic name, when the consumer is subscribed to multiple topics. If you are only using one topic, then only partitions are numerically ordered

In Kafka, if I increase the number of partitions in a topic then will order of messages be broken? (I used a key to partition)

Recently, I started to study Kafka and have been thinking how to adopt it into my service. Some of my messages should be processed in strict order, so I chose to use a key for partitioning on producer. However, even though we just need one partition right now, we might increase the number of partitions in the near future. So, in Kafka, if I increase the number of partitions in a topic then will consumers get messages in order?
Thanks in advance.
If you increase partitions, there's no guarantee that future, equal keys will land in their prior partition, so you'll experience a temporary period, based on topic retention, where you'll have keys spanning more than one partition (by default)
One workaround is to ensure you've consumed all messages, stop all clients interacting with the topic, then empty the topic and increase the count
Or you can start with an increased count to begin with and continue having all equal keys distributed over multiple partitions

Kafka partiton performance issue

I have 6 partitions for some topic and created a consumer group with 6 consumers. So, Here each consumer reading data from 1 partition. Here the problem is one partition having more data and other partitions having minimal data, due to data skew in one partition one consumer performing slow. How to handle this situation.
You need to know which keys you are sending into the topic. If the keys are null, then this shouldn't be possible.
If there is a distribution of non-null keys that is skewed towards a few values, then those partitions will be greater.
You are welcome to write your own Partitioner interface, otherwise.

Uneven Distribution of messages in Kafka Partitions

I have a topic with 10 partitions, 1 consumer group with 4 consumers and worker size is 3.
I could see there is an uneven distribution of messages in the partitions, One partition is having so much data and another one is free.
How can I make my producer to evenly distribute the load into all the partitions, so that all partitions are being utilized properly?
According to the JavaDoc comment in the DefaultPartitioner class itself, the default partitioning strategy is:
If a partition is specified in the record, use it.
If no partition is specified but a key is present choose a partition based on a hash of the key.
If no partition or key is present choose a partition in a round-robin fashion.
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
So here are two possible reasons that may be causing the uneven distribution, depending on whether you are specifying a key while producing the message or not:
If you are specifying a key and you are getting an uneven distribution using the DefaultPartitioner, the most apparent explanation would be that you are specifying the same key multiple times.
If you are not specifying a key and using the DefaultPartitioner, a non-obvious behavior could be happening. According to the above you would expect round-robin distribution of messages, but this is not necessarily the case. An optimization introduced in 0.8.0 could be causing the same partition to be used. Check this link for a more detailed explanation: https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified? .
Instead of going for the default partitioner class you can assign the producer with a partition number so that message directly goes to the specified partition,
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, partitionNumber,key, value);
Seems like your problem is uneven consumption of messages rather than uneven producing of messages to Kafka topic. In other words, your amount of reading threads doesn't match amount of partitions you have (they do not need to match 1:1 though, only be the same amout of partitions to read from per each consumer thread).
See short explanation for more details.
You can make use of the key parameter of the producer record. Here is a thing that for a specific key the data goes in to the same partition always now, I don’t know the structure of your producer record but as you said you have 10 partition then you can use simply n%10 as your producer record key.
Where n is 0 to 9 now your for record 0 key will be 0 and then kafka will generate a hash key and put it in some partition say partition 0, and for record 1 it will be one and then it will go into the 1st partition and so on.
This way you will able to apply round robin on your producer record your key will be independent from the fields in your record so you can have a variable n and key as n%10.
Or you can specify the partition in your producer record. So either you use the key or the partition field of the producer record.
If you have defined partitioner from record let's say in Kafka key is string and value is student Pojo.
In student Pojo let's say based on student country field, I want to go in a specific partition. Imagine that there is 10 partitions in a topic and for example, in value, "India" is a country and based on "India" we got partition number 5.
Whenever country is "India", Kafka will allocate the 5 number partition and that record goes to the partition number 5 always (if the partition has not changed).
Let's say that in your pipeline there are lots of records which are coming and have a country "India", all those records will go to partition number 5, and you will see uneven distribution in Kafka partition.
In my case, I used the default partitioner but still had much much more records in one partition than in others. The problem was I unexpectedly had many records with the same key. Check your keys!
As I was unable to resolve this with Faust, the approach I am using is to implement the 'round-robin' distribution myself.
I iterate over my records to produce and do for example:
for index, message in enumerate(messages):
topic.send(message, partition=index % num_partitions)
I.e. bound my index to within the range of partitions I have.
There could still be unevenness - consider you repeatedly run this but your number of records is less than your num_partitions - then your first partitions will keep getting the major share of messages. You can avoid this issue by adding a random offset:
import random
initial_partition = random.randrange(0, num_partitions)
for index, message in enumerate(messages):
topic.send(message, partition=(initial_partition + index) % num_partitions)

Topics, partitions and keys

I am looking for some clarification on the subject.
In Kafka documentations I found the following:
Kafka only provides a total order over messages within a partition,
not between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over messages
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
So here are my questions:
Does it mean if i want to have more than 1 consumer (from the same group) reading from one topic I need to have more than 1 partition?
Does it mean I need same amount of partitions as amount of consumers for the same group?
How many consumers can read from one partition?
Also have some questions regarding relationship between keys and partitions with regard to API. I only looked at .net APIs (especially one from MS) but looks like the mimic Java API.
I see when using a producer to send a message to a topic there is a key parameter. But when consumer reads from a topic there is a partition number.
How are partitions numbered? Starting from 0 or 1?
What exactly relationship between a key and partition?
As I understand some function on key will determine a partition. is that correct?
If I have 2 partitions in a topic and want some particular messages go to one partition and other messages go to another I should use a specific key for one specific partition, and the rest for another?
What if I have 3 partitions and one type of messages to one particular partition and the rest to other 2?
How in general I send messages to a particular partition in order to know for a consumer from where to read?
Or I better off with multiple topics?
Thanks in advance.
Does it mean if i want to have more than 1 consumer (from the same
group) reading from one topic I need to have more than 1 partition?
Let's see the following properties of kafka:
each partition is consumed by exactly one consumer in the group
one consumer in the group can consume more than one partition
the number of consumer processes in a group must be <= number
of partitions
With these properties, kafka is smartly able to provide both ordering guarantees and load balancing over a pool of consumer processes.
To answer your question, yes, in the context of the same group, if you want to have N consumers, you have to have at least N partitions.
Does it mean I need same amount of partitions as amount of consumers
for the same group?
I think this has been explained in the first answer.
How many consumers can read from one partition?
The number of consumers that can read from one partition is always equal to the number of consumer groups subscribing to that topic.
Relationship between keys and partitions with regard to API
First, we must understand that the producer is responsible for choosing which record to assign to which partition within the topic.
Now, lets see how producer does so. First, lets see the class definition of ProducerRecord.java :
public class ProducerRecord<K, V> {
private final String topic;
private final Integer partition;
private final Headers headers;
private final K key;
private final V value;
private final Long timestamp;
}
Here, the field that we have to understand from the class is partition.
From the ProducerRecord docs,
If a valid partition number is specified, that partition will be used when sending the record.
If no partition is specified but a key is present a partition will be chosen using a hash of the key.
If neither key nor partition is present a partition will be assigned in a round-robin fashion.
Partitions increase parallelism of Kafka topic. Any number of consumers/producers can use the same partition. Its up to application layer to define the protocol. Kafka guarantees delivery. Regarding the API, you may want to look at Java docs as they may be more complete. Based on my experience:
Partitions start from 0
Keys may be used to send messages to the same partition. For example hash(key)%num_partition. The logic is pluggable to Producer. https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/Partitioner.html
Yes. but be careful not to end up with some key that will result in the "dedicated" partition. For this, you may want to have dedicated topic. For example, control topic and data topic
This seems to be the same question as 3.
I believe consumers should not make assumptions of the data based on partition. The typical approach is to have consumer group that can read from multiple partitions of a topic. If you want to have dedicated channels, it is better (safer/maintainable) to use separate topics.