Topics, partitions and keys - apache-kafka

I am looking for some clarification on the subject.
In Kafka documentations I found the following:
Kafka only provides a total order over messages within a partition,
not between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over messages
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
So here are my questions:
Does it mean if i want to have more than 1 consumer (from the same group) reading from one topic I need to have more than 1 partition?
Does it mean I need same amount of partitions as amount of consumers for the same group?
How many consumers can read from one partition?
Also have some questions regarding relationship between keys and partitions with regard to API. I only looked at .net APIs (especially one from MS) but looks like the mimic Java API.
I see when using a producer to send a message to a topic there is a key parameter. But when consumer reads from a topic there is a partition number.
How are partitions numbered? Starting from 0 or 1?
What exactly relationship between a key and partition?
As I understand some function on key will determine a partition. is that correct?
If I have 2 partitions in a topic and want some particular messages go to one partition and other messages go to another I should use a specific key for one specific partition, and the rest for another?
What if I have 3 partitions and one type of messages to one particular partition and the rest to other 2?
How in general I send messages to a particular partition in order to know for a consumer from where to read?
Or I better off with multiple topics?
Thanks in advance.

Does it mean if i want to have more than 1 consumer (from the same
group) reading from one topic I need to have more than 1 partition?
Let's see the following properties of kafka:
each partition is consumed by exactly one consumer in the group
one consumer in the group can consume more than one partition
the number of consumer processes in a group must be <= number
of partitions
With these properties, kafka is smartly able to provide both ordering guarantees and load balancing over a pool of consumer processes.
To answer your question, yes, in the context of the same group, if you want to have N consumers, you have to have at least N partitions.
Does it mean I need same amount of partitions as amount of consumers
for the same group?
I think this has been explained in the first answer.
How many consumers can read from one partition?
The number of consumers that can read from one partition is always equal to the number of consumer groups subscribing to that topic.
Relationship between keys and partitions with regard to API
First, we must understand that the producer is responsible for choosing which record to assign to which partition within the topic.
Now, lets see how producer does so. First, lets see the class definition of ProducerRecord.java :
public class ProducerRecord<K, V> {
private final String topic;
private final Integer partition;
private final Headers headers;
private final K key;
private final V value;
private final Long timestamp;
}
Here, the field that we have to understand from the class is partition.
From the ProducerRecord docs,
If a valid partition number is specified, that partition will be used when sending the record.
If no partition is specified but a key is present a partition will be chosen using a hash of the key.
If neither key nor partition is present a partition will be assigned in a round-robin fashion.

Partitions increase parallelism of Kafka topic. Any number of consumers/producers can use the same partition. Its up to application layer to define the protocol. Kafka guarantees delivery. Regarding the API, you may want to look at Java docs as they may be more complete. Based on my experience:
Partitions start from 0
Keys may be used to send messages to the same partition. For example hash(key)%num_partition. The logic is pluggable to Producer. https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/Partitioner.html
Yes. but be careful not to end up with some key that will result in the "dedicated" partition. For this, you may want to have dedicated topic. For example, control topic and data topic
This seems to be the same question as 3.
I believe consumers should not make assumptions of the data based on partition. The typical approach is to have consumer group that can read from multiple partitions of a topic. If you want to have dedicated channels, it is better (safer/maintainable) to use separate topics.

Related

Do we need to know number of partitions for a topic beforehand?

We want to put messages/records of a different customers on different partitions of a kafka topic.
But number of customers is not known in prior. So how can we set partition count for kafka topic in this case? Do we need any other way where partition count changes at runtime based on keys (customer_id in this case). Thanks in advance.
need to know number of partitions
Assuming Java, use AdminClient.describeTopics() method call and get partitions of each response object.
Regarding the rest of the question, consumer instances automatically distribute partition assignment when subscribing to topics.
Producers should not know about consumers, so you don't "put records on partitions" based on any factor of (possible) consumers.
partition count changes at runtime based on keys (customer_id)
Unclear what this means. Partition count can only increase, and if you do increase it, then your partitions will become unordered, so you should consider how large your keyspace is before creating the topic. For example, if you have a numeric ID, and use the first two digits as the partition value, then you could create a topic up to 100 partitions.

Are partitions on different Kafka topics co-located within same consumer (k8s pod)

I have a requirement where I want to be able to read data from partition 1 of topic A and partition 1 of topic B from the same consumer, I have a group of consumers running in different Kubernetes pods. Both topics will have 5 partitions each and both topics have key based partition strategy.
So assuming partition 1 on topic A and partition 1 on topic B are keyed with same key value would they both colocate on the same consumer or pod? If that's the case then I can cross reference data from one topic using the key of the other topic's message.
Keys are only relevant to the producer partitioner.
There is no guarantee that a consumer will be assigned the same partitions across two topics. The ConsumerPartitionAssignor linked below is only per-topic. You might get lucky with consumers assigned partitions with the same keys across topics, but after a rebalancing, it'll no longer be true.
If you must consume the same partition of multiple topics, you may assign() those values to the consumer instance rather than subscribe()-ing to the whole topic.
However, if you are wanting to join data across topics, the more appropriate way to do this would be to use Kafka Streams / KSQL joins.
Yes, if you configure routing by key for both topics, same key will be sent to same partition. Have a look at the documentation here : https://kafka.apache.org/documentation/#design_loadbalancing
"For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers."

Kafka Offset and Partition identification

I had a few questions from Kafka. Please help me in understanding the problem.
As per official documentation, each partition will have one unique sequential id which called offset.
How does the offset numbers will be generated i.e based on the message arrival into a partition or offset numbers will be generated whenever the partitions are created?
do the same offset ID/number generates/exists in another partition because each partition is independent each other?
If the same offset can be possible in another partition then, How consumer uniquely identifies the message across multiple partitions?
How does consumer know the particular offset belongs to a particular partition? Please let me understand in both situations like a message with key & without a key?
Each partition maintains the messages it has received in a sequential order where they are identified by an offset. This offset is a sequential number and it automatically generated and assigned to messages.
Yes this is correct. Message ordering is guaranteed only on the partition level. This means that if you have a topic with multiple partitions, messages on different partitions might have the same offset. Therefore, an offset has a true meaning only within a single partition (as you can also see in the picture below, which is taken from Kafka Docs).
3/4. The consumers are subscribed to topics, but behind the scenes they are subscribed to particular partitions (well, if you have a single consumer in the consumer group it will subscribe to all of the partitions). Therefore, when the consumer reads messages from a particular partition, it can uniquely identify messages using their unique offsets which are maintained throughout the partition. As I already mentioned, the message order is guaranteed only within a single partition.
Note that messages without key, will be evenly distributed across the partitions of the topic, in a round-robin fashion. On the other hand, messages with the same key will be stored in the same partition and hence, you can use the key to store and order messages having the same key. For example, if you need to process users and you'd like order guarantee for each distinct user, you can use userID as a key, so that all the events of that user are stored in the same partition. Later on, you will be able to consume these user-specific messages, in the order they were originally received.

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

Is there any way to maintain message ordering between partitions of a kafka topic with a single consumer?

We are developing a kafka based streaming system in which the producer would produce to multiple partitions within its topic and a single consumer would consume from the topic. I know that kafka maintains message order within partitions, but can we maintain a global message order between partitions within a topic?
Short answer:
no, Kafka does not provide any ordering guarantees between partitions.
Long answer:
I don't quite understand your problem. If you are saying you have only one consumer consuming your topic, why would you have more than 1 partition in that topic and reinvent the wheel trying to maintain order between partitions? If you want to leave some space for future growth, e.g. adding another consumer to consume a part of partitions, then you'll have to rethink your "global message order" idea.
Do you really need ALL messages to be processed in order? Or maybe you could partition by client/application/whatever and maintain order per partition? In most cases you don't really need that global message order, but just have to partition your data properly.
Maintaining order between multiple consumers is a really tough problem to solve, and even if solved correctly you'll just neglect all Kafka benefits.
You can't benifit from kafka if you want the global ordering in more than one partition. Kafka only supports message ordering in only one partition. In our company, we need only the same catergory messages are sent to the same partition, which can easily partition using partitionId.
The purpose of partitions in Kafka is to create a partial order of messages in a broader topic, where the messages follow a strict total order in any given partition. So the answer is 'no', it would defeat the purpose of partitions if any notion of cross-partition order were to be introduced.
I would suggest instead focusing on how messages (records, in Kafka parlance) are keyed, which effectively determines how they are mapped to a partition. Which partition specifically doesn't matter, as long as the mapping is deterministic and repeatable — all you should care about is that identically keyed records will always appear on the same partition and, hence, will not be assigned to multiple consumers at the same time (within the same consumer group).
If you are publishing updates to persisted entities, the primary key of the entity is typically a good starting point for a Kafka record key. If there needs to be some order of updates across a connected graph of entities, then taking the ID root of the graph and making it the key will likely satisfy your ordering needs.