I'm getting started with Kafka and trying to understand exactly what would happen in a case when a partition of a topic in a Kafka cluster fills up beyond its limit. I understand that a partition resides in the same node but different partitions reside in different nodes. If I use a custom partition logic in the Kafka producer where a certain key always goes to a certain partition, then what happens when that partition becomes full? Are messages with this key sent to a random partition?
The partition is just cleaned and all the messages are deleted so that new messages can be sent to the same partition. The "retention" policy is handled by all the "log.retention.*" broker parameters.
Related
I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.
From the kafka faq page
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key
So all the messages with a particular key will always go to the same partition in a topic:
How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
How does the consumer know which partition the producer wrote to
Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.
If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...
Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering
so that the consumers can consume messages from specific producers?
Again, this should not be done.
Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,
If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)
The offset is atomic action which will promise that no two messages will get same offset.
The offset is running number, it has a meaning only in the specific topic and specfic partition
If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value
The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition
I am starting to learn Kafka for enterprise solution purposes.
During my readings, some questions came to my mind:
Are all the partitions on a topic of the same size?
What will happen if a Kafka message is sent on a topic and the topic is overloaded. There is no memory for any new messages on the topic?
Are all the partitions on a topic of the same size?
No, they are usually not all of the same size. Mainly, it depends how you distribute the messages accross partitions. If you are using keys in your messages they will be hashed and used to partition your data. Without a key in a message the data will be assigned round robin accross partitions.
In addition, it also depends on how many messages the producer is wrapping into a batch before sending it to the broker and which compression type you are applying.
However, you can say that all replicas of a TopicPartition have the same size as soon as they are in-sync.
What will happen if a Kafka message is sent on a topic and the topic is overloaded. There is no memory for any new messages on the topic?
A topic itself cannot be overloaded, but rather the entire Kafka broker. If a broker has no free disk space left it will shut itself down and you will in that situation not be able to write any more data into a TopicPartition if the leader was on that broker. However, if only one broker goes down and you have more than one replication configured in your topic those replicas will take over on another broker.
To make sure that this never happens you can apply the volume based retention policies that is available in Kafka if you set your cleanup.policy to delete and set the retention.bytes such that with the number of topics/replicas you will not exceed your storage limit.
We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.
I want any information/explanation on how Kafka maintains a message sequence when messages are written to topic with multiple partition.
For e.g. I have multiple message producer each producing messages sequentially and writing on the Kafka topic with more than 1 partition. In this case, how consumer group will work to consume messages.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Even within one partition, you still could encounter the out-of-order events if retries is enabled and max.in.flight.requests.per.connection is larger than 1.
Work-around is create a topic with only one partition although it means only one consumer process per consumer group.
Kafka will store messages in the partitions according to the message key given to the producer. If none is given, then the messages will be written in a round-robin style into the partitions. To keep ordering for a topic, you need to make sure the ordered sequence has the same key, or that the topic has only one partition.