Is it possible to create a kafka topic with dynamic partition count? - apache-kafka

I am using kafka to stream the events of page visits by the website users to an analytics service. Each event will contain the following details for the consumer:
user id
IP address of the user
I need very high throughput, so I decided to partition the topic with partition key as userId-ipAddress
ie
For a userId 1000 and ip address 10.0.0.1, the event will have
partition key as "1000-10.0.0.1"
In this use case the partition key is dynamic, so specifying the number of partitions upfront while creating the topic.
Is it possible to create topic in kafka with dynamic partition count?
Is it a good practice to use this kind of partitioning or Is there any other way this can be achieved?

It's not possible to create a Kafka topic with dynamic partition count. When you create a topic you have to specify the number of partitions. You can change it later manually using Replication Tools.
But I don't understand why do you need dynamic partition count in the first place. The partition key is not related to the number of partitions. You can use your partition key with ten partitions or with thousand partitions. When you send a message to Kafka topic, Kafka must send it to a specific partition. Every partition is identify by it's ID which is simply a number. Kafka computes something like this
partition_id = hash(partition_key) % number_of_partition
and it sends the message to partition partition_id. If you have far more users than partitions you should be OK. More suggestions:
Use userId as a partition key. You probably don't need IP address as a part of partition key. What is it good for? Typically you need all messages from a single user to end up in a single partition. If you have IP address as a partition key then the messages from a single user could end up in multiple partitions. I don't know your use case but it general that's not what you want.
Measure how many partitions you need to process all messages. Then create let's say ten times more partitions. You can create more partitions than you actually need. Kafka won't mind and there are no performance penalties. See How to choose the number of topics/partitions in a Kafka cluster?
Right now you should be able to process all messages in your system. If traffic grows you can add more Kafka brokers and you can use Replication tools to change leaders/replicas for partitions. If the traffic grows more than ten times you must create new partitions.

Related

Do we need to know number of partitions for a topic beforehand?

We want to put messages/records of a different customers on different partitions of a kafka topic.
But number of customers is not known in prior. So how can we set partition count for kafka topic in this case? Do we need any other way where partition count changes at runtime based on keys (customer_id in this case). Thanks in advance.
need to know number of partitions
Assuming Java, use AdminClient.describeTopics() method call and get partitions of each response object.
Regarding the rest of the question, consumer instances automatically distribute partition assignment when subscribing to topics.
Producers should not know about consumers, so you don't "put records on partitions" based on any factor of (possible) consumers.
partition count changes at runtime based on keys (customer_id)
Unclear what this means. Partition count can only increase, and if you do increase it, then your partitions will become unordered, so you should consider how large your keyspace is before creating the topic. For example, if you have a numeric ID, and use the first two digits as the partition value, then you could create a topic up to 100 partitions.

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

Kafka used as Delivery Mechanism in News Feed

Can I create topics called update_i for different kinds of updates and partition them using user_id in a Kafka MQ ? I've been through this post by confluent.io: https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ . Also, I know that I cannot create a topic with dynamic number of partitions. These two facts (the post and static number of Kafka partitions). What's the delivery mechanism alternative ?
Can I create topics called update_i for different kinds of updates and partition them using user_id in a Kafka MQ ?
If I understand you correctly, the answer is Yes.
What you would need to do in a nutshell:
Topic configuration: Determine the required number of partitions for your topic(s). Usually, the number of partitions is determined based on (1) anticipated scale/volume of the incoming data, i.e. the Write-side of scaling, and/or (2) the required parallelism when consuming the messages for processing, i.e. the Read-side of scaling. See https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ for details.
Writing messages to these Kafka topics (aka the side of the "Kafka producer"): In Kafka, messages are key-value pairs. In your case, you would set the message key to be the user_id. Then, when using Kafka's default "partitioner", messages for the same message key (here: user_id) would automatically be sent to the same partition -- which is what you want to achieve.
As a possible solution I would suggest to create a number of partitions, and then setup producers to select partition using the following rule
user_id mod <number_of_partitions>
That will allow you to keep order of messages for particular user_id.
Then, If you need to have a consumer that processes only messages for particular user_id, you can write a (low-level) consumer that will read a particular partition and process only messages that are sent for a particular customer and ignore all other messages.

Topics, partitions and keys

I am looking for some clarification on the subject.
In Kafka documentations I found the following:
Kafka only provides a total order over messages within a partition,
not between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over messages
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
So here are my questions:
Does it mean if i want to have more than 1 consumer (from the same group) reading from one topic I need to have more than 1 partition?
Does it mean I need same amount of partitions as amount of consumers for the same group?
How many consumers can read from one partition?
Also have some questions regarding relationship between keys and partitions with regard to API. I only looked at .net APIs (especially one from MS) but looks like the mimic Java API.
I see when using a producer to send a message to a topic there is a key parameter. But when consumer reads from a topic there is a partition number.
How are partitions numbered? Starting from 0 or 1?
What exactly relationship between a key and partition?
As I understand some function on key will determine a partition. is that correct?
If I have 2 partitions in a topic and want some particular messages go to one partition and other messages go to another I should use a specific key for one specific partition, and the rest for another?
What if I have 3 partitions and one type of messages to one particular partition and the rest to other 2?
How in general I send messages to a particular partition in order to know for a consumer from where to read?
Or I better off with multiple topics?
Thanks in advance.
Does it mean if i want to have more than 1 consumer (from the same
group) reading from one topic I need to have more than 1 partition?
Let's see the following properties of kafka:
each partition is consumed by exactly one consumer in the group
one consumer in the group can consume more than one partition
the number of consumer processes in a group must be <= number
of partitions
With these properties, kafka is smartly able to provide both ordering guarantees and load balancing over a pool of consumer processes.
To answer your question, yes, in the context of the same group, if you want to have N consumers, you have to have at least N partitions.
Does it mean I need same amount of partitions as amount of consumers
for the same group?
I think this has been explained in the first answer.
How many consumers can read from one partition?
The number of consumers that can read from one partition is always equal to the number of consumer groups subscribing to that topic.
Relationship between keys and partitions with regard to API
First, we must understand that the producer is responsible for choosing which record to assign to which partition within the topic.
Now, lets see how producer does so. First, lets see the class definition of ProducerRecord.java :
public class ProducerRecord<K, V> {
private final String topic;
private final Integer partition;
private final Headers headers;
private final K key;
private final V value;
private final Long timestamp;
}
Here, the field that we have to understand from the class is partition.
From the ProducerRecord docs,
If a valid partition number is specified, that partition will be used when sending the record.
If no partition is specified but a key is present a partition will be chosen using a hash of the key.
If neither key nor partition is present a partition will be assigned in a round-robin fashion.
Partitions increase parallelism of Kafka topic. Any number of consumers/producers can use the same partition. Its up to application layer to define the protocol. Kafka guarantees delivery. Regarding the API, you may want to look at Java docs as they may be more complete. Based on my experience:
Partitions start from 0
Keys may be used to send messages to the same partition. For example hash(key)%num_partition. The logic is pluggable to Producer. https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/Partitioner.html
Yes. but be careful not to end up with some key that will result in the "dedicated" partition. For this, you may want to have dedicated topic. For example, control topic and data topic
This seems to be the same question as 3.
I believe consumers should not make assumptions of the data based on partition. The typical approach is to have consumer group that can read from multiple partitions of a topic. If you want to have dedicated channels, it is better (safer/maintainable) to use separate topics.

Data Modeling with Kafka? Topics and Partitions

One of the first things I think about when using a new service (such as a non-RDBMS data store or a message queue) is: "How should I structure my data?".
I've read and watched some introductory materials. In particular, take, for example, Kafka: a Distributed Messaging System for Log Processing, which writes:
"a Topic is the container with which messages are associated"
"the smallest unit of parallelism is the partition of a topic. This implies that all messages that ... belong to a particular partition of a topic will be consumed by a consumer in a consumer group."
Knowing this, what would be a good example that illustrates how to use topics and partitions? When should something be a topic? When should something be a partition?
As an example, let's say my (Clojure) data looks like:
{:user-id 101 :viewed "/page1.html" :at #inst "2013-04-12T23:20:50.22Z"}
{:user-id 102 :viewed "/page2.html" :at #inst "2013-04-12T23:20:55.50Z"}
Should the topic be based on user-id? viewed? at? What about the partition?
How do I decide?
When structuring your data for Kafka it really depends on how it´s meant to be consumed.
In my mind, a topic is a grouping of messages of a similar type that will be consumed by the same type of consumer so in the example above, I would just have a single topic and if you´ll decide to push some other kind of data through Kafka, you can add a new topic for that later.
Topics are registered in ZooKeeper which means that you might run into issues if trying to add too many of them, e.g. the case where you have a million users and have decided to create a topic per user.
Partitions on the other hand is a way to parallelize the consumption of the messages. The total number of partitions in a broker cluster need to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature. Consumers in a consumer group will split the burden of processing the topic between themselves according to the partitioning so that one consumer will only be concerned with messages in the partition itself is "assigned to".
Partitioning can either be explicitly set using a partition key on the producer side or if not provided, a random partition will be selected for every message.
Once you know how to partition your event stream, the topic name will be easy, so let's answer that question first.
#Ludd is correct - the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.
For example:
If you care about users' average time-on-site, then you should partition by :user-id. That way, all the events related to a single user's site activity will be available within the same partition. This means that a stream processing engine such as Apache Samza can calculate average time-on-site for a given user just by looking at the events in a single partition. This avoids having to perform any kind of costly partition-global processing
If you care about the most popular pages on your website, you should partition by the :viewed page. Again, Samza will be able to keep a count of a given page's views just by looking at the events in a single partition
Generally, we are trying to avoid having to rely on global state (such as keeping counts in a remote database like DynamoDB or Cassandra), and instead be able to work using partition-local state. This is because local state is a fundamental primitive in stream processing.
If you need both of the above use-cases, then a common pattern with Kafka is to first partition by say :user-id, and then to re-partition by :viewed ready for the next phase of processing.
On topic names - an obvious one here would be events or user-events. To be more specific you could go with with events-by-user-id and/or events-by-viewed.
This is not exactly related to the question, but in case you already have decided upon the logical segregation of records based on topics, and want to optimize the topic/partition count in Kafka, this blog post might come handy.
Key takeaways in a nutshell:
In general, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve. Let the max throughout achievable on a single partition for production be p and consumption be c. Let’s say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions.
Currently, in Kafka, each broker opens a file handle of both the index and the data file of every log segment. So, the more partitions, the higher that one needs to configure the open file handle limit in the underlying operating system. E.g. in our production system, we once saw an error saying too many files are open, while we had around 3600 topic partitions.
When a broker is shut down uncleanly (e.g., kill -9), the observed unavailability could be proportional to the number of partitions.
The end-to-end latency in Kafka is defined by the time from when a message is published by the producer to when the message is read by the consumer. As a rule of thumb, if you care about latency, it’s probably a good idea to limit the number of partitions per broker to 100 x b x r, where b is the number of brokers in a Kafka cluster and r is the replication factor.
I think topic name is a conclusion of a kind of messages, and producer publish message to the topic and consumer subscribe message through subscribe topic.
A topic could have many partitions. partition is good for parallelism. partition is also the unit of replication,so in Kafka, leader and follower is also said at the level of partition. Actually a partition is an ordered queue which the order is the message arrived order. And the topic is composed by one or more queue in a simple word. This is useful for us to model our structure.
Kafka is developed by LinkedIn for log aggregation and delivery. this scene is very good as a example.
The user's events on your web or app can be logged by your Web sever and then sent to Kafka broker through the producer. In producer, you could specific the partition method, for example : event type (different event is saved in different partition) or event time (partition a day into different period according your app logic) or user type or just no logic and balance all logs into many partitions.
About your case in question, you can create one topic called "page-view-event", and create N partitions through hash keys to distribute the logs into all partitions evenly. Or you could choose a partition logic to make log distributing by your spirit.