=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.
Related
Scenario
10 kafka consumers within a same Consumer Group.
Kafka has 10 partitions => which means each partition is automatically assigned to a single consumer within the group.
Message is sent to partition on a round-robin basis.
Every now and then, a message will take much longer to process than other messages.
In such occasions, there's a chance the next message is assigned to a consumer that is still busy working while there are other free consumers
Question
Does Kafka support a mechanism to automatically send message to a partition whose consumer is free?
If it doesn't, what is the common approach to this scenario?
Although you could implement a custom Assignor class, by default, consumption is only based on assignment, not by load; such information is not communicated back to the group coordinator. Plus, shuffling around constantly based on load would likely cause frequent group rebalances, causing consumption to be even slower
Regarding length-of-processing, I am not aware of any way your consumer would be able to inspect message before partition assignment and polling such records. Therefore, you'd need to decouple your processing logic from the actual poll loop if you'd like to improve processing times.
We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.
I have been studying Apache Kafka for a while now.
Lets consider the following example.
Consider I have a topic with 3 partitions. I have a single producer and single consumer. I am producing my messages without specifying the key attribute.
So i know on the producer side, when i publish a message, the strategy used by kafka to assign a message to either of those partitions would be Round-Robin.
Now, what i want to know is when I start a single consumer belonging to a certain consumer group listening to that same topic, what strategy will it use to pull the messages from the different partitons(as there are 3)?
Would it follow the a similar round-robin model, where it will send a fetch request to a leader of a partition 1, wait for a response, get the response, return the records to process. Then, send a fetch request to the leader of a partition 2 and so on?
If it follows some other strategy/algorithm, I would love to know what it is?
Thank you in advance.
There is no ordering guarantee outside of a partition so in a way that algorithm used is moot to the end user and subject to change.
Today, there is nothing terribly complex that happens in this instance. The protocol shows you that a fetch request includes a partition so you get a fetch per partition. That means the order depends on the consumer. A partition won't be starved because fetch requests will happen for all partitions assigned to the consumer.
i am trying to make my head regarding Kafka consumers and I'd like to know if the following use case can be solved using Kafka.
My use case is basically this one:
I have a stream that I'd like to be consumed in sync by several consumers. In other words, I have a first consumer that starts to consume the stream, then another consumer arrives later. I'd like this second consumer to start to consume the stream at the offset where is currently the first consumer.
I know that I need to have the consumers in two different groups. But it is not clear for me :
on how or if it is possible to coordinate the groups offset
if I would expect a latency for such coordination task
You do not need two different groups, all consumers can check one topic. Or as many as they like, for that matter.
offset
Messages typically are identified by their arrival date, so all the clients need to tell the producer "my last visit was at 10:00, give me all new messages". So all each client needs to keep track of is when which individual topic was checked last.
latency
this is kind of "of scope" at this point. Of course there will be latency, but it depends on the environment, like "how many consumers", "how many topics", "message format" etc.
so can your usecase be solved using kafka
In short: yes. "Can one consumer continue where another has left", the consumers could exchange the latest index between each other, of course that would require some internal synchronization. Kafka itself does not care about consumers, so it will not keep track itself about the latest index. You need to do the work. Another possibility would be to actually consume the messages (like, delete them from queue once consumed), so each time another consumer hits the queue it is guaranteed to receive the messages another consumer left off. Of course that would depend on your usecase, can you actually delete your messages from the queue.
This is not a problematic treated by kafka directly (consumer group is to distribute partitions among members, not to attribute the same offset), but you can do somehting for this. You could simply create an other topic, where consumer1 would post either offset or copy of the message read (so you would need bth consumer and producer for this), and your other synchronized consumer would react against this - of course there ould be some latency for this.
What is your use case behind this? Why can't you consume at different offset? Couldn't you rather having one consumer, which would then dispatch the message read to to different processes, so that they are indeed synchronized? (with no latency)
What do you mean by synchronized: should consumer2 (and 3 and more) only consume the same message than consumer1 (ie can't consume faster, what I assume in both previous solution) While this is possible, it would really be better to know the reason behind this, maybe there is a better way for you to process data
One of the first things I think about when using a new service (such as a non-RDBMS data store or a message queue) is: "How should I structure my data?".
I've read and watched some introductory materials. In particular, take, for example, Kafka: a Distributed Messaging System for Log Processing, which writes:
"a Topic is the container with which messages are associated"
"the smallest unit of parallelism is the partition of a topic. This implies that all messages that ... belong to a particular partition of a topic will be consumed by a consumer in a consumer group."
Knowing this, what would be a good example that illustrates how to use topics and partitions? When should something be a topic? When should something be a partition?
As an example, let's say my (Clojure) data looks like:
{:user-id 101 :viewed "/page1.html" :at #inst "2013-04-12T23:20:50.22Z"}
{:user-id 102 :viewed "/page2.html" :at #inst "2013-04-12T23:20:55.50Z"}
Should the topic be based on user-id? viewed? at? What about the partition?
How do I decide?
When structuring your data for Kafka it really depends on how it´s meant to be consumed.
In my mind, a topic is a grouping of messages of a similar type that will be consumed by the same type of consumer so in the example above, I would just have a single topic and if you´ll decide to push some other kind of data through Kafka, you can add a new topic for that later.
Topics are registered in ZooKeeper which means that you might run into issues if trying to add too many of them, e.g. the case where you have a million users and have decided to create a topic per user.
Partitions on the other hand is a way to parallelize the consumption of the messages. The total number of partitions in a broker cluster need to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature. Consumers in a consumer group will split the burden of processing the topic between themselves according to the partitioning so that one consumer will only be concerned with messages in the partition itself is "assigned to".
Partitioning can either be explicitly set using a partition key on the producer side or if not provided, a random partition will be selected for every message.
Once you know how to partition your event stream, the topic name will be easy, so let's answer that question first.
#Ludd is correct - the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.
For example:
If you care about users' average time-on-site, then you should partition by :user-id. That way, all the events related to a single user's site activity will be available within the same partition. This means that a stream processing engine such as Apache Samza can calculate average time-on-site for a given user just by looking at the events in a single partition. This avoids having to perform any kind of costly partition-global processing
If you care about the most popular pages on your website, you should partition by the :viewed page. Again, Samza will be able to keep a count of a given page's views just by looking at the events in a single partition
Generally, we are trying to avoid having to rely on global state (such as keeping counts in a remote database like DynamoDB or Cassandra), and instead be able to work using partition-local state. This is because local state is a fundamental primitive in stream processing.
If you need both of the above use-cases, then a common pattern with Kafka is to first partition by say :user-id, and then to re-partition by :viewed ready for the next phase of processing.
On topic names - an obvious one here would be events or user-events. To be more specific you could go with with events-by-user-id and/or events-by-viewed.
This is not exactly related to the question, but in case you already have decided upon the logical segregation of records based on topics, and want to optimize the topic/partition count in Kafka, this blog post might come handy.
Key takeaways in a nutshell:
In general, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve. Let the max throughout achievable on a single partition for production be p and consumption be c. Let’s say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions.
Currently, in Kafka, each broker opens a file handle of both the index and the data file of every log segment. So, the more partitions, the higher that one needs to configure the open file handle limit in the underlying operating system. E.g. in our production system, we once saw an error saying too many files are open, while we had around 3600 topic partitions.
When a broker is shut down uncleanly (e.g., kill -9), the observed unavailability could be proportional to the number of partitions.
The end-to-end latency in Kafka is defined by the time from when a message is published by the producer to when the message is read by the consumer. As a rule of thumb, if you care about latency, it’s probably a good idea to limit the number of partitions per broker to 100 x b x r, where b is the number of brokers in a Kafka cluster and r is the replication factor.
I think topic name is a conclusion of a kind of messages, and producer publish message to the topic and consumer subscribe message through subscribe topic.
A topic could have many partitions. partition is good for parallelism. partition is also the unit of replication,so in Kafka, leader and follower is also said at the level of partition. Actually a partition is an ordered queue which the order is the message arrived order. And the topic is composed by one or more queue in a simple word. This is useful for us to model our structure.
Kafka is developed by LinkedIn for log aggregation and delivery. this scene is very good as a example.
The user's events on your web or app can be logged by your Web sever and then sent to Kafka broker through the producer. In producer, you could specific the partition method, for example : event type (different event is saved in different partition) or event time (partition a day into different period according your app logic) or user type or just no logic and balance all logs into many partitions.
About your case in question, you can create one topic called "page-view-event", and create N partitions through hash keys to distribute the logs into all partitions evenly. Or you could choose a partition logic to make log distributing by your spirit.