Kafka - Synchronized Consumer Groups - apache-kafka

i am trying to make my head regarding Kafka consumers and I'd like to know if the following use case can be solved using Kafka.
My use case is basically this one:
I have a stream that I'd like to be consumed in sync by several consumers. In other words, I have a first consumer that starts to consume the stream, then another consumer arrives later. I'd like this second consumer to start to consume the stream at the offset where is currently the first consumer.
I know that I need to have the consumers in two different groups. But it is not clear for me :
on how or if it is possible to coordinate the groups offset
if I would expect a latency for such coordination task

You do not need two different groups, all consumers can check one topic. Or as many as they like, for that matter.
offset
Messages typically are identified by their arrival date, so all the clients need to tell the producer "my last visit was at 10:00, give me all new messages". So all each client needs to keep track of is when which individual topic was checked last.
latency
this is kind of "of scope" at this point. Of course there will be latency, but it depends on the environment, like "how many consumers", "how many topics", "message format" etc.
so can your usecase be solved using kafka
In short: yes. "Can one consumer continue where another has left", the consumers could exchange the latest index between each other, of course that would require some internal synchronization. Kafka itself does not care about consumers, so it will not keep track itself about the latest index. You need to do the work. Another possibility would be to actually consume the messages (like, delete them from queue once consumed), so each time another consumer hits the queue it is guaranteed to receive the messages another consumer left off. Of course that would depend on your usecase, can you actually delete your messages from the queue.

This is not a problematic treated by kafka directly (consumer group is to distribute partitions among members, not to attribute the same offset), but you can do somehting for this. You could simply create an other topic, where consumer1 would post either offset or copy of the message read (so you would need bth consumer and producer for this), and your other synchronized consumer would react against this - of course there ould be some latency for this.
What is your use case behind this? Why can't you consume at different offset? Couldn't you rather having one consumer, which would then dispatch the message read to to different processes, so that they are indeed synchronized? (with no latency)
What do you mean by synchronized: should consumer2 (and 3 and more) only consume the same message than consumer1 (ie can't consume faster, what I assume in both previous solution) While this is possible, it would really be better to know the reason behind this, maybe there is a better way for you to process data

Related

Why is Kakfa called pub-sub and can we read randomly from an offset in Kafka

I have been reading about Kafka for weeks now but have some doubts which I was not able to resolve by going through multiple resources. Sorry if these are lame questions.
Kafka is a pub-sub system but the consumer pulls data from kafka broker - I have read that pull will be better than pushing (with some cons) but if we are pulling the data why do we call it a pub-sub system? Here Kafka is not notifying the consumers who have subscribed, rather consumer is pulling the data explicitly. (There are resources which say it is called pub-sub because data is not deleted after a consumer reads it (which happen in a queue). However the pub-sub name is still confusing to me).
If consumer is pulling data from the broker, then I understand that consumer needs to commit it to the broker (Delivery semantics), but then why do we say that in Kafka, we can start reading from whichever offset we want. I mean broker is keep a track of consumer offset, then to resume reading from a random offset do I need to provide another offset in the API and will the new offset will be reset at the broker's end or how this will happen.
Kafka is a distributed log. The servers are called brokers, and clients are producers and consumers. Calling it "pub sub" makes developers aware what group of problems/applications it can solve/support. It doesn't describe how it's different from other systems in the same group... More importantly, I don't think think "pub sub" is ever written in the official documentation.
it is called pub-sub because data is not deleted after a consumer reads it
If data is deleted, that describes a non-persistent queue, not a pub-sub system.
The main distinction is that there are "Publishers" and "Subscribers" that are not communicating point-to-point. It doesn't matter if the subscription mechanism is push or pull based. From Wikipedia -
In software architecture, publish–subscribe is a messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead categorize published messages into classes without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more classes and only receive messages that are of interest, without knowledge of which publishers, if any, there are.
So, Kafka producers write to ("categorized") topics located on brokers, rather than directly to the consumers ("specific receivers, called subscribers"). Consumers can start reading from topics that don't (yet) exist. And Producers can send data to topics that have no consumer(s).
Back to your question -
to resume reading from a random offset do I need to provide another offset in the API and will the new offset will be reset at the broker's end or how this will happen.
First, consumers aren't required to commit to a consumer group. For any new/expired group, the auto.offset.reset config will be used to determine start position, otherwise, the committed offset for an assigned group/topic/partition combination will be used. This is prior to the consumer being able to seek individual partitions to random offsets.

How to scale to thousands of producer-consumer pairs in Kafka?

I have a usecase where I want to have thousands of producers writing messages which will be consumed by thousands of corresponding consumers. Each producer's message is meant for exactly one consumer.
Going through the core concepts here and here: it seems like each consumer-producer pair should have its own topic. Is this correct understanding? I also looked into consumer groups but it seems they are more for parallellizing consumption.
Right now I have multiple producer-consumer pairs sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time. Also in the event I have to delete the checkpoint this will be even more problematic as it starts reading from the very beginning.
Is creating thousands of topics the solution for this? Or is there any other way to use concepts like partitions, consumer groups etc? Both producers and consumers are spark streaming/batch applications. Thanks.
Each producer's message is meant for exactly one consumer
Assuming you commit the offsets, and don't allow retries, this is the expected behavior of all Kafka consumers (or rather, consumer groups)
seems like each consumer-producer pair should have its own topic
Not really. As you said, you have many-to-many relationship of clients. You do not need to have a known pair ahead of time; a producer could send data with no expected consumer, then any consumer application(s) in the future should be able to subscribe to that topic for the data they are interested in.
sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time
The consumption would take linearly more time on a higher production rate, yes, and partitions are the way to solve for that. Beyond that, you need faster network and processing. You still need to consume and deserialize in order to filter, so the filter is not the bottleneck here.
Is creating thousands of topics the solution for this?
Ultimately depends on your data, but I'm guessing not.
Is creating thousands of topics the solution for this? Or is there any
other way to use concepts like partitions, consumer groups etc? Both
producers and consumers are spark streaming/batch applications.
What's the reason you want to have thousands of consumers? or want to have a 1 to 1 explicit relationship? As mentioned earlier, only one consumer within a consumer group will process a message. This is normal.
If however you are trying to make your record processing extremely concurrent, instead of using very high partition counts or very large consumer groups, should use something like Parallel Consumer (PC).
By using PC, you can processing all your keys in parallel, regardless of how long it takes to process, and you can be as concurrent as you wish .
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

If I use Kafka as simple message. Does it really worth

=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.

Producer-consumer with side constraints in Kafka (or others)

We have a bunch of producers that send messages/events to a bunch of consumers. Each message must be consumed by exactly one consumer. We know that this common scenario can easily be achieved by using consumer groups in Kafka. However, we also have a couple of additional constraints: Not every consumer can consume every message. Messages have (arbitrary) requirements attached to them and only consumers that fulfil these requirements must process them. This would still be possible with a consumer group where a consumer first looks at the message and eventually re-submits it if it does not meet the requirements. However, there is no guarantee that messages will be seen by every consumers at least once so they may bounce around indefinitely although there may be a matching consumer. We also cannot set up multiple topics because the requirements for consumers are arbitrary complex boolean formulas defined by the user and not the application. This can result in a combinatorial explosion of topics.
Additionally we want to be able to dynamically add and remove consumers from the group in case more processing resources are needed. As far as I understood Kafka, this can lead to consumers not getting any messages if there are not enough partitions and dynamically re-partitioning is also not really possible (without admin interaction).
Is there any way to make this work in Kafka? Maybe Kafka is also not the right technology, are there others that are more suitable? We also looked at RabbitMQ but also there we did not find a way that guarantees that every consumer is seeing a message so that it can evaluate the requirements.
you could commit offsets manually when you after identifying the desired events by setting ENABLE_AUTO_COMMIT_CONFIG to false in your consumer configs but your use-case would trigger excessive rebalances which stops any consumption. i don't think Kafka is the appropriate infrastructure for this.
however if you could mark your events with finite number of keys, you can dictate which partition they are produced to. using the same key in your consumer guarantees to poll events from the same partition. note that you need to have the same number of partitions in your topic as the number of unique keys.

Can consumer group remember which all topics it is subscribed to

I am new to Kafka and I am trying to make a multiple produce subscribe functionality.
Lets say there are N number of producers called P1,P2,P3... and M number of consumers C1,C2,C3
Now C1 need to subscribe to P1,P2 and at some point of time he needs to subscribe to P3 also. Hence C1 has a dynamic list of topics it needs to subscribe to.
I was hoping this can be achieved using high level consumer , where we can name out consumer group and Kafka will store the offset till we read. But then what i noticed is that , we also need to give the topic names while creating high level consumer. In my case I have like 1000 number of topics i need to subscribe and this list is dynamically updated.
Is there a way , where in kafka high level consumer can remember the topics it have subscribed to and listen to them when brought up , rather than we providing the names of all the topics it was subscribed in the past.
I don't think that Kafka architecture that you outlined would work. The main issue, given that Kafka topic is a point of asynchrony between producers and consumers, is that you cannot do a clean cut switch with your "dynamic list of topics you need to subscribe to" (as you put it), since some amount of messages will presumably always be in "the queue".
Besides that, it's not exactly trivial to dynamically change the topic (and partition) in consumer clients. AFAIK Kafka is not meant to be used this way.
A better option would be to use a special message field that would tell your consumer clients whether the message is for them or not.
So you can use dedicated topics for messages that don't require this dynamic nature (in order to avoid doing this check for all messages, if possible) and a separate topic where you'd mix all messages that do require it.