Allow consumption of same message by different instances of the same service from a Kafka topic - apache-kafka

I have several instances of the same service subscribed to a Kafka topic. A producer publishes 1 message to a topic. I want this message to be consumed by all instances. When instance is started, the messages should be read from the end of topic/partitions. I don't want the instances to receive messages which were published before service is started (but these won't be a big problem if some old messages are processed by the service). I don't want the instances to lose messages if the instances are disconnected from Kafka for some time or Kafka is down which mean that I need to commit offsets periodically. Message can be processed twice, it is not a big problem.
Is the following the best way to archive the described behavior: generate new Kafka group id using new Guid or timestamp for each instance each time instance is started?
What are disadvantages of the approach described in item 1 above?

It is enough to do two things. First, each instance of the service should have its own group.id. That guarantees that each of them will read all published messages, and will receive published messages after reconnecting. This id is per instance and there is no need to regenerate it on start. Second, each instance should have the property auto.offset.reset=latest, which is also the default. This guarantees that the consumer will not read messages, which were published before the first start of the instance.
Of course, your instances need to commit offsets after processing of the messages.

Related

Why is Kakfa called pub-sub and can we read randomly from an offset in Kafka

I have been reading about Kafka for weeks now but have some doubts which I was not able to resolve by going through multiple resources. Sorry if these are lame questions.
Kafka is a pub-sub system but the consumer pulls data from kafka broker - I have read that pull will be better than pushing (with some cons) but if we are pulling the data why do we call it a pub-sub system? Here Kafka is not notifying the consumers who have subscribed, rather consumer is pulling the data explicitly. (There are resources which say it is called pub-sub because data is not deleted after a consumer reads it (which happen in a queue). However the pub-sub name is still confusing to me).
If consumer is pulling data from the broker, then I understand that consumer needs to commit it to the broker (Delivery semantics), but then why do we say that in Kafka, we can start reading from whichever offset we want. I mean broker is keep a track of consumer offset, then to resume reading from a random offset do I need to provide another offset in the API and will the new offset will be reset at the broker's end or how this will happen.
Kafka is a distributed log. The servers are called brokers, and clients are producers and consumers. Calling it "pub sub" makes developers aware what group of problems/applications it can solve/support. It doesn't describe how it's different from other systems in the same group... More importantly, I don't think think "pub sub" is ever written in the official documentation.
it is called pub-sub because data is not deleted after a consumer reads it
If data is deleted, that describes a non-persistent queue, not a pub-sub system.
The main distinction is that there are "Publishers" and "Subscribers" that are not communicating point-to-point. It doesn't matter if the subscription mechanism is push or pull based. From Wikipedia -
In software architecture, publish–subscribe is a messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead categorize published messages into classes without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more classes and only receive messages that are of interest, without knowledge of which publishers, if any, there are.
So, Kafka producers write to ("categorized") topics located on brokers, rather than directly to the consumers ("specific receivers, called subscribers"). Consumers can start reading from topics that don't (yet) exist. And Producers can send data to topics that have no consumer(s).
Back to your question -
to resume reading from a random offset do I need to provide another offset in the API and will the new offset will be reset at the broker's end or how this will happen.
First, consumers aren't required to commit to a consumer group. For any new/expired group, the auto.offset.reset config will be used to determine start position, otherwise, the committed offset for an assigned group/topic/partition combination will be used. This is prior to the consumer being able to seek individual partitions to random offsets.

Kafka with multiple instances of microservices and end-users

This is more of a design/architecture question.
We have a microservice A (MSA) with multiple instances (say 2) running of it behind LB.
The purpose of this microservice is to get the messages from Kafka topic and send to end users/clients. Both instances use same consumer group id for a particular client/user so as messages are not duplicated. And we have 2 (or =#instances) partitions of Kafka topic
End users/clients connect to LB to fetch the message from MSA. Long polling is used here.
Request from client can land to any instance. If it lands to MSA1, it will pull the data from kafka partion1 and if it lands to MSA2, it will pull the data from partition2.
Now, a producer is producing the messages, we dont have high messages count. So, lets say producer produce msg1 and it goes to partition1. End user/client will not get this message unless it's request lands to MSA1, which might not happen always as there are other requests coming to LB.
We want to solve this issue. We want that client gets the message near realtime.
One of the solution can be having a distributed persistent queue (e.g. ActiveMQ) where both MSA1 and MSA2 keep on putting the messages after reading from Kafka and client just fetch the message from queue. But this will cause separate queue for every end-user/client/groupid.
Is this a good solution, can we go ahead with this? Anything that we should change here. We are deploying our system on AWS, so if any AWS managed service can help here e.g. SNS+SQS combination?
Some statistics:
~1000 users, one group id per user
2-4 instances of microservice
long polling every few seconds (~20s)
average message size ~10KB
Broadly you have three possible approaches:
You can dispense with using Kafka's consumer group functionality and allow each instance to consume from all partitions.
You can make the instances of each service aware of each other. For example, an instance which gets a request which can be fulfilled by another instance will forward the request there. This is most effective if the messages can be partitioned by client on the producer end (so that a request from a given client only needs to be routed to an instance). Even then, the consumer group functionality introduces some extra difficulty (rebalances mean that the consumer currently responsible for a given partition might not have seen all the messages in the partition). You may want to implement your own variant of the consumer group coordination protocol, only on rebalance, the instance starts from some suitably early point regardless of where the previous consumer got to.
If you can't reliably partition by client in the producer (e.g. the client is requesting a stream of all messages matching arbitrary criteria) then Kafka is really not going to be a fit and you probably want a database (with all the expense and complexity that implies).

If I use Kafka as simple message. Does it really worth

=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.

Graphql subscriptions in a distributed system with Kafka (and spring boot)

I have the following situation:
I have 5 instances of the same service, all in the same kafka consumer group. One of them has a websocket connection to the client (the graphql subscription). I use graphql-java and Spring Boot.
When that connection is opened, I produce events from any of the 5 instances (with a message key defined so they go to the same partition and ordered) and I need for all those events to be consumed by the same instance that opened that connection. Not by the other 4.
Even if the partition assignment plays in my favor, a reassignment can by done at any time, leaving me without luck
My implementation is using reactor-kafka but I think it's just an implementation detail.
The options I see are:
Start to listen on that topic with a new group id each time, so that service always receives the messages from that topic (but the 5 in the other group id too)
Create a new topic for each websocket connection, so only the producer knows that topic (but the topic id should be sent in the kafka events so that the producers of those events know where to publish them)
If I receive the message and I'm not the one with the connection, don't ACK it. But this would make things slow and seems hacky
Start using something different altogether like Redis PubSub to receive all messages in all consumers and check for the connection.
I see there's an implementation for node but I don't see how it is solving the problem.
A similar question explains how to program a subscription but doesn't talk about this distributed thing.
Is the cleanest approach any of the one I suggested? Is there an approach with Kafka that I'm not seeing? Or am I misunderstanding some piece?
I ended up using 1 consumer group id per listener with a topic specifically for those events.

Kafka consumer was not able to poll

If one kafka consumer application is reading message from kafka, another was not able to read and vice-versa.
We are running two independent applications, one will process message and another will read and put into a database.
Message which is been processing in first application is not available in the second application
Without seeing the code I can only guess ... :-)
I think that you have a topic with only one partition and both consumer applications are in the same consumer group. In this case only one consumer gets messages from the only one partition in the topic.
If you want both applications receiving messages from same topic, you need to put them into different consumer groups (group.id parameter for the consumer properties).