If Service produce event to one topic, only this service have to consume processed event from another topic (Kafka) - apache-kafka

I have to implement event driven architecture services with Kafka (Java tech stack).
I drew example:
Imagine that I have 3 external producers (Ms1, Ms2, Ms3), who sends events in to one topic, which my service reads. After receiving event, my service processing some business logic and than pushes event to another topic. Ms1, Ms2, Ms3 subscribe on this topic and listen what come in. My goal is: if Ms1 sent event to topic-1, only Ms1 must received response event from topic-2 (despite the fact that other Consumers are listening to this topic too, they are forbidden to receive event belong to Ms1). If Ms2 sent event to topic-1, than only Ms2 must received event from topic-2.
And I don't know how many consumers/producer will be. It's floating amount. Today it can be 3 external producers/consumers, tomorrow maybe 30 and so on. They can subscribe and unsubscribe.

Kafka records shouldn't "belong" to particular services, IMO, this is mostly metadata about data lineage; maybe that information will be valuable for some other consumer use case that you've not considered yet.
If you have multiple consumers from one topic, there's no logic outside of filtering and explicit partition assignments that would get "all M1 producer events to all M1 consumers"
If you want to lock down access to topics to particular clients, use ACLs and certificates. Otherwise, there's nothing stopping new consumer groups from subscribing to whatever topics they want

Related

Why is Kakfa called pub-sub and can we read randomly from an offset in Kafka

I have been reading about Kafka for weeks now but have some doubts which I was not able to resolve by going through multiple resources. Sorry if these are lame questions.
Kafka is a pub-sub system but the consumer pulls data from kafka broker - I have read that pull will be better than pushing (with some cons) but if we are pulling the data why do we call it a pub-sub system? Here Kafka is not notifying the consumers who have subscribed, rather consumer is pulling the data explicitly. (There are resources which say it is called pub-sub because data is not deleted after a consumer reads it (which happen in a queue). However the pub-sub name is still confusing to me).
If consumer is pulling data from the broker, then I understand that consumer needs to commit it to the broker (Delivery semantics), but then why do we say that in Kafka, we can start reading from whichever offset we want. I mean broker is keep a track of consumer offset, then to resume reading from a random offset do I need to provide another offset in the API and will the new offset will be reset at the broker's end or how this will happen.
Kafka is a distributed log. The servers are called brokers, and clients are producers and consumers. Calling it "pub sub" makes developers aware what group of problems/applications it can solve/support. It doesn't describe how it's different from other systems in the same group... More importantly, I don't think think "pub sub" is ever written in the official documentation.
it is called pub-sub because data is not deleted after a consumer reads it
If data is deleted, that describes a non-persistent queue, not a pub-sub system.
The main distinction is that there are "Publishers" and "Subscribers" that are not communicating point-to-point. It doesn't matter if the subscription mechanism is push or pull based. From Wikipedia -
In software architecture, publish–subscribe is a messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead categorize published messages into classes without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more classes and only receive messages that are of interest, without knowledge of which publishers, if any, there are.
So, Kafka producers write to ("categorized") topics located on brokers, rather than directly to the consumers ("specific receivers, called subscribers"). Consumers can start reading from topics that don't (yet) exist. And Producers can send data to topics that have no consumer(s).
Back to your question -
to resume reading from a random offset do I need to provide another offset in the API and will the new offset will be reset at the broker's end or how this will happen.
First, consumers aren't required to commit to a consumer group. For any new/expired group, the auto.offset.reset config will be used to determine start position, otherwise, the committed offset for an assigned group/topic/partition combination will be used. This is prior to the consumer being able to seek individual partitions to random offsets.

Kafka with multiple instances of microservices and end-users

This is more of a design/architecture question.
We have a microservice A (MSA) with multiple instances (say 2) running of it behind LB.
The purpose of this microservice is to get the messages from Kafka topic and send to end users/clients. Both instances use same consumer group id for a particular client/user so as messages are not duplicated. And we have 2 (or =#instances) partitions of Kafka topic
End users/clients connect to LB to fetch the message from MSA. Long polling is used here.
Request from client can land to any instance. If it lands to MSA1, it will pull the data from kafka partion1 and if it lands to MSA2, it will pull the data from partition2.
Now, a producer is producing the messages, we dont have high messages count. So, lets say producer produce msg1 and it goes to partition1. End user/client will not get this message unless it's request lands to MSA1, which might not happen always as there are other requests coming to LB.
We want to solve this issue. We want that client gets the message near realtime.
One of the solution can be having a distributed persistent queue (e.g. ActiveMQ) where both MSA1 and MSA2 keep on putting the messages after reading from Kafka and client just fetch the message from queue. But this will cause separate queue for every end-user/client/groupid.
Is this a good solution, can we go ahead with this? Anything that we should change here. We are deploying our system on AWS, so if any AWS managed service can help here e.g. SNS+SQS combination?
Some statistics:
~1000 users, one group id per user
2-4 instances of microservice
long polling every few seconds (~20s)
average message size ~10KB
Broadly you have three possible approaches:
You can dispense with using Kafka's consumer group functionality and allow each instance to consume from all partitions.
You can make the instances of each service aware of each other. For example, an instance which gets a request which can be fulfilled by another instance will forward the request there. This is most effective if the messages can be partitioned by client on the producer end (so that a request from a given client only needs to be routed to an instance). Even then, the consumer group functionality introduces some extra difficulty (rebalances mean that the consumer currently responsible for a given partition might not have seen all the messages in the partition). You may want to implement your own variant of the consumer group coordination protocol, only on rebalance, the instance starts from some suitably early point regardless of where the previous consumer got to.
If you can't reliably partition by client in the producer (e.g. the client is requesting a stream of all messages matching arbitrary criteria) then Kafka is really not going to be a fit and you probably want a database (with all the expense and complexity that implies).

ActiveMQ Artemis JMS Shared Subscription

I have a single node ActiveMQ instance with two competing consumers connected to a topic. The topic subscription is shared as per JMS 2.0 specification. Shared subscription does guarantee that only either of the subscribers (using same subscription name) gets the message. But what I noticed is that it does not guarantee that the second message is delivered only if the first one is acknowledged. In case if the first consumer takes time to acknowledge the message, the second message is delivered to the free consumer even before the acknowledgement of the first one is sent by the consumer to the broker. Is this a standard behaviour? And is there a way to stop the broker from delivering the second message before the acknowledgement of the first one?
ActiveMQ Artemis allows the exclusive queues. They are special queues which route all messages to only one consumer at a time.
Obviously exclusive queues have a draw back that you cannot scale out the consumers to improve consumption as only one consumer would technically be active.
However I would suggest to take a look at the message grouping to scale out your solution. Message groups are useful when you want all messages for a certain value of the property to be processed serially by the same consumer, without stopping the delivery of messages with different value of the property to other consumers.

How to process events which are out of order using Kafka Streams

I have an application where events are sent on a Kafka topic based on user actions like User Login, user's Intermediate actions (optional) and User Logout. Each event has some information in a event object along with userId , for example a Login Event has loginTime; Add Note has notes (Intermediate actions). Similarly a Logout event has logoutTime. The requirement is to aggregate information from all these events into one object after receiving the Logout event for each user & send it on downstream.
Due to some reasons (Network delay, multiple event producer) events may not come in order (User Logout event may come before Intermediate event), So the question is how to handle such scenarios? I can not wait for Intermediate events after receiving User Logout event since Intermediate events are optional depending on user's actions.
The only option which I think here, is to wait for some time after receiving User Logout event, process Intermediate events if received within that wait time & send processed event, but again not sure how to achieve this.
Kafka does not guarantee order on topic, it guarantee order on partition. One topic can have more than one partition so every consumer that is consuming your topic will consume one partition. That is how kafka is achieving scalability. So what you are experiencing is normal behavior (it isn't bug or related to network delay or something like that). What you can do is to make sure that all messages that you want to proceed in order are sent to the same partition. You can do that by setting number of partitions to 1, that is the dumbest way. When you send message with producer, by default kafka take a look into key, take hash of it and by that hash know on which partition should send a message. You can make sure that for all messages, the key is the same. That way all hashes of keys will be the same and all messages will go to the same partition. Also, you can implement custom partitioner and override default way how kafka choose on which partition message will go. In this way, all messages will arrive in order. If you cannot do any of this actions, then you will receive events out of order and you will have to think about a way how to consume them out of order but that is not question related to kafka.
If you are not able to preserve order of event (that Logout will be last event),
you can achieve your requirements using ProcesorApi from Kafka Streams. Kafka Streams DSL can be combine with Processor API (more details here).
You can have several partitions, but all events for particular user has to be send to same Partition.
You have to implement custom Processor/Transformer.
Your processor will be put each event/activity in state store (aggregate all event from particular user under same key).
Processor API gives you ability to create some kind of scheduler (Punctuator).
You can schedule to check every X seconds events for particular user. If Logout was long ago, you get all events/activities and make some aggregation and send results to downstreams.
As said in other answers, in Kafka order is maintained on per-partition basis.
Since you are talking about user events, why don't you make UserID as your Kafka topic key? So, that all events related to a specific user will always be ordered (provided they are produced by a single producer).
You should ensure (by design) that only one Kafka producer pushes all the user change events to the given topic. In this way, you can avoid out-of order messages due to multiple producers.
From streams, you might also want to look at Windows in Kafka streams. Tumbling windows for example is non-overlapping and fixed size. You aggregate records over a period of time.
Now you may want to sort the aggregated by their timestamp (or you said you have logout time, login time etc) and act accordingly.
Simple and effective solution
Use synchronous send and set delivery.timeout.ms and retries to a maximum value.
To ensure fault tolerance set acks=all with min.insync.replicas=2 (topic configuration) and use a single producer to push to that topic.
You should also set max.block.ms to some max value so that your send() does not return immediately if there is an error in fetching the metadata (for example, when Kafka is down).
Benchmark the synchronous send with your rate and check to see if it meets your requirements or benchmark number.
This ensures that a message that came first is sent first to Kafka and then the next message is not sent until the previous message is successfully acknowledged.
If your benchmark figure is not met, try having a back-pressure
mechanism like in-memory/persistent queue.
Add event to a queue in Thread-1
Peek (not dequeue) event from the queue in Thread-2
Call producer.send(...).get() in Thread-2
Dequeue the event in Thread-2
The key is to make your frontend tracker to send ordered events to the backend service which then produces events to kafka.
You can achieve that by batching the events, and sending the batched events to the backend only after the previous batched events are successfully delivered.

Graphql subscriptions in a distributed system with Kafka (and spring boot)

I have the following situation:
I have 5 instances of the same service, all in the same kafka consumer group. One of them has a websocket connection to the client (the graphql subscription). I use graphql-java and Spring Boot.
When that connection is opened, I produce events from any of the 5 instances (with a message key defined so they go to the same partition and ordered) and I need for all those events to be consumed by the same instance that opened that connection. Not by the other 4.
Even if the partition assignment plays in my favor, a reassignment can by done at any time, leaving me without luck
My implementation is using reactor-kafka but I think it's just an implementation detail.
The options I see are:
Start to listen on that topic with a new group id each time, so that service always receives the messages from that topic (but the 5 in the other group id too)
Create a new topic for each websocket connection, so only the producer knows that topic (but the topic id should be sent in the kafka events so that the producers of those events know where to publish them)
If I receive the message and I'm not the one with the connection, don't ACK it. But this would make things slow and seems hacky
Start using something different altogether like Redis PubSub to receive all messages in all consumers and check for the connection.
I see there's an implementation for node but I don't see how it is solving the problem.
A similar question explains how to program a subscription but doesn't talk about this distributed thing.
Is the cleanest approach any of the one I suggested? Is there an approach with Kafka that I'm not seeing? Or am I misunderstanding some piece?
I ended up using 1 consumer group id per listener with a topic specifically for those events.