Is there a standard way to migrate Kafka Consumers?

Is there a standard way to migrate Kafka Consumers? - apache-kafka

There is a Consumer A consuming from Topic A.
Consumer A is going to be deprecated (as it belongs to a different team and they dont handle the use case anymore). It is not possible to just change the ownership of the component to our team as we already have a Consumer B. Consumer B currently listens to Topic B. We want to leverage the Consumer B to read from Topic A as well.
So, we want to migrate the business logic in Consumer A to Consumer B.
Before going ahead with the full fledged migration, we want to validate the logic for a small part of the traffic.
Is there a standard way of migrating business logic from Consumer A to Consumer B ?
For the small part of traffic, is it possible to just process the traffic in Consumer B ?

In an ideal world, your business logic code should be separated from the consumer code. Because your business logic most likely going to outlive your broker technology's life span. If you are in that ideal world, you can route a subset of the messages to the new business logic library. Say for example if you decided to route 10% of the traffic, do message offset % 10 = 0 then route it to the new logic. If you are not in an ideal world, you can try manual partition assignment and take only those messages coming into a specific partition. Try this.

Related

Kafka with multiple instances of microservices and end-users

This is more of a design/architecture question.
We have a microservice A (MSA) with multiple instances (say 2) running of it behind LB.
The purpose of this microservice is to get the messages from Kafka topic and send to end users/clients. Both instances use same consumer group id for a particular client/user so as messages are not duplicated. And we have 2 (or =#instances) partitions of Kafka topic
End users/clients connect to LB to fetch the message from MSA. Long polling is used here.
Request from client can land to any instance. If it lands to MSA1, it will pull the data from kafka partion1 and if it lands to MSA2, it will pull the data from partition2.
Now, a producer is producing the messages, we dont have high messages count. So, lets say producer produce msg1 and it goes to partition1. End user/client will not get this message unless it's request lands to MSA1, which might not happen always as there are other requests coming to LB.
We want to solve this issue. We want that client gets the message near realtime.
One of the solution can be having a distributed persistent queue (e.g. ActiveMQ) where both MSA1 and MSA2 keep on putting the messages after reading from Kafka and client just fetch the message from queue. But this will cause separate queue for every end-user/client/groupid.
Is this a good solution, can we go ahead with this? Anything that we should change here. We are deploying our system on AWS, so if any AWS managed service can help here e.g. SNS+SQS combination?
Some statistics:
~1000 users, one group id per user
2-4 instances of microservice
long polling every few seconds (~20s)
average message size ~10KB

Broadly you have three possible approaches:
You can dispense with using Kafka's consumer group functionality and allow each instance to consume from all partitions.
You can make the instances of each service aware of each other. For example, an instance which gets a request which can be fulfilled by another instance will forward the request there. This is most effective if the messages can be partitioned by client on the producer end (so that a request from a given client only needs to be routed to an instance). Even then, the consumer group functionality introduces some extra difficulty (rebalances mean that the consumer currently responsible for a given partition might not have seen all the messages in the partition). You may want to implement your own variant of the consumer group coordination protocol, only on rebalance, the instance starts from some suitably early point regardless of where the previous consumer got to.
If you can't reliably partition by client in the producer (e.g. the client is requesting a stream of all messages matching arbitrary criteria) then Kafka is really not going to be a fit and you probably want a database (with all the expense and complexity that implies).

Modelling a Kafka cluster

I have an API endpoint that accepts events with a specific user ID and some other data. I want those events broadcasted to some external locations and I wanted to explore using Kafka as a solution for that.
I have the following requirements:
Events with the same UserID should be delivered in order to the external locations.
Events should be persisted.
If a single external location is failing, that shouldn't delay delivery to other locations.
Initially, from some reading I did, it felt like I want to have N consumers where N is the number of external locations I want to broadcast to. That should fulfill requirement (3). I also probably want one producer, my API, that will push events to my Kafka cluster. Requirement (2) should come in automatically with Kafka.
I was more confused regarding how to model the internal Kafka cluster side of things. Again, from the reading I did, it sounds like it's a bad practice to have millions of topics, so having a single topic for each userID is not an option. The other option I read about is having one partition for each userID (let's say M partitions). That would allow requirement (1) to happen out of the box, if I understand correctly. But that would also mean I have M brokers, is that correct? That also sounds unreasonable.
What would be the best way to fulfill all requirements? As a start, I plan on hosting this with a local Kafka cluster.

You are correct that one topic per user is not ideal.
Partition count is not dependent upon broker count, so this is a better design.
If a single external location is failing, that shouldn't delay delivery to other locations.
This is standard consumer-group behavior, not topic/partition design.

Distribute messages on single Kafka topic to specific consumer

Avro encoded messages on a single Kafka topic, single partitioned. Each of these messages were to be consumed by a specific consumer only. For ex, message a1, a2, b1 and c1 on this topic, there are 3 consumers named A, B and C. Each consumer would get all the messages but ultimate A would consume a1 and a2, B on b1 and C on c1.
I want to know how typically this is solved when using avro on Kafka:
leave it for the consumers to deserialize the message then some application logic to decide to consume the message or drop the message
use partition logic to make each of the messages to go to a particular partition, then setup each consumer to listen to only a single partition
setup another 3 topics and a tiny kafka-stream application that would do the filtering + routing from main topic to these 3 specific topics
make use of kafka header to inject identifier for downstream consumers to filter
Looks like each of the options have their pros and cons. I want to know if there is a convention that people follow or there is some other ways of solving this.

It depends...
If you only have a single partitioned topic, the only option is to let each consumer read all data and filter client side which data the consumer is interested in. For this case, each consumer would need to use a different group.id to isolate the consumers from each other.
Option 2 is certainly possible, if you can control the input topic you are reading from. You might still have different group.ids for each consumer as it seems that the consumer represent different applications that should be isolated from each other. The question is still if this is a good model, because the idea of partitions is to provide horizontal scale out, and data-parallel processing; however, if each application reads only from one partition it seems not to align with this model. You also need to know which data goes into which partition producer side and consumer side to get the mapping right. Hence, it implies a "coordination" between producer and consumer what seems not desirable.
Option 3 seems to indicate that you cannot control the input topic and thus want to branch the data into multiple topics? This seems to be a good approach in general, as topics are a logical categorization of data. However, it would even be better to have 3 topic for the different data to begin with! If you cannot have 3 input topic from the beginning on, Option 3 seems not to provide a good conceptual setup, however, it won't provide much performance benefits, because the Kafka Streams application required to read and write each record once. The saving you gain is that each application would only consume from one topic and thus redundant data read is avoided here -- if you would have, lets say 100 application (and each is only interested in 1/100 of the data) you would be able to cut down the load significantly from an 99x read overhead to a 1x read and 1x write overhead. For your case you don't really cut down much as you go from 2x read overhead to 1x read + 1x write overhead. Additionally, you need to manage the Kafka Streams application itself.
Option 4 seems to be orthogonal, because is seems to answer the question on how the filtering works, and headers can be use for Option 1 and Option 3 to do the actually filtering/branching.

The data in the topic is just bytes, Avro shouldn't matter.
Since you only have one partition, only one consumer of a group can be actively reading the data.
If you only want to process certain offsets, you must either seek to them manually or skip over messages in your poll loop and commit those offsets

Microservice along with kafka consumer Group

We are working on a appliation in which we are using kafka.
The components of the Application are as follows,
We have a microservice which gets the request's and pushes the messages to a kafka topic. (Lets say ServiceA)
Another microservice which consumes the messages from topics and push the data to a datastore. (Lets say ServiceB)
I am clear with ServiceA part of the application but have some design confusions in the ServiceB part.
As ServiceB we are planning for REST API,
Is it good to bundle Consumer and controllers in a single application ?
For consumer i am planning to go with ConsumerGroup with multiple Consumer's to acheive more throughput. Is there any better and efficent approach ?
Should i take out the Consumer part of ServiceB and make it as a separate service which is independent ?
If we are bundling it inside the ServiceB should i configure Consumer as a Listener ? (We are going with spring boot for microservice)
Thanks in Advance !

Is it good to bundle Consumer and controllers in a single application
?
Its good to bundle together by context, having a listener, wich forwards to another service to controll makes no sense in my opionion. But consider splitting up controller by different context if necessary. Like Martin Fowler says: start with a monolith first and than split up (https://martinfowler.com/bliki/MonolithFirst.html)
For consumer i am planning to go with ConsumerGroup with multiple
Consumer's to acheive more throughput. Is there any better and
efficent approach ?
A consumer group makes sense if you think about scale your service B out. If you want to have this possibility in the future, start with one instance of ServiceB inside the consumer group. If you use something like kubernetes, its simple do later on deploy more instances of your service if required. But do not invest to much on in an eventual future. Start simple and do some monitoring, and if you figure out some bottle necks, than act. One more thing to keep in mind, is that kafka by default keeps message for a long time (i guess 7 days by default) so if you thing in a classical message broker style you could get a lot of duplicates of your messages. Think about a update message, if somethings change, which is raised when your ServiceA starts. Maybe reducing the retention.ms would be an option, but take care not to loose messages.
Should i take out the Consumer part of ServiceB and make it as a
separate service which is independent ?
No i think not.
If we are bundling it inside the ServiceB should i configure Consumer
as a Listener ? (We are going with spring boot for microservice)
Yes :-)

Can consumer group remember which all topics it is subscribed to

I am new to Kafka and I am trying to make a multiple produce subscribe functionality.
Lets say there are N number of producers called P1,P2,P3... and M number of consumers C1,C2,C3
Now C1 need to subscribe to P1,P2 and at some point of time he needs to subscribe to P3 also. Hence C1 has a dynamic list of topics it needs to subscribe to.
I was hoping this can be achieved using high level consumer , where we can name out consumer group and Kafka will store the offset till we read. But then what i noticed is that , we also need to give the topic names while creating high level consumer. In my case I have like 1000 number of topics i need to subscribe and this list is dynamically updated.
Is there a way , where in kafka high level consumer can remember the topics it have subscribed to and listen to them when brought up , rather than we providing the names of all the topics it was subscribed in the past.

I don't think that Kafka architecture that you outlined would work. The main issue, given that Kafka topic is a point of asynchrony between producers and consumers, is that you cannot do a clean cut switch with your "dynamic list of topics you need to subscribe to" (as you put it), since some amount of messages will presumably always be in "the queue".
Besides that, it's not exactly trivial to dynamically change the topic (and partition) in consumer clients. AFAIK Kafka is not meant to be used this way.
A better option would be to use a special message field that would tell your consumer clients whether the message is for them or not.
So you can use dedicated topics for messages that don't require this dynamic nature (in order to avoid doing this check for all messages, if possible) and a separate topic where you'd mix all messages that do require it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse