Kafka Best Way to Filter Messages - apache-kafka

One application (producer) is publishing messages and these messages are being consumed by another application (with multiple consumers). Producer sends data with field country and we will have multiple consumers in our application, each consumer will subscribe to specific country.
From what I have been reading so far, we can have 2 approaches to filter message:
Filter data on consumer side: Producer can add country in message
header. Consumer will receive all data and filter country it needs by checking
from message header. Not sure if we can/should have multiple Consumers with different filters on different countries? Or just one Consumer that filters out the list of countries and we do aggregation by countries on our own?
One topic with separate partition for separate
country: We will have a custom partitioner on Producer so it can send
message to a specific partition. Consumers will be directed to the
right partition for consuming country specific message.
My question is should we choose option 1 or 2? We are expecting to receive hundreds of messages every few seconds.

In my experience typically the first approach is used.
The second option is problematic. What if you add a new country? You will need to add a partition to the topic, which is possible but not straightforward. You will also need to change the logic on the producer and conusmer side. If consumers are just subscribed to the topic, then in case of failure partitions will be automatically assigned to the alive consumers inside the consumer group. In your case you will need to handle the failures with the programming logic.
Another approach is to have a topic per country.
One more approach is to publish all the data into one topic and then distribute data to other topics(each per consumer) with Kafka Streams application. If the requirements change then you change the implementation of Kafka Streams app.

Related

As kafka publisher how can I differentiate consumer groups for different set of data for the same topic?

I have a Kafka publisher which sends data to a topic and two consumer groups consuming data from the topic. I want to send some common data for both the consumers groups to consume but some data specific to individual consumer groups. Can this be achieved ? If yes, how?
I believe you would need to use two separate topics for that kind of scenario.
Or, you could send all to both and filter out the unneeded records for one of the consumers.
See Filtering Records (if your consumers are Spring consumers).

One to One and Group Messaging using Kafka

As Kafka has a topic based pub-sub architecture how can I handle One-to-One and Group Messaging part of web application using Kafka?
I am using SpringBoot+Angular stack and Docker Kafka server.
I'll write another answer here.
Based on my experience with the chatting service. You only need one topic for all the messages. Using a well designed Message body.
public class Message {
private String from; // user id
private String to; // user id or group id
}
Then you can create like 100 partitions for this topic and create two consumers to consume them (50 partitions for one consumer in the beginning).
Then if your system reaches the bottleneck, you can easier scale X more consumers to handle the load.
How to do distribute the messages in the consumer. I used to send the messages to the Mobile app, so all the app has a long-existing connection to the server, and the server sends the messages to the app by that channel. For group chat, I create a Redis cache to store all the active users in the group, so I can easier get the users who belong to this group, send them the messages.
And another thing, Kafka stateless, means Kafka doesn't de-coupled from the business logic, only acts as a message system, transfers the messages. If you connect your business logic to Kafka, like create a topic "One-to-One" and delete some after they finished, Kafka will be very messy.
By One-to-One, I suppose you mean one producer and one consumer i.e. using at as a queue.
This is certainly possible with Kafka. You can have one consumer subscribe to a topic and and restrict others by not giving them authorization . See Authorization in Kafka
Note that once a message is consumed, it is not deleted, rather it is committed so that the same consumer will not consume it again.
By Group Messaging, I suppose you mean one producer > multiple consumers or
multiple-producer > multiple-consumers
This is also possible, a producer can produce messages to a topic and multiple consumers can consume them.
If all the consumers have the same group id, then each consumer in the group gets only a subset of messages.
If they have different group ids then each consumer will get all messages.
Multiple producers also can produce to the same topic.
A consumer can also subscribe to multiple topics.
Ok, It's a very complicated question, I try to type some simple basic information.
Kafka topics are divided into a number of partitions. Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel.
So if you are using partitions, means you have multiple consumers to consume some in parallel.
consumer groups for a given topic — each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic.
Basically, you can have only one group, then the message will not be processed twice in the same consumer group, and this is how Kafka delivers exactly once.
If you need two consumer groups, you need to think about why you need two? Are the consumers in two groups handling the different logic?
There is more, please check the official document, or you can answer a smaller question.

Kafka: Can we have consumers subscribing to same topic but have different pipelines inside the topic?

I have 200 Kafka consumers who can do either of these things,
1. They can subscribe to 200 different topics, and will consume messages that are sensitive.
2. All 200 consumers can subscribe to a single topic.
Problem:
1. Is it a good design to create 200 or large number of topics?
2. In second scenario, how we will achieve the implementation where the messages published to the topic should be sent to a particular consumer only based on some parameter.
Is it a good design to create 200 or large number of topics?
Kafka uses replicated files for topics and since it uses offset based message transmission number of topics has no direct effect on performance so it is not a problem.
In second scenario, how we will achieve the implementation where the messages published to the topic should be sent to a particular
consumer only based on some parameter.
You can not do this based on a parameter, if you need a message to be delivered to exactly one consumer then you need to group all the consumers in a single consumer group and make the whole group listen to a topic. This way a message will be consumed by only one consumer in that consumer group.
If you need to sequential (ordered) message consumption then you need to create your Kafka topic with only 1 partition.

Can consumer group remember which all topics it is subscribed to

I am new to Kafka and I am trying to make a multiple produce subscribe functionality.
Lets say there are N number of producers called P1,P2,P3... and M number of consumers C1,C2,C3
Now C1 need to subscribe to P1,P2 and at some point of time he needs to subscribe to P3 also. Hence C1 has a dynamic list of topics it needs to subscribe to.
I was hoping this can be achieved using high level consumer , where we can name out consumer group and Kafka will store the offset till we read. But then what i noticed is that , we also need to give the topic names while creating high level consumer. In my case I have like 1000 number of topics i need to subscribe and this list is dynamically updated.
Is there a way , where in kafka high level consumer can remember the topics it have subscribed to and listen to them when brought up , rather than we providing the names of all the topics it was subscribed in the past.
I don't think that Kafka architecture that you outlined would work. The main issue, given that Kafka topic is a point of asynchrony between producers and consumers, is that you cannot do a clean cut switch with your "dynamic list of topics you need to subscribe to" (as you put it), since some amount of messages will presumably always be in "the queue".
Besides that, it's not exactly trivial to dynamically change the topic (and partition) in consumer clients. AFAIK Kafka is not meant to be used this way.
A better option would be to use a special message field that would tell your consumer clients whether the message is for them or not.
So you can use dedicated topics for messages that don't require this dynamic nature (in order to avoid doing this check for all messages, if possible) and a separate topic where you'd mix all messages that do require it.

Kafka: Is it possible to share data among consumers in a consumer group?

I have multiple messages (more specifically log messages) in a certain topic which have the same id for a block of messages (these id's keep changing but remain same for a certain block of messages) and I need to find a way to group all the messages with that id or share the data contained in those messages with the same id between all the consumers in a consumer group.
So is there any way I could share data among various consumers in a consumer group?
This sounds like a sessionization use case to me. Kafka doesn't provide any means of grouping or nesting messages together so you'd have to do that yourself by keeping state in the consumer while processing and wrap the group of messages with some kind of header. Then you could push this to a new topic of wrapped message groups.
A better approach would probably be to make use of an external database or other system with more flexible means of selecting or organizing data based on fields. You can have a look at this blogpost for an example using Spark streaming + HBase.
There are two ways you can do that.
When you publish the message itself, create a message with partition key, so all the messages with same id goes to single partition. then in the consumer side it will always consumed by single consumer.[https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example]
If you use Spark-streaming in consumer side, you could use sliding window concept to group all the same id messages.[http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations]