Kafka Message Filtering Based on ENVs - apache-kafka

I have a consumer application deployed on several ENVs (dev, test, stage & preprod). They all are consuming the same Kafka Topic (means works like multiple consumer of same topic).
I have separate producer applications for all ENVs (dev, test, stage & preprod). While producing message inside the payload it has a field to mention the producer's ENV.
Our requirement is that - Dev ENV's consumer should only consume Dev ENV's producer application's messages. Same goes to other ENVs.
My question is - should I go with Consumer side filtering? Is this will ensure our requirement? How it will ensure our requirement?
Thanks in advance.

You have multiple options on how to deal with this requirement. However, I don't think it is in general a good idea to have one topic for different environments. Looking into data protection and access permissions this doesn't sound like a good design.
Anyway, I see the following options.
Option 1:
Use the environment (dev, test, ...) as the key of the topic and tell the consumer to filter by key.
Option 2:
Write producers that send data from each environment to individual partitions and tell the consumers for each environment to only read from a particular partition.
But before implementing Option 2, I would rather do
Option 3:
Have a topic for each environment and let the Producer/Consumer write/read from the differen topics.

I agree with mike that using a single topic across environments is not a good idea.
However if you are going to do this, then I would suggest you use a stream processor to create separate topics for your consumers. You can do this in Kafka Streams, ksqlDB, etc.
ksqlDB would look like this:
-- Declare stream over existing topic
CREATE STREAM FOO_ALL_ENVS WITH (KAFKA_TOPIC='my_source_topic', VALUE_FORMAT='AVRO');
-- Create derived stream & new topic populated with message just for DEV
-- You can explicitly provide the target Kafka topic name.
CREATE STREAM FOO_DEV WITH (KAFKA_TOPIC='foo_dev') AS SELECT * FROM FOO_ALL_ENVS WHERE ENV='DEV';
-- Create derived stream & new topic populated with message just for PROD
-- If you don't specify a Kafka topic name it will inherit from the
-- stream name (i.e. `FOO_PROD`)
CREATE STREAM FOO_PROD AS SELECT * FROM FOO_ALL_ENVS WHERE ENV='PROD';
-- etc
Now you have your producer writing to a single topic (if you must), but your consumers can consume from a topic that is specific to their environment. The ksqlDB statements are continuous queries so will process all existing messages in the source topic and every new message that arrives.

Related

Kafka Consumer and Producer

Can I have the consumer act as a producer(publisher) as well? I have a user case where a consumer (C1) polls a topic and pulls messages. after processing the message and performing a commit, it needs to notify another process to carry on remaining work. Given this use case is it a valid design for Consumer (C1) to publish a message to a different topic? i.e. C1 is also acting as a producer
Yes. This is a valid use case. We have many production applications does the same, consuming events from a source topic, perform data enrichment/transformation and publish the output into another topic for further processing.
Again, the implementation pattern depends on which tech stack you are using. But if you after Spring Boot application, you can have look at https://medium.com/geekculture/implementing-a-kafka-consumer-and-kafka-producer-with-spring-boot-60aca7ef7551
Totally valid scenario, for example you can have connector source or a producer which simple push raw data to a topic.
The receiver is loosely coupled to your publisher so they cannot communicate each other directly.
Then you need C1 (Mediator) to consume message from the source, transform the data and publish the new data format to a different topic.
If you're using a JVM based client, this is precisely the use case for using Kafka Streams rather than the base Consumer/Producer API.
Kafka Streams applications must consume from an initial topic, then can convert(map), filter, aggregate, split, etc into other topics.
https://kafka.apache.org/documentation/streams/

How to achieve leadership notion using consumer group of kafka?

Requirement:
Module1 publish data and module2 consumes it.
Here I can have multiple instances of module2 in which one node should act as a leader and consume the data from the topic and process it and add it to its inmemory. This node has the responsibility to replicate its inmemory with the rest of the module2 instances which acts as a passive node. One of the requirement here is the processing order should remain same.
How to design this in Kakfa?
My thoughts are Module1 publish to sample_topic (having single partition) and each instance of module2 will use the consumer group name and subscribe to sample_topic. Since any instance of the same consumer group can receive a message the concept of leader is not available.
Is there any way to achieve the leadership concept? similar to how brokers work in kafka.
As you pointed out in the question - this is not the default behavior of a consumer group.
A consumer group will distribute the partitions across each consumer and you will not receive the same messages.
What you seem to need is a way to manage global state i.e. you want all consumers to be aware of and have reference to the same data.
There might be a way to hack this with the consumer API - but what I would suggest you look into is the Kafka Streams API.
More specifically, within the Kafka Streams API there is an interface called GlobalKTable:
A KTable distributes the data between all running Kafka Streams
instances, while a GlobalKTable has a full copy of all data on each
instance.
You can also just have each consumer subscribe to the same topic from individual consumer groups, unless it is a requirement that the consumer group must be the same for scaling purposes.

Kafka Stream - How Elastic Scaling works

I was reading about Kafka Stream - Elastic Scaling features.
Means Kafka Stream can handover the task to other instance and task states will get created using changelog. Its mentioned that Instance coardinate with each other to achieve rebalance.
But there is no such detail given how exactly rebalance work?
Is it same like how Consumer Group works or different mechanism because Kafka Stream instances not exactly how consumer in Consumer Group?
Visit this article for a more thorough explanation.
..."In a nutshell, running instances of your application will automatically become aware of new instances joining the group, and will split the work with them; and vice versa, if any running instances are leaving the group (e.g. because they were stopped or they failed), then the remaining instances will become aware of that, too, and will take over their work. More specifically, when you are launching instances of your Streams API based application, these instances will share the same Kafka consumer group id. The group.id is a setting of Kafka’s consumer configuration, and for a Streams API based application this consumer group id is derived from the application.id setting in the Kafka Streams configuration."...

Dynamically create and change Kafka topics with Flink

I'm using Flink to read and write data from different Kafka topics.
Specifically, I'm using the FlinkKafkaConsumer and FlinkKafkaProducer.
I'd like to know if it is possible to change the Kafka topics I'm reading from and writing to 'on the fly' based on either logic within my program, or the contents of the records themselves.
For example, if a record with a new field is read, I'd like to create a new topic and start diverting records with that field to the new topic.
Thanks.
If you have your topics following a generic naming pattern, for example, "topic-n*", your Flink Kafka consumer can automatically reads from "topic-n1", "topic-n2", ... and so on as they are added to Kafka.
Flink 1.5 (FlinkKafkaConsumer09) added support for dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees.
Consumer constructor that accepts subscriptionPattern: link.
Thinking more about the requirement,
1st step is - You will start from one topic (for simplicity) and will spawn more topic during runtime based on the data provided and direct respective messages to these topics. It's entirely possible and will not be a complicated code. Use ZkClient API to check if topic-name exists, if does not exist create a model topic with new name and start pushing messages into it through a new producer tied to this new topic. You don't need to restart job to produce messages to a specific topic.
Your initial consumer become producer(for new topics) + consumer(old topic)
2nd step is - You want to consume messages for new topic. One way could be to spawn a new job entirely. You can do this be creating a thread pool initially and supplying arguments to them.
Again be more careful with this, more automation can lead to overload of cluster in case of a looping bug. Think about the possibility of too many topics created after some time if input data is not controlled or is simply dirty. There could be better architectural approaches as mentioned above in comments.

Using Kafka to Transfer Files between two clients

I have kafka cluster setup between to machines (machine#1 and machine#2) and the configuration is the following:
1) Each machine is configured to have one broker and one zookeeper running.
2) Server and zookeeper properties are configured to have a multi-broker, mulit-node zookeeper.
I currently have the following understanding of KafkaProducer and KafkaConsumer:
1) If I send a file from machine#1 to machine#2, it's broken down in lines using some default delimiter (LF or \n).
2) Therefore, if machine#1 publishes 2 different files to the same topic, that doesn't mean that machine#2 will receive the two files. Instead, every line will be appended to the topic log partitions and a machine#2 will read it from the log partitions in the order of arrival. i.e. the order is not the same as
file1-line1
file1-line2
end-of-file1
file2-line1
file2-line2
end-of-file2
but it might be something like:
file1-line1
file2-line1
file1-line2
end-of-file1
file-2-line2
end-of-file2
Assuming that the above is correct (i'm happy to be wrong), I believe simple Producer Consumer usage to transfer files is not the correct approach (Probably connect API is the solution here). Since Kafka Website says that "Log Aggregation" is a very popular use case, I was wonder if someone has any example projects or website which demonstrates file exchange examples using Kafka.
P.S. I know that by definition Connect API says that this is for reliable data exchange between kafka and "Other" systems - but I don't see why the other system cannot have kafka. So I am hoping that my question doesn't have to focus on "Other" non-kafka systems.
Your understanding is correct, however if u want the same order you can use just 1 partition for that topic.
So the order in which machine#2 reads will be the same as what you sent.
However this will be inefficient and will lack parallelism for which Kafka is widely used.
Kafka has ordering guarantee within a partition. quote from documentation
Kafka only provides a total order over records within a partition, not
between different partitions in a topic
In order to send all the lines from a file to only one partition, send an additional key to the producer client which will hash the sent message to the same partition.
This will make sure you receive the events from one file in the same order on machine#2. If you have any questions feel free to ask, as we use Kafka for ordering guarantee of events generated from multiple sources in production which is basically your use case as well.