Kafka Stream - How Elastic Scaling works - apache-kafka

I was reading about Kafka Stream - Elastic Scaling features.
Means Kafka Stream can handover the task to other instance and task states will get created using changelog. Its mentioned that Instance coardinate with each other to achieve rebalance.
But there is no such detail given how exactly rebalance work?
Is it same like how Consumer Group works or different mechanism because Kafka Stream instances not exactly how consumer in Consumer Group?

Visit this article for a more thorough explanation.
..."In a nutshell, running instances of your application will automatically become aware of new instances joining the group, and will split the work with them; and vice versa, if any running instances are leaving the group (e.g. because they were stopped or they failed), then the remaining instances will become aware of that, too, and will take over their work. More specifically, when you are launching instances of your Streams API based application, these instances will share the same Kafka consumer group id. The group.id is a setting of Kafka’s consumer configuration, and for a Streams API based application this consumer group id is derived from the application.id setting in the Kafka Streams configuration."...

Related

Kafkajs - multiple consumers reading from same topic and partition

I'm planning to use Kafkajs https://kafka.js.org/ and implement it in a NodeJs server.
I would like to know what is the expected behavior in case I have 2 (or more) instances of the server running, each of them having a consumer which is configured with the same group id and topic?
Does this mean that they might read the same messages?
Should I specify a unique consumer group per each server instance ?
I read this - Multiple consumers consuming from same topic but not sure it applies for Kafkajs
It's not possible for a single consumer group to have multiple consumers reading from overlapping partitions in any Kafka library. If your topic only has one partition, only one instance of your application will be consuming it, and if that instance dies, the group will rebalance and the other instance will take over (potentially reading some of the same data, due to the nature of at-least-once delivery, but it's not at the same time as the other instance)

How to achieve leadership notion using consumer group of kafka?

Requirement:
Module1 publish data and module2 consumes it.
Here I can have multiple instances of module2 in which one node should act as a leader and consume the data from the topic and process it and add it to its inmemory. This node has the responsibility to replicate its inmemory with the rest of the module2 instances which acts as a passive node. One of the requirement here is the processing order should remain same.
How to design this in Kakfa?
My thoughts are Module1 publish to sample_topic (having single partition) and each instance of module2 will use the consumer group name and subscribe to sample_topic. Since any instance of the same consumer group can receive a message the concept of leader is not available.
Is there any way to achieve the leadership concept? similar to how brokers work in kafka.
As you pointed out in the question - this is not the default behavior of a consumer group.
A consumer group will distribute the partitions across each consumer and you will not receive the same messages.
What you seem to need is a way to manage global state i.e. you want all consumers to be aware of and have reference to the same data.
There might be a way to hack this with the consumer API - but what I would suggest you look into is the Kafka Streams API.
More specifically, within the Kafka Streams API there is an interface called GlobalKTable:
A KTable distributes the data between all running Kafka Streams
instances, while a GlobalKTable has a full copy of all data on each
instance.
You can also just have each consumer subscribe to the same topic from individual consumer groups, unless it is a requirement that the consumer group must be the same for scaling purposes.

What is difference b/w group.id, application.id and client.id in kafka?

I am new to kafka, So I am just clearing my kafka concepts.
I have created a simple streaming application which is streaming data from a single topic which has two partitions. I have two instances of this application (I am saying this on the basis of same application.id in both the projects). When I started third instance of the application, I got the error. From this, I understood that application.id in kafka is being treated as consumer group id where single consumer can read from single partition of a topic and 3rd consumer does not get any, so it failed to get store against the topic.
I have also tried another scenario where I changed application.id in one of the my applications. So by doing this, third instance of the application also started working fine. So it confirmed my hypothesis that application.id is being treated as consumer group id.
But I have also noticed that group.id, client.id also exists there which is confusing me. What is purpose of using group.id, client.id in our projects, what are these properties and how they work. I have set up same group id for all three applications.
In short:
client.id (for producer and for consumer) sets the name of an individual Kafka producer or consumer client.
group.id sets the name of the Kafka consumer group that an individual Kafka consumer client belongs to.
application.id is a setting only used by Kafka Streams to name an application, i.e., an application that uses the Kafka Streams library (which can be run on one or more app instances). Behind the scenes, the application.id is also used to generate group.id and client.ids. See the application.id documentation for further information.

producer to multiple subcriber paradigm

We have a dispatcher which receives a message - and then 'fans' it out to multiple downstream environments.
Each set of downstream environment needs to consume this message.
Will it suffice to tag the different set of environments with different group.ID to force all the environments to consume the same message (1 producer - multiple subscriber broadcast).
If a particular environment (group) crashes,will it possible to replay the messages to the particular group only ?
Yes, this is typically how you achieve such a data flow.
If you have multiple consumer groups subscribed to the same topics, they will all consume all messages. As you said, you use the group.id configuration to identify each consumer groups.
In addition each consumer group tracks its own offsets. So you can easily make a particular group replay part of the log without impacting the other groups. This can be achieved for example by using the kafka-consumer-groups.sh tool with one of the reset options.
Yes, that's how Kafka works. So long as the retention for the topic is configured such, then any particular consumer group can re-consume from any offset in the log, whether the beginning or just the last point from which it successfully read. All other consumers are unaffected.

Dynamically create and change Kafka topics with Flink

I'm using Flink to read and write data from different Kafka topics.
Specifically, I'm using the FlinkKafkaConsumer and FlinkKafkaProducer.
I'd like to know if it is possible to change the Kafka topics I'm reading from and writing to 'on the fly' based on either logic within my program, or the contents of the records themselves.
For example, if a record with a new field is read, I'd like to create a new topic and start diverting records with that field to the new topic.
Thanks.
If you have your topics following a generic naming pattern, for example, "topic-n*", your Flink Kafka consumer can automatically reads from "topic-n1", "topic-n2", ... and so on as they are added to Kafka.
Flink 1.5 (FlinkKafkaConsumer09) added support for dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees.
Consumer constructor that accepts subscriptionPattern: link.
Thinking more about the requirement,
1st step is - You will start from one topic (for simplicity) and will spawn more topic during runtime based on the data provided and direct respective messages to these topics. It's entirely possible and will not be a complicated code. Use ZkClient API to check if topic-name exists, if does not exist create a model topic with new name and start pushing messages into it through a new producer tied to this new topic. You don't need to restart job to produce messages to a specific topic.
Your initial consumer become producer(for new topics) + consumer(old topic)
2nd step is - You want to consume messages for new topic. One way could be to spawn a new job entirely. You can do this be creating a thread pool initially and supplying arguments to them.
Again be more careful with this, more automation can lead to overload of cluster in case of a looping bug. Think about the possibility of too many topics created after some time if input data is not controlled or is simply dirty. There could be better architectural approaches as mentioned above in comments.