Kafka streams with multiple application instances - apache-kafka

I have 2 instances of my application for kafka streams consuming 2 partitions in a single topic.
will the single partitions data be only in one application or both applications? and Say if one applications instance is down will i have issues. how will interactive queries solve this ?
do i need to use globalktable?

Each kafka stream application instance will be mapped to one or more partition, based on how many partitions the input topics have.
If you run 2 instances for an input topic with 2 partitions, each partition will consume from one partition. If one instance goes down, kafka stream will rebalance the work load on the first instance and it will consumer from both partition.
You can refer the architecture here in detail : https://docs.confluent.io/current/streams/architecture.html

Related

Kafkajs - multiple consumers reading from same topic and partition

I'm planning to use Kafkajs https://kafka.js.org/ and implement it in a NodeJs server.
I would like to know what is the expected behavior in case I have 2 (or more) instances of the server running, each of them having a consumer which is configured with the same group id and topic?
Does this mean that they might read the same messages?
Should I specify a unique consumer group per each server instance ?
I read this - Multiple consumers consuming from same topic but not sure it applies for Kafkajs
It's not possible for a single consumer group to have multiple consumers reading from overlapping partitions in any Kafka library. If your topic only has one partition, only one instance of your application will be consuming it, and if that instance dies, the group will rebalance and the other instance will take over (potentially reading some of the same data, due to the nature of at-least-once delivery, but it's not at the same time as the other instance)

Spring Kafka, subscribing to large number of topics using topic pattern

Are there any known limitations with the number of Kafka topics and the topics distribution between consumer instances while subscribing to Kafka topics using topicPattern.
Our use case is that we need to subscribe to a large number of topics (potentially few thousands of topics). All the topics follow a naming convention and have only 1 partition. We don't know the list of topic names before hand and new topics that match the pattern can be created anytime. Our consumers should be able to consume messages from all the topics that match the pattern.
Do we have any limit on the number of topics that can be subscribed this way using the topic pattern.
Can we scale up the number of consumer instances to read from more topics
Will the topics be distributed among all the running consumer instances. In our local testing it has been observed that all the topics are read by a single consumer despite having multiple consumer instances running in parallel. Only when the serving consumer instance is shutdown, the topics are picked by another instance (not distributed among available instances)
All the topics mentioned are single partition topics.
I don't believe there is a hard limit but...
No; you can't change concurrency at runtime, but you can set a larger concurrency than needed and you will have idle consumers waiting for assignment.
You would need to provide a custom partition assigner, or select one of the alternate provided ones e.g. RoundRobinAssignor will probably work for you https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

One KafkaConsumer listening to multiple partitions VS multiple KafkaConsumers listening to multiple partitions

I have ten Kafka Producers each one writing to different partition of a topic.
I cannot tell which is more effective.
Having one consumer listening to the ten partitions or having ten consumers listening to different partition?
There is no difference between these two ways. But remember when you have ten consumers there is overhead for connecting each consumer to Kafka.
If there is a capability in consuming different partitions by one consumer so probably it is enough performant.
Typically, if you have multiple consumers, you'll be able to get more throughput, since you'll have multiple threads/applications pulling data from the kafka cluster, which means you'll be able to parallelize across multiple cores, and maybe multiple servers.
However, you also need to take into account what you're trying to accomplish. Does one process/application need to look at all the data? Are the messages independent of each other? All of this will inform how your application should be designed.
In a default configuration, all of the available partitions for a topic will be distributed evently across all consumers with the same group id. So, you could have one consumer, and it will automatically grab all partitions for that topic. Or you could instantiate ten consumers, and each consumer will get exactly one partition in this case.

Apache Kafka Multiple Consumer Instances

I have a consumer that is supposed to read messages from a topic. This consumer actually reads the messages and writes them to a time series database. We have multiple instances of the time series database running as a cluster on multiple physical machines.
Our plan is to deploy the consumer on all those machines where the time series service is running. So if I have 5 nodes on which the time series service is running, I will install one consumer instance per node. All those consumer instances belong to the same consumer group. So in pictures the set up looks like below:
As you can see, the Producer P1 and P2 write into 2 partitions namely partition 1 and partition 2 of the kafka topic. I then have 4 instances of the time series service where one consumer is running per instance. How should I read using my consumer properly such that I do not end up with duplicate messages in my time series database?
Edit: After reading through the Kafka documentation, I came across these two statements:
If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
So in my case above, it is behaving like a Queue? Is my understanding correct?
If all consumers belong to one group (have the same groupId), then kafka topic will behave for you as a queue.
Important: there is no reason to have consumers more than partitions, as consumers (out-of-the-box kafka consumers) are scaled by partitions.