How kafka consumer choose consume nearest broker? - apache-kafka

I deploy a kafka cluster on three hosts.And deploy consumers on the same hosts.
How i to let the consumer consume the nearest broker's partitions.for example,host a's consumer just consume the partitions which belong to host a.

Kafka doesn't work that way. The clients will connect to all three brokers and produce and consume from all three brokers in parallel based on which nodes are currently the leader for each topic partition.

Related

Kafka 3.3.1 Active/Active consumers and producers

We have 2 diff kafka clusters with 10 brokers in each cluster and each cluster has its own Zookeeper cluster. We also have setup MirrorMaker 2 to sync data between the clusters. With MM2, the offset is also being synced along with data.
Looking forward to setup Active/Active for my consumer application as well as producer application.
Lets say the clusters are DC1 & DC2.
Topic name is test-mm.
With MM2 setup,
In DC1,
test-mm
test-mm-DC2(Mirror of DC2)
In DC2,
test-mm
test-mm-DC1(Mirror of DC1)
Consumer Active/ Active
In DC1, I have an application consuming data from test-mm & test-mm-DC2 with the consumer group name group1-test.
In DC2, The same application is consuming data from test-mm & test-mm-DC1 with the consumer group name group1-test.
Application is running as Active/Active on both DCs.
Now producer in DC1 is producing to the topic test-mm in DC1 and it gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is, the offset gets synced so, with the same consumer group name, we can run consumer application on both DCs and only one consumer will get and process the message. Also, when the consumer application in DC1 goes down, the consumer application in DC2 will start processing and we can achieve the real active/active for consumers. Is this correct?
Producer active/active,
It may not be possible with Producer in DC1 and Producer 2 in DC2 as the sequence may not be maintained with 2 different producer. Not sure if Active/Active can be achieved with producer.
You will want two producers, one producing to test-mm in DC1 and the other producing to test-mm in DC2. Once messages have been produced to test-mm in DC1 this will be replicated to test-mm-DC1 in DC2 and vice versa. This is achieving active / active as the data will exist on both DCs, your consumers are also consuming from both DCs and if one DC fails the other producer and consumer will continue as normal. Please let me know if this has not answered your question.
Hopefully my comment answers your question about exactly once processing with MM2. The Stack Overflow post I linked takes the following paragraph from the IBM guide: https://ibm-cloud-architecture.github.io/refarch-eda/technology/kafka-mirrormaker/#record-duplication
This Cloudera blog also mentions that exactly once processing does not apply across multiple clusters: https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/
Cross-cluster Exactly Once Guarantee
Kafka provides support for exactly-once processing but that guarantee
is provided only within a given Kafka cluster and does not apply
across multiple clusters. Cross-cluster replication cannot directly
take advantage of the exactly-once support within a Kafka cluster.
This means MM2 can only provide at least once semantics when
replicating data across the source and target clusters which implies
there could be duplicate records downstream.
Now with regards to the below question:
Now producer in DC1 is producing to the topic test-mm in DC1 and it
gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is,
the offset gets synced so, with the same consumer group name, we can
run consumer application on both DCs and only one consumer will get
and process the message. Also, when the consumer application in DC1
goes down, the consumer application in DC2 will start processing and
we can achieve the real active/active for consumers. Is this correct?
See this post here, they ask a similar question: How are consumers setup in Active - Active Kafka setup
I've not configured MM2 in an active/active architecture before so can't confirm whether you would have two active consumers for each DC or one. Hopefully another member will be able to answer this question for you.

Impact of publishing messages to one Kafka broker with in a cluster

I have Kafka cluster with three brokers and zookeeper instances. Kept the replication factor of 2 for each partition.
i want to understand the impact of publishing messages to single node in a cluster by giving one broker address. Will this broker sends message to other brokers if messages fit into partitions hold by other brokers?
can someone explain how internal sync works or else point to resources.
giving one broker address
Even if you give one address, the bootstrap protocol returns all brokers to the client.
The partitioner logic determines which partition in which broker to send the data to - you target partitions, not brokers in the client.

Kafka Producer, Consumer, Broker in same host?

Are there any downsides to running the same producer and consumer code for all nodes in the cluster? If there are 8 nodes in the cluster (8 consumer, 8 kafka broker, and 8 producers), would 8 producers be running at the same time in the cluster then? Is there a way to modify cluster so that only one producer runs at a time?
Kafka cluster is nothing but Kafka brokers running under a distributed consensus. Kafka cluster is agnostic about number of producers and consumers running around it. Producers and consumers are clients of the Kafka cluster. Producers will stream data to Kafka and consumers consume the data from Kafka. Within Kafka cluster data will be distributed within topics. Topics are sharded using partitions. If multiple consumers belong to the same consumer group consumers can work in a self healing fashion.
Is there a way to modify cluster so that only one producer runs at a
time?
If you intend to run a single producer at certain point of time, you don't need to make any change within cluster.
Are there any downsides to running the same producer and consumer code for all nodes in the cluster?
The primary downsides here would be scalability and memory usage.
Producers and Consumers are not required to run on Brokers. Producers should be deployed where data is being generated (or running as separate hosts, like Kafka Connect workers).
Consumers should be scaled out independently based on the throughput and ordering guarantees that you need in your downstream systems.
There is nothing that says 8 brokers requires 8 producers and 8 consumers; partitions are what matters more
If you have N partitions in a topic, you can only scale to N active consumers anyway, and infinitely many producers
8 brokers can hold lots of partitions for any given topic
Running a single producer is an implementation of your own code. The broker cannot force it.

Kafka Topic Distribution among brokers

When creating topics, can we determine which broker will be the leader for the topic? Are topics balanced across brokers in Kafka? (Considering the topics have just one partition)
Kafka does manage this internally and you don't need to worry about this in general: http://kafka.apache.org/documentation/#basic_ops_leader_balancing
If you create a new topic, Kafka will select a broker based on load. If a topic has only one partitions, it will only be hosted on a single broker (plus followers if you have multiple replicas), because a partitions cannot be split over multiple brokers in Kafka.
Nevertheless, you can get the information which broker host what topic and you can also "move" topics and partitions: http://kafka.apache.org/documentation/#basic_ops_cluster_expansion

One Kafka broker connects to multiple zookeepers

I'm new to Kafka, zookeeper and Storm.
I our environment we have one Kafka broker connecting to multiple zookeepers. Is there an advantage having the producer send the messages to a specific topic and partition on one broker to multiple zookeepers vs multiple brokers to multiple zookeepers?
Yes there is. Kafka allows you to scale by adding brokers. When you use a Kafka cluster with a single broker, as you have, all partitions reside on that single broker. But when you have multiple brokers, Kafka will split the partitions between them. So, broker A may be elected leader for partitions 1 and 2 of your topic, and broker B leader for partition 3. So, when you publish messages to the topic, the client will split the messages between the various partitions on the two brokers.
Note that I also mentioned leader election. Adding brokers to your Kafka cluster gives you replication. Kafka uses ZooKeeper to elect a leader for each partition as I mentioned in my example. Once a leader is elected, the client splits messages among partitions and sends each message to the leader for the appropriate partition. Depending on the topic configuration, the leader may synchronously replicate messages to a backup. So, in my example, if the replication factor for the topic is 2 then broker A will synchronously replicate messages for partitions 1 and 2 to broker B and broker B will synchronously replicate messages for partition 3 to broker A.
So, that's all to say that adding brokers gives you both scalability and fault-tolerance.