Kafka 3.3.1 Active/Active consumers and producers - apache-kafka

We have 2 diff kafka clusters with 10 brokers in each cluster and each cluster has its own Zookeeper cluster. We also have setup MirrorMaker 2 to sync data between the clusters. With MM2, the offset is also being synced along with data.
Looking forward to setup Active/Active for my consumer application as well as producer application.
Lets say the clusters are DC1 & DC2.
Topic name is test-mm.
With MM2 setup,
In DC1,
test-mm
test-mm-DC2(Mirror of DC2)
In DC2,
test-mm
test-mm-DC1(Mirror of DC1)
Consumer Active/ Active
In DC1, I have an application consuming data from test-mm & test-mm-DC2 with the consumer group name group1-test.
In DC2, The same application is consuming data from test-mm & test-mm-DC1 with the consumer group name group1-test.
Application is running as Active/Active on both DCs.
Now producer in DC1 is producing to the topic test-mm in DC1 and it gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is, the offset gets synced so, with the same consumer group name, we can run consumer application on both DCs and only one consumer will get and process the message. Also, when the consumer application in DC1 goes down, the consumer application in DC2 will start processing and we can achieve the real active/active for consumers. Is this correct?
Producer active/active,
It may not be possible with Producer in DC1 and Producer 2 in DC2 as the sequence may not be maintained with 2 different producer. Not sure if Active/Active can be achieved with producer.

You will want two producers, one producing to test-mm in DC1 and the other producing to test-mm in DC2. Once messages have been produced to test-mm in DC1 this will be replicated to test-mm-DC1 in DC2 and vice versa. This is achieving active / active as the data will exist on both DCs, your consumers are also consuming from both DCs and if one DC fails the other producer and consumer will continue as normal. Please let me know if this has not answered your question.

Hopefully my comment answers your question about exactly once processing with MM2. The Stack Overflow post I linked takes the following paragraph from the IBM guide: https://ibm-cloud-architecture.github.io/refarch-eda/technology/kafka-mirrormaker/#record-duplication
This Cloudera blog also mentions that exactly once processing does not apply across multiple clusters: https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/
Cross-cluster Exactly Once Guarantee
Kafka provides support for exactly-once processing but that guarantee
is provided only within a given Kafka cluster and does not apply
across multiple clusters. Cross-cluster replication cannot directly
take advantage of the exactly-once support within a Kafka cluster.
This means MM2 can only provide at least once semantics when
replicating data across the source and target clusters which implies
there could be duplicate records downstream.
Now with regards to the below question:
Now producer in DC1 is producing to the topic test-mm in DC1 and it
gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is,
the offset gets synced so, with the same consumer group name, we can
run consumer application on both DCs and only one consumer will get
and process the message. Also, when the consumer application in DC1
goes down, the consumer application in DC2 will start processing and
we can achieve the real active/active for consumers. Is this correct?
See this post here, they ask a similar question: How are consumers setup in Active - Active Kafka setup
I've not configured MM2 in an active/active architecture before so can't confirm whether you would have two active consumers for each DC or one. Hopefully another member will be able to answer this question for you.

Related

How to make producer idempotence in kafka cluster among two DC?

I have non-trival problem with kafka cluster spreaded among 2 DC. I wanna to have at the same time: 1) kafka producer idempotence and 2) async replication from DC1 to DC2. As known kafka producer idempotence require enabled acks=all in its properties. Thats requires acknoledgements from all brokers in DC1 and in DC2 too.
My question is: How I can change kafka cluster archetecture to achive ability of use idempotented producer and high aviability of brokers in DC1 and DC2? Prefering brokers from DC1.
Parameter min.insync.replicas helps solve the problem. It is mean how much replicas must saved to send ask to producer, even when asks=all is configured.
link 1
link 2

Does a kafka consumer machine need to run zookeeper?

So my question is this: If i have a server running Kafka (And zookeeper), and another machine only consuming messages, does the consumer machine need to run zookeeper too? Or does the server take care of all?
No.
Role of Zookeeper in Kafka is:
Broker registration: (cluster membership) with heartbeats mechanism to keep the list current
Storing topic configuration: which topics exist, how many partitions each
has, where are the replicas, who is the preferred leader, list of ISR for
partitions
Electing controller: The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
So Zookeeper is required only for kafka broker. There is no need to have Zookeper on the producer or consumer side.
The consumer does not need zookeeper
You have not mentioned which version of Kafka or the clients you're using.
Kafka consumers using 0.8 store their offsets in Zookeeper, so it is required for them. However, no, you would not run Zookeeper and consumers on the same server
From 0.9 and later, clients are separate from needing it (unless you want to manage external connections to Zookeeper on your own for storing data)

Kafka Producer, Consumer, Broker in same host?

Are there any downsides to running the same producer and consumer code for all nodes in the cluster? If there are 8 nodes in the cluster (8 consumer, 8 kafka broker, and 8 producers), would 8 producers be running at the same time in the cluster then? Is there a way to modify cluster so that only one producer runs at a time?
Kafka cluster is nothing but Kafka brokers running under a distributed consensus. Kafka cluster is agnostic about number of producers and consumers running around it. Producers and consumers are clients of the Kafka cluster. Producers will stream data to Kafka and consumers consume the data from Kafka. Within Kafka cluster data will be distributed within topics. Topics are sharded using partitions. If multiple consumers belong to the same consumer group consumers can work in a self healing fashion.
Is there a way to modify cluster so that only one producer runs at a
time?
If you intend to run a single producer at certain point of time, you don't need to make any change within cluster.
Are there any downsides to running the same producer and consumer code for all nodes in the cluster?
The primary downsides here would be scalability and memory usage.
Producers and Consumers are not required to run on Brokers. Producers should be deployed where data is being generated (or running as separate hosts, like Kafka Connect workers).
Consumers should be scaled out independently based on the throughput and ordering guarantees that you need in your downstream systems.
There is nothing that says 8 brokers requires 8 producers and 8 consumers; partitions are what matters more
If you have N partitions in a topic, you can only scale to N active consumers anyway, and infinitely many producers
8 brokers can hold lots of partitions for any given topic
Running a single producer is an implementation of your own code. The broker cannot force it.

how producers find kafka reader

The producers send messages by setting up a list of Kafka Broker as follows.
props.put("bootstrap.servers", "127.0.0.1:9092,127.0.0.1:9092,127.0.0.1:9092");
I wonder "producers" how to know that which of the three brokers knew which one had a partition leader.
For a typical distributed server, either you have a load bearing server or have a virtual IP, but for Kafka, how is it loaded?
Does the producers program try to connect to one broker at random and look for a broker with a partition leader?
A Kafka cluster contains multiple broker instances. At any given time, exactly one broker is the leader while the remaining are the in-sync-replicas (ISR) which contain the replicated data. When the leader broker is taken down unexpectedly, one of the ISR becomes the leader.
Kafka chooses one broker’s partition’s replicas as leader using ZooKeeper. When a producer publishes a message to a partition in a topic, it is forwarded to its leader.
According to Kafka documentation:
The partitions of the log are distributed over the servers in the
Kafka cluster with each server handling data and requests for a share
of the partitions. Each partition is replicated across a configurable
number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
You can find topic and partition leader using this piece of code.
EDIT:
The producer sends a meta request with a list of topics to one of the brokers you supplied when configuring the producer.
The response from the broker contains a list of partitions in those topics and the leader for each partition. The producer caches this information and therefore, it knows where to redirect the messages.
It's quite an old question but I have the same question and after researched, I want to share the answer cuz I hope it can help others.
To determine leader of a partition, producer uses a request type called a metadata request, which includes a list of topics the producer is interested in.
The broker will response specifies which partitions exist in the topics, the replicas for each partition, and which replica is the leader.
Metadata requests can be sent to any broker because all brokers have a metadata cache that contains this information.

correct architecture to consume message from another datacenter in apache kafka

We have 2 different data centers DC1 and DC2. DC1 is active and DC2 is passive.
Now we have installed Apache Kafka in DC1 and created topics, wrote producers and consumers and able to push the data correctly from source to sink.
Now we have the following requirement.
we need to keep the sink of DC2 also in synch with DC1. it means the data which is pushed to topic A by producer need to be consumed by two consumers. The first consumer which is already working is from DC1 itself and the other consumer has to be from DC2.
We thought of coming up with a solution like this
Create another consumer group in DC2 which listens to the same topic in DC1.
We are not sure on how it is going to work and how we can make DC2 consumer group listen to DC1 topic.
What is the correct way of handling it and morrow it is possible that DC2 can become active and DC1 be passive to handle DR.
We read on MirrorMaker tool but not sure on how to use it and is that the correct solution to address our problem statement.
I guess the key question here is
Is DC2 full disaster recovery solution ? (
I mean, in case DC1 kafka fails, should DC2 have all the data and resources needed to continue operation ) ?
Option 1 (prefferred) : If the answer is YES, I would set up two different kafka clusters for DC1 and DC2. And use the MirrorMaker tool to consume topics from DC1 into DC2.
Take into account that you might have topics with "intermediate" data in kafka and if you are running the same processes in parallel in the two DCs you could have duplicate data in that topics if you replicate them in Mirror Maker.
Be very careful with the processes to recover DC1. Probably the easiest way to do this is to have DC1 as a passive copy when DC2 is taking the lead and replicate then data to DC1 with MirrorMaker.
Option 2 (more complicated): If the answer is NO, AND YOU ARE VERY DISCIPLINED / HAVE VERY STRICT PROCESSES and you AUDIT periodically/automatically your infrastructure then you can setup kafka brokers with rack aware replication (setting broker.rack=DC1 or DC2) and the cluster will send replicas to the brokers in the second DC <-- BUT your kafka installation in DC2 won't be "passive" at all.
Caveat here: You must set up always a minimum of two replicas for each topic (to avoid mistakes put default.replication.factor = [number of different DCs you have] in your kafka broker config - but be aware that this can be overriden).
If you have kafka in different DCs, I would also assign Kafka broker IDs that reflect which DC every broker "lives". For instance, for DC2 I would start numbering brokers with the number "200" and for DC1 brokers should start with "100".