How to make producer idempotence in kafka cluster among two DC? - apache-kafka

I have non-trival problem with kafka cluster spreaded among 2 DC. I wanna to have at the same time: 1) kafka producer idempotence and 2) async replication from DC1 to DC2. As known kafka producer idempotence require enabled acks=all in its properties. Thats requires acknoledgements from all brokers in DC1 and in DC2 too.
My question is: How I can change kafka cluster archetecture to achive ability of use idempotented producer and high aviability of brokers in DC1 and DC2? Prefering brokers from DC1.

Parameter min.insync.replicas helps solve the problem. It is mean how much replicas must saved to send ask to producer, even when asks=all is configured.
link 1
link 2

Related

Kafka 3.3.1 Active/Active consumers and producers

We have 2 diff kafka clusters with 10 brokers in each cluster and each cluster has its own Zookeeper cluster. We also have setup MirrorMaker 2 to sync data between the clusters. With MM2, the offset is also being synced along with data.
Looking forward to setup Active/Active for my consumer application as well as producer application.
Lets say the clusters are DC1 & DC2.
Topic name is test-mm.
With MM2 setup,
In DC1,
test-mm
test-mm-DC2(Mirror of DC2)
In DC2,
test-mm
test-mm-DC1(Mirror of DC1)
Consumer Active/ Active
In DC1, I have an application consuming data from test-mm & test-mm-DC2 with the consumer group name group1-test.
In DC2, The same application is consuming data from test-mm & test-mm-DC1 with the consumer group name group1-test.
Application is running as Active/Active on both DCs.
Now producer in DC1 is producing to the topic test-mm in DC1 and it gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is, the offset gets synced so, with the same consumer group name, we can run consumer application on both DCs and only one consumer will get and process the message. Also, when the consumer application in DC1 goes down, the consumer application in DC2 will start processing and we can achieve the real active/active for consumers. Is this correct?
Producer active/active,
It may not be possible with Producer in DC1 and Producer 2 in DC2 as the sequence may not be maintained with 2 different producer. Not sure if Active/Active can be achieved with producer.
You will want two producers, one producing to test-mm in DC1 and the other producing to test-mm in DC2. Once messages have been produced to test-mm in DC1 this will be replicated to test-mm-DC1 in DC2 and vice versa. This is achieving active / active as the data will exist on both DCs, your consumers are also consuming from both DCs and if one DC fails the other producer and consumer will continue as normal. Please let me know if this has not answered your question.
Hopefully my comment answers your question about exactly once processing with MM2. The Stack Overflow post I linked takes the following paragraph from the IBM guide: https://ibm-cloud-architecture.github.io/refarch-eda/technology/kafka-mirrormaker/#record-duplication
This Cloudera blog also mentions that exactly once processing does not apply across multiple clusters: https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/
Cross-cluster Exactly Once Guarantee
Kafka provides support for exactly-once processing but that guarantee
is provided only within a given Kafka cluster and does not apply
across multiple clusters. Cross-cluster replication cannot directly
take advantage of the exactly-once support within a Kafka cluster.
This means MM2 can only provide at least once semantics when
replicating data across the source and target clusters which implies
there could be duplicate records downstream.
Now with regards to the below question:
Now producer in DC1 is producing to the topic test-mm in DC1 and it
gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is,
the offset gets synced so, with the same consumer group name, we can
run consumer application on both DCs and only one consumer will get
and process the message. Also, when the consumer application in DC1
goes down, the consumer application in DC2 will start processing and
we can achieve the real active/active for consumers. Is this correct?
See this post here, they ask a similar question: How are consumers setup in Active - Active Kafka setup
I've not configured MM2 in an active/active architecture before so can't confirm whether you would have two active consumers for each DC or one. Hopefully another member will be able to answer this question for you.

Does a kafka consumer machine need to run zookeeper?

So my question is this: If i have a server running Kafka (And zookeeper), and another machine only consuming messages, does the consumer machine need to run zookeeper too? Or does the server take care of all?
No.
Role of Zookeeper in Kafka is:
Broker registration: (cluster membership) with heartbeats mechanism to keep the list current
Storing topic configuration: which topics exist, how many partitions each
has, where are the replicas, who is the preferred leader, list of ISR for
partitions
Electing controller: The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
So Zookeeper is required only for kafka broker. There is no need to have Zookeper on the producer or consumer side.
The consumer does not need zookeeper
You have not mentioned which version of Kafka or the clients you're using.
Kafka consumers using 0.8 store their offsets in Zookeeper, so it is required for them. However, no, you would not run Zookeeper and consumers on the same server
From 0.9 and later, clients are separate from needing it (unless you want to manage external connections to Zookeeper on your own for storing data)

Doubts Regarding Kafka Cluster Setup

I have a use case I want to set up a Kafka cluster initially at the starting I have 1 Kafka Broker(A) and 1 Zookeeper Node. So below mentioned are my queries:
On adding a new Kafka Broker(B) to the cluster. Will all data present on broker A will be distributed automatically? If not what I need to do distribute the data.
Not let's suppose somehow the case! is solved my data is distributed on both the brokers. Now due to some maintenance issue, I want to take down the server B.
How to transfer the data of Broker B to the already existing broker A or to a new Broker C.
How can I increase the replication factor of my brokers at runtime
How can I change the zookeeper IPs present in Kafka Broker Config at runtime without restarting Kafka?
How can I dynamically change the Kafka Configuration at runtime
Regarding Kafka Client:
Do I need to specify all Kafka broker IP to kafkaClient for connection?
And each and every time a broker is added or removed does I need to add or remove my IP in Kafka Client connection String. As it will always require to restart my producer and consumers?
Note:
Kafka Version: 2.0.0
Zookeeper: 3.4.9
Broker Size : (2 core, 8 GB RAM) [4GB for Kafka and 4 GB for OS]
To run a topic from a single kafka broker you will have to set a replication factor of 1 when creating that topic (explicitly, or implicitly via default.replication.factor). This means that the topic's partitions will be on a single broker, even after increasing the number of brokers.
You will have to increase the number of replicas as described in the kafka documentation. You will also have to pay attention that the internal __consumer_offsets topic has enough replicas. This will start the replication process and eventually the original broker will be the leader of every topic partition, and the other broker will be the follower and fully caught up. You can use kafka-topics.sh --describe to check that every partition has both brokers in the ISR (in-sync replicas).
Once that is done you should be able to take the original broker offline and kafka will elect the new broker as the leader of every topic partition. Don't forget to update the clients so they are aware of the new broker as well, in case a client needs to restart when the original broker is down (otherwise it won't find the cluster).
Here are the answers in brief:
Yes, the data present on broker A will also be distributed in Kafka broker B
You can set up three brokers A, B and C so if A fails then B and C will, and if B fails then, C will take over and so on.
You can increase the replication factor of your broker
you could create increase-replication-factor.json and put this content in it:
{"version":1,
"partitions":[
{"topic":"signals","partition":0,"replicas":[0,1,2]},
{"topic":"signals","partition":1,"replicas":[0,1,2]},
{"topic":"signals","partition":2,"replicas":[0,1,2]}
]}
To increase the number of replicas for a given topic, you have to:
Specify the extra partitions to the existing topic with below command(let us say the increase from 2 to 3)
bin/kafktopics.sh --zookeeper localhost:2181 --alter --topic topic-to-increase --partitions 3
There is zoo.cfg file where you can add the IP and configuration related to ZooKeeper.

Why is my kafka topic not consumable with a broker down?

My issue is that I have a three broker Kafka Cluster and an availability requirement to have access to consume and produce to a topic when one or two of my three brokers is down.
I also have a reliability requirement to have a replication factor of 3. These seem to be conflicting requirements to me. Here is how my problem manifests:
I create a new topic with replication factor 3
I send several messages to that topic
I kill one of my brokers to simulate a broker issue
I attempt to consume the topic I created
My consumer hangs
I review my logs and see the error:
Number of alive brokers '2' does not meet the required replication factor '3' for the offsets topic
If I set all my broker's offsets.topic.replication.factor setting to 1, then I'm able to produce and consume my topics, even if I set the topic level replication factor to 3.
Is this an okay configuration? Or can you see any pitfalls in setting things up this way?
You only need as many brokers as your replication factor when creating the topic.
I'm guessing in your case, you start with a fresh cluster and no consumers have connected yet. In this case, the __consumer_offsets internal topic does not exist as it is only created when it's first needed. So first connect a consumer for a moment and then kill one of the brokers.
Apart from that, in order to consume you only need 1 broker up, the leader for the partition.

correct architecture to consume message from another datacenter in apache kafka

We have 2 different data centers DC1 and DC2. DC1 is active and DC2 is passive.
Now we have installed Apache Kafka in DC1 and created topics, wrote producers and consumers and able to push the data correctly from source to sink.
Now we have the following requirement.
we need to keep the sink of DC2 also in synch with DC1. it means the data which is pushed to topic A by producer need to be consumed by two consumers. The first consumer which is already working is from DC1 itself and the other consumer has to be from DC2.
We thought of coming up with a solution like this
Create another consumer group in DC2 which listens to the same topic in DC1.
We are not sure on how it is going to work and how we can make DC2 consumer group listen to DC1 topic.
What is the correct way of handling it and morrow it is possible that DC2 can become active and DC1 be passive to handle DR.
We read on MirrorMaker tool but not sure on how to use it and is that the correct solution to address our problem statement.
I guess the key question here is
Is DC2 full disaster recovery solution ? (
I mean, in case DC1 kafka fails, should DC2 have all the data and resources needed to continue operation ) ?
Option 1 (prefferred) : If the answer is YES, I would set up two different kafka clusters for DC1 and DC2. And use the MirrorMaker tool to consume topics from DC1 into DC2.
Take into account that you might have topics with "intermediate" data in kafka and if you are running the same processes in parallel in the two DCs you could have duplicate data in that topics if you replicate them in Mirror Maker.
Be very careful with the processes to recover DC1. Probably the easiest way to do this is to have DC1 as a passive copy when DC2 is taking the lead and replicate then data to DC1 with MirrorMaker.
Option 2 (more complicated): If the answer is NO, AND YOU ARE VERY DISCIPLINED / HAVE VERY STRICT PROCESSES and you AUDIT periodically/automatically your infrastructure then you can setup kafka brokers with rack aware replication (setting broker.rack=DC1 or DC2) and the cluster will send replicas to the brokers in the second DC <-- BUT your kafka installation in DC2 won't be "passive" at all.
Caveat here: You must set up always a minimum of two replicas for each topic (to avoid mistakes put default.replication.factor = [number of different DCs you have] in your kafka broker config - but be aware that this can be overriden).
If you have kafka in different DCs, I would also assign Kafka broker IDs that reflect which DC every broker "lives". For instance, for DC2 I would start numbering brokers with the number "200" and for DC1 brokers should start with "100".