correct architecture to consume message from another datacenter in apache kafka - apache-kafka

We have 2 different data centers DC1 and DC2. DC1 is active and DC2 is passive.
Now we have installed Apache Kafka in DC1 and created topics, wrote producers and consumers and able to push the data correctly from source to sink.
Now we have the following requirement.
we need to keep the sink of DC2 also in synch with DC1. it means the data which is pushed to topic A by producer need to be consumed by two consumers. The first consumer which is already working is from DC1 itself and the other consumer has to be from DC2.
We thought of coming up with a solution like this
Create another consumer group in DC2 which listens to the same topic in DC1.
We are not sure on how it is going to work and how we can make DC2 consumer group listen to DC1 topic.
What is the correct way of handling it and morrow it is possible that DC2 can become active and DC1 be passive to handle DR.
We read on MirrorMaker tool but not sure on how to use it and is that the correct solution to address our problem statement.

I guess the key question here is
Is DC2 full disaster recovery solution ? (
I mean, in case DC1 kafka fails, should DC2 have all the data and resources needed to continue operation ) ?
Option 1 (prefferred) : If the answer is YES, I would set up two different kafka clusters for DC1 and DC2. And use the MirrorMaker tool to consume topics from DC1 into DC2.
Take into account that you might have topics with "intermediate" data in kafka and if you are running the same processes in parallel in the two DCs you could have duplicate data in that topics if you replicate them in Mirror Maker.
Be very careful with the processes to recover DC1. Probably the easiest way to do this is to have DC1 as a passive copy when DC2 is taking the lead and replicate then data to DC1 with MirrorMaker.
Option 2 (more complicated): If the answer is NO, AND YOU ARE VERY DISCIPLINED / HAVE VERY STRICT PROCESSES and you AUDIT periodically/automatically your infrastructure then you can setup kafka brokers with rack aware replication (setting broker.rack=DC1 or DC2) and the cluster will send replicas to the brokers in the second DC <-- BUT your kafka installation in DC2 won't be "passive" at all.
Caveat here: You must set up always a minimum of two replicas for each topic (to avoid mistakes put default.replication.factor = [number of different DCs you have] in your kafka broker config - but be aware that this can be overriden).
If you have kafka in different DCs, I would also assign Kafka broker IDs that reflect which DC every broker "lives". For instance, for DC2 I would start numbering brokers with the number "200" and for DC1 brokers should start with "100".

Related

Kafka 3.3.1 Active/Active consumers and producers

We have 2 diff kafka clusters with 10 brokers in each cluster and each cluster has its own Zookeeper cluster. We also have setup MirrorMaker 2 to sync data between the clusters. With MM2, the offset is also being synced along with data.
Looking forward to setup Active/Active for my consumer application as well as producer application.
Lets say the clusters are DC1 & DC2.
Topic name is test-mm.
With MM2 setup,
In DC1,
test-mm
test-mm-DC2(Mirror of DC2)
In DC2,
test-mm
test-mm-DC1(Mirror of DC1)
Consumer Active/ Active
In DC1, I have an application consuming data from test-mm & test-mm-DC2 with the consumer group name group1-test.
In DC2, The same application is consuming data from test-mm & test-mm-DC1 with the consumer group name group1-test.
Application is running as Active/Active on both DCs.
Now producer in DC1 is producing to the topic test-mm in DC1 and it gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is, the offset gets synced so, with the same consumer group name, we can run consumer application on both DCs and only one consumer will get and process the message. Also, when the consumer application in DC1 goes down, the consumer application in DC2 will start processing and we can achieve the real active/active for consumers. Is this correct?
Producer active/active,
It may not be possible with Producer in DC1 and Producer 2 in DC2 as the sequence may not be maintained with 2 different producer. Not sure if Active/Active can be achieved with producer.
You will want two producers, one producing to test-mm in DC1 and the other producing to test-mm in DC2. Once messages have been produced to test-mm in DC1 this will be replicated to test-mm-DC1 in DC2 and vice versa. This is achieving active / active as the data will exist on both DCs, your consumers are also consuming from both DCs and if one DC fails the other producer and consumer will continue as normal. Please let me know if this has not answered your question.
Hopefully my comment answers your question about exactly once processing with MM2. The Stack Overflow post I linked takes the following paragraph from the IBM guide: https://ibm-cloud-architecture.github.io/refarch-eda/technology/kafka-mirrormaker/#record-duplication
This Cloudera blog also mentions that exactly once processing does not apply across multiple clusters: https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/
Cross-cluster Exactly Once Guarantee
Kafka provides support for exactly-once processing but that guarantee
is provided only within a given Kafka cluster and does not apply
across multiple clusters. Cross-cluster replication cannot directly
take advantage of the exactly-once support within a Kafka cluster.
This means MM2 can only provide at least once semantics when
replicating data across the source and target clusters which implies
there could be duplicate records downstream.
Now with regards to the below question:
Now producer in DC1 is producing to the topic test-mm in DC1 and it
gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is,
the offset gets synced so, with the same consumer group name, we can
run consumer application on both DCs and only one consumer will get
and process the message. Also, when the consumer application in DC1
goes down, the consumer application in DC2 will start processing and
we can achieve the real active/active for consumers. Is this correct?
See this post here, they ask a similar question: How are consumers setup in Active - Active Kafka setup
I've not configured MM2 in an active/active architecture before so can't confirm whether you would have two active consumers for each DC or one. Hopefully another member will be able to answer this question for you.

How to make producer idempotence in kafka cluster among two DC?

I have non-trival problem with kafka cluster spreaded among 2 DC. I wanna to have at the same time: 1) kafka producer idempotence and 2) async replication from DC1 to DC2. As known kafka producer idempotence require enabled acks=all in its properties. Thats requires acknoledgements from all brokers in DC1 and in DC2 too.
My question is: How I can change kafka cluster archetecture to achive ability of use idempotented producer and high aviability of brokers in DC1 and DC2? Prefering brokers from DC1.
Parameter min.insync.replicas helps solve the problem. It is mean how much replicas must saved to send ask to producer, even when asks=all is configured.
link 1
link 2

Does a kafka consumer machine need to run zookeeper?

So my question is this: If i have a server running Kafka (And zookeeper), and another machine only consuming messages, does the consumer machine need to run zookeeper too? Or does the server take care of all?
No.
Role of Zookeeper in Kafka is:
Broker registration: (cluster membership) with heartbeats mechanism to keep the list current
Storing topic configuration: which topics exist, how many partitions each
has, where are the replicas, who is the preferred leader, list of ISR for
partitions
Electing controller: The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
So Zookeeper is required only for kafka broker. There is no need to have Zookeper on the producer or consumer side.
The consumer does not need zookeeper
You have not mentioned which version of Kafka or the clients you're using.
Kafka consumers using 0.8 store their offsets in Zookeeper, so it is required for them. However, no, you would not run Zookeeper and consumers on the same server
From 0.9 and later, clients are separate from needing it (unless you want to manage external connections to Zookeeper on your own for storing data)

Is it possible to produce to a kafka topic when only 1 of the brokers is reachable?

Is it possible to produce to a Kafka topic when only 1 of the brokers is reachable from the producer, none of the zookeeper nodes are reachable from the producer, but all of the brokers are healthy and are reachable from each other?
For example, this would be required if I were to produce messages via an SSH tunnel. If this were for a temporary push I could possibly create the topic with replication factor 1 and have all partitions assigned to the broker in question, and reassign the partitions after the fact, but I'm hoping there is a more flexible setup.
This is all using the java client.
Producers don't interact with Zookeeper so it's not an issue.
The only requirement for Producers is to be able to connect to the brokers that are leaders for the partitions they want to use.
If the broker you connect to is the leader for the partitions you want to use, then yes you can produce to it.
Otherwise it's not going to work. Also creating a topic may not help as its partitions could be assigned to any brokers. Also in order to create a topic, a client has to connect to the controller which may not be the broker you can reach.
If you can only connect to 1 "thing", you may want to consider using something like a REST Proxy. Your "isolated" environment could send REST requests to the proxy which is able to connect to all brokers in the cluster.

Kafka mirror maker behaviour

Say, we are having two Datacenters DC1 and DC2. I am mirroring kafka data from DC1 to DC2 using Kafka mirror maker. Only DC1 is active and DC2 will become active once DC1 goes down.
As per my knowledge, both kafka topic and offsets topic will be mirrored to DC2.
For Example, I have produced 100 msgs to T1 in DC1 and around 80 msgs have been mirrored to DC2. In DC1, I have consumed around 90 msgs. Now DC1 goes down and I am consuming from DC2. My consumer request would be to fetch the 91st message. But only 80 msgs have been mirrored. What will happen in this case? Since required offset is not available, whether this will behave with value of auto.offset.reset.
In another case, say I consumed 90 msgs from DC1 but all 100 msgs have been mirrored to DC2. In this case, If I start consuming from DC2, 10 msgs will be duplicated right?
What will happen if offset topic mirroring not completed after successfull processing?
Mirror maker doesn't replicate the offsets. Both source and destination can have different number of partitions and different offsets.
If you want to ensure exactly once delivery on DC2 and no data loss, you need to have producer and consumer.properties configured properly.
There are valid scenarios, when consumer consumes some record from source but the producer fails to write to destination. In that scenario, if "enable.auto.commit" is set to true, it will periodically commit the offset even if the event wasn't written in destination. So to avoid that, it should be set to false.
Ensure that for no data loss:
In consumer.properties : enable.auto.commit=false is set
In producer, add following properties :
max.in.flight.requests.per.connection=1
retries=Int.MaxValue
acks=-1
block.on.buffer.full=true
For mirrorMaker, set --abortOnSendFail
Here are some best practices for mirror maker.
https://community.hortonworks.com/articles/79891/kafka-mirror-maker-best-practices.html
In regards to the consumers running on Destination cluster, they don't care how many records are consumed from Source cluster. They have their own consumer_offsets. So at the first run, it starts from 0 offset, and then next run will read from the last offset you consumed.
If you want to read from offset 0, you can always set "auto.offset.reset" to "earliest"