Kafka (MSK) and MirrorMaker2 disaster recovery procedures for producers and consumers - apache-kafka

I have two questions I'm hoping someone with experience in MSK/Kafka and MirrorMaker2 can help with.
Currently we have a production MSK 2.7.0 cluster with 3 brokers and roughly 1T in topic data. We use the Debezium plugin for most things, a few jdbc/mysql sink connectors and then a handful of random consumers so far. For DR purposes, I'm considering adding a second MSK cluster of the same size and using MirrorMaker2 to replicate everything to it. I've done a fair amount of searching and reading about how others might be approaching DR for Kafka. It seems that MM2 is the standard.
I've seen conflicting views on whether active/standby or active/active is recommended. It seems that active/active would be ideal but it comes with a lot of considerations for producers and consumers, mostly when event ordering is important. Curious if anyone can elaborate on that, and how realistic it would be in setting up that topology. Event order is important for most of our cases.
For an active/standby configuration, it's not clear to me after what I've read what to plan for in the event the primary cluster goes down permanently and all of the consumers/producers have to migrate over to the new cluster. There's a lot written about how MM2 replicates its own offset data, but I'm not finding much about what a consumer needs to account for when being moved over to the replicated topic. I'm especially interested in what it would mean to move the Debezium connectors over, and if it has a mechanism built in for such a thing or what I should expect.

Related

Running Kafka cost-effectively at the expense of lower resilience

Let's say I have a cheap and less reliable datacenter A, and an expensive and more reliable datacenter B. I want to run Kafka in the most cost-effective way, even if that means risking data loss and/or downtime. I can run any number of brokers in either datacenter, but remember that costs need to be as low as possible.
For this scenario, assume that no costs are incurred if brokers are not running. Also assume that producers/consumers run completely reliably with no concern for their cost.
Two ideas I have are as follows:
Provision two completely separate Kafka clusters, one in each datacenter, but keep the cluster in the more expensive datacenter (B) powered off. Upon detecting an outage in A, power on the cluster in B. Producers/consumers will have logic to switch between clusters.
Run the Zookeeper cluster in B, with powered on brokers in A, and powered off brokers in B. If there is an outage in A, then brokers in B come online to pick up where A left off.
Option 1 would be cheaper, but requires more complexity in the producers/consumers. Option 2 would be more expensive, but requires less complexity in the producers/consumers.
Is Option 2 even possible? If there is an outage in A, is there any way to have brokers in B come online, get elected as leaders for the topics and have the producers/consumers seamlessly start sending to them? Again, data loss is okay and so is switchover downtime. But whatever option needs to not require manual intervention.
Is there any other approach that I can consider?
Neither is feasible.
Topics and their records are unique to each cluster. Only one leader partition can exist for any Kafka partition in a cluster.
With these two pieces of information, example scenarios include:
Producers cut over to a new cluster, and find the new leaders until old cluster comes back
Even if above could happen instantaneously, or with minimal retries, consumers then are responsible for reading from where? They cannot aggregate data from more than one bootstrap.servers at any time.
So, now you get into a situation where both clusters always need to be available, with N consumer threads for N partitions existing in the other cluster, and M threads for the original cluster
Meanwhile, producers are back to writing to the appropriate (cheaper) cluster, so data will potentially be out of order since you have no control which consumer threads process what data first.
Only after you track the consumer lag from the more expensive cluster consumers will you be able to reasonably stop those threads and shut down that cluster upon reaching zero lag across all consumers
Another thing to keep in mind is that topic creation/update/delete events aren't automatically synced across clusters, so Kafka Streams apps, especially, will all be unable to maintain state with this approach.
You can use tools like MirrorMaker or Confluent Replicator / Cluster Linking to help with all this, but the client failover piece I've personally never seen handled very well, especially when record order and idempotent processing matters
Ultimately, this is what availability zones are for. From what I understand, the chances of a cloud provider losing more than one availability zone at a time is extremely rare. So, you'd setup one Kafka cluster across 3 or more availability zones, and configure "rack awareness" for Kafka to account for its installation locations.
If you want to keep the target / passive cluster shutdown while not operational and then spin up the cluster you should be ok if you don't need any history and don't care about the consumer lag gap in the source cluster.. obv use case dependent.
MM2 or any sort of async directional replication requires the cluster to be active all the time.
Stretch cluster is not really doable b/c of the 2 dc thing, whether raft or zk you need a 3rd dc for that, and that would probably be your most expensive option.
Redpanda has the capability of offloading all of your log segments to s3 and then indexes them to allow them to be used for other clusters, so if you constantly wrote one copy of your log segments to your standby DC storage array with s3 interface it might be palatable. Then whenever needed you just spin up a cluster on demand in the target dc and point it to the object store and you can immediately start producing and consuming with your new clients.

Backing up Kafka in a Non Idempotent Architecture

We're heading up the architecture of an event sourced solution leveraging both Kafka Streams and vanilla Kafka topic consuming/producing of messages. The "workers" are not idempotent but we are instead relying on consumer groups offsets. Kafka acts as our single source of truth.
Looking through options for true backup (i.e., replication is not backup) solutions for Kafka we've seen the variety of S3 connectors out there, and did check out Confluent's Replicator as well as the coming MirrorMaker 2. None do – to my knowledge – offer true protection in terms of topic deletion nor point-in-time restoration.
As a small-ish startup, we're keen on not operating Kafka in house, albeit disk level snapshots (not offered by any actor known to me) seems to be the way forward.
Having looked at AWS MSK, Confluent and CloudKarafka I am interested in hearing about your experiences, pros/cons and solutions in a similar (or at least approximate) architectures...

By what standards is it better to split Kafka partition?

I got a few questions while I was preparing the Kafka Service.
First question.
What is the recommended criterion for partitioning busy services?
I understand it is a good idea to decide the number of partitions based on the memory of producer and the the consumers.
Are there any criteria to determine the number of partitions?
Your account of the experience will also be of great help.
Second question.
Sometimes, only one broker happens to be busy during Kafka service.
How do I fix this?
Is there any way to prevent it?
Question three :
Is there any way I can know about server dirty shutdown?
In general, the more the partitions in your Kafka cluster the higher the throughput. However, note that there is an (negative) impact of having too many partitions in total on things like availability and latency. This article from Confluent can shed some light regarding your first question.
Coming to your second question, a topic is made up of at least one partition. A Kafka broker contains multiple partitions of various topics. Some of these partitions are leaders and some of them are replicas from partitions on other brokers. Therefore, a broker might have some active partitions (leaders) and some inactive (replicas). I guess that in your case, only a single broker contains leader partitions so you need to check your replication and partitioning strategies.
Regarding your last question, you need to consider a Kafka cluster monitoring tool such as Confluent's Control Centre, or Landoop's Kafka LENSES.

what is the best approach to keep two kafka clusters in Sync

I have to setup two kafka clusters in two different data centers (DCs), which have same topics and configuration. the reason is that the connectivity between two data centers is nasty we cannot create a global one.
We are having producers and consumers to publish and subcribe to the topics of each DC.
the problem is that I need to keep both clusters in sync.
Lets say: all messages are written to the first DC should be eventually replicated to the second, and otherway around.
I am evaluation the kafka MirrorMaker tool by creating the Mirror by consuming messages of the first and procuding messages to the second one. However it is also requried to replicate data from the second to the first because writing data is both allowed in two clusters.
I dont think the Kafka MirrorMaker tool is fit to our case.
Appricate any suggestion?
Thanks in advance.
Depending on your exact requirements, you can use MirrorMaker for your use case.
One option would be to just have two separate topics, lets call them topic1 on cluster 1 and topic2 on cluster 2. All your producing threads write to the "local" topic and you use mirrormaker to replicate this topic to the remote cluster.
For your consumers, you simply subscribe to both topics on whatever cluster is closest to you, that way you will get all records that were written on either cluster.
I have created an illustration that hopefully helps:
Alternatively, you could create aggregation topics on both clusters and use MirrorMaker to replicate data into this topic, this would enable you to have all data in one topic for consumption.
You would have duplicate data on the same cluster this way, but you could take care of this by lower retention settings on the input topic.
Again, hopefully the following picture helps to explains my thinking:
In order for this to work, you will need to configure MirrorMaker to replicate a topic into a topic with a different name, which is not a standard thing for it to do, I have written a small blog post on how to do this, if you want to investigate this option further.

Is Kafka useful if we have less messages to process

Is Kafka useful if we have less messages to process. If I have 1000 messages per second to process, is Kafka feasible?
As any experienced software engineer will say, it depends ;-). There are many factors to consider. Here is just a sample:
Do you need to have these messages persisted? If not, then probably Kafka is not what you're looking for.
Even if you require persistence, it doesn't mean that Kafka can handle your throughput requirements (although my gut feeling says it can cope with your volume). The only way to determine that is to run performance tests with your message volumes against Kafka and see how it copes. It's also quite possible that other brokers like ActiveMQ can handle your volumes as well. Then it falls down to how appropriate is the broker for your use case (e.g., event sourcing?) Checkout out Kafka's docs to see how Kafka is used in the industry.
You have to keep in mind that Kafka is currently not as popular as other brokers such as ActiveMQ. So even if Kafka is useful to your scenario, you could have a hard time finding help on Kafka questions/issues you'll have along the way.