Kafka leader election in multi-dc with an arbiter/witness/observer - apache-kafka

I would like to deploy a Kafka cluster in two datacenters with the same number of nodes on each DC. The first DC is used in active mode while the second is in passive mode.
For example, let say that both datacenters have 3 nodes with 2 in-sync replica (ISR) on the first DC and one ISR on the second DC.
Is it possible to have a third DC containing an arbiter/witness/observer node such that in case of failure of one DC, a leader election can succeed with the correct outcome in term of consistency? mongoDB has such feature named Replica set Arbiter.
What about deploying ZooKeeper on the three datacenters? From my understanding ZooKeeper does not hold the Kafka data and it should not be contacted for each new record in the Kafka topic, i.e. you do not pay the latency to the third DC for each new record.

There is one presentation at the Kafka summit 2017 One Data Center is Not Enough: Scaling Apache Kafka Across Multiple Data Centers speaking about this setup. There is also some interesting information inside a Confluent whitepaper Disaster Recovery for Multi-Datacenter Apache Kafka® Deployments.
It says it could work and they called it an observer node but it also says no one has ever tried this.
Zookeeper keeps tracks of the following metadata for Kafka (0.9.0+).
Electing a controller - The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes.
Cluster membership - which brokers are alive and part of the cluster? this is also managed through ZooKeeper.
Topic configuration - what overrides are there for that topic, where are the partitions located etc.
Quotas - how much data is each client allowed to read and write
ACLs - who is allowed to read and write to which topic
More detail on the dependency between Kafka and Zookeeper on the Kafka FAQ and answer at Quora from a Kafka commiter working at Confluent.
From the resources I have read, a setup with two DC (Kafka plus Zookeeper ) and an arbiter/witness/observer Zookeeper node on a third DC with high latency could work but I haven't found any resources that has actually experimented it.

Related

Setting up multi datacenter Kafka cluster

I am working on setting up the Kafka cluster with a multi DC cluster. The intention is to ensure if one DC goes down, both producers and consumers can still able to continue operations without any issues. I came across two options, but not sure what's the difference and how it works.
Option 1: Setting up multiple zookeeper cluster (one cluster per DC)
Setting up multiple zookeepers and each zookeeper will have a set of brokers in a DC. In this scenario will I really get both Active-Active and Disaster Recovery? If 1 DC goes down what will happen to consumers.
Option 2: Setting up Mirror maker with source and target
I understand it's a replication of one cluster to another. But how do I point to both clusters from a consumer or producer perspective? Will it be handled automatically or something I should do it manually?
Any explanation of these options are appreciated.

Kafka setup strategy for replication?

I have two vm servers (say S1 and S2) and need to install kafka in cluster mode where there will be topic with only one partition and two replicas(one is leader in itself and other is follower ) for reliability.
Got high level idea from this cluster setup Want to confirm If below strategy is correct.
First set up zookeeper as cluster on both nodes for high availability(HA). If I do setup zk on single node only and then that node goes down, complete cluster
will be down. Right ? Is it mandatory to use zk in latest kafka version also ? Looks it is must for older version Is Zookeeper a must for Kafka?
Start the kafka broker on both nodes . It can be on same port as it is hosted on different nodes.
Create Topic on any node with partition 1 and replica as two.
zookeeper will select any broker on one node as leader and another as follower
Producer will connect to any broker and start publishing the message.
If leader goes down, zookeeper will select another node as leader automatically . Not sure how replica of 2 will be maintained now as there is only
one node live now ?
Is above strategy correct ?
Useful resources
ISR
ISR vs replication factor
First set up zookeeper as cluster on both nodes for high
availability(HA). If I do setup zk on single node only and then that
node goes down, complete cluster will be down. Right ? Is it mandatory
to use zk in latest kafka version also ? Looks it is must for older
version Is Zookeeper a must for Kafka?
Answer: Yes. Zookeeper is still must until KIP-500 will be released. Zookeeper is responsible for electing controller, storing metadata about Kafka cluster and managing broker membership (link). Ideally the number of Zookeeper nodes should be at least 3. By this way you can tolerate one node failure. (2 healthy Zookeeper nodes (majority in cluster) are still capable of selecting a controller)) You should also consider to set up Zookeeper cluster on different machines other than the machines that Kafka is installed. Thus the failure of a server won't lead to loss of both Zookeeper and Kafka nodes.
Start the kafka broker on both nodes . It can be on same port as it is
hosted on different nodes.
Answer: You should first start Zookeeper cluster, then Kafka cluster. Same ports on different nodes are appropriate.
Create Topic on any node with partition 1 and replica as two.
Answer: Partitions are used for horizontal scalability. If you don't need this, one partition is okay. By having replication factor 2, one of the nodes will be leader and one of the nodes will be follower at any time. But it is not enough for avoiding data loss completely as well as providing HA. You should have at least 3 Kafka brokers, 3 replication factor of topics, min.insync.replicas=2 as broker config and acks=all as producer config in the ideal configuration for avoiding data loss by not compromising HA. (you can check this for more information)
zookeeper will select any broker on one node as leader and another as
follower
Answer: Controller broker is responsible for maintaining the leader/follower relationship for all the partitions. One broker will be partition leader and another one will be follower. You can check partition leaders/followers with this command.
bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic my-replicated-topic
Producer will connect to any broker and start publishing the message.
Answer: Yes. Setting only one broker as bootstrap.servers is enough to connect to Kafka cluster. But for redundancy you should provide more than one broker in bootstrap.servers.
bootstrap.servers: A list of host/port pairs to use for establishing
the initial connection to the Kafka cluster. The client will make use
of all servers irrespective of which servers are specified here for
bootstrapping—this list only impacts the initial hosts used to
discover the full set of servers. This list should be in the form
host1:port1,host2:port2,.... Since these servers are just used for the
initial connection to discover the full cluster membership (which may
change dynamically), this list need not contain the full set of
servers (you may want more than one, though, in case a server is
down).
If leader goes down, zookeeper will select another node as leader
automatically . Not sure how replica of 2 will be maintained now as
there is only one node live now ?
Answer: If Controller broker goes down, Zookeeper will select another broker as new Controller. If broker which is leader of your partition goes down, one of the in-sync-replicas will be the new leader. (Controller broker is responsible for this) But of course, if you have just two brokers then replication won't be possible. That's why you should have at least 3 brokers in your Kafka cluster.
Yes - ZooKeeper is still needed on Kafka 2.4, but you can read about KIP-500 which plans to remove the dependency on ZooKeeper in the near future and start using the Raft algorithm in order to create the quorum.
As you already understood, if you will install ZK on a single node it will work in a standalone mode and you won't have any resiliency. The classic ZK ensemble consist 3 nodes and it allows you to lose 1 ZK node.
After pointing your Kafka brokers to the right ZK cluster you can start your brokers and the cluster will be up and running.
In your example, I would suggest you to create another node in order to gain better resiliency and met the replication factor that you wanted, while still be able to lose one node without losing data.
Bear in mind that using single partition means that you are bounded to single consumer per Consumer Group. The rest of the consumers will be idle.
I suggest you to read this blog about Kafka Best Practices and how to choose the number of topics/partitions in a Kafka cluster.

Why does a Kafka cluster need a Controller when it already uses Zookeeper?

A Kafka cluster has a controller node and a Zookeeper cluster, both with their own set of responsibilities. What is the requirement of the controller when we have zookeeper already ?
For example : Controller election is performed by zookeeper, a partition leader election is done by controller. Why doesn't Kafka use Zookeeper also for Partition Leader election when it already has the information of what partitions are on what nodes and which nodes are actually active.
In short, I am struggling to understand the requirement of controller in spite of the zookeeper being present. It would be really helpful if someone could explain the reason for this design choice and the advantages.
kafka uses zookeeper for a few things:
cluster membership - the live brokers of a cluster are those who have ephemeral ZK nodes
leader election - election of the kafka broker that acts as a controller
state storage - some (mostly the older) state is stored in ZK - the configuration for topics, for example. some state that used to be in ZK has been migrated to special topics (consumer offsets) and some newer functionality was written to store state entirely in kafka (transaction logs, for example).
the general trend is to stop using state in ZK and instead self-host it (although older parts of the code have never been migrated out).
as for why not use ZK for partition leader election - one reason is there is logic involved. when electing a cluster leader broker there's no preference - any broker will do. this fits well with how ZK-based leader-election works (1st memeber to create and own an ephemeral znode wins).
when choosing a partition leader, however, you need a little bit more logic. for example - you'd like to elect the leader with the "highest watermark" (with the most up to date data, remember replication is generally async). there's also logic around unclean leader election. ZK alone cannot do that, hence it is done by the controller.
Zookeeper works as a coordination service and Kafka is using zookeeper for the same purpose.
Zookeeper is must by design for Kafka. Because Zookeeper has the responsibility a kind of managing Kafka cluster. It has list of all Kafka brokers with it and Controller of the cluster is selected by Zookeeper and stored there.
Kafka stores minimum information on Zookeeper.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
in order to protected zookeeper. without controller, zookeeper needs to trigger too many listeners (equals to broker count), and most of these listener are useless, it is a potential risk for zookeeper, through controller only controller interact with zookeeper

Zookeeper failures in Kafka 0.9 and above

Based of the answer given in Is Zookeeper a must for Kafka?.
It is clear that what is the responsibility of Zookeeper in Kafka 0.9 and above
I just wanted to understand what will be the impact if zookeeper cluster goes down completely?
kafka uses ZK for membership (figure out what brokers exist and which of them are alive) and leader election (elect the one broker that is controller for the cluster at any moment).
simply put - if ZK fails kafka dies.
if ZK sneezes (say a particularly long GC pause or a short network connectivity issue) a kafka cluster may temporarily "lose" any number of members and/or the controller. by the time this settles you may have a new controller and new leader brokers for all partitions (which may or may not cause loss of acknowledged data, see "unclean leader election"). I'm not sure if all ongoing transactions would be rolled back - never tried.

Kafka Producer, multi DC failover support

I have two distinct kafka clusters located in different data centers - DC1 and DC2. How to organize kafka producer failover between two DCs? If primary kafka cluster (DC1) becomes unavailable, I want producer to switch to failover kafka cluster (DC2) and continue publishing to it? Producer also should be able to switch back to primary cluster, once it is available. Any good patterns, existing libs, approaches, code examples?
Each partition of the Kafka topic your producer is publishing to has a separate leader, often spread across multiple brokers in the cluster, so the producer is connected to many “primary” brokers simultaneously. Should any one of them fail another In Sync Replica (ISR) will be elected as leader and automatically take over. You do not need to do anything in your client app for it to reconnect to the new leader(s), retry any failed requests, and continue.
If this is for Multi-Data Center (MDC) failover then things get much more complicated depending on if the client apps die as well or if they keep running and need just their cluster connections to failover. Offsets are not preserved across multiple Kafka clusters so while producers are simpler, consumers need to call GetOffsetsForTimes() upon failover.
For a great write up of the the MDC failover modes and best practices see the MDC Whitepaper here: https://www.confluent.io/white-paper/disaster-recovery-for-multi-datacenter-apache-kafka-deployments/
Since you asked only about producers, your app can detect if the primary cluster is down (say for a certain number of retries) and then instead of attempting to reconnect, it can instead connect to another brokerlist from the secondary cluster. Alternatively you can redirect the dns name of the brokerlist hosts to point to the secondary cluster.