When I've one etcd cluster separate from master, one by region (three regions), the master's write in each etcd at same time or there's one active node with other nodes at standby? The nodes switch data between them?
The main mechanism how ETCD stores key value sensitive data across Kubernetes cluster based on Raft consensus algorithm. It is a comprehensive way for distributing configuration, state and metadata information within a cluster and monitoring for any changes to the data stack.
Assuming that Master node is handling all core components inventory, it plays a role of the main contributor for managing ETCD database and has a responsibility of the leader to keep a consistent state for the other ETCD members located on worker nodes according to the distributed consensus algorithm based on a quorum model.
However, single Master node configuration does not guarantee cluster resistance to any of possible outages, therefore multi master nodes setup is more efficient way to accomplish High availability for ETCD storage as it provides consistent replica set for ETCD members distributed within a separate nodes.
In order to establish data reliability it is important to periodically backup ETCD cluster via built-in etcdctl command line tool or making snapshot for the volume where ETCD storage located.
You might be able to find more specific information about ETCD in the relevant Github project documentation.
Related
In HA kubernetes clusters, we configure multiple control planes(master nodes), but how does multiple control planes sync their data? When we create a pod using kubectl command, the request went through the cloud load balancer to one of the control plane. I want to understand how other control planes sync their data with the one that got the new request?
First of all, please note that API Server is the only component that directly talks with the etcd.
Every change made on the Kubernetes cluster ( e.g. kubectl create) will create appropriate entry in etcd database and everything you will get from a kubectl get command is stored in etcd.
In this article you can find detailed explanation of communication between API Server and etcd.
Etcd uses RAFT protocol for leader election and that leader handles all client requests which need cluster consensus ( requests that do not require consensus can be processed by any cluster member ):
etcd is built on the Raft consensus algorithm to ensure data store consistency across all nodes in a cluster—table stakes for a fault-tolerant distributed system.
Raft achieves this consistency via an elected leader node that manages replication for the other nodes in the cluster, called followers. The leader accepts requests from the clients, which it then forwards to follower nodes. Once the leader has ascertained that a majority of follower nodes have stored each new request as a log entry, it applies the entry to its local state machine and returns the result of that execution—a ‘write’—to the client. If followers crash or network packets are lost, the leader retries until all followers have stored all log entries consistently.
More information about etcd and raft consensus algorithm can be found in this documentation.
To my knowledge, etcd uses Raft as a consensus and leader selection algorithm to maintain a leader that is in charge of keeping the ensemble of etcd nodes in sync with data changes within the etcd cluster. Among other things, this allows etcd to recover from node failures in the cluster where etcd runs.
But what about etcd managing other clusters, i.e. clusters other than the one where etcd runs?
For example, say we have an etcd cluster and separately, a DB (e.g. MySQL or Redis) cluster comprised of master (read and write) node/s and (read-only) replicas. Can etcd manage node roles for this other cluster?
More specifically:
Can etcd elect a leader for clusters other than the one running etcd and make that information available to other clusters and nodes?
To make this more concrete, using the example above, say a master node in the MySQL DB cluster mentioned in the above example goes down. Note again, that the master and replicas for the MySQL DB are running on a different cluster from the nodes running and hosting etcd data.
Does etcd provide capabilities to detect this type of node failures on clusters other than etcd's automatically? If yes, how is this done in practice? (e.g. MySQL DB or any other cluster where nodes can take on different roles).
After detecting such failure, can etcd re-arrange node roles in this separate cluster (i.e. designate new master and replicas), and would it use the Raft leader selection algorithm for this as well?
Once it has done so, can etcd also notify client (application) nodes that depend on this DB and configuration accordingly?
Finally, does any of the above require Kubernetes? Or can etcd manage external clusters all by its own?
In case it helps, here's a similar question for Zookeper.
etcd's master election is strictly for electing a leader for etcd itself.
Other clusters, however can use a distributed strongly-consistent key-value store (such as etcd) to implement their own failure detection, leader election and to allow clients of that cluster to respond.
Etcd doesn't manage clusters other than its own. It's not magic awesome sauce.
If you want to use etcd to manage a mysql cluster, you will need a component which manages the mysql nodes and stores cluster state in etcd. Clients can watch for changes in that cluster state and adjust.
I want to understand what could be the possible impact of a master node failure in a k8s cluster with only one master node with internal etcd store.
As per my understanding, all kinds of deployed workload containers (including stateless and stateful sets with persistent volume claims) running on worker nodes would keep on running until recreation of any container is required as they don't have a direct functional dependency on the master node and etcd store for their core functions. And, the unavailability of the master node only affects the control plane operations for the cluster.
Is my understanding correct? If not, could you please explain the impact of the master node failure on my workload running on that cluster?
I understand that the best way to achieve HA for k8s cluster is to set up a multi-master cluster with possibly externalizing etcd stores also for decoupling of them. This question is to understand the exact impact of the master node failure to take an informed call before configuring a multi-master cluster.
Etcd operators on the quorum system so as long as the cluster sees a majority it will continue operating. If the failed node was the current leader, the others would trigger an election after the heartbeat timeout.
For kube-apiserver, it's a horizontal service so losing a node is not interesting, just like any other webapp. Some (most) controllers are singletons, but they run on every control plane node and use kube-apiserver for leader elections so as with Etcd, if the leader dies then a few seconds later another copy will get the leader lock and take over.
I was going through the differences between Swarm vs K8s one of the cons of Swarm is that it has limited fault tolerance functionality. How does K8s achieve fault tolerance, is it via K8s multi-master. Please share your inputs
Yes! In order to achieve Kubernetes fault-tolerance is recommended to have multiples Control Planes (master) nodes and if you are running in cloud providers multiples availability zones are recommended.
The Control Plane’s components make global decisions about the cluster (for example, scheduling), as well as detecting and responding to cluster events (for example, starting up a new pod when a deployment’s replicas field is unsatisfied).
Basically, the control plane is composed by theses components:
kube-apiserver - Exposes the Kubernetes API. Is the front for Kubernetes control plane
etcd - Key/Value Kubernetes' backing store for cluster data
kube-scheduler - Responsible for watches for newly created pods with no assigned node, and selects a node for them to run on
kube-controller-manager - One of controller responsibility is maintain the correct number of pods for every replication controller, populate endpoints objects and responding when nodes go down.
cloud-controller-manager - Interact with the underlying cloud providers,
Every cluster need 1 worker nodes at least, the work nodes is responsible to run your workloads.
Here’s the diagram of a Kubernetes cluster with all the components tied together:
For more info see here
Yes, all kubernetes control plane components are either clustered (etcd), run a leader election (controllers), or flat (apiserver). Traditionally you run three control plane nodes but you can do 5 in some complex topologies.
I see most of the K8S master components has a leader selection process except apiServer. If only one node will be the leader any point of time, why would we need more then 3 master cluster for bigger k8s cluster?
The requirement of minimum 3 hosts comes from the fact that Kubernetes HA cluster uses etcd for storing and syncing configuration. etcd requires minimum 3 nodes to ensure HA. In general case we need to use n+1 model when want to deploy Kubernetes HA cluster
In a single master setup, the master node manages the etcd database, API server, controller manager and scheduler, along with the worker nodes. However, if that single master node fails, all the worker node fail as well and entire cluster will be lost.
In a multi-master setup, by contrast, multi-master provides high availability for a single cluster and improves network performance because all the masters behave like a unified data center.
A multi-master setup protects against a wide range of failure modes, from a loss of single worker node to the failure of the master node’s etcd service. By providing redundancy, a multi-master cluster serves a highly available system for your end users.
Do not use a cluster with two master replicas. Consensus on a two-replica cluster requires both replicas running when changing persistent state. As a result, both replicas are needed and a failure of any replica turns cluster into majority failure state. A two-replica cluster is thus inferior, in terms of HA, to a single replica cluster.
Here are useful documentation: kubernetes-ha-cluster, creating-ha-cluster.
Articles: ha-cluster, ha.