Fault Tolerance and Kubernetes StatefulSet - kubernetes

As I understand it, most databases enable the use of replicas that can take over from a leader in case the leader is unavailable.
I'm wondering the necessity of having these replicas in a Kubernetes environment, when using say a StatefulSet. Once the pod becomes unresponsive, Kubernetes will restart it, right? And the PVC will make sure the data isn't lost.
Is it that leader election is a faster process than bringing up a new application?
Or is it that the only advantage of the replicas is to provide load balancing for read queries?

As I understand it, most databases enable the use of replicas that can take over from a leader in case the leader is unavailable.
I'm wondering the necessity of having these replicas in a Kubernetes environment, when using say a StatefulSet.
There has been a move to distributed databases from previous single node datatbases. Distributed databases typically run using 3 or 5 replicas / instances in a cluster. The primary purpose for this is High Availability and fault tolerance to e.g. node or disk failure. This is the same if the database is run on Kubernetes.
the PVC will make sure the data isn't lost.
The purpose of PVCs is to decouple the application configuration with the selection of storage system. This allows that you e.g. can deploy the same application on both Google Cloud, AWS and Minikube without any different configuration although you will use different storage systems. This does not change how the storage systems work.
Is it that leader election is a faster process than bringing up a new application?
Many different things can fail, the node, the storage system or the network can be partitioned so that you cannot reach a certain node.
Leader election is just a piece of the mitigations against these problems in a clustered setup, you also need replication of all data in a consistent way. Raft consensus algorithm is a common solution for this in modern distributed databases.
Or is it that the only advantage of the replicas is to provide load balancing for read queries?
This might be an advantage in distributed databases, yes. But this is seldom the primary reason to using them, in my experience.

Related

Highly available stateful session implementation with K8S

How to implement memory state/session replications with K8S? For instance, a web shopping cart system replicates the user HTTP sessions among cluster nodes over the network so that if a node is down, a process in another node can take over the user sessions.
K8S has StatefulSet which uses the disk storages to assure the state persistency, I think. If a pod is down, the restarted pod takes over the state form the disk. However, the overhead of persisting in-memory user sessions to disk is high and may not be fast enough.
I suppose the solution could be using memory cache server or etcd like systems. Is it the established practice? In my understanding, K8S is good for stateless processing in scale, and StatefulSet had been introduced to address stateful situation but not sure it is good fit for situation where fast stateful handover is required.
Please advise.
How to implement memory state/session replications with K8S? For
instance, a web shopping cart system replicates the user HTTP sessions
among cluster nodes over the network so that if a node is down, a
process in another node can take over the user sessions.
To store the state it's best to use the Redis or in-memory database.
K8S has StatefulSet which uses the disk storages to assure the state
persistency, I think. If a pod is down, the restarted pod takes over
the state form the disk. However, the overhead of persisting in-memory
user sessions to disk is high and may not be fast enough.
You are right but maybe you have not tried it before, i been using the Redis for Production in K8s with million of users but never faced issues. Redis has two options for backup the keys if you deploy on K8s.
RDB and Append only-AOF, till now never faced Redis crashed or so, but only get crashed due to Out of Memory, so make sure your Key policy is set properly like LRU or so.
In my understanding, K8S is good for stateless processing in scale
You are right but people have been using the Deployment and Statefulsets for running Redis cluster and Elasticsearch clusters in K8s also with all backup and scaling options.
It's easy to configure & manage the DB with K8s while with VM not much scalability there.
We have been running stateful sets with Redis, Elasticsearch, RabbitMQ since long in Prod and have not seen many issues. Make sure you attach the SSD high IOPS disk to POD and you are good to go.
Nice example : https://github.com/loopbackio/loopback4-example-shopping/blob/master/kubernetes/README.md

Does kubernetes support non distributed applications?

Our store applications are not distributed applications. We deploy on each node and then configured to store specific details. So, it is tightly coupled to node. Can I use kubernetes for this test case? Would I get benefits from it?
Our store applications are not distributed applications. We deploy on each node and then configured to store specific details. So, it is tightly coupled to node. Can I use kubernetes for this test case?
Based on only this information, it is hard to tell. But Kubernetes is designed so that it should be easy to migrate existing applications. E.g. you can use a PersistentVolumeClaim for the directories that your application store information.
That said, it will probably be challenging. A cluster administrator want to treat the Nodes in the cluster as "cattles" and throw them away when its time to upgrade. If your app only has one instance, it will have some downtime and your PersistentVolume should be backed by a storage system over the network - otherwise the data will be lost when the node is thrown away.
If you want to run more than one instance for fault tolerance, it need to be stateless - but it is likely not stateless if it stores local data on disk.
There are several ways to have applications running on fixed nodes of the cluster. It really depends on how those applications behave and why do they need to run on a fixed node of the cluster.
Usually such applications are Stateful and may require interacting with a specific node's resources, or writing directly on a mounted volume on specific nodes for performance reasons and so on.
It can be obtained with a simple nodeSelector or with affinity to nodes ( https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ )
Or with local persistent volumes ( https://kubernetes.io/blog/2019/04/04/kubernetes-1.14-local-persistent-volumes-ga/ )
With this said, if all the applications that needs to be executed on the Kubernetes cluster are apps that needs to run on a single node, you lose a lot of benefits as Kubernetes works really well with stateless applications, which may be moved between nodes to obtain high availability and a strong resilience to nodes failure.
The thing is that Kubernetes is complex and brings you a lot of tools to work with, but if you end up using a small amount of them, I think it's an overkill.
I would weight the benefits you could get with adopting Kubernetes (easy way to check the whole cluster health, easy monitoring of logs, metrics and resources usage. Strong resilience to node failure for stateless applications, load balancing of requests and a lot more) with the cons and complexity that it may brings (especially migrating to it can require a good amount of effort, if you weren't using containers to host your applications and so on)

MariaDB Server vs MariaDB Galera Cluster HA Replication

I am planning to deploy HA database cluster on my kubernetes cluster. I am new to database and I am confused by the various database terms. I have decided on MariaDB and I have found two charts, MariaDB and MariaDB Galera Cluster.
I understand that both can achieve the same goal, but what are the main differences between the two? Under what scenario I should use either or?
Thanks in advance!
I'm not an expert so take my explanation with precaution (and double check it)
The main difference between the MariaDB's Chart and the MariaDB Galera Cluster's Chart is that the first one will deploy the standard master-slave (or primary-secondary) database, while the second one is a resilient master-master (or primary-primary) database cluster.
What does it means in more detail is the following:
MariaDB Chart will deploy a Master StatefulSet and a Slave StatefulSet which will spawn (with default values) one master Pod and 2 slave Pods. Once your database is up and running, you can connect to the master and write or read data, which is then replicated on the slaves, so that you have safe copies of your data available.
The copies can be used to read data, but only the master Pod can write new data in the database. Should the Pod crash.. or the Kubernetes cluster node where the Pod is running malfunction, you will not be able to write new data until the master's Pod is once more up and running (which may require manual intervention).. or if you perform a failover, promoting one of the other Pods to be the new temporary master (which also requires a manual intervention or some setup with proxies or virtual ips and so on).
Galera Cluster Chart instead, will deploy something more resilient. With default values, it will create a single StatefulSet with 3 Pods.. and each one of these Pods will be able to either read and write data, acting virtually as a master.
This means that if one of the Pods stop working for whatever reason, the other 2 will continue serving the database as if nothing happened, making the whole thing way more resilient. When the Pod (which stopped working) will come back up and running, it will obtain the new / different data from the other Pods, getting in sync.
In exchange for the resilience of the whole infrastructure (it would be too easy if the Galera Cluster solution would offer extreme resilience with no drawbacks), there are some cons in a multi-master application, with the more commons being some added latency in the operations, required to keep everything in sync and consistent.. and added complexity, which often may brings headaches.
There are several other limits with Galera Cluster, like explicit LOCKS of tables not working or that all tables must declare a primary key. You can find the full list here (https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/)
Deciding between the two solutions mostly depends on the following question:
Do you have the necessity that, should one of your Kubernetes cluster node fail, the database keeps working (and being usable by your apps) like nothing happened, even if one of its Pods was running on that particular node?

Can etcd detect problems and elect leaders for other clusters?

To my knowledge, etcd uses Raft as a consensus and leader selection algorithm to maintain a leader that is in charge of keeping the ensemble of etcd nodes in sync with data changes within the etcd cluster. Among other things, this allows etcd to recover from node failures in the cluster where etcd runs.
But what about etcd managing other clusters, i.e. clusters other than the one where etcd runs?
For example, say we have an etcd cluster and separately, a DB (e.g. MySQL or Redis) cluster comprised of master (read and write) node/s and (read-only) replicas. Can etcd manage node roles for this other cluster?
More specifically:
Can etcd elect a leader for clusters other than the one running etcd and make that information available to other clusters and nodes?
To make this more concrete, using the example above, say a master node in the MySQL DB cluster mentioned in the above example goes down. Note again, that the master and replicas for the MySQL DB are running on a different cluster from the nodes running and hosting etcd data.
Does etcd provide capabilities to detect this type of node failures on clusters other than etcd's automatically? If yes, how is this done in practice? (e.g. MySQL DB or any other cluster where nodes can take on different roles).
After detecting such failure, can etcd re-arrange node roles in this separate cluster (i.e. designate new master and replicas), and would it use the Raft leader selection algorithm for this as well?
Once it has done so, can etcd also notify client (application) nodes that depend on this DB and configuration accordingly?
Finally, does any of the above require Kubernetes? Or can etcd manage external clusters all by its own?
In case it helps, here's a similar question for Zookeper.
etcd's master election is strictly for electing a leader for etcd itself.
Other clusters, however can use a distributed strongly-consistent key-value store (such as etcd) to implement their own failure detection, leader election and to allow clients of that cluster to respond.
Etcd doesn't manage clusters other than its own. It's not magic awesome sauce.
If you want to use etcd to manage a mysql cluster, you will need a component which manages the mysql nodes and stores cluster state in etcd. Clients can watch for changes in that cluster state and adjust.

Architecture of Cluster ETCD at multi zones

When I've one etcd cluster separate from master, one by region (three regions), the master's write in each etcd at same time or there's one active node with other nodes at standby? The nodes switch data between them?
The main mechanism how ETCD stores key value sensitive data across Kubernetes cluster based on Raft consensus algorithm. It is a comprehensive way for distributing configuration, state and metadata information within a cluster and monitoring for any changes to the data stack.
Assuming that Master node is handling all core components inventory, it plays a role of the main contributor for managing ETCD database and has a responsibility of the leader to keep a consistent state for the other ETCD members located on worker nodes according to the distributed consensus algorithm based on a quorum model.
However, single Master node configuration does not guarantee cluster resistance to any of possible outages, therefore multi master nodes setup is more efficient way to accomplish High availability for ETCD storage as it provides consistent replica set for ETCD members distributed within a separate nodes.
In order to establish data reliability it is important to periodically backup ETCD cluster via built-in etcdctl command line tool or making snapshot for the volume where ETCD storage located.
You might be able to find more specific information about ETCD in the relevant Github project documentation.