MongoDB replicaset loses database when updating primary replica (Kubernetes + Bitnami Helm Chart) - mongodb

I am using Microk8s and bitnami helm chart here
I set up a replicaset of 3 replicas
mongo-0 (by definition this is Primary), mongo-1 and mongo-2
Bitnami makes the replicaset to always use mongo-0 (if available) as Primary replica. However the next can happen: I find out I need to update the nodes, let's say to increase storage. To do this, I would need to:
Drain the node running mongo-0. This automatically triggers a new election, and let's say mongo-1 is the new primary.
I will add to the cluster a new node (with more capacity).
This will make the mongodb replicaset to assign a mongo-0 pod to the new node. However, the new node is empty, so the persistent volume where I store the database (lets say /mnt/mongo) is empty.
I would expect that the current primary replica will finish populating the database to the new replica (mongo-0, and therefore its Persistent Volume) and ONLY when that is done, then make mongo-0 the primary.
However I saw that mongo-0 becomse primary without any data being copied to it from the previous primary, effectively deleting the whole database, since now the primary node states that the database is empty.
How is that possible? What am I missing here?

I am not familiar with your exact management tools but the scaling process you described is wrong. You should not be removing 1 out of 3 nodes out of the replica set at any point, at least not in a production environment.
To replace a RS node:
Add a fourth node with desired parameters.
Set node priorities such that the node you want to remove has a lower priority than any of the other nodes.
Wait for the newly added node to have an acceptable replication lag.
Ensure the primary is not the node you want to remove.
Remove the node you want to remove from the RS.
Expecting generic software to automatically figure out when #3 completes and move on to #4 correctly is, I would say, rather optimistic. Maybe MongoDB ops manager would do that.
Your post contains a number of other incorrect statements about how MongoDB operates. For example, a node that has no data cannot become a primary in a replica set with data. Perhaps you have other configuration issues going on and you actually have independent nodes/multiple replica sets in what you think is a single deployment.

Related

Running DB as Kubernetes Deployment or StatefulSet?

I would like to run a single pod of Mongo db in my Kubernetes cluster. I would be using node selector to get the pod scheduled on a specific node.
Since Mongo is a database and I am using node selector, is there any reason for me not to use Kubernetes Deployment over StatefulSet? Elaborate more on this if we should never use Deployment.
Since mongo is a database and I am using node selector, Is there any reason for me not to use k8s deployment over StatefulSet? Elaborate more on this if we should never use Deployment.
You should not run a database (or other stateful workload) as Deployment, use StatefulSet for those.
They have different semantics while updating or when the pod becomes unreachable. StatefulSet use at-most-X semantics and Deployments use at-least-X semantics, where X is number of replicas.
E.g. if the node becomes unreachable (e.g. network issue), for Deployment, a new Pod will be created on a different node (to follow your desired 1 replica), but for StatefulSet it will make sure to terminate the existing Pod before creating a new, so that there are never more than 1 (when you have 1 as desired number of replicas).
If you run a database, I assume that you want the data consistent, so you don't want duplicate instances with different data - (but should probably run a distributed database instead).

MariaDB Server vs MariaDB Galera Cluster HA Replication

I am planning to deploy HA database cluster on my kubernetes cluster. I am new to database and I am confused by the various database terms. I have decided on MariaDB and I have found two charts, MariaDB and MariaDB Galera Cluster.
I understand that both can achieve the same goal, but what are the main differences between the two? Under what scenario I should use either or?
Thanks in advance!
I'm not an expert so take my explanation with precaution (and double check it)
The main difference between the MariaDB's Chart and the MariaDB Galera Cluster's Chart is that the first one will deploy the standard master-slave (or primary-secondary) database, while the second one is a resilient master-master (or primary-primary) database cluster.
What does it means in more detail is the following:
MariaDB Chart will deploy a Master StatefulSet and a Slave StatefulSet which will spawn (with default values) one master Pod and 2 slave Pods. Once your database is up and running, you can connect to the master and write or read data, which is then replicated on the slaves, so that you have safe copies of your data available.
The copies can be used to read data, but only the master Pod can write new data in the database. Should the Pod crash.. or the Kubernetes cluster node where the Pod is running malfunction, you will not be able to write new data until the master's Pod is once more up and running (which may require manual intervention).. or if you perform a failover, promoting one of the other Pods to be the new temporary master (which also requires a manual intervention or some setup with proxies or virtual ips and so on).
Galera Cluster Chart instead, will deploy something more resilient. With default values, it will create a single StatefulSet with 3 Pods.. and each one of these Pods will be able to either read and write data, acting virtually as a master.
This means that if one of the Pods stop working for whatever reason, the other 2 will continue serving the database as if nothing happened, making the whole thing way more resilient. When the Pod (which stopped working) will come back up and running, it will obtain the new / different data from the other Pods, getting in sync.
In exchange for the resilience of the whole infrastructure (it would be too easy if the Galera Cluster solution would offer extreme resilience with no drawbacks), there are some cons in a multi-master application, with the more commons being some added latency in the operations, required to keep everything in sync and consistent.. and added complexity, which often may brings headaches.
There are several other limits with Galera Cluster, like explicit LOCKS of tables not working or that all tables must declare a primary key. You can find the full list here (https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/)
Deciding between the two solutions mostly depends on the following question:
Do you have the necessity that, should one of your Kubernetes cluster node fail, the database keeps working (and being usable by your apps) like nothing happened, even if one of its Pods was running on that particular node?

Accessing replica set information from within the pod

I want to access the number of replicas and also the current replica id for a given pod, from inside the pod itself.
For example, if there are 3 replicas of any given pod, say foo_A, foo_B and foo_C, created in that specific order, is it possible to have total number of replicas and index of pod within the replica set to be available within the pod ?
Also I understand that, with old pods getting killed and new ones coming up, index of pod within replica set can dynamically change.
I know this can be achieved using Downward API, but which fields to access ?
Could anyone please help ?
As mentioned in comments, you can use StatefulSets:
StatefulSet Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled on.
As you can see here your pods will be created in a ordinal sequence:
For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
Before a Pod is terminated, all of its successors must be completely shutdown.

Running multiple pods for postgres in kubernetes. Is it safe for production?

My kubernetes cluster has 3 pods for postgres. I have configured persistent volume outside of the cluster on a separate virtual machine. Now as per kubernetes design, multiple pods will be responding to read/write requests of clients. Is their any dead lock or multiple writes issues that can occur between multiple postgres pods?
You would need a leader election system between them. There can be only one active primary in a Postgres cluster at a time (give or take very very niche cases). I would recommend https://github.com/zalando-incubator/postgres-operator instead.
I agree with the previous answer. In the case you've asked it is better if you use a postgres cluster where only a instance acts as primary the others act as secondary. When the primary is failed, one of the secondary becomes primary and later when the failed primary is back it is added as a secondary of the new primary instance. The leader election is responsible for raising a secondary as new primary instance. That's how cluster is managed.
Besides the previous one you can use kubedb for kubernetes.

Kubernetes Citus setup with individual hostname/ip

I am in the process of learning Kubernetes with a view to setting up a simple cluster with Citus DB and I'm having a little trouble with getting things going, so would be grateful for any help.
I have a docker image containing my base debian image configured for Citus for the project, and I want to set it up at this point with one master, that should mount a GCP master disk with a Postgres DB that I'll then distribute among the other containers, each mounted with a individual separate disk with empty tables (configured with the Citus extension) to hold what gets distributed to each. I'd like to automate this further at some point, but now I'm aiming for just a master container, and eight nodes. My plan is to create a deployment that opens port 5432 and 80 on each node, and I thought that I can create two pods, one to hold the master and one to hold the eight nodes. Ideally I'd want to mount all the disks and then run a post-mount script on the master that will find all the node containers (by IP or hostname??), add them as Citus nodes, then run create_distributed_table to distribute the data.
My confusion at present is about how to label all the individual nodes so they will keep their internal address or hostname and so in the case of one going down it will be replaced and resume with the data on the PD. I've read about ConfigMaps and setting hostname aliases but I'm still unclear about how to proceed. Is this possible, or is this the wrong way to approach this kind of setup?
You are looking for a StatefulSet. That lets you have a known number of pod replicas; with attached storage (PersistentVolumes); and consistent DNS names. In the pod spec I would launch only a single copy of the server and use the StatefulSet's replica count to control the number of "nodes" (also a Kubernetes term), if the replica is #0 then it's the master.