Safely unbind statefulset/my-pod-0 pvc without affecting the other replicas - kubernetes

I have a statefulset with 2 pods, running on separate dedicated nodes, each with its own pvc.
I need to perform some maintenance to the PVs assigned to each of the PVCs assigned to the statefulset's pods.
So far I've been able to scale down the statefulset to 1 replica, which brought offline statefulset/pod-1, this way I've been able to perform the maintenance task.
Now I need to perform the same task to statefulset/pod-0, without taking offline statefulset/pod-1.
What are my options?
Remember, statefulset/pod-0's pv must be unmounted in order to start the maintenance task.

I don't think that this is possible to achieve with one statefulset because of the Deployment and Scaling Guarantees a statefulset provides. To unmount the volume of a pod, you must shutdown/delete the pod first and before this can be done to a pod in a statefulset, "all of its successors must be completely shutdown."
My advice is to plan and communicate a maintenance window with the stakeholders of the application and scale down the statefulset entirely to apply the changes to the volume in your storage backends. Moving volumes across storage backends is not a task to perform on a regular basis, so I think it is reasonable to ask for a one-time maintenance to do so.

I was able to perform the task on statefulset/pod-0 by:
Cordon the node
Delete statefulset/pod-0
Once the task was complete uncordon the node and the pod started automatically without any issues.
This will only work if the pod contains a specific nodeAffinity

...to move the cinder pv from one backend to another
With Cinder CSI you can perform a snapshot or clone the volume into a new PV and move to another.

Related

Azure Kubernetes Service: What about PV and PVC in Stateful workload migration to new nodepool of different machine size

Context:
[Regarding cost reduction for compute]. In AKS cluster, recently observed that compute resources are under-utilised and so my plan is:
To create a new node pool(with lower cpu and memory, and so lower cost basically) and attach to the same AKS cluster.
And then
cordon the old node pool and then drain it, so the workload will move to new node pool (Thanks to nodeSelector)
Question:
What about the k8s resources like statefulset for eg. Redis etc which are in old node pool and are having PV and PVC . Do we have to take backup of those pvc and restore etc for new node pool? (As per my thinking, kubernetes will take care of detaching and attaching pvc considering all this activity will be within single kubernetes cluster only)
You are right! Kubernetes will take care of detaching and attaching PV and PVC considering all this activity will be within single Kubernetes cluster only. You don’t need to backup.
(If the response was helpful please don't forget to upvote and/or accept as answer, thank you)

When should I use StatefulSet?Can I deploy database in StatefulSet?

I heard that statefulset is suitable for database.
But StatefulSet will create different pvc for echo pod.
If I set the replicas=3.then I get 3 pod and 3 different pvc with different data.
For database users,they only want a database not 3 database.
So Its clear we should not use statefulset in this situation.
But when should we use statefulset.
A StatefulSet does three big things differently from a Deployment:
It creates a new PersistentVolumeClaim for each replica;
It gives the pods sequential names, starting with statefulsetname-0; and
It starts the pods in a specific order (ascending numerically).
This is useful when the database itself knows how to replicate data between different copies of itself. In Elasticsearch, for example, indexes are broken up into shards. There are by default two copies of each shard. If you have five Pods running Elasticsearch, each one will have a different fraction of the data, but internally the database system knows how to route a request to the specific server that has the datum in question.
I'd recommend using a StatefulSet in preference to manually creating a PersistentVolumeClaim. For database workloads that can't be replicated, you can't set replicas: greater than 1 in either case, but the PVC management is valuable. You usually can't have multiple databases pointing at the same physical storage, containers or otherwise, and most types of Volumes can't be shared across Pods.
We can deploy a database to Kubernetes as a stateful application. Usually, when we deploy pods they have their own storage, but that storage is ephemeral - if the container kills its storage, it’s gone with it.
So, we’ll have a Kubernetes object to tackle that scenario: when we want our data to persist we attach a pod with a respective persistent volume claim. By doing so, if our container kills our data, it will be in the cluster, and the new pod will access the data accordingly.
Some limitations of using StatefulSet are:
1.Required use of persistent volume provisioner to provision storage for pod-based on request storage class.
2.Deleting or scaling down the replicas will not delete the volume attached to StatefulSet. It ensures the safety of the data.
3.StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods.
4.StatefulSet doesn’t provide any guarantee to delete all pods when StatefulSet is deleted, unlike deployment, which deletes all pods associated with deployment when the deployment is deleted. You have to scale down pod replicas to 0 before deleting StatefulSet.
stateful set useful for running the application which stores the state basically.
Stateful set database run the multiple replicas of POD and PVC however internally they all auto sync. Data sync across the pods and PVC.
So ideally it's best option to use the stateful sets with multiple replicas to get the HA database.
Now it depends on the use case which database you want to use, it supports replication or not clustering, etc.
here is MySQL example with replication details : https://kubernetes.io/docs/tasks/run-application/run-replicated-stateful-application/

StatefulSet and Local Persistent Volume when the kube node is gone

This question is about StatefulSet and Local Persistent Volume.
If we deploy a StatefulSet with the pods using local persistent volumes, when the Kube node hosting a persistent volume is gone, the corresponding pod become un-schedulable. My question is how can a operator reliably detect this problem?
I can’t find any documentation taking about this. Can operator receives a notification or something?
What I observed is when a node hosting a PV is deleted, the corresponding pod stuck in pending stage. One way I can think of to detect this problem is to find the PVC for the pod, then find the PV for the PVC, and then find the node the PV is on, then query to see if the node is there.
But the problem is, inspecting PV and node requires cluster level privilege, which ideally should not be given to an operator that is only supposed to manage namespace level resources.
Plus, I am not sure that (following Pod->PVC->PV->Node sequence) captures all possible situations that physical storage becomes inaccessible.
What is the proper way to detect the situation? Once the situation is detected, it is pretty easy to fix.
Thank you very much!

In Kubernetes, can draining a node temporarily scale up a deployment?

Does kubernetes ever create pods as the result of draining a node? I'm not sure whether this is a feature I'm missing, or whether I just haven't found the right docs for it.
So here's the problem: I've got services that want to be always-on, but typically want to be a single container (for various stupid reasons having to do with them being more stateful than they should be). It's ok to run two containers temporarily during deployment or maintenance activities, though. So in ECS, I would say "desired capacity 1, maximumPercent 200%, minimumHealthPercent 100%". Then, if I need to replace a cluster node, ECS would automatically scale the service up, and once the new task was passing health checks, it would stop the old task and then the node could continue draining.
In kubernetes, the primitives seem to be less-flexible: I can set a pod disruption budget to prevent a pod from being evicted. But I don't see a way to get a deployment to temporarily scale up as a result of a node being drained. The pod disruption budget object in kubernetes, being mostly independent of a deployment or replica set, seems to mainly act as a blocker to draining a node, but not as a way to eagerly trigger scale-up.
In Kubernetes, deployments will only create new pods when current replicas are below desired replicas. In another word, the creation of a new Pod is triggered post disruption.
By design, deployments do not observe the disruption events(and probably it's not possible, as there are many voluntary actions) nither the eviction API directly. Hence the deployments never scale up automatically.
Probably you are looking for something like `horizontal pod autoscaler. However, this only scales based on resource consumption.
I would have deployed at least 2 replicas and use pod disruption budget as your service(application) is critical and should run 24/7/365. This is not only for nodes maintenance, but for many many other reasons(voluntary & involuntary) a pod can come down/rescheduled.

Kubernetes PodDisruptionBudget, HorizontalPodAutoscaler & RollingUpdate Interaction?

If I have the following Kubernetes objects:
Deployment with rollingUpdate.maxUnavailable set to 1.
PodDisruptionBudget with maxUnavailable set to 1.
HorizontalPodAutoscaler setup to allow auto scaling.
Cluster auto-scaling is enabled.
If the cluster was under load and is in the middle of scaling up, what happens:
During a rolling update? Do the new Pod's added due to the scale up use the new version of the Pod?
When a node needs to be restarted or replaced? Does the PodDisruptionBudget stop the restart completely? Does the HorizontalPodAutoscaler to scale up the number of nodes before taking down another node?
When the Pod affinity is set to avoid placing two Pod's from the same Deployment on the same node.
As in documentation:
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but controllers (like deployment and stateful-set) are not limited by PDBs when doing rolling upgrades – the handling of failures during application updates is configured in the controller spec.
So it partially depends on the controller configuration and implementation. I believe new pods added by the autoscaler will use the new version of the Pod, because that's the version present in deployments definition at that point.
That depends on the way you execute the node restart. If you just cut the power, nothing can be done ;) If you execute proper drain before shutting the node down, then the PodDisruptionBudget will be taken into account and the draining procedure won't violate it. The disruption budget is respected by the Eviction API, but can be violated with low level operations like manual pod deletion. It is more like a suggestion, that some APIs respect, than a force limit that is enforced by whole Kubernetes.
According to the official documentation if the affinity is set to be a "soft" one, the pods will be scheduled on the same node anyway. If it's "hard", then the deployment will get stuck, not being able to schedule required amount of pods. Rolling update will still be possible, but the HPA won't be able to grow the pod pool anymore.