StatefulSet and Local Persistent Volume when the kube node is gone - kubernetes

This question is about StatefulSet and Local Persistent Volume.
If we deploy a StatefulSet with the pods using local persistent volumes, when the Kube node hosting a persistent volume is gone, the corresponding pod become un-schedulable. My question is how can a operator reliably detect this problem?
I can’t find any documentation taking about this. Can operator receives a notification or something?
What I observed is when a node hosting a PV is deleted, the corresponding pod stuck in pending stage. One way I can think of to detect this problem is to find the PVC for the pod, then find the PV for the PVC, and then find the node the PV is on, then query to see if the node is there.
But the problem is, inspecting PV and node requires cluster level privilege, which ideally should not be given to an operator that is only supposed to manage namespace level resources.
Plus, I am not sure that (following Pod->PVC->PV->Node sequence) captures all possible situations that physical storage becomes inaccessible.
What is the proper way to detect the situation? Once the situation is detected, it is pretty easy to fix.
Thank you very much!

Related

Kubernetes CSI Driver: Mounting of volumes when pods run on different nodes

I am currently using the Hetzner CSI-Driver (https://github.com/hetznercloud/csi-driver) in Kubernetes, which works fine for the most part.
But sometimes I run into the issue that two pods using the same persistentVolumeClaim get scheduled onto different nodes. Since the persistentVolume is only mounted onto one node, all podes running on the other node fail with the error 'Unable to attach or mount volumes'.
That makes sense to me but I can't wrap my head around what the correct solution would be. I thought that CSI-Drivers which mount volumes told Kubernetes in some way "oh, this pod needs that volumeClaim? Then you need to schedule it onto that node because the mounted volume is currently in use there by another pod", so I don't understand why pods using the same claim even get scheduled onto different nodes.
Is my understanding of CSI-drivers in general incorrect or is there some way in which I can enforce that behaviour? Or am I using this wrong alltogether and should change the underlying configuration?
Any help is appreciated.
Currently I simply restart the pod until I get lucky and it is moved to the correct node and then everything works fine. But I assume that there is a more elegant solution.

Safely unbind statefulset/my-pod-0 pvc without affecting the other replicas

I have a statefulset with 2 pods, running on separate dedicated nodes, each with its own pvc.
I need to perform some maintenance to the PVs assigned to each of the PVCs assigned to the statefulset's pods.
So far I've been able to scale down the statefulset to 1 replica, which brought offline statefulset/pod-1, this way I've been able to perform the maintenance task.
Now I need to perform the same task to statefulset/pod-0, without taking offline statefulset/pod-1.
What are my options?
Remember, statefulset/pod-0's pv must be unmounted in order to start the maintenance task.
I don't think that this is possible to achieve with one statefulset because of the Deployment and Scaling Guarantees a statefulset provides. To unmount the volume of a pod, you must shutdown/delete the pod first and before this can be done to a pod in a statefulset, "all of its successors must be completely shutdown."
My advice is to plan and communicate a maintenance window with the stakeholders of the application and scale down the statefulset entirely to apply the changes to the volume in your storage backends. Moving volumes across storage backends is not a task to perform on a regular basis, so I think it is reasonable to ask for a one-time maintenance to do so.
I was able to perform the task on statefulset/pod-0 by:
Cordon the node
Delete statefulset/pod-0
Once the task was complete uncordon the node and the pod started automatically without any issues.
This will only work if the pod contains a specific nodeAffinity
...to move the cinder pv from one backend to another
With Cinder CSI you can perform a snapshot or clone the volume into a new PV and move to another.

Is there an intermediate layer/cache between Kubernetes pod and Persistance volume, or does a pod access PV directly

Recently I ran into a strange problem. We have two pods running into an openshift cluster that shares a persistent volume (GlusterFs) between them.
Now for the sake of this explanation, let's assume one of the pods was PodA and the Other was PodB, in this case, PodB was running for three months, there is automation in POdA which creates/updates files in the shared persistence volume and PodB reads it and perform some operation based on the input.
Now coming to the problem, whenever POdA created a new file in the shared PV it was visible and accessible from PodA. However, there were a few files that PodA was updating periodically, but the change was not reflected in PodB. So in PodB, we could only see the old version of those files. To solve that problem, we have forcefully deleted PodB, and then openshift recreated it, and the problem was gone.
I thought in PV mechanism Kubernetes mount external storage/folder into the pod (container), and there is no intermediate layer or cache or something like that. From what we have experienced so far, it seems every container (or pod) creates a local copy of those files, or maybe there is a cache in between (PV and pod),
I have searched about this on google and could not find a detailed explanation on how this PV mount works in Kubernetes , would love to know the actual reason behind this problem.
There is no caching mechanism for PVs provided by Kubernetes, so the problem you are observing must be located in either the GlusterFS CSI driver or GlusterFS itself.

Why would anyone want to use GCE Persistent Disk or EBS with Kubernetes?

These disks are accessible by only a single node.
Each node would have different data.
Also, any node can be terminated at any time, so you would have to find a way to reattach the volume to a new node that replaces the old one. How would you do that?
And after scaleup, a new node might not have one of these disks available to attach to, so you would need a new disk.
And why might anyone want to do all this? Just for temporary space? For that, they could use an EC2 instance store or GCE boot disk (though I guess that that might be enough.)
I'm specifically familiar with EBS; I assume GCE persistent disks work the same way. The important detail is that an EBS volume is not tied to a specific node; while it can only be attached to one node at a time, it can be moved to another node, and Kubernetes knows how to do this.
An EBS volume can be dynamically attached to an EC2 instance. In Kubernetes, generally there is a dynamic volume provisioner that's able to create PersistentVolume objects that are backed by EBS volumes in response to PersistentVolumeClaim objects. Critically, if a Pod uses a PVC that references an EBS-volume PV, the storage driver knows that, wherever the Pod is scheduled, it can dynamically attach the EBS volume to that EC2 instance.
That means that an EBS-volume PersistentVolume isn't actually "locked" to a single node. If the Pod is deleted and a new one uses the PersistentVolumeClaim, the volume can "move" to the node that runs the new Pod. If the Node is removed, all of its Pods can be rescheduled elsewhere, and the EBS volumes can go somewhere else too.
An EBS volume can only be attached to one instance at a time; in Kubernetes volume terminology, it can only have a ReadWriteOnce access mode. If it could be attached to many instances (as, for instance, an EFS NFS-based filesystem could be) it could be ReadOnlyMany or ReadWriteMany.
This makes EBS be a reasonably good default choice for persistent data storage, if your application actually needs it. It's not actually host-specific and it can move around the cluster as needed. It won't work if two Pods need to share files, but this is generally a complex and fragile setup and it's better to design your application to not need it.
The best setup is if your application doesn't need persistent local storage at all. This makes it easy to scale Deployments, because the data is "somewhere else". The data could be in a database; the data could be in a managed database, such as RDS; or it could be in an object-storage system like S3. Again, this requires changes in your application to not use local files for data storage.

How do I mount data into persisted storage on Kubernetes and share the storage amongst multiple pods?

I am new at Kubernetes and am trying to understand the most efficient and secure way to handle sensitive persisted data that interacts with a k8 pod. I have the following requirements when I start a pod in a k8s cluster:
The pod should have persisted storage.
Data inside the pod should be persistent even if the pod crashes or restarts.
I should be able to easily add or remove data from hostPath into the pod. (Not sure if this is feasible since I do not know how the data will behave if the pod starts on a new node in a multi node environment. Do all nodes have access to the data on the same hostPath?)
Currently, I have been using StatefulSets with a persistent volume claim on GKE. The image that I am using has a couple of constraints as follows:
I have to mount a configuration file into the pod before it starts. (I am currently using configmaps to pass the configuration file)
The pod that comes up, creates its own TLS certificates which I need to pass to other pods. (Currently I do not have a process in place to do this and thus have been manually copy pasting these certificates into other pods)
So, how do I maintain a common persisted storage that handles sensitive data between multiple pods and how do I add pre-configured data to this storage? Any guidance or suggestions are appreciated.
I believe this documentation on creating a persistent disk with multiple readers [1] is what you are looking for. you will however only be able to have the pods read from the disk since GCP does not support "WRITEMANY".
Regarding hostpaths, the mount point is on the pod the volume is a directory on the node. I believe the hostpath is confined to individual nodes.
[1] https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/readonlymany-disks
[2] https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes