Kubernetes CSI Driver: Mounting of volumes when pods run on different nodes - kubernetes

I am currently using the Hetzner CSI-Driver (https://github.com/hetznercloud/csi-driver) in Kubernetes, which works fine for the most part.
But sometimes I run into the issue that two pods using the same persistentVolumeClaim get scheduled onto different nodes. Since the persistentVolume is only mounted onto one node, all podes running on the other node fail with the error 'Unable to attach or mount volumes'.
That makes sense to me but I can't wrap my head around what the correct solution would be. I thought that CSI-Drivers which mount volumes told Kubernetes in some way "oh, this pod needs that volumeClaim? Then you need to schedule it onto that node because the mounted volume is currently in use there by another pod", so I don't understand why pods using the same claim even get scheduled onto different nodes.
Is my understanding of CSI-drivers in general incorrect or is there some way in which I can enforce that behaviour? Or am I using this wrong alltogether and should change the underlying configuration?
Any help is appreciated.
Currently I simply restart the pod until I get lucky and it is moved to the correct node and then everything works fine. But I assume that there is a more elegant solution.

Related

Is there an intermediate layer/cache between Kubernetes pod and Persistance volume, or does a pod access PV directly

Recently I ran into a strange problem. We have two pods running into an openshift cluster that shares a persistent volume (GlusterFs) between them.
Now for the sake of this explanation, let's assume one of the pods was PodA and the Other was PodB, in this case, PodB was running for three months, there is automation in POdA which creates/updates files in the shared persistence volume and PodB reads it and perform some operation based on the input.
Now coming to the problem, whenever POdA created a new file in the shared PV it was visible and accessible from PodA. However, there were a few files that PodA was updating periodically, but the change was not reflected in PodB. So in PodB, we could only see the old version of those files. To solve that problem, we have forcefully deleted PodB, and then openshift recreated it, and the problem was gone.
I thought in PV mechanism Kubernetes mount external storage/folder into the pod (container), and there is no intermediate layer or cache or something like that. From what we have experienced so far, it seems every container (or pod) creates a local copy of those files, or maybe there is a cache in between (PV and pod),
I have searched about this on google and could not find a detailed explanation on how this PV mount works in Kubernetes , would love to know the actual reason behind this problem.
There is no caching mechanism for PVs provided by Kubernetes, so the problem you are observing must be located in either the GlusterFS CSI driver or GlusterFS itself.

StatefulSet and Local Persistent Volume when the kube node is gone

This question is about StatefulSet and Local Persistent Volume.
If we deploy a StatefulSet with the pods using local persistent volumes, when the Kube node hosting a persistent volume is gone, the corresponding pod become un-schedulable. My question is how can a operator reliably detect this problem?
I can’t find any documentation taking about this. Can operator receives a notification or something?
What I observed is when a node hosting a PV is deleted, the corresponding pod stuck in pending stage. One way I can think of to detect this problem is to find the PVC for the pod, then find the PV for the PVC, and then find the node the PV is on, then query to see if the node is there.
But the problem is, inspecting PV and node requires cluster level privilege, which ideally should not be given to an operator that is only supposed to manage namespace level resources.
Plus, I am not sure that (following Pod->PVC->PV->Node sequence) captures all possible situations that physical storage becomes inaccessible.
What is the proper way to detect the situation? Once the situation is detected, it is pretty easy to fix.
Thank you very much!

Assuming imagePullPolicy:IfNotPresent and the image can't be pulled, will Kubernetes reschedule the pod to a node which _does_ have the image?

I'm wondering whether using imagePullPolicy: ifNotPresent would provide any resiliency to the temporary loss of our private registry..
Background
I have a multi-master, multi-worker bare-metal cluster, and all my pods are using images pulled from a local private registry, running outside the cluster on a single node. The cluster is therefore a lot more fault-tolerant than the registry itself.
What would happen if I set my workloads to imagePullPolicy: Always, and the registry failed? I wouldn't be able to (for example) scale up/down my pods, since I wouldn't be able to pull the image from the registry.
If I used imagePullPolicy: IfNotPresent, then provided the image existed on the nodes already, I could happily scale up/down, even in the absence of the registry.
The question
The question is, if a pod couldn't start because the image couldn't be pulled, would Kubernetes ever try to reschedule that pod on a different node (which may have the image cached), or once it's scheduled (and failing) to one node, will it remain there until deleted/drained?
Cheers!
D
That is correct! If you have the registry down, and imagePullPolicy: Always, any new pod would fail to create timing out.
And no, it wouldn't reschedule the pod. It would stay there in state ImagePullBackOff until the registry comes up.
You could trick it to re-schedule the pod (not based on the registry availability though), but most probably the scheduler will decide to schedule it on the same node, unless resource wise something has changed.
AFAIK Kube scheduler doesn't handle the scenario where the image is present on one node and is not present on the other node.
You can write your own scheduler to handle this scenario Refer: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
You can also use tool like https://github.com/uber/kraken which is a P2P docker registry ie if image is present on one of the node it can be pulled by other nodes too. This would also make your registry fault-tolerant.

Daemonset with privilged pods

I am working on a client requirement that the worker nodes needs to have a specific time zone configured for their apps to run properly. We have tried things such as using the TZ environment and also mounting a volume on /etc/localtime that points to the right file in /usr/share/zoneinfo// - these work to some extent but it seems I will need to use daemonsets to modify the node configuration for some of the apps.
The concern I have is that the specific pod that needs to make this change on the nodes will have to be run with host privileges and leaving such pods running on all pods doesn't sound good. The documentation says that the pods on daemonsets must have the restart policy of always so I can't have them exit after making the changes too.
I believe I can address this specific concern with an init container that run with host privileges, make the appropriate changes on the node and exit. The other pods in the daemonset will run after the init container is run successfully and finally, all the other pods get scheduled on the nodes. I also believe this sequence works the same way when I add another nodes to the cluster.
Does that sound about right? Are there better approaches?

Steps involved in creating a pod in kubernetes

How does Kubernetes create Pods?
I.e. what are the sequential steps involved in creating a Pod, is it implemented in Kubernetes?
Any code reference in Kubernetes repo would also be helpful.
A Pod is described in a definition file, and ran as a set of Docker containers on a given host which is part of the Kubernetes cluster, much like docker-compose does, but with several differences.
Precisely, a Pod always contains multiple Docker containers, even though, only the containers defined by the user are usually visible through the API: A Pod has one container that is a placeholder generated by the Kubernetes API, that will hold the IP for the Pod (so that when a Pod is restarted, it's actually the client containers that are restarted, but the placeholder container remains and keeps the same IP, unlike in straight Docker or docker-compose, where recreating a composition or container changes the IP.)
How Pods are scheduled, created, started, restarted if needed, re-scheduled etc... it a much longer story and very broad question.