Trying to mount a Lustre filesystem in a pod: tries and problems - kubernetes

For our data science platform we have user directories on a Lustre filesystem (HPC world) that we want to present inside Jupyter notebooks running on OpenShift (they are spawned on demand by JupyterHub after authentication and other custom actions).
We tried the following approaches:
Mounting the Lustre filesystem in a privileged sidecar container in the same pod as the notebook, sharing the mount through an EmptyDir volume (that way the notebook container does not need to be privileged). We have mount/umount of Lustre occur in the sidecar as post-start and pre-stop in the pod lifecycle. Problem is that sometimes when we delete the pod umount does not work or hang for whatever reason, and as EmptyDirs are "destroyed" it flushes everything in Lustre because the fs is still mounted. A real bad result...
Mounting the Lustre filesystem directly on the node and create a hostPath volume in the pod and a volumeMount in the notebook container. It kind of works, but only if the container is run as privileged, which we don't want of course.
We tried more specific sccs which authorize the hostPath volume (hostaccess, hostmount-anyuid), and also made a custom one with hostPath volume and 'allowHostDirVolumePlugin: true', but with no success. We can see the mount from the container, but even with everything wide open (777), we get a "permission denied" with whatever we try to do on the mount(ls, touch,...). Again, it works only if the container is privileged. Does not seem to be SELinux related, at least we have no alerts.
Does anyone see where the problem is, or has another suggestion of solution we can try?

Related

Kubernetes pod went down

I am pretty new to Kubernetes so I don't have much idea. Last day a pod went down and I was thinking if I would be able to recover the tmp folder.
So basically I want to know that when a pod in Kubernetes goes down, does it lose access to the "/tmp" folder ?
Unless you configure otherwise, this folder will be considered storage within the container, and the contents will be lost when the container terminates.
Similarly to how you can run a container in docker, write something to the filesystem within the container, then stop and remove the container, start a new one, and find the file you wrote within the container is no longer there.
If you want to keep the /tmp folder contents between restarts, you'll need to attach a persistent volume to it and mount it as /tmp within the container, but with the caveat that if you do that, you cannot use that same volume with other replicas in a deployment unless you use a read-write-many capable filesystem underneath, like NFS.

Debugging nfs volume "Unable to attach or mount volumes for pod"

I've set up an nfs server that serves a RMW pv according to the example at https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs
This setup works fine for me in lots of production environments, but in some specific GKE cluster instance, mount stopped working after pods restarted.
From kubelet logs I see the following repeating many times
Unable to attach or mount volumes for pod "api-bf5869665-zpj4c_default(521b43c8-319f-425f-aaa7-e05c08282e8e)": unmounted volumes=[shared-mount], unattached volumes=[geekadm-net deployment-role-token-6tg9p shared-mount]: timed out waiting for the condition; skipping pod
Error syncing pod 521b43c8-319f-425f-aaa7-e05c08282e8e ("api-bf5869665-zpj4c_default(521b43c8-319f-425f-aaa7-e05c08282e8e)"), skipping: unmounted volumes=[shared-mount], unattached volumes=[geekadm-net deployment-role-token-6tg9p shared-mount]: timed out waiting for the condition
Manually mounting the nfs on any of the nodes work just fine: mount -t nfs <service ip>:/ /tmp/mnt
How can I further debug the issue? Are there any other logs I could look at besides kubelet?
In case the pod gets kicked out of the node because the mount is too slow, you may see messages like that in logs.
Kubelets even inform about this issue in logs.
Sample log from Kubelets:
Setting volume ownership for /var/lib/kubelet/pods/c9987636-acbe-4653-8b8d-
aa80fe423597/volumes/kubernetes.io~gce-pd/pvc-fbae0402-b8c7-4bc8-b375-
1060487d730d and fsGroup set. If the volume has a lot of files then setting
volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Cause:
The pod.spec.securityContext.fsGroup setting causes kubelet to run chown and chmod on all the files in the volumes mounted for given pod. This can be a very time consuming thing to do in case of big volumes with many files.
By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod's securityContext when that volume is mounted. From the document.
Solution:
You can deal with it in the following ways.
Reduce the number of files in the volume.
Stop using the fsGroup setting.
Did you specify an nfs version when mounting command-line? I had the same issue on AKS, but inspired by https://stackoverflow.com/a/71789693/1382108 I checked the nfs versions. Noticed my PV had a vers=3. When I tried mounting command-line using mount -t nfs -o vers=3 command just hung, with vers=4.1 it worked immediately. Changed the version in my PV and next Pod worked just fine.

K8s - an NFS share is only mounted read-only in the pod

Environment: external NFS share for persistent storage, accessible to all, R/W, Centos7 VMs (NFS share and K8s cluster), NFS utils installed on all workers.
Mount on a VM, e.g. a K8s worker node, works correctly, the share is R/W
Deployed in the K8s cluster: PV, PVC, Deployment (Volumes - referenced to PVC, VolumeMount)
The structure of the YAML files corresponds to the various instructions and postings, including the postings here on the site.
The pod starts, the share is mounted. Unfortunately, it is read-only. All the suggestions from the postings I have found about this did not work so far.
Any idea what else I could look out for, what else I could try?
Thanks. Thomas
After digging deep, I found the cause of the problem. Apparently, the syntax for the NFS export is very sensitive. One more space can be problematic.
On the NFS server, two export entries were stored in the kernel tables. The first R/O and the second R/W. I don't know whether this is a Centos bug because of the syntax in /etc/exports.
On another Centos machine I was able to mount the share without any problems (r/w). In the container (Debian-based image), however, not (only r/o). I have not investigated whether this is due to Kubernetes or Debian behaves differently.
After correcting the /etc/exports file and restarting the NFS server, there was only one correct entry in the kernel table. After that, mounting R/W worked on a Centos machine as well as in the Debian-based container inside K8s.
Here are the files / table:
privious /etc/exports:
/nfsshare 172.16.6.* (rw,sync,no_root_squash,no_all_squash,no_acl)
==> kernel:
/nfsshare 172.16.6.*(ro, ...
/nfsshare *(rw, ...
corrected /etc/exports (w/o the blank):
/nfsshare *(rw,sync,no_root_squash,no_all_squash,no_acl)
In principle, the idea of using an init container is a good one. Thank you for reminding me of this.
I have tried it.
Unfortunately, it doesn't change the basic problem. The file system is mounted "read-only" by Kubernetes. The init container returns the following error message (from the log):
chmod: /var/opt/dilax/enumeris: Read-only file system

Mounting container filesystem into sidecar in k8s

I'd like to run perf record and perf script on a process running in a container in Kubernetes (actually on Openshift). Following the approach from this blogpost I was able to get perf record working (in the sidecar). However the perf script cannot read symbols (in sidecar) because these are present only in the main container.
Therefore I'd like to mount the complete filesystem of the main container into the sidecar, e.g. under /main and then run perf script --symfs=/main. I don't want to copy the complete filesystem into an emptyDir. I've found another nice blogpost about using overlay filesystem; however IIUC I would need to create the overlay in the main container and I don't want to run that as a privileged container and require commands (like mount) to be present.
Is there any way to create sort of reverse mount, exposing a part of container to be mounted by other containers within the same pod?

Kubernetes: fsGroup has different impact on hostPath versus pvc and different impact on nfs versus cifs

Many of my workflows use pod iam roles. As documented here, I must include fsGroup in order for non-root containers to read the generated identity token. The problem with this is when I additionally include pvc’s that point to cifs pv’s, the volumes fail to mount because they time out. Seemingly this is because Kubelet tries to chown all of the files on the volume, which takes too much time and causes the timeout. Questions…
Why doesnt Kubernetes try to chown all of the files when hostPath is used instead of a pvc? All of the workflows were fine until I made the switch to use pvcs from hostPath, and now the timeout issue happens.
Why does this problem happen on cifs pvcs but not nfs pvcs? I have noticed that nfs pvcs continue to mount just fine and the fsGroup seemingly doesn’t take effect as I don’t see the group id change on any of the files. However, the cifs pvcs can no longer be mounted seemingly due to the timeout issue. If it matters, I am using the native nfs pv lego and this cifs flexVolume plugin that has worked great up until now.
Overall, the goal of this post is to better understand how Kubernetes determines when to chown all of the files on a volume when fsGroup is included in order to make a good design decision going forward. Thanks for any help you can provide!
Kubernetes Chowning Files References
https://learn.microsoft.com/en-us/azure/aks/troubleshooting
Since gid and uid are mounted as root or 0 by default. If gid or uid
are set as non-root, for example 1000, Kubernetes will use chown to
change all directories and files under that disk. This operation can
be time consuming and may make mounting the disk very slow.
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
By default, Kubernetes recursively changes ownership and permissions
for the contents of each volume to match the fsGroup specified in a
Pod's securityContext when that volume is mounted. For large volumes,
checking and changing ownership and permissions can take a lot of
time, slowing Pod startup.
I posted this question on the Kubernetes Repo a while ago and it was recently answered in the comments.
The gist is fsgroup support is implemented and decided on per plugin. They ignore it for nfs, which is why I have never seen Kubelet chown files on nfs pvcs. For FlexVolume plugins, a plugin can opt-out of fsGroup based permission changes by returning FSGroup false. So, that is why Kubelet was trying to chown the cifs pvcs -- the FlexVolume plugin I am using does not return fsGroup false.
So, in the end you don't need to worry about this for nfs, and if you are using a FlexVolume plugin for a shared file system, you should make sure it returns fsGroup false if you don't want Kubelet to chown all of the files.