Kubernetes: fsGroup has different impact on hostPath versus pvc and different impact on nfs versus cifs - kubernetes

Many of my workflows use pod iam roles. As documented here, I must include fsGroup in order for non-root containers to read the generated identity token. The problem with this is when I additionally include pvc’s that point to cifs pv’s, the volumes fail to mount because they time out. Seemingly this is because Kubelet tries to chown all of the files on the volume, which takes too much time and causes the timeout. Questions…
Why doesnt Kubernetes try to chown all of the files when hostPath is used instead of a pvc? All of the workflows were fine until I made the switch to use pvcs from hostPath, and now the timeout issue happens.
Why does this problem happen on cifs pvcs but not nfs pvcs? I have noticed that nfs pvcs continue to mount just fine and the fsGroup seemingly doesn’t take effect as I don’t see the group id change on any of the files. However, the cifs pvcs can no longer be mounted seemingly due to the timeout issue. If it matters, I am using the native nfs pv lego and this cifs flexVolume plugin that has worked great up until now.
Overall, the goal of this post is to better understand how Kubernetes determines when to chown all of the files on a volume when fsGroup is included in order to make a good design decision going forward. Thanks for any help you can provide!
Kubernetes Chowning Files References
https://learn.microsoft.com/en-us/azure/aks/troubleshooting
Since gid and uid are mounted as root or 0 by default. If gid or uid
are set as non-root, for example 1000, Kubernetes will use chown to
change all directories and files under that disk. This operation can
be time consuming and may make mounting the disk very slow.
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
By default, Kubernetes recursively changes ownership and permissions
for the contents of each volume to match the fsGroup specified in a
Pod's securityContext when that volume is mounted. For large volumes,
checking and changing ownership and permissions can take a lot of
time, slowing Pod startup.

I posted this question on the Kubernetes Repo a while ago and it was recently answered in the comments.
The gist is fsgroup support is implemented and decided on per plugin. They ignore it for nfs, which is why I have never seen Kubelet chown files on nfs pvcs. For FlexVolume plugins, a plugin can opt-out of fsGroup based permission changes by returning FSGroup false. So, that is why Kubelet was trying to chown the cifs pvcs -- the FlexVolume plugin I am using does not return fsGroup false.
So, in the end you don't need to worry about this for nfs, and if you are using a FlexVolume plugin for a shared file system, you should make sure it returns fsGroup false if you don't want Kubelet to chown all of the files.

Related

How to delete files from EFS mounted into K8s pod?

I have a kubernetes deployment which generates hundreds of thousands of files. I've mounted and EFS instance into my pod with a persistent volume and persistent volume claim. I've tried running my deployment but ran into an issue and now I need to wipe the persistent volume. What's the best way to do this?
I've tried running exec-ing into my pod and running rm -rf but that didn't seem to make any progress after 30 minutes. I also tried using rsync but that also was incredibly slow.
Does EFS offer a mechanism to delete files from the console or command line? Does k8s offer a mechanism to wipe a persistent volume (claim)? What's the best way to give my pod a fresh slate to start working with again?
EDIT: I tried deleting and recreating the PVC but that didn't seem to work since my pod crashlooped once the deployment was restarted with the new PVC.
EDIT 2: I was mounting my PVC with a subPath - changing the subPath gave my pod a fresh new directory to work with. This was a nice workaround but I still would like to delete the old data in the EFS volume so I don't have to pay for it.

Kubernetes configMap or persistent volume?

What is the best approach to passing multiple configuration files into a POD?
Assume that we have a legacy application that we have to dockerize and run in a Kubernetes environment. This application requires more than 100 configuration files to be passed. What is the best solution to do that? Create hostPath volume and mount it to some directory containing config files on the host machine? Or maybe config maps allow passing everything as a single compressed file, and then extracting it in the pod volume?
Maybe helm allows somehow to iterate over some directory, and create automatically one big configMap that will act as a directory?
Any suggestions are welcomed
Create hostPath volume and mount it to some directory containing config files on the host machine
This should be avoided.
Accessing hostPaths may not always be allowed. Kubernetes may use PodSecurityPolicies (soon to be replaced by OPA/Gatekeeper/whatever admission controller you want ...), OpenShift has a similar SecurityContextConstraint objects, allowing to define policies for which user can do what. As a general rule: accessing hostPaths would be forbidden.
Besides, hostPaths devices are local to one of your node. You won't be able to schedule your Pod some place else, if there's any outage. Either you've set a nodeSelector restricting its deployment to a single node, and your application would be done as long as your node is. Or there's no placement rule, and your application may restart without its configuration.
Now you could say: "if I mount my volume from an NFS share of some sort, ...". Which is true. But then, you would probably be better using a PersistentVolumeClaim.
Create automatically one big configMap that will act as a directory
This could be an option. Although as noted by #larsks in comments to your post: beware that ConfigMaps are limited in terms of size. While manipulating large objects (frequent edit/updates) could grow your etcd database size.
If you really have ~100 files, ConfigMaps may not be the best choice here.
What next?
There's no one good answer, not knowing exactly what we're talking about.
If you want to allow editing those configurations without restarting containers, it would make sense to use some PersistentVolumeClaim.
If that's not needed, ConfigMaps could be helpful, if you can somewhat limit their volume, and stick with non-critical data. While Secrets could be used storing passwords or any sensitive configuration snippet.
Some emptyDir could also be used, assuming you can figure out a way to automate provisioning of those configurations during container startup (eg: git clone in some initContainer, and/or some shell script contextualizing your configuration based on some environment variables)
If there are files that are not expected to change over time, or whose lifecycle is closely related to that of the application version shipping in your container image: I would consider adding them to my Dockerfile. Maybe even add some startup script -- something you could easily call from an initContainer, generating whichever configuration you couldn't ship in the image.
Depending on what you're dealing with, you could combine PVC, emptyDirs, ConfigMaps, Secrets, git stored configurations, scripts, ...

fsGroup vs supplementalGroups

I'm running my deployment on OpenShift, and found that I need to have a GID of 2121 to have write access.
I still don't seem to have write access when I try this:
security:
podSecurityContext:
fsGroup: 2121
This gives me a 2121 is not an allowed group error.
However, this does seem to be working for me:
security:
podSecurityContext:
fsGroup: 100010000 # original fsGroup value
supplementalGroups: [2121]
I am wondering what the difference of fsGroup and supplementalGroups is.
I've read the documentation here and have also looked at kubectl explain deployment.spec.template.spec.securityContext, but I still can't quite understand the difference.
Could I get some clarification on what are the different use cases?
FSGroup is used to set the group that owns the pod volumes. This group will be used by Kubernetes to change the permissions of all files in volumes, when volumes are mounted by a pod.
The owning GID will be the FSGroup
The setgid bit is set (new files created in the volume will be owned by FSGroup)
The permission bits are OR'd with rw-rw----
If unset, the Kubelet will not modify the ownership and permissions of
any volume.
Some caveats when using FSGroup:
Changing the ownership of a volume for slow and/or large file systems
can cause delays in pod startup.
This can harm other processes using the same volume if their
processes do not have permission to access the new GID.
SupplementalGroups - controls which supplemental group ID can be assigned to processes in a pod.
A list of groups applied to the first process run in each container,
in addition to the container's primary GID. If unspecified, no groups
will be added to any container.
Additionally from the OpenShift documentation:
The recommended way to handle NFS access, assuming it is not an option
to change permissions on the NFS export, is to use supplemental
groups. Supplemental groups in OpenShift Container Platform are used
for shared storage, of which NFS is an example. In contrast, block
storage such as iSCSI uses the fsGroup SCC strategy and the fsGroup
value in the securityContext of the pod.

K8s - an NFS share is only mounted read-only in the pod

Environment: external NFS share for persistent storage, accessible to all, R/W, Centos7 VMs (NFS share and K8s cluster), NFS utils installed on all workers.
Mount on a VM, e.g. a K8s worker node, works correctly, the share is R/W
Deployed in the K8s cluster: PV, PVC, Deployment (Volumes - referenced to PVC, VolumeMount)
The structure of the YAML files corresponds to the various instructions and postings, including the postings here on the site.
The pod starts, the share is mounted. Unfortunately, it is read-only. All the suggestions from the postings I have found about this did not work so far.
Any idea what else I could look out for, what else I could try?
Thanks. Thomas
After digging deep, I found the cause of the problem. Apparently, the syntax for the NFS export is very sensitive. One more space can be problematic.
On the NFS server, two export entries were stored in the kernel tables. The first R/O and the second R/W. I don't know whether this is a Centos bug because of the syntax in /etc/exports.
On another Centos machine I was able to mount the share without any problems (r/w). In the container (Debian-based image), however, not (only r/o). I have not investigated whether this is due to Kubernetes or Debian behaves differently.
After correcting the /etc/exports file and restarting the NFS server, there was only one correct entry in the kernel table. After that, mounting R/W worked on a Centos machine as well as in the Debian-based container inside K8s.
Here are the files / table:
privious /etc/exports:
/nfsshare 172.16.6.* (rw,sync,no_root_squash,no_all_squash,no_acl)
==> kernel:
/nfsshare 172.16.6.*(ro, ...
/nfsshare *(rw, ...
corrected /etc/exports (w/o the blank):
/nfsshare *(rw,sync,no_root_squash,no_all_squash,no_acl)
In principle, the idea of using an init container is a good one. Thank you for reminding me of this.
I have tried it.
Unfortunately, it doesn't change the basic problem. The file system is mounted "read-only" by Kubernetes. The init container returns the following error message (from the log):
chmod: /var/opt/dilax/enumeris: Read-only file system

How to mimic Docker ability to pre-populate a volume from a container directory with Kubernetes

I am migrating my previous deployment made with docker-compose to Kubernetes.
In my previous deployment, some containers do have some data made at build time in some paths and these paths are mounted in persistent volumes.
Therefore, as the Docker volume documentation states,the persistent volume (not a bind mount) will be pre-populated with the container directory content.
I'd like to achieve this behavior with Kubernetes and its persistent volumes, How can I do ? Do I need to add some kind of logic using scripts in order to copy my container's files to the mounted path when data is not present the first time the container starts ?
Possibly related question: Kubernetes mount volume on existing directory with files inside the container
I think your options are
ConfigMap (are "some data" configuration files?)
Init containers (as mentioned)
CSI Volume Cloning (clone combining an init or your first app container)
there used to be a gitRepo; deprecated in favour of init containers where you can clone your config and data from
HostPath volume mount is an option too
An NFS volume is probably a very reasonable option and similar from an approach point of view to your Docker Volumes
Storage type: NFS, iscsi, awsElasticBlockStore, gcePersistentDisk and others can be pre-populated. There are constraints. NFS probably the most flexible for sharing bits & bytes.
FYI
The subPath might be of interest too depending on your use case and
PodPreset might help in streamlining the op across the fleet of your pods
HTH