fsGroup vs supplementalGroups - kubernetes

I'm running my deployment on OpenShift, and found that I need to have a GID of 2121 to have write access.
I still don't seem to have write access when I try this:
security:
podSecurityContext:
fsGroup: 2121
This gives me a 2121 is not an allowed group error.
However, this does seem to be working for me:
security:
podSecurityContext:
fsGroup: 100010000 # original fsGroup value
supplementalGroups: [2121]
I am wondering what the difference of fsGroup and supplementalGroups is.
I've read the documentation here and have also looked at kubectl explain deployment.spec.template.spec.securityContext, but I still can't quite understand the difference.
Could I get some clarification on what are the different use cases?

FSGroup is used to set the group that owns the pod volumes. This group will be used by Kubernetes to change the permissions of all files in volumes, when volumes are mounted by a pod.
The owning GID will be the FSGroup
The setgid bit is set (new files created in the volume will be owned by FSGroup)
The permission bits are OR'd with rw-rw----
If unset, the Kubelet will not modify the ownership and permissions of
any volume.
Some caveats when using FSGroup:
Changing the ownership of a volume for slow and/or large file systems
can cause delays in pod startup.
This can harm other processes using the same volume if their
processes do not have permission to access the new GID.
SupplementalGroups - controls which supplemental group ID can be assigned to processes in a pod.
A list of groups applied to the first process run in each container,
in addition to the container's primary GID. If unspecified, no groups
will be added to any container.
Additionally from the OpenShift documentation:
The recommended way to handle NFS access, assuming it is not an option
to change permissions on the NFS export, is to use supplemental
groups. Supplemental groups in OpenShift Container Platform are used
for shared storage, of which NFS is an example. In contrast, block
storage such as iSCSI uses the fsGroup SCC strategy and the fsGroup
value in the securityContext of the pod.

Related

set the readOnlyRootFilesystem flag to 'true' on the security context of the container's spec

For containers we are getting vulnerability issue. to fix that we need to enable set the readOnlyRootFilesystem flag to 'true'. But when we enable this flag to true all containers started to fail. Can anyone please help me out here ?
First, understand why, how, and what your applications are writing on the root file system.
If you use a Deployment resource, your application should be stateless (or stateful, if you prefer).
If it needs to write something, prefer to use another resource, like StatefulSet, and mount a Persistent Volume Claim or a Volume to where your POD/Container needs access to write.
If your application needs write access to write logs to a log file, prefer to redirect it to the stdout or add a k8s empty dir (also known as ephemeral storage) and redirect those log files to this empty dir.
docs:
ephemeral storage: https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/
deployment: https://kubernetes.io/docs/tasks/run-application/run-stateless-application-deployment/
statefulsets: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/

Limit/Controlled kubectl exec command

I am still bit of kubernetes newbe. But I am looking for a way to give developers controlled access to kubectl exec command . I want to give them run most of the read-only command but prevent some high risk command and also prevent interactive download/install etc. Also want to log all their action during sessions for audit purpose.
I am not seeing any straight forward way to do it using rbac. Also not seeing those options in rancher either. I am looking for some guidance and direction to achieve such capability.
I am sure some of you have achieved it some way.
Kubernetes RBAC can only validate whenever you can or cannot exec into pods, (by checking create verb on pods/exec resource), after that it switches to SPDY protocol and passes your input and returning back output from analog of docker exec on your container runtime, without actually caring about what's going in and out
With rbac you also have to specify pod name, which might be problematic if you are using Deployments, where each new revision will generate a different pod name. Since pattern matching is not implemented in rbac - you would have to change your role every time new pod name is generated.
So the answer is "No, you can' do it with rbac"
An alternative solution would be to use some kind of CI/CD (jenkins,gitlab-ci etc.) or orchestration tool (rundeck, ansible-tower etc) where you will create some kind of script, where your developers would pass arguments to a job, controlled by you, i.e.
kubectl exec deploy/foo -- /bin/bar baz "$DEV_ARGUMENT"
Which, essentially, means, that you would be responsible for managing access to that job/script, creating and maintaining serviceAccount for that script, etc.
If you are afraid of image mutability, i.e. you don't want your developers to install something in running container, but otherwise are okay with giving them shell on it (remember, they can still read any secrets/env vars/configMaps and even serviceAccount tokens that pod uses of you mount them by default), you should consider the following:
Don't run your containers as root. Try to use images, that support rootles operation, and then either specify correct non-root UID in runAsUser field in securityContext, or configure runAsNonRoot: true flag to deny containers running as root.
Better general solution would be to utilize PodSecurityPolicy (deprecated, removed in 1.25), Pod Security Admission or some 3rd party admission contoller like OPA Gatekeeper to deny containers running as root in your namespace
You can also make your pods immutable by using readOnlyRootFilesystem in security context, which will deny write operation to pod ephemeral storage (but if you mounted any volume as RW - they still will be accessible to write operations). Feasibility of this approach depends on whenever your apps use some kind of temporary files of not
Relevant links:
kubernetes RBAC role verbs to exec to pod
https://github.com/kubernetes/kubernetes/issues/44703#issuecomment-324826356 - issue, discussing current rbac limitations
https://itnext.io/how-it-works-kubectl-exec-e31325daa910
https://erkanerol.github.io/post/how-kubectl-exec-works/ - bot links explaining how exec actually works

Kubernetes configMap or persistent volume?

What is the best approach to passing multiple configuration files into a POD?
Assume that we have a legacy application that we have to dockerize and run in a Kubernetes environment. This application requires more than 100 configuration files to be passed. What is the best solution to do that? Create hostPath volume and mount it to some directory containing config files on the host machine? Or maybe config maps allow passing everything as a single compressed file, and then extracting it in the pod volume?
Maybe helm allows somehow to iterate over some directory, and create automatically one big configMap that will act as a directory?
Any suggestions are welcomed
Create hostPath volume and mount it to some directory containing config files on the host machine
This should be avoided.
Accessing hostPaths may not always be allowed. Kubernetes may use PodSecurityPolicies (soon to be replaced by OPA/Gatekeeper/whatever admission controller you want ...), OpenShift has a similar SecurityContextConstraint objects, allowing to define policies for which user can do what. As a general rule: accessing hostPaths would be forbidden.
Besides, hostPaths devices are local to one of your node. You won't be able to schedule your Pod some place else, if there's any outage. Either you've set a nodeSelector restricting its deployment to a single node, and your application would be done as long as your node is. Or there's no placement rule, and your application may restart without its configuration.
Now you could say: "if I mount my volume from an NFS share of some sort, ...". Which is true. But then, you would probably be better using a PersistentVolumeClaim.
Create automatically one big configMap that will act as a directory
This could be an option. Although as noted by #larsks in comments to your post: beware that ConfigMaps are limited in terms of size. While manipulating large objects (frequent edit/updates) could grow your etcd database size.
If you really have ~100 files, ConfigMaps may not be the best choice here.
What next?
There's no one good answer, not knowing exactly what we're talking about.
If you want to allow editing those configurations without restarting containers, it would make sense to use some PersistentVolumeClaim.
If that's not needed, ConfigMaps could be helpful, if you can somewhat limit their volume, and stick with non-critical data. While Secrets could be used storing passwords or any sensitive configuration snippet.
Some emptyDir could also be used, assuming you can figure out a way to automate provisioning of those configurations during container startup (eg: git clone in some initContainer, and/or some shell script contextualizing your configuration based on some environment variables)
If there are files that are not expected to change over time, or whose lifecycle is closely related to that of the application version shipping in your container image: I would consider adding them to my Dockerfile. Maybe even add some startup script -- something you could easily call from an initContainer, generating whichever configuration you couldn't ship in the image.
Depending on what you're dealing with, you could combine PVC, emptyDirs, ConfigMaps, Secrets, git stored configurations, scripts, ...

Kubernetes: fsGroup has different impact on hostPath versus pvc and different impact on nfs versus cifs

Many of my workflows use pod iam roles. As documented here, I must include fsGroup in order for non-root containers to read the generated identity token. The problem with this is when I additionally include pvc’s that point to cifs pv’s, the volumes fail to mount because they time out. Seemingly this is because Kubelet tries to chown all of the files on the volume, which takes too much time and causes the timeout. Questions…
Why doesnt Kubernetes try to chown all of the files when hostPath is used instead of a pvc? All of the workflows were fine until I made the switch to use pvcs from hostPath, and now the timeout issue happens.
Why does this problem happen on cifs pvcs but not nfs pvcs? I have noticed that nfs pvcs continue to mount just fine and the fsGroup seemingly doesn’t take effect as I don’t see the group id change on any of the files. However, the cifs pvcs can no longer be mounted seemingly due to the timeout issue. If it matters, I am using the native nfs pv lego and this cifs flexVolume plugin that has worked great up until now.
Overall, the goal of this post is to better understand how Kubernetes determines when to chown all of the files on a volume when fsGroup is included in order to make a good design decision going forward. Thanks for any help you can provide!
Kubernetes Chowning Files References
https://learn.microsoft.com/en-us/azure/aks/troubleshooting
Since gid and uid are mounted as root or 0 by default. If gid or uid
are set as non-root, for example 1000, Kubernetes will use chown to
change all directories and files under that disk. This operation can
be time consuming and may make mounting the disk very slow.
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
By default, Kubernetes recursively changes ownership and permissions
for the contents of each volume to match the fsGroup specified in a
Pod's securityContext when that volume is mounted. For large volumes,
checking and changing ownership and permissions can take a lot of
time, slowing Pod startup.
I posted this question on the Kubernetes Repo a while ago and it was recently answered in the comments.
The gist is fsgroup support is implemented and decided on per plugin. They ignore it for nfs, which is why I have never seen Kubelet chown files on nfs pvcs. For FlexVolume plugins, a plugin can opt-out of fsGroup based permission changes by returning FSGroup false. So, that is why Kubelet was trying to chown the cifs pvcs -- the FlexVolume plugin I am using does not return fsGroup false.
So, in the end you don't need to worry about this for nfs, and if you are using a FlexVolume plugin for a shared file system, you should make sure it returns fsGroup false if you don't want Kubelet to chown all of the files.

What prerequisites do I need for Kubernetes to mount an EBS volume?

The documentation doesn’t go into detail. I imagine I would at least need an iam role.
This is what we have done and it worked well.
I was on kubernetes 1.7.2 and trying to provision storage (dynamic/static) for kubernetes pods on AWS. Some of the things mentioned below may not be needed if you are not looking for dynamic storage classes.
Made sure that the DefaultStorageClass admission controller is enabled on the API server. (DefaultStorageClass is among the comma-delimited, ordered list of values for the --enable-admission-plugins flag of the API server component.)
I have given options --cloud-provider=aws and --cloud-config=/etc/aws/aws.config (while starting apiserver, controller-manager, kubelet)
(the file /etc/aws/aws.conf is present on instance with below contents)
$ cat /etc/aws/aws.conf
[Global]
Zone = us-west-2a
Created IAM policy added to role (as in link below), created instance profile for it and attached to the instances. (NOTE: I missed attaching instance profile and it did not work).
https://medium.com/#samnco/using-aws-elbs-with-the-canonical-distribution-of-kubernetes-9b4d198e2101
For dynamic provisioning:
Created storage class and made it default.
Let me know it this did not work.
Regards
Sudhakar
This is the one used by kubespray, and is very likely indicative of a rational default:
https://github.com/kubernetes-incubator/kubespray/blob/v2.5.0/contrib/aws_iam/kubernetes-minion-policy.json
with the tl;dr of that link being to create an Allow for the following actions:
s3:*
ec2:Describe*
ec2:AttachVolume
ec2:DetachVolume
route53:*
(although I would bet that s3:* is too wide, I don't have the information handy to provide a more constrained version; similar observation on the route53:*)
All of the Resource keys for those are * except the s3: one which restricts the resource to buckets beginning with kubernetes-* -- unknown if that's just an example, or there is something special in the kubernetes prefixed buckets. Obviously you might have a better list of items to populate the Resource keys to genuinely restrict attachable volumes (just be careful with dynamically provisioned volumes, as would be created by PersistentVolume resources)