How to enforce MustRunAsNonRoot policy in K8S cluster in AKS - kubernetes

I have a K8S cluster running in Azure AKS service.
I want to enforce MustRunAsNonRoot policy. How to do it?
The following policy is created:
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restrict-root
spec:
privileged: false
allowPrivilegeEscalation: false
runAsUser:
rule: MustRunAsNonRoot
seLinux:
rule: RunAsAny
fsGroup:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- '*'
It is deployed in the cluster:
$ kubectl get psp
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES
restrict-root false RunAsAny MustRunAsNonRoot RunAsAny RunAsAny false *
Admission controller is running in the cluster:
$ kubectl get pods -n gatekeeper-system
NAME READY STATUS RESTARTS AGE
gatekeeper-audit-7b4bc6f977-lvvfl 1/1 Running 0 32d
gatekeeper-controller-5948ddcd54-5mgsm 1/1 Running 0 32d
gatekeeper-controller-5948ddcd54-b59wg 1/1 Running 0 32d
Anyway it is possible to run a simple pod running under root:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: busybox
args: ["sleep", "10000"]
securityContext:
runAsUser: 0
Pod is running:
$ kubectl describe po mypod
Name: mypod
Namespace: default
Priority: 0
Node: aks-default-31327534-vmss000001/10.240.0.5
Start Time: Mon, 08 Feb 2021 23:10:46 +0100
Labels: <none>
Annotations: <none>
Status: Running
Why MustRunAsNonRoot is not applied? How to enforce it?
EDIT: It looks like AKS engine does not support PodSecurityPolicy (list of supported policies). Then the question is still the same: how to enforce MustRunAsNonRoot rule on workloads?

You shouldn't use PodSecurityPolicy on Azure AKS cluster as it has been set for deprecation as of May 31st, 2021 in favor of Azure Policy for AKS. Check the official docs for further details:
Warning
The feature described in this document, pod security policy (preview), is set for deprecation and will no longer be available
after May 31st, 2021 in favor of Azure Policy for
AKS.
The deprecation date has been extended from the previous date of
October 15th, 2020.
So currently you should rather use Azure Policy for AKS, where among other built-in policies grouped into initiatives (an initiative in Azure Policy is a collection of policy definitions that are tailored towards achieving a singular overarching goal), you can find a policy which goal is to disallow running of privileged containers on your AKS cluster.
As to PodSecurityPolicy, for the time being it should still work. Please check here if you didn't forget about anything e.g. make sure you set up the corresponding ClusterRole and ClusterRoleBinding to allow the policy to be used.

Related

How can I get more Replicas of Istio Running?

I am trying to upgrade the nodes in my Kubernetes cluster. When I go to do that, I get a notification saying:
PDB istio-ingressgateway in namespace istio-system allows 0 pod disruptions
PDB is Pod Disruption Budget. Basically, istio is saying that it can't loose that pod and keep things working right.
There is a really long discussion about this over on the Istio GitHub issues. This issue has been on going for over 2 years. Most of the discussions center around saying that the defaults are wrong. There are few workaround suggestions. But most of them are pre 1.4 (and the introduction of Istiod). The closest workaround I could find that might be compatible with current version is to add some additional replicas to the IstioOperator.
I tried that with a patch operation (run in PowerShell):
kubectl patch IstioOperator installed-state --patch $(Get-Content istio-ha-patch.yaml -Raw) --type=merge -n istio-system
Where istio-ha-patch.yaml is:
spec:
components:
egressGateways:
- enabled: true
k8s:
hpaSpec:
minReplicas: 2
name: istio-egressgateway
ingressGateways:
- enabled: true
k8s:
hpaSpec:
minReplicas: 2
name: istio-ingressgateway
pilot:
enabled: true
k8s:
hpaSpec:
minReplicas: 2
I applied that, and checked the yaml of the IstioOperator, and it did apply to the resource's yaml. But the replica count for the ingress pod did not go up. (It stayed at 1 of 1.)
At this point, my only option is to uninstall Istio, apply my update then re-install Istio. (Yuck)
Is there anyway to get the replica count of Istio's ingress gateway up such that I can keep it running as I do a rolling node upgrade?
Turns out that if you did not install Istio using the Istio Kubernetes Operator, you cannot use the option I tried.
Once I uninstalled Istio and reinstalled it using the Operator, then I was able to get it to work.
Though I did not use the Patch operation, I just did a kubectl apply -f istio-operator-spec.yaml where istio-operator-spec.yaml is:
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-controlplane
namespace: istio-system
spec:
components:
ingressGateways:
- enabled: true
k8s:
hpaSpec:
minReplicas: 2
name: istio-ingressgateway
pilot:
enabled: true
k8s:
hpaSpec:
minReplicas: 2
profile: default

Subnetting within Kubernetes Cluster

I have couple of deployments - say Deployment A and Deployment B. The K8s Subnet is 10.0.0.0/20.
My requirement : Is it possible to get all pods in Deployment A to get IP from 10.0.1.0/24 and pods in Deployment B from 10.0.2.0/24.
This helps the networking clean and with help of IP itself a particular deployment can be identified.
Deployment in Kubernetes is a high-level abstraction that rely on controllers to build basic objects. That is different than object itself such as pod or service.
If you take a look into deployments spec in Kubernetes API Overview, you will notice that there is no such a thing as defining subnets, neither IP addresses that would be specific for deployment so you cannot specify subnets for deployments.
Kubernetes idea is that pod is ephemeral. You should not try to identify resources by IP addresses as IPs are randomly assigned. If the pod dies it will have another IP address. You could try to look on something like statefulsets if you are after unique stable network identifiers.
While Kubernetes does not support this feature I found workaround for this using Calico: Migrate pools feature.
First you need to have calicoctl installed. There are several ways to do that mentioned in the install calicoctl docs.
I choose to install calicoctl as a Kubernetes pod:
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
To make work faster you can setup an alias :
alias calicoctl="kubectl exec -i -n kube-system calicoctl /calicoctl -- "
I have created two yaml files to setup ip pools:
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: pool1
spec:
cidr: 10.0.0.0/24
ipipMode: Always
natOutgoing: true
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: pool2
spec:
cidr: 10.0.1.0/24
ipipMode: Always
natOutgoing: true
Then you you have apply the following configuration but since my yaml were being placed in my host filesystem and not in calico pod itself I placed the yaml as an input to the command:
➜ cat ippool1.yaml | calicoctl apply -f-
Successfully applied 1 'IPPool' resource(s)
➜ cat ippool2.yaml | calicoctl apply -f-
Successfully applied 1 'IPPool' resource(s)
Listing the ippools you will notice the new added ones:
➜ calicoctl get ippool -o wide
NAME CIDR NAT IPIPMODE VXLANMODE DISABLED SELECTOR
default-ipv4-ippool 192.168.0.0/16 true Always Never false all()
pool1 10.0.0.0/24 true Always Never false all()
pool2 10.0.1.0/24 true Always Never false all()
Then you can specify what pool you want to choose for you deployment:
---
metadata:
labels:
app: nginx
name: deployment1-pool1
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
annotations:
cni.projectcalico.org/ipv4pools: "[\"pool1\"]"
---
I have created similar one called deployment2 that used ippool2 with the results below:
deployment1-pool1-6d9ddcb64f-7tkzs 1/1 Running 0 71m 10.0.0.198 acid-fuji
deployment1-pool1-6d9ddcb64f-vkmht 1/1 Running 0 71m 10.0.0.199 acid-fuji
deployment2-pool2-79566c4566-ck8lb 1/1 Running 0 69m 10.0.1.195 acid-fuji
deployment2-pool2-79566c4566-jjbsd 1/1 Running 0 69m 10.0.1.196 acid-fuji
Also its worth mentioning that while testing this I found out that if your default deployment will have many replicas and will ran out of ips Calico will then use different pool.

Pod stuck in Pending state when trying to schedule it on AWS Fargate

I have an EKS cluster to which I've added support to work in hybrid mode (in other words, I've added Fargate profile to it). My intention is to run only specific workload on the AWS Fargate while keeping the EKS worker nodes for other kind of workload.
To test this out, my Fargate profile is defined to be:
Restricted to specific namespace (Let's say: mynamespace)
Has specific label so that pods need to match it in order to be scheduled on Fargate (Label is: fargate: myvalue)
For testing k8s resources, I'm trying to deploy simple nginx deployment which looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
namespace: mynamespace
labels:
fargate: myvalue
spec:
selector:
matchLabels:
app: nginx
version: 1.7.9
fargate: myvalue
replicas: 1
template:
metadata:
labels:
app: nginx
version: 1.7.9
fargate: myvalue
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
When I try to apply this resource, I get following:
$ kubectl get pods -n mynamespace -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-596c594988-x9s6n 0/1 Pending 0 10m <none> <none> 07c651ad2b-7cf85d41b2424e529247def8bda7bf38 <none>
Pod stays in the Pending state and it is never scheduled to the AWS Fargate instances.
This is a pod describe output:
$ kubectl describe pod nginx-deployment-596c594988-x9s6n -n mynamespace
Name: nginx-deployment-596c594988-x9s6n
Namespace: mynamespace
Priority: 2000001000
PriorityClassName: system-node-critical
Node: <none>
Labels: app=nginx
eks.amazonaws.com/fargate-profile=myprofile
fargate=myvalue
pod-template-hash=596c594988
version=1.7.9
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Controlled By: ReplicaSet/nginx-deployment-596c594988
NominatedNodeName: 9e418415bf-8259a43075714eb3ab77b08049d950a8
Containers:
nginx:
Image: nginx:1.7.9
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-784d2 (ro)
Volumes:
default-token-784d2:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-784d2
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
One thing that I can conclude from this output is that correct Fargate profile was chosen:
eks.amazonaws.com/fargate-profile=myprofile
Also, I see that some value is added to NOMINATED NODE field but not sure what it represents.
Any ideas or usual problems that happen and that might be worth troubleshooting in this case? Thanks
It turns out the problem was in networking setup of private subnets associated with the Fargate profile all the time.
To give more info, here is what I initially had:
EKS cluster with several worker nodes where I've assigned only public subnets to the EKS cluster itself
When I tried to add Fargate profile to the EKS cluster, because of the current limitation on Fargate, it is not possible to associate profile with public subnets. In order to solve this, I've created private subnets with the same tag like the public ones so that EKS cluster is aware of them
What I forgot was that I needed to enable connectivity from the vpc private subnets to the outside world (I was missing NAT gateway). So I've created NAT gateway in Public subnet that is associated with EKS and added to the private subnets additional entry in their associated Routing table that looks like this:
0.0.0.0/0 nat-xxxxxxxx
This solved the problem that I had above although I'm not sure about the real reason why AWS Fargate profile needs to be associated only with private subnets.
If you use the community module, all of this can be taken care of by the following config:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "2.44.0"
name = "vpc-module-demo"
cidr = "10.0.0.0/16"
azs = slice(data.aws_availability_zones.available.names, 0, 3)
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
single_nat_gateway = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
enable_nat_gateway = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
enable_vpn_gateway = false
enable_dns_hostnames = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
enable_dns_support = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
tags = {
"Name" = "terraform-eks-demo-node"
"kubernetes.io/cluster/${var.cluster-name}" = "shared"
}
}

Kubernetes cluster cannot attach and mount automatically created google cloud platform disks to worker nodes

Basically I'm trying to deploy a cluster on GCE via kubeadm with StorageClass supported (without using Google Kubernetes Engine).
Say I deployed a cluster with a master node at Tokyo, and 3 work nodes in Hong Kong, Taiwan and Oregon.
NAME STATUS ROLES AGE VERSION
k8s-node-hk Ready <none> 3h35m v1.14.2
k8s-node-master Ready master 3h49m v1.14.2
k8s-node-oregon Ready <none> 3h33m v1.14.2
k8s-node-tw Ready <none> 3h34m v1.14.2
Kube-controller-manager and kubelet both started with cloud-provider=gce, and now I can apply StorageClass and PersistentVolumeClaim then get disks created automatically (say a disk in Taiwan) on GCP and get PV and PVC bound.
kubectl get pvc:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
eth-pvc Bound pvc-bf35e3c9-81e2-11e9-8926-42010a920002 10Gi RWO slow 137m
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-bf35e3c9-81e2-11e9-8926-42010a920002 10Gi RWO Delete Bound default/eth-pvc slow 137m
However, kube-controller-manager cannot find the Taiwan node and mount the disk to the node in same zone, and it logged (we can see the zone asia-northeast1-a is not correct):
I0529 07:25:46.366500 1 reconciler.go:288] attacherDetacher.AttachVolume started for volume "pvc-bf35e3c9-81e2-11e9-8926-42010a920002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-bf35e3c9-81e2-11e9-8926-42010a920002") from node "k8s-node-tokyo"
E0529 07:25:47.296824 1 attacher.go:102] Error attaching PD "kubernetes-dynamic-pvc-bf35e3c9-81e2-11e9-8926-42010a920002" to node "k8s-node-tokyo": GCE persistent disk not found: diskName="kubernetes-dynamic-pvc-bf35e3c9-81e2-11e9-8926-42010a920002" zone="asia-northeast1-a"
The kubelet on each node started with --cloud-provider=gce, but I didn't find how to configure the zone. And when I checked kubelet's log, I find this flag is already deprecated on kubernetes v1.14.2 (the latest in May 2019).
May 29 04:36:03 k8s-node-tw kubelet[29971]: I0529 04:36:03.623704 29971 server.go:417] Version: v1.14.2
May 29 04:36:03 k8s-node-tw kubelet[29971]: W0529 04:36:03.624039 29971 plugins.go:118] WARNING: gce built-in cloud provider is now deprecated. The GCE provider is deprecated and will be removed in a future release
However, kubelet annotated k8s-node-tw node with the correct zone and region:
May 29 04:36:05 k8s-node-tw kubelet[29971]: I0529 04:36:05.157665 29971 kubelet_node_status.go:331] Adding node label from cloud provider: beta.kubernetes.io/instance-type=n1-standard-1
May 29 04:36:05 k8s-node-tw kubelet[29971]: I0529 04:36:05.158093 29971 kubelet_node_status.go:342] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=asia-east1-a
May 29 04:36:05 k8s-node-tw kubelet[29971]: I0529 04:36:05.158201 29971 kubelet_node_status.go:346] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=asia-east1
Thanks for reading here. My question is:
If it's possible, how can I configure kubelet or kube-controller-manager correctly to make it support GCP's storage class and the created disks attached and mounted successfully?
==================K8s config files======================
Deployment (related part):
volumes:
- name: data
persistentVolumeClaim:
claimName: eth-pvc
- name: config
configMap:
name: {{ template "ethereum.fullname" . }}-geth-config
- name: account
secret:
secretName: {{ template "ethereum.fullname" . }}-geth-account
PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: eth-pvc
spec:
storageClassName: slow
resources:
requests:
storage: 10Gi
accessModes:
- ReadWriteOnce
SC:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
replication-type: none
zone: asia-east1-a
After several days' research, I found that the reason is:
Master node is in asia-northeast1-a (Tokyo)
Worker nodes are in asia-east1-a (Taiwan) and other zones
cloud-provider-gcp only search the zone in one region (normally the master node's zone, but you can specify it by setting local-zone in the cloud config file), which means it can only support one zone or multiple zones in one region by default
Conclusion:
In order to support multiple zones among multiple regions, we need to modify the gce provider code of configuration, like add another field to configure which zones should be searched.
==========================UPDATE=========================
I modified the k8s code to add a extra-zones config field like this diff on github to make it work on my use case.

Using sysctls in Google Kubernetes Engine (GKE)

I'm running a k8s cluster - 1.9.4-gke.1 - on Google Kubernetes Engine (GKE).
I need to set sysctl net.core.somaxconn to a higher value inside some containers.
I've found this official k8s page: Using Sysctls in a Kubernetes Cluster - that seemed to solve my problem. The solution was to make an annotation on my pod spec like the following:
annotations:
security.alpha.kubernetes.io/sysctls: net.core.somaxconn=1024
But when I tried to create my pod:
Status: Failed
Reason: SysctlForbidden
Message: Pod forbidden sysctl: "net.core.somaxconn" not whitelisted
So I've tried to create a PodSecurityPolicy like the following:
---
apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
name: sites-psp
annotations:
security.alpha.kubernetes.io/sysctls: 'net.core.somaxconn'
spec:
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
volumes:
- '*'
... but it didn't work either.
I've also found that I can use a kubelet argument on every node to whitelist the specific sysctl: --experimental-allowed-unsafe-sysctls=net.core.somaxconn
I've added this argument to the KUBELET_TEST_ARGS setting on my GCE machine and restarted it. From what I can see from the output of ps command, it seems that the option was successfully added to the kubelet process on the startup:
/home/kubernetes/bin/kubelet --v=2 --kube-reserved=cpu=60m,memory=960Mi --experimental-allowed-unsafe-sysctls=net.core.somaxconn --allow-privileged=true --cgroup-root=/ --cloud-provider=gce --cluster-dns=10.51.240.10 --cluster-domain=cluster.local --pod-manifest-path=/etc/kubernetes/manifests --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --experimental-check-node-capabilities-before-mount=true --cert-dir=/var/lib/kubelet/pki/ --enable-debugging-handlers=true --bootstrap-kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --anonymous-auth=false --authorization-mode=Webhook --client-ca-file=/etc/srv/kubernetes/pki/ca-certificates.crt --cni-bin-dir=/home/kubernetes/bin --network-plugin=kubenet --volume-plugin-dir=/home/kubernetes/flexvolume --node-labels=beta.kubernetes.io/fluentd-ds-ready=true,cloud.google.com/gke-nodepool=temp-pool --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-gates=ExperimentalCriticalPodAnnotation=true
The problem is that I keep receiving a message telling me that my pod cannot be started because sysctl net.core.somaxconn is not whitelisted.
Is there some limitation on GKE so that I cannot whitelist a sysctl? Am I doing something wrong?
Until sysctl support becomes better integrated you can put this in your pod spec
spec:
initContainers:
- name: sysctl-buddy
image: busybox:1.29
securityContext:
privileged: true
command: ["/bin/sh"]
args:
- -c
- sysctl -w net.core.somaxconn=4096 vm.overcommit_memory=1
resources:
requests:
cpu: 1m
memory: 1Mi
This is an intentional Kubernetes limitation. There is an open PR to add net.core.somaxconn to the whitelist here: https://github.com/kubernetes/kubernetes/pull/54896
As far as I know, there isn't a way to override this behavior on GKE.