Limit the number of pods per node - kubernetes

I'm trying to limit the number of pods per each node from my cluster.
I managed to add a global limit per node from kubeadm init with config file:
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
networking:
podSubnet: <subnet>
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 10
This is not quite well because the limit is applied even on master node (where multiple kube-system pods are running and the number of pods here may increase over 10).
I would like to keep the default value at init and change the value at join on each node.
I have found something:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 10
---
apiVersion: kubeadm.k8s.io/v1beta1
kind: JoinConfiguration
discovery:
bootstrapToken:
apiServerEndpoint: "<api_endpoint>"
token: "<token>"
unsafeSkipCAVerification: true
but, even if no error/warning appears, it seems that the value of maxPods is ignored. I can create more than 10 pods for that specific node.
Also kubectl get node <node> -o yaml returns status.capacity.pods with its default value (110).
How can I proceed in order to have this pods limit applied per each node?
I would like to mention that I have basic/limited knowledge related to Kubernetes.
Thank you!

There is a config.yaml file at /var/lib/kubelet. This config file is generated from kubelet config map in kube-system namespace when you run kubeadm join.Partial content of the file is as below.
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
clusterDNS:
- 10.96.0.10
maxPods: 10
You can change that file and add maxPods parameter and then restart kubelet on the node.
sudo systemctl restart kubelet
Currently in kubeadm join there is no way to pass a kubelet config file.

You can also set the maximum number of pods per node with the kubelet --max-pods option.

Related

Whitelisting sysctl parameters for helm chart

I have a helm chart that deploys an app but also needs to reconfigure some sysctl parameters in order to run properly. When I install the helm chart and run kubectl describe pod/pod_name on the pod that was deployed, I get forbidden sysctl: "kernel.sem" not whitelisted. I have added a podsecuritypolicy like so but with no such luck.
apiVersion:policy/v1beta1
kind:PodSecurityPolicy
metadata:
name: policy
spec:
allowedUnsafeSysctls:
- kernel.sem
- kernel.shmmax
- kernel.shmall
- fs.mqueue.msg_max
seLinux:
rule: 'RunAsAny'
runAsUser:
rule: 'RunAsAny'
supplementalGroups:
rule: 'RunAsAny'
fsGroup:
rule:'RunAsAny'
---UPDATE---
I also try to set the kubelet parameters via a config file in order to allow-unsafe-ctls but I get an error no kind "KubeletConfiguration" is registered for version "kubelet.config.k8s.io/v1beta1".
Here's the configuration file:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
allowedUnsafeSysctls:
- "kernel.sem"
- "kernel.shmmax"
- "kernel.shmall"
- "fs.mqueue.msg_max"
The kernel.sem sysctl is considered as unsafe sysctl, therefore is disabled by default (only safe sysctls are enabled by default). You can allow one or more unsafe sysctls on a node-by-node basics, to do so you need to add --allowed-unsafe-sysctls flag to the kubelet.
Look at "Enabling Unsafe Sysctls"
I've created simple example to illustrate you how it works.
First I added --allowed-unsafe-sysctls flag to the kubelet.
In my case I use kubeadm, so I need to add this flag to
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf file:
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --allowed-unsafe-sysctls=kernel.sem"
...
NOTE: You have to add this flag on every node you want to run Pod with kernel.sem enabled.
Then I reloaded systemd manager configuration and restarted kubelet using below command:
# systemctl daemon-reload && systemctl restart kubelet
Next I created a simple Pod using this manifest file:
apiVersion: v1
kind: Pod
metadata:
labels:
run: web
name: web
spec:
securityContext:
sysctls:
- name: kernel.sem
value: "250 32000 100 128"
containers:
- image: nginx
name: web
Finally we can check if it works correctly:
# sysctl -a | grep "kernel.sem"
kernel.sem = 32000 1024000000 500 32000 // on the worker node
# kubectl get pod
NAME READY STATUS RESTARTS AGE
web 1/1 Running 0 110s
# kubectl exec -it web -- bash
root#web:/# cat /proc/sys/kernel/sem
250 32000 100 128 // inside the Pod
Your PodSecurityPolicy doesn't work as expected, because of as you can see in the documentation:
Warning: If you allow unsafe sysctls via the allowedUnsafeSysctls field in a PodSecurityPolicy, any pod using such a sysctl will fail to start if the sysctl is not allowed via the --allowed-unsafe-sysctls kubelet flag as well on that node.

Subnetting within Kubernetes Cluster

I have couple of deployments - say Deployment A and Deployment B. The K8s Subnet is 10.0.0.0/20.
My requirement : Is it possible to get all pods in Deployment A to get IP from 10.0.1.0/24 and pods in Deployment B from 10.0.2.0/24.
This helps the networking clean and with help of IP itself a particular deployment can be identified.
Deployment in Kubernetes is a high-level abstraction that rely on controllers to build basic objects. That is different than object itself such as pod or service.
If you take a look into deployments spec in Kubernetes API Overview, you will notice that there is no such a thing as defining subnets, neither IP addresses that would be specific for deployment so you cannot specify subnets for deployments.
Kubernetes idea is that pod is ephemeral. You should not try to identify resources by IP addresses as IPs are randomly assigned. If the pod dies it will have another IP address. You could try to look on something like statefulsets if you are after unique stable network identifiers.
While Kubernetes does not support this feature I found workaround for this using Calico: Migrate pools feature.
First you need to have calicoctl installed. There are several ways to do that mentioned in the install calicoctl docs.
I choose to install calicoctl as a Kubernetes pod:
kubectl apply -f https://docs.projectcalico.org/manifests/calicoctl.yaml
To make work faster you can setup an alias :
alias calicoctl="kubectl exec -i -n kube-system calicoctl /calicoctl -- "
I have created two yaml files to setup ip pools:
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: pool1
spec:
cidr: 10.0.0.0/24
ipipMode: Always
natOutgoing: true
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: pool2
spec:
cidr: 10.0.1.0/24
ipipMode: Always
natOutgoing: true
Then you you have apply the following configuration but since my yaml were being placed in my host filesystem and not in calico pod itself I placed the yaml as an input to the command:
➜ cat ippool1.yaml | calicoctl apply -f-
Successfully applied 1 'IPPool' resource(s)
➜ cat ippool2.yaml | calicoctl apply -f-
Successfully applied 1 'IPPool' resource(s)
Listing the ippools you will notice the new added ones:
➜ calicoctl get ippool -o wide
NAME CIDR NAT IPIPMODE VXLANMODE DISABLED SELECTOR
default-ipv4-ippool 192.168.0.0/16 true Always Never false all()
pool1 10.0.0.0/24 true Always Never false all()
pool2 10.0.1.0/24 true Always Never false all()
Then you can specify what pool you want to choose for you deployment:
---
metadata:
labels:
app: nginx
name: deployment1-pool1
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
annotations:
cni.projectcalico.org/ipv4pools: "[\"pool1\"]"
---
I have created similar one called deployment2 that used ippool2 with the results below:
deployment1-pool1-6d9ddcb64f-7tkzs 1/1 Running 0 71m 10.0.0.198 acid-fuji
deployment1-pool1-6d9ddcb64f-vkmht 1/1 Running 0 71m 10.0.0.199 acid-fuji
deployment2-pool2-79566c4566-ck8lb 1/1 Running 0 69m 10.0.1.195 acid-fuji
deployment2-pool2-79566c4566-jjbsd 1/1 Running 0 69m 10.0.1.196 acid-fuji
Also its worth mentioning that while testing this I found out that if your default deployment will have many replicas and will ran out of ips Calico will then use different pool.

Fails to run kubeadm init

With reference to https://github.com/kubernetes/kubeadm/issues/1239. How do I configure and start the latest kubeadm successfully?
kubeadm_new.config is generated by config migration:
kubeadm config migrate --old-config kubeadm_default.config --new-config kubeadm_new.config. Content of kubeadm_new.config:
apiEndpoint:
advertiseAddress: 1.2.3.4
bindPort: 6443
apiVersion: kubeadm.k8s.io/v1alpha3
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
nodeRegistration:
criSocket: /var/run/dockershim.sock
name: khteh-t580
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
---
apiVersion: kubeadm.k8s.io/v1alpha3
auditPolicy:
logDir: /var/log/kubernetes/audit
logMaxAge: 2
path: ""
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controlPlaneEndpoint: ""
etcd:
local:
dataDir: /var/lib/etcd
image: ""
imageRepository: k8s.gcr.io
kind: ClusterConfiguration
kubernetesVersion: v1.12.2
networking:
dnsDomain: cluster.local
podSubnet: ""
serviceSubnet: 10.96.0.0/12
unifiedControlPlaneImage: ""
I changed "kubernetesVersion: v1.12.2" in kubeadm_new.config and it seems to progress further and now stuck at the following error:
failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false.
How do I set fail-swap-on to FALSE to get it going?
Kubeadm comes with a command which prints default configuration, so you can check each of the assigned default values with:
kubeadm config print-default
In your case, if you want to disable swap check in the kubelet, you have to add the following lines to your current kubeadm config:
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
You haven't mentioned why you chose to disable swap.
I wouldn't consider it as a first option - not because memory swap is a bad practice (it is a useful and basic kernel mechanism) but because it seems the Kubelet is not designed to work properly with swap enabled.
K8S is very clear about this topic as you can see in the Kubeadm installation:
Swap disabled. You MUST disable swap in order for the kubelet to work
properly.
I would recommend reading about Evicting end-user Pods and the relevant features that K8S provides to prioritize memory of pods:
1 ) The 3 qos classes - Make sure that your high priority workloads are running with the Guaranteed (or at least Burstable) class.
2 ) Pod Priority and Preemption.

Taint a node in kubernetes live cluster

How can I achieve the same command with a YAML file such that I can do a kubectl apply -f? The command below works and it taints but I can't figure out how to use it via a manifest file.
$ kubectl taint nodes \
172.4.5.2-3a1d4eeb \
kops.k8s.io/instancegroup=loadbalancer \
NoSchedule
Use the -o yaml option and save the resulting YAML file and make sure to remove the status and some extra stuff, this will apply the taint , but provide you the yaml that you can later use to do kubectl apply -f , and save it to version control ( even if you create the resource from command line and later get the yaml and apply it , it will not re-create the resource , so it is perfectly fine )
Note: Most of the commands support --dry-run , that will just generate the yaml and not create the resource , but in this case , I could not make it work with --dry-run , may be this command does not support that flag.
C02W84XMHTD5:~ iahmad$ kubectl taint node minikube dedicated=foo:PreferNoSchedule -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: 2018-10-16T21:44:03Z
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/hostname: minikube
node-role.kubernetes.io/master: ""
name: minikube
resourceVersion: "291136"
selfLink: /api/v1/nodes/minikube
uid: 99a1a304-d18c-11e8-9334-f2cf3c1f0864
spec:
externalID: minikube
taints:
- effect: PreferNoSchedule
key: dedicated
value: foo
then use the yaml with kubectl apply:
apiVersion: v1
kind: Node
metadata:
name: minikube
spec:
taints:
- effect: PreferNoSchedule
key: dedicated
value: foo
I have two nodes in my cluster, please look at labels
kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
172.16.2.53 Ready node 7d4h v1.19.7 type=primary
172.16.2.89 Ready node 33m v1.19.7 type=secondary
Lets say I want to taint node name with "172.16.2.89"
kubectl taint node 172.16.2.89 type=secondary:NoSchedule
node/172.16.2.89 tainted
Example -
kubectl taint node <node-name> <label-key>=<value>:NoSchedule
I have two nodes in my cluster, please look at labels
NoExecute means the pod will be evicted from the node.
NoSchedule means the scheduler will not place the pod onto the node

How to assign a namespace to certain nodes?

Is there any way to configure nodeSelector at the namespace level?
I want to run a workload only on certain nodes for this namespace.
To achieve this you can use PodNodeSelector admission controller.
First, you need to enable it in your kubernetes-apiserver:
Edit /etc/kubernetes/manifests/kube-apiserver.yaml:
find --enable-admission-plugins=
add PodNodeSelector parameter
Now, you can specify scheduler.alpha.kubernetes.io/node-selector option in annotations for your namespace, example:
apiVersion: v1
kind: Namespace
metadata:
name: your-namespace
annotations:
scheduler.alpha.kubernetes.io/node-selector: env=test
spec: {}
status: {}
After these steps, all the pods created in this namespace will have this section automatically added:
nodeSelector
env: test
More information about the PodNodeSelector you can find in the official Kubernetes documentation:
https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#podnodeselector
kubeadm users
If you deployed your cluster using kubeadm and if you want to make this configuration persistent, you have to update your kubeadm config file:
kubectl edit cm -n kube-system kubeadm-config
specify extraArgs with custom values under apiServer section:
apiServer:
extraArgs:
enable-admission-plugins: NodeRestriction,PodNodeSelector
then update your kube-apiserver static manifest on all control-plane nodes:
# Kubernetes 1.22 and forward:
kubectl get configmap -n kube-system kubeadm-config -o=jsonpath="{.data}" > kubeadm-config.yaml
# Before Kubernetes 1.22:
# "kubeadmin config view" was deprecated in 1.19 and removed in 1.22
# Reference: https://github.com/kubernetes/kubeadm/issues/2203
kubeadm config view > kubeadm-config.yaml
# Update the manifest with the file generated by any of the above lines
kubeadm init phase control-plane apiserver --config kubeadm-config.yaml
kubespray users
You can just use kube_apiserver_enable_admission_plugins variable for your api-server configuration variables:
kube_apiserver_enable_admission_plugins:
- PodNodeSelector
I totally agree with the #kvaps answer but something is missing : it is necessary to add a label in your node :
kubectl label node <yournode> env=test
Like that, the pod created in the namespace with scheduler.alpha.kubernetes.io/node-selector: env=test will be schedulable only on node with env=test label
To dedicate nodes to only host resources belonging to a namespace, you also have to prevent the scheduling of other resources over those nodes.
It can be achieved by a combination of podSelector and a taint, injected via the admission controller when you create resources in the namespace. In this way, you don't have to manually label and add tolerations to each resource but it is sufficient to create them in the namespace.
Properties objectives:
the podSelector forces scheduling of resources only on the selected nodes
the taint denies scheduling of any resource not in the namespace on the selected nodes
Configuration of nodes/node pool
Add a taint to the nodes you want to dedicate to the namespace:
kubectl taint nodes project.example.com/GPUsNodePool=true:NoSchedule -l=nodesWithGPU=true
This example adds the taint to the nodes that already have the label nodesWithGPU=true. You can taint nodes also individually by name: kubectl taint node my-node-name project.example.com/GPUsNodePool=true:NoSchedule
Add a label:
kubectl label nodes project.example.com/GPUsNodePool=true -l=nodesWithGPU=true
The same is done if, for example, you use Terraform and AKS. The node pool configuration:
resource "azurerm_kubernetes_cluster_node_pool" "GPUs_node_pool" {
name = "gpusnp"
kubernetes_cluster_id = azurerm_kubernetes_cluster.clustern_name.id
vm_size = "Standard_NC12" # https://azureprice.net/vm/Standard_NC12
node_taints = [
"project.example.com/GPUsNodePool=true:NoSchedule"
]
node_labels = {
"project.example.com/GPUsNodePool" = "true"
}
node_count = 2
}
Namespace creation
Create then the namespace with instructions for the admission controller:
apiVersion: v1
kind: Namespace
metadata:
name: gpu-namespace
annotations:
scheduler.alpha.kubernetes.io/node-selector: "project.example.com/GPUsNodePool=true" # poorly documented: format has to be of "selector-label=label-val"
scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Equal", "value": "true", "effect": "NoSchedule", "key": "project.example.com/GPUsNodePool"}]'
project.example.com/description: 'This namespace is dedicated only to resources that need a GPU.'
Done! Create resources in the namespace and the admission controller together with the scheduler will do the rest.
Testing
Create a sample pod with no label or toleration but into the namespace:
kubectl run test-dedicated-ns --image=nginx --namespace=gpu-namespace
# get nodes and nodes
kubectl get po -n gpu-namespace
# get node name
kubectl get po test-dedicated-ns -n gpu-namespace -o jsonpath='{.spec.nodeName}'
# check running pods on a node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>