Is it possible to add swap space on kubernetes nodes?

Is it possible to add swap space on kubernetes nodes? - kubernetes

I am trying to add swap space on kubernetes node to prevent it from out of memory issue. Is it possible to add swap space on node (previously known as minion)? If possible what procedure should I follow and how it effects pods acceptance test?

Kubernetes doesn't support container memory swap. Even if you add swap space, kubelet will create the container with --memory-swappiness=0 (when using Docker). There have been discussions about adding support, but the proposal was not approved. https://github.com/kubernetes/kubernetes/issues/7294

Technically you can do it.
There is a broad discussion weather to give K8S users the privilege to decide enabling swap or not.
I'll first refer directly to your question and then continue with the discussion.
If you run K8S on Kubeadm and you've added swap to your nodes - follow the steps below:
1 ) Reset the current cluster setup and then add the fail-swap-on=false flag to the kubelet configuration:
kubeadm reset
echo 'Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"' >> /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
(*) If you're running on Ubuntu replace the path for the Kubelet config from etc/systemd/syste,/kubelet to /etc/default/kubelet.
2 ) Reload the service:
systemctl daemon-reload
systemctl restart kubelet
3 ) Initialize the cluster settings again and ignore the swap error:
kubeadm init --ignore-preflight-errors Swap
OR:
If you prefer working with kubeadm-config.yaml:
1 ) Add the failSwapOn flag:
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false # <---- Here
2 ) And run:
kubeadm init --config /etc/kubernetes/kubeadm-config.yaml --ignore-preflight-errors=Swap
Returning back to discussion weather to allow swapping or not.
From the one hand, K8S is very clear about this - Kubelet is not designed to support swap - you can see it mentioned in the Kubeadm link I shared above:
Swap disabled. You MUST disable swap in order for the kubelet to work
properly
From the other hand, you can see users reporting that there are cases where there deployments require swap enabled.
I would suggest that you first try without enabling swap.
(Not because swap is a function that the kernel can't manage, but merely because it is not recommended by Kube - probably related to the design of Kubelet).
Make sure that you are familiar with the features that K8S provides to prioritize memory of pods:
1 ) The 3 qos classes - Make sure that your high priority workloads are running with the Guaranteed (or at least Burstable) class.
2 ) Pod Priority and Preemption.
I would recommend also reading Evicting end-user Pods:
If the kubelet is unable to reclaim sufficient resource on the node,
kubelet begins evicting Pods.
The kubelet ranks Pods for eviction first by whether or not their
usage of the starved resource exceeds requests, then by Priority, and
then by the consumption of the starved compute resource relative to
the Pods' scheduling requests.
As a result, kubelet ranks and evicts Pods in the following order:
BestEffort or Burstable Pods whose usage of a starved resource exceeds its request. Such pods are ranked by Priority, and then usage
above request.
Guaranteed pods and Burstable pods whose usage is beneath requests are evicted last. Guaranteed Pods are guaranteed only when requests
and limits are specified for all the containers and they are equal.
Such pods are guaranteed to never be evicted because of another Pod's
resource consumption. If a system daemon (such as kubelet, docker, and
journald) is consuming more resources than were reserved via
system-reserved or kube-reserved allocations, and the node only has
Guaranteed or Burstable Pods using less than requests remaining, then
the node must choose to evict such a Pod in order to preserve node
stability and to limit the impact of the unexpected consumption to
other Pods. In this case, it will choose to evict pods of Lowest
Priority first.
Good luck (:
A few relevant discussions:
Kubelet/Kubernetes should work with Swap Enabled
[ERROR Swap]: running with swap on is not supported. Please disable swap
Kubelet needs to allow configuration of container memory-swap

Kubernetes 1.22 introduced swap as an alpha feature.
More at:
https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/
https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory

Related

Kubernetes: avoid pod being evicted by DiskPressure

I am looking for best practices to avoid Pod The node had condition: [DiskPressure].
So what I'm doing is full database export of all our views which is massive. At some point the pod runs into DiskPressure error and the k8 decides to Evict and kill it.
What would be the best practices to handle this? There is 7GB of free space which maybe is not enough. Is just raising that the best way to go about it or are the other mechanisms to handle this type of work?
Hope my question makes sense

Error message Pod The node had a condition: [DiskPressure]
happens when the kubelet agent won't admit new pods on the node, that means they won't start. Node disk pressure means that the disks that are attached to the node are under pressure.
The reason you might run into node disk pressure is because Kubernetes has not cleaned up any unused images and is a problem of logs building up.if you have a long-running container with a lot of logs, they may build up enough that it overloads the capacity of the node disk.
Troubleshooting Node Disk Pressure:
To troubleshoot the issue of node disk pressure, you need to figure out what files are taking up the most space. You can either manually SSH into each Kubernetes node, or use a DaemonSet, you can do that from this link.
After installing you can start looking at the logs of the pods that are running by executing kubectl logs -l app=disk-checker. You will see a list of files and their sizes, which will give you greater insight into what is taking up space on your nodes.
Possible solutions:
The issue is caused by necessary application data, making it impossible to delete the files. In this case, you will have to increase the size of the node disks to ensure that there’s sufficient room for the application files.
Another solution is that you find applications that have produced a lot of files that are no longer needed and simply delete the unnecessary files.
Adding more for your information:
1)To avoid DiskPressure crashing the node :
DiskPressure triggers when either node root file systems or image file systems satisfies an eviction threshold for available disk space, inodes will trigger DiskPressure which in turn causes pod eviction,refer to these Node conditions.
Based on the Node conditions, you should consider adjusting the parameters of your kubelet, --image-gc-high-threshold and --image-gc-low-threshold, so that there is always enough space for normal operations, consider --low-diskspace-threshold-mb provisioning more space for your nodes, depending on your requirements.
2) To reduce the DiskPressure condition
Use the kubelet command line arg :
--eviction-hard mapStringString: A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction.
DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See Set Kubelet parameters via a config file for more information.

How to delete a pod from Kubernetes master node?

Does anyone know how to delete pod from kubernetes master node? I have this one master node on bare-metal ubuntu server. When i'm trying to delete it with "kubectl delete pod .." or force deleting from there: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/ it doesnt work. the pod is creating again and again...

The pods in a Statefulsets are managed by ReplicaSets and will be recreated again if the current and the desired replicas defined in the spec do not match.
The document you linked provides instructions as to how to kill the pods forcefully avoiding the graceful shutdown behaviour which can have unexpected behaviour depending on the application.
The link clearly states the pods will be recreated in the section:
Force deletions do not wait for confirmation from the kubelet that the Pod has been terminated. Irrespective of whether a force deletion is successful in killing a Pod, it will immediately free up the name from the apiserver. This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee.
If you want the pods to be stopped and new pods for the Statefulset do not get created, you need to scale down the Statefulset by changing the replicas to 0.
You can read the official docs for how to scale the Statefulset replicas.

The key to figuring out how to kill the pod will be to understand how it was created. For example, if the pod is part of a deployment with a declared replicas count as 1, Once you kill/ force kill, Kubernetes detects a mismatch between the desired state (the number of replicas defined in the deployment configuration) to the current state and will create a new pod to replace the one that was deleted - therefor in this example you will need to either scale the deployment to 0 or delete the deployment.

If we need to kill any pod we can just scale down the replica set.
kubectl scale deploy <deployment_name> --replicas=<expected_no_of_replicas>

Way of deleting pods will depends on how you created it. If you created it individually ( not part of a ReplicaSet/ReplicationController/Deployment ) then you can delete pod directly. otherwise the only option to delete is the scale option. In production setup what I believe is all are using Deployment option out of ReplicaSet/ReplicationController/Deployment( Please refer documents and understand the difference between all those three options )

About k8s metrices server only some resources can be monitored

Version
k8s version: v1.19.0
metrics server: v0.3.6
I set up k8s cluster and metrics server, it can check nodes and pod on master node,
work node can not see, it return unknown.
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
u-29 1160m 14% 37307Mi 58%
u-31 2755m 22% 51647Mi 80%
u-32 4661m 38% 32208Mi 50%
u-34 1514m 12% 41083Mi 63%
u-36 1570m 13% 40400Mi 62%
when the pod running on the client node, it return unable to fetch pod metrics for pod default/nginx-7764dc5cf4-c2sbq: no metrics known for pod
when the pod running one the master node, it can return cpu or memory
NAME CPU(cores) MEMORY(bytes)
nginx-7cdd6c99b8-6pfg2 0m 2Mi

This is a community wiki answer based on OP's comment posted for better visibility. Feel free to expand it.
The issue was caused by using different versions of docker on different nodes. After upgrading docker to v19.3 on both nodes and executing kubeadm reset the issue was resolved.

Generally the metrics server receives the metrics via the kubelet.
Maybe there is a problem in retrieving the information from that.
You will need to look at the follow configurations mentioned in the readme.
Configuration
Depending on your cluster setup, you may also need to change flags passed to the Metrics Server container. Most useful flags:
--kubelet-preferred-address-types - The priority of node address types used when determining an address for connecting to a particular node (default [Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP])
--kubelet-insecure-tls - Do not verify the CA of serving certificates presented by Kubelets. For testing purposes only.
--requestheader-client-ca-file - Specify a root certificate bundle for verifying client certificates on incoming requests.
Maybe you can check below configuration changes.
--kubelet-preferred-address-types=InternalIP
--kubelet-insecure-tls
You might be able to refer this ticket to get more information.

What happens when you drain nodes in a Kubernetes cluster?

I'd like to get some clarification for preparation for maintenance when you drain nodes in a Kubernetes cluster:
Here's what I know when you run kubectl drain MY_NODE:
Node is cordoned
Pods are gracefully shut down
You can opt to ignore Daemonset pods because if they are shut down, they'll just be re-spawned right away again.
I'm confused as to what happens when a node is drained though.
Questions:
What happens to the pods? As far as I know, there's no 'live migration' of pods in Kubernetes.
Will the pod be shut down and then automatically started on another node? Or does this depend on my configuration? (i.e. could a pod be shut down via drain and not start up on another node)
I would appreciate some clarification on this and any best practices or advice as well. Thanks in advance.

By default kubectl drain is non-destructive, you have to override to change that behaviour. It runs with the following defaults:
--delete-local-data=false
--force=false
--grace-period=-1
--ignore-daemonsets=false
--timeout=0s
Each of these safeguard deals with a different category of potential destruction (local data, bare pods, graceful termination, daemonsets). It also respects pod disruption budgets to adhere to workload availability. Any non-bare pod will be recreated on a new node by its respective controller (e.g. daemonset controller, replication controller).
It's up to you whether you want to override that behaviour (for example you might have a bare pod if running jenkins job. If you override by setting --force=true it will delete that pod and it won't be recreated). If you don't override it, the node will be in drain mode indefinitely (--timeout=0s)).

I just want to add a few things to eamon1234's answer:
You may find this useful as well:
Link to official docummentation (in case default flags change etc.). According to it:
The 'drain' evicts or deletes all pods except mirror pods (which
cannot be deleted through the API server). If there are
DaemonSet-managed pods, drain will not proceed without
--ignore-daemonsets, and regardless it will not delete any DaemonSet-managed pods, because those pods would be immediately
replaced by the DaemonSet controller, which ignores unschedulable
markings. If there are any pods that are neither mirror pods nor
managed by ReplicationController, ReplicaSet, DaemonSet, StatefulSet
or Job, then drain will not delete any pods unless you use --force.
--force will also allow deletion to proceed if the managing resource of one or more pods is missing.
Simple chart illustrating what actually happens when using kubectl drain.
Using kubectl drain with --dry-run option may be also a good idea so you can see its outcome before any actual changes are applied e.g.:
kubectl drain foo --force --dry-run
however it will not show any errors about existing local data or daemonsets which you can see whithout using --dry-run flag:
... error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore) ...

We can use kubectl drain to safely evict all of our pods from a node before we perform maintenance on the node.
If you want to update or patch or any kind of maintenance on Hardware/Node you should first drain all the pods(Migrate pods one node to another) kubectl drain
When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted. It is then safe to bring down the node
After maintenance work we can use kubectl uncordon to tell Kubernetes that it can resume scheduling new pods onto the node.

Jenkins X builds fail with "The node was low on resource: [DiskPressure]."

My Jenkins X installation, mid-project, is now becoming very unstable. (Mainly) Jenkins pods are failing to start due to disk pressure.
Commonly, many pods are failing with
The node was low on resource: [DiskPressure].
or
0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had disk pressure, 2 node(s) had no available volume zone.
Unable to mount volumes for pod "jenkins-x-chartmuseum-blah": timeout expired waiting for volumes to attach or mount for pod "jx"/"jenkins-x-chartmuseum-blah". list of unmounted volumes=[storage-volume]. list of unattached volumes=[storage-volume default-token-blah]
Multi-Attach error for volume "pvc-blah" Volume is already exclusively attached to one node and can't be attached to another
This may have become more pronounced with more preview builds for projects with npm and the massive node-modules directories it generates. I'm also not sure if Jenkins is cleaning up after itself.
Rebooting the nodes helps, but not for very long.

Let's approach this from the Kubernetes side.
There are few things you could do to fix this:
As mentioned by #Vasily check what is causing disk pressure on nodes. You may also need to check logs from:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log
More about why those logs below.
Check your Eviction Thresholds. Adjust Kubelet and Kube-Scheduler configuration if needed. See what is happening with both of them (logs mentioned earlier might be useful now). More info can be found here
Check if you got a correctly running Horizontal Pod Autoscaler: kubectl get hpa
You can use standard kubectl commands to setup and manage your HPA.
Finally, the volume related errors that you receive indicates that we might have problem with PVC and/or PV. Make sure you have your volume in the same zone as node. If you want to mount the volume to a specific container make sure it is not exclusively attached to another one. More info can be found here and here
I did not test it myself because more info is needed in order to reproduce the whole scenario but I hope that above suggestion will be useful.
Please let me know if that helped.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse