How can I OOM all BestEffort Pods in Kubernetes? - kubernetes

To demonstrate the kubelet's eviction behaviour, I am trying to deploy a Kubernetes workload that will consume memory to the point that the kubelet evicts all BestEffort Pods due to memory pressure but does not kill my workload (or at least not before the BestEffort Pods).
My best attempt is below. It writes to two tmpfs volumes (since, by default, the limit of a tmpfs volume is half of the Node's total memory). The 100 comes from the fact that --eviction-hard=memory.available<100Mi is set on the kubelet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fallocate
namespace: developer
spec:
selector:
matchLabels:
app: fallocate
template:
metadata:
labels:
app: fallocate
spec:
containers:
- name: alpine
image: alpine
command:
- /bin/sh
- -c
- |
count=1
while true
do
AVAILABLE_DISK_KB=$(df /cache-1 | grep /cache-1 | awk '{print $4}')
AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ? $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
fallocate -l $(( $MINIMUM - 100 ))MB /cache-1/$count
AVAILABLE_DISK_KB=$(df /cache-2 | grep /cache-2 | awk '{print $4}')
AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ? $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
fallocate -l $(( $MINIMUM - 100 ))MB /cache-2/$count
count=$(( $count+1 ))
sleep 1
done
resources:
requests:
memory: 2Gi
cpu: 100m
limits:
cpu: 100m
volumeMounts:
- name: cache-1
mountPath: /cache-1
- name: cache-2
mountPath: /cache-2
volumes:
- name: cache-1
emptyDir:
medium: Memory
- name: cache-2
emptyDir:
medium: Memory
The intention of this script is to use up memory to the point that Node memory usage is in the hard eviction threshhold boundary to cause the kubelet to start to evict. It evicts some BestEfforts Pods, but in most cases the workload is killed before all BestEffort Pods are evicted. Is there a better way of doing this?
I am running on GKE with cluster version 1.9.3-gke.0.
EDIT:
I also tried using simmemleak:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: simmemleak
namespace: developer
spec:
selector:
matchLabels:
app: simmemleak
template:
metadata:
labels:
app: simmemleak
spec:
containers:
- name: simmemleak
image: saadali/simmemleak
resources:
requests:
memory: 1Gi
cpu: 1m
limits:
cpu: 1m
But this workload keeps dying before any evictions. I think the issue is that it is being killed by the kernel before the kubelet has time to react.

To avoid system OOM taking effect prior to the kubelet eviction, you could configure kubepods memory limits --system-reserved and --enforce-node-allocatable Read more.
For example, Node has 32Gi of memory, configure to limit kubepods memory up to 20Gi
--eviction-hard=memory.available<500Mi

I found this on the Kubernetes docs, I hope it helps:
kubelet may not observe memory pressure right away
The kubelet currently polls cAdvisor to collect memory usage stats at a regular interval.
If memory usage increases within that window rapidly, the kubelet may not observe MemoryPressure fast enough, and the OOMKiller will still be invoked.
We intend to integrate with the memcg notification API in a future release to reduce this latency, and instead have the kernel tell us when a threshold has been crossed immediately.
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for this issue is to set eviction thresholds at approximately 75% capacity.
This increases the ability of this feature to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
==EDIT== : As there seems to be a race between OOM and kubelet, and the memory allocated by your script grows quicker than the time Kubelet takes to realise that the pods need to be evicted, it might be wise to try to allocate memory more slowly inside your script.

Related

When the kubelet reports that the node is in diskpressure?

I know that the kubelet reports that the node is in diskpressure if there is not enough space on the node.
But I want to know the exact threshold of diskpressure.
Please let me know the source code of the kubelet related this issue if you could.
Or I really thanks for your help about the official documentation from k8s or sth else.
Thanks again!!
Kubelet running on the node will report the disk pressure depending on hard or soft eviction threshold values which is being set. if nothing being set it will be all default values. Please refer kubernetes documentation
Below are values which will be used
DiskPressure - nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree
Disk pressure is a condition indicating that a node is using too much disk space or is using disk space too fast, according to the thresholds you have set in your Kubernetes configuration.
DaemonSet can deploy apps to multiple nodes in a single step. Like deployments, DaemonSets must be applied using kubectl before they can take effect.
Since kubernetes is running on Linux,this is easily done by running du command.you can either manually ssh into each kubernetes nodes,or use a Daemonset As follows:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: disk-checker
labels:
app: disk-checker
spec:
selector:
matchLabels:
app: disk-checker
template:
metadata:
labels:
app: disk-checker
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- resources:
requests:
cpu: 0.15
securityContext:
privileged: true
image: busybox
imagePullPolicy: IfNotPresent
name: disk-checked
command: ["/bin/sh"]
args: ["-c", "du -a /host | sort -n -r | head -n 20"]
volumeMounts:
- name: host
mountPath: "/host"
volumes:
- name: host
hostPath:
path: "/"
Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold, check complete Node Conditions for more details.
Ways to set Kubelet options :
1)Command line options like --eviction-hard.
2)Config file.
3)More recent is dynamic configuration.
When you experience an issue with node disk pressure, your immediate thoughts should be when you run into the issue: an error in garbage collecting or log files. Of course the better answer here is to clean up unused files (free up some disk space).
So Monitor your clusters and get notified of any node disks approaching pressure, and get the issue resolved before it starts killing other pods inside the cluster.
Edit :
AFAIK there is no magic trick to know the exact threshold of diskpressure . You need to start with reasonable (limits & requests) and refine using trial and error.
Refer to this SO for more information on how to set the threshold of diskpressure.

GKE Autopilot - Containers stuck in init phase on particular node

I'm using GKE's Autopilot Cluster to run some kubernetes workloads. Pods getting scheduled to one of the allocated nodes is taking around 10 mins stuck in init phase. Same pod in different node is up in seconds.
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jobs
spec:
replicas: 1
selector:
matchLabels:
app: job
template:
metadata:
labels:
app: job
spec:
volumes:
- name: shared-data
emptyDir: {}
initContainers:
- name: init-volume
image: gcr.io/dummy_image:latest
imagePullPolicy: Always
resources:
limits:
memory: "1024Mi"
cpu: "1000m"
ephemeral-storage: "10Gi"
volumeMounts:
- name: shared-data
mountPath: /data
command: ["/bin/sh","-c"]
args:
- cp -a /path /data;
containers:
- name: job-server
resources:
requests:
ephemeral-storage: "5Gi"
limits:
memory: "1024Mi"
cpu: "1000m"
ephemeral-storage: "10Gi"
image: gcr.io/jobprocessor:latest
imagePullPolicy: Always
volumeMounts:
- name: shared-data
mountPath: /ebdata1
This happens only if container has init container. In my case, I'm copying some data from dummy container to shared volume which I'm mounting on actual container..
But whenever pods get scheduled to this particular node, it gets stuck in init phase for around 10 minutes and automatically gets resolved. I couldn't see any errors in event logs.
kubectl describe node problematic-node
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SystemOOM 52m kubelet System OOM encountered, victim process: cp, pid: 477887
Warning OOMKilling 52m kernel-monitor Memory cgroup out of memory: Killed process 477887 (cp) total-vm:2140kB, anon-rss:564kB, file-rss:768kB, shmem-rss:0kB, UID:0 pgtables:44kB oom_score_adj:-997
Only message is the above warning. Is this issue caused by some misconfiguration from my side?
The best recommendation is for you to manage container compute resources properly within your Kubernetes cluster. When creating a Pod, you can optionally specify how much CPU and memory (RAM) each Container needs to avoid OOM situations.
When Containers have resource requests specified, the scheduler can make better decisions about which nodes to place Pods on. And when Containers have their limits specified, contention for resources on a node can be handled in a specified manner. CPU specifications are in units of cores, and memory is specified in units of bytes.
An event is produced each time the scheduler fails, use the command below to see the status of events:
$ kubectl describe pod <pod-name>| grep Events
Also, read the official Kubernetes guide on “Configure Out Of Resource Handling”. Always make sure to:
reserve 10-20% of memory capacity for system daemons like kubelet and OS kernel identify pods which can be evicted at 90-95% memory utilization to reduce thrashing and incidence of system OOM.
To facilitate this kind of scenario, the kubelet would be launched with options like below:
--eviction-hard=memory.available<xMi
--system-reserved=memory=yGi
Having Heapster container monitoring in place must be helpful for visualization.
Read more reading on Kubernetes and Docker Administration.

Kubernetes cron job oomkilled

I have a rails app that is deployed on K8S. Inside my web app, there is a cronjob thats running every day at 8pm and it takes 6 hours to finish. I noticed OOMkilled error occurs after a few hours from cronjob started. I also increased memory of a pod but the error still happened.
This is my yaml file:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: sync-data
spec:
schedule: "0 20 * * *" # At 20:00:00pm every day
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
jobTemplate:
spec:
ttlSecondsAfterFinished: 100
template:
spec:
serviceAccountName: sync-data
containers:
- name: sync-data
resources:
requests:
memory: 2024Mi # OOMKilled
cpu: 1000m
limits:
memory: 2024Mi # OOMKilled
cpu: 1000m
image: xxxxxxx.dkr.ecr.ap-northeast-1.amazonaws.com/path
imagePullPolicy: IfNotPresent
command:
- "/bin/sh"
- "-c"
- |
rake xxx:yyyy # Will take ~6 hours to finish
restartPolicy: Never
Are there any best practices to run long consuming cronjob on K8S?
Any help is welcome!
OOM Killed can happen for 2 reasons.
Your pod is taking more memory than the limit specified. In that case, you need to increase the limit obviously.
If all the pods in the node are taking more memory than they have requested then Kubernetes will kill some pods to free up space. In that case, you can give higher priority to this pod.
You should have monitoring in place to actually determine the reasons for this. Proper monitoring will show you which pods are performing as per expectations and which are not. You could also use node selectors for long-running pods and set priority class which will remove non-cron pods first.
Well honestly there is no correct resources request/limit stuff in kubernetes because it totally depend on your pod what kind of stuff it is doing. One thing I would suggest or you can do is deploy the vertical pod auto-scaling and observe what the vertical pod autoscaler suggest you the perfect resource request/limits for your cron job. Here is the very nice article you can start with and you will get to know how you can utilize this in your requirement.
https://medium.com/infrastructure-adventures/vertical-pod-autoscaler-deep-dive-limitations-and-real-world-examples-9195f8422724

How to make auto cluster upscaling work GKE/digitalocean for a job kind with varied requested memory requirement?

I have 1 node K8 cluster on digitalocean with 1cpu/2gbRAM
and 3 node cluster on google cloud with 1cpu/2gbRAM
I ran two jobs separatley on each cloud platform with auto-scaling enabled.
First job had memory request of 200Mi
apiVersion: batch/v1
kind: Job
metadata:
name: scaling-test
spec:
parallelism: 16
template:
metadata:
name: scaling-test
spec:
containers:
- name: debian
image: debian
command: ["/bin/sh","-c"]
args: ["sleep 300"]
resources:
requests:
cpu: "100m"
memory: "200Mi"
restartPolicy: Never
More nodes of (1cpu/2gbRAM) were added to cluster automatically and after job completion extra node were deleted automatically.
After that, i ran second job with memory request 4500Mi
apiVersion: batch/v1
kind: Job
metadata:
name: scaling-test2
spec:
parallelism: 3
template:
metadata:
name: scaling-test2
spec:
containers:
- name: debian
image: debian
command: ["/bin/sh","-c"]
args: ["sleep 5"]
resources:
requests:
cpu: "100m"
memory: "4500Mi"
restartPolicy: Never
After checking later job remained in pending state . I checked pods Events log and i'm seeing following error.
0/5 nodes are available: 5 Insufficient memory **source: default-scheduler**
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory **source:cluster-autoscaler**
cluster did not auto-scaled for required requested resource for job. Is this possible using kubernetes?
CA doesn't add nodes to the cluster if it wouldn't make a pod schedulable. It will only consider adding nodes to node groups for which it was configured. So one of the reasons it doesn't scale up the cluster may be that the pod has too large (e.g. 4500Mi memory). Another possible reason is that all suitable node groups are already at their maximum size.

How to avoid creating pod on node that doesn't have enough storage?

I'm setting up my application with Kubernetes. I have 2 Docker images (Oracle and Weblogic). I have 2 kubernetes nodes, Node1 (20 GB storage) and Node2 (60 GB) storage.
When I run kubectl apply -f oracle.yaml it tries to create oracle pod on Node1 and after few minutes it fails due to lack of storage. How can I force Kubernetes to check the free storage of that node before creating the pod there?
Thanks
First of all, you probably want to give Node1 more storage.
But if you don't want the pod to start at all you can probably run a check with an initContainer where you check how much space you are using with something like du or df. It could be a script that checks for a threshold that exits unsuccessfully if there is not enough space. Something like this:
#!/bin/bash
# Check if there are less than 10000 bytes in the <dir> directory
if [ `du <dir> | tail -1 | awk '{print $1}'` -gt "10000" ]; then exit 1; fi
Another alternative is to use a persistent volume (PV) with a persistent volume claim (PVC) that has enough space together with the default StorageClass Admission Controller, and you do allocate the appropriate space in your volume definition.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 40Gi
storageClassName: mytype
Then on your Pod:
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- name: mycontainer
image: nginx
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
persistentVolumeClaim:
claimName: myclaim
The Pod will not start if your claim cannot be allocated (There isn't enough space)
You may try to specify ephemeral storage requirement for pod:
resources:
requests:
ephemeral-storage: "40Gi"
limits:
ephemeral-storage: "40Gi"
Then it would be scheduled only on nodes with sufficient ephemeral storage available.
You can verify the amount of ephemeral storage available on each node in the output of "kubectl describe node".
$ kubectl describe node somenode | grep -A 6 Allocatable
Allocatable:
attachable-volumes-gce-pd: 64
cpu: 3920m
ephemeral-storage: 26807024751
hugepages-2Mi: 0
memory: 12700032Ki
pods: 110