Kubernetes cluster hangs when pod consumes too much memory - kubernetes

I have a job I run on a k8s node that quickly grows to 32Gb+ memory. The node has 32Gb of memory.
I would expect Kubernetes to evict the pod, but instead the node becomes completely unreachable and the only way to get it back is with a reboot. Looking at htop immediately after deploying the job, all CPUs and memory are maxed out.
Even when I set memory limits on the job configuration to 16M, the same happens.
This is what the job config looks like:
apiVersion: batch/v1
kind: Job
metadata:
name: mem-test
spec:
template:
spec:
containers:
- name: mem-test
image: gcr.io/foo/mem-test:latest
resources:
limits:
memory: 16000Mi
restartPolicy: Never
What am I missing?

Related

AKS with Keda: pod are removed during execution

I tried Keda with AKS and I really appreciate when pod are automatically instanciate based on Azure Dev Ops queue job for release & build.
However I noticed something strange and often AKS/Keda remove pod while processing which makes workflow failed.
Message reads
We stopped hearing from agent aks-linux-768d6647cc-ntmh4. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Expected behavior: pods must complete the tasks then Keda/AKS can remove this pod.
I share with you my keda yml file
# deployment.yaml
apiVersion: apps/v1 # The API resource where this workload resides
kind: Deployment # The kind of workload we're creating
metadata:
name: aks-linux # This will be the name of the deployment
spec:
selector: # Define the wrapping strategy
matchLabels: # Match all pods with the defined labels
app: aks-linux # Labels follow the `name: value` template
replicas: 1
template: # This is the template of the pod inside the deployment
metadata: # Metadata for the pod
labels:
app: aks-linux
spec:
nodeSelector:
agentpool: linux
containers: # Here we define all containers
- image: <My image here>
name: aks-linux
env:
- name: "AZP_URL"
value: "<myURL>"
- name: "AZP_TOKEN"
value: "<MyToken>"
- name: "AZP_POOL"
value: "<MyPool>"
resources:
requests: # Minimum amount of resources requested
cpu: 2
memory: 4096Mi
limits: # Maximum amount of resources requested
cpu: 4
memory: 8192Mi
I used latest version of AKS and Keda. Any idea ?
Check the official Keda docs:
When running your agents as a deployment you have no control on which pod gets killed when scaling down.
So, to solve it you need to use ScaledJob:
If you run your agents as a Job, KEDA will start a Kubernetes job for each job that is in the agent pool queue. The agents will accept one job when they are started and terminate afterwards. Since an agent is always created for every pipeline job, you can achieve fully isolated build environments by using Kubernetes jobs.
See there how to implement it.

Kubernetes scheduling: control preference and remove pods in priority on a node

Context
I have a Kubernetes cluster with 2 pools:
the first one, bare-metal, is composed of a finite number of bare metal instances. Every node in this pool has a label node-type: bare-metal
the second one, vms, is composed of virtual machines. This is used if the cluster needs to autoscale, since provisioning VMs is faster than provisioning a new bare metal instance. Every node in this pool has a label node-type: vms
For performance reasons, I would like that, when possible, all pods run on a bare-metal node. If not possible, and only if not possible (not enough memory, too many pods already existing), that's ok for me to run pods on a vms node. I have 2 questions (tightly related) about this.
About preferredDuringSchedulingIgnoredDuringExecution behavior
To schedule pods in priority on bare-metal, I've used preferredDuringSchedulingIgnoredDuringExecution. Example with a simple nginx deployment with 90 replicas (a bare-metal pool node could hold all of them):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 90
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
cpu: 50m
memory: 50M
limits:
cpu: 50m
memory: 50M
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- bare-metal
This works fine, even if I don't really understand why on the 90 pods, 3 are scheduled on a node of vms pool (even if scheduling pods on the bare-metal pool is still possible). Well, I understand that this is preferred instead of required, but I'm wondering if we can tune this "preference".
Question: Can I configure my deployment to tell like "preferred only if there is no other solution"?
Scaling down
When I scale down my Deployment, I would like that pods scheduled on vms pool are deleted first, because my pods should run on priority on a bare-metal pool node. However, if I run:
kubectl scale deployment nginx --replicas 3
I can observe that pods on bare-metal pool nodes are deleted first, leaving only 3 pods on a vms pool node.
Question: Is there a way to configure my deployment to tell Kubernetes to remove the pods on vms pool first?
Edit: It seems that the proposal of the introduction of PodDeletionCost can solve this. But it's not implemented for now.

GKE Autopilot - Containers stuck in init phase on particular node

I'm using GKE's Autopilot Cluster to run some kubernetes workloads. Pods getting scheduled to one of the allocated nodes is taking around 10 mins stuck in init phase. Same pod in different node is up in seconds.
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jobs
spec:
replicas: 1
selector:
matchLabels:
app: job
template:
metadata:
labels:
app: job
spec:
volumes:
- name: shared-data
emptyDir: {}
initContainers:
- name: init-volume
image: gcr.io/dummy_image:latest
imagePullPolicy: Always
resources:
limits:
memory: "1024Mi"
cpu: "1000m"
ephemeral-storage: "10Gi"
volumeMounts:
- name: shared-data
mountPath: /data
command: ["/bin/sh","-c"]
args:
- cp -a /path /data;
containers:
- name: job-server
resources:
requests:
ephemeral-storage: "5Gi"
limits:
memory: "1024Mi"
cpu: "1000m"
ephemeral-storage: "10Gi"
image: gcr.io/jobprocessor:latest
imagePullPolicy: Always
volumeMounts:
- name: shared-data
mountPath: /ebdata1
This happens only if container has init container. In my case, I'm copying some data from dummy container to shared volume which I'm mounting on actual container..
But whenever pods get scheduled to this particular node, it gets stuck in init phase for around 10 minutes and automatically gets resolved. I couldn't see any errors in event logs.
kubectl describe node problematic-node
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SystemOOM 52m kubelet System OOM encountered, victim process: cp, pid: 477887
Warning OOMKilling 52m kernel-monitor Memory cgroup out of memory: Killed process 477887 (cp) total-vm:2140kB, anon-rss:564kB, file-rss:768kB, shmem-rss:0kB, UID:0 pgtables:44kB oom_score_adj:-997
Only message is the above warning. Is this issue caused by some misconfiguration from my side?
The best recommendation is for you to manage container compute resources properly within your Kubernetes cluster. When creating a Pod, you can optionally specify how much CPU and memory (RAM) each Container needs to avoid OOM situations.
When Containers have resource requests specified, the scheduler can make better decisions about which nodes to place Pods on. And when Containers have their limits specified, contention for resources on a node can be handled in a specified manner. CPU specifications are in units of cores, and memory is specified in units of bytes.
An event is produced each time the scheduler fails, use the command below to see the status of events:
$ kubectl describe pod <pod-name>| grep Events
Also, read the official Kubernetes guide on “Configure Out Of Resource Handling”. Always make sure to:
reserve 10-20% of memory capacity for system daemons like kubelet and OS kernel identify pods which can be evicted at 90-95% memory utilization to reduce thrashing and incidence of system OOM.
To facilitate this kind of scenario, the kubelet would be launched with options like below:
--eviction-hard=memory.available<xMi
--system-reserved=memory=yGi
Having Heapster container monitoring in place must be helpful for visualization.
Read more reading on Kubernetes and Docker Administration.

Kubernetes cron job oomkilled

I have a rails app that is deployed on K8S. Inside my web app, there is a cronjob thats running every day at 8pm and it takes 6 hours to finish. I noticed OOMkilled error occurs after a few hours from cronjob started. I also increased memory of a pod but the error still happened.
This is my yaml file:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: sync-data
spec:
schedule: "0 20 * * *" # At 20:00:00pm every day
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
jobTemplate:
spec:
ttlSecondsAfterFinished: 100
template:
spec:
serviceAccountName: sync-data
containers:
- name: sync-data
resources:
requests:
memory: 2024Mi # OOMKilled
cpu: 1000m
limits:
memory: 2024Mi # OOMKilled
cpu: 1000m
image: xxxxxxx.dkr.ecr.ap-northeast-1.amazonaws.com/path
imagePullPolicy: IfNotPresent
command:
- "/bin/sh"
- "-c"
- |
rake xxx:yyyy # Will take ~6 hours to finish
restartPolicy: Never
Are there any best practices to run long consuming cronjob on K8S?
Any help is welcome!
OOM Killed can happen for 2 reasons.
Your pod is taking more memory than the limit specified. In that case, you need to increase the limit obviously.
If all the pods in the node are taking more memory than they have requested then Kubernetes will kill some pods to free up space. In that case, you can give higher priority to this pod.
You should have monitoring in place to actually determine the reasons for this. Proper monitoring will show you which pods are performing as per expectations and which are not. You could also use node selectors for long-running pods and set priority class which will remove non-cron pods first.
Well honestly there is no correct resources request/limit stuff in kubernetes because it totally depend on your pod what kind of stuff it is doing. One thing I would suggest or you can do is deploy the vertical pod auto-scaling and observe what the vertical pod autoscaler suggest you the perfect resource request/limits for your cron job. Here is the very nice article you can start with and you will get to know how you can utilize this in your requirement.
https://medium.com/infrastructure-adventures/vertical-pod-autoscaler-deep-dive-limitations-and-real-world-examples-9195f8422724

How to make auto cluster upscaling work GKE/digitalocean for a job kind with varied requested memory requirement?

I have 1 node K8 cluster on digitalocean with 1cpu/2gbRAM
and 3 node cluster on google cloud with 1cpu/2gbRAM
I ran two jobs separatley on each cloud platform with auto-scaling enabled.
First job had memory request of 200Mi
apiVersion: batch/v1
kind: Job
metadata:
name: scaling-test
spec:
parallelism: 16
template:
metadata:
name: scaling-test
spec:
containers:
- name: debian
image: debian
command: ["/bin/sh","-c"]
args: ["sleep 300"]
resources:
requests:
cpu: "100m"
memory: "200Mi"
restartPolicy: Never
More nodes of (1cpu/2gbRAM) were added to cluster automatically and after job completion extra node were deleted automatically.
After that, i ran second job with memory request 4500Mi
apiVersion: batch/v1
kind: Job
metadata:
name: scaling-test2
spec:
parallelism: 3
template:
metadata:
name: scaling-test2
spec:
containers:
- name: debian
image: debian
command: ["/bin/sh","-c"]
args: ["sleep 5"]
resources:
requests:
cpu: "100m"
memory: "4500Mi"
restartPolicy: Never
After checking later job remained in pending state . I checked pods Events log and i'm seeing following error.
0/5 nodes are available: 5 Insufficient memory **source: default-scheduler**
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory **source:cluster-autoscaler**
cluster did not auto-scaled for required requested resource for job. Is this possible using kubernetes?
CA doesn't add nodes to the cluster if it wouldn't make a pod schedulable. It will only consider adding nodes to node groups for which it was configured. So one of the reasons it doesn't scale up the cluster may be that the pod has too large (e.g. 4500Mi memory). Another possible reason is that all suitable node groups are already at their maximum size.