GKE Autopilot - Containers stuck in init phase on particular node - kubernetes

I'm using GKE's Autopilot Cluster to run some kubernetes workloads. Pods getting scheduled to one of the allocated nodes is taking around 10 mins stuck in init phase. Same pod in different node is up in seconds.
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jobs
spec:
replicas: 1
selector:
matchLabels:
app: job
template:
metadata:
labels:
app: job
spec:
volumes:
- name: shared-data
emptyDir: {}
initContainers:
- name: init-volume
image: gcr.io/dummy_image:latest
imagePullPolicy: Always
resources:
limits:
memory: "1024Mi"
cpu: "1000m"
ephemeral-storage: "10Gi"
volumeMounts:
- name: shared-data
mountPath: /data
command: ["/bin/sh","-c"]
args:
- cp -a /path /data;
containers:
- name: job-server
resources:
requests:
ephemeral-storage: "5Gi"
limits:
memory: "1024Mi"
cpu: "1000m"
ephemeral-storage: "10Gi"
image: gcr.io/jobprocessor:latest
imagePullPolicy: Always
volumeMounts:
- name: shared-data
mountPath: /ebdata1
This happens only if container has init container. In my case, I'm copying some data from dummy container to shared volume which I'm mounting on actual container..
But whenever pods get scheduled to this particular node, it gets stuck in init phase for around 10 minutes and automatically gets resolved. I couldn't see any errors in event logs.
kubectl describe node problematic-node
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SystemOOM 52m kubelet System OOM encountered, victim process: cp, pid: 477887
Warning OOMKilling 52m kernel-monitor Memory cgroup out of memory: Killed process 477887 (cp) total-vm:2140kB, anon-rss:564kB, file-rss:768kB, shmem-rss:0kB, UID:0 pgtables:44kB oom_score_adj:-997
Only message is the above warning. Is this issue caused by some misconfiguration from my side?

The best recommendation is for you to manage container compute resources properly within your Kubernetes cluster. When creating a Pod, you can optionally specify how much CPU and memory (RAM) each Container needs to avoid OOM situations.
When Containers have resource requests specified, the scheduler can make better decisions about which nodes to place Pods on. And when Containers have their limits specified, contention for resources on a node can be handled in a specified manner. CPU specifications are in units of cores, and memory is specified in units of bytes.
An event is produced each time the scheduler fails, use the command below to see the status of events:
$ kubectl describe pod <pod-name>| grep Events
Also, read the official Kubernetes guide on “Configure Out Of Resource Handling”. Always make sure to:
reserve 10-20% of memory capacity for system daemons like kubelet and OS kernel identify pods which can be evicted at 90-95% memory utilization to reduce thrashing and incidence of system OOM.
To facilitate this kind of scenario, the kubelet would be launched with options like below:
--eviction-hard=memory.available<xMi
--system-reserved=memory=yGi
Having Heapster container monitoring in place must be helpful for visualization.
Read more reading on Kubernetes and Docker Administration.

Related

When the kubelet reports that the node is in diskpressure?

I know that the kubelet reports that the node is in diskpressure if there is not enough space on the node.
But I want to know the exact threshold of diskpressure.
Please let me know the source code of the kubelet related this issue if you could.
Or I really thanks for your help about the official documentation from k8s or sth else.
Thanks again!!
Kubelet running on the node will report the disk pressure depending on hard or soft eviction threshold values which is being set. if nothing being set it will be all default values. Please refer kubernetes documentation
Below are values which will be used
DiskPressure - nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree
Disk pressure is a condition indicating that a node is using too much disk space or is using disk space too fast, according to the thresholds you have set in your Kubernetes configuration.
DaemonSet can deploy apps to multiple nodes in a single step. Like deployments, DaemonSets must be applied using kubectl before they can take effect.
Since kubernetes is running on Linux,this is easily done by running du command.you can either manually ssh into each kubernetes nodes,or use a Daemonset As follows:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: disk-checker
labels:
app: disk-checker
spec:
selector:
matchLabels:
app: disk-checker
template:
metadata:
labels:
app: disk-checker
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- resources:
requests:
cpu: 0.15
securityContext:
privileged: true
image: busybox
imagePullPolicy: IfNotPresent
name: disk-checked
command: ["/bin/sh"]
args: ["-c", "du -a /host | sort -n -r | head -n 20"]
volumeMounts:
- name: host
mountPath: "/host"
volumes:
- name: host
hostPath:
path: "/"
Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold, check complete Node Conditions for more details.
Ways to set Kubelet options :
1)Command line options like --eviction-hard.
2)Config file.
3)More recent is dynamic configuration.
When you experience an issue with node disk pressure, your immediate thoughts should be when you run into the issue: an error in garbage collecting or log files. Of course the better answer here is to clean up unused files (free up some disk space).
So Monitor your clusters and get notified of any node disks approaching pressure, and get the issue resolved before it starts killing other pods inside the cluster.
Edit :
AFAIK there is no magic trick to know the exact threshold of diskpressure . You need to start with reasonable (limits & requests) and refine using trial and error.
Refer to this SO for more information on how to set the threshold of diskpressure.

How to make auto cluster upscaling work GKE/digitalocean for a job kind with varied requested memory requirement?

I have 1 node K8 cluster on digitalocean with 1cpu/2gbRAM
and 3 node cluster on google cloud with 1cpu/2gbRAM
I ran two jobs separatley on each cloud platform with auto-scaling enabled.
First job had memory request of 200Mi
apiVersion: batch/v1
kind: Job
metadata:
name: scaling-test
spec:
parallelism: 16
template:
metadata:
name: scaling-test
spec:
containers:
- name: debian
image: debian
command: ["/bin/sh","-c"]
args: ["sleep 300"]
resources:
requests:
cpu: "100m"
memory: "200Mi"
restartPolicy: Never
More nodes of (1cpu/2gbRAM) were added to cluster automatically and after job completion extra node were deleted automatically.
After that, i ran second job with memory request 4500Mi
apiVersion: batch/v1
kind: Job
metadata:
name: scaling-test2
spec:
parallelism: 3
template:
metadata:
name: scaling-test2
spec:
containers:
- name: debian
image: debian
command: ["/bin/sh","-c"]
args: ["sleep 5"]
resources:
requests:
cpu: "100m"
memory: "4500Mi"
restartPolicy: Never
After checking later job remained in pending state . I checked pods Events log and i'm seeing following error.
0/5 nodes are available: 5 Insufficient memory **source: default-scheduler**
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory **source:cluster-autoscaler**
cluster did not auto-scaled for required requested resource for job. Is this possible using kubernetes?
CA doesn't add nodes to the cluster if it wouldn't make a pod schedulable. It will only consider adding nodes to node groups for which it was configured. So one of the reasons it doesn't scale up the cluster may be that the pod has too large (e.g. 4500Mi memory). Another possible reason is that all suitable node groups are already at their maximum size.

How to avoid creating pod on node that doesn't have enough storage?

I'm setting up my application with Kubernetes. I have 2 Docker images (Oracle and Weblogic). I have 2 kubernetes nodes, Node1 (20 GB storage) and Node2 (60 GB) storage.
When I run kubectl apply -f oracle.yaml it tries to create oracle pod on Node1 and after few minutes it fails due to lack of storage. How can I force Kubernetes to check the free storage of that node before creating the pod there?
Thanks
First of all, you probably want to give Node1 more storage.
But if you don't want the pod to start at all you can probably run a check with an initContainer where you check how much space you are using with something like du or df. It could be a script that checks for a threshold that exits unsuccessfully if there is not enough space. Something like this:
#!/bin/bash
# Check if there are less than 10000 bytes in the <dir> directory
if [ `du <dir> | tail -1 | awk '{print $1}'` -gt "10000" ]; then exit 1; fi
Another alternative is to use a persistent volume (PV) with a persistent volume claim (PVC) that has enough space together with the default StorageClass Admission Controller, and you do allocate the appropriate space in your volume definition.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 40Gi
storageClassName: mytype
Then on your Pod:
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- name: mycontainer
image: nginx
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
persistentVolumeClaim:
claimName: myclaim
The Pod will not start if your claim cannot be allocated (There isn't enough space)
You may try to specify ephemeral storage requirement for pod:
resources:
requests:
ephemeral-storage: "40Gi"
limits:
ephemeral-storage: "40Gi"
Then it would be scheduled only on nodes with sufficient ephemeral storage available.
You can verify the amount of ephemeral storage available on each node in the output of "kubectl describe node".
$ kubectl describe node somenode | grep -A 6 Allocatable
Allocatable:
attachable-volumes-gce-pd: 64
cpu: 3920m
ephemeral-storage: 26807024751
hugepages-2Mi: 0
memory: 12700032Ki
pods: 110

Google Kubernetes Engine (GKE) CPU/pod

On GKE I have created a cluster with 1 node and n1-standard-1 instance type (vCPU:1, RAM: 3.75 GB). The main purpose of the cluster is to host an application that has 3 pods (mysql, backend and frontend) on default namespace. I can deploy mysql with no problem. After that when I try to deploy the backend it just remains in "Pending" state saying that not enough CPU is available. The message is very verbose.
So my question is, is it not possible to have 3 pods running using 1 cpu unit? I want is reduce cost and let those pods use the same cpu. Is it possible to achieve that? If yes, then how?
The error message "pending" is not that informative. Could you please run
kubectl get pods
and get your pod name and again run
kubectl describe pod {podname}
then you can get a idea about the error message.
By the way you can run 3 pods in a single cpu.
Yes, it is possible to have multiple pods, or 3 in your case, on a single CPU unit.
If you want to manage your memory resources, consider putting constraints such as those described in the official docs. Below is an example.
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: db
image: mysql
env:
- name: MYSQL_ROOT_PASSWORD
value: "password"
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
One would need more information regarding your deployment to answer your queries in a more detailer manner. Please consider providing the same.

How can I OOM all BestEffort Pods in Kubernetes?

To demonstrate the kubelet's eviction behaviour, I am trying to deploy a Kubernetes workload that will consume memory to the point that the kubelet evicts all BestEffort Pods due to memory pressure but does not kill my workload (or at least not before the BestEffort Pods).
My best attempt is below. It writes to two tmpfs volumes (since, by default, the limit of a tmpfs volume is half of the Node's total memory). The 100 comes from the fact that --eviction-hard=memory.available<100Mi is set on the kubelet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fallocate
namespace: developer
spec:
selector:
matchLabels:
app: fallocate
template:
metadata:
labels:
app: fallocate
spec:
containers:
- name: alpine
image: alpine
command:
- /bin/sh
- -c
- |
count=1
while true
do
AVAILABLE_DISK_KB=$(df /cache-1 | grep /cache-1 | awk '{print $4}')
AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ? $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
fallocate -l $(( $MINIMUM - 100 ))MB /cache-1/$count
AVAILABLE_DISK_KB=$(df /cache-2 | grep /cache-2 | awk '{print $4}')
AVAILABLE_DISK_MB=$(( $AVAILABLE_DISK_KB / 1000 ))
AVAILABLE_MEMORY_MB=$(free -m | grep Mem | awk '{print $4}')
MINIMUM=$(( $AVAILABLE_DISK_MB > $AVAILABLE_MEMORY_MB ? $AVAILABLE_MEMORY_MB : $AVAILABLE_DISK_MB ))
fallocate -l $(( $MINIMUM - 100 ))MB /cache-2/$count
count=$(( $count+1 ))
sleep 1
done
resources:
requests:
memory: 2Gi
cpu: 100m
limits:
cpu: 100m
volumeMounts:
- name: cache-1
mountPath: /cache-1
- name: cache-2
mountPath: /cache-2
volumes:
- name: cache-1
emptyDir:
medium: Memory
- name: cache-2
emptyDir:
medium: Memory
The intention of this script is to use up memory to the point that Node memory usage is in the hard eviction threshhold boundary to cause the kubelet to start to evict. It evicts some BestEfforts Pods, but in most cases the workload is killed before all BestEffort Pods are evicted. Is there a better way of doing this?
I am running on GKE with cluster version 1.9.3-gke.0.
EDIT:
I also tried using simmemleak:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: simmemleak
namespace: developer
spec:
selector:
matchLabels:
app: simmemleak
template:
metadata:
labels:
app: simmemleak
spec:
containers:
- name: simmemleak
image: saadali/simmemleak
resources:
requests:
memory: 1Gi
cpu: 1m
limits:
cpu: 1m
But this workload keeps dying before any evictions. I think the issue is that it is being killed by the kernel before the kubelet has time to react.
To avoid system OOM taking effect prior to the kubelet eviction, you could configure kubepods memory limits --system-reserved and --enforce-node-allocatable Read more.
For example, Node has 32Gi of memory, configure to limit kubepods memory up to 20Gi
--eviction-hard=memory.available<500Mi
I found this on the Kubernetes docs, I hope it helps:
kubelet may not observe memory pressure right away
The kubelet currently polls cAdvisor to collect memory usage stats at a regular interval.
If memory usage increases within that window rapidly, the kubelet may not observe MemoryPressure fast enough, and the OOMKiller will still be invoked.
We intend to integrate with the memcg notification API in a future release to reduce this latency, and instead have the kernel tell us when a threshold has been crossed immediately.
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for this issue is to set eviction thresholds at approximately 75% capacity.
This increases the ability of this feature to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
==EDIT== : As there seems to be a race between OOM and kubelet, and the memory allocated by your script grows quicker than the time Kubelet takes to realise that the pods need to be evicted, it might be wise to try to allocate memory more slowly inside your script.