Why are K8 pod limits not being applied? - kubernetes

I am basically trying to apply pod CPU limits on my cluster. I made the edit to my deployment.yaml (added requests, limits) in there and applied the yaml. Below are the observations:
I don't get any error when I apply the file (my existing app pod is terminated, a new one is spun up)
I can see a QoS class being applied when I do a describe on my pod (Qos: Burstable)
The issue is that the limits are not being honored, as I can see in my metrics server that the pod CPU > 300 (whereas the limit set is 200m)
I have a istio sidecar container attached to the pod (but I only want to apply the limit to my app an not Istio)
Snippet of yaml file:
resources:
limits:
cpu: 200m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi
Any ideas what else I need to check here, I checked all documentation and get no errors but the limits are not being applied. Thanks in advance!

Pod CPU includes all containers in the pod, while the limits you've specified apply only to the app container.
If you query metrics for the container alone, you will probably find that it's honouring the limits you've enforced upon it.
Here's an example prometheus query you can use if you're running it on your cluster, that will allow you to understand the ratio between every container actual CPU usage and it's CPU requests -
max(sum(irate(container_cpu_usage_seconds_total{container=~"<container_name>", image !="",namespace="<namespace>", pod=~"<Deployment_name>"}[5m])) by (pod, namespace))
/
max(sum(kube_pod_container_resource_requests_cpu_cores{namespace="<namespace>", container=~"<container_name>", pod=~"<deployment_name>"}) by (pod,namespace))

Related

Kubernetes: cpu request and total resources doubts

For better understand my doubts, I will put an example
Example:
We have one worker node with 3 allocatable cpus and kubernetes has scheduled three pods on it:
pod_1 with 500m cpu request
pod_2 with 700m cpu request
pod_3 with 300m cpu request
In this worker node I can't schedule other pods.
But if I check the real usage:
pod_1 cpu usage is 300m
pod_2: cpu usage is 600m
My question is:
Can pod_3 have a real usage of 500m or the request of other pods will limit the cpu usage?
Thanks
Pietro
It doesn't matter what the real usage is - the "request" means how much resources are guaranteed to be available for the pod. Your workload might be using only a fraction of the requested resources - but what will really count is the "request" itself.
Example - Let's say you have a node with 1CPU core.
Pod A - 100m Request
Pod B - 200m Request
Pod C - 700m Request
Now, no pod can be allocated in the node - because the whole 1 CPU resource is already requested by 3 pods. It doesn't really matter which fraction of the allocated resources each pod is using at any given time.
Another point worth noting is the "Limit". A requested resource usage could be surpassed by a workload - but it cannot surpass the "Limit". This is a very important mechanism to be understood.
Kubernetes will schedule the pods based on the request that you configure for the container(s) of pod (via the specs for the respective Deployment or other kinds).
Here's an example:
For simplicity, let's assume only one container for the pod.
containers:
- name: "foobar"
resources:
requests:
cpu: "300m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
If you ask for 300 millicpus as your request, Kubernetes will place the pod on a node that has at least 300 millicpus allocatable to that pod. If a node has less allocatable CPU available, the pod will not be placed on that node. Similarly, you can also set the value for memory request as well.
The limit works to limit the resource use by the container. In the example above, Kubernetes will evict the pod if the container ends up using more than 512MiB of memory; once evicted, the pod will be placed on a node that has at least 300 millicpus available (and if no such node exists, the pod will remain in Pending state with FailedScheduling as the reason, until a node with sufficient capacity is available).
Do note, that the resource request works only at the time of pod scheduling, and not at runtime (meaning, the actual consumption of the resources will not trigger a re-scheduling of the pod even if the container used more resources than what it requested as long as it remains below the limit, if specified).
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-requests-are-scheduled
So, in summary,
The total of all your requests is used as the what can be allocated regardless of the actual runtime utilization of your pod (as long as the limit is not crossed)
You can request for 300 millicpus, but only use 100 millicpus, or 400 millicpus; Kubernetes will still show the "allocated" value as 300
If your container crosses the limit, it will get evicted by Kubernetes

Kubernetes Pod 00MKilled Issue

The scenario is we run some web sites based on an nginx image.
When we had our cluster setup with nodes of 2cores and 4GB RAM each.
The pods had the following configurations, cpu: 40m and memory: 100MiB.
Later, we upgraded our cluster with nodes of 4cores and 8GB RAM each.
But kept on getting 00MKilled in every pod.
So we increased memory on every pods to around 300MiB and then every thing seems to be working fine.
My question is why does this happen and how do I solve it.
P.S. if we revert back to each node being 2cores and 4GB RAM, the pods work just fine with decreased resources of 100MiB.
Any help would be highly appreciated.
Regards.
For each container in kubernetes you can configure resources for both cpu and memory, like following
resources:
limits:
cpu: 100m
memory: "200Mi"
requests:
cpu: 50m
memory: "100Mi"
According to documentation
When you specify the resource request for Containers in a Pod, the scheduler uses this information to decide which node to place the Pod on. When you specify a resource limit for a Container, the kubelet enforces those limits so that the running container is not allowed to use more of that resource than the limit you set.
So if you set memory: 100MiB on resources:limits and your container consume more than 100MiB memory then you will get out of memory (OOM) error
For more details about request and limits on resources click here

How to fix ephemeral local storage problem?

I'm Running some deployment on EKS k8s 1.16 and after ~5 min my pod gets Evicted with the following message:
Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.
My node has 20Gi ephemeral storage.
My QoS Class is Guaranteed and no matter which amount of ephemeral-storage I configure in my yaml, I see the same error with the amount I configure.
Do you have a clue what can be done?
My yaml file is here: https://slexy.org/view/s2096sex7L
It's because you're putting an upper limit of ephemeral-storage usage by setting resources.limits.ephemeral-storage to 1Gi. Remove the limits.ephemeral-storage if safe, or change the value depending upon your requirement.
resources:
limits:
memory: "61Gi"
cpu: "7500m"
ephemeral-storage: "1Gi" <----- here
requests:
memory: "61Gi"
cpu: "7500m"
ephemeral-storage: "1Gi"
Requests and limits
If the node where a Pod is running has enough of a resource available, it’s possible (and allowed) for a container to use more resources than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.
If you're reading this and you're using GKE Autopilot, there is a hard limit of 10G for ephemeral storage in Autopilot. I would recommend moving your storage to a Volume.
See Autopilot documentation here

how can i found that the pod was recreate by out of limits memory

If a pod run out of limits memory, which define as follows:
resources:
limits:
memory: 80Gi
cpu: 10
The kubernetes will recreate the pod but how can I found that the pod was recreate by out of limits memory?
Any logs record this situation?
The simplest way is to use Heapster for monitoring cluster resource usage.
Using a Grafana setup with InfluxDB as storage backends for Heapster gives you the CPU and Memory usage of the entire cluster, individual pods and containers.
When a Pod gets restarted due to reaching memory limit, you should see a sawtooth wave on memory graph for this pod.
More useful information about monitoring tools and how to set it up can be found here.

Kubernetes pods failing on "Pod sandbox changed, it will be killed and re-created"

On a Google Container Engine cluster (GKE), I see sometimes a pod (or more) not starting and looking in its events, I can see the following
Pod sandbox changed, it will be killed and re-created.
If I wait - it just keeps re-trying.
If I delete the pod, and allow them to be recreated by the Deployment's Replica Set, it will start properly.
The behavior is inconsistent.
Kubernetes versions 1.7.6 and 1.7.8
Any ideas?
In my case it happened because of too little memory and CPU limits
For example, in your manifest file, increase the limits and requests from:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
to this:
limits:
cpu: 1000m
memory: 2048Mi
requests:
cpu: 500m
memory: 1024Mi
I can see following message posted in Google Cloud Status Dashboard:
"We are investigating an issue affecting Google Container Engine (GKE) clusters where after docker crashes or is restarted on a node, pods are unable to be scheduled.
The issue is believed to be affecting all GKE clusters running Kubernetes v1.6.11, v1.7.8 and v1.8.1.
Our Engineering Team suggests: If nodes are on release v1.6.11, please downgrade your nodes to v1.6.10. If nodes are on release v1.7.8, please downgrade your nodes to v1.7.6. If nodes are on v1.8.1, please downgrade your nodes to v1.7.6.
Alternative workarounds are also provided by the Engineering team in this doc . These workarounds are applicable to the customers that are unable to downgrade their nodes."
I was affected by same issue on one node in GKE 1.8.1 cluster (other nodes were fine). I did following:
Make sure your node pool has some headroom to receive all pods scheduled on affected node. When in doubt, increase node pool by 1.
Drain affected node following this manual:
kubectl drain <node>
You may run into warnings about daemonsets or pods with local storage, proceed with operation.
Power down affected node in Compute Engine. GKE should schedule replacement node if your pool size is smaller than specified in pool description.