"Limits" property ignored when deploying a container in a Kubernetes cluster - kubernetes

I am deploying a container in Google Kubernetes Engine with this YAML fragment:
spec:
containers:
- name: service
image: registry/service-go:latest
resources:
requests:
memory: "20Mi"
cpu: "20m"
limits:
memory: "100Mi"
cpu: "50m"
But it keeps taking 120m. Why is "limits" property being ignored? Everything else is working correctly. If I request 200m, 200m are being reserved, but limit keeps being ignored.
My Kubernetes version is 1.10.7-gke.1
I only have the default namespace and when executing
kubectl describe namespace default
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container cpu - - 100m - -

Considering Resources Request Only
The google cloud console works well, I think you have multiple containers in your pod, this is why. The value shown above is the sum of resources requests declared in your truncated YAML file. You can verify easily with kubectl.
First verify the number of containers in you pod.
kubectl describe pod service-85cc4df46d-t6wc9
Then, look the description of the node via kubectl, you should have the same informations as the console says.
kubectl describe node gke-default-pool-abcdefgh...
What is the difference between resources request and limit ?
You can imagine your cluster as a big square box. This is the total of your allocatable resources. When you drop a Pod in the big box, Kubernetes will check if there is an empty space for the requested resources of the pod (is the small box fits in the big box?). If there is enough space available, then it will schedule your workload on the selected node.
Resources limits are not taken into account by the scheduler. All is done at the kernel level with CGroups. The goal is to restrict workloads to take all the CPU or Memory on the node they are scheduled on.
If your resources requests == resources limits then, workloads cannot escape their "box" and are not able to use available CPU/Memory next to them. In other terms, your resource are guaranteed for the pod.
But, if the limits are greater than your requests, this is called overcommiting resources. You bet that all the workloads on the same node are not fully loaded at the same time (generally the case).
I recommend to not overcommiting the memory resource, do not let the pod escape the "box" in term of memory, it can leads to OOMKilling.

You can try logging into the node running your pod and run:
ps -Af | grep docker
You'll see the full command line that kubelet sends to docker. Representing the memory limit it should have something like --memory. Note that the request value for memory is only used by the Kubernetes scheduler to determine whether it has exceeded all pods/containers running on a node.
Representing the requests for CPUs you'll see the --cpu-shares flag. In this case the limit is not a hard limit but again it's a way for the Kubernetes scheduler to not allocate containers/pod passed that limit when running multiple containers/pods on a specific node. You can learn more about cpu-shares here and from the Kubernetes side here. So in essence, if you don't have enough workloads on the node, it will always go over its CPU share if it needs to and that's what you are probably seeing.
Docker has other ways of restricting the CPUs such as cpu-period/cpu-quota and cpuset-cpus but not used bu Kubernetes as of this writing. In this, I believe mesos does somehow better when dealing with CPU/memory reservations and quotas imo.
Hope it helps.

Related

My kubernetes pods are Evicting with ephemeral-storage issue

I am running a k8 cluster with 8 workers and 3 master nodes. And my pods are evicting repetively with the ephemeral storage issues.
Below is the error I am getting on Evicted pods:
Message: The node was low on resource: ephemeral-storage. Container xpaas-logger was using 30108Ki, which exceeds its request of 0. Container wso2am-gateway-am was using 406468Ki, which exceeds its request of 0.
To overcome the above error, I have added ephemeral storage limits and request to my namespace.
apiVersion: v1
kind: LimitRange
metadata:
name: ephemeral-storage-limit-range
spec:
limits:
- default:
ephemeral-storage: 2Gi
defaultRequest:
ephemeral-storage: 130Mi
type: Container
Even after adding the above limits and requests to my namespace, my pod is reaching its limits and then evicting.
Message: Pod ephemeral local storage usage exceeds the total limit of containers 2Gi.
How can I monitor my ephemeral storage, where does it store on my instance?
How can I set the docker logrotate to my ephemeral storage based on size? Any suggestions?
"Ephemeral storage" here refers to space being used in the container filesystem that's not in a volume. Something inside your process is using a lot of local disk space. In the abstract this is relatively easy to debug: use kubectl exec to get a shell in the pod, and then use normal Unix commands like du to find where the space is going. Since it's space inside the pod, it's not directly accessible from the nodes, and you probably can't use tools like logrotate to try to manage it.
One specific cause of this I've run into in the past is processes configured to log to a file. In Kubernetes you should generally set your logging setup to log to stdout instead. This avoids this specific ephemeral-storage problem, but also avoids a number of practical issues around actually getting the log file out of the pod. kubectl logs will show you these logs and you can set up cluster-level tooling to export them to another system.

Kubernates pod memory limit using metrics

I am trying to fetch the actual and total amount of memory allocated to a pod using API.
While I am able to fetch actual memory consumption using metrics server api.
How can I fetch total memory assigned to pod using metrics server API?
I am developing a dashboard in which I need to showcase pod memory and cpu.The ui graph has input actual and total amout. I can fetch actual memory used by command kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/ but how can I get total memory of pod?
All the available APIs are described at Kubernetes API Guide. It is available for the different K8s versions (just pay attention to URL).
Nick, How can I fetch this value using API ?
At a first glance even /api/v1/namespaces/<namespace>/pods/ will do the job.
Please see my example (I have decided to tet it myself).
$ cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: server-go-lim
...
spec:
containers:
- image: nkolchenko/enea:server_go_latest
resources:
...
limits:
memory: 1024Mi # Here is the Limit
...
$ kubectl create -f pod.yaml
pod/server-go-lim created
$ kubectl get --raw /api/v1/namespaces/default/pods/server-go-lim | json_pp
{
...
"spec" : {
"containers" : [
{
"resources" :
"limits" : {
"memory" : "1Gi"
}
As we can see, API returns Limits for pod.
Let me know if that's the one you've been looking for.
You simply run kubectl describe pod <pod> and look under .Containers.<container>.Limits
The total amount of memory allocated to a pod is bounded by the memory limit you assign to the pod's containers. https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#specify-a-memory-request-and-a-memory-limit
To specify a memory request for a Container, include the resources:requests field in the Container's resource manifest. To specify a memory limit, include resources:limits.
If you don't specify a memory limit, the container has no upper bound. https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#if-you-do-not-specify-a-memory-limit
If you do not specify a memory limit for a Container, one of the following situations applies:
• The Container has no upper bound on the amount of memory it uses. The Container could use all of the memory available on the Node where it is running which in turn could invoke the OOM Killer. Further, in case of an OOM Kill, a container with no resource limits will have a greater chance of being killed.
• The Container is running in a namespace that has a default memory limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the memory limit.

AutoScaling work loads without running out of memory

I have a number of pods running and horizontal pod auto scaler assigned to target them, the cluster I am using can also add nodes and remove nodes automatically based on current load.
BUT we recently had the cluster go offline with OOM errors and this caused a disruption in service.
Is there a way to monitor the load on each node and if usage reaches say 80% of the memory on a node, Kubernetes should not schedule more pods on that node but wait for another node to come online.
The pending pods are what one should monitor and define Resource requests which affect scheduling.
The Scheduler uses Resource requests Information when scheduling the pod
to a node. Each node has a certain amount of CPU and memory it can allocate to
pods. When scheduling a pod, the Scheduler will only consider nodes with enough
unallocated resources to meet the pod’s resource requirements. If the amount of
unallocated CPU or memory is less than what the pod requests, Kubernetes will not
schedule the pod to that node, because the node can’t provide the minimum amount
required by the pod. The new Pods will remain in Pending state until new nodes come into the cluster.
Example:
apiVersion: v1
kind: Pod
metadata:
name: requests-pod
spec:
containers:
- image: busybox
command: ["dd", "if=/dev/zero", "of=/dev/null"]
name: main
resources:
requests:
cpu: 200m
memory: 10Mi
When you don’t specify a request for CPU, you’re saying you don’t care how much
CPU time the process running in your container is allotted. In the worst case, it may
not get any CPU time at all (this happens when a heavy demand by other processes
exists on the CPU). Although this may be fine for low-priority batch jobs, which aren’t
time-critical, it obviously isn’t appropriate for containers handling user requests.
Short answer: add resources requests but don't add limits. Otherwise, you will face the throttling issue.

What happens with memory requests/limits specified in K8s job (container) when job is completed?

I have multi-environment k8s cluster (EKS) and I'm trying to setup accurate values for ResourceQuotas.
One interesting thing that I've noticed is that specified request/limit for CPU/memory stay "occupied" in k8s cluster when job is completed successfully and effectively pod releases cpu/memory resources that it is using.
Since I expect that there would be a lot of jobs executed on the environment this caused a problem for me. Of course, I've added support for running cleanup cronjob for the successfully executed jobs but that is just one part of the solution.
I'm aware of the TTL feature on k8s: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#ttl-mechanism-for-finished-jobs that is still in alpha state and as such not available on the EKS k8s cluster.
I would expect that both request/limits specified on that specific pod (container/s) are "released" also but when looking at the k8s metrics on Grafana, I see that that is not true.
This is an example (green line marks current resource usage, yellow marks resource request while blue marks resource limit):
My question is:
Is this expected behaviour?
If yes, what are the technical reasons why request/limits are not released as well after job (pod) execution is completed?
I've done "load" test on my environment to test if requests/limits that are left assigned on the completed job (pod) will indeed have influence on the ResourceQuota that I've set.
This is how my ResourceQuota looks like:
apiVersion: v1
kind: ResourceQuota
metadata:
name: mem-cpu-quota
spec:
hard:
requests.cpu: "1"
requests.memory: 2Gi
limits.cpu: "2"
limits.memory: 3Gi
This is request/limit for cpu/memory that exists on each k8s job (to be precise on the container running in the Pod which is spinned up by Job):
resources:
limits:
cpu: 250m
memory: 250Mi
requests:
cpu: 100m
memory: 100Mi
Results of testing:
Currently running number of jobs: 66
Expected sum of CPU requests (if assumption from the question is correct) ~= 6.6m
Expected sum of Memory requests (if assumption from the question is correct) ~= 6.6Mi
Expected sum of CPU limits (if assumption from the question is correct) ~= 16.5
Expected sum of Memory limits (if assumption from the question is correct) ~= 16.5
I've created Grafana graphs that show following:
CPU usage/requests/limits for jobs in one namespace
sum(rate(container_cpu_usage_seconds_total{namespace="${namespace}", container="myjob"}[5m]))
sum(kube_pod_container_resource_requests_cpu_cores{namespace="${namespace}", container="myjob"})
sum(kube_pod_container_resource_limits_cpu_cores{namespace="${namespace}", container="myjob"})
Memory usage/requests/limits for jobs in one namespace
sum(rate(container_memory_usage_bytes{namespace="${namespace}", container="myjob"}[5m]))
sum(kube_pod_container_resource_requests_memory_bytes{namespace="${namespace}", container="myjob"})
sum(kube_pod_container_resource_limits_memory_bytes{namespace="${namespace}", container="myjob"})
This is how graphs look like:
According to this graph, requests/limits get accumulated and go well beyond the ResourceQuota thresholds. However, I'm still able to run new jobs without a problem.
At this moment, I've started doubting in what metrics are showing and opted to check other part of the metrics. To be specific, I've used following set of metrics:
CPU:
sum (rate(container_cpu_usage_seconds_total{namespace="$namespace"}[1m]))
kube_resourcequota{namespace="$namespace", resource="limits.cpu", type="hard"}
kube_resourcequota{namespace="$namespace", resource="requests.cpu", type="hard"}
kube_resourcequota{namespace="$namespace", resource="limits.cpu", type="used"}
kube_resourcequota{namespace="$namespace", resource="requests.cpu", type="used"}
Memory:
sum (container_memory_usage_bytes{image!="",name=~"^k8s_.*", namespace="$namespace"})
kube_resourcequota{namespace="$namespace", resource="limits.memory", type="hard"}
kube_resourcequota{namespace="$namespace", resource="requests.memory", type="hard"}
kube_resourcequota{namespace="$namespace", resource="limits.memory", type="used"}
kube_resourcequota{namespace="$namespace", resource="requests.memory", type="used"}
This is how graph looks like:
Conclusion:
From this screenshot, it is clear that, once load test completes and jobs go into the complete state, even though pods are still around (with READY: 0/1 and STATUS: Completed), cpu/memory request/limits are released and no longer represent constraint that needs to be calculated into the ResourceQuota threshold.
This can be seen by observing following data on the graph:
CPU allocated requests
CPU allocated limits
Memory allocated requests
Memory allocated limits
all of which increase at the point of time when load happens on the system and new jobs are created but goes back into the previous state as soon as jobs are completed (even though they are not deleted from the environment)
In other words, resource usage/limits/request for both cpu/memory are taken into the account only while job (and its corresponding pod) is in RUNNING state
If you do kubectl get pod you can see the pod that was created by the Job still exists in the list, for example:
NAME READY STATUS RESTARTS AGE
cert-generator-11b35c51b71ea3086396a780dbf20b5cd695b25d-wvb7t 0/1 Completed 0 57d
Thus, any resource requests/limits are still utilized by the pod. To release the resource, you can manually delete the pod. It will be re-created the next time the job runs.
You can also configure the job (and hence the pod) to be auto-deleted from history upon success and/or failure by using .spec.ttlSecondsAfterFinished on the Job. But you would lose the way to know if the job was successful or not though.
Or if your job is actually created by CronJob, then you can configure the job (and hence the pod) to be auto-deleted .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit on the CronJob.

Node not ready, pods pending

I am running a cluster on GKE and sometimes I get into a hanging state. Right now I was working with just two nodes and allowed the cluster to autoscale. One of the nodes has a NotReady status and simply stays in it. Because of that, half of my pods are Pending, because of insufficient CPU.
How I got there
I deployed a pod which has quite high CPU usage from the moment it starts. When I scaled it to 2, I noticed CPU usage was at 1.0; the moment I scaled the Deployment to 3 replicas, I expected to have the third one in Pending state until the cluster adds another node, then schedule it there.
What happened instead is the node switched to a NotReady status and all pods that were on it are now Pending.
However, the node does not restart or anything - it is just not used by Kubernetes. The GKE then thinks that there are enough resources as the VM has 0 CPU usage and won't scale up to 3.
I cannot manually SSH into the instance from console - it is stuck in the loading loop.
I can manually delete the instance and then it starts working - but I don't think that's the idea of fully managed.
One thing I noticed - not sure if related: in GCE console, when I look at VM instances, the Ready node is being used by the instance group and the load balancer (which is the service around an nginx entry point), but the NotReady node is only in use by the instance group - not the load balancer.
Furthermore, in kubectl get events, there was a line:
Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Failed to create load balancer for service default/proxy-service: failed to ensure static IP 104.199.xx.xx: error creating gce static IP address: googleapi: Error 400: Invalid value for field 'resource.address': '104.199.xx.xx'. Specified IP address is already reserved., invalid
I specified loadBalancerIP: 104.199.xx.xx in the definition of the proxy-service to make sure that on each restart the service gets the same (reserved) static IP.
Any ideas on how to prevent this from happening? So that if a node gets stuck in NotReady state it at least restarts - but ideally doesn't get into such state to begin with?
Thanks.
The first thing I would do is to define Resources and Limits for those pods.
Resources tell the cluster how much memory and CPU you think that the pod is going to use. You do this to help the scheduler to find the best location to run those pods.
Limits are crucial here: they are set to prevent your pods damaging the stability of the nodes. It's better to have a pod killed by an OOM than a pod bringing a node down because of resource starvation.
For example, in this case you're saying that you want 200m CPU (20%) for your pod but if for any chance it goes above 300 (30%), you want the scheduler to kill it and start a new one.
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
resources:
limits:
cpu: 300m
memory: 200Mi
requests:
cpu: 200m
memory: 100Mi
You can read more here: http://kubernetes.io/docs/admin/limitrange/
For AWS I can tell. You can create dynamic scaling policies based on CPU and memory utilization.
It goes in NotReady state because of out of memory or maybe insufficient CPU. You can create a custom memory metric to collect memory metric of all the worker nodes in the cluster collectively and push it to cloudwatch.
You can follow this documentation- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html
CPU metric is already there so no need to create it. So a memory metric will be created for you cluster.
You can now create an alarm for it when it goes above certain threshold. Now you have to go to the Auto Scaling Group through AWS console. Now you have to add a scaling policy for your autoscaling group selecting the alarm that you created and add number of instance accordingly.