Trying to understand the meaning of averageUtilization in Kubernetes autoscaling - kubernetes

The docs says:
For per-pod resource metrics (like CPU), the controller fetches the metrics from the resource metrics API for each Pod targeted by the HorizontalPodAutoscaler. Then, if a target utilization value is set, the controller calculates the utilization value as a percentage of the equivalent resource request on the containers in each Pod. If a target raw value is set, the raw metric values are used directly. The controller then takes the mean of the utilization or the raw value (depending on the type of target specified) across all targeted Pods, and produces a ratio used to scale the number of desired replicas.
Assume I have a Pod with:
resources:
limits:
cpu: "0.3"
memory: 500M
requests:
cpu: "0.01"
memory: 40M
and now I have an autoscaling definition as:
type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Which according to the docs:
With this metric the HPA controller will keep the average utilization of the pods in the scaling target at 60%. Utilization is the ratio between the current usage of resource to the requested resources of the pod
So, I'm not understanding something here. If request is the minimum resources required to run the app, how would scaling be based on this value? 60% of 0.01 is nothing, and the service would be constantly scaling.

Your misunderstanding might be that the value of request is not necessarily the minimum your app need to run.
It is what you (the developer, admin, DevOps) request from the Kubernetes cluster for a pod in your application to run and it helps the scheduler to pick the right node for your workload (say the in one that has sufficient resources available). So, don't pick this value too small or too high.
Apart from that, autoscaling works as you described it. In this case, the cluster calculates how much of your requested CPU is used and will scale out when more than 60% are in use. Keep in mind, that Kubernetes does not look at every single pod but on the average of all pods in that group.
For example, given two pods running, one pod could run on 100% of requests and the other one at (almost) 0%. The average would be around 50% so no autoscaling happens in the case of the Horizontal Pod Autoscaler.
In production, I personally try to do a guess on the right values and then look at the metrics and adjust the values to my real-world workload. Prometheus is your friend or at least the metrics server:
https://github.com/prometheus-operator/kube-prometheus
https://github.com/kubernetes-sigs/metrics-server

Related

Can Horizontal Pod Scaling work with one node only?

I'm new to Kubernetes, and I have a doubt about horizontal pod autoscaling. Can I apply HPA with just one node ? If so, what are the benefits of HPA using one node only ?
If I use the metrics below, the target says averageUtilization 50% of cpu. Does that imply that I need a new node after the value is reached ?
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
Any advice ?
Here are some notes that might help you to sort things out:
Yes, you can use horizontal pod autoscaling on one node only.
The benefit of running multiple pods is parallelism: More instances of your app can handle more load - in that regard it doesn't matter if you run the pods on one or several nodes.
But if you have more pods of your application, you might end up in a situation where you need additional nodes to handle the load.
To determine out how many pods can run on one node, kubernetes uses the concept of resource limits and requests.
HPA will spawn new pods if the actual utilization of your pod hits the target utilization - but it doesn't take care that your node can handle more pods - you need to configure this using resource limits and requests.
Scaling up the nodes of your cluster is not handled by HPA, you need to use the kubernetes cluster autoscaler for that.

kubernetes node shows limit, but no limit set

i have not configured any rangelimit or pod limit
but my nodes show requests and limits, is that a limit? or the max-seen value?
having around 20 active nodes all of them are the same hardware size - but each node shows diffrent limit with kubctl describe node nodeXX
does that mean i cannot use more than the limit?
If you check the result of kubectl describe node nodeXX again more carefully you can see that each pod has the columns: CPU Requests, CPU Limits, Memory Requests and Memory Limits. The total Requests and Limits as shown in your screenshot should be the sum of your pods requests and limits.
If you haven't configured limits for your pods then they will have 0%. However I can see in your screenshot that you have a node-exporter pod on your node. You probably also have pods in the kube-system namespace that you haven't scheduled yourself but are essential for kubernetes to work.
About your question:
does that mean i cannot use more than the limit
This article is great at explaining about requests and limits:
Requests are what the container is guaranteed to get. If a container
requests a resource, Kubernetes will only schedule it on a node that
can give it that resource.
Limits, on the other hand, make sure a container never goes above a
certain value. The container is only allowed to go up to the limit,
and then it is restricted.
For example: if your pod requests 1000Mi of memory and your node only has 500Mi of requested memory left, the pod will never be scheduled. If your pod requests 300Mi and has a limit of 1000Mi it will be scheduled, and kubernetes will try to not allocate more than 1000Mi of memory to it.
It may be OK to surpass 100% limit, specially in development environments, where we trade performance for capacity. Example:

Using Horizontal Pod Autoscaling along with resource requests and limits

Say we have the following deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
...
spec:
replicas: 2
template:
spec:
containers:
- image: ...
...
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 300Mi
And we also create a HorizontalPodAutoscaler object which automatically scales up/down the number of pods based on CPU average utilization. I know that the HPA will compute the number of pods based on the resource requests, but what if I want the containers to be able to request more resources before scaling horizontally?
I have two questions:
1) Are resource limits even used by K8s when a HPA is defined?
2) Can I tell the HPA to scale based on resource limits rather than requests? Or as a means of implementing such a control, can I set the targetUtilization value to be more than 100%?
No, HPA is not looking at limits at all. You can specify target utilization to any value even higher than 100%.
Hi in deployment we have resources requests and limits. As per documentation here those parameters acts before HPA gets main role as autoscaler:
When you create a Pod, the Kubernetes scheduler selects a node for
the Pod to run on. Each node has a maximum capacity for each of the
resource types: the amount of CPU and memory it can provide for
Pods.
Then the kubelet starts a Container of a Pod, it passes the CPU and memory limits to the container runtime.
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
On the other hand:
The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled by the controller manager’s (with default value of 15 seconds).
The controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition.
Note:
Please note that if some of the pod’s containers do not have the relevant resource request set, CPU utilization for the pod will not be defined and the autoscaler will not take any action for that metric.
Hope this help

"Limits" property ignored when deploying a container in a Kubernetes cluster

I am deploying a container in Google Kubernetes Engine with this YAML fragment:
spec:
containers:
- name: service
image: registry/service-go:latest
resources:
requests:
memory: "20Mi"
cpu: "20m"
limits:
memory: "100Mi"
cpu: "50m"
But it keeps taking 120m. Why is "limits" property being ignored? Everything else is working correctly. If I request 200m, 200m are being reserved, but limit keeps being ignored.
My Kubernetes version is 1.10.7-gke.1
I only have the default namespace and when executing
kubectl describe namespace default
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container cpu - - 100m - -
Considering Resources Request Only
The google cloud console works well, I think you have multiple containers in your pod, this is why. The value shown above is the sum of resources requests declared in your truncated YAML file. You can verify easily with kubectl.
First verify the number of containers in you pod.
kubectl describe pod service-85cc4df46d-t6wc9
Then, look the description of the node via kubectl, you should have the same informations as the console says.
kubectl describe node gke-default-pool-abcdefgh...
What is the difference between resources request and limit ?
You can imagine your cluster as a big square box. This is the total of your allocatable resources. When you drop a Pod in the big box, Kubernetes will check if there is an empty space for the requested resources of the pod (is the small box fits in the big box?). If there is enough space available, then it will schedule your workload on the selected node.
Resources limits are not taken into account by the scheduler. All is done at the kernel level with CGroups. The goal is to restrict workloads to take all the CPU or Memory on the node they are scheduled on.
If your resources requests == resources limits then, workloads cannot escape their "box" and are not able to use available CPU/Memory next to them. In other terms, your resource are guaranteed for the pod.
But, if the limits are greater than your requests, this is called overcommiting resources. You bet that all the workloads on the same node are not fully loaded at the same time (generally the case).
I recommend to not overcommiting the memory resource, do not let the pod escape the "box" in term of memory, it can leads to OOMKilling.
You can try logging into the node running your pod and run:
ps -Af | grep docker
You'll see the full command line that kubelet sends to docker. Representing the memory limit it should have something like --memory. Note that the request value for memory is only used by the Kubernetes scheduler to determine whether it has exceeded all pods/containers running on a node.
Representing the requests for CPUs you'll see the --cpu-shares flag. In this case the limit is not a hard limit but again it's a way for the Kubernetes scheduler to not allocate containers/pod passed that limit when running multiple containers/pods on a specific node. You can learn more about cpu-shares here and from the Kubernetes side here. So in essence, if you don't have enough workloads on the node, it will always go over its CPU share if it needs to and that's what you are probably seeing.
Docker has other ways of restricting the CPUs such as cpu-period/cpu-quota and cpuset-cpus but not used bu Kubernetes as of this writing. In this, I believe mesos does somehow better when dealing with CPU/memory reservations and quotas imo.
Hope it helps.

Ensuring availability in Kubernetes with high-variance memory / CPU load?

Problem: the code we're running on Kubernetes Pods have a very high variance across it's runtime; specifically, it has occasional CPU & Memory spikes when certain conditions are triggered. These triggers involve user queries with hard realtime requirements (system has to respond within <5 seconds).
Under conditions where the node serving the spiking pod doesn't have enough CPU/RAM, Kubernetes responds to these excessive requests by killing the pod altogether; which results in no output across any time whatsoever.
In what way can we ensure, that these spikes are being taken into account when pods are allocated; and more critically, that no pod shutdown happens for these reasons?
Thanks!
High availability of pods with load can be achieved in two ways:
Configuring More CPU/Memory
As the applications requires more CPU/memory during the peak times configure in such a way that allocated resources for the POD will take care of extra load. Configure the POD something like this:
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
You can increase the limits based on the usage. But this way of doing can cause two issues
1) Underutilized resources
As the resources are allocated in large number, these may go wasted unless there is a spike in the traffic.
2) Deployment failure
POD deployment may fail because of not having enough resources in the kubernetes node to cater the request.
For more info : https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
> Autoscaling
Ideal way of doing it is to autoscale the POD based on the traffic.
kubectl autoscale deployment <DEPLOY-APP-NAME> --cpu-percent=50 --min=1 --max=10
Configure the cpu-percent based on the requirement, else 80% by default. Min and max are the number of PODS which can be configured accordingly.
So each time a POD hits the CPU percent with 50% a new pod will be launched and continues till it launches a max of 10 PODS and same applicable for vice-versa scenario.
For more info: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
Limit is a limit, it's expected to do that, period.
What you can do is either run without limit - it will then behave like in any other situation when run on the node - OOM will happen when Node, not Pod reaches memory limit. But this sounds like asking for trouble. And mind that even if you'd set a high limit it's the request that actualy guarantees some resources to pod, so even with limit of 2Gi on Pod it can OOM on 512Mi if request was 128Mi
You should design your app in a way that will not generate such spikes or that will tolerate OOMs on pods. Hard to tell what your soft does exactly, but some things that come to mind that could help cracking this are request throttling, horizontal pod autoscaler or running asynchronously with some kind of message queue.