How to setup Kubernetes HPA to scale based on maximum available memory in a given pod? - kubernetes

I’d like to autoscale the pods not based on the average memory, but rather based on largest amount of available memory in a given pod.
Example:
Let’s say the target maximum available memory is 50%.
If we have 7 pods already and 6 of them have 90% occupied memory, but a single pod with 40% occupied memory, that’d satisfy my criteria and we won’t need to upscale. But the moment that last pod goes below 50% available memory we’ll upscale.
I know it’s not a wise criteria for scaling in case majority of case, but in my particular circumstance, it fits.

Related

Set cpu requests in K8s for fluctuating load

I have a service deployed in Kubernetes and I am trying to optimize the requested cpu resources.
For now, I have deployed 10 instances and set spec.containers[].resources.limits.cpu to 0.1, based on the "average" use. However, it became obvious that this average is rather useless in practice because under constant load, the load increases significantly (to 0.3-0.4 as far as I can tell).
What happens consequently, when multiple instances are deployed on the same node, is that this node is heavily overloaded; pods are no longer responsive, are killed and restarted etc.
What is the best practice to find a good value? My current best guess is to increase the requested cpu to 0.3 or 0.4; I'm looking at Grafana visualizations and see that the pods on the heavily loaded node(s) converge there under continuous load.
However, how can I know if they would use more load if they could before becoming unresponsive as the node is overloaded?
I'm actually trying to understand how to approach this in general. I would expect an "ideal" service (presuming it is CPU-focused) to use close to 0.0 when there is no load, and close to 1.0 when requests are constantly coming in. With that assumption, should I set the cpu.requests to 1.0, taking a perspective where actual constant usage is assumed?
I have read some Kubernetes best practice guides, but none of them seem to address how to set the actual value for cpu requests in practice in more depth than "find an average".
Basically come up with a number that is your lower acceptable bound for how much the process runs. Setting a request of 100m means that you are okay with a lower limit of your process running 0.1 seconds for every 1 second of wall time (roughly). Normally that should be some kind of average utilization, usually something like a P99 or P95 value over several days or weeks. Personally I usually look at a chart of P99, P80, and P50 (median) over 30 days and use that to decide on a value.
Limits are a different beast, they are setting your CPU timeslice quota. This subsystem in Linux has some persistent bugs so unless you've specifically vetted your kernel as correct, I don't recommend using it for anything but the most hostile of programs.
In a nutshell: Main goal is to understand how much traffic a pod can handle and how much resource it consumes to do so.
CPU limits are hard to understand and can be harmful, you might want
to avoid them, see static policy documentation and relevant
github issue.
To dimension your CPU requests you will want to understand first how much a pod can consume during high load. In order to do this you can :
disable all kind of autoscaling (HPA, vertical pod autoscaler, ...)
set the number of replicas to one
lift the CPU limits
request the highest amount of CPU you can on a node (3.2 usually on 4cpu nodes)
send as much traffic as you can on the application (you can achieve simple Load Tests scenarios with locust for example)
You will eventually end up with a ratio clients-or-requests-per-sec/cpu-consumed. You can suppose the relation is linear (this might not be true if your workload complexity is O(n^2) with n the number of clients connected, but this is not the nominal case).
You can then choose the pod resource requests based on the ratio you measured. For example if you consume 1.2 cpu for 1000 requests per second you know that you can give each pod 1 cpu and it will handle up to 800 requests per second.
Once you know how much a pod can consume under its maximal load, you can start setting up cpu-based autoscaling, 70% is a good first target that can be refined if you encounter issues like latency or pods not autoscaling fast enough. This will avoid your nodes to run out of cpu if the load increases.
There are a few gotchas, for example single-threaded applications are not able to consume more than a cpu. Thus if you give it 1.5 cpu it will run out of cpu but you won't be able to visualize it from metrics as you'll believe it still can consume 0.5 cpu.

Kubernetes HPA Auto Scaling Velocity

We have defined HPA for an application to have min 1 and max 4 replicas with 80% cpu as the threshold.
What we wanted was, if the pod cpu goes beyond 80%, the app needs to be scaled up 1 at a time.
Instead what is happening is the application is getting scaled up to max number of replicas.
How can we define the scale velocity to scale 1 pod at a time. And again if one of the pod consumes more than 80% cpu then scale one more pod up but not maximum replicas.
Let me know how do we achieve this.
First of all, the 80% CPU utilisation is not a threshold but a target value.
The HPA algorithm for calculating the desired number of replicas is based on the following formula:
X = N * (C/T)
Where:
X: desired number of replicas
N: current number of replicas
C: current value of the metric
T: target value for the metric
In other words, the algorithm aims at calculating a replica count that keeps the observed metric value as close as possible to the target value.
In your case, this means if the average CPU utilisation across the pods of your app is below 80%, the HPA tends to decrease the number of replicas (to make the CPU utilisation of the remaining pods go up). On the other hand, if the average CPU utilisation across the pods is above 80%, the HPA tends to increase the number of replicas, so that the CPU utilisation of the individual pods decreases.
The number of replicas that are added or removed in a single step depends on how far apart the current metric value is from the target value and on the current number of replicas. This decision is internal to the HPA algorithm and you can't directly influence it. The only contract that the HPA has with its users is to keep the metric value as close as possible to the target value.
If you need a very specific autoscaling behaviour, you can write a custom controller (or operator) to autoscale your application instead of using the HPA.
This - https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details - expains the algorithm HPA uses, including the formula to calculate the number of "desired replicas".
If I recall, there were some (positive) changes to the HPA algo with v1.12.
HPA has total control on scale up as of today. You can only fine tune scale down operation with the following parameter.
--horizontal-pod-autoscaler-downscale-stabilization
The good news is that there is a proposal for Configurable scale up/down velocity for HPA

How to make Horizontal Pod Autoscaler scale down pod replicas on a percentage decrease threshold?

I am looking for a syntax/condition of percentage decrease threshold to be inserted in HPA.yaml file which would allow the Horizontal Pod Autoscaler to start decreasing the pod replicas when the CPU utilization falls that particular percentage threshold.
Consider this scenario:-
I mentioned an option targetCPUUtilizationPercentage and assigned it with value 50. minReplicas to be 1 and MaxReplicas to be 5.
Now lets assume the CPU utilization went above 50, and went till 100, making the HPA to create 2 replicas. If the utilization decreases to 51% also, HPA will not terminate 1 pod replica.
Is there any way to conditionize the scale down on the basis of % decrease in CPU utilization?
Just like targetCPUUtilizationPercentage, I could be able to mention targetCPUUtilizationPercentageDecrease and assign it value 30, so that when the CPU utilization falls from 100% to 70%, HPA terminates a pod replica and further 30% decrease in CPU utilization, so that when it reaches 40%, the other remaining pod replica gets terminated.
As per on-line resources, this topic is still under community progress "Configurable HorizontalPodAutoscaler options"
I didn't try but as workaround you can try to create custom metrics f.e. using Prometheus Adapter, Horizontal pod auto scaling by using custom metrics
in order to have more control about provided limits.
At the moment you can use horizontal-pod-autoscaler-downscale-stabilization:
--horizontal-pod-autoscaler-downscale-stabilization option to control
The value for this option is a duration that specifies how long the autoscaler has to wait before another downscale operation can be performed after the current one has completed. The default value is 5 minutes (5m0s).
On the other point of view this is expected due to the basis of HPA:
Applications that process very important data events. These should scale up as fast as possible (to reduce the data processing time), and scale down as soon as possible (to reduce cost).
Hope this help.

Query on kubernetes metrics-server metrics values

I am using metrics-server(https://github.com/kubernetes-incubator/metrics-server/) to collect the core metrics from containers in a kubernetes cluster.
I could fetch 2 resource usage metrics per container.
cpu usage
memory usage
However its not clear to me whether
these metrics are accumulated over time or they are already sampled for a particular time window(1 minute/ 30 seconds..)
What are the units for the above metric values. For CPU usage, is it the number of cores or milliseconds? And for memory usage i assume its the bytes usage.
While computing CPU usage metric value, does metrics-server already take care of dividing the container usage by the host system usage?
Also, if i have to compare these metrics with the docker-api metrics, how to compute CPU usage % for a given container?
Thanks!
Metrics are scraped periodically from kubelets. The default resolution duration is 60s, which can be overriden with the --metric-resolution=<duration> flag.
The value and unit (cpu - cores in decimal SI, memory - bytes in binary SI) are arrived at by using the Quantity serializer in the k8s apimachinery package. You can read about it from the comments in the source code
No, the CPU metric is not relative to the host system usage as you can see that it's not a percentage value. It represents the rate of change of the total amount of CPU seconds consumed by the container by core. If this value increases by 1 within one second, the pod consumes 1 CPU core (or 1000 milli-cores) in that second.
To arrive at a relative value, depending on your use case, you can divide the CPU metric for a pod by that for the node, since metrics-server exposes both /pods and /nodes endpoints.

Choosing the compute resources of the nodes in the cluster with horizontal scaling

Horizontal scaling means that we scale by adding more machines into the pool of resources. Still, there is a choice of how much power (CPU, RAM) each node in the cluster will have.
When cluster managed with Kubernetes it is extremely easy to set any CPU and memory limit for Pods. How to choose the optimal CPU and memory size for cluster nodes (or Pods in Kubernetes)?
For example, there are 3 nodes in a cluster with 1 vCPU and 1GB RAM each. To handle more load there are 2 options:
Add the 4th node with 1 vCPU and 1GB RAM
Add to each of the 3 nodes more power (e.g. 2 vCPU and 2GB RAM)
A straightforward solution is to calculate the throughput and cost of each option and choose the cheaper one. Are there any more advanced approaches for choosing the compute resources of the nodes in a cluster with horizontal scalability?
For this particular example I would go for 2x vCPU instead of another 1vCPU node, but that is mainly cause I believe running OS for anything serious on a single vCPU is just wrong. System to behave decently needs 2+ cores available, otherwise it's too easy to overwhelm that one vCPU and send the node into dust. There is no ideal algorithm for this though. It will depend on your budget, on characteristics of your workloads etc.
As a rule of thumb, don't stick to too small instances as you have a bunch of stuff that has to run on them always, regardless of their size and the more node, the more overhead. 3x 4vCpu+16/32GB RAM sounds like nice plan for starters, but again... it depends on what you want, need and can afford.
The answer is related to such performance metrics as latency and throughput:
Latency is a time interval between sending request and receiving response.
Throughput is a request processing rate (requests per second).
Latency has influence on throughput: bigger latency = less throughput.
If a business transaction consists of multiple sequential calls of the services that can't be parallelized, then compute resources (CPU and memory) has to be chosen based on the desired latency value. Adding more instances of the services (horizontal scaling) will not have any positive influence on the latency in this case.
Adding more instances of the service increases throughput allowing to process more requests in parallel (if there are no bottlenecks).
In other words, allocate CPU and memory resources so that service has desired response time and add more service instances (scale horizontally) to handle more requests in parallel.