Kubernetes HPA (with custom metrics) scaling policies - kubernetes

Starting from Kubernetes v1.18 the v2beta2 API allows scaling behavior to be configured through the Horizontal Pod Autoscalar (HPA) behavior field. I'm planning to apply HPA with custom metrics to a StatefulSet.
The use case I'm looking at is scaling out using a custom metric (e.g. number of user sessions on my application), but the HPA will not scale down at all. This use case is also described by K8s SIG-Autoscaling enhancements - "Configurable scale velocity for HPA >> Story 4: Scale Up As Usual, Do Not Scale Down".
behavior:
scaleDown:
policies:
- type: pods
value: 0
The user sessions could stay active for minutes to hours. Starting with 1 replica of the StatefulSet, as the number of user sessions hit an upper limit (exposed using Prometheus collector and later configured using HPA custom metric option), the application pods will scale-out. The new pods will start serving new users.
Since this is a StatefulSet and cannot just abruptly scale down, I'm seeking help on ways to scale down when the user sessions on the new replicas go down to 0. The above link says that the scale down can be controlled by a separate process. Not sure how to do this? Looking for some pointers.
Thanks.

You can use periodSeconds and stabilizationWindowSeconds values to manage how much time will pass between termination of pods, for example:
behavior:
scaleDown:
stabilizationWindowSeconds: 10
policies:
- type: Pods
value: 1
periodSeconds: 20
This way it will scale down 1 pod every ~30 seconds (or whatever value will be used in periodSeconds and stabilizationWindowSeconds). Time may vary depending on stabilizationWindowSeconds values over time.
periodSeconds describes how much time will pass between termination of each pod, maximum value is 1800 second (30 minutes).
stabilizationWindowSeconds when metrics indicate that target should be scaled down, this algorithm takes a look into previously calculated desired states and uses highest value from specified interval. For scale down default value is 300, maximum value is 3600 (one hour).

Related

Scheduled scaling for PODs in Kubernetes

I have a scaled deployment with predictable load change depends on time. How can I make my deployment prepared to the load (for example, I want to double pods number every evening from 16:00 to 23:00). Does Kubernetes provides such tool?
I know Kubernetes pods are scaling with Horizontal Pod Autoscaler, which scales the number of pods based on CPU utilisation or custom metric. But it is reactive approach, I'm looking for proactive.
A quick google search would direct you here: https://github.com/kubernetes/kubernetes/issues/49931
In essence the best solution as of now, is to either run a sidecar container for your pod's main container, which could use the kubernetes api to scale itself up based on a time period with a simple bash script, or write a CRD yourself that reacts to time based events (it is 6pm), something like this one:
https://github.com/amelbakry/kube-schedule-scaler
which watches annotations with cron-like specs on deployments and reacts accordingly.
If you are looking for a more advanced Auto scaler then you can give Keda Keda.sh a try. It has the support for cron based auto scale up & down.
Plus it also support some other event driven based auto scaling like what I have done based on Consumer group's lag in Apache Kafka for particular topic.
There are multiple event source supported, check it out here
Horizontal Pod Autoscaler of Kubernetes is not a re-active approach, but in fact it is a proactive scaling approach. Let I explain its algorithm using its default setting:
The cool time is 5 minutes
Resource utilization tracing for every 15 seconds
It means that the system traces resource utilization (depend on what metrics the end-users set, e.g., CPU, storage...etc.) for every 15 seconds.
Until every 5 minutes of cooling down (no scaling actions), the controller will calculate the resource utilization in the past 5 minutes (which uses the historical data traces in every 15 seconds above). Then it estimates number of resource (i.e., number of replicas) requires for the next 5-min time window by the equation:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue /
desiredMetricValue )]
Other pro-active auto-scaler also works in the similar manner. Different points is that they may apply different techniques (queue theory, machine learning, or time series model) to estimate desiredReplicas as what done in the above equation.

Kubernetes HPA Auto Scaling Velocity

We have defined HPA for an application to have min 1 and max 4 replicas with 80% cpu as the threshold.
What we wanted was, if the pod cpu goes beyond 80%, the app needs to be scaled up 1 at a time.
Instead what is happening is the application is getting scaled up to max number of replicas.
How can we define the scale velocity to scale 1 pod at a time. And again if one of the pod consumes more than 80% cpu then scale one more pod up but not maximum replicas.
Let me know how do we achieve this.
First of all, the 80% CPU utilisation is not a threshold but a target value.
The HPA algorithm for calculating the desired number of replicas is based on the following formula:
X = N * (C/T)
Where:
X: desired number of replicas
N: current number of replicas
C: current value of the metric
T: target value for the metric
In other words, the algorithm aims at calculating a replica count that keeps the observed metric value as close as possible to the target value.
In your case, this means if the average CPU utilisation across the pods of your app is below 80%, the HPA tends to decrease the number of replicas (to make the CPU utilisation of the remaining pods go up). On the other hand, if the average CPU utilisation across the pods is above 80%, the HPA tends to increase the number of replicas, so that the CPU utilisation of the individual pods decreases.
The number of replicas that are added or removed in a single step depends on how far apart the current metric value is from the target value and on the current number of replicas. This decision is internal to the HPA algorithm and you can't directly influence it. The only contract that the HPA has with its users is to keep the metric value as close as possible to the target value.
If you need a very specific autoscaling behaviour, you can write a custom controller (or operator) to autoscale your application instead of using the HPA.
This - https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details - expains the algorithm HPA uses, including the formula to calculate the number of "desired replicas".
If I recall, there were some (positive) changes to the HPA algo with v1.12.
HPA has total control on scale up as of today. You can only fine tune scale down operation with the following parameter.
--horizontal-pod-autoscaler-downscale-stabilization
The good news is that there is a proposal for Configurable scale up/down velocity for HPA

How to make Horizontal Pod Autoscaler scale down pod replicas on a percentage decrease threshold?

I am looking for a syntax/condition of percentage decrease threshold to be inserted in HPA.yaml file which would allow the Horizontal Pod Autoscaler to start decreasing the pod replicas when the CPU utilization falls that particular percentage threshold.
Consider this scenario:-
I mentioned an option targetCPUUtilizationPercentage and assigned it with value 50. minReplicas to be 1 and MaxReplicas to be 5.
Now lets assume the CPU utilization went above 50, and went till 100, making the HPA to create 2 replicas. If the utilization decreases to 51% also, HPA will not terminate 1 pod replica.
Is there any way to conditionize the scale down on the basis of % decrease in CPU utilization?
Just like targetCPUUtilizationPercentage, I could be able to mention targetCPUUtilizationPercentageDecrease and assign it value 30, so that when the CPU utilization falls from 100% to 70%, HPA terminates a pod replica and further 30% decrease in CPU utilization, so that when it reaches 40%, the other remaining pod replica gets terminated.
As per on-line resources, this topic is still under community progress "Configurable HorizontalPodAutoscaler options"
I didn't try but as workaround you can try to create custom metrics f.e. using Prometheus Adapter, Horizontal pod auto scaling by using custom metrics
in order to have more control about provided limits.
At the moment you can use horizontal-pod-autoscaler-downscale-stabilization:
--horizontal-pod-autoscaler-downscale-stabilization option to control
The value for this option is a duration that specifies how long the autoscaler has to wait before another downscale operation can be performed after the current one has completed. The default value is 5 minutes (5m0s).
On the other point of view this is expected due to the basis of HPA:
Applications that process very important data events. These should scale up as fast as possible (to reduce the data processing time), and scale down as soon as possible (to reduce cost).
Hope this help.

Why Kubernetes HPA convert custom metric?

Kubernetes Horisontal Pod Autoscaling (HPA) modifies my custom metric: StackDriver displays correct metric, but HPA shows another number.
For example, StackDrives value is 118K, but HPA displays 1656144.
I understand that HPA use some conversation for floating numbers, but my metric is integer: Unit: number Kind: Gauge Value type: Int64.
Running in GKE 1.11.7.
Any ideas?
if you specify targetValue it will be a whole number, so there won't be scaling down of pods.
If you use targetAverageValue it will calculate based on the number of pods created.
In your HPA manifest you did not specify value of --horizontal-pod-autoscaler-sync-period flag. As default it is set to 15 seconds.
In your case it means that HPA value is amout of whole deployment queue in last 15 seconds. More information can be found in HPA Documentation.
As you mentioned in StackDriver you used GAUGE metric which measures a value at a particular point in time - Stackdriver
In short, StackDriver shows current value in the exact time, HPA values is amount of last 15 seconds.

How does open-faas deployed on kubernetes determine when to scale a function up or down?

In Kubernetes, I am a little unclear of what criteria needs to be met for open-faas to scale a function's replicas up or down.
According to the documentation:
Auto-scaling in OpenFaaS allows a function to scale up or down depending on demand represented by different metrics.
It sounds like, by default, a reason for scaling would be requests/second increasing/decreasing.
OpenFaaS ships with a single auto-scaling rule defined in the mounted configuration file for AlertManager. AlertManager reads usage (requests per second) metrics from Prometheus in order to know when to fire an alert to the API Gateway.
And this "alert" sent to the API Gateway would cause a function's replica count to scale up.
I don't see in the documentation, or the AlertManager, where the threshold for requests/second is set to scale up/down at.
My overall questions:
What is the default threshold of requests/second that would cause a scale up?
Is this threshold configurable? If so, how?