Horizontal-Pod-Autoscale scale only if CPU load is remain constant for given (5 min) duration - kubernetes

I have a k8s cluster deployed in AWS's EKS. Using Kubernetes 1.14 version
Horizontal-Pod-Autoscale scale only if CPU load is remain constant for given (5 min) duration
As we want to take decision after 4-5 mins if load remain high during that duration.
if load reduces after 3-4 mins then don't scale up, but currently we are not able to find any way for that.
horizontal-pod-autoscaler-upscale-delay is deprecated.
So we are looking for parameter by which, we can set CPU usage duration for HPA.

horizontal-pod-autoscaler-upscale-delay is removed. It might still work. You can add it to kube-controller arguments and check

Related

Can I set the pod to use max request CPU from the beginning?

I am using Openshift 4, CPU Request: 0.2, Limit 0.4.
From the monitoring, I can see the CPU usage started from 0.1, and increased gradually. Is it because that there is a machanisim to prevent over reserve the CPU usage?
Can I setup that the pod to use the max request CPU from the beginning, and adapt to Limit as fast as possible?
The max limit is already available from the beginning (presuming that the node has the CPU available to give). OCP is using CFS to enforce that limit, and CFS doesn't have anything that gradually kicks in, CFS only has one thing it considers: the configured limit.
As for why you are seeing this in your monitoring, I'm not sure. But my first guess would be that that graph is using a moving average. (And thus, since it's a moving average it will converge towards the actual usage.)

Scaling Kafka streams application using kubernetes horizontal pod scaling

We have a kafka streams application with 3 pods. Application scaling is a heavy operation(because of large state) for us. So, I would like to increase/scale pod only if it absolutely necessary. For example, if the application utilization increases beyond a number for lets say 10 mins.
Again, i don't need to scale up/down my application for sudden burst(a fews seconds) of messages
Looking for something configuration like:
window : 15 mins
avergae cpu : 1000 milli
So, I would like to scale the application is the average cpu over 15 mins window is greater than 1000 milli.
You can take a look into HPA policies.
There is stabilizationWindowSeconds:
StabilizationWindowSeconds is the number of seconds for which past
recommendations should be considered while scaling up or scaling down.
StabilizationWindowSeconds must be greater than or equal to zero and
less than or equal to 3600 (one hour). If not set, use the default
values: - For scale up: 0 (i.e. no stabilization is done)
And the limits CPU average utilization can be set in metric target objects under averageUtilization.

Set cpu requests in K8s for fluctuating load

I have a service deployed in Kubernetes and I am trying to optimize the requested cpu resources.
For now, I have deployed 10 instances and set spec.containers[].resources.limits.cpu to 0.1, based on the "average" use. However, it became obvious that this average is rather useless in practice because under constant load, the load increases significantly (to 0.3-0.4 as far as I can tell).
What happens consequently, when multiple instances are deployed on the same node, is that this node is heavily overloaded; pods are no longer responsive, are killed and restarted etc.
What is the best practice to find a good value? My current best guess is to increase the requested cpu to 0.3 or 0.4; I'm looking at Grafana visualizations and see that the pods on the heavily loaded node(s) converge there under continuous load.
However, how can I know if they would use more load if they could before becoming unresponsive as the node is overloaded?
I'm actually trying to understand how to approach this in general. I would expect an "ideal" service (presuming it is CPU-focused) to use close to 0.0 when there is no load, and close to 1.0 when requests are constantly coming in. With that assumption, should I set the cpu.requests to 1.0, taking a perspective where actual constant usage is assumed?
I have read some Kubernetes best practice guides, but none of them seem to address how to set the actual value for cpu requests in practice in more depth than "find an average".
Basically come up with a number that is your lower acceptable bound for how much the process runs. Setting a request of 100m means that you are okay with a lower limit of your process running 0.1 seconds for every 1 second of wall time (roughly). Normally that should be some kind of average utilization, usually something like a P99 or P95 value over several days or weeks. Personally I usually look at a chart of P99, P80, and P50 (median) over 30 days and use that to decide on a value.
Limits are a different beast, they are setting your CPU timeslice quota. This subsystem in Linux has some persistent bugs so unless you've specifically vetted your kernel as correct, I don't recommend using it for anything but the most hostile of programs.
In a nutshell: Main goal is to understand how much traffic a pod can handle and how much resource it consumes to do so.
CPU limits are hard to understand and can be harmful, you might want
to avoid them, see static policy documentation and relevant
github issue.
To dimension your CPU requests you will want to understand first how much a pod can consume during high load. In order to do this you can :
disable all kind of autoscaling (HPA, vertical pod autoscaler, ...)
set the number of replicas to one
lift the CPU limits
request the highest amount of CPU you can on a node (3.2 usually on 4cpu nodes)
send as much traffic as you can on the application (you can achieve simple Load Tests scenarios with locust for example)
You will eventually end up with a ratio clients-or-requests-per-sec/cpu-consumed. You can suppose the relation is linear (this might not be true if your workload complexity is O(n^2) with n the number of clients connected, but this is not the nominal case).
You can then choose the pod resource requests based on the ratio you measured. For example if you consume 1.2 cpu for 1000 requests per second you know that you can give each pod 1 cpu and it will handle up to 800 requests per second.
Once you know how much a pod can consume under its maximal load, you can start setting up cpu-based autoscaling, 70% is a good first target that can be refined if you encounter issues like latency or pods not autoscaling fast enough. This will avoid your nodes to run out of cpu if the load increases.
There are a few gotchas, for example single-threaded applications are not able to consume more than a cpu. Thus if you give it 1.5 cpu it will run out of cpu but you won't be able to visualize it from metrics as you'll believe it still can consume 0.5 cpu.

Scheduled scaling for PODs in Kubernetes

I have a scaled deployment with predictable load change depends on time. How can I make my deployment prepared to the load (for example, I want to double pods number every evening from 16:00 to 23:00). Does Kubernetes provides such tool?
I know Kubernetes pods are scaling with Horizontal Pod Autoscaler, which scales the number of pods based on CPU utilisation or custom metric. But it is reactive approach, I'm looking for proactive.
A quick google search would direct you here: https://github.com/kubernetes/kubernetes/issues/49931
In essence the best solution as of now, is to either run a sidecar container for your pod's main container, which could use the kubernetes api to scale itself up based on a time period with a simple bash script, or write a CRD yourself that reacts to time based events (it is 6pm), something like this one:
https://github.com/amelbakry/kube-schedule-scaler
which watches annotations with cron-like specs on deployments and reacts accordingly.
If you are looking for a more advanced Auto scaler then you can give Keda Keda.sh a try. It has the support for cron based auto scale up & down.
Plus it also support some other event driven based auto scaling like what I have done based on Consumer group's lag in Apache Kafka for particular topic.
There are multiple event source supported, check it out here
Horizontal Pod Autoscaler of Kubernetes is not a re-active approach, but in fact it is a proactive scaling approach. Let I explain its algorithm using its default setting:
The cool time is 5 minutes
Resource utilization tracing for every 15 seconds
It means that the system traces resource utilization (depend on what metrics the end-users set, e.g., CPU, storage...etc.) for every 15 seconds.
Until every 5 minutes of cooling down (no scaling actions), the controller will calculate the resource utilization in the past 5 minutes (which uses the historical data traces in every 15 seconds above). Then it estimates number of resource (i.e., number of replicas) requires for the next 5-min time window by the equation:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue /
desiredMetricValue )]
Other pro-active auto-scaler also works in the similar manner. Different points is that they may apply different techniques (queue theory, machine learning, or time series model) to estimate desiredReplicas as what done in the above equation.

How kubernetes HPA with 2 or more metrics behaves - especially the no.of replicas calculation?

We have configured to use 2 metrics for HPA
CPU Utilization
App specific custom metrics
When testing, we observed the scaling happening, but calculation of no.of replicas is not very clear. I am not able to locate any documentation on this.
Questions:
Can someone point to documentation or code on the calculation part?
Is it a good practice to use multiple metrics for scaling?
Thanks in Advance!
From https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#how-does-the-horizontal-pod-autoscaler-work
If multiple metrics are specified in a HorizontalPodAutoscaler, this calculation is done for each metric, and then the largest of the desired replica counts is chosen. If any of those metrics cannot be converted into a desired replica count (e.g. due to an error fetching the metrics from the metrics APIs), scaling is skipped.
Finally, just before HPA scales the target, the scale recommendation is recorded. The controller considers all recommendations within a configurable window choosing the highest recommendation from within that window. This value can be configured using the --horizontal-pod-autoscaler-downscale-stabilization-window flag, which defaults to 5 minutes. This means that scaledowns will occur gradually, smoothing out the impact of rapidly fluctuating metric values