Kubernetes HPA - How to avoid scaling-up for CPU utilisation spike - kubernetes

HPA - How to avoid scaling-up for CPU utilization spike (not on startup)
When the business configuration is loaded for different country CPU load increases for 1min, but we want to avoid scaling-up for that 1min.
below pic, CurrentMetricValue is just current value from a matrix or an average value from the last poll to current poll duration --horizontal.-pod-autoscaler-sync-period

The default HPA check interval is 30 seconds. This can be configured through the as you mentioned by changing value of flag --horizontal-pod-autoscaler-sync-period of the controller manager.
The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled by the controller manager’s --horizontal-pod-autoscaler-sync-period flag.
During each period, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. The controller manager obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).
In order to change/add flags in kube-controller-manager - you should have access to your /etc/kubernetes/manifests/ directory on master node and be able to modify parameters in /etc/kubernetes/manifests/kube-controller-manager.yaml.
Note: you are not able do this on GKE, EKS and other managed clusters.
What is more I recommend increasing --horizontal-pod-autoscaler-downscale-stabilization (the replacement for --horizontal-pod-autoscaler-upscale-delay).
If you're worried about long outages I would recommend setting up a custom metric (1 if network was down in last ${duration}, 0 otherwise) and setting the target value of the metric to 1 (in addition to CPU-based autoscaling). This way:
If network was down in last ${duration} recommendation based on the custom metric will be equal to the current size of your deployment. Max of this recommendation and very low CPU recommendation will be equal to the current size of the deployment. There will be no scale downs until the connectivity is restored (+ a few minutes after that because of the scale down stabilization window).
If network is available recommendation based on the metric will be 0. Maxed with CPU recommendation it will be equal to the CPU recommendation and autoscaler will operate normally.
I think this solves your issue better than limiting size of autoscaling step. Limiting size of autoscaling step will only slow down rate at which number of pods decreases so longer network outage will still result in your deployment shrinking to minimum allowed size.
You can also use memory based scaling
Since it is not possible to create memory-based hpa in Kubernetes, it has been written a script to achieve the same. You can find our script here by clicking on this link:
https://github.com/powerupcloud/kubernetes-1/blob/master/memory-based-autoscaling.sh
Clone the repository :
https://github.com/powerupcloud/kubernetes-1.git
and then go to the Kubernetes directory. Execute the help command to get the instructions:
./memory-based-autoscaling.sh --help
Read more here: memory-based-autoscaling.

Related

Is it possible to specifiy a sorting criteria within a kubernetes horizonal pod autoscaler to define which pods are chosen to terminate?

I have some HPAs defined within a Kubernetes cluster and the scaling functionality works as expected. However, I've observed that the choice of specific pods that are chosen to be scaled down seems pretty arbitrary.
So the question is. Can I somewhere define criteria to choose which pods are preferred to be terminated when a scale-down event happens, but without explicitly defining that the pods are actively scaled on that criteria?
For example, I mainly care about CPU and scale such that the CPU percentage is maintained at 50% or less, but when scaling down would prefer that older pods are preferred to be terminated rather than newer ones, or pods consuming the most memory be terminated in preference to those consuming less memory.
I'm aware that I can explicitly scale on multiple criteria like CPU and memory, but this can be problematic and prevent downward scaling unnecessarily for example when memory is allocated to a cache but CPU usage has decreased.
As per this official doc You can add the annotation controller.kubernetes.io/pod-deletion-cost with a value in the range [-2147483647, 2147483647] and this will cause pods with lower value to be killed first. Default is 0, so anything negative on one pod will cause a pod to get killed during downscaling.
Find this gitlink about the implementation of this feature: Scale down a deployment by removing specific pods (PodDeletionCost) #2255.
You can also use Pod Priority and Preemption. Refer to this official doc for more information

Is there an HPA configuration that could autoscale based on previous CPU usage?

We currently have a GKE environemt with several HPAs for different deployments. All of them work just fine out-of-the-box, but sometimes our users still experience some delay during peak hours.
Usually this delay is the time it takes the new instances to start and become ready.
What I'd like is a way to have an HPA that could predict usage and scale eagerly before it is needed.
The simplest implementation I could think of is just an HPA that could take the average usage of previous days and in advance (say 10 minutos earliers) scale up or down based on the historic usage for the current time-frame.
Is there anything like that in vanilla k8s or GKE? I was unable to find anything like that in GCP's docs.
If you want to scale your applications based on events/custom metrics, you can use KEDA (Kubernetes-based Event Driven Autoscaler) which support scaling based on GCP Stackdriver, Datadog or Promtheus metrics (and many other scalers).
What you need to do is creating some queries to get the CPU usage at the moment: CURRENT_TIMESTAMP - 23H50M (or the aggregated value for the last week), then defining some thresholds to scale up/down your application.
If you have trouble doing this with your monitoring tool, you can create a custom metrics API that queries the monitoring API and aggregate the values (with the time shift) before sending it to the metrics-api scaler.

How Target capacity % is calculated in AWS ECS capacity provider

While creating capacity provider in AWS ECS. the value Target capacity % we are filling, after crossing this value our cluster scale-in, but I am curious how this value of the current cluster is calculating and if I am want to check what is the current value of the cluster where can I check this. I have not found any data in cludwatch side.
See this blog post and related documentation.
You can see the "Capacity Provider Reservation" value in CloudWatch Metrics under "AWS/ECS/ManagedScaling"
For ECS Capacity Providers using managed scaling you will have an Autoscaling group associated with the Capacity Provider. The Autoscaling group will have a Target Tracking scaling policy associated with it which tracks a metric (often CPU utilization but could be what ever is most appropriate for your solution).
A Target tracking autoscaling policy tracks a target value for the metric. When using ECS Capacity providers with managed scaling the Target Capacity % you configure for the Capacity Provider is used as the Target Value for the Target Tracking scaling policy.
So as an example if your Target Tracking autoscaling policy is tracking CPUUtilization and you specify a Target Tracking % of 60% then the Capacity Provider will work on a best efforts basis to keep aggregate CPU Utilization at 60%. This will result in a scale out event when the CPUUtilization is greater than 60% and a scale in event when it is less than 60%.
You can see the scaling events in the AWS CloudWatch Management console Alarms view as the scale out or scale in actions are triggered. You will be able to see the metric your Target Tracking autoscaling policy is tracking in the AWS loudWatch console Metrics view.

How Kubernetes computes CPU utilization for HPA?

I want to understand how HPA computes CPU utilization across Pods.
According to this doc it takes the average of CPU utilization of a pod (average across the last 1 minute) divided by the CPU requested by the pod. Then it computes the arithmetic mean of all the pods' CPU.
Unfortunately the doc contains some information that are outdated like for example that --horizontal-pod-autoscaler-sync-period is by default set to 30 seconds but in the official doc, the default value is 15 seconds.
When I tested, I noticed that HPA scales up even before that average CPU reaches the threshold I set (which is 90%), Which made me think that maybe it takes the maximum CPU across Pods and not the average.
My question is where I can find an updated documentation to understand exactly how HPA works?
Note that I've not a Kubernetes cluster at hand, this is a theoretical answer based on the source code of k8s.
See if this actually matches your experience.
Kubernetes is opensource, here seems to be the HPA code.
The functions GetResourceReplica and calcPlainMetricReplicas (for non-utilization percentage) compute the number of replicas given the current metrics.
Both use the usageRatio returned by GetMetricUtilizationRatio, this value is multiplied by the number of currently ready pods in the Replica to get the new number of pods:
New_number_of_pods = Old_numbers_of_ready_pods * usageRatio
There is a tolerance check (ie if the usageRatio falls close enough to 1, nothing is done) and the pending and unkown-state pods are ignored (considered to use 0% of the resource) while the pods without metrics are considered to use 100% of the resource.
The usageRatio is computed by GetResourceUtilizationRatio that is passed the metrics and the requests (of resources) of all the pods, it goes as follow:
utilization = Total_sum_resource_usage_all_pods / Total_sum_resource_requests_all_pods
usageRatio = utilization * 100 / targetUtilization
Where targetUtilization comes from the HPA spec.
The code is easier to read than this summary of mine, in this context the term request means "resource request" (that's an educated guess).
So I'd say that 90% is the resource usage across all pods computed as they were all a single pod requesting the sum of each pod's request and collecting the metrics as they were all running on a single dedicated node.
According to https://github.com/kubernetes/kubernetes/issues/78988#issuecomment-502106361 this is configuration dependent and an issue of the metrics server and the kublet reporting, the HPA should rather only using the information:
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/#cpu
I think the duration is should be defined by the kubelet's --housekeeping-interval and defaults to 10 seconds

Is there any tool for GKE nodes autoscaling base on total pods requested in kubernetes?

When I resize a replication controller using kubectl, if the cluster does not have enough resource, there will have one or more pods always in pending.
Is there has any tool will auto resize GKE cluster when the resource is running out?
I had a similar requirement (for the Go build system): wanted to know when scheduled vs. available CPU or memory was > 1, and scale out nodes when that was true (or, more accurately, when it was ~.8). There's not a built-in metric, but as you suggest you can do it with a custom metric.
This was all done in Go, but it will give you the basic idea:
Create the metrics (memory and CPU, in my case
Put values to the metrics
The key takeaway IMO is that you have to iterate over each pod in the cluster to determine how much capacity is consumed, then iterate over each node in the cluster to determine how much capacity is available. It's then just a matter of pointing your autoscaler to the custom metric(s).
Big big big thing worth noting: I ultimately determined that scaling on the built-in CPU utilization metric was just as good as (if not better than, but more on that in a bit) than the custom metric. Each pod we scheduled pegged the CPU, so when pods were maxed out so was CPU. The build-in CPU utilization metric is probably better because you don't have the latency that comes with periodically putting custom metrics.
You can turn on autoscaling for the Instance Group that your GKE nodes belong to.