HPA creates more pods than expected

HPA creates more pods than expected - kubernetes

I created HPA on our k8s cluster which should auto-scale on 90% memory utilization. However, it scales UP without hitting the target percentage. I use the following config:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
namespace: {{ .Values.namespace }}
name: {{ include "helm-generic.fullname" . }}
labels:
{{- include "helm-generic.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "helm-generic.fullname" . }}
minReplicas: 1
maxReplicas: 2
metrics:
- type: Resource
resource:
name: memory
targetAverageUtilization: 90
So for this config it creates 2 pods which is the maxReplicas number. If I add 4 for maxReplicas it will create 3.
This is what i get from kubectl describe hpa
$ kubectl describe hpa -n trunkline
Name: test-v1
Namespace: trunkline
Labels: app.kubernetes.io/instance=test-v1
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=helm-generic
app.kubernetes.io/version=0.0.0
helm.sh/chart=helm-generic-0.1.3
Annotations: meta.helm.sh/release-name: test-v1
meta.helm.sh/release-namespace: trunkline
CreationTimestamp: Wed, 12 Oct 2022 17:36:54 +0300
Reference: Deployment/test-v1
Metrics: ( current / target )
**resource memory on pods (as a percentage of request): 59% (402806784) / 90%**
resource cpu on pods (as a percentage of request): 11% (60m) / 80%
Min replicas: 1
Max replicas: 2
Deployment pods: **2 current / 2** desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from memory resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events: <none>
As you see the pods memory % is 59 , with target 90 which I expect to produce only 1 pod.

This is working as intended.
targetAverageUtilization is an average over all the matching Pods that is targeted.
The idea of HPA is:
scale up? We have 2 Pods, average memory utilization is only 59%, this is under 90%, no need to scale up
scale down? Since 59% is the average for 2 Pods under the current load, then if there was only one Pod taking all load it would raise to 59%*2=118% utilization, which is over 90% so we need to scale up again, so not scaling down

The horizontal pod autoscaler has a very specific formula for calculating the target replica count:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
With the output you show, currentMetricValue is 59% and desiredMetricValue is 90%. Multiplying that by the currentReplicas of 2, you get about 1.3 replicas, which gets rounded up to 2.
This formula, and especially the ceil() round-up behavior, can make HPA very slow to scale down, especially with a small number of replicas.
More broadly, autoscaling on Kubernetes-observable memory might not work the way you expect. Most programming languages are garbage-collected (C, C++, and Rust are the most notable exceptions) and garbage collectors as a rule tend to allocate a large block of operating-system memory and reuse it, rather than return it to the operating system if load decreases. If you have a pod that reaches 90% memory from the Kubernetes point of view, its possible that memory usage will never decrease. You might need to autoscale on a different metric, or attach an external metrics system like Prometheus to get more detailed memory-manager statistics you can act on.

Related

How to implement Kubernetes horizontal pod autoscaling with scale up/down policies?

Kubernetes v1.19 in AWS EKS
I'm trying to implement horizontal pod autoscaling in my EKS cluster, and am trying to mimic what we do now with ECS. With ECS, we do something similar to the following
scale up when CPU >= 90% after 3 consecutive 1-min periods of sampling
scale down when CPU <= 60% after 5 consecutive 1-min periods of sampling
scale up when memory >= 85% after 3 consecutive 1-min periods of sampling
scale down when memory <= 70% after 5 consecutive 1-min periods of sampling
I'm trying to use the HorizontalPodAutoscaler kind, and helm create gives me this template. (Note I modified it to suit my needs, but the metrics stanza remains.)
{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "microserviceChart.Name" . }}
labels:
{{- include "microserviceChart.Name" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "microserviceChart.Name" . }}
minReplicas: {{ include "microserviceChart.minReplicas" . }}
maxReplicas: {{ include "microserviceChart.maxReplicas" . }}
metrics:
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: cpu
targetAverageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
- type: Resource
resource:
name: memory
targetAverageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
{{- end }}
However, how do I fit the scale up/down information shown in Horizontal Pod Autoscaling in the above template, to match the behavior that I want?

The Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed metrics (like CPU or Memory).
There is an official walkthrough focusing on HPA and it's scaling:
Kubernetes.io: Docs: Tasks: Run application: Horizontal pod autoscale: Walkthrough
The algorithm that scales the amount of replicas is the following:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
An example (of already rendered) autoscaling can be implemented with a YAML manifest like below:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: HPA-NAME
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: DEPLOYMENT-NAME
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 75
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
A side note!
HPA will use calculate both metrics and chose the one with bigger desiredReplicas!
Addressing a comment I wrote under the question:
I think we misunderstood each other. It's perfectly okay to "scale up when CPU >= 90" but due to logic behind the formula I don't think it will be possible to say "scale down when CPU <=70". According to the formula it would be something in the midst of: scale up when CPU >= 90 and scale down when CPU =< 45.
This example could be misleading and not 100% true in all scenarios. Taking a look on following example:
HPA set to averageUtilization of 75%.
Quick calculations with some degree of approximation (default tolerance for HPA is 0.1):
2 replicas:
scale-up (by 1) should happen when: currentMetricValue is >=80%:
x = ceil[2 * (80/75)], x = ceil[2,1(3)], x = 3
scale-down (by 1) should happen when currentMetricValue is <=33%:
x = ceil[2 * (33/75)], x = ceil[0,88], x = 1
8 replicas:
scale-up (by 1) should happen when currentMetricValue is >=76%:
x = ceil[8 * (76/75)], x = ceil[8,10(6)], x = 9
scale-down (by 1) should happen when currentMetricValue is <=64%:
x = ceil[8 * (64/75)], x = ceil[6,82(6)], x = 7
Following this example, having 8 replicas with their currentMetricValue at 55 (desiredMetricValue set to 75) should scale-down to 6 replicas.
More information that describes the decision making of HPA (for example why it's doesn't scale) can be found by running:
$ kubectl describe hpa HPA-NAME
Name: nginx-scaler
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Sun, 07 Mar 2021 22:48:58 +0100
Reference: Deployment/nginx-scaling
Metrics: ( current / target )
resource memory on pods (as a percentage of request): 5% (61903667200m) / 75%
resource cpu on pods (as a percentage of request): 79% (199m) / 75%
Min replicas: 1
Max replicas: 10
Deployment pods: 5 current / 5 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 4m48s (x4 over 5m3s) horizontal-pod-autoscaler did not receive metrics for any ready pods
Normal SuccessfulRescale 103s horizontal-pod-autoscaler New size: 2; reason: cpu resource utilization (percentage of request) above target
Normal SuccessfulRescale 71s horizontal-pod-autoscaler New size: 4; reason: cpu resource utilization (percentage of request) above target
Normal SuccessfulRescale 71s horizontal-pod-autoscaler New size: 5; reason: cpu resource utilization (percentage of request) above target
HPA scaling procedures can be modified by the changes introduced in Kubernetes version 1.18 and newer where the:
Support for configurable scaling behavior
Starting from v1.18 the v2beta2 API allows scaling behavior to be configured through the HPA behavior field. Behaviors are specified separately for scaling up and down in scaleUp or scaleDown section under the behavior field. A stabilization window can be specified for both directions which prevents the flapping of the number of the replicas in the scaling target. Similarly specifying scaling policies controls the rate of change of replicas while scaling.
Kubernetes.io: Docs: Tasks: Run application: Horizontal pod autoscale: Support for configurable scaling behavior
I'd reckon you could used newly introduced field like behavior and stabilizationWindowSeconds to tune your workload to your specific needs.
I also do recommend reaching out to EKS documentation for more reference, support for metrics and examples.

How do I stop replicaSet terminating requested number of pods

I have a minimum number of replicas set to 3, maximum 10 and the number of requested replicas as 6. When I deploy, the replicaSet looks good, and I have 6 pods running as expected.
However, after a few minutes I get this message- "Scaled down replica set my-first-app to 3". It then terminates my pods so I am only left with 3. How do I stop it doing this? I want to have 6 replicas.

The HorizontalPodAutoscaler (which I assume you are talking about, as the ReplicaSet resource would not actually be able to scale your Pods horizontally) automatically scales down resources based on the configured metrics (e.g. CPU usage, Memory usage).
High level it uses the following algorithm to determine the number of replicas:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
It then checks whether the desiredReplicas is larger than the minimum number of replicas required.
See the docs on the Horizontal Pod Autoscaler for more details.
To answer your question: If you always want 6 Pods at the very minimum, just make sure the minReplicas are set to 6, e.g.:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-example
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: deployment-example
minReplicas: 6
maxReplicas: 10
targetCPUUtilizationPercentage: 70

Kubernetes HPA based on available healthy pods

Is it possible to have the HPA scale based on the number of available running pods?
I have set up a readiness probe that cuts out a pod based it's internal state (idle, working, busy). When a pod is 'busy', it no longer receives new requests. But the cpu, and memory demands are low.
I don't want to scale based on cpu, mem, or other metrics.
Seeing as the readiness probe removes it from active service, can I scale based on the average number of active (not busy) pods? When that number drops below a certain point more pods are scaled.
TIA for any suggestions.

You can create custom metrics, a number of busy-pods for HPA.
That is, the application should emit a metric value when it is busy. And use that metric to create HorizontalPodAutoscaler.
Something like this:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: custom-metric-sd
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: custom-metric-sd
minReplicas: 1
maxReplicas: 20
metrics:
- type: Pods
pods:
metricName: busy-pods
targetAverageValue: 4
Here is another reference for HPA with custom metrics.

Horizontal pod Autoscaler scales custom metric too aggressively on GKE

I have the below Horizontal Pod Autoscaller configuration on Google Kubernetes Engine to scale a deployment by a custom metric - RabbitMQ messages ready count for a specific queue: foo-queue.
It picks up the metric value correctly.
When inserting 2 messages it scales the deployment to the maximum 10 replicas.
I expect it to scale to 2 replicas since the targetValue is 1 and there are 2 messages ready.
Why does it scale so aggressively?
HPA configuration:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: foo-hpa
namespace: development
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: foo
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
metricSelector:
matchLabels:
metric.labels.queue: foo-queue
targetValue: 1

I think you did a great job explaining how targetValue works with HorizontalPodAutoscalers. However, based on your question, I think you're looking for targetAverageValue instead of targetValue.
In the Kubernetes docs on HPAs, it mentions that using targetAverageValue instructs Kubernetes to scale pods based on the average metric exposed by all Pods under the autoscaler. While the docs aren't explicit about it, an external metric (like the number of jobs waiting in a message queue) counts as a single data point. By scaling on an external metric with targetAverageValue, you can create an autoscaler that scales the number of Pods to match a ratio of Pods to jobs.
Back to your example:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: foo-hpa
namespace: development
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: foo
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
metricSelector:
matchLabels:
metric.labels.queue: foo-queue
# Aim for one Pod per message in the queue
targetAverageValue: 1
will cause the HPA to try keeping one Pod around for every message in your queue (with a max of 10 pods).
As an aside, targeting one Pod per message is probably going to cause you to start and stop Pods constantly. If you end up starting a ton of Pods and process all of the messages in the queue, Kubernetes will scale your Pods down to 1. Depending on how long it takes to start your Pods and how long it takes to process your messages, you may have lower average message latency by specifying a higher targetAverageValue. Ideally, given a constant amount of traffic, you should aim to have a constant number of Pods processing messages (which requires you to process messages at about the same rate that they are enqueued).

According to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
From the most basic perspective, the Horizontal Pod Autoscaler controller operates on the ratio between desired metric value and current metric value:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
From the above I understand that as long as the queue has messages the k8 HPA will continue to scale up since currentReplicas is part of the desiredReplicas calculation.
For example if:
currentReplicas = 1
currentMetricValue / desiredMetricValue = 2/1
then:
desiredReplicas = 2
If the metric stay the same in the next hpa cycle currentReplicas will become 2 and desiredReplicas will be raised to 4

Try to follow this instruction that describes horizontal autoscale settings for RabbitMQ in k8s
Kubernetes Workers Autoscaling based on RabbitMQ queue size
In particular, targetValue: 20 of metric rabbitmq_queue_messages_ready is recommended instead of targetValue: 1:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: workers-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: my-workers
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
metricSelector:
matchLabels:
metric.labels.queue: myqueue
**targetValue: 20
Now our deployment my-workers will grow if RabbitMQ queue myqueue has more than 20 non-processed jobs in total

I'm using the same Prometheus metrics from RabbitMQ (I'm using Celery with RabbitMQ as broker).
Did anyone here considered using rabbitmq_queue_messages_unacked metric rather than rabbitmq_queue_messages_ready?
The thing is, that rabbitmq_queue_messages_ready is decreasing as soon the message pulled by a worker and I'm afraid that long-running task might be killed by HPA, while rabbitmq_queue_messages_unacked stays until the task completed.
For example, I have a message that will trigger a new pod (celery-worker) to run a task that will take 30 minutes. The rabbitmq_queue_messages_ready will decrease as the pod is running and the HPA cooldown/delay will terminate pod.
EDIT: seems like a third one rabbitmq_queue_messages is the right one - which is the sum of both unacked and ready:
sum of ready and unacknowledged messages - total queue depth
documentation

Kubernetes : GKE | HPA is not scaling pods even though the memory utilization is greater/equal to target value

We have a GKE cluster (1.11) and implemented HPA based on memory utilization for pods. During our testing activity, we have observed HPA behavior is not consistent, HPA is not scaling pods even though the target value is met. We have also noticed that, HPA events is not giving us any response data ( either scaling or downscaling related info).
Example
kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
com-manh-cp-organization Deployment/com-manh-cp-organization 95%/90% 1 25 1 1d
kubectl describe hpa com-manh-cp-organization
Name: com-manh-cp-organization
Namespace: default
Labels: app=com-manh-cp-organization
stereotype=REST
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"labels":{"app":"com-manh-cp-organizatio...
CreationTimestamp: Tue, 12 Feb 2019 18:02:12 +0530
Reference: Deployment/com-manh-cp-organization
Metrics: ( current / target )
resource memory on pods (as a percentage of request): 95% (4122087424) / 90%
Min replicas: 1
Max replicas: 25
Deployment pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from memory resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events: <none>
Cluster version : 1.11.6
Cloud service : GKE
Metric : memory
Target : targetAverageUtilization
Any inputs will be much appreciated and let us know if we can debug HPA implementation.
Thanks.

There is a tolerance on the values for the threshold in HPA when calculating the replica numbers as specified in this link.
This tolerance is by default 0.1. And in your configuration you might not be hitting the threshold when you put 90% due to this. I would recommend you to change the metrics to 80% and see if it is working.