I have a cluster with Cluster Autoscaler activated and HPA for one of my deployments.
This is the HPA definition:
kind: HorizontalPodAutoscaler
metadata:
name: hpa-resource-metrics-cpu
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: ReplicationController
name: hello-hpa-cpu
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 50
Now in a situation where my cluster is being used very lightly, that means this deployment will only have 1 available replica.
And since the cluster is not under high usage, it could be the case that the node containing that replica is scheduled for deletion (downscaling).
In that case, it would make my deployment have a downtime (when the cluster node is deleted, the only replica for the deployment is deleted as well, so it needs to be rescheduled in a new pod). I don't want that to happen (the downtime).
From this issue: https://github.com/kubernetes/kubernetes/issues/48307, it seems that Pod Disruption Budgets are not applicable to deployments with only 1 replica.
So the only solution to my problem would be to have minReplicas set to 2?
Or is there something else I could do to prevent this downtime, and still let minReplicas as 1?
Kubernetes has the notion of a disruption. The cluster autoscaler (or an administrator) taking a node offline is a "voluntary" disruption (as distinct from, say, the node losing power) and so you have some control over it. If you create a pod disruption budget:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: hello-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: hello
You have specified that there shouldn't be fewer than one pod, with a label app: hello, when the cluster tries to perform a voluntary disruption.
Doing this can prevent the cluster autoscaler from actually deleting the node. The examples in the PDB documentation generally have multiple replicas and can tolerate some of them being offline, so it's possible to delete 1 replica of 3 and recreate it on a different node. There is an extended example where there's not capacity in the cluster to start a rescheduled pod, and this blocks destroying a node. You might set the HPA to minReplicas: 3 to avoid this case, even if it means your system will be overprovisioned at the quietest times.
Related
Helo i am running a .NET application in Azure Kubernetes Services as a 3 pod cluster (1 pod per node).
I am trying to understand how can i make my cluster elastic depending on load ?
How can i configure the deployment.yaml so that after a certain % of the cpu utilization and/or % of memory per pod it spawns another pod? The same thing when load decreases, how do i shut down instances.
Is there any guide/tutorial to set this up based on percentage (ideally) ?
The basic feature you need to use is called HorizontalPodAutoscaler or for short HPA. There you can configure cpu or memory limits and if the limit is exceeded, the pod replica number will be increased. E.g. from this walkthrough:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This will scale out the php-apache deployment, as soon as the pods cpu utilization is greater than 50 %. Be aware that calculating the resource utilization and the resulting number of replicas is not as intuitive, as it might seam. Also see docs (the whole page should be quite interesting too). You can also combine criteria for scale out.
There are also addons that help you scale based on other parameters, like the number of messages in a queue. Check out keda, they provide different scalers, like RabbitMQ, Kafka, AWS CloudWatch, Azure Monitor, etc.
And since you wrote
1 pod per node
you might be running a DaemonSet. In that case your only option to scale out would be to add additional nodes, since with daemonsets there is always exactly one pod per node. If that's the case you could think about using a Deployment combined with a PodAntiAffinity instead, see docs. By that you can configure pods to preferably run on nodes where pods of the same deployment are not running yet, e.g.:
[...]
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
[...]
From docs:
The anti-affinity rule says that the scheduler should try to avoid scheduling the Pod onto a node that is in the same zone as one or more Pods with the label security=S2. More precisely, the scheduler should try to avoid placing the Pod on a node that has the topology.kubernetes.io/zone=R label if there are other nodes in the same zone currently running Pods with the Security=S2 Pod label.
That would make scale out more flexible as it is with a daemonset, yet you have a similar effect of pods being equally distributed through out the cluster.
If you want/need to stick to a daemonset you can check out the AKS Cluster Autoscaler, that can be used to automatically add/remove additional nodes from your cluster, based on resource consumption.
Is it possible to have the HPA scale based on the number of available running pods?
I have set up a readiness probe that cuts out a pod based it's internal state (idle, working, busy). When a pod is 'busy', it no longer receives new requests. But the cpu, and memory demands are low.
I don't want to scale based on cpu, mem, or other metrics.
Seeing as the readiness probe removes it from active service, can I scale based on the average number of active (not busy) pods? When that number drops below a certain point more pods are scaled.
TIA for any suggestions.
You can create custom metrics, a number of busy-pods for HPA.
That is, the application should emit a metric value when it is busy. And use that metric to create HorizontalPodAutoscaler.
Something like this:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: custom-metric-sd
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: custom-metric-sd
minReplicas: 1
maxReplicas: 20
metrics:
- type: Pods
pods:
metricName: busy-pods
targetAverageValue: 4
Here is another reference for HPA with custom metrics.
I have the below Horizontal Pod Autoscaller configuration on Google Kubernetes Engine to scale a deployment by a custom metric - RabbitMQ messages ready count for a specific queue: foo-queue.
It picks up the metric value correctly.
When inserting 2 messages it scales the deployment to the maximum 10 replicas.
I expect it to scale to 2 replicas since the targetValue is 1 and there are 2 messages ready.
Why does it scale so aggressively?
HPA configuration:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: foo-hpa
namespace: development
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: foo
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
metricSelector:
matchLabels:
metric.labels.queue: foo-queue
targetValue: 1
I think you did a great job explaining how targetValue works with HorizontalPodAutoscalers. However, based on your question, I think you're looking for targetAverageValue instead of targetValue.
In the Kubernetes docs on HPAs, it mentions that using targetAverageValue instructs Kubernetes to scale pods based on the average metric exposed by all Pods under the autoscaler. While the docs aren't explicit about it, an external metric (like the number of jobs waiting in a message queue) counts as a single data point. By scaling on an external metric with targetAverageValue, you can create an autoscaler that scales the number of Pods to match a ratio of Pods to jobs.
Back to your example:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: foo-hpa
namespace: development
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: foo
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
metricSelector:
matchLabels:
metric.labels.queue: foo-queue
# Aim for one Pod per message in the queue
targetAverageValue: 1
will cause the HPA to try keeping one Pod around for every message in your queue (with a max of 10 pods).
As an aside, targeting one Pod per message is probably going to cause you to start and stop Pods constantly. If you end up starting a ton of Pods and process all of the messages in the queue, Kubernetes will scale your Pods down to 1. Depending on how long it takes to start your Pods and how long it takes to process your messages, you may have lower average message latency by specifying a higher targetAverageValue. Ideally, given a constant amount of traffic, you should aim to have a constant number of Pods processing messages (which requires you to process messages at about the same rate that they are enqueued).
According to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
From the most basic perspective, the Horizontal Pod Autoscaler controller operates on the ratio between desired metric value and current metric value:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
From the above I understand that as long as the queue has messages the k8 HPA will continue to scale up since currentReplicas is part of the desiredReplicas calculation.
For example if:
currentReplicas = 1
currentMetricValue / desiredMetricValue = 2/1
then:
desiredReplicas = 2
If the metric stay the same in the next hpa cycle currentReplicas will become 2 and desiredReplicas will be raised to 4
Try to follow this instruction that describes horizontal autoscale settings for RabbitMQ in k8s
Kubernetes Workers Autoscaling based on RabbitMQ queue size
In particular, targetValue: 20 of metric rabbitmq_queue_messages_ready is recommended instead of targetValue: 1:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: workers-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: my-workers
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbitmq_queue_messages_ready"
metricSelector:
matchLabels:
metric.labels.queue: myqueue
**targetValue: 20
Now our deployment my-workers will grow if RabbitMQ queue myqueue has more than 20 non-processed jobs in total
I'm using the same Prometheus metrics from RabbitMQ (I'm using Celery with RabbitMQ as broker).
Did anyone here considered using rabbitmq_queue_messages_unacked metric rather than rabbitmq_queue_messages_ready?
The thing is, that rabbitmq_queue_messages_ready is decreasing as soon the message pulled by a worker and I'm afraid that long-running task might be killed by HPA, while rabbitmq_queue_messages_unacked stays until the task completed.
For example, I have a message that will trigger a new pod (celery-worker) to run a task that will take 30 minutes. The rabbitmq_queue_messages_ready will decrease as the pod is running and the HPA cooldown/delay will terminate pod.
EDIT: seems like a third one rabbitmq_queue_messages is the right one - which is the sum of both unacked and ready:
sum of ready and unacknowledged messages - total queue depth
documentation
I'm working on deploying the Thanos monitoring system and one of its components, the metric compactor, warns that there should never be more than one compactor running at the same time. If this constraint is violated it will likely lead to corruption of metric data.
Is there any way to codify "Exactly One" pod via Deployment/StatefulSet/etc, aside from "just set replicas: 1 and never scale"? We're using Rancher as an orchestration layer and it's real easy to hit that + button without thinking about it.
Limit the replicas in your deployment to 1
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
Be careful with Deployment, because they can be configured with two update strategy:
RollingUpdate: new pods are added while and old pods are terminated. This mean that, depending on the maxSurge option, if you set your replicas to 1, you may still be have at most 2 pods.
Recreate: all the previous pods are terminated before any new pods are created.
Instead, Statefulsets guarantee that there will never be more than 1 instance of a pod at any given time.
apiVersion: apps/v1beta1
kind: StatefulSet
spec:
replicas: 1
Unlike Deployments, pods are not replaced until the previous has been terminated.
I have created a horizontal auto-scaler based on the cpu usage and it works fine. I want to know how I can configure the autoscaler in a way that it just scales up without scaling down? The reason I want such a thing is when I have high load/request I create some operators but I want to keep them alive even if for some amount of time they don't do anything but auto-scaler kills the pods and scaling down to the minimum replicas after sometime if there is no load.
My autoscaler:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: gateway
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gateway
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 20
Edit:
By operator, I mean small applications/programs that are running in a pod.
You can add --horizontal-pod-autoscaler-downscale-stabilization flag to kube-controller-manager as described in docs. Default delay is set to 5 minutes.
To add flag to kube-controller-manager edit /etc/kubernetes/manifests/kube-controller-manager.yaml on master node, pod will be then recreated.