How to make kubernetes cluster elastic? - kubernetes

Helo i am running a .NET application in Azure Kubernetes Services as a 3 pod cluster (1 pod per node).
I am trying to understand how can i make my cluster elastic depending on load ?
How can i configure the deployment.yaml so that after a certain % of the cpu utilization and/or % of memory per pod it spawns another pod? The same thing when load decreases, how do i shut down instances.
Is there any guide/tutorial to set this up based on percentage (ideally) ?

The basic feature you need to use is called HorizontalPodAutoscaler or for short HPA. There you can configure cpu or memory limits and if the limit is exceeded, the pod replica number will be increased. E.g. from this walkthrough:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This will scale out the php-apache deployment, as soon as the pods cpu utilization is greater than 50 %. Be aware that calculating the resource utilization and the resulting number of replicas is not as intuitive, as it might seam. Also see docs (the whole page should be quite interesting too). You can also combine criteria for scale out.
There are also addons that help you scale based on other parameters, like the number of messages in a queue. Check out keda, they provide different scalers, like RabbitMQ, Kafka, AWS CloudWatch, Azure Monitor, etc.
And since you wrote
1 pod per node
you might be running a DaemonSet. In that case your only option to scale out would be to add additional nodes, since with daemonsets there is always exactly one pod per node. If that's the case you could think about using a Deployment combined with a PodAntiAffinity instead, see docs. By that you can configure pods to preferably run on nodes where pods of the same deployment are not running yet, e.g.:
[...]
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
[...]
From docs:
The anti-affinity rule says that the scheduler should try to avoid scheduling the Pod onto a node that is in the same zone as one or more Pods with the label security=S2. More precisely, the scheduler should try to avoid placing the Pod on a node that has the topology.kubernetes.io/zone=R label if there are other nodes in the same zone currently running Pods with the Security=S2 Pod label.
That would make scale out more flexible as it is with a daemonset, yet you have a similar effect of pods being equally distributed through out the cluster.
If you want/need to stick to a daemonset you can check out the AKS Cluster Autoscaler, that can be used to automatically add/remove additional nodes from your cluster, based on resource consumption.

Related

Horizontal Auto-scaling based on other pod healthcheck

Is it possible to scale down a pod to 0 replicas when other pod is down?I'm familiar with the basics of the Horizontal Auto-Scaling concept, but as I understand it scales pod up or down only when demands for resources (CPU, memory) changes.
My CI pipeline follows a green/blue pattern, so when the new version of the application is being deployed the second one is scaled down to 0 replicas, leaving other pods belonging to the same environment up wasting resources. Do you have any idea how to solve it using kubernetes or helm features?
Thanks
If you have a CI pipeline you can just run the kubectl command and scale down the deployment before deploying the blue-green this way no resource wasting will be there.
However yes, you can scale UP/DOWN the deployment or application based on the custom metrics.
i would recommend you checking out Cloud-native project Keda : https://keda.sh/
Keda:
KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can
drive the scaling of any container in Kubernetes based on the number
of events needing to be processed.
Example
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
name: {scaled-object-name}
spec:
scaleTargetRef:
deploymentName: {deployment-name} # must be in the same namespace as the ScaledObject
containerName: {container-name} #Optional. Default: deployment.spec.template.spec.containers[0]
pollingInterval: 30 # Optional. Default: 30 seconds
cooldownPeriod: 300 # Optional. Default: 300 seconds
minReplicaCount: 0 # Optional. Default: 0
maxReplicaCount: 100 # Optional. Default: 100
triggers:
# {list of triggers to activate the deployment}
Scale object ref : https://keda.sh/docs/1.4/concepts/scaling-deployments/#scaledobject-spec

Cluster Autoscaler and Horizontal Pod Autoscaler working together

I have a cluster with Cluster Autoscaler activated and HPA for one of my deployments.
This is the HPA definition:
kind: HorizontalPodAutoscaler
metadata:
name: hpa-resource-metrics-cpu
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: ReplicationController
name: hello-hpa-cpu
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 50
Now in a situation where my cluster is being used very lightly, that means this deployment will only have 1 available replica.
And since the cluster is not under high usage, it could be the case that the node containing that replica is scheduled for deletion (downscaling).
In that case, it would make my deployment have a downtime (when the cluster node is deleted, the only replica for the deployment is deleted as well, so it needs to be rescheduled in a new pod). I don't want that to happen (the downtime).
From this issue: https://github.com/kubernetes/kubernetes/issues/48307, it seems that Pod Disruption Budgets are not applicable to deployments with only 1 replica.
So the only solution to my problem would be to have minReplicas set to 2?
Or is there something else I could do to prevent this downtime, and still let minReplicas as 1?
Kubernetes has the notion of a disruption. The cluster autoscaler (or an administrator) taking a node offline is a "voluntary" disruption (as distinct from, say, the node losing power) and so you have some control over it. If you create a pod disruption budget:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: hello-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: hello
You have specified that there shouldn't be fewer than one pod, with a label app: hello, when the cluster tries to perform a voluntary disruption.
Doing this can prevent the cluster autoscaler from actually deleting the node. The examples in the PDB documentation generally have multiple replicas and can tolerate some of them being offline, so it's possible to delete 1 replica of 3 and recreate it on a different node. There is an extended example where there's not capacity in the cluster to start a rescheduled pod, and this blocks destroying a node. You might set the HPA to minReplicas: 3 to avoid this case, even if it means your system will be overprovisioned at the quietest times.

Distribute pods for a deployment across different node pools

In my GKE Kubernetes cluster, I have 2 node pools; one with regular nodes and the other with pre-emptible nodes. I'd like some of the pods to be on pre-emptible nodes so I can save costs while I have at least 1 pod on a regular non-pre-emptible node to reduce the risk of downtime.
I'm aware of using podAntiAffinity to encourage pods to be scheduled on different nodes, but is there a way to have k8s schedule pods for a single deployment across both pools?
Yes 💡! You can use Pod Topology Spread Constraints, based on a label 🏷️ key on your nodes. For example, the label could be type and the values could be regular and preemptible. Then you can have something like this:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: app
image: myimage
You can also identify a maxSkew which means the maximum differentiation of a number of pods that one label value (node type) can have.
You can also combine multiple 'Pod Topology Spread Constraints' and also together with PodAffinity/AntiAffinity and NodeAffinity. All depending on what best fits your use case.
Note: This feature is alpha in 1.16 and beta in 1.18. Beta features are enabled by default but with alpha features, you need an alpha cluster in GKE.
☮️✌️

Disable auto rescheduling for a pod

On k8s cluster (GCP) during nodes auto-scaling, my pods are rescheduled automatically. The main problem that they perform computations and keep results in memory during auto-scaling. Because of rescheduling, pods lose all results and tasks.
I want to disable rescheduling for specified pods. I know a few possible solutions:
nodeSelector (not very flexible due to the dynamic nature of a cluster)
pod disruption budget PDB
I have tried PDB and set minAvailable = 1 but it didn't work. I found that you can also set maxUnavailable=0, will it more effective? I didn't understand exactly the behaviour if maxUnavailable when it's set to 0. Could you explain it more? Thank you!
Link for more details - https://github.com/dask/dask-kubernetes/issues/112
Setting max unavailable to 0 is a way to go and also, using nodepools can be a good workaround.
gcloud container node-pools create <nodepool> --node-taints=app=dask-scheduler:NoSchedule
gcloud container node-pools create <nodepool> --node-labels app=dask-scheduler
This will create the nodepool with the label app=dask-scheduler, after in the pod spec, you can do this:
nodeSelector:
app: dask-scheduler
And put the dask scheduler on a node-pool that doesn't autoscale.
There's an object called PDB where in its spec you can set maxUnavailable
in the example of maxUnavailable=1, this means if you had 100 pods defined, always make sure there is only one removed/drained/re-scheduled at a time
in the case of maxUnavailable, if you have 2 pods, and you set maxUnavailable to 0, it will never remove your pods. It being the scheduler
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: zookeeper
Are you specifying resource requests and limits?

Ignore pod anti affinity during deployment update

Imagine in a Master-Node-Node setup where you deploy a service with pod anti-affinity on the Nodes: An update of the Deployment will cause another pod being created but the scheduler not being able to schedule, because both Nodes have the anti-affinity.
Q: How could one more flexibly set the anti-affinity to allow the update?
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
With an error
No nodes are available that match all of the following predicates:: MatchInterPodAffinity (2), PodToleratesNodeTaints (1).
Look at Max Surge
If you set Max Surge = 0, you are telling Kubernetes that you won't allow it to create more pods than the number of replicas you have setup for the deployment. This basically forces Kubernetes to remove a pod before starting a new one, and thereby making room for the new pod first, getting you around the podAntiAffinity issue. I've utilized this mechanism myself, with great success.
Config example
apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
replicas: <any number larger than 1>
...
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
...
template:
...
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
Warning: Don't do this if you only have one replica, as it will cause downtime because the only pod will be removed before a new one is added. If you have a huge number of replicas, which will make deployments slow because Kubernetes can only upgrade 1 pod at at a time, you can crank up maxUnavailable to enable Kubernetes to remove a higher number of pods at a time.
Here are a few methods which may work:
Modify your deployment rollingUpdate strategy 'Max Unavailable' parameter be at least 1, which allows for one pod of the old deployment to be destroyed immediately, making room for one pod of the new deployment to be created.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
Modify the deployment pod spec to use "soft" anti-affinity instead of "hard".
Pod Anti-Affinity
There are currently two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee.