In my GKE Kubernetes cluster, I have 2 node pools; one with regular nodes and the other with pre-emptible nodes. I'd like some of the pods to be on pre-emptible nodes so I can save costs while I have at least 1 pod on a regular non-pre-emptible node to reduce the risk of downtime.
I'm aware of using podAntiAffinity to encourage pods to be scheduled on different nodes, but is there a way to have k8s schedule pods for a single deployment across both pools?
Yes 💡! You can use Pod Topology Spread Constraints, based on a label 🏷️ key on your nodes. For example, the label could be type and the values could be regular and preemptible. Then you can have something like this:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: app
image: myimage
You can also identify a maxSkew which means the maximum differentiation of a number of pods that one label value (node type) can have.
You can also combine multiple 'Pod Topology Spread Constraints' and also together with PodAffinity/AntiAffinity and NodeAffinity. All depending on what best fits your use case.
Note: This feature is alpha in 1.16 and beta in 1.18. Beta features are enabled by default but with alpha features, you need an alpha cluster in GKE.
☮️✌️
Related
I have 4 node with different zone:
Node A : zone a
Node B : zone b
Node C : zone c
Node D : zone c
I want to spread the pod to Node A, B and C. I have Deployment that have 3 replicas to spread across those node, each pod each node. My deployments using kustomization and ArgoCD to deploy. Using the topologySpreadConstraint need to be update the label but labels are immutable on this case.
Current deployment condition using this
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-apps
spec:
replicas: 3
revisionHistoryLimit: 0
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: app
operator: In
values:
- my-apps
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-apps
version: v1
...
I've done add label for those 3 nodes and this configuration works well at first time. But when it comes to update the deployment and rolling update, The pods on nodes going to imbalance.
zone a : 2 pod
zone b : 1 pod
zone c : 0 pod
I've done playing with podAntiAffinity but its return as pending if I use hard affinity and still imbalance if I use soft affinity. Any suggestion best practice for this case? did I missed something?
Rolling updates cause some pods to be removed from nodes and others to be added. This can cause the pods on nodes to become imbalanced, as the pods that were already on the nodes will remain, but the pods that are added during the update will likely be different. To prevent this from happening, it is important to use the maxUnavailable option in the rolling update strategy. This allows you to specify the maximum number of pods that can be removed from a node during the rolling update, ensuring that the pods on each node remain balanced.
kubectl apply -f deployment.yaml --strategy=RollingUpdate --strategy-rolling-update-maxUnavailable=1
This command will create or update a deployment with the rolling update strategy, and the maxUnavailable option set to 1.This will ensure that no more than 1 pod is removed from a node during the rolling update, thus keeping the pods across nodes balanced.Try it and let me know if this works
If you are scaling down the pods, as per official doc limitations:
There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in an imbalanced Pods distribution.You can use a tool such as the Descheduler to rebalance the Pods distribution.
I understand that the nodeSelector will help to move the pods to specific nodes with labels. But, say if I know the names for pods in advance and based on these names, how do I move these pods to different nodes having specific labels.
I am unable to understand as to how to use nodeSelector, affinity, antiAffinity in this case.
What would an example values.yaml look like?
I have labelled three nodes. Then, when I launch the 6 pods, they are equally divided among the nodes. Each pod has an index value at the end of the pod name. mypod-0, mypod-1 until mypod-5.
I want to have mypod-0 and mypod-3 on node 1, mypod-1 and mypod-4 on node 2 and so on.
We can use Pod Topology Constraints to decide how the pods need to spread across your cluster.
For Example as per you question we can deploy 2 pods in 1 node and other 2 pods in different node. To make this possible we need to set the topologyKey for nodes and use them while deploying the pod. This sample yaml shows the syntax of pod topology constraints.
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: nginx
image: nginx
Since you are using labels for naming the pods you can use nodeAffinity and nodeSelectors along with pod topology constraints. For more information regarding the combination of topology and affinity refer to this official k8 document
Helo i am running a .NET application in Azure Kubernetes Services as a 3 pod cluster (1 pod per node).
I am trying to understand how can i make my cluster elastic depending on load ?
How can i configure the deployment.yaml so that after a certain % of the cpu utilization and/or % of memory per pod it spawns another pod? The same thing when load decreases, how do i shut down instances.
Is there any guide/tutorial to set this up based on percentage (ideally) ?
The basic feature you need to use is called HorizontalPodAutoscaler or for short HPA. There you can configure cpu or memory limits and if the limit is exceeded, the pod replica number will be increased. E.g. from this walkthrough:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This will scale out the php-apache deployment, as soon as the pods cpu utilization is greater than 50 %. Be aware that calculating the resource utilization and the resulting number of replicas is not as intuitive, as it might seam. Also see docs (the whole page should be quite interesting too). You can also combine criteria for scale out.
There are also addons that help you scale based on other parameters, like the number of messages in a queue. Check out keda, they provide different scalers, like RabbitMQ, Kafka, AWS CloudWatch, Azure Monitor, etc.
And since you wrote
1 pod per node
you might be running a DaemonSet. In that case your only option to scale out would be to add additional nodes, since with daemonsets there is always exactly one pod per node. If that's the case you could think about using a Deployment combined with a PodAntiAffinity instead, see docs. By that you can configure pods to preferably run on nodes where pods of the same deployment are not running yet, e.g.:
[...]
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
[...]
From docs:
The anti-affinity rule says that the scheduler should try to avoid scheduling the Pod onto a node that is in the same zone as one or more Pods with the label security=S2. More precisely, the scheduler should try to avoid placing the Pod on a node that has the topology.kubernetes.io/zone=R label if there are other nodes in the same zone currently running Pods with the Security=S2 Pod label.
That would make scale out more flexible as it is with a daemonset, yet you have a similar effect of pods being equally distributed through out the cluster.
If you want/need to stick to a daemonset you can check out the AKS Cluster Autoscaler, that can be used to automatically add/remove additional nodes from your cluster, based on resource consumption.
I'm trying to understand, if it is possible ( and how ) to pin pods of statefulsets to specific availibility zones. In the example above I would like to explicitly configure, that elastic pod 1 runs in availibility zone 1, pod 2 in availibility zone 2 and so forth. I also don't want the pods to run outside of their availibility zones, if one goes down.
I read this doc on the matter. If I understand it correctly, I can only specify, that a statefulset shouldn't run it's pods in the same availibility zone, but not, that it always runs the pod in a specific availibility zone.
Thanks to anyone who can educate me in this matter.
aws overview
You can use Pod Topology Spread constraints.
topology spread constraints control how Pods are spread across your
cluster among failure-domains such as regions, zones, nodes, and other
user-defined topology domains. This can help to achieve high
availability as well as efficient resource utilization.
Taking the example from here
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
Is there anyway I can tell Kuberbetes how to schedule the replicas in the statefulset? For example, I have nodes divided into 3 different availability zones (AZ). I have labeled these nodes accordingly. Now I want K8s to put 1 replica in each AZ based on node label. Thanks
Pods will always try to be scheduled across different nodes, to achieve what you are looking for you can try to use DaemonSet, which will allow only one of these kind of pods in each node.
Also, you can use anti affinity based on the already scheduled pods in that node.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The feature you are looking for is called Pod Anti-Affinity and can be specified as such:
apiVersion: apps/v1
kind: StatefulSet
[..]
spec:
template:
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: az
weight: 100
[..]
Since Kubernetes 1.18, there is also Pod Topology Spread Constraints, which is a nicer way to specify these anti-affinity rules.