Best practice spread pod across the node evenly - kubernetes

I have 4 node with different zone:
Node A : zone a
Node B : zone b
Node C : zone c
Node D : zone c
I want to spread the pod to Node A, B and C. I have Deployment that have 3 replicas to spread across those node, each pod each node. My deployments using kustomization and ArgoCD to deploy. Using the topologySpreadConstraint need to be update the label but labels are immutable on this case.
Current deployment condition using this
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-apps
spec:
replicas: 3
revisionHistoryLimit: 0
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: app
operator: In
values:
- my-apps
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-apps
version: v1
...
I've done add label for those 3 nodes and this configuration works well at first time. But when it comes to update the deployment and rolling update, The pods on nodes going to imbalance.
zone a : 2 pod
zone b : 1 pod
zone c : 0 pod
I've done playing with podAntiAffinity but its return as pending if I use hard affinity and still imbalance if I use soft affinity. Any suggestion best practice for this case? did I missed something?

Rolling updates cause some pods to be removed from nodes and others to be added. This can cause the pods on nodes to become imbalanced, as the pods that were already on the nodes will remain, but the pods that are added during the update will likely be different. To prevent this from happening, it is important to use the maxUnavailable option in the rolling update strategy. This allows you to specify the maximum number of pods that can be removed from a node during the rolling update, ensuring that the pods on each node remain balanced.
kubectl apply -f deployment.yaml --strategy=RollingUpdate --strategy-rolling-update-maxUnavailable=1
This command will create or update a deployment with the rolling update strategy, and the maxUnavailable option set to 1.This will ensure that no more than 1 pod is removed from a node during the rolling update, thus keeping the pods across nodes balanced.Try it and let me know if this works
If you are scaling down the pods, as per official doc limitations:
There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in an imbalanced Pods distribution.You can use a tool such as the Descheduler to rebalance the Pods distribution.

Related

How to install pods (based on pod name) to specific nodes in K8s cluster?

I understand that the nodeSelector will help to move the pods to specific nodes with labels. But, say if I know the names for pods in advance and based on these names, how do I move these pods to different nodes having specific labels.
I am unable to understand as to how to use nodeSelector, affinity, antiAffinity in this case.
What would an example values.yaml look like?
I have labelled three nodes. Then, when I launch the 6 pods, they are equally divided among the nodes. Each pod has an index value at the end of the pod name. mypod-0, mypod-1 until mypod-5.
I want to have mypod-0 and mypod-3 on node 1, mypod-1 and mypod-4 on node 2 and so on.
We can use Pod Topology Constraints to decide how the pods need to spread across your cluster.
For Example as per you question we can deploy 2 pods in 1 node and other 2 pods in different node. To make this possible we need to set the topologyKey for nodes and use them while deploying the pod. This sample yaml shows the syntax of pod topology constraints.
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: nginx
image: nginx
Since you are using labels for naming the pods you can use nodeAffinity and nodeSelectors along with pod topology constraints. For more information regarding the combination of topology and affinity refer to this official k8 document

Schedule Kubernetes pods in the same failure zone

We have a deployment with a large replicas number ( > 1 ) that we must deploy in the same zone.
We stumbled upon this documentation section: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#an-example-of-a-pod-that-uses-pod-affinity
which explains how to schedule pods in zones that already have other pods that match certain labels.
however, there are no other pods that our deployment depends upon. all other workloads are replicated and spread across multiple zones, and this is the first deployment that we would like to keep in a single zone.
also, we thought about explicitly setting the zone for this deployment, but in case of zone failure, it will become unavailable until we notice and explicitly set it to another zone. so setting the exact zone won't work here.
any insights here? and thanks!
Pod Affinity affects how the pod is scheduled based on the presence or absence of other pods within the node. That would probably not serve the purpose you're trying to achieve.
You're probably better off using node affinity (it's on the same link you provided)
That would allow you to force to a zone, because each GKE node will have a failure-domain label which you can get doing this and looking through the results:
kubectl get node {name-of-node} -o json | jq ".metadata.labels"
The labels will read something like this:
"failure-domain.beta.kubernetes.io/region": "europe-west2",
"failure-domain.beta.kubernetes.io/zone": "europe-west2-b",
You can then combine this with nodeAffinity in your deployment yaml (parts snipped for brevity):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
...
annotations:
...
name: my-deployment
spec:
replicas: 1
strategy: {}
selector:
matchLabels:
...
template:
metadata:
annotations:
...
labels:
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
This will force the pods generated by the deployment to all go onto nodes sitting in europe-west2-b
I could change this and make it like this:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
- europe-west2-c
To allow it schedule in two zones (but it would not be able to schedule on to the europe-west2-a zone as a consequence)
I do not think there is a direct way to achieve this. I can think of two ways this can work.
Using Affinity on Pod and Node
Adding node affinity with preferredDuringSchedulingIgnoredDuringExecution for the regions you would want to target.
Adding pod affinity to itself with preferredDuringSchedulingIgnoredDuringExecution for pods to prefer to be with each other.
With this what should happen is when the first pod is about to be spun up it would match none of its preferred affinity but the scheduler will still schedule it. But once one is running for the rest of the pod there will be a pod with the correct affinity and they should all spin up. The challenge is there is a possibility of a race condition where multiple pods try to get scheduled and scheduler puts them in different locations once your first preferred zone is out.
Using Webhooks
You can use some mutating webhook to check the node label and add requiredDuringSchedulingIgnoredDuringExecution affinity to pods based on what zones you have still available.
The challenge here is you would most likely need to write and maintain this webhook yourself. I am not sure if you will find your exact usecase solved by someone else in open source. A quick search shows me this repo. I have not tested this but might give you a start.

Distribute pods for a deployment across different node pools

In my GKE Kubernetes cluster, I have 2 node pools; one with regular nodes and the other with pre-emptible nodes. I'd like some of the pods to be on pre-emptible nodes so I can save costs while I have at least 1 pod on a regular non-pre-emptible node to reduce the risk of downtime.
I'm aware of using podAntiAffinity to encourage pods to be scheduled on different nodes, but is there a way to have k8s schedule pods for a single deployment across both pools?
Yes 💡! You can use Pod Topology Spread Constraints, based on a label 🏷️ key on your nodes. For example, the label could be type and the values could be regular and preemptible. Then you can have something like this:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: app
image: myimage
You can also identify a maxSkew which means the maximum differentiation of a number of pods that one label value (node type) can have.
You can also combine multiple 'Pod Topology Spread Constraints' and also together with PodAffinity/AntiAffinity and NodeAffinity. All depending on what best fits your use case.
Note: This feature is alpha in 1.16 and beta in 1.18. Beta features are enabled by default but with alpha features, you need an alpha cluster in GKE.
☮️✌️

Two pods force deploy to different ICP workers

There is a cluster Kubernetes and IBM Cloud Private with two workers.
I have one deployment which creates two pods. How can I force deployment to install its pods on two different workers? In this case if I lost one icp worker I always have other with need pod.
If you want pods to not schedule on the same node, the correct concept that you will want to use is inter-pod anti-affinity. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity-beta-feature
Observe:
spec:
replicas: 2
selector:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
You can create your pods as kubernetes DaemonSet. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. You can access below link to see details.
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
In addition to #Santiclause answer regarding scheduling policy in affinity mode there are two different mods of affinity:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution.
While using requiredDuringSchedulingIgnoredDuringExecution affinity scheduler we need to make sure that all rules are met for a pod to be scheduled.
If you will have i.e. not enough nodes to spawn all pods the scheduler will wait forever until there will be enough nodes available.
If you use preferredDuringSchedulingIgnoredDuringExecution affinity scheduler it will try to spawn all replicas based on the highest score the nodes gets from the combination of defined rules and their weight.
Weight is a parameter used along with a rule, each rule can have a different weight. In order to calculate a Score for a node we use following logic:
For every node, we iterate through rules defined in the configuration (i.e. resource request, requiredDuringScheduling, affinity expressions, etc.). In case the rule is matched we add the weight value to the score for that node. Once all rules for all nodes are processed we will have a list of all nodes with their final score. The node(s) with the highest score are the most preferred.
Just to summarize, higher weight value will increase importance of a rule and will help scheduler to decide which node to choose.

Ignore pod anti affinity during deployment update

Imagine in a Master-Node-Node setup where you deploy a service with pod anti-affinity on the Nodes: An update of the Deployment will cause another pod being created but the scheduler not being able to schedule, because both Nodes have the anti-affinity.
Q: How could one more flexibly set the anti-affinity to allow the update?
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
With an error
No nodes are available that match all of the following predicates:: MatchInterPodAffinity (2), PodToleratesNodeTaints (1).
Look at Max Surge
If you set Max Surge = 0, you are telling Kubernetes that you won't allow it to create more pods than the number of replicas you have setup for the deployment. This basically forces Kubernetes to remove a pod before starting a new one, and thereby making room for the new pod first, getting you around the podAntiAffinity issue. I've utilized this mechanism myself, with great success.
Config example
apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
replicas: <any number larger than 1>
...
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
...
template:
...
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
Warning: Don't do this if you only have one replica, as it will cause downtime because the only pod will be removed before a new one is added. If you have a huge number of replicas, which will make deployments slow because Kubernetes can only upgrade 1 pod at at a time, you can crank up maxUnavailable to enable Kubernetes to remove a higher number of pods at a time.
Here are a few methods which may work:
Modify your deployment rollingUpdate strategy 'Max Unavailable' parameter be at least 1, which allows for one pod of the old deployment to be destroyed immediately, making room for one pod of the new deployment to be created.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
Modify the deployment pod spec to use "soft" anti-affinity instead of "hard".
Pod Anti-Affinity
There are currently two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee.