Schedule Kubernetes pods in the same failure zone - kubernetes

We have a deployment with a large replicas number ( > 1 ) that we must deploy in the same zone.
We stumbled upon this documentation section: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#an-example-of-a-pod-that-uses-pod-affinity
which explains how to schedule pods in zones that already have other pods that match certain labels.
however, there are no other pods that our deployment depends upon. all other workloads are replicated and spread across multiple zones, and this is the first deployment that we would like to keep in a single zone.
also, we thought about explicitly setting the zone for this deployment, but in case of zone failure, it will become unavailable until we notice and explicitly set it to another zone. so setting the exact zone won't work here.
any insights here? and thanks!

Pod Affinity affects how the pod is scheduled based on the presence or absence of other pods within the node. That would probably not serve the purpose you're trying to achieve.
You're probably better off using node affinity (it's on the same link you provided)
That would allow you to force to a zone, because each GKE node will have a failure-domain label which you can get doing this and looking through the results:
kubectl get node {name-of-node} -o json | jq ".metadata.labels"
The labels will read something like this:
"failure-domain.beta.kubernetes.io/region": "europe-west2",
"failure-domain.beta.kubernetes.io/zone": "europe-west2-b",
You can then combine this with nodeAffinity in your deployment yaml (parts snipped for brevity):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
...
annotations:
...
name: my-deployment
spec:
replicas: 1
strategy: {}
selector:
matchLabels:
...
template:
metadata:
annotations:
...
labels:
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
This will force the pods generated by the deployment to all go onto nodes sitting in europe-west2-b
I could change this and make it like this:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
- europe-west2-c
To allow it schedule in two zones (but it would not be able to schedule on to the europe-west2-a zone as a consequence)

I do not think there is a direct way to achieve this. I can think of two ways this can work.
Using Affinity on Pod and Node
Adding node affinity with preferredDuringSchedulingIgnoredDuringExecution for the regions you would want to target.
Adding pod affinity to itself with preferredDuringSchedulingIgnoredDuringExecution for pods to prefer to be with each other.
With this what should happen is when the first pod is about to be spun up it would match none of its preferred affinity but the scheduler will still schedule it. But once one is running for the rest of the pod there will be a pod with the correct affinity and they should all spin up. The challenge is there is a possibility of a race condition where multiple pods try to get scheduled and scheduler puts them in different locations once your first preferred zone is out.
Using Webhooks
You can use some mutating webhook to check the node label and add requiredDuringSchedulingIgnoredDuringExecution affinity to pods based on what zones you have still available.
The challenge here is you would most likely need to write and maintain this webhook yourself. I am not sure if you will find your exact usecase solved by someone else in open source. A quick search shows me this repo. I have not tested this but might give you a start.

Related

Best practice spread pod across the node evenly

I have 4 node with different zone:
Node A : zone a
Node B : zone b
Node C : zone c
Node D : zone c
I want to spread the pod to Node A, B and C. I have Deployment that have 3 replicas to spread across those node, each pod each node. My deployments using kustomization and ArgoCD to deploy. Using the topologySpreadConstraint need to be update the label but labels are immutable on this case.
Current deployment condition using this
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-apps
spec:
replicas: 3
revisionHistoryLimit: 0
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: app
operator: In
values:
- my-apps
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-apps
version: v1
...
I've done add label for those 3 nodes and this configuration works well at first time. But when it comes to update the deployment and rolling update, The pods on nodes going to imbalance.
zone a : 2 pod
zone b : 1 pod
zone c : 0 pod
I've done playing with podAntiAffinity but its return as pending if I use hard affinity and still imbalance if I use soft affinity. Any suggestion best practice for this case? did I missed something?
Rolling updates cause some pods to be removed from nodes and others to be added. This can cause the pods on nodes to become imbalanced, as the pods that were already on the nodes will remain, but the pods that are added during the update will likely be different. To prevent this from happening, it is important to use the maxUnavailable option in the rolling update strategy. This allows you to specify the maximum number of pods that can be removed from a node during the rolling update, ensuring that the pods on each node remain balanced.
kubectl apply -f deployment.yaml --strategy=RollingUpdate --strategy-rolling-update-maxUnavailable=1
This command will create or update a deployment with the rolling update strategy, and the maxUnavailable option set to 1.This will ensure that no more than 1 pod is removed from a node during the rolling update, thus keeping the pods across nodes balanced.Try it and let me know if this works
If you are scaling down the pods, as per official doc limitations:
There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in an imbalanced Pods distribution.You can use a tool such as the Descheduler to rebalance the Pods distribution.

How can I pin pods of a statefulset to a specific availibility zones

I'm trying to understand, if it is possible ( and how ) to pin pods of statefulsets to specific availibility zones. In the example above I would like to explicitly configure, that elastic pod 1 runs in availibility zone 1, pod 2 in availibility zone 2 and so forth. I also don't want the pods to run outside of their availibility zones, if one goes down.
I read this doc on the matter. If I understand it correctly, I can only specify, that a statefulset shouldn't run it's pods in the same availibility zone, but not, that it always runs the pod in a specific availibility zone.
Thanks to anyone who can educate me in this matter.
aws overview
You can use Pod Topology Spread constraints.
topology spread constraints control how Pods are spread across your
cluster among failure-domains such as regions, zones, nodes, and other
user-defined topology domains. This can help to achieve high
availability as well as efficient resource utilization.
Taking the example from here
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1

Statefulset's replica scheduling questions

Is there anyway I can tell Kuberbetes how to schedule the replicas in the statefulset? For example, I have nodes divided into 3 different availability zones (AZ). I have labeled these nodes accordingly. Now I want K8s to put 1 replica in each AZ based on node label. Thanks
Pods will always try to be scheduled across different nodes, to achieve what you are looking for you can try to use DaemonSet, which will allow only one of these kind of pods in each node.
Also, you can use anti affinity based on the already scheduled pods in that node.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The feature you are looking for is called Pod Anti-Affinity and can be specified as such:
apiVersion: apps/v1
kind: StatefulSet
[..]
spec:
template:
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: az
weight: 100
[..]
Since Kubernetes 1.18, there is also Pod Topology Spread Constraints, which is a nicer way to specify these anti-affinity rules.

Two pods force deploy to different ICP workers

There is a cluster Kubernetes and IBM Cloud Private with two workers.
I have one deployment which creates two pods. How can I force deployment to install its pods on two different workers? In this case if I lost one icp worker I always have other with need pod.
If you want pods to not schedule on the same node, the correct concept that you will want to use is inter-pod anti-affinity. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity-beta-feature
Observe:
spec:
replicas: 2
selector:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
You can create your pods as kubernetes DaemonSet. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. You can access below link to see details.
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
In addition to #Santiclause answer regarding scheduling policy in affinity mode there are two different mods of affinity:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution.
While using requiredDuringSchedulingIgnoredDuringExecution affinity scheduler we need to make sure that all rules are met for a pod to be scheduled.
If you will have i.e. not enough nodes to spawn all pods the scheduler will wait forever until there will be enough nodes available.
If you use preferredDuringSchedulingIgnoredDuringExecution affinity scheduler it will try to spawn all replicas based on the highest score the nodes gets from the combination of defined rules and their weight.
Weight is a parameter used along with a rule, each rule can have a different weight. In order to calculate a Score for a node we use following logic:
For every node, we iterate through rules defined in the configuration (i.e. resource request, requiredDuringScheduling, affinity expressions, etc.). In case the rule is matched we add the weight value to the score for that node. Once all rules for all nodes are processed we will have a list of all nodes with their final score. The node(s) with the highest score are the most preferred.
Just to summarize, higher weight value will increase importance of a rule and will help scheduler to decide which node to choose.

How to specify pod to node affinity in kubernetes

How can I configure a specific pod to run on a multi-node kubernetes cluster so that it would restrict the containers of the POD to a subset of the nodes.
E.g. let's say I have A, B, C three nodes running mu kubernetes cluster.
How to limit a Pod to run its containers only on A & B, and not on C?
You can add label to nodes that you want to run pod on and add nodeSelector to pod configuration. The process is described here:
http://kubernetes.io/docs/user-guide/node-selection/
So basically you want to
kubectl label nodes A node_type=foo
kubectl label nodes B node_type=foo
And you want to have this nodeSelector in your pod spec:
nodeSelector:
node_type: foo
Firstly, you need to add label to nodes. You can refer to Nebril's answer of how to add label.
Then I recommend you to use the node affinity feature to constrain pods to nodes with particular labels. Here is an example.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node_type
operator: In
values:
- foo
containers:
- name: with-node-affinity
image: gcr.io/google_containers/pause:2.0
Compared to nodeSelector, the affinity/anti-affinity feature greatly expands the types of constraints you can express.
For more detailed information about node affinity, you can refer to the following link: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
I am unable to post a comment to previous replies but I upvoted the answer that is complete.
nodeSelector is not strictly enforced and relying on it may cause some grief including Master getting overwhelmed with requests as well as IOs from pods scheduled on Master. At the time the answer was provided it may still have been the only option but in the later versions that is not so.
Definitely use nodeAffinity or nodeAntiAffinity with requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution
For more expressive filters for scheduling peruse: Assigning pods to nodes