Is there anyway I can tell Kuberbetes how to schedule the replicas in the statefulset? For example, I have nodes divided into 3 different availability zones (AZ). I have labeled these nodes accordingly. Now I want K8s to put 1 replica in each AZ based on node label. Thanks
Pods will always try to be scheduled across different nodes, to achieve what you are looking for you can try to use DaemonSet, which will allow only one of these kind of pods in each node.
Also, you can use anti affinity based on the already scheduled pods in that node.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The feature you are looking for is called Pod Anti-Affinity and can be specified as such:
apiVersion: apps/v1
kind: StatefulSet
[..]
spec:
template:
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: az
weight: 100
[..]
Since Kubernetes 1.18, there is also Pod Topology Spread Constraints, which is a nicer way to specify these anti-affinity rules.
Related
We have a deployment with a large replicas number ( > 1 ) that we must deploy in the same zone.
We stumbled upon this documentation section: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#an-example-of-a-pod-that-uses-pod-affinity
which explains how to schedule pods in zones that already have other pods that match certain labels.
however, there are no other pods that our deployment depends upon. all other workloads are replicated and spread across multiple zones, and this is the first deployment that we would like to keep in a single zone.
also, we thought about explicitly setting the zone for this deployment, but in case of zone failure, it will become unavailable until we notice and explicitly set it to another zone. so setting the exact zone won't work here.
any insights here? and thanks!
Pod Affinity affects how the pod is scheduled based on the presence or absence of other pods within the node. That would probably not serve the purpose you're trying to achieve.
You're probably better off using node affinity (it's on the same link you provided)
That would allow you to force to a zone, because each GKE node will have a failure-domain label which you can get doing this and looking through the results:
kubectl get node {name-of-node} -o json | jq ".metadata.labels"
The labels will read something like this:
"failure-domain.beta.kubernetes.io/region": "europe-west2",
"failure-domain.beta.kubernetes.io/zone": "europe-west2-b",
You can then combine this with nodeAffinity in your deployment yaml (parts snipped for brevity):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
...
annotations:
...
name: my-deployment
spec:
replicas: 1
strategy: {}
selector:
matchLabels:
...
template:
metadata:
annotations:
...
labels:
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
This will force the pods generated by the deployment to all go onto nodes sitting in europe-west2-b
I could change this and make it like this:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
- europe-west2-c
To allow it schedule in two zones (but it would not be able to schedule on to the europe-west2-a zone as a consequence)
I do not think there is a direct way to achieve this. I can think of two ways this can work.
Using Affinity on Pod and Node
Adding node affinity with preferredDuringSchedulingIgnoredDuringExecution for the regions you would want to target.
Adding pod affinity to itself with preferredDuringSchedulingIgnoredDuringExecution for pods to prefer to be with each other.
With this what should happen is when the first pod is about to be spun up it would match none of its preferred affinity but the scheduler will still schedule it. But once one is running for the rest of the pod there will be a pod with the correct affinity and they should all spin up. The challenge is there is a possibility of a race condition where multiple pods try to get scheduled and scheduler puts them in different locations once your first preferred zone is out.
Using Webhooks
You can use some mutating webhook to check the node label and add requiredDuringSchedulingIgnoredDuringExecution affinity to pods based on what zones you have still available.
The challenge here is you would most likely need to write and maintain this webhook yourself. I am not sure if you will find your exact usecase solved by someone else in open source. A quick search shows me this repo. I have not tested this but might give you a start.
I have a pod that has both node affinity and pod affinity. could some help me understand how would things behave in such a scenario?
Node 1:
label:
schedule-on : gpu
Node 2:
label:
schedule-on : gpu
Node 3:
label:
schedule-on : non-gpu
Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: test
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: schedule-on
operator: In
values:
- gpu
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- test
topologyKey: schedule-on
the output of the above is:
Pods are getting scheduled on a different node [node1,node2]
ideal output: Pod needs to be scheduled on the same node [node1]
Here is my finding.
Finding 1: I believe node affinity is taking precedence and pod affinity is getting ignored
It's the union of node affinity and pod affinity. since both the pod has the same topology key domain . hence making them in the same colocation the pods can get scheduled in different nodes but in same colocation .
When matching the topology key and placing the pod. Value of the key is also considered
I would like to add some definitions to your answer.
First of all, from this article:
Node affinity is a set of rules. It is used by the scheduler to decide where a pod can be placed in the cluster. The rules are defined using labels on nodes and label selectors specified in pods definition. Node affinity allows a pod to specify an affinity towards a group of nodes it can be scheduled on.
And from here:
Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
In your scenario: as you wrote - both pods have the same topology key domain. This fact makes them in the same colocation, so the pods can get scheduled in different nodes but in same colocation.
Here are also some words about topologyKey:
TopologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
I have an EKS node group with 2 nodes for compute workloads. I use a taint on these nodes and tolerations in the deployment. I have a deployment with 2 replicas I want these two pods to be spread on these two nodes like one pod on each node.
I tried using:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- appname
Each pod is put on each node but if I updated the deployment file like changing its image name, it fails to schedule a new pod.
I also tried:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
type: compute
but they aren't spread evenly like 2 pods on a node.
Try adding:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
By default K8s is trying to scale the new replicaset up first before it starts downscaling the old replicas. Since it cannot schedule new replicas (because antiaffinity) they are stuck in pending state.
Once you set the deployment's maxSurge=0, you tell k8s that you don't want the deployment to scale up first during update, and thus in result it can only scale down making place for new replicas to be scheduled.
Setting maxUnavailable=1 tells k8s to replace only one pod at a time.
You can use DeamonSet instead of Deployment. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
See documentation for Deamonset
I was having the same problem with pods failing to schedule and getting stuck in pending state while rolling out new versions while my goal was to run exactly 3 pods at all times, 1 on each of the 3 available nodes.
That means I could not use maxUnavailable: 1 because that would temporarily result in less than 3 pods during the rollout.
Instead of using the app name label for matching anti-affinity, I ended up using a label with a random value ("version") on each deployment. This means new deployments will happily schedule pods to nodes where a previous version is still running, but the new versions will always be spread evenly.
Something like this:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
metadata:
labels:
deploymentVersion: v1
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deploymentVersion
operator: In
values:
- v1
topologyKey: "kubernetes.io/hostname"
v1 can be anything that's a valid label and changes on every deployment attempt.
I'm using envsubst to have dynamic variables in yaml files:
DEPLOYMENT_VERSION=$(date +%s) envsubst < deploy.yaml | kubectl apply -f -
And then the config looks like this:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
metadata:
labels:
deploymentVersion: v${DEPLOYMENT_VERSION}
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deploymentVersion
operator: In
values:
- v${DEPLOYMENT_VERSION}
topologyKey: "kubernetes.io/hostname"
I wish Kubernetes offered a more straightforward way to achieve this.
Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.
In my GKE Kubernetes cluster, I have 2 node pools; one with regular nodes and the other with pre-emptible nodes. I'd like some of the pods to be on pre-emptible nodes so I can save costs while I have at least 1 pod on a regular non-pre-emptible node to reduce the risk of downtime.
I'm aware of using podAntiAffinity to encourage pods to be scheduled on different nodes, but is there a way to have k8s schedule pods for a single deployment across both pools?
Yes 💡! You can use Pod Topology Spread Constraints, based on a label 🏷️ key on your nodes. For example, the label could be type and the values could be regular and preemptible. Then you can have something like this:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: app
image: myimage
You can also identify a maxSkew which means the maximum differentiation of a number of pods that one label value (node type) can have.
You can also combine multiple 'Pod Topology Spread Constraints' and also together with PodAffinity/AntiAffinity and NodeAffinity. All depending on what best fits your use case.
Note: This feature is alpha in 1.16 and beta in 1.18. Beta features are enabled by default but with alpha features, you need an alpha cluster in GKE.
☮️✌️