So, let's say we assign requiredDuringSchedulingIgnoredDuringExecution type to node affinity and specify some node selector terms. Let's suppose a pod gets created according to the terms. But, what if a pod gets evicted and it is rescheduled.
Would that node affinity/node selector terms which is requiredDuringSchedulingIgnoredDuringExecution be taken into account when a new pod is created?
requiredDuringSchedulingIgnoredDuringExecution type of nodeAffinity, unofficially calls "hard rule" guarantee you the pod will be created/RECREATED exactly based on rules you defined.
Example from official kubernetes documentation:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
Above spec will always recreate pod ONLY in nodes with label=kubernetes.io/e2e-az-name that contains values e2e-az1 or e2e-az2
Scheduling is done by Kubernetes scheduler and every time it will schedule or reschedule a pod it will consider the affinity and antiaffiny defined in pod spec.
Related
I have a nodepool in AKS with autoscaling enabled. Here is a simple scenario I am trying to achieve.
I have 1 node running a single pod (A) with label deployment=x. I have another pod (B) with a podAntiAffinity rule to avoid nodes running pods with deployment=x label.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deployment
operator: In
values:
- x
topologyKey: kubernetes.io/hostname
The behavior I am seeing is that Pod B gets scheduled to the same node as Pod A. I would have wanted Pod B to be in Pending state until the autoscaler added a new node that satisfies the podAntiAffinity rule and schedule Pod B to the new node. Is this possible to do?
Kubernetes Version: 1.22.6
-- EDIT--
This example does trigger a node scale up, so it's doing what I expect it to do and that working for me.
The podAntiAffinity example I posted works as expected.
You should change key for search. As I understand, once you select key: deployment, you are looking in deployments, it's not related to the node. Try this example:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- name-of-deployment
topologyKey: "kubernetes.io/hostname"
I cannot really understand the purpose and usage of topologyKey in pod affinity. The documentations says:
topologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
And example usage is as follows:
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0
So where does topology.kubernetes.io/zone come from? How can I know what value should I provide for this topologyKey field in my yaml file, and what happens if I just put a random string here? Should I label my node and use the key of this label in topologyKey field?
Thank you.
Required as part of a affinity.podAffinity or affinity.podAntiAffinity spec section, the topologyKey field is used by the scheduler to determine the domain for Pod placement.
The topologyKey domain is used to determine relative placement of the Pods being scheduled relative to the Pods identified by the ...labelSelector.matchExpressions section.
With podAffinity, a Pod will be scheduled in the same domain as the Pods that match the expression.
Two common label options are topology.kubernetes.io/zone and kubernetes.io/hostname. Others can be found in the Kubernetes Well-Known Labels, Annotations and Taints documentation.
topology.kubernetes.io/zone: Pods will be scheduled in the same zone as a Pod that matches the expression.
kubernetes.io/hostname: Pods will be scheduled on the same hostname as a Pod that matches the expression.
For podAntiAffinity, the opposite is true: Pods will not be scheduled in the same domain as the Pods that match the expression.
The Kubernetes documentation Assigning Pods to Nodes documentation (Inter-pod affinity and anti-affinity section) provides a additional explanation.
A topology Key is effectively just a label that you assign to your nodes or that a cloud provider has already assigned.
The intent is to indicate certain topology characteristics, like the availability zone or server rack, for example. But they are actually arbitrary.
It is documented here.
For example, you want to spread out pods across 3 different availability zones. The topology key could help you to achieve this, as it prevents them from being randomly scheduled in the same zone.
Here are 2 examples from the docs:
For example, you could use requiredDuringSchedulingIgnoredDuringExecution affinity to tell the scheduler to co-locate Pods of two services in the same cloud provider zone because they communicate with each other a lot. Similarly, you could use preferredDuringSchedulingIgnoredDuringExecution anti-affinity to spread Pods from a service across multiple cloud provider zones.
topology.kubernetes.io/zone will be set only if you are using a cloudprovider. However, you should consider setting this on nodes if it makes sense in your topology.
it is documented here.
# This is an example of the values from an AWS cluster
❯ k get nodes --show-labels | awk '{print $6}' | tr ',' '\n' | grep topology
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1b
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1c
You could use beta.kubernetes.io/instance-type=t3.medium for example if it makes sense in your affinity rule, and you want to treat all nodes of instance-type=t3.medium as the same topology.
Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.
I am facing an issue while trying to deploy my redis pods in a k3s cluster.
I have updated my Chart.yaml to add the redis dependency as below:
...
dependencies:
name: redis
version: 10.2.3
repository: https://charts.bitnami.com/bitnami
...
but when i tried to apply nodeaffinity rule in values.yaml as below
redis:
master:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker4
However, we see that it is not being scheduled to node4. Please can someone tell me which rule is incorrect or should I use pod affinity instead.
preferredDuringSchedulingIgnoredDuringExecution is a soft constraint. Scheduler takes your preference into consideration but it need not honor it if some other node has higher priority score after running through other scheduler logic. You have also given it a weight: 1 which is the minimal weight.
If you want to enforce the pod to run on worker4, you can create a hard constraint using requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution. This would mean if there is no other node matching the label kubernetes.io/hostname: worker4, your pod would become unschedulable.
If you want to use preferredDuringSchedulingIgnoredDuringExecution so that your pod can get scheduled to any node in event that worker4 is not available, you can try increasing the weight. weight takes a range range 1-100.
How can I configure a specific pod to run on a multi-node kubernetes cluster so that it would restrict the containers of the POD to a subset of the nodes.
E.g. let's say I have A, B, C three nodes running mu kubernetes cluster.
How to limit a Pod to run its containers only on A & B, and not on C?
You can add label to nodes that you want to run pod on and add nodeSelector to pod configuration. The process is described here:
http://kubernetes.io/docs/user-guide/node-selection/
So basically you want to
kubectl label nodes A node_type=foo
kubectl label nodes B node_type=foo
And you want to have this nodeSelector in your pod spec:
nodeSelector:
node_type: foo
Firstly, you need to add label to nodes. You can refer to Nebril's answer of how to add label.
Then I recommend you to use the node affinity feature to constrain pods to nodes with particular labels. Here is an example.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node_type
operator: In
values:
- foo
containers:
- name: with-node-affinity
image: gcr.io/google_containers/pause:2.0
Compared to nodeSelector, the affinity/anti-affinity feature greatly expands the types of constraints you can express.
For more detailed information about node affinity, you can refer to the following link: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
I am unable to post a comment to previous replies but I upvoted the answer that is complete.
nodeSelector is not strictly enforced and relying on it may cause some grief including Master getting overwhelmed with requests as well as IOs from pods scheduled on Master. At the time the answer was provided it may still have been the only option but in the later versions that is not so.
Definitely use nodeAffinity or nodeAntiAffinity with requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution
For more expressive filters for scheduling peruse: Assigning pods to nodes