What is topologyKey in pod affinity? - kubernetes

I cannot really understand the purpose and usage of topologyKey in pod affinity. The documentations says:
topologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
And example usage is as follows:
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0
So where does topology.kubernetes.io/zone come from? How can I know what value should I provide for this topologyKey field in my yaml file, and what happens if I just put a random string here? Should I label my node and use the key of this label in topologyKey field?
Thank you.

Required as part of a affinity.podAffinity or affinity.podAntiAffinity spec section, the topologyKey field is used by the scheduler to determine the domain for Pod placement.
The topologyKey domain is used to determine relative placement of the Pods being scheduled relative to the Pods identified by the ...labelSelector.matchExpressions section.
With podAffinity, a Pod will be scheduled in the same domain as the Pods that match the expression.
Two common label options are topology.kubernetes.io/zone and kubernetes.io/hostname. Others can be found in the Kubernetes Well-Known Labels, Annotations and Taints documentation.
topology.kubernetes.io/zone: Pods will be scheduled in the same zone as a Pod that matches the expression.
kubernetes.io/hostname: Pods will be scheduled on the same hostname as a Pod that matches the expression.
For podAntiAffinity, the opposite is true: Pods will not be scheduled in the same domain as the Pods that match the expression.
The Kubernetes documentation Assigning Pods to Nodes documentation (Inter-pod affinity and anti-affinity section) provides a additional explanation.

A topology Key is effectively just a label that you assign to your nodes or that a cloud provider has already assigned.
The intent is to indicate certain topology characteristics, like the availability zone or server rack, for example. But they are actually arbitrary.
It is documented here.
For example, you want to spread out pods across 3 different availability zones. The topology key could help you to achieve this, as it prevents them from being randomly scheduled in the same zone.
Here are 2 examples from the docs:
For example, you could use requiredDuringSchedulingIgnoredDuringExecution affinity to tell the scheduler to co-locate Pods of two services in the same cloud provider zone because they communicate with each other a lot. Similarly, you could use preferredDuringSchedulingIgnoredDuringExecution anti-affinity to spread Pods from a service across multiple cloud provider zones.

topology.kubernetes.io/zone will be set only if you are using a cloudprovider. However, you should consider setting this on nodes if it makes sense in your topology.
it is documented here.
# This is an example of the values from an AWS cluster
❯ k get nodes --show-labels | awk '{print $6}' | tr ',' '\n' | grep topology
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1b
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1c
You could use beta.kubernetes.io/instance-type=t3.medium for example if it makes sense in your affinity rule, and you want to treat all nodes of instance-type=t3.medium as the same topology.

Related

Kubernetes : How to ensure one pod gets scheduled on each worker node?

Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.

Assign affinity for distributing the kubernetes pods across all nodes?

What rules should be used to assign affinity to Kubernetes pods for distributing the pods across all Availability Zones?
I have a region with 3 Availability Zones and Nodes in each of these. I want to make sure that each of the 3 pods are spread across all the 3 Availability Zones.
You should be able to use the label topology.kubernetes.io/zone (for e.g. topologyKey) and add anti-affinity rules.
This is part of the anti-affinity example:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: failure-domain.beta.kubernetes.io/zone
the result of the example is documented as
The pod anti-affinity rule says that the pod cannot be scheduled onto a node if that node is in the same zone as a pod with label having key "security" and value "S2".
Instead of the label security in the example, you can use e.g. app-name: <your-app-name> as label and use that in your matchExpression.

Will requiredDuringSchedulingIgnoredDuringExecution nodeAffinity persist if a pod gets evicted and rescheduled?

So, let's say we assign requiredDuringSchedulingIgnoredDuringExecution type to node affinity and specify some node selector terms. Let's suppose a pod gets created according to the terms. But, what if a pod gets evicted and it is rescheduled.
Would that node affinity/node selector terms which is requiredDuringSchedulingIgnoredDuringExecution be taken into account when a new pod is created?
requiredDuringSchedulingIgnoredDuringExecution type of nodeAffinity, unofficially calls "hard rule" guarantee you the pod will be created/RECREATED exactly based on rules you defined.
Example from official kubernetes documentation:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
Above spec will always recreate pod ONLY in nodes with label=kubernetes.io/e2e-az-name that contains values e2e-az1 or e2e-az2
Scheduling is done by Kubernetes scheduler and every time it will schedule or reschedule a pod it will consider the affinity and antiaffiny defined in pod spec.

Two pods force deploy to different ICP workers

There is a cluster Kubernetes and IBM Cloud Private with two workers.
I have one deployment which creates two pods. How can I force deployment to install its pods on two different workers? In this case if I lost one icp worker I always have other with need pod.
If you want pods to not schedule on the same node, the correct concept that you will want to use is inter-pod anti-affinity. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity-beta-feature
Observe:
spec:
replicas: 2
selector:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
You can create your pods as kubernetes DaemonSet. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. You can access below link to see details.
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
In addition to #Santiclause answer regarding scheduling policy in affinity mode there are two different mods of affinity:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution.
While using requiredDuringSchedulingIgnoredDuringExecution affinity scheduler we need to make sure that all rules are met for a pod to be scheduled.
If you will have i.e. not enough nodes to spawn all pods the scheduler will wait forever until there will be enough nodes available.
If you use preferredDuringSchedulingIgnoredDuringExecution affinity scheduler it will try to spawn all replicas based on the highest score the nodes gets from the combination of defined rules and their weight.
Weight is a parameter used along with a rule, each rule can have a different weight. In order to calculate a Score for a node we use following logic:
For every node, we iterate through rules defined in the configuration (i.e. resource request, requiredDuringScheduling, affinity expressions, etc.). In case the rule is matched we add the weight value to the score for that node. Once all rules for all nodes are processed we will have a list of all nodes with their final score. The node(s) with the highest score are the most preferred.
Just to summarize, higher weight value will increase importance of a rule and will help scheduler to decide which node to choose.

How to specify pod to node affinity in kubernetes

How can I configure a specific pod to run on a multi-node kubernetes cluster so that it would restrict the containers of the POD to a subset of the nodes.
E.g. let's say I have A, B, C three nodes running mu kubernetes cluster.
How to limit a Pod to run its containers only on A & B, and not on C?
You can add label to nodes that you want to run pod on and add nodeSelector to pod configuration. The process is described here:
http://kubernetes.io/docs/user-guide/node-selection/
So basically you want to
kubectl label nodes A node_type=foo
kubectl label nodes B node_type=foo
And you want to have this nodeSelector in your pod spec:
nodeSelector:
node_type: foo
Firstly, you need to add label to nodes. You can refer to Nebril's answer of how to add label.
Then I recommend you to use the node affinity feature to constrain pods to nodes with particular labels. Here is an example.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node_type
operator: In
values:
- foo
containers:
- name: with-node-affinity
image: gcr.io/google_containers/pause:2.0
Compared to nodeSelector, the affinity/anti-affinity feature greatly expands the types of constraints you can express.
For more detailed information about node affinity, you can refer to the following link: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
I am unable to post a comment to previous replies but I upvoted the answer that is complete.
nodeSelector is not strictly enforced and relying on it may cause some grief including Master getting overwhelmed with requests as well as IOs from pods scheduled on Master. At the time the answer was provided it may still have been the only option but in the later versions that is not so.
Definitely use nodeAffinity or nodeAntiAffinity with requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution
For more expressive filters for scheduling peruse: Assigning pods to nodes