What rules should be used to assign affinity to Kubernetes pods for distributing the pods across all Availability Zones?
I have a region with 3 Availability Zones and Nodes in each of these. I want to make sure that each of the 3 pods are spread across all the 3 Availability Zones.
You should be able to use the label topology.kubernetes.io/zone (for e.g. topologyKey) and add anti-affinity rules.
This is part of the anti-affinity example:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: failure-domain.beta.kubernetes.io/zone
the result of the example is documented as
The pod anti-affinity rule says that the pod cannot be scheduled onto a node if that node is in the same zone as a pod with label having key "security" and value "S2".
Instead of the label security in the example, you can use e.g. app-name: <your-app-name> as label and use that in your matchExpression.
Related
I cannot really understand the purpose and usage of topologyKey in pod affinity. The documentations says:
topologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
And example usage is as follows:
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0
So where does topology.kubernetes.io/zone come from? How can I know what value should I provide for this topologyKey field in my yaml file, and what happens if I just put a random string here? Should I label my node and use the key of this label in topologyKey field?
Thank you.
Required as part of a affinity.podAffinity or affinity.podAntiAffinity spec section, the topologyKey field is used by the scheduler to determine the domain for Pod placement.
The topologyKey domain is used to determine relative placement of the Pods being scheduled relative to the Pods identified by the ...labelSelector.matchExpressions section.
With podAffinity, a Pod will be scheduled in the same domain as the Pods that match the expression.
Two common label options are topology.kubernetes.io/zone and kubernetes.io/hostname. Others can be found in the Kubernetes Well-Known Labels, Annotations and Taints documentation.
topology.kubernetes.io/zone: Pods will be scheduled in the same zone as a Pod that matches the expression.
kubernetes.io/hostname: Pods will be scheduled on the same hostname as a Pod that matches the expression.
For podAntiAffinity, the opposite is true: Pods will not be scheduled in the same domain as the Pods that match the expression.
The Kubernetes documentation Assigning Pods to Nodes documentation (Inter-pod affinity and anti-affinity section) provides a additional explanation.
A topology Key is effectively just a label that you assign to your nodes or that a cloud provider has already assigned.
The intent is to indicate certain topology characteristics, like the availability zone or server rack, for example. But they are actually arbitrary.
It is documented here.
For example, you want to spread out pods across 3 different availability zones. The topology key could help you to achieve this, as it prevents them from being randomly scheduled in the same zone.
Here are 2 examples from the docs:
For example, you could use requiredDuringSchedulingIgnoredDuringExecution affinity to tell the scheduler to co-locate Pods of two services in the same cloud provider zone because they communicate with each other a lot. Similarly, you could use preferredDuringSchedulingIgnoredDuringExecution anti-affinity to spread Pods from a service across multiple cloud provider zones.
topology.kubernetes.io/zone will be set only if you are using a cloudprovider. However, you should consider setting this on nodes if it makes sense in your topology.
it is documented here.
# This is an example of the values from an AWS cluster
❯ k get nodes --show-labels | awk '{print $6}' | tr ',' '\n' | grep topology
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1b
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1c
You could use beta.kubernetes.io/instance-type=t3.medium for example if it makes sense in your affinity rule, and you want to treat all nodes of instance-type=t3.medium as the same topology.
Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.
Background:
While performance testing an application, I was getting inconsistent results when scaling the replicas for my php-fpm containers where I realized that 3/4 pods were scheduled on the same node.
I then configured anti affinity rules to not schedule pods on the same node. I quickly realized that using requiredDuringSchedulingIgnoredDuringExecution was not an option because I could not have # of replicas > # of nodes so I configured preferredDuringSchedulingIgnoredDuringExecution.
For the most part, it looks like my pods are scheduled evenly across all my nodes however sometimes (seen through a rolling upgrade), I see pods on the same node. I feel like the weight value which is currently set to 100 is playing a factor.
Here is the yaml I am using (helm):
{{- if .Values.podAntiAffinity }}
{{- if .Values.podAntiAffinity.enabled }}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: "{{ .Values.deploymentName }}"
topologyKey: "kubernetes.io/hostname"
{{- end }}
{{- end }}
Questions:
The way I read the documentation, the weight number will be added to a calculated score for the node based on how busy it is (simplified) however what I don't understand is how a weight of 1 vs 100 would be any different?
Why are pods sometimes scheduled on the same node with this rule? Is it because the total score for the node that the pod wasn't scheduled on is too low (as it is too busy)?
Is there a way to see a log/event of how the pod was scheduled on a particular node? I'd expect kubectl describe pod to have those details but seemingly it does not (except in an error scenario).
preferredDuringSchedulingIgnoredDuringExecution is not guaranteed.
two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee.
The weight you set is giving an edge but there are other parameters (set by user and kubernetes) with their own weights. Below example should give a better picture where weight that you set matters
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: example.com/myLabel
operator: In
values:
- a
weight: 40
- preference:
matchExpressions:
- key: example.com/myLabel
operator: In
values:
- b
weight: 35
I have 3 nodes in k8s and i'm running kafka (3 cluster).
While deploying zk/broker/rest-proxy, its not getting deployed in all the available nodes. How can i make sure that all pods are deployed in different nodes. Do i need to use nodeaffinity or podaffinity ?
If you want all pods to run on different nodes - you must use PodAntiAffinity. If this is hard requirement - you must use requiredDuringSchedulingIgnoredDuringExecution rule. If it's not - use preferredDuringSchedulingIgnoredDuringExecution.
topologyKey should be kubernetes.io/hostname.
In labelSelector put your pod's labels.
I recommend using soft anti-affinity which will look like:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- <your app label>
topologyKey: kubernetes.io/hostname
weight: 100
Here I explained the difference between anti-affinity types with examples applied to a live cluster:
https://blog.verygoodsecurity.com/posts/kubernetes-multi-az-deployments-using-pod-anti-affinity/
I am interested in knowing how pervasively labels / selectors are getting used in Kubernetes. Is it widely used feature in field to segregate container workloads.
If not, what are other ways that are used to segregate workloads in kubernetes.
I'm currently running a Kubernetes in production for some months and using the labels on some pods to spread them out over the nodes using the podAntiAffinity rules. So that these pods aren't all located on a single node. Mind you, I'm running a small cluster of three nodes.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- your-deployment-name
topologyKey: "kubernetes.io/hostname"
I've found this a useful way to use labels.