Background:
While performance testing an application, I was getting inconsistent results when scaling the replicas for my php-fpm containers where I realized that 3/4 pods were scheduled on the same node.
I then configured anti affinity rules to not schedule pods on the same node. I quickly realized that using requiredDuringSchedulingIgnoredDuringExecution was not an option because I could not have # of replicas > # of nodes so I configured preferredDuringSchedulingIgnoredDuringExecution.
For the most part, it looks like my pods are scheduled evenly across all my nodes however sometimes (seen through a rolling upgrade), I see pods on the same node. I feel like the weight value which is currently set to 100 is playing a factor.
Here is the yaml I am using (helm):
{{- if .Values.podAntiAffinity }}
{{- if .Values.podAntiAffinity.enabled }}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: "{{ .Values.deploymentName }}"
topologyKey: "kubernetes.io/hostname"
{{- end }}
{{- end }}
Questions:
The way I read the documentation, the weight number will be added to a calculated score for the node based on how busy it is (simplified) however what I don't understand is how a weight of 1 vs 100 would be any different?
Why are pods sometimes scheduled on the same node with this rule? Is it because the total score for the node that the pod wasn't scheduled on is too low (as it is too busy)?
Is there a way to see a log/event of how the pod was scheduled on a particular node? I'd expect kubectl describe pod to have those details but seemingly it does not (except in an error scenario).
preferredDuringSchedulingIgnoredDuringExecution is not guaranteed.
two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee.
The weight you set is giving an edge but there are other parameters (set by user and kubernetes) with their own weights. Below example should give a better picture where weight that you set matters
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: example.com/myLabel
operator: In
values:
- a
weight: 40
- preference:
matchExpressions:
- key: example.com/myLabel
operator: In
values:
- b
weight: 35
Related
I was trying to spread pods evenly in all zones, but couldn't make it work properly.
In my k8s cluster, nodes are spread across 3 az's. Now suppose min node count is 1 and there are 2 nodes at the moment, first one is totally full of pods. Now when I create one deployment (replica 2) with topology spread constraints as ScheduleAnyway then since 2nd node has enough resources both the pods are deployed in that node. I don't want that. I tried changing condition to DoNotSchedule. But since I have only 3 az's, I am only able to schedule 3 pods and it's triggering new node for all 3 pods. I want to make sure that relpica's are spread in 3 all az's.
Here is snippet from deployment spec. Does anyone know what should be the way out?
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "my-app"
You need to tweak the attribute max skew.
Assign the attribute a higher value
Refer the example given at :
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
If I good understand your problem you can use node Affinity rule and maxSkew field.
Please take a look at this my answer or have a look at it below. In it, I have described how you can force your pods to split between nodes. All you need to do is set the key and value parameters in matchExpressions section accordingly.
Additionally, you may find the requiredDuringSchedulingIgnoredDuringExecution field and the preferredDuringSchedulingIgnoredDuringExecution field very useful.
Look at this example yaml file:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
example: app
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
- worker-2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
Idea of this configuration: I'm using nodeAffinity here to indicate on which nodes pod can be placed:
- key: kubernetes.io/hostname
and
values:
- worker-1
- worker-2
It is important to set the following line:
- maxSkew: 1
According to the documentation:
maxSkew describes the degree to which Pods may be unevenly distributed. It must be greater than zero.
Thanks to this, the difference in the number of assigned feeds between nodes will always be maximally equal to 1.
This section:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
is optional however, it will allow you to adjust the feed distribution on the free nodes even better. Here you can find a description with differences between: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution:
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be "only run the pod on nodes with Intel CPUs" and an example preferredDuringSchedulingIgnoredDuringExecution would be "try to run this set of pods in failure zone XYZ, but if it's not possible, then allow some to run elsewhere".
Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.
I am facing an issue while trying to deploy my redis pods in a k3s cluster.
I have updated my Chart.yaml to add the redis dependency as below:
...
dependencies:
name: redis
version: 10.2.3
repository: https://charts.bitnami.com/bitnami
...
but when i tried to apply nodeaffinity rule in values.yaml as below
redis:
master:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker4
However, we see that it is not being scheduled to node4. Please can someone tell me which rule is incorrect or should I use pod affinity instead.
preferredDuringSchedulingIgnoredDuringExecution is a soft constraint. Scheduler takes your preference into consideration but it need not honor it if some other node has higher priority score after running through other scheduler logic. You have also given it a weight: 1 which is the minimal weight.
If you want to enforce the pod to run on worker4, you can create a hard constraint using requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution. This would mean if there is no other node matching the label kubernetes.io/hostname: worker4, your pod would become unschedulable.
If you want to use preferredDuringSchedulingIgnoredDuringExecution so that your pod can get scheduled to any node in event that worker4 is not available, you can try increasing the weight. weight takes a range range 1-100.
What rules should be used to assign affinity to Kubernetes pods for distributing the pods across all Availability Zones?
I have a region with 3 Availability Zones and Nodes in each of these. I want to make sure that each of the 3 pods are spread across all the 3 Availability Zones.
You should be able to use the label topology.kubernetes.io/zone (for e.g. topologyKey) and add anti-affinity rules.
This is part of the anti-affinity example:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: failure-domain.beta.kubernetes.io/zone
the result of the example is documented as
The pod anti-affinity rule says that the pod cannot be scheduled onto a node if that node is in the same zone as a pod with label having key "security" and value "S2".
Instead of the label security in the example, you can use e.g. app-name: <your-app-name> as label and use that in your matchExpression.
I've got an AKS cluster configured with two fairly small VM worker nodes, and then a virtual node to use ACI. What I really want to happen is for pods to get scheduled on the two VM nodes until they are full, then use the virtual node, but I cannot get this to work.
I've tried using node affinity, as suggested here, but this just doesn't work, pods get scheduled on the virtual node first. If I use a required node affinity, then they do get scheduled only on the VM nodes, but that is not what I want. I am guessing the issue here is that the resource availability on my VM nodes is significantly lower than the virtual node (as you would expect) so the virtual node is getting much more weight applied to it, which counteracts the affinity rule, but I don't really know as I can't see any way to see this weight.
So, does anyone have a way to make this scenario work?
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/ goes over the different scoring options used by the scheduler and https://github.com/kubernetes/examples/blob/master/staging/scheduler-policy/scheduler-policy-config.json shows how to customize them.
I suspect what you want is a preferred affinity combined with increasing the scoring factor for NodeAffinityPriority.
nodeAffinity is the right way to go, but you have to play right with requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution parameters.
For example:
apiVersion: v1
kind: Pod
metadata:
name: node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: KEY-FOR-ALL-THREE-NODES
operator: In
values:
- VALUE-FOR-ALL-THREE-NODES
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: KEY-FOR-THE-TWO-SMALL-NODES
operator: In
values:
- VALE-FOR-THE-TWO-SMALL-NODES
containers:
- name: nginx
image: nginx
This pod can run only on the nodes with the key: value stated as requirement (so all three nodes), but you are giving it a preference to run on the small nodes (if there is room), with weight of 100. The weight is a subjective thing, so it should work the same with +1 then with +100.
Also, since you have three nodes, you can skip requirement part, and only set a preference.