Kubernetes node affinity behavior - kubernetes

I'm trying to affect pods to a specific node using affinities but it ends with a strange behavior I can't understand.
Let explain my nodes setup. I have "x" nodes which all have the labels kali=true, two nodes have in addition the labels kali-app=true and one of these nodes have the label kali-app-1=true.
Now I try to deploy a Deployment with a replicas of 2 pods which should result with putting these pods on the kali-app-1 node:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kali
operator: In
values: ['true']
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kali-app
operator: In
values: ['true']
- weight: 2
preference:
matchExpressions:
- key: kali-app-1
operator: In
values: ['true']
The result is the first pod is put on the kali-app-1 node but the second one is put on the node which only have the kali-app node (namely kali-app-2=true which is an other label I have in my cluster).
Does anyone can explain me this behavior?

Based on the nodeAffinity documentation it works as expected since preferredDuringSchedulingIgnoredDuringExecution field is used for the 2nd and 3rd conditions.
They way it works is:
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee.
The weight field in preferredDuringSchedulingIgnoredDuringExecution
is in the range 1-100. For each node that meets all of the scheduling
requirements (resource request, RequiredDuringScheduling affinity
expressions, etc.), the scheduler will compute a sum by iterating
through the elements of this field and adding "weight" to the sum if
the node matches the corresponding MatchExpressions. This score is
then combined with the scores of other priority functions for the
node. The node(s) with the highest total score are the most
preferred.
That means that after first pod is scheduled on the required node, another node is more preferred for placing the second pod. (e.g. if the deployment is heavy and has high cpu/memory requests) + generally kubernetes scheduler tries to spread pods across nodes to comply with high availability. Please get familiar with NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well GitHub issue.
Therefore if these pods from the deployment have to be scheduled on the same one node with label kali-app-1, you need to use requiredDuringSchedulingIgnoredDuringExecution field to enforce this assignment and do not provide the scheduler with other possible options.

Related

Restrict maximum pod replica count on k8s node

I have a k8s cluster with 2 node groups (worker group and task runner group for heavy tasks).
I have a deployment with N pod replicas and I want to assign a maximum of 2 replicas per node.
I found a way to restrict replicas count to 1 by describing Anti Affinity rules for pods and tried to add Topology Spread Constraints with the MaxSkew parameter, but this didn't work.
Is there any way to restrict the maximum pod replica count per node?
The key is to use topologyKey.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
name: deployment-name
topologyKey: kubernetes.io/hostname
weight: 100
A bit about topologyKey:
In principle, the topologyKey can be any allowed label key with the following exceptions for performance and security reasons:
For Pod affinity and anti-affinity, an empty topologyKey field is not allowed in both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.
For requiredDuringSchedulingIgnoredDuringExecution Pod anti-affinity rules, the admission controller LimitPodHardAntiAffinityTopology limits topologyKey to kubernetes.io/hostname.
You can modify or disable the admission controller if you want to allow custom topologies.
Pod Topology Spread Constraints
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
topologyKey is the key of node labels.
If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology.
The scheduler tries to place a balanced number of Pods into each topology domain.

When scaling a pod, will Kubernetes start new pods on more available nodes?

I tried to find a clear answer to this, and I'm sure it's in the kubernetes documentation somewhere but I couldn't find it.
If I have 4 nodes in my cluster, and am running a compute intensive pod which presumably uses up all/most of the resources on a single node, and I scale that pod to 4 replicas- will Kubernetes put those new pods on the unused nodes automatically?
Or, in the case where that compute intensive pod is not actually crunching numbers at the time but I scale the pod to 4 replicas (ie: node running the original pod has plenty of 'available' resources) will kubernetes still see that there are 3 other totally free nodes and start the pods on them?
Kubernetes will schedule the new pods on any node with sufficient resources. So one of the newly created pods could end up on a node were one is already running, if the node has enough resources left.
You can use an anti affinity to prevent pods from scheduling on the same node as a pod from the same deployment, e.g. by using the development's label
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources-group
operator: In
values:
- high
topologyKey: topology.kubernetes.io/zone
Read docs for more on that topic.
As far as I remember, it depends on the scheduler configuration.
If you are running your on-premise kubernetes cluster and you have access to the scheduler process, you can tune the policy as you prefer (see documentation).
Otherwise, you can only play with the pod resource requests/limits and anti-affinity (see here).

Kubernetes Pod anti-affinity - evenly spread pods based on a label?

We are finding that our Kubernetes cluster tends to have hot-spots where certain nodes get far more instances of our apps than other nodes.
In this case, we are deploying lots of instances of Apache Airflow, and some nodes have 3x more web or scheduler components than others.
Is it possible to use anti-affinity rules to force a more even spread of pods across the cluster?
E.g. "prefer the node with the least pods of label component=airflow-web?"
If anti-affinity does not work, are there other mechanisms we should be looking into as well?
Try adding this to the Deployment/StatefulSet .spec.template:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "component"
operator: In
values:
- airflow-web
topologyKey: "kubernetes.io/hostname"
Have you tried configuring the kube-scheduler?
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering: finds the set of Nodes where it's feasible to schedule the Pod.
Scoring: ranks the remaining nodes to choose the most suitable Pod placement.
Scheduling Policies: can be used to specify the predicates and priorities that the kube-scheduler runs to filter and score nodes.
kube-scheduler --policy-config-file <filename>
Sample config file
One of the priorities for your scenario is:
BalancedResourceAllocation: Favors nodes with balanced resource usage.
The right solution here is pod topology spread constraints: https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/
Anti-affinity only works until each node has at least 1 pod. Spread constraints actually balances based on the pod count per node.

Can Kubernetes be forced to restart a failed pod on a differet node?

When running a Kubernetes job I've set spec.spec.restartPolicy: OnFailure and spec.backoffLimit: 30. When a pod fails it's sometimes doing so because of a hardware incompatibility (matlab segfault on some hardware). Kubernetes is restarting the pod each time on the same node, having no chance of correcting the problem.
Can I instruct Kubernete to try a different node on restart?
Once Pod is scheduled it cannot be moved to another Node.
The Job controller can create a new Pod if you specify spec.spec.restartPolicy: Never.
There is a chance that this new Pod will be scheduled on different Node.
I did a quick experiment with podAntiAffinity: but it looks like it's ignored by scheduler (makes sense as the previous Pod is in Error state).
BTW: If you can add labels to failing nodes it will be possible to avoid them by using nodeSelector: <label>.
restartPolicy only refers to restarts of the Containers by the Kubelet on the same node.
Setting restartPolicy: OnFailure will prevent the neverending creation of pods because it will just restart the failing one on the same node.
If you want to create new pods on failure with restartPolicy: Never, you can limit them by setting activeDeadlineSeconds However pods also will be recreated on the same node as failed ones. Upon reaching the deadline without success, the job will have status with reason: DeadlineExceeded. No more pods will be created, and existing pods will be deleted.
.spec.backoffLimit is just the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay. And of course, this delay time is set by the Job controller
Take a look: pod-lifecycle.
However as a workaround you may want your Pods to end up on specific nodes which are properly working.
These scenarios are addressed by a number of primitives in Kubernetes:
nodeSelector — This is a simple Pod scheduling feature that allows scheduling a Pod onto a node whose labels match the nodeSelector labels specified
Node Affinity — is the enhanced version of the nodeSelector which offers a more expressive syntax for fine-grained control of how Pods are scheduled to specific nodes.
There are two types of affinity in Kubernetes: node affinity and Pod affinity. Similarly to nodeSelector, node affinity attracts a Pod to certain nodes, the Pod affinity attracts a Pod to certain Pods. In addition to that, Kubernetes supports Pod anti-affinity, which repels a Pod from other Pods.
Here's an example of a pod that uses node affinity:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
This node affinity rule says the pod can only be placed on a node with a label whose key is kubernetes.io/e2e-az-name and whose value is either e2e-az1 or e2e-az2. In addition, among nodes that meet that criteria, nodes with a label whose key is another-node-label-key and whose value is another-node-label-value should be preferred.
To label nodes you can use command:
$ kubectl label nodes <your-node-name> key=value
See definition: scheduling-pods.
As another workaround you may taint the specific, not working nodes - taints allow a Node to repel a set of Pods.
See more: taint-nodes-kubernetes.
Taints get a possibility to mark a node as NoSchedule - pods by default cannot be spawned on this node until you will add tolerations to pods which will allow scheduler to create pods on nodes with taints specified in toleration configuration. Command below:
$ kubectl taint nodes example-node key=value:NoSchedule
places a taint on node example-node. The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
See: node-taint.

When and Where to use Kubernetes Pod Affinity Rules

I'm trying to understand if it is good practice to use a podAntiAffinity rule to prefer that Pod's in my Deployment avoid being scheduled on the same node. Thus spreading the Pod's out on my Kubernetes cluster.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app.kubernetes.io/name"
operator: In
values:
- "foo"
topologyKey: "kubernetes.io/hostname"
The documentation suggests avoiding the use of podAntiAffinity for clusters with hundreds of nodes which suggests that there is a performance impact to using them.
Also, if I don't use them, isn't the default scheduler behaviour to space out Pod's anyway?
I suppose it also matters what the Deployment is for. It makes sense to use a podAntiAffinity for a Redis cache for example but wouldn't it make even more sense to use a DaemonSet for that case? Also, what is the recommendation for a web server Pod?
You use Pod/Node Affinity rules if you want to schedule pods on some nodes by matching specified condition in more expressive methods. I am not sure if you can use it to avoid being scheduled on same node.
If you don't use affinity rule, then kube-scheduler will look for feasible node to schedule pod on it and this is generally is not the same node.
You make kube-scheduler "to think" more by defining affinity rules and this is normal that in big clusters it can affect to performance.
Also to understand how the kube-scheduler iterates over Nodes by default, you can check this documentation.