Kubernetes - Enable automatic pod rescheduling on taint/toleration - kubernetes

In the following scenario:
Pod X has a toleration for a taint
However node A with such taint does not exists
Pod X get scheduled on a different node B in the meantime
Node A with the proper taint becomes Ready
Here, Kubernetes does not trigger an automatic rescheduling of the pod X on node A as it is properly running on node B. Is there a way to enable that automatic rescheduling to node A?

Natively, probably not, unless you:
change the taint of nodeB to NoExecute (it probably already was set) :
NoExecute - the pod will be evicted from the node (if it is already running on the node), and will not be scheduled onto the node (if it is not yet running on the node).
update the toleration of the pod
That is:
You can put multiple taints on the same node and multiple tolerations on the same pod.
The way Kubernetes processes multiple taints and tolerations is like a filter: start with all of a node’s taints, then ignore the ones for which the pod has a matching toleration; the remaining un-ignored taints have the indicated effects on the pod. In particular,
if there is at least one un-ignored taint with effect NoSchedule then Kubernetes will not schedule the pod onto that node
If that is not possible, then using Node Affinity could help (but that differs from taints)

Related

What order are different kubernetes affinities/anti affinities checked when scheduling pod

The title sort of says it all, but to give an example I have a pod failing to schedule.
0/19 nodes are available: 1 node(s) didn't match pod affinity rules, 1 node(s) didn't match pod affinity/anti-affinity, 15 node(s) had a volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate
Okay first oddity, I see 20 nodes listed as failing when I have only 19 nodes. Since I don't set any pod anti affinity at all I presume the 1 node that didn't match pod affinity is the same node that didn't match pod affinity/anti-affinity, though why it's referenced twice I have no clue.
The rest I'm confused about. I would expect the pod I am trying to schedule to fail for all but one node due to both it's pod affinity and it's volume node affinity, though the 3 taint nodes are expected as well.
What I wouldn't expect is only 1 node to fail pod affinity and a bunch of others to fail volume node affinity. I'd assumed it would check pod affinity first, fail 18 nodes based off of that, then go on and check volume affinity. Perhaps that means my pod affinity check isn't working right since only one failed?
But the key word there is assume. I don't actually know the order that affinities are checked when there are multiple types of affinity, and as such I don't know which affinity I should be looking at as the culprit for my scheduling failure.
So are there any clearly defined rules for how affinity/anti affinity checks are made and which takes priority when scheduling?

Kubernetes Pod won't start - 1 node(s) had a volume affinity conflict

I have a pod that won't start with a volume affinity conflict. This is a bare-metal cluster so it's unrelated to regions. The pod has 4 persistent volume claims which are all reporting bound so I'm assuming it's not one of those. There are 4 nodes, one of them is tainted so that the pod will not start on it, one of them is tainted specifically so that the pod WILL start on it. That's the only affinity I have set up to my knowledge. The message looks like this:
0/4 nodes are available: 1 node(s) had taint {XXXXXXX},
that the pod didn't tolerate, 1 node(s) had volume node
affinity conflict, 2 Insufficient cpu, 2 Insufficient memory.
This is what I would have expected apart from the volume affinity conflict. There are no other affinities set other than to point it at this node. I'm really not sure why it's doing this or where to even begin. The message isn't super helpful. It does NOT say which node or which volume there is a problem with. The one thing I don't really understand is how binding works. One of the PVC's is mapped to a PV on another node however it is reporting as bound so I'm not completely certain if that's the problem. I am using local-storage as the storage class. I'm wondering if that's the problem but I'm fairly new to Kubernetes and I'm not sure where to look.
You got 4 Nodes but none of them are available for scheduling due to a different set of conditions. Note that each Node can be affected by multiple issues and so the numbers can add up to more than what you have on total nodes. Let's try to address these issues one by one:
Insufficient memory: Execute kubectl describe node <node-name> to check how much free memory is available there. Check the requests and limits of your pods. Note that Kubernetes will block the full amount of memory a pod requests regardless how much this pod uses.
Insufficient cpu: Analogical as above.
node(s) had volume node affinity conflict: Check out if the nodeAffinity of your PersistentVolume (kubectl describe pv) matches the node label (kubectl get nodes). Check if the nodeSelector in your pod also matches. Make sure you set up the Affinity and/or AntiAffinity rules correctly. More details on that can be found here.
node(s) had taint {XXXXXXX}, that the pod didn't tolerate: You can use kubectl describe node to check taints and kubectl taint nodes <node-name> <taint-name>- in order to remove them. Check the Taints and Tolerations for more details.

What is the optimal scheduling strategy for K8s pods?

Here is what I am working with.
I have 3 nodepools on GKE
n1s1 (3.75GB)
n1s2 (7.5GB)
n1s4 (15GB)
I have pods that will require any of the following memory requests. Assume limits are very close to requests.
1GB, 2GB, 4GB, 6GB, 8GB, 10GB, 12GB, 14GB
How best can I associate a pod to a nodepool for max efficiency?
So far I have 3 strategies.
For each pod config, determine the “rightful nodepool”. This is the smallest nodepool that can accommodate the pod config in an ideal world.
So for 2GB pod it's n1s1 but for 4GB pod it'd be n1s2.
Schedule a pod only on its rightful nodepool.
Schedule a pod only on its rightful nodepool or one nodepool higher than that.
Schedule a pod only on any nodepool where it can currently go.
Which of these or any other strategies will minimize wasting resources?
=======
Why would you have 3 pools like that in the first place? You generally want to use the largest instance type you can that gets you under 110 pods per node (which is the default hard cap). The job of the scheduler is to optimize the packing for you, and it's pretty good at that with the default settings.
I would use a mix of Taints and Tolerations and Node affinity.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
You can set a taint on a node kubectl taint nodes node1 key=value:NoSchedule
The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
While you are writing a pod yaml you can specify PodSpec and add toleration which will match the taint created on node1 which will allow pod with either toleration to be scheduled onto node1
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
or
tolerations:
- key: "key"
operator: "Exists"
effect: "NoSchedule"
Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn’t be running. A few of the use cases are
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a particular set of users, you can add a taint to those nodes (say, kubectl taint nodes nodename dedicated=groupName:NoSchedule) and then add a corresponding toleration to their pods (this would be done most easily by writing a custom admission controller). The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as well as any other nodes in the cluster. If you want to dedicate the nodes to them and ensure they only use the dedicated nodes, then you should additionally add a label similar to the taint to the same set of nodes (e.g. dedicated=groupName), and the admission controller should additionally add a node affinity to require that the pods can only schedule onto nodes labeled with dedicated=groupName.
Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don’t need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule or kubectl taint nodes nodename special=true:PreferNoSchedule) and adding a corresponding toleration to pods that use the special hardware. As in the dedicated nodes use case, it is probably easiest to apply the tolerations using a custom admission controller. For example, it is recommended to use Extended Resources to represent the special hardware, taint your special hardware nodes with the extended resource name and run the ExtendedResourceToleration admission controller. Now, because the nodes are tainted, no pods without the toleration will schedule on them. But when you submit a pod that requests the extended resource, the ExtendedResourceToleration admission controller will automatically add the correct toleration to the pod and that pod will schedule on the special hardware nodes. This will make sure that these special hardware nodes are dedicated for pods requesting such hardware and you don’t have to manually add tolerations to your pods.
Taint based Evictions: A per-pod-configurable eviction behavior when there are node problems, which is described in the next section.
As for node affinity:
is conceptually similar to nodeSelector – it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
There are currently two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee. The “IgnoredDuringExecution” part of the names means that, similar to how nodeSelector works, if labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod will still continue to run on the node. In the future we plan to offer requiredDuringSchedulingRequiredDuringExecution which will be just like requiredDuringSchedulingIgnoredDuringExecution except that it will evict pods from nodes that cease to satisfy the pods’ node affinity requirements.
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be “only run the pod on nodes with Intel CPUs” and an example preferredDuringSchedulingIgnoredDuringExecution would be “try to run this set of pods in failure zone XYZ, but if it’s not possible, then allow some to run elsewhere”.
Node affinity is specified as field nodeAffinity of field affinity in the PodSpec.
...
The new node affinity syntax supports the following operators: In, NotIn, Exists, DoesNotExist, Gt, Lt. You can use NotIn and DoesNotExist to achieve node anti-affinity behavior, or use node taints to repel pods from specific nodes.
If you specify both nodeSelector and nodeAffinity, both must be satisfied for the pod to be scheduled onto a candidate node.
If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node only if all nodeSelectorTerms can be satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node if one of the matchExpressions is satisfied.

Stop scheduling pods on kubernetes master

For testing purpose, I have enabled pod scheduling on kubernetes master node with the following command
kubectl taint nodes --all node-role.kubernetes.io/master-
Now I have added worker node to the cluster and I would like to stop scheduling pods on master. How do I do that?
You simply taint the node again.
kubectl taint nodes master node-role.kubernetes.io/master=:NoSchedule
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
Even placed a taint on Master node,you can specify a toleration for a pod in the PodSpec, the pod would be able to schedule onto Master node:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
To learn more,see Taints and Tolerations

What's the purpose of Kubernetes DaemonSet when replication controllers have node anti-affinity

DaemonSet is a Kubernetes beta resource that can ensure that exactly one pod is scheduled to a group of nodes. The group of nodes is all nodes by default, but can be limited to a subset using nodeSelector or the Alpha feature of node affinity/anti-affinity.
It seems that DaemonSet functionality can be achieved with replication controllers/replica sets with proper node affinity and anti-affinity.
Am I missing something? If that's correct should DaemonSet be deprecated before it even leaves Beta?
As you said, DaemonSet guarantees one pod per node for a subset of the nodes in the cluster. If you use ReplicaSet instead, you need to
use the node affinity/anti-affinity and/or node selector to control the set of nodes to run on (similar to how DaemonSet does it).
use inter-pod anti-affinity to spread the pods across the nodes.
make sure the number of pods > number of node in the set, so that every node has one pod scheduled.
However, ensuring (3) is a chore as the set of nodes can change over time. With DaemonSet, you don't have to worry about that, nor would you need to create extra, unschedulable pods. On top of that, DaemonSet does not rely on the scheduler to assign its pods, which makes it useful for cluster bootstrap (see How Daemon Pods are scheduled).
See the "Alternative to DaemonSet" section in the DaemonSet doc for more comparisons. DaemonSet is still the easiest way to run a per-node daemon without external tools.