k8s multiple taints ans labels - relationship - kubernetes

I created a node group with two labels and two taints.
spec:
labels:
key1: "val1"
key2: "val2"
taints:
- effect: NoSchedule
key: key1
value: val2
- effect: NoSchedule
key: key2
value: val2
I have 2 kinds of pods:
either
nodeSelector:
key1: "val1"
tolerations:
- effect: NoSchedule
key: key1
value: val1
or
nodeSelector:
key2: "val2"
tolerations:
- effect: NoSchedule
key: key2
value: val2
None of the pods arrives with both key1 val1 and key2 val2.
Apparently no pod is being scheduled on my new node group.
Is it because the logic between label and taints keyss is and?
Is it possible to define or between labels and between taints? so that my pods will be scheduled?

Because of taints, your pod is not able to deploy on the node.
You need to tolerate both taint key1 and key2.
Here is my understanding, Node Selector is a feature to put a pod on a node which has a label you specified with nodeSelector. Whereas, Taint is a feature not to schedule a pod without toleration.
For example, if you modify a kernel parameter on a node and don't want to schedule pods. You may not set toleration to the pods, then the pod will not be scheduled on the node. However, if you want, you need to specify nodeselector to schedule a pod on the node and tolerate the taint, then the pod will be scheduled on the node.
According to the document of kubernetes, Use case of taints is as follows. I think there is no use case what you tried.
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive
use by a particular set of users, you can add a taint to those nodes
(say, kubectl taint nodes nodename dedicated=groupName:NoSchedule) and
then add a corresponding toleration to their pods (this would be done
most easily by writing a custom admission controller). The pods with
the tolerations will then be allowed to use the tainted (dedicated)
nodes as well as any other nodes in the cluster. If you want to
dedicate the nodes to them and ensure they only use the dedicated
nodes, then you should additionally add a label similar to the taint
to the same set of nodes (e.g. dedicated=groupName), and the admission
controller should additionally add a node affinity to require that the
pods can only schedule onto nodes labeled with dedicated=groupName.
Nodes with Special Hardware: In a cluster where a small subset of
nodes have specialized hardware (for example GPUs), it is desirable to
keep pods that don't need the specialized hardware off of those nodes,
thus leaving room for later-arriving pods that do need the specialized
hardware. This can be done by tainting the nodes that have the
specialized hardware (e.g. kubectl taint nodes nodename
special=true:NoSchedule or kubectl taint nodes nodename
special=true:PreferNoSchedule) and adding a corresponding toleration
to pods that use the special hardware. As in the dedicated nodes use
case, it is probably easiest to apply the tolerations using a custom
admission controller. For example, it is recommended to use Extended
Resources to represent the special hardware, taint your special
hardware nodes with the extended resource name and run the
ExtendedResourceToleration admission controller. Now, because the
nodes are tainted, no pods without the toleration will schedule on
them. But when you submit a pod that requests the extended resource,
the ExtendedResourceToleration admission controller will automatically
add the correct toleration to the pod and that pod will schedule on
the special hardware nodes. This will make sure that these special
hardware nodes are dedicated for pods requesting such hardware and you
don't have to manually add tolerations to your pods.
Taint based Evictions: A per-pod-configurable eviction behavior when
there are node problems, which is described in the next section.
*1: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#example-use-cases

Related

Kubernetes Restrict Node to run labeled pods only

we would like to merge 2 kubernetes cluster because we need to establish a communication between the pods and it should also be cheaper.
Cluster 1 should stay intact and cluster 2 will be deleted. The pods in cluster 2 have very high requirements for resources and we would like to create node pool dedicated to these pods.
So the idea is to label the new nodes and also label the pods that were part of cluster 2 before to enforce that they run on these nodes.
What I cannot find an answer for is the following question: How can I ensure that no other pod is scheduled to run on the new node pool without having to redeploy all pods and assigning labels to them?
There are 2 problems you have to solve:
Stop cluster 1 pods from running on cluster 2 nodes
Stop cluster 2 pods from running on cluster 1 nodes
Given your question, it looks like you can make changes to cluster 2 deployments, but don't want to update existing cluster 1 deployments.
The solution to problem 1 is to use taints and tolerations. You can taint your cluster 2 nodes to stop all pods from being scheduled there then add tolerations to your cluster 2 deployments to allow them to ignore this taint. This means that cluster 1 pods cannot be deployed to cluster 2 nodes and problem 1 is solved.
You add a taint like this:
kubectl taint nodes node1 key1=value1:NoSchedule-
and tolerate it in your cluster 2 pod/deployment spec like this:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Problem 2 cannot be solved the same way because you don't want to change deployments for cluster 1 pods. This is a shame because taints are the easiest solution to this. If you could make that change, then you'd simply add a taint to cluster 1 nodes and tolerate it only in cluster 1 deployments.
Given these constraints, the solution is to use node affinity. You'd need to use the requiredDuringSchedulingIgnoredDuringExecution form to ensure that the rules are always followed. The rules themselves can be as simple as a node selector based on labels. A shorter version of the example from the linked docs:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a-node-label-key
operator: In
values:
- a-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0

Can Kubernetes be forced to restart a failed pod on a differet node?

When running a Kubernetes job I've set spec.spec.restartPolicy: OnFailure and spec.backoffLimit: 30. When a pod fails it's sometimes doing so because of a hardware incompatibility (matlab segfault on some hardware). Kubernetes is restarting the pod each time on the same node, having no chance of correcting the problem.
Can I instruct Kubernete to try a different node on restart?
Once Pod is scheduled it cannot be moved to another Node.
The Job controller can create a new Pod if you specify spec.spec.restartPolicy: Never.
There is a chance that this new Pod will be scheduled on different Node.
I did a quick experiment with podAntiAffinity: but it looks like it's ignored by scheduler (makes sense as the previous Pod is in Error state).
BTW: If you can add labels to failing nodes it will be possible to avoid them by using nodeSelector: <label>.
restartPolicy only refers to restarts of the Containers by the Kubelet on the same node.
Setting restartPolicy: OnFailure will prevent the neverending creation of pods because it will just restart the failing one on the same node.
If you want to create new pods on failure with restartPolicy: Never, you can limit them by setting activeDeadlineSeconds However pods also will be recreated on the same node as failed ones. Upon reaching the deadline without success, the job will have status with reason: DeadlineExceeded. No more pods will be created, and existing pods will be deleted.
.spec.backoffLimit is just the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay. And of course, this delay time is set by the Job controller
Take a look: pod-lifecycle.
However as a workaround you may want your Pods to end up on specific nodes which are properly working.
These scenarios are addressed by a number of primitives in Kubernetes:
nodeSelector — This is a simple Pod scheduling feature that allows scheduling a Pod onto a node whose labels match the nodeSelector labels specified
Node Affinity — is the enhanced version of the nodeSelector which offers a more expressive syntax for fine-grained control of how Pods are scheduled to specific nodes.
There are two types of affinity in Kubernetes: node affinity and Pod affinity. Similarly to nodeSelector, node affinity attracts a Pod to certain nodes, the Pod affinity attracts a Pod to certain Pods. In addition to that, Kubernetes supports Pod anti-affinity, which repels a Pod from other Pods.
Here's an example of a pod that uses node affinity:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
This node affinity rule says the pod can only be placed on a node with a label whose key is kubernetes.io/e2e-az-name and whose value is either e2e-az1 or e2e-az2. In addition, among nodes that meet that criteria, nodes with a label whose key is another-node-label-key and whose value is another-node-label-value should be preferred.
To label nodes you can use command:
$ kubectl label nodes <your-node-name> key=value
See definition: scheduling-pods.
As another workaround you may taint the specific, not working nodes - taints allow a Node to repel a set of Pods.
See more: taint-nodes-kubernetes.
Taints get a possibility to mark a node as NoSchedule - pods by default cannot be spawned on this node until you will add tolerations to pods which will allow scheduler to create pods on nodes with taints specified in toleration configuration. Command below:
$ kubectl taint nodes example-node key=value:NoSchedule
places a taint on node example-node. The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
See: node-taint.

What is the optimal scheduling strategy for K8s pods?

Here is what I am working with.
I have 3 nodepools on GKE
n1s1 (3.75GB)
n1s2 (7.5GB)
n1s4 (15GB)
I have pods that will require any of the following memory requests. Assume limits are very close to requests.
1GB, 2GB, 4GB, 6GB, 8GB, 10GB, 12GB, 14GB
How best can I associate a pod to a nodepool for max efficiency?
So far I have 3 strategies.
For each pod config, determine the “rightful nodepool”. This is the smallest nodepool that can accommodate the pod config in an ideal world.
So for 2GB pod it's n1s1 but for 4GB pod it'd be n1s2.
Schedule a pod only on its rightful nodepool.
Schedule a pod only on its rightful nodepool or one nodepool higher than that.
Schedule a pod only on any nodepool where it can currently go.
Which of these or any other strategies will minimize wasting resources?
=======
Why would you have 3 pools like that in the first place? You generally want to use the largest instance type you can that gets you under 110 pods per node (which is the default hard cap). The job of the scheduler is to optimize the packing for you, and it's pretty good at that with the default settings.
I would use a mix of Taints and Tolerations and Node affinity.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
You can set a taint on a node kubectl taint nodes node1 key=value:NoSchedule
The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
While you are writing a pod yaml you can specify PodSpec and add toleration which will match the taint created on node1 which will allow pod with either toleration to be scheduled onto node1
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
or
tolerations:
- key: "key"
operator: "Exists"
effect: "NoSchedule"
Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn’t be running. A few of the use cases are
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a particular set of users, you can add a taint to those nodes (say, kubectl taint nodes nodename dedicated=groupName:NoSchedule) and then add a corresponding toleration to their pods (this would be done most easily by writing a custom admission controller). The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as well as any other nodes in the cluster. If you want to dedicate the nodes to them and ensure they only use the dedicated nodes, then you should additionally add a label similar to the taint to the same set of nodes (e.g. dedicated=groupName), and the admission controller should additionally add a node affinity to require that the pods can only schedule onto nodes labeled with dedicated=groupName.
Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don’t need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule or kubectl taint nodes nodename special=true:PreferNoSchedule) and adding a corresponding toleration to pods that use the special hardware. As in the dedicated nodes use case, it is probably easiest to apply the tolerations using a custom admission controller. For example, it is recommended to use Extended Resources to represent the special hardware, taint your special hardware nodes with the extended resource name and run the ExtendedResourceToleration admission controller. Now, because the nodes are tainted, no pods without the toleration will schedule on them. But when you submit a pod that requests the extended resource, the ExtendedResourceToleration admission controller will automatically add the correct toleration to the pod and that pod will schedule on the special hardware nodes. This will make sure that these special hardware nodes are dedicated for pods requesting such hardware and you don’t have to manually add tolerations to your pods.
Taint based Evictions: A per-pod-configurable eviction behavior when there are node problems, which is described in the next section.
As for node affinity:
is conceptually similar to nodeSelector – it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
There are currently two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee. The “IgnoredDuringExecution” part of the names means that, similar to how nodeSelector works, if labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod will still continue to run on the node. In the future we plan to offer requiredDuringSchedulingRequiredDuringExecution which will be just like requiredDuringSchedulingIgnoredDuringExecution except that it will evict pods from nodes that cease to satisfy the pods’ node affinity requirements.
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be “only run the pod on nodes with Intel CPUs” and an example preferredDuringSchedulingIgnoredDuringExecution would be “try to run this set of pods in failure zone XYZ, but if it’s not possible, then allow some to run elsewhere”.
Node affinity is specified as field nodeAffinity of field affinity in the PodSpec.
...
The new node affinity syntax supports the following operators: In, NotIn, Exists, DoesNotExist, Gt, Lt. You can use NotIn and DoesNotExist to achieve node anti-affinity behavior, or use node taints to repel pods from specific nodes.
If you specify both nodeSelector and nodeAffinity, both must be satisfied for the pod to be scheduled onto a candidate node.
If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node only if all nodeSelectorTerms can be satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node if one of the matchExpressions is satisfied.

Kubernetes worker node only for a specific type of pod

I've a requirement where I want to schedule a specific type of pod on a particular node and no other types of pod should get scheduled on that node. For example,
Assuming that I've 3 worker nodes - w1, w2 and w3
I want pods of type(say POD-w2) should always get scheduled on w2 and no other type of pods should get scheduled on w2.
Add a label type=w2 to worker 2.
Use node selector or node affinity to schedule required pods on that node.
For other pods use node anti affinity to prevent other pods getting scheduled on to the worker 2
To exclusively use a node for a specific type of pod, you should taint your node as described here. Then, create a toleration in your deployment/pod definition for the node taint to ensure that only that type of pod can be scheduled on the tainted node.
To achieve this, we have to taint the node as well as affinity by labeling the node. The required pod should tolerate the taint and satisfy the affinity also. By this way pod will get scheduled ONLY on the dedicated node.
example:
kubectl taint nodes <dedicated_node_name> dedicated=myservice:NoSchedule
kubectl label node <dedicated_node_name> dedicated=myservice
then use toleration and affinity in the deployment spec
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values:
- myservice
and
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: myservice

Kubernetes :: Run POD on node without GPU

I'm using Kubernetes to orchestrate my micro-services.
In my K8S cluster, I have CPU-Only instances and other instances with GPU.
I would like to know how could I force specific PODS to run on the instances without GPU?
Thank you
As explained here you can use taints and tolerations to ensure that some pods will not be scheduled on nodes with GPUs.
All nodes with GPU can be tainted like this:
kubectl taint nodes <nodename> hasgpu=true:NoSchedule
Now add the following to specs of pods - which need GPU. This will ensure that any pod which does not have this toleration will not go to an instance with GPU attached.
tolerations:
- key: "hasgpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
You can check out a detailed explanation and examples of taints and toleration in this blog
Though adding the toleration in YAML file is not as much clean and you can use an admission controller to dynamically add tolerations using an admission controller. This will add tolerations to pods which request specific resources such as GPU. You can find more details here, this solution is elegant but relatively more work.