What is the optimal scheduling strategy for K8s pods? - kubernetes

Here is what I am working with.
I have 3 nodepools on GKE
n1s1 (3.75GB)
n1s2 (7.5GB)
n1s4 (15GB)
I have pods that will require any of the following memory requests. Assume limits are very close to requests.
1GB, 2GB, 4GB, 6GB, 8GB, 10GB, 12GB, 14GB
How best can I associate a pod to a nodepool for max efficiency?
So far I have 3 strategies.
For each pod config, determine the “rightful nodepool”. This is the smallest nodepool that can accommodate the pod config in an ideal world.
So for 2GB pod it's n1s1 but for 4GB pod it'd be n1s2.
Schedule a pod only on its rightful nodepool.
Schedule a pod only on its rightful nodepool or one nodepool higher than that.
Schedule a pod only on any nodepool where it can currently go.
Which of these or any other strategies will minimize wasting resources?
=======

Why would you have 3 pools like that in the first place? You generally want to use the largest instance type you can that gets you under 110 pods per node (which is the default hard cap). The job of the scheduler is to optimize the packing for you, and it's pretty good at that with the default settings.

I would use a mix of Taints and Tolerations and Node affinity.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
You can set a taint on a node kubectl taint nodes node1 key=value:NoSchedule
The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
While you are writing a pod yaml you can specify PodSpec and add toleration which will match the taint created on node1 which will allow pod with either toleration to be scheduled onto node1
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
or
tolerations:
- key: "key"
operator: "Exists"
effect: "NoSchedule"
Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn’t be running. A few of the use cases are
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a particular set of users, you can add a taint to those nodes (say, kubectl taint nodes nodename dedicated=groupName:NoSchedule) and then add a corresponding toleration to their pods (this would be done most easily by writing a custom admission controller). The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as well as any other nodes in the cluster. If you want to dedicate the nodes to them and ensure they only use the dedicated nodes, then you should additionally add a label similar to the taint to the same set of nodes (e.g. dedicated=groupName), and the admission controller should additionally add a node affinity to require that the pods can only schedule onto nodes labeled with dedicated=groupName.
Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don’t need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule or kubectl taint nodes nodename special=true:PreferNoSchedule) and adding a corresponding toleration to pods that use the special hardware. As in the dedicated nodes use case, it is probably easiest to apply the tolerations using a custom admission controller. For example, it is recommended to use Extended Resources to represent the special hardware, taint your special hardware nodes with the extended resource name and run the ExtendedResourceToleration admission controller. Now, because the nodes are tainted, no pods without the toleration will schedule on them. But when you submit a pod that requests the extended resource, the ExtendedResourceToleration admission controller will automatically add the correct toleration to the pod and that pod will schedule on the special hardware nodes. This will make sure that these special hardware nodes are dedicated for pods requesting such hardware and you don’t have to manually add tolerations to your pods.
Taint based Evictions: A per-pod-configurable eviction behavior when there are node problems, which is described in the next section.
As for node affinity:
is conceptually similar to nodeSelector – it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
There are currently two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee. The “IgnoredDuringExecution” part of the names means that, similar to how nodeSelector works, if labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod will still continue to run on the node. In the future we plan to offer requiredDuringSchedulingRequiredDuringExecution which will be just like requiredDuringSchedulingIgnoredDuringExecution except that it will evict pods from nodes that cease to satisfy the pods’ node affinity requirements.
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be “only run the pod on nodes with Intel CPUs” and an example preferredDuringSchedulingIgnoredDuringExecution would be “try to run this set of pods in failure zone XYZ, but if it’s not possible, then allow some to run elsewhere”.
Node affinity is specified as field nodeAffinity of field affinity in the PodSpec.
...
The new node affinity syntax supports the following operators: In, NotIn, Exists, DoesNotExist, Gt, Lt. You can use NotIn and DoesNotExist to achieve node anti-affinity behavior, or use node taints to repel pods from specific nodes.
If you specify both nodeSelector and nodeAffinity, both must be satisfied for the pod to be scheduled onto a candidate node.
If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node only if all nodeSelectorTerms can be satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node if one of the matchExpressions is satisfied.

Related

k8s multiple taints ans labels - relationship

I created a node group with two labels and two taints.
spec:
labels:
key1: "val1"
key2: "val2"
taints:
- effect: NoSchedule
key: key1
value: val2
- effect: NoSchedule
key: key2
value: val2
I have 2 kinds of pods:
either
nodeSelector:
key1: "val1"
tolerations:
- effect: NoSchedule
key: key1
value: val1
or
nodeSelector:
key2: "val2"
tolerations:
- effect: NoSchedule
key: key2
value: val2
None of the pods arrives with both key1 val1 and key2 val2.
Apparently no pod is being scheduled on my new node group.
Is it because the logic between label and taints keyss is and?
Is it possible to define or between labels and between taints? so that my pods will be scheduled?
Because of taints, your pod is not able to deploy on the node.
You need to tolerate both taint key1 and key2.
Here is my understanding, Node Selector is a feature to put a pod on a node which has a label you specified with nodeSelector. Whereas, Taint is a feature not to schedule a pod without toleration.
For example, if you modify a kernel parameter on a node and don't want to schedule pods. You may not set toleration to the pods, then the pod will not be scheduled on the node. However, if you want, you need to specify nodeselector to schedule a pod on the node and tolerate the taint, then the pod will be scheduled on the node.
According to the document of kubernetes, Use case of taints is as follows. I think there is no use case what you tried.
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive
use by a particular set of users, you can add a taint to those nodes
(say, kubectl taint nodes nodename dedicated=groupName:NoSchedule) and
then add a corresponding toleration to their pods (this would be done
most easily by writing a custom admission controller). The pods with
the tolerations will then be allowed to use the tainted (dedicated)
nodes as well as any other nodes in the cluster. If you want to
dedicate the nodes to them and ensure they only use the dedicated
nodes, then you should additionally add a label similar to the taint
to the same set of nodes (e.g. dedicated=groupName), and the admission
controller should additionally add a node affinity to require that the
pods can only schedule onto nodes labeled with dedicated=groupName.
Nodes with Special Hardware: In a cluster where a small subset of
nodes have specialized hardware (for example GPUs), it is desirable to
keep pods that don't need the specialized hardware off of those nodes,
thus leaving room for later-arriving pods that do need the specialized
hardware. This can be done by tainting the nodes that have the
specialized hardware (e.g. kubectl taint nodes nodename
special=true:NoSchedule or kubectl taint nodes nodename
special=true:PreferNoSchedule) and adding a corresponding toleration
to pods that use the special hardware. As in the dedicated nodes use
case, it is probably easiest to apply the tolerations using a custom
admission controller. For example, it is recommended to use Extended
Resources to represent the special hardware, taint your special
hardware nodes with the extended resource name and run the
ExtendedResourceToleration admission controller. Now, because the
nodes are tainted, no pods without the toleration will schedule on
them. But when you submit a pod that requests the extended resource,
the ExtendedResourceToleration admission controller will automatically
add the correct toleration to the pod and that pod will schedule on
the special hardware nodes. This will make sure that these special
hardware nodes are dedicated for pods requesting such hardware and you
don't have to manually add tolerations to your pods.
Taint based Evictions: A per-pod-configurable eviction behavior when
there are node problems, which is described in the next section.
*1: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#example-use-cases

Match Deployment to specific nodepool

I am looking to find out if there is a way I can assign a specific Deployment to a specific node pool.
I am planning to deploy a big-size application using kubernetes. I am wondering if there is a way we can assign deployments to specific node pools. In other words, we have 3 types of services:
General services, low performance and low replica count
Monitor services, high I/O and high performance servers needed
Module services, most demanding services, we are aiming to allocate the biggest part of our budget for this.
So obviously we would like to best allocate nodes to specific deployments so no resources go wasted, for example low tier servers node pool X would be only utilized by General service deployments, high tier servers node pool Y would be only utilized by the monitor services, and the highest tier servers would only be utilized by the Module services.
I understand that there is a huge number of articles that talks about pod affinity and other related things, but what I seem to not be able to find anything that matches the following:
How to assign Deployment to specific node pool
Thanks in advance!
Another way (in addition to what Yayotrón proposed) would be to work with NodeAffinity and AntiAffinity. For more information check the official documentation here: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
Taints and tolerations very strict and scheduling on other nodes would not be possible at all.
With Affinity and Antiaffinity you can specify wheter you want it to be strict (RequiredDuringSchedulingIgnoredDuringExecution) or a soft restriction (PreferredDuring....)
This can be achieved using Taints and Tolerations. A quick summary of what they are (from their documentation):
Node affinity, is a property of Pods that attracts them to a set of nodes (either as a preference or a hard requirement). Taints are the opposite -- they allow a node to repel a set of pods.
Tolerations are applied to pods, and allow (but do not require) the
pods to schedule onto nodes with matching taints.
Taints and tolerations work together to ensure that pods are not
scheduled onto inappropriate nodes. One or more taints are applied to
a node; this marks that the node should not accept any pods that do
not tolerate the taints.
Or simply by using NodeSelector
When you register a node to join the kubernetes cluster you can specify the taints and labels using kubelet --register-with-taints label=value --node-labels=label2=value2.
Or you can use kubectl taint for already registered nodes.
Then when you're going to deploy a pod/deployment/statefulset you can specify its nodeSelector and Tolerations
spec:
nodeSelector:
label2: value2
tolerations:
- key: "label"
operator: "Equal"
value: "value"
effect: "NoSchedule"

Kubernetes - Enable automatic pod rescheduling on taint/toleration

In the following scenario:
Pod X has a toleration for a taint
However node A with such taint does not exists
Pod X get scheduled on a different node B in the meantime
Node A with the proper taint becomes Ready
Here, Kubernetes does not trigger an automatic rescheduling of the pod X on node A as it is properly running on node B. Is there a way to enable that automatic rescheduling to node A?
Natively, probably not, unless you:
change the taint of nodeB to NoExecute (it probably already was set) :
NoExecute - the pod will be evicted from the node (if it is already running on the node), and will not be scheduled onto the node (if it is not yet running on the node).
update the toleration of the pod
That is:
You can put multiple taints on the same node and multiple tolerations on the same pod.
The way Kubernetes processes multiple taints and tolerations is like a filter: start with all of a node’s taints, then ignore the ones for which the pod has a matching toleration; the remaining un-ignored taints have the indicated effects on the pod. In particular,
if there is at least one un-ignored taint with effect NoSchedule then Kubernetes will not schedule the pod onto that node
If that is not possible, then using Node Affinity could help (but that differs from taints)

How to convert Daemonsets to kind Deployment

I have already deployed pods using Daemonsets with nodeselector. My requirements is I need to use kind Deployment but at the same time I would want to retain Daemonsets functionality
.I have nodeselector defined so that same pod should be installed in labelled node.
How to achieve your help is appreciated.
My requirements is pod should be placed automatically based on nodeselector but with kind Deployment
In otherwords
Using Replication controller when I schedule 2 (two) replicas of a pod I expect 1 (one) replica each in each Nodes (VMs). Instead I find both replicas are created in same node This will make 1 Node a single point of failure which I need to avoid.
I have labelled two nodes properly. And I could see both pods spawned on single node. How to achieve both pods always schedule on both nodes?
Look into affinity and anti-affinity, specifically, inter-pod affinity and anti-affinity.
From official documentation:
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y”.

What's the purpose of Kubernetes DaemonSet when replication controllers have node anti-affinity

DaemonSet is a Kubernetes beta resource that can ensure that exactly one pod is scheduled to a group of nodes. The group of nodes is all nodes by default, but can be limited to a subset using nodeSelector or the Alpha feature of node affinity/anti-affinity.
It seems that DaemonSet functionality can be achieved with replication controllers/replica sets with proper node affinity and anti-affinity.
Am I missing something? If that's correct should DaemonSet be deprecated before it even leaves Beta?
As you said, DaemonSet guarantees one pod per node for a subset of the nodes in the cluster. If you use ReplicaSet instead, you need to
use the node affinity/anti-affinity and/or node selector to control the set of nodes to run on (similar to how DaemonSet does it).
use inter-pod anti-affinity to spread the pods across the nodes.
make sure the number of pods > number of node in the set, so that every node has one pod scheduled.
However, ensuring (3) is a chore as the set of nodes can change over time. With DaemonSet, you don't have to worry about that, nor would you need to create extra, unschedulable pods. On top of that, DaemonSet does not rely on the scheduler to assign its pods, which makes it useful for cluster bootstrap (see How Daemon Pods are scheduled).
See the "Alternative to DaemonSet" section in the DaemonSet doc for more comparisons. DaemonSet is still the easiest way to run a per-node daemon without external tools.