What is the difference between NoExecute, NoSchedule, PreferNoSchedule? - kubernetes

https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/ The docs are not very clear as to what exactly the values represent.
the system will try to avoid placing a pod that does not tolerate the
taint on the node, but it is not required
What does 'try' imply? If I said a function will try sort a a list of numbers - it's not very clear...

Although there is a slight difference, I like more Google explanation about what Node Taints are, rather then Kubernetes:
A node taint lets you mark a node so that the scheduler avoids or prevents using it for certain Pods. A complementary feature, tolerations, lets you designate Pods that can be used on "tainted" nodes.
Node taints are key-value pairs associated with an effect. Here are the available effects:
NoSchedule: Pods that do not tolerate this taint are not scheduled on the node.
PreferNoSchedule: Kubernetes avoids scheduling Pods that do not tolerate this taint onto the node. This one basically means, do it, if possible.
NoExecute: Pod is evicted from the node if it is already running on the node, and is not scheduled onto the node if it is not yet running on the node.
Note that the difference between NoSchedule and NoExecute is that with the first one it won't schedule a pod, but if it is already running, it won't kill it. With the last one, it will kill the pod and re-schedule on another node.

Based on your question - which I understand it is regarding PreferNoSchedule - It means that it is not a strict requirement, in other words, if a pod is already scheduled on that node and it has that toleration effect, it won't be removed.
I would suggest you to dive into the design proposals docs if you are still confused, they clarified a lot of concepts for me.

Related

Kubernetes scheduler

Does the Kubernetes scheduler assign the pods to the nodes one by one in a queue (not in parallel)?
Based on this, I guess that might be the case since it is mentioned that the nodes are iterated round robin.
I want to make sure that the pod scheduling is not being done in parallel.
Short answer
Taking into consideration all the processes kube-scheduler performs when it's scheduling the pod, the answer is yes.
Scheduler and pods
For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container
in pods has different requirements for resources and every pod also
has different requirements. Therefore, existing nodes need to be
filtered according to the specific scheduling requirements.
In a cluster, Nodes that meet the scheduling requirements for a Pod
are called feasible nodes. If none of the nodes are suitable, the pod
remains unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of
functions to score the feasible Nodes and picks a Node with the
highest score among the feasible ones to run the Pod. The scheduler
then notifies the API server about this decision in a process called
binding.
Reference - kube-scheduler.
The scheduler determines which Nodes are valid placements for each Pod
in the scheduling queue according to constraints and available
resources.
Reference - kube-scheduler - synopsis.
In short words, kube-scheduler picks up pods one by one, assess them and its requests, then proceeds to finding appropriate feasible nodes to schedule pods on.
Scheduler and nodes
Mentioned link is related to nodes to give a fair chance to run pods across all feasible nodes.
Nodes in a cluster that meet the scheduling requirements of a Pod are
called feasible Nodes for the Pod
Information here is related to default kube-scheduler, there are solutions which can be used or even it's possible to implement self-written one. Also it's possible to run multiple schedulers in cluster.
Useful links:
Node selection in kube-scheduler
Kubernetes scheduler

Can a pod run on multiple nodes?

I have one kubernetes master and three kubernetes nodes. I made one pod which is running on specific node. I want to run that pod on 2 nodes. how can I achieve this? do replica concept help me? if yes how?
Yes, you can assign pods to one or more nodes of your cluster, and here are some options to achieve this:
nodeSelector
nodeSelector is the simplest recommended form of node selection constraint. nodeSelector is a field of PodSpec. It specifies a map of key-value pairs. For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). The most common usage is one key-value pair.
affinity and anti-affinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
nodeSelector provides a very simple way to constrain pods to nodes with particular labels. The affinity/anti-affinity feature, greatly expands the types of constraints you can express. The key enhancements are
The affinity/anti-affinity language is more expressive. The language offers more matching rules besides exact matches created with a logical AND operation;
you can indicate that the rule is "soft"/"preference" rather than a hard requirement, so if the scheduler can't satisfy it, the pod will still be scheduled;
you can constrain against labels on other pods running on the node (or other topological domain), rather than against labels on the node itself, which allows rules about which pods can and cannot be co-located
DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
Some typical uses of a DaemonSet are:
running a cluster storage daemon on every node
running a logs collection daemon on every node
running a node monitoring daemon on every node
Please check this link to read more about how to assign pods to nodes.
It's not a good practice to run the pods directly on the nodes as the nodes/pods can crash at any time. It's better use the K8S controllers as mentioned in the K8S documentation here.
K8S supports multiple containers and depending on the requirement the appropriate controller can be used. By looking at the OP it's difficult to say which controller to use.
You can use daemonset, if you want to run pod on each node.
What I see is you are trying to deploy pod on each node, it's better if you allow the scheduler to make decision where the pod need to be deployed based on the resources.
This would be best in all worst scenario's.
I'm mean in case of node failures.

Kubernetes Autoscaler: no downtime for deployments when downscaling is possible?

In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.

what is the benefit of the taint model over node selector

I am learning Kubernetes, and and faced a conceptual question, what is the benefit of new taint model over the simple node selector.
Documentation talks about a usecase where a group of devs might have exclusive right for a set of pods by a taint like dedicated=groupA:NoSchedule. But I thought we can do the same thing by a simple nodeSelector.
To be more specific, what is the role of the effect on this taint. Why not simply a label like the rest of the Kubernetes.
A node selector affects a single pod template, asking the scheduler to place it on a set of nodes. A NoSchedule taint affects all pods asking the scheduler to block all pods from being scheduled there.
A node selector is useful when the pod needs something from the node. For example, requesting a node that has a GPU. A node taint is useful when the node needs to be reserved for special workloads. For example, a node that should only be running pods that will use the GPU (so the GPU node isn't filled with pods that aren't using it).
Sometimes they are useful together as in the example above, too. You want the node to only have pods that use the GPU, and you want the pod that needs a GPU to be scheduled to a GPU node. In that case you may want to taint the node with dedicated=gpu:NoSchedule and add both a taint toleration and node selector to the pod template.

How do I schedule the same pod on different nodes using kubectl scale?

New to kubernetes. Can I use kubectl scale --replicas=N and start pods on different nodes?
By default the scheduler attempts to spread pods across nodes, so that you don't have multiple pods of the same type on the same node. So there's nothing special required if you're just aiming for best-effort pod spreading.
If you want to express the requirement that the pod must not run on a node that already has a pod of that type on it you can use pod anti-affinity, which is currently an Alpha feature.
If you want to ensure that all nodes (or all nodes matching a certain selector) have that pod on them you can use a DaemonSet.
Scaling a Deployment (or RC) tells controller-manager to create more pods, new pods are then subject to scheduling. K8S scheduler will attempt to find most reasonable placement to schedule your pods to. This does not guarantee that pods will launch on different nodes, but makes it a rather likely scenario, if you have the required resources. Unfortunately it also means that if all pods can fit on one node, there are situations where scheduler might actually do just that (ie. all other nodes in unschedulable state for some reason). If that happens, the pods will not reschedule when conditions change.
To have a solid guarantee that pods wil not get colocated on the same node you have two options:
legacy hack : define a hostPort in your pod template. As given host port is a resource that can be assigned only once per node, your pods will never exist more then once per node
alpha feature : you can look into Pod AntiAffinity, quite early and not really battle proven yet
First one has a dissadvantage - you can never have more then one pod of this type per node, so it ie. affects rolling deployments and limits your capacity for scaling (you can never have more active pods then number of nodes)