I'm trying to gey pods scheduled on the master node. Succesfully untainted the node
kubectl taint node mymasternode
node-role.kubernetes.io/master:NoSchedule-
node/mymasternode untainted
But then changing replicas to 4 in the deploy.yaml and apply it all the pods are scheduled on the worker nodes that were workers already.
Is there an extra step needed to get pods scheduled on the master node as well?
To get pods scheduled on Control plane nodes which have a taint applied (which most Kubernetes distributions will do), you need to add a toleration to your manifests, as described in their documentation, rather than untaint the control plane node. Untainting the control plane node can be dangerous as if you run out of resources on that node, your cluster's operation is likely to suffer.
Something like the following should work
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
If you're looking to get a pod scheduled to every node, usually the approach is to create a daemonset with that toleration applied.
If you need to have a pod scheduled to a control plane node, without using a daemonset, it's possible to combine a toleration with scheduling information to get it assigned to a specific node. The simplest approach to this is to specify the target node name in the manifest.
This isn't a very flexible approach, so for example if you wanted to assign pods to any control plane node, you could apply a label to those nodes and use a node selector combined with the toleration to get the workloads assigned there.
By default master is tainted for not to schedule any pods on it by adding Tolerations we can allow pods to be schedule on Master but thats not guranteed to make sure its schedule on master only we add nodeSeletor this will ensure pods will only schedule on master.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
node-role.kubernetes.io/master: ""
Proof Of Concept :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8s default-scheduler Successfully assigned default/nginx to controlplane
Normal Pulled 7s kubelet Container image "nginx" already present on machine
Normal Created 7s kubelet Created container nginx
Normal Started 6s kubelet Started container nginx
Related
I have a Kubernetes cluster, and running 3 nodes. But I want to run my app on only two nodes. So I want to ask, Can I run other pods (Kubernetes extensions) in the Kubernetes cluster only on a single node?
node = Only Kubernetes pods
node = my app
node = my app
Yes, you can run the application POD on only two nodes and other extension Kubernetes POD on a single node.
When you say Kubernetes extension POD by that consider some external third-party PODs like Nginx ingress controller and other not default system POD like kube-proxy, kubelet, etc those should require to run each available node.
Option 1
You can use the Node affinity to schedule PODs on specific nodes.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/hostname
operator: In
values:
- node-1
- node-2
containers:
- name: with-node-affinity
image: nginx
Option 2
You can use the taint & toleration to schedule the PODs on specific nodes.
Certain kube-system pods like kube-proxy, the CNI pods (cilium/flannel) and other daemonSet must run on each of the worker node, you can not stop them. If that is not the case for you, a node can be taint to noSchedule using below command.
kubectl taint nodes type=<a_node_label>:NoSchedule
The further enhancement you can explore https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
When running a Kubernetes job I've set spec.spec.restartPolicy: OnFailure and spec.backoffLimit: 30. When a pod fails it's sometimes doing so because of a hardware incompatibility (matlab segfault on some hardware). Kubernetes is restarting the pod each time on the same node, having no chance of correcting the problem.
Can I instruct Kubernete to try a different node on restart?
Once Pod is scheduled it cannot be moved to another Node.
The Job controller can create a new Pod if you specify spec.spec.restartPolicy: Never.
There is a chance that this new Pod will be scheduled on different Node.
I did a quick experiment with podAntiAffinity: but it looks like it's ignored by scheduler (makes sense as the previous Pod is in Error state).
BTW: If you can add labels to failing nodes it will be possible to avoid them by using nodeSelector: <label>.
restartPolicy only refers to restarts of the Containers by the Kubelet on the same node.
Setting restartPolicy: OnFailure will prevent the neverending creation of pods because it will just restart the failing one on the same node.
If you want to create new pods on failure with restartPolicy: Never, you can limit them by setting activeDeadlineSeconds However pods also will be recreated on the same node as failed ones. Upon reaching the deadline without success, the job will have status with reason: DeadlineExceeded. No more pods will be created, and existing pods will be deleted.
.spec.backoffLimit is just the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay. And of course, this delay time is set by the Job controller
Take a look: pod-lifecycle.
However as a workaround you may want your Pods to end up on specific nodes which are properly working.
These scenarios are addressed by a number of primitives in Kubernetes:
nodeSelector — This is a simple Pod scheduling feature that allows scheduling a Pod onto a node whose labels match the nodeSelector labels specified
Node Affinity — is the enhanced version of the nodeSelector which offers a more expressive syntax for fine-grained control of how Pods are scheduled to specific nodes.
There are two types of affinity in Kubernetes: node affinity and Pod affinity. Similarly to nodeSelector, node affinity attracts a Pod to certain nodes, the Pod affinity attracts a Pod to certain Pods. In addition to that, Kubernetes supports Pod anti-affinity, which repels a Pod from other Pods.
Here's an example of a pod that uses node affinity:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
This node affinity rule says the pod can only be placed on a node with a label whose key is kubernetes.io/e2e-az-name and whose value is either e2e-az1 or e2e-az2. In addition, among nodes that meet that criteria, nodes with a label whose key is another-node-label-key and whose value is another-node-label-value should be preferred.
To label nodes you can use command:
$ kubectl label nodes <your-node-name> key=value
See definition: scheduling-pods.
As another workaround you may taint the specific, not working nodes - taints allow a Node to repel a set of Pods.
See more: taint-nodes-kubernetes.
Taints get a possibility to mark a node as NoSchedule - pods by default cannot be spawned on this node until you will add tolerations to pods which will allow scheduler to create pods on nodes with taints specified in toleration configuration. Command below:
$ kubectl taint nodes example-node key=value:NoSchedule
places a taint on node example-node. The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
See: node-taint.
I am thinking about paritioning my Kubernetes cluster into zones of dedicated nodes for exclusive use by dedicated sets of users as discussed here. I am wondering how tainting nodes would affect DaemonSets, including those that are vital to cluster operation (e.g. kube-proxy, kube-flannel-ds-amd64)?
The documentation says daemon pods respect taints and tolerations. But if so, how can the system schedule e.g. kube-proxy pods on nodes tainted with kubectl taint nodes node-x zone=zone-y:NoSchedule when the pod (which is not under my control but owned by Kubernetes' own DaemonSet kube-proxy) does not carry a corresponding toleration.
What I have found empirically so far is that Kubernetes 1.14 reschedules a kube-proxy pod regardless (after I have deleted it on the tainted node-x), which seems to contradict the documentation. One the other hand, this does not seem to be the case for my own DaemonSet. When I kill its pod on node-x it only gets rescheduled after I remove the node's taint (or presumably after I add a toleration to the pod's spec inside the DaemonSet).
So how do DaemonSets and tolerations interoperate in detail. Could it be that certain DaemonSets (such as kube-proxy, kube-flannel-ds-amd64) are treated specially?
Your kube-proxy and flannel daemonsets will have many tolerations defined in their manifest that mean they will get scheduled even on tainted nodes.
Here are a couple from my canal daemonset:
tolerations:
- effect: NoSchedule
operator: Exists
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
Here are the taints from one of my master nodes:
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
value: "true"
- effect: NoExecute
key: node-role.kubernetes.io/etcd
value: "true"
Even though most workloads won't be scheduled on the master because of its NoSchedule and NoExectue taints, a canal pod will be run there because the daemonset tolerates those taints specifically.
The doc you already linked to goes into detail.
I had the same issue. It was necessary for my daemonset to run its pods on every nodes (critical pod definition). I had this daemonset tolerations definition:
spec:
template:
spec:
tolerations:
- key: CriticalAddonsOnly
operator: Exists
And it was running on the only node with no taint definition...
I've checked on my kube-proxy which was just a line different:
spec:
template:
spec:
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- operator: Exists
So I added this "- operator: Exists" line (I do not really understand what it does and how) to my daemonset definition and now it works fine. My daemonset starts pods on every nodes of my cluster...
I have an application pod which will be deployed on k8s cluster
But as Kubernetes scheduler decides on which node this pod needs to run
Now I want to add taint to the node dynamically where my application pod is running with NOschedule so that no new pods will be scheduled on this node
I know that we can use kubectl taint node with NOschedule if I know the node name but I want to achieve this dynamically based on which node this application pod is running
The reason why I want to do this is this is critical application pod which shouldn’t have down time and for good reasons I have only 1 pod for this application across the cluster
Please suggest
In addition to #Rico answer.
You can use feature called node affinity, this is still a beta but some functionality is already implemented.
You should add a label to your node, for example test-node-affinity: test. Once this is done you can Add the nodeAffinity of field affinity in the PodSpec.
spec:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: test-node-affinity
operator: In
values:
- test
This will mean the POD will look for a node with key test-node-affinity and value test and will be deployed there.
I recommend reading this blog Taints and tolerations, pod and node affinities demystified by Toader Sebastian.
Also familiarise yourself with Taints and Tolerations from Kubernetes docs.
You can get the node where your pod is running with something like this:
$ kubectl get pod myapp-pod -o=jsonpath='{.spec.nodeName}'
Then you can taint it:
$ kubectl taint nodes <node-name-from-above> key=value:NoSchedule
or the whole thing in one command:
$ kubectl taint nodes $(kubectl get pod myapp-pod -o=jsonpath='{.spec.nodeName}') key=value:NoSchedule
This question already has an answer here:
Scheduler is not scheduling Pod for DaemonSet in Master node
(1 answer)
Closed 5 years ago.
I am running a cluster with 1 master and 1 node. Now, when I run daemon set it only shows 1 desired node, while it should be 2. There is no error I could find anywhere in the describe/logs, but the daemonset only chooses 1 node to run. I am using kubernetes 1.9.1.
Any idea what I can be doing wrong? Or how to debug it?
TIA.
This happens if the k8s master node has just the node-role.kubernetes.io/master: NoSchedule taint without toleration for it.
The the node-role.kubernetes.io/master: NoSchedule toleration is needed in k8s 1.6 or later to schedule daemonsets on master nodes.
Add the following toleration for the daemonset in the YAML file to make k8s schedule daemonsets on the master node too:
...
kind: DaemonSet
spec:
...
template:
...
spec:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
Taints of the master node can be checked by:
kubectl describe node <master node>
Tolerations of a pod can be checked by:
kubectl describe pod <pod name>
More info about daemonsets is in https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/.
By default, your cluster will not schedule pods on the master for security reasons. If you want to be able to schedule pods on the master, e.g. for a single-machine Kubernetes cluster for development, run:
kubectl taint nodes --all node-role.kubernetes.io/master-