So I have this problem, and try to implement podAffinity to solve it.
I have 3 nodes and want to deploy 2 pods on the same node. In the Deployment YAML files I have service:git under metadata.labels, and the following is the affinity setting:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: service
operator: In
values:
- git
topologyKey: kubernetes.io/hostname
But the pods failed to deploy, I got the following error:
0/3 nodes are available: 3 node(s) didn't match pod affinity rules, 3 node(s) didn't match pod affinity/anti-affinity.
Are there any problems with my configuration?
If not, I guess maybe it is because when the first pod is deployed, the system will try to find a node that contains a pod with the label service: git and fail (because it is the first one), and another pod also fail because of the same reason. Is this correct?
But then how to solve the problem (without resorting to workarounds)?
You are using "requiredDuringSchedulingIgnoredDuringExecution:" so it will be looking a "running(already)" pod which has a label "service: git" and itseems you do not already have any pod with that label. so following a is a quick workaround where a test pod will be created with label "service: git" . so that podAffinity rule will find a destination node ( that would be the node where this testpod will be running)
kubectl run testpod --image=busybox --labels="service=git" -- sleep infinite
Once above pod is UP .. all the pods in your deployment also should get created.
if not delete the deployment and re-apply it.
If you need a elegant solution then you can consider using "preferredDuringSchedulingIgnoredDuringExecution" instead of "requiredDuringSchedulingIgnoredDuringExecution"
Update Sept 2022:
The "requiredDuringSchedulingIgnoredDuringExecution:" is effectively ignored when it is the first pod of the deployment and the pod does in fact get scheduled - otherwise if no pods are there then not even the first one will be able to be deployed. The second pod will then obviously see that first pod running and satisfy the rule and so on. This has been confirmed by testing.
Related
I have a kubernetes cluster which has two worker nodes. Each worker node will have one pod. I have configured in helm chart, the hostname of those pods will be pod-0.test.com and pod-1.test.com. I have pointed the coredns to forward any DNS requests that matches ".com" domain to a remote machine where unbound is running which will take of actual DNS resolution.
.com:53 {
errors
cache 30
forward . <remote machine IP>
}
Let's take worker-0 node IP is 10.x.y.z and worker-1 node IP is 10.a.b.c and Let's say, pod-0.test.com sits in worker-0 and pod-1.test.com sits in worker-1. I have DNS entries configured in unbound of remote machine which will resolve as below:
pod-0.test.com -> 10.x.y.z
pod-1.test.com -> 10.a.b.c
When I uninstall pods and reinstall it, there are chances where pod-0.test.com will sit in worker-1 and pod-1.test.com will sit in worker-0. So if the pods gets swapped between worker nodes, I need to again change the unbound configuration and restart unbound. I will come to know which pod sits in which worker node only after pod gets installed but I need to have proper DNS entries in the remote machine configured in prior to this otherwise pods will gets restart when the pod hostname is resolved to wrong IP.
So what I am looking for is to overcome this issue somehow by automating to have proper DNS entries configured according to the worker node IP where the pod sits in. Is there any way to achieve this? Or Is there a possibility that pod or coredns will itself add proper DNS entry in the remote machine (which is configured in forward directive of coredns) before it is coming up like pre-install step? I need to have this pod hostname to worker node IP resolution should happen properly in both remote machine as well as inside the pods.
It would be really helpful if someone has an approach to handle this issue. Thanks in advance!
there is a work around for this issue you can use node selectors for deploying your pods on the same node. If you don’t want to do it in this way, if you are implementing this via a pipeline you can add a few steps to your pipeline for making the entries the flow goes as below.
Trigger CI/CD pipeline → Pod getting deployed → execute kubectl command for getting pods on each node → ssh into the remote machine give sudo privileges if required & change the required config files.
Use the below command to get details of pods running on a particular node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>
Have you tried using Node Affinity. You can schedule a given pod always to the same Node using node labels. Simply you can use kubernetes.io/hostname label key to select the Node as below:
First Pod
apiVersion: v1
kind: Pod
metadata:
name: pod-0
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker1
hostname: pod-0.test.com
containers:
...
...
Second Pod
apiVersion: v1
kind: Pod
metadata:
name: pod-1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker2
hostname: pod-1.test.com
containers:
...
...
we would like to merge 2 kubernetes cluster because we need to establish a communication between the pods and it should also be cheaper.
Cluster 1 should stay intact and cluster 2 will be deleted. The pods in cluster 2 have very high requirements for resources and we would like to create node pool dedicated to these pods.
So the idea is to label the new nodes and also label the pods that were part of cluster 2 before to enforce that they run on these nodes.
What I cannot find an answer for is the following question: How can I ensure that no other pod is scheduled to run on the new node pool without having to redeploy all pods and assigning labels to them?
There are 2 problems you have to solve:
Stop cluster 1 pods from running on cluster 2 nodes
Stop cluster 2 pods from running on cluster 1 nodes
Given your question, it looks like you can make changes to cluster 2 deployments, but don't want to update existing cluster 1 deployments.
The solution to problem 1 is to use taints and tolerations. You can taint your cluster 2 nodes to stop all pods from being scheduled there then add tolerations to your cluster 2 deployments to allow them to ignore this taint. This means that cluster 1 pods cannot be deployed to cluster 2 nodes and problem 1 is solved.
You add a taint like this:
kubectl taint nodes node1 key1=value1:NoSchedule-
and tolerate it in your cluster 2 pod/deployment spec like this:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Problem 2 cannot be solved the same way because you don't want to change deployments for cluster 1 pods. This is a shame because taints are the easiest solution to this. If you could make that change, then you'd simply add a taint to cluster 1 nodes and tolerate it only in cluster 1 deployments.
Given these constraints, the solution is to use node affinity. You'd need to use the requiredDuringSchedulingIgnoredDuringExecution form to ensure that the rules are always followed. The rules themselves can be as simple as a node selector based on labels. A shorter version of the example from the linked docs:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a-node-label-key
operator: In
values:
- a-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
When running a Kubernetes job I've set spec.spec.restartPolicy: OnFailure and spec.backoffLimit: 30. When a pod fails it's sometimes doing so because of a hardware incompatibility (matlab segfault on some hardware). Kubernetes is restarting the pod each time on the same node, having no chance of correcting the problem.
Can I instruct Kubernete to try a different node on restart?
Once Pod is scheduled it cannot be moved to another Node.
The Job controller can create a new Pod if you specify spec.spec.restartPolicy: Never.
There is a chance that this new Pod will be scheduled on different Node.
I did a quick experiment with podAntiAffinity: but it looks like it's ignored by scheduler (makes sense as the previous Pod is in Error state).
BTW: If you can add labels to failing nodes it will be possible to avoid them by using nodeSelector: <label>.
restartPolicy only refers to restarts of the Containers by the Kubelet on the same node.
Setting restartPolicy: OnFailure will prevent the neverending creation of pods because it will just restart the failing one on the same node.
If you want to create new pods on failure with restartPolicy: Never, you can limit them by setting activeDeadlineSeconds However pods also will be recreated on the same node as failed ones. Upon reaching the deadline without success, the job will have status with reason: DeadlineExceeded. No more pods will be created, and existing pods will be deleted.
.spec.backoffLimit is just the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay. And of course, this delay time is set by the Job controller
Take a look: pod-lifecycle.
However as a workaround you may want your Pods to end up on specific nodes which are properly working.
These scenarios are addressed by a number of primitives in Kubernetes:
nodeSelector — This is a simple Pod scheduling feature that allows scheduling a Pod onto a node whose labels match the nodeSelector labels specified
Node Affinity — is the enhanced version of the nodeSelector which offers a more expressive syntax for fine-grained control of how Pods are scheduled to specific nodes.
There are two types of affinity in Kubernetes: node affinity and Pod affinity. Similarly to nodeSelector, node affinity attracts a Pod to certain nodes, the Pod affinity attracts a Pod to certain Pods. In addition to that, Kubernetes supports Pod anti-affinity, which repels a Pod from other Pods.
Here's an example of a pod that uses node affinity:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
This node affinity rule says the pod can only be placed on a node with a label whose key is kubernetes.io/e2e-az-name and whose value is either e2e-az1 or e2e-az2. In addition, among nodes that meet that criteria, nodes with a label whose key is another-node-label-key and whose value is another-node-label-value should be preferred.
To label nodes you can use command:
$ kubectl label nodes <your-node-name> key=value
See definition: scheduling-pods.
As another workaround you may taint the specific, not working nodes - taints allow a Node to repel a set of Pods.
See more: taint-nodes-kubernetes.
Taints get a possibility to mark a node as NoSchedule - pods by default cannot be spawned on this node until you will add tolerations to pods which will allow scheduler to create pods on nodes with taints specified in toleration configuration. Command below:
$ kubectl taint nodes example-node key=value:NoSchedule
places a taint on node example-node. The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
See: node-taint.
Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.
I have an application pod which will be deployed on k8s cluster
But as Kubernetes scheduler decides on which node this pod needs to run
Now I want to add taint to the node dynamically where my application pod is running with NOschedule so that no new pods will be scheduled on this node
I know that we can use kubectl taint node with NOschedule if I know the node name but I want to achieve this dynamically based on which node this application pod is running
The reason why I want to do this is this is critical application pod which shouldn’t have down time and for good reasons I have only 1 pod for this application across the cluster
Please suggest
In addition to #Rico answer.
You can use feature called node affinity, this is still a beta but some functionality is already implemented.
You should add a label to your node, for example test-node-affinity: test. Once this is done you can Add the nodeAffinity of field affinity in the PodSpec.
spec:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: test-node-affinity
operator: In
values:
- test
This will mean the POD will look for a node with key test-node-affinity and value test and will be deployed there.
I recommend reading this blog Taints and tolerations, pod and node affinities demystified by Toader Sebastian.
Also familiarise yourself with Taints and Tolerations from Kubernetes docs.
You can get the node where your pod is running with something like this:
$ kubectl get pod myapp-pod -o=jsonpath='{.spec.nodeName}'
Then you can taint it:
$ kubectl taint nodes <node-name-from-above> key=value:NoSchedule
or the whole thing in one command:
$ kubectl taint nodes $(kubectl get pod myapp-pod -o=jsonpath='{.spec.nodeName}') key=value:NoSchedule