Kubernetes Restrict Node to run labeled pods only - kubernetes

we would like to merge 2 kubernetes cluster because we need to establish a communication between the pods and it should also be cheaper.
Cluster 1 should stay intact and cluster 2 will be deleted. The pods in cluster 2 have very high requirements for resources and we would like to create node pool dedicated to these pods.
So the idea is to label the new nodes and also label the pods that were part of cluster 2 before to enforce that they run on these nodes.
What I cannot find an answer for is the following question: How can I ensure that no other pod is scheduled to run on the new node pool without having to redeploy all pods and assigning labels to them?

There are 2 problems you have to solve:
Stop cluster 1 pods from running on cluster 2 nodes
Stop cluster 2 pods from running on cluster 1 nodes
Given your question, it looks like you can make changes to cluster 2 deployments, but don't want to update existing cluster 1 deployments.
The solution to problem 1 is to use taints and tolerations. You can taint your cluster 2 nodes to stop all pods from being scheduled there then add tolerations to your cluster 2 deployments to allow them to ignore this taint. This means that cluster 1 pods cannot be deployed to cluster 2 nodes and problem 1 is solved.
You add a taint like this:
kubectl taint nodes node1 key1=value1:NoSchedule-
and tolerate it in your cluster 2 pod/deployment spec like this:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Problem 2 cannot be solved the same way because you don't want to change deployments for cluster 1 pods. This is a shame because taints are the easiest solution to this. If you could make that change, then you'd simply add a taint to cluster 1 nodes and tolerate it only in cluster 1 deployments.
Given these constraints, the solution is to use node affinity. You'd need to use the requiredDuringSchedulingIgnoredDuringExecution form to ensure that the rules are always followed. The rules themselves can be as simple as a node selector based on labels. A shorter version of the example from the linked docs:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a-node-label-key
operator: In
values:
- a-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0

Related

Install Kubernetes-embedded pods run only specific node

I have a Kubernetes cluster, and running 3 nodes. But I want to run my app on only two nodes. So I want to ask, Can I run other pods (Kubernetes extensions) in the Kubernetes cluster only on a single node?
node = Only Kubernetes pods
node = my app
node = my app
Yes, you can run the application POD on only two nodes and other extension Kubernetes POD on a single node.
When you say Kubernetes extension POD by that consider some external third-party PODs like Nginx ingress controller and other not default system POD like kube-proxy, kubelet, etc those should require to run each available node.
Option 1
You can use the Node affinity to schedule PODs on specific nodes.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/hostname
operator: In
values:
- node-1
- node-2
containers:
- name: with-node-affinity
image: nginx
Option 2
You can use the taint & toleration to schedule the PODs on specific nodes.
Certain kube-system pods like kube-proxy, the CNI pods (cilium/flannel) and other daemonSet must run on each of the worker node, you can not stop them. If that is not the case for you, a node can be taint to noSchedule using below command.
kubectl taint nodes type=<a_node_label>:NoSchedule
The further enhancement you can explore https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

Kubernetes EKS deployment set soft node affinity to split pods 50/50 per nodegroup

I have an EKS cluster with two nodegroups each in different AZ. One deployment Deployment1 is running on 2 namespaces for redundancy, one copy per namespace and each of them run in separate AZs/nodegroup. Also there is another deployment Deployment2 that does not have any node affinity set and K8s manages where pods get scheduled.
Both deployments are huge with lots of pods. I have a subnet of 250 IPs available to me for each node group.
The problem is that while Deployment1 is fine on it's own, and gets split almost equally per AZ/Nodegroup, the Deployment2 tends to schedule most pods in one of the nodegroups and that ends when there are no more IPs available. This is a problem for Deployment1 since one namespace of it is tied to that nodegroup and no new pods can be scheduled there if load changes.
Can I somehow balance Deployment2 so it has 'soft affinity' that would split it 50/50 per each nodegroup, but if needed, can schedule pods in the other nodegroup?
If you're using Kubernetes 1.19 or later you can use topologySpreadConstraints, adding this to the pod template:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
foo: bar
where maxSkew define how uneven pods can be scheduled, topologyKey is the key of node labels and the labelSelector matches a label of your deployment. See docs
If your on an older Kubernetes version, you can look at pod anti affinity.

Can Kubernetes be forced to restart a failed pod on a differet node?

When running a Kubernetes job I've set spec.spec.restartPolicy: OnFailure and spec.backoffLimit: 30. When a pod fails it's sometimes doing so because of a hardware incompatibility (matlab segfault on some hardware). Kubernetes is restarting the pod each time on the same node, having no chance of correcting the problem.
Can I instruct Kubernete to try a different node on restart?
Once Pod is scheduled it cannot be moved to another Node.
The Job controller can create a new Pod if you specify spec.spec.restartPolicy: Never.
There is a chance that this new Pod will be scheduled on different Node.
I did a quick experiment with podAntiAffinity: but it looks like it's ignored by scheduler (makes sense as the previous Pod is in Error state).
BTW: If you can add labels to failing nodes it will be possible to avoid them by using nodeSelector: <label>.
restartPolicy only refers to restarts of the Containers by the Kubelet on the same node.
Setting restartPolicy: OnFailure will prevent the neverending creation of pods because it will just restart the failing one on the same node.
If you want to create new pods on failure with restartPolicy: Never, you can limit them by setting activeDeadlineSeconds However pods also will be recreated on the same node as failed ones. Upon reaching the deadline without success, the job will have status with reason: DeadlineExceeded. No more pods will be created, and existing pods will be deleted.
.spec.backoffLimit is just the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay. And of course, this delay time is set by the Job controller
Take a look: pod-lifecycle.
However as a workaround you may want your Pods to end up on specific nodes which are properly working.
These scenarios are addressed by a number of primitives in Kubernetes:
nodeSelector — This is a simple Pod scheduling feature that allows scheduling a Pod onto a node whose labels match the nodeSelector labels specified
Node Affinity — is the enhanced version of the nodeSelector which offers a more expressive syntax for fine-grained control of how Pods are scheduled to specific nodes.
There are two types of affinity in Kubernetes: node affinity and Pod affinity. Similarly to nodeSelector, node affinity attracts a Pod to certain nodes, the Pod affinity attracts a Pod to certain Pods. In addition to that, Kubernetes supports Pod anti-affinity, which repels a Pod from other Pods.
Here's an example of a pod that uses node affinity:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
This node affinity rule says the pod can only be placed on a node with a label whose key is kubernetes.io/e2e-az-name and whose value is either e2e-az1 or e2e-az2. In addition, among nodes that meet that criteria, nodes with a label whose key is another-node-label-key and whose value is another-node-label-value should be preferred.
To label nodes you can use command:
$ kubectl label nodes <your-node-name> key=value
See definition: scheduling-pods.
As another workaround you may taint the specific, not working nodes - taints allow a Node to repel a set of Pods.
See more: taint-nodes-kubernetes.
Taints get a possibility to mark a node as NoSchedule - pods by default cannot be spawned on this node until you will add tolerations to pods which will allow scheduler to create pods on nodes with taints specified in toleration configuration. Command below:
$ kubectl taint nodes example-node key=value:NoSchedule
places a taint on node example-node. The taint has key key, value value, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
See: node-taint.

How to fix "pods is not balanced" in Kubernetes cluster

Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.

Prefer dispatch replicas among nodes

I'm running a Kubernetes cluster of 3 nodes with GKE. I ask Kubernetes for 3 replicas of backend pods. The 3 pods are not well dispatched among the nodes to provide a high-availability service, they are usually all on 2 nodes.
I would like Kubernetes to dispatch the pods as much as possible to have a pod on each node, but not fail the deployment/scale-up if they are more backend pods than nodes.
Is it possible to do that with preferredDuringSchedulingIgnoredDuringExecution?
Try setting up an preferred antiAffinity rule like so:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- "my_app_name"
topologyKey: "kubernetes.io/hostname"
This will try to schedule pods onto nodes which do not already have a pod of the same label running on them. After that it's a free for all (so it won't evenly spread them after making sure at least 1 is running on each node). This means that after scaling up you might end up with a node with 5 pods, and other nodes with 1 pod each.