Kubernetes EKS deployment set soft node affinity to split pods 50/50 per nodegroup - kubernetes

I have an EKS cluster with two nodegroups each in different AZ. One deployment Deployment1 is running on 2 namespaces for redundancy, one copy per namespace and each of them run in separate AZs/nodegroup. Also there is another deployment Deployment2 that does not have any node affinity set and K8s manages where pods get scheduled.
Both deployments are huge with lots of pods. I have a subnet of 250 IPs available to me for each node group.
The problem is that while Deployment1 is fine on it's own, and gets split almost equally per AZ/Nodegroup, the Deployment2 tends to schedule most pods in one of the nodegroups and that ends when there are no more IPs available. This is a problem for Deployment1 since one namespace of it is tied to that nodegroup and no new pods can be scheduled there if load changes.
Can I somehow balance Deployment2 so it has 'soft affinity' that would split it 50/50 per each nodegroup, but if needed, can schedule pods in the other nodegroup?

If you're using Kubernetes 1.19 or later you can use topologySpreadConstraints, adding this to the pod template:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
foo: bar
where maxSkew define how uneven pods can be scheduled, topologyKey is the key of node labels and the labelSelector matches a label of your deployment. See docs
If your on an older Kubernetes version, you can look at pod anti affinity.

Related

Kubernetes Restrict Node to run labeled pods only

we would like to merge 2 kubernetes cluster because we need to establish a communication between the pods and it should also be cheaper.
Cluster 1 should stay intact and cluster 2 will be deleted. The pods in cluster 2 have very high requirements for resources and we would like to create node pool dedicated to these pods.
So the idea is to label the new nodes and also label the pods that were part of cluster 2 before to enforce that they run on these nodes.
What I cannot find an answer for is the following question: How can I ensure that no other pod is scheduled to run on the new node pool without having to redeploy all pods and assigning labels to them?
There are 2 problems you have to solve:
Stop cluster 1 pods from running on cluster 2 nodes
Stop cluster 2 pods from running on cluster 1 nodes
Given your question, it looks like you can make changes to cluster 2 deployments, but don't want to update existing cluster 1 deployments.
The solution to problem 1 is to use taints and tolerations. You can taint your cluster 2 nodes to stop all pods from being scheduled there then add tolerations to your cluster 2 deployments to allow them to ignore this taint. This means that cluster 1 pods cannot be deployed to cluster 2 nodes and problem 1 is solved.
You add a taint like this:
kubectl taint nodes node1 key1=value1:NoSchedule-
and tolerate it in your cluster 2 pod/deployment spec like this:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Problem 2 cannot be solved the same way because you don't want to change deployments for cluster 1 pods. This is a shame because taints are the easiest solution to this. If you could make that change, then you'd simply add a taint to cluster 1 nodes and tolerate it only in cluster 1 deployments.
Given these constraints, the solution is to use node affinity. You'd need to use the requiredDuringSchedulingIgnoredDuringExecution form to ensure that the rules are always followed. The rules themselves can be as simple as a node selector based on labels. A shorter version of the example from the linked docs:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a-node-label-key
operator: In
values:
- a-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0

Schedule few statefulset pods on one node and rest on other node in a kubernetes cluster

I have a kubernetes cluster of 3 worker nodes where I need to deploy a statefulset app having 6 replicas.
My requirement is to make sure in every case, each node should get exactly 2 pods out of 6 replicas. Basically,
node1 - 2 pods of app
node2 - 2 pods of app
node3 - 2 pods of app
========================
Total 6 pods of app
Any help would be appreciated!
You should use Pod Anti-Affinity to make sure that the pods are spread to different nodes.
Since you will have more than one pod on the nodes, use preferredDuringSchedulingIgnoredDuringExecution
example when the app has the label app: mydb (use what fits your case):
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- mydb
topologyKey: "kubernetes.io/hostname"
each node should get exactly 2 pods out of 6 replicas
Try to not think that the pods are pinned to certain node. The idea with Kubernetes workload is that the workload is independent of the underlying infrastructure such as nodes. What you really want - I assume - is to spread the pods to increase availability - e.g. if one nodes goes down, your system should still be available.
If you are running at a cloud provider, you should probably design the anti-affinity such that the pods are scheduled to different Availability Zones and not only to different Nodes - but it requires that your cluster is deployed in a Region (consisting of multiple Availability Zones).
Spread pods across Availability Zones
After even distribution, all 3 nodes (scattered over three zones ) will have 2 pods. That is ok. The hard requirement is if 1 node ( Say node-1) goes down, then it's 2 pods, need not be re-scheduled again on other nodes. When the node-1 is restored, then those 2 pods now will be scheduled back on it. So, we can say, all 3 pair of pods have different node/zone affinity. Any idea around this?
This can be done with PodAffinity, but is more likely done using TopologySpreadConstraints and you will probably use topologyKey: topology.kubernetes.io/zone but this depends on what labels your nodes have.

How to make sure Kubernetes autoscaler not deleting the nodes which runs specific pod

I am running a Kubernetes cluster(AWS EKS one) with Autoscaler pod So that Cluster will autoscale according to the resource request within the cluster.
Also, cluster will shrink no of nodes when the load is reduced. As I observed, Autosclaer can delete any node in this process.
I want to control this behavior such as asking Autoscaler to stop deleting nodes that runs a specific pod.
For example, If a node runs the Jenkins pod, Autoscaler should skip that node and delete other matching node from the cluster.
Will there a way to achieve this requirement. Please give your thoughts.
You can use "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
...
template:
metadata:
labels:
app: jenkins
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
spec:
nodeSelector:
failure-domain.beta.kubernetes.io/zone: us-west-2b
...
You should set a pod disruption budget that references specific pods by label. If you want to ensure that there is at least one Jenkins worker pod running at all times, for example, you could create a PDB like
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: jenkins-worker-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: jenkins
component: worker
(adapted from the basic example in Specifying a Disruption Budget in the Kubernetes docs).
Doing this won't prevent nodes from being destroyed; the cluster autoscaler is still free to scale things down. What it will do is temporarily delay destroying a node until the disruption budget can be met again.
For example, say you've configured your Jenkins setup so that there are three workers. Two get scheduled on the same node, and the autoscaler takes that node offline. The ordinary Kubernetes Deployment system will create two new replicas on nodes that still exist. If the autoscaler also decides it wants to destroy the node that has the last worker, the pod disruption budget above will prevent it from doing so until at least one other worker is running.
When you say "the Jenkins pod" in the question, there are two other important implications to this. One is that you should almost always configure your applications using higher-level objects like Deployments or StatefulSets and not bare Pods. The other is that it is generally useful to run multiple copies of things for redundancy if nothing else. Even absent the cluster autoscaler, disks fail, Amazon occasionally arbitrarily decommissions EC2 instances, and nodes otherwise can go offline outside of your control; you often don't want just one copy of something running in your cluster, especially if you're considering it a critical service.
In autoscaler FAQ on github you can read the following:
What types of pods can prevent CA from removing a node?
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have a pod disruption
budget
set or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching
anti-affinity, etc)
Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
*Unless the pod has the following annotation (supported in
CA 1.0.3 or later): "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

Kubernetes AntiAffinity - limit max number of same pods per node

I have a kubernetes cluster with 4 nodes. I have a pod deployed as a deployment, with 8 replicas. When I deployed this, kubernetes sometimes schedule 4 pods in node1, and the rest of the 4 pods in node2. In this case node3 and node4 don't have this container running (but other containers running there)
I do understand Pod affinity and anti-affinity , where they have the Zookeeper example for pod-anti-affinity, which is great. This would make sure that no 2 pods would deploy on the same node.
This is fine, however my requirement is slightly different where I want to restrict the maximum number of the pods k8s can deploy to one node with node anti-affinity.
I need to make sure that not more than 3 instance of same pods are deployed on a node in my above example. I thought of setting a memory/cpu limit on pods but that seemed like a bad idea as I have nodes with different configuration. Is there any way to achieve this?
( Update 1 ) - I understand that my questions wasn't clear enough. To clarify further, what I want is to limit the instance of a pod to a maximum of 3 per node for a particular deployment. Example, how do I tell k8s to not deploy more than 3 instances of nginx pod per node? The restriction should only be applied to the nginx deployments and not other deployments.
( Update 2 ) - To further explain with a scenario.
A k8s cluster, with 4 worker nodes.
2 Deployments
A nginx deployment -> replicas = 10
A custom user agent deployment -> Replicas 10
Requirement - Hey kubernetes, I want to schedule 10 Pods of the "custom user agent" pod (Pod #2 in this example) in 4 nodes, but I want to make sure that each node may have only a maximum of 3 pods of the 'custom user agent'. For the 'nginx' pod, there shouldnt' be any such restriction, which means I don't mind if k8s schedule 5 nginx in one node and the rest of the 5 in the second node.
I myself didn't find official documentation for this. but i think you can use podantiaffinity with preferredDuringSchedulingIgnoredDuringExecution option. this will prevent k8s from placing the same pods on a single node, but if that is not possible it will select the most eligible existing node. official doc here
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
name: deployment-name
topologyKey: kubernetes.io/hostname
weight: 100
So setting a bare minimum number of pod for each node can be achieved by topologykey
Yes, you can achieve a pod to spawn on every node by a deployment object by pod affinity with topologykey set as "kubernetes.io/hostname".
With the above example, you will have the following behaviour:
I hope thats what you are looking for:
That feature is in alpha, I believe it is called topologyKey, depending on your Kubernetes version you may be able to use it. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
I believe what you want to achieve can be done via maxSkew parameter of pod topology spread constraints. Please check the original documentation https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

How to fix "pods is not balanced" in Kubernetes cluster

Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.