Kubernetes AntiAffinity - limit max number of same pods per node - kubernetes

I have a kubernetes cluster with 4 nodes. I have a pod deployed as a deployment, with 8 replicas. When I deployed this, kubernetes sometimes schedule 4 pods in node1, and the rest of the 4 pods in node2. In this case node3 and node4 don't have this container running (but other containers running there)
I do understand Pod affinity and anti-affinity , where they have the Zookeeper example for pod-anti-affinity, which is great. This would make sure that no 2 pods would deploy on the same node.
This is fine, however my requirement is slightly different where I want to restrict the maximum number of the pods k8s can deploy to one node with node anti-affinity.
I need to make sure that not more than 3 instance of same pods are deployed on a node in my above example. I thought of setting a memory/cpu limit on pods but that seemed like a bad idea as I have nodes with different configuration. Is there any way to achieve this?
( Update 1 ) - I understand that my questions wasn't clear enough. To clarify further, what I want is to limit the instance of a pod to a maximum of 3 per node for a particular deployment. Example, how do I tell k8s to not deploy more than 3 instances of nginx pod per node? The restriction should only be applied to the nginx deployments and not other deployments.
( Update 2 ) - To further explain with a scenario.
A k8s cluster, with 4 worker nodes.
2 Deployments
A nginx deployment -> replicas = 10
A custom user agent deployment -> Replicas 10
Requirement - Hey kubernetes, I want to schedule 10 Pods of the "custom user agent" pod (Pod #2 in this example) in 4 nodes, but I want to make sure that each node may have only a maximum of 3 pods of the 'custom user agent'. For the 'nginx' pod, there shouldnt' be any such restriction, which means I don't mind if k8s schedule 5 nginx in one node and the rest of the 5 in the second node.

I myself didn't find official documentation for this. but i think you can use podantiaffinity with preferredDuringSchedulingIgnoredDuringExecution option. this will prevent k8s from placing the same pods on a single node, but if that is not possible it will select the most eligible existing node. official doc here
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
name: deployment-name
topologyKey: kubernetes.io/hostname
weight: 100

So setting a bare minimum number of pod for each node can be achieved by topologykey
Yes, you can achieve a pod to spawn on every node by a deployment object by pod affinity with topologykey set as "kubernetes.io/hostname".
With the above example, you will have the following behaviour:
I hope thats what you are looking for:

That feature is in alpha, I believe it is called topologyKey, depending on your Kubernetes version you may be able to use it. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

I believe what you want to achieve can be done via maxSkew parameter of pod topology spread constraints. Please check the original documentation https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

Related

Kubernetes EKS deployment set soft node affinity to split pods 50/50 per nodegroup

I have an EKS cluster with two nodegroups each in different AZ. One deployment Deployment1 is running on 2 namespaces for redundancy, one copy per namespace and each of them run in separate AZs/nodegroup. Also there is another deployment Deployment2 that does not have any node affinity set and K8s manages where pods get scheduled.
Both deployments are huge with lots of pods. I have a subnet of 250 IPs available to me for each node group.
The problem is that while Deployment1 is fine on it's own, and gets split almost equally per AZ/Nodegroup, the Deployment2 tends to schedule most pods in one of the nodegroups and that ends when there are no more IPs available. This is a problem for Deployment1 since one namespace of it is tied to that nodegroup and no new pods can be scheduled there if load changes.
Can I somehow balance Deployment2 so it has 'soft affinity' that would split it 50/50 per each nodegroup, but if needed, can schedule pods in the other nodegroup?
If you're using Kubernetes 1.19 or later you can use topologySpreadConstraints, adding this to the pod template:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
foo: bar
where maxSkew define how uneven pods can be scheduled, topologyKey is the key of node labels and the labelSelector matches a label of your deployment. See docs
If your on an older Kubernetes version, you can look at pod anti affinity.

When scaling a pod, will Kubernetes start new pods on more available nodes?

I tried to find a clear answer to this, and I'm sure it's in the kubernetes documentation somewhere but I couldn't find it.
If I have 4 nodes in my cluster, and am running a compute intensive pod which presumably uses up all/most of the resources on a single node, and I scale that pod to 4 replicas- will Kubernetes put those new pods on the unused nodes automatically?
Or, in the case where that compute intensive pod is not actually crunching numbers at the time but I scale the pod to 4 replicas (ie: node running the original pod has plenty of 'available' resources) will kubernetes still see that there are 3 other totally free nodes and start the pods on them?
Kubernetes will schedule the new pods on any node with sufficient resources. So one of the newly created pods could end up on a node were one is already running, if the node has enough resources left.
You can use an anti affinity to prevent pods from scheduling on the same node as a pod from the same deployment, e.g. by using the development's label
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources-group
operator: In
values:
- high
topologyKey: topology.kubernetes.io/zone
Read docs for more on that topic.
As far as I remember, it depends on the scheduler configuration.
If you are running your on-premise kubernetes cluster and you have access to the scheduler process, you can tune the policy as you prefer (see documentation).
Otherwise, you can only play with the pod resource requests/limits and anti-affinity (see here).

Schedule few statefulset pods on one node and rest on other node in a kubernetes cluster

I have a kubernetes cluster of 3 worker nodes where I need to deploy a statefulset app having 6 replicas.
My requirement is to make sure in every case, each node should get exactly 2 pods out of 6 replicas. Basically,
node1 - 2 pods of app
node2 - 2 pods of app
node3 - 2 pods of app
========================
Total 6 pods of app
Any help would be appreciated!
You should use Pod Anti-Affinity to make sure that the pods are spread to different nodes.
Since you will have more than one pod on the nodes, use preferredDuringSchedulingIgnoredDuringExecution
example when the app has the label app: mydb (use what fits your case):
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- mydb
topologyKey: "kubernetes.io/hostname"
each node should get exactly 2 pods out of 6 replicas
Try to not think that the pods are pinned to certain node. The idea with Kubernetes workload is that the workload is independent of the underlying infrastructure such as nodes. What you really want - I assume - is to spread the pods to increase availability - e.g. if one nodes goes down, your system should still be available.
If you are running at a cloud provider, you should probably design the anti-affinity such that the pods are scheduled to different Availability Zones and not only to different Nodes - but it requires that your cluster is deployed in a Region (consisting of multiple Availability Zones).
Spread pods across Availability Zones
After even distribution, all 3 nodes (scattered over three zones ) will have 2 pods. That is ok. The hard requirement is if 1 node ( Say node-1) goes down, then it's 2 pods, need not be re-scheduled again on other nodes. When the node-1 is restored, then those 2 pods now will be scheduled back on it. So, we can say, all 3 pair of pods have different node/zone affinity. Any idea around this?
This can be done with PodAffinity, but is more likely done using TopologySpreadConstraints and you will probably use topologyKey: topology.kubernetes.io/zone but this depends on what labels your nodes have.

Kubernetes: Evenly distribute the replicas across the cluster

We can use DaemonSet object to deploy one replica on each node. How can we deploy say 2 replicas or 3 replicas per node? How can we achieve that. please let us know
There is no way to force x pods per node the way a Daemonset does. However, with some planning, you can force a fairly even pod distribution across your nodes using pod anti affinity.
Let's say we have 10 nodes. The first thing is we need to have a ReplicaSet (deployment) with 30 pods (3 per node). Next, we want to set the pod anti affinity to use preferredDuringSchedulingIgnoredDuringExecution with a relatively high weight and match the deployment's labels. This will cause the scheduler to prefer not scheduling pods where the same pod already exists. Once there is 1 pod per node, the cycle starts over again. A node with 2 pods will be weighted lower than one with 1 pod so the next pod should try to go there.
Note this is not as precise as a DaemonSet and may run into some limitations when it comes time to scale up or down the cluster.
A more reliable way if scaling the cluster is to simply create multiple DaemonSets with different names, but identical in every other way. Since the DaemonSets will have the same labels, they can all be exposed through the same service.
By default, the kubernetes scheduler will prefer to schedule pods on different nodes.
The kubernetes scheduler will first determine all possible nodes where a pod can be deployed based on your affinity/anti-affinity/resource limits/etc.
Afterward, the scheduler will find the best node where the pod can be deployed. The scheduler will automatically schedule the pods to be on separate availability zones and on separate nodes if this is possible of course.
You can try this on your own. For example, if you have 3 nodes, try deploying 9 replicas of a pod. You will see that each node will have 2 pods running.

How to fix "pods is not balanced" in Kubernetes cluster

Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.