I have a k8s cluster with 2 node groups (worker group and task runner group for heavy tasks).
I have a deployment with N pod replicas and I want to assign a maximum of 2 replicas per node.
I found a way to restrict replicas count to 1 by describing Anti Affinity rules for pods and tried to add Topology Spread Constraints with the MaxSkew parameter, but this didn't work.
Is there any way to restrict the maximum pod replica count per node?
The key is to use topologyKey.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
name: deployment-name
topologyKey: kubernetes.io/hostname
weight: 100
A bit about topologyKey:
In principle, the topologyKey can be any allowed label key with the following exceptions for performance and security reasons:
For Pod affinity and anti-affinity, an empty topologyKey field is not allowed in both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.
For requiredDuringSchedulingIgnoredDuringExecution Pod anti-affinity rules, the admission controller LimitPodHardAntiAffinityTopology limits topologyKey to kubernetes.io/hostname.
You can modify or disable the admission controller if you want to allow custom topologies.
Pod Topology Spread Constraints
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
topologyKey is the key of node labels.
If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology.
The scheduler tries to place a balanced number of Pods into each topology domain.
Related
I'm trying to affect pods to a specific node using affinities but it ends with a strange behavior I can't understand.
Let explain my nodes setup. I have "x" nodes which all have the labels kali=true, two nodes have in addition the labels kali-app=true and one of these nodes have the label kali-app-1=true.
Now I try to deploy a Deployment with a replicas of 2 pods which should result with putting these pods on the kali-app-1 node:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kali
operator: In
values: ['true']
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kali-app
operator: In
values: ['true']
- weight: 2
preference:
matchExpressions:
- key: kali-app-1
operator: In
values: ['true']
The result is the first pod is put on the kali-app-1 node but the second one is put on the node which only have the kali-app node (namely kali-app-2=true which is an other label I have in my cluster).
Does anyone can explain me this behavior?
Based on the nodeAffinity documentation it works as expected since preferredDuringSchedulingIgnoredDuringExecution field is used for the 2nd and 3rd conditions.
They way it works is:
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee.
The weight field in preferredDuringSchedulingIgnoredDuringExecution
is in the range 1-100. For each node that meets all of the scheduling
requirements (resource request, RequiredDuringScheduling affinity
expressions, etc.), the scheduler will compute a sum by iterating
through the elements of this field and adding "weight" to the sum if
the node matches the corresponding MatchExpressions. This score is
then combined with the scores of other priority functions for the
node. The node(s) with the highest total score are the most
preferred.
That means that after first pod is scheduled on the required node, another node is more preferred for placing the second pod. (e.g. if the deployment is heavy and has high cpu/memory requests) + generally kubernetes scheduler tries to spread pods across nodes to comply with high availability. Please get familiar with NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well GitHub issue.
Therefore if these pods from the deployment have to be scheduled on the same one node with label kali-app-1, you need to use requiredDuringSchedulingIgnoredDuringExecution field to enforce this assignment and do not provide the scheduler with other possible options.
I have an EKS cluster with two nodegroups each in different AZ. One deployment Deployment1 is running on 2 namespaces for redundancy, one copy per namespace and each of them run in separate AZs/nodegroup. Also there is another deployment Deployment2 that does not have any node affinity set and K8s manages where pods get scheduled.
Both deployments are huge with lots of pods. I have a subnet of 250 IPs available to me for each node group.
The problem is that while Deployment1 is fine on it's own, and gets split almost equally per AZ/Nodegroup, the Deployment2 tends to schedule most pods in one of the nodegroups and that ends when there are no more IPs available. This is a problem for Deployment1 since one namespace of it is tied to that nodegroup and no new pods can be scheduled there if load changes.
Can I somehow balance Deployment2 so it has 'soft affinity' that would split it 50/50 per each nodegroup, but if needed, can schedule pods in the other nodegroup?
If you're using Kubernetes 1.19 or later you can use topologySpreadConstraints, adding this to the pod template:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
foo: bar
where maxSkew define how uneven pods can be scheduled, topologyKey is the key of node labels and the labelSelector matches a label of your deployment. See docs
If your on an older Kubernetes version, you can look at pod anti affinity.
We are finding that our Kubernetes cluster tends to have hot-spots where certain nodes get far more instances of our apps than other nodes.
In this case, we are deploying lots of instances of Apache Airflow, and some nodes have 3x more web or scheduler components than others.
Is it possible to use anti-affinity rules to force a more even spread of pods across the cluster?
E.g. "prefer the node with the least pods of label component=airflow-web?"
If anti-affinity does not work, are there other mechanisms we should be looking into as well?
Try adding this to the Deployment/StatefulSet .spec.template:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "component"
operator: In
values:
- airflow-web
topologyKey: "kubernetes.io/hostname"
Have you tried configuring the kube-scheduler?
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering: finds the set of Nodes where it's feasible to schedule the Pod.
Scoring: ranks the remaining nodes to choose the most suitable Pod placement.
Scheduling Policies: can be used to specify the predicates and priorities that the kube-scheduler runs to filter and score nodes.
kube-scheduler --policy-config-file <filename>
Sample config file
One of the priorities for your scenario is:
BalancedResourceAllocation: Favors nodes with balanced resource usage.
The right solution here is pod topology spread constraints: https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/
Anti-affinity only works until each node has at least 1 pod. Spread constraints actually balances based on the pod count per node.
Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.
I'm running a Kubernetes cluster of 3 nodes with GKE. I ask Kubernetes for 3 replicas of backend pods. The 3 pods are not well dispatched among the nodes to provide a high-availability service, they are usually all on 2 nodes.
I would like Kubernetes to dispatch the pods as much as possible to have a pod on each node, but not fail the deployment/scale-up if they are more backend pods than nodes.
Is it possible to do that with preferredDuringSchedulingIgnoredDuringExecution?
Try setting up an preferred antiAffinity rule like so:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- "my_app_name"
topologyKey: "kubernetes.io/hostname"
This will try to schedule pods onto nodes which do not already have a pod of the same label running on them. After that it's a free for all (so it won't evenly spread them after making sure at least 1 is running on each node). This means that after scaling up you might end up with a node with 5 pods, and other nodes with 1 pod each.