We are finding that our Kubernetes cluster tends to have hot-spots where certain nodes get far more instances of our apps than other nodes.
In this case, we are deploying lots of instances of Apache Airflow, and some nodes have 3x more web or scheduler components than others.
Is it possible to use anti-affinity rules to force a more even spread of pods across the cluster?
E.g. "prefer the node with the least pods of label component=airflow-web?"
If anti-affinity does not work, are there other mechanisms we should be looking into as well?
Try adding this to the Deployment/StatefulSet .spec.template:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "component"
operator: In
values:
- airflow-web
topologyKey: "kubernetes.io/hostname"
Have you tried configuring the kube-scheduler?
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering: finds the set of Nodes where it's feasible to schedule the Pod.
Scoring: ranks the remaining nodes to choose the most suitable Pod placement.
Scheduling Policies: can be used to specify the predicates and priorities that the kube-scheduler runs to filter and score nodes.
kube-scheduler --policy-config-file <filename>
Sample config file
One of the priorities for your scenario is:
BalancedResourceAllocation: Favors nodes with balanced resource usage.
The right solution here is pod topology spread constraints: https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/
Anti-affinity only works until each node has at least 1 pod. Spread constraints actually balances based on the pod count per node.
Related
I have a k8s cluster with 2 node groups (worker group and task runner group for heavy tasks).
I have a deployment with N pod replicas and I want to assign a maximum of 2 replicas per node.
I found a way to restrict replicas count to 1 by describing Anti Affinity rules for pods and tried to add Topology Spread Constraints with the MaxSkew parameter, but this didn't work.
Is there any way to restrict the maximum pod replica count per node?
The key is to use topologyKey.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
name: deployment-name
topologyKey: kubernetes.io/hostname
weight: 100
A bit about topologyKey:
In principle, the topologyKey can be any allowed label key with the following exceptions for performance and security reasons:
For Pod affinity and anti-affinity, an empty topologyKey field is not allowed in both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.
For requiredDuringSchedulingIgnoredDuringExecution Pod anti-affinity rules, the admission controller LimitPodHardAntiAffinityTopology limits topologyKey to kubernetes.io/hostname.
You can modify or disable the admission controller if you want to allow custom topologies.
Pod Topology Spread Constraints
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
topologyKey is the key of node labels.
If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology.
The scheduler tries to place a balanced number of Pods into each topology domain.
I'm trying to affect pods to a specific node using affinities but it ends with a strange behavior I can't understand.
Let explain my nodes setup. I have "x" nodes which all have the labels kali=true, two nodes have in addition the labels kali-app=true and one of these nodes have the label kali-app-1=true.
Now I try to deploy a Deployment with a replicas of 2 pods which should result with putting these pods on the kali-app-1 node:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kali
operator: In
values: ['true']
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kali-app
operator: In
values: ['true']
- weight: 2
preference:
matchExpressions:
- key: kali-app-1
operator: In
values: ['true']
The result is the first pod is put on the kali-app-1 node but the second one is put on the node which only have the kali-app node (namely kali-app-2=true which is an other label I have in my cluster).
Does anyone can explain me this behavior?
Based on the nodeAffinity documentation it works as expected since preferredDuringSchedulingIgnoredDuringExecution field is used for the 2nd and 3rd conditions.
They way it works is:
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee.
The weight field in preferredDuringSchedulingIgnoredDuringExecution
is in the range 1-100. For each node that meets all of the scheduling
requirements (resource request, RequiredDuringScheduling affinity
expressions, etc.), the scheduler will compute a sum by iterating
through the elements of this field and adding "weight" to the sum if
the node matches the corresponding MatchExpressions. This score is
then combined with the scores of other priority functions for the
node. The node(s) with the highest total score are the most
preferred.
That means that after first pod is scheduled on the required node, another node is more preferred for placing the second pod. (e.g. if the deployment is heavy and has high cpu/memory requests) + generally kubernetes scheduler tries to spread pods across nodes to comply with high availability. Please get familiar with NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well GitHub issue.
Therefore if these pods from the deployment have to be scheduled on the same one node with label kali-app-1, you need to use requiredDuringSchedulingIgnoredDuringExecution field to enforce this assignment and do not provide the scheduler with other possible options.
I tried to find a clear answer to this, and I'm sure it's in the kubernetes documentation somewhere but I couldn't find it.
If I have 4 nodes in my cluster, and am running a compute intensive pod which presumably uses up all/most of the resources on a single node, and I scale that pod to 4 replicas- will Kubernetes put those new pods on the unused nodes automatically?
Or, in the case where that compute intensive pod is not actually crunching numbers at the time but I scale the pod to 4 replicas (ie: node running the original pod has plenty of 'available' resources) will kubernetes still see that there are 3 other totally free nodes and start the pods on them?
Kubernetes will schedule the new pods on any node with sufficient resources. So one of the newly created pods could end up on a node were one is already running, if the node has enough resources left.
You can use an anti affinity to prevent pods from scheduling on the same node as a pod from the same deployment, e.g. by using the development's label
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources-group
operator: In
values:
- high
topologyKey: topology.kubernetes.io/zone
Read docs for more on that topic.
As far as I remember, it depends on the scheduler configuration.
If you are running your on-premise kubernetes cluster and you have access to the scheduler process, you can tune the policy as you prefer (see documentation).
Otherwise, you can only play with the pod resource requests/limits and anti-affinity (see here).
I'm trying to understand if it is good practice to use a podAntiAffinity rule to prefer that Pod's in my Deployment avoid being scheduled on the same node. Thus spreading the Pod's out on my Kubernetes cluster.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app.kubernetes.io/name"
operator: In
values:
- "foo"
topologyKey: "kubernetes.io/hostname"
The documentation suggests avoiding the use of podAntiAffinity for clusters with hundreds of nodes which suggests that there is a performance impact to using them.
Also, if I don't use them, isn't the default scheduler behaviour to space out Pod's anyway?
I suppose it also matters what the Deployment is for. It makes sense to use a podAntiAffinity for a Redis cache for example but wouldn't it make even more sense to use a DaemonSet for that case? Also, what is the recommendation for a web server Pod?
You use Pod/Node Affinity rules if you want to schedule pods on some nodes by matching specified condition in more expressive methods. I am not sure if you can use it to avoid being scheduled on same node.
If you don't use affinity rule, then kube-scheduler will look for feasible node to schedule pod on it and this is generally is not the same node.
You make kube-scheduler "to think" more by defining affinity rules and this is normal that in big clusters it can affect to performance.
Also to understand how the kube-scheduler iterates over Nodes by default, you can check this documentation.
I'm running a Kubernetes cluster of 3 nodes with GKE. I ask Kubernetes for 3 replicas of backend pods. The 3 pods are not well dispatched among the nodes to provide a high-availability service, they are usually all on 2 nodes.
I would like Kubernetes to dispatch the pods as much as possible to have a pod on each node, but not fail the deployment/scale-up if they are more backend pods than nodes.
Is it possible to do that with preferredDuringSchedulingIgnoredDuringExecution?
Try setting up an preferred antiAffinity rule like so:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- "my_app_name"
topologyKey: "kubernetes.io/hostname"
This will try to schedule pods onto nodes which do not already have a pod of the same label running on them. After that it's a free for all (so it won't evenly spread them after making sure at least 1 is running on each node). This means that after scaling up you might end up with a node with 5 pods, and other nodes with 1 pod each.