I'm trying to understand if it is good practice to use a podAntiAffinity rule to prefer that Pod's in my Deployment avoid being scheduled on the same node. Thus spreading the Pod's out on my Kubernetes cluster.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app.kubernetes.io/name"
operator: In
values:
- "foo"
topologyKey: "kubernetes.io/hostname"
The documentation suggests avoiding the use of podAntiAffinity for clusters with hundreds of nodes which suggests that there is a performance impact to using them.
Also, if I don't use them, isn't the default scheduler behaviour to space out Pod's anyway?
I suppose it also matters what the Deployment is for. It makes sense to use a podAntiAffinity for a Redis cache for example but wouldn't it make even more sense to use a DaemonSet for that case? Also, what is the recommendation for a web server Pod?
You use Pod/Node Affinity rules if you want to schedule pods on some nodes by matching specified condition in more expressive methods. I am not sure if you can use it to avoid being scheduled on same node.
If you don't use affinity rule, then kube-scheduler will look for feasible node to schedule pod on it and this is generally is not the same node.
You make kube-scheduler "to think" more by defining affinity rules and this is normal that in big clusters it can affect to performance.
Also to understand how the kube-scheduler iterates over Nodes by default, you can check this documentation.
Related
I'm trying to affect pods to a specific node using affinities but it ends with a strange behavior I can't understand.
Let explain my nodes setup. I have "x" nodes which all have the labels kali=true, two nodes have in addition the labels kali-app=true and one of these nodes have the label kali-app-1=true.
Now I try to deploy a Deployment with a replicas of 2 pods which should result with putting these pods on the kali-app-1 node:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kali
operator: In
values: ['true']
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kali-app
operator: In
values: ['true']
- weight: 2
preference:
matchExpressions:
- key: kali-app-1
operator: In
values: ['true']
The result is the first pod is put on the kali-app-1 node but the second one is put on the node which only have the kali-app node (namely kali-app-2=true which is an other label I have in my cluster).
Does anyone can explain me this behavior?
Based on the nodeAffinity documentation it works as expected since preferredDuringSchedulingIgnoredDuringExecution field is used for the 2nd and 3rd conditions.
They way it works is:
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee.
The weight field in preferredDuringSchedulingIgnoredDuringExecution
is in the range 1-100. For each node that meets all of the scheduling
requirements (resource request, RequiredDuringScheduling affinity
expressions, etc.), the scheduler will compute a sum by iterating
through the elements of this field and adding "weight" to the sum if
the node matches the corresponding MatchExpressions. This score is
then combined with the scores of other priority functions for the
node. The node(s) with the highest total score are the most
preferred.
That means that after first pod is scheduled on the required node, another node is more preferred for placing the second pod. (e.g. if the deployment is heavy and has high cpu/memory requests) + generally kubernetes scheduler tries to spread pods across nodes to comply with high availability. Please get familiar with NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well GitHub issue.
Therefore if these pods from the deployment have to be scheduled on the same one node with label kali-app-1, you need to use requiredDuringSchedulingIgnoredDuringExecution field to enforce this assignment and do not provide the scheduler with other possible options.
I tried to find a clear answer to this, and I'm sure it's in the kubernetes documentation somewhere but I couldn't find it.
If I have 4 nodes in my cluster, and am running a compute intensive pod which presumably uses up all/most of the resources on a single node, and I scale that pod to 4 replicas- will Kubernetes put those new pods on the unused nodes automatically?
Or, in the case where that compute intensive pod is not actually crunching numbers at the time but I scale the pod to 4 replicas (ie: node running the original pod has plenty of 'available' resources) will kubernetes still see that there are 3 other totally free nodes and start the pods on them?
Kubernetes will schedule the new pods on any node with sufficient resources. So one of the newly created pods could end up on a node were one is already running, if the node has enough resources left.
You can use an anti affinity to prevent pods from scheduling on the same node as a pod from the same deployment, e.g. by using the development's label
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources-group
operator: In
values:
- high
topologyKey: topology.kubernetes.io/zone
Read docs for more on that topic.
As far as I remember, it depends on the scheduler configuration.
If you are running your on-premise kubernetes cluster and you have access to the scheduler process, you can tune the policy as you prefer (see documentation).
Otherwise, you can only play with the pod resource requests/limits and anti-affinity (see here).
We are finding that our Kubernetes cluster tends to have hot-spots where certain nodes get far more instances of our apps than other nodes.
In this case, we are deploying lots of instances of Apache Airflow, and some nodes have 3x more web or scheduler components than others.
Is it possible to use anti-affinity rules to force a more even spread of pods across the cluster?
E.g. "prefer the node with the least pods of label component=airflow-web?"
If anti-affinity does not work, are there other mechanisms we should be looking into as well?
Try adding this to the Deployment/StatefulSet .spec.template:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "component"
operator: In
values:
- airflow-web
topologyKey: "kubernetes.io/hostname"
Have you tried configuring the kube-scheduler?
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering: finds the set of Nodes where it's feasible to schedule the Pod.
Scoring: ranks the remaining nodes to choose the most suitable Pod placement.
Scheduling Policies: can be used to specify the predicates and priorities that the kube-scheduler runs to filter and score nodes.
kube-scheduler --policy-config-file <filename>
Sample config file
One of the priorities for your scenario is:
BalancedResourceAllocation: Favors nodes with balanced resource usage.
The right solution here is pod topology spread constraints: https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/
Anti-affinity only works until each node has at least 1 pod. Spread constraints actually balances based on the pod count per node.
Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.
I'm running a Kubernetes cluster of 3 nodes with GKE. I ask Kubernetes for 3 replicas of backend pods. The 3 pods are not well dispatched among the nodes to provide a high-availability service, they are usually all on 2 nodes.
I would like Kubernetes to dispatch the pods as much as possible to have a pod on each node, but not fail the deployment/scale-up if they are more backend pods than nodes.
Is it possible to do that with preferredDuringSchedulingIgnoredDuringExecution?
Try setting up an preferred antiAffinity rule like so:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- "my_app_name"
topologyKey: "kubernetes.io/hostname"
This will try to schedule pods onto nodes which do not already have a pod of the same label running on them. After that it's a free for all (so it won't evenly spread them after making sure at least 1 is running on each node). This means that after scaling up you might end up with a node with 5 pods, and other nodes with 1 pod each.