Kubernetes pod affinity - Scheduling pods on different nodes - kubernetes

We are running our application with 3 pods on a 3 node kubernetes cluster. When we deploy out application, sometimes, pods are getting scheduled to the same kubernetes node.
We want our pods to scheduled in such a way that it spread our pods across nodes ( no 2 pods of the same application should be same node). Infact, as per documentation(https://kubernetes.io/docs/concepts/configuration/assign-pod-node/), kubernetes already does a good job in this. However, if it doesn't find resources, it schedules it to the same node. How do it make it a hard constraint ?
Requirement:
We want the deployment to fail or be in pending state if the pods don't obey the constraints (no 2 pods of the same application should be same node)

i think this one will work
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- <VALUE>
topologyKey: "kubernetes.io/hostname"
For more reference you can visit : https://thenewstack.io/implement-node-and-pod-affinity-anti-affinity-in-kubernetes-a-practical-example/

Related

EKS autoscaling for pod with node affinity

I'm running an EKS cluster with a nodegroup in all of the AZs in us-west-2 - us-west-2a, us-west-2b, us-west-2c, us-west-2d.
I recently deployed a pod with the following node affinity:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.ebs.csi.aws.com/zone
operator: In
values:
- us-west-2c
I noticed that the cluster had to do 2 scale ups. The first scale up created a node in the us-west-2d AZ and the second created on in us-west-2c. It seems to me that the autoscaler didn't take into account the affinity of the node that's trying to be scheduled.
Is this expected behavior? Is there a way to somehow configure more intelligent auto scaling in EKS such that new nodes are brought up to satisfy the affinities of currently unscheduled pods?
I should probably also ask, should I instead create an AZ specific node group with some other label (probably one that's related to this deployment) and use that as the affinity?

Kubernetes Pod anti-affinity - evenly spread pods based on a label?

We are finding that our Kubernetes cluster tends to have hot-spots where certain nodes get far more instances of our apps than other nodes.
In this case, we are deploying lots of instances of Apache Airflow, and some nodes have 3x more web or scheduler components than others.
Is it possible to use anti-affinity rules to force a more even spread of pods across the cluster?
E.g. "prefer the node with the least pods of label component=airflow-web?"
If anti-affinity does not work, are there other mechanisms we should be looking into as well?
Try adding this to the Deployment/StatefulSet .spec.template:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "component"
operator: In
values:
- airflow-web
topologyKey: "kubernetes.io/hostname"
Have you tried configuring the kube-scheduler?
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering: finds the set of Nodes where it's feasible to schedule the Pod.
Scoring: ranks the remaining nodes to choose the most suitable Pod placement.
Scheduling Policies: can be used to specify the predicates and priorities that the kube-scheduler runs to filter and score nodes.
kube-scheduler --policy-config-file <filename>
Sample config file
One of the priorities for your scenario is:
BalancedResourceAllocation: Favors nodes with balanced resource usage.
The right solution here is pod topology spread constraints: https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/
Anti-affinity only works until each node has at least 1 pod. Spread constraints actually balances based on the pod count per node.

Schedule few statefulset pods on one node and rest on other node in a kubernetes cluster

I have a kubernetes cluster of 3 worker nodes where I need to deploy a statefulset app having 6 replicas.
My requirement is to make sure in every case, each node should get exactly 2 pods out of 6 replicas. Basically,
node1 - 2 pods of app
node2 - 2 pods of app
node3 - 2 pods of app
========================
Total 6 pods of app
Any help would be appreciated!
You should use Pod Anti-Affinity to make sure that the pods are spread to different nodes.
Since you will have more than one pod on the nodes, use preferredDuringSchedulingIgnoredDuringExecution
example when the app has the label app: mydb (use what fits your case):
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- mydb
topologyKey: "kubernetes.io/hostname"
each node should get exactly 2 pods out of 6 replicas
Try to not think that the pods are pinned to certain node. The idea with Kubernetes workload is that the workload is independent of the underlying infrastructure such as nodes. What you really want - I assume - is to spread the pods to increase availability - e.g. if one nodes goes down, your system should still be available.
If you are running at a cloud provider, you should probably design the anti-affinity such that the pods are scheduled to different Availability Zones and not only to different Nodes - but it requires that your cluster is deployed in a Region (consisting of multiple Availability Zones).
Spread pods across Availability Zones
After even distribution, all 3 nodes (scattered over three zones ) will have 2 pods. That is ok. The hard requirement is if 1 node ( Say node-1) goes down, then it's 2 pods, need not be re-scheduled again on other nodes. When the node-1 is restored, then those 2 pods now will be scheduled back on it. So, we can say, all 3 pair of pods have different node/zone affinity. Any idea around this?
This can be done with PodAffinity, but is more likely done using TopologySpreadConstraints and you will probably use topologyKey: topology.kubernetes.io/zone but this depends on what labels your nodes have.

How to fix "pods is not balanced" in Kubernetes cluster

Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.

Prefer dispatch replicas among nodes

I'm running a Kubernetes cluster of 3 nodes with GKE. I ask Kubernetes for 3 replicas of backend pods. The 3 pods are not well dispatched among the nodes to provide a high-availability service, they are usually all on 2 nodes.
I would like Kubernetes to dispatch the pods as much as possible to have a pod on each node, but not fail the deployment/scale-up if they are more backend pods than nodes.
Is it possible to do that with preferredDuringSchedulingIgnoredDuringExecution?
Try setting up an preferred antiAffinity rule like so:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- "my_app_name"
topologyKey: "kubernetes.io/hostname"
This will try to schedule pods onto nodes which do not already have a pod of the same label running on them. After that it's a free for all (so it won't evenly spread them after making sure at least 1 is running on each node). This means that after scaling up you might end up with a node with 5 pods, and other nodes with 1 pod each.