How to distribute K8 Deployments evenly across nodes with Kubernetes? - kubernetes

I have 12 K8 deployments that should be distributed somewhat evenly across 3 K8 nodes based on resource utilization (like the uptime command). I expected Kubernetes to automatically choose the node that is utilizing the least amount of resources at the time of pod creation, which I would think should result in a somewhat even distribution, but to my surprise Kubernetes is creating the majority of the Pods on the same single node that is barely handling the load, while the other nodes are not being utilized almost at all.
I heard about using topologySpreadConstraints like so
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
type: wordpress-3
But I cant get it to work properly, what is the correct way to achieve the even distribution behavior of deployments that I am looking for? thanks!

Are you using bitnami's wordpress chart?
If so you can update the values.yaml you pass into the chart and set anti-affinity like this:
# Affinity
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app.kubernetes.io/instance: {name of your Wordpress release}
This will force kubernetes to only allow one Wordpress pod on one host (i.e. node). I use this setup on my own Wordpress installations and it means if one node goes down, it doesn't take out the site as the other replicas will still be running and on separate nodes

Related

Schedule Kubernetes pods in the same failure zone

We have a deployment with a large replicas number ( > 1 ) that we must deploy in the same zone.
We stumbled upon this documentation section: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#an-example-of-a-pod-that-uses-pod-affinity
which explains how to schedule pods in zones that already have other pods that match certain labels.
however, there are no other pods that our deployment depends upon. all other workloads are replicated and spread across multiple zones, and this is the first deployment that we would like to keep in a single zone.
also, we thought about explicitly setting the zone for this deployment, but in case of zone failure, it will become unavailable until we notice and explicitly set it to another zone. so setting the exact zone won't work here.
any insights here? and thanks!
Pod Affinity affects how the pod is scheduled based on the presence or absence of other pods within the node. That would probably not serve the purpose you're trying to achieve.
You're probably better off using node affinity (it's on the same link you provided)
That would allow you to force to a zone, because each GKE node will have a failure-domain label which you can get doing this and looking through the results:
kubectl get node {name-of-node} -o json | jq ".metadata.labels"
The labels will read something like this:
"failure-domain.beta.kubernetes.io/region": "europe-west2",
"failure-domain.beta.kubernetes.io/zone": "europe-west2-b",
You can then combine this with nodeAffinity in your deployment yaml (parts snipped for brevity):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
...
annotations:
...
name: my-deployment
spec:
replicas: 1
strategy: {}
selector:
matchLabels:
...
template:
metadata:
annotations:
...
labels:
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
This will force the pods generated by the deployment to all go onto nodes sitting in europe-west2-b
I could change this and make it like this:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
- europe-west2-c
To allow it schedule in two zones (but it would not be able to schedule on to the europe-west2-a zone as a consequence)
I do not think there is a direct way to achieve this. I can think of two ways this can work.
Using Affinity on Pod and Node
Adding node affinity with preferredDuringSchedulingIgnoredDuringExecution for the regions you would want to target.
Adding pod affinity to itself with preferredDuringSchedulingIgnoredDuringExecution for pods to prefer to be with each other.
With this what should happen is when the first pod is about to be spun up it would match none of its preferred affinity but the scheduler will still schedule it. But once one is running for the rest of the pod there will be a pod with the correct affinity and they should all spin up. The challenge is there is a possibility of a race condition where multiple pods try to get scheduled and scheduler puts them in different locations once your first preferred zone is out.
Using Webhooks
You can use some mutating webhook to check the node label and add requiredDuringSchedulingIgnoredDuringExecution affinity to pods based on what zones you have still available.
The challenge here is you would most likely need to write and maintain this webhook yourself. I am not sure if you will find your exact usecase solved by someone else in open source. A quick search shows me this repo. I have not tested this but might give you a start.

Statefulset's replica scheduling questions

Is there anyway I can tell Kuberbetes how to schedule the replicas in the statefulset? For example, I have nodes divided into 3 different availability zones (AZ). I have labeled these nodes accordingly. Now I want K8s to put 1 replica in each AZ based on node label. Thanks
Pods will always try to be scheduled across different nodes, to achieve what you are looking for you can try to use DaemonSet, which will allow only one of these kind of pods in each node.
Also, you can use anti affinity based on the already scheduled pods in that node.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The feature you are looking for is called Pod Anti-Affinity and can be specified as such:
apiVersion: apps/v1
kind: StatefulSet
[..]
spec:
template:
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: az
weight: 100
[..]
Since Kubernetes 1.18, there is also Pod Topology Spread Constraints, which is a nicer way to specify these anti-affinity rules.

Preferred inter-pod affinity never being respected in kubernetes

I have a jenkins pod, having the label app: jenkins-master
This resides on jenkins namespace.
I want an nginx pod of a deployment (on another namespace, default) to be collocated to the above pod.
So I add the following in its spec:
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
namespaces:
- all
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- jenkins-master
topologyKey: "kubernetes.io/os"
I have a GKE cluster of 8 nodes.
Out of 5-6 times I have created/deleted the deployment, the nginx pod actually never landed on the same node as jenkins-master.
I know it is preferred scheduling, but is this behavior normal?
Working on GKE with "v1.15.9-gke.24"
edit 1: I have changed the topologyKey: "kubernetes.io/hostname" as suggested in a couple of answers below but that didn't help much either.
edit 2: These are the allocated resources for the node that jenkins-master pod is scheduled on
Resource Requests Limits
cpu 1691m (43%) 5013m (127%)
memory 4456Mi (33%) 8902Mi (66%)
Since scheduling is based on requests, I don't understand how the following deployment fails to collocate, the requests I am making are minimal
resources:
limits:
memory: "1Gi"
cpu: "100m"
requests:
memory: "100Mi"
cpu: "50m"
I think you made a mistake using the topologyKey: "kubernetes.io/os" which is used if you are mixing operating systems in your cluster (for example: mixing Linux and Windows nodes).
You should be using: topologyKey: "kubernetes.io/hostname", where Kubelet populates this label with the hostname.
I assume you know that topology refers to some labels that are given to the nodes automatically upon initialization of the cluster.
So, topology groups nodes as one (through these labels), so when you say topologyKey: "kubernetes.io/os", you are saying choose a node that is part of this group, and schedule the pod on it. Since probably all your nodes have the same OS, to your scheduler it is a valid node to run on. So, yes, it is intended behavior.
Note that this is still a preference, but it will still try to schedule on the right node, if there are enough resources.
What you have to do is what is suggesting omricoco; topologyKey: "kubernetes.io/hostname". You need to let the scheduler group by hostname, so you will have only 1 node per group, and the pod to be scheduled will be on the same node.

Use of Labels in Kubernetes deployments

I am interested in knowing how pervasively labels / selectors are getting used in Kubernetes. Is it widely used feature in field to segregate container workloads.
If not, what are other ways that are used to segregate workloads in kubernetes.
I'm currently running a Kubernetes in production for some months and using the labels on some pods to spread them out over the nodes using the podAntiAffinity rules. So that these pods aren't all located on a single node. Mind you, I'm running a small cluster of three nodes.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- your-deployment-name
topologyKey: "kubernetes.io/hostname"
I've found this a useful way to use labels.

Two pods force deploy to different ICP workers

There is a cluster Kubernetes and IBM Cloud Private with two workers.
I have one deployment which creates two pods. How can I force deployment to install its pods on two different workers? In this case if I lost one icp worker I always have other with need pod.
If you want pods to not schedule on the same node, the correct concept that you will want to use is inter-pod anti-affinity. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity-beta-feature
Observe:
spec:
replicas: 2
selector:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
You can create your pods as kubernetes DaemonSet. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. You can access below link to see details.
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
In addition to #Santiclause answer regarding scheduling policy in affinity mode there are two different mods of affinity:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution.
While using requiredDuringSchedulingIgnoredDuringExecution affinity scheduler we need to make sure that all rules are met for a pod to be scheduled.
If you will have i.e. not enough nodes to spawn all pods the scheduler will wait forever until there will be enough nodes available.
If you use preferredDuringSchedulingIgnoredDuringExecution affinity scheduler it will try to spawn all replicas based on the highest score the nodes gets from the combination of defined rules and their weight.
Weight is a parameter used along with a rule, each rule can have a different weight. In order to calculate a Score for a node we use following logic:
For every node, we iterate through rules defined in the configuration (i.e. resource request, requiredDuringScheduling, affinity expressions, etc.). In case the rule is matched we add the weight value to the score for that node. Once all rules for all nodes are processed we will have a list of all nodes with their final score. The node(s) with the highest score are the most preferred.
Just to summarize, higher weight value will increase importance of a rule and will help scheduler to decide which node to choose.