Preferred inter-pod affinity never being respected in kubernetes

Preferred inter-pod affinity never being respected in kubernetes - kubernetes

I have a jenkins pod, having the label app: jenkins-master
This resides on jenkins namespace.
I want an nginx pod of a deployment (on another namespace, default) to be collocated to the above pod.
So I add the following in its spec:
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
namespaces:
- all
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- jenkins-master
topologyKey: "kubernetes.io/os"
I have a GKE cluster of 8 nodes.
Out of 5-6 times I have created/deleted the deployment, the nginx pod actually never landed on the same node as jenkins-master.
I know it is preferred scheduling, but is this behavior normal?
Working on GKE with "v1.15.9-gke.24"
edit 1: I have changed the topologyKey: "kubernetes.io/hostname" as suggested in a couple of answers below but that didn't help much either.
edit 2: These are the allocated resources for the node that jenkins-master pod is scheduled on
Resource Requests Limits
cpu 1691m (43%) 5013m (127%)
memory 4456Mi (33%) 8902Mi (66%)
Since scheduling is based on requests, I don't understand how the following deployment fails to collocate, the requests I am making are minimal
resources:
limits:
memory: "1Gi"
cpu: "100m"
requests:
memory: "100Mi"
cpu: "50m"

I think you made a mistake using the topologyKey: "kubernetes.io/os" which is used if you are mixing operating systems in your cluster (for example: mixing Linux and Windows nodes).
You should be using: topologyKey: "kubernetes.io/hostname", where Kubelet populates this label with the hostname.

I assume you know that topology refers to some labels that are given to the nodes automatically upon initialization of the cluster.
So, topology groups nodes as one (through these labels), so when you say topologyKey: "kubernetes.io/os", you are saying choose a node that is part of this group, and schedule the pod on it. Since probably all your nodes have the same OS, to your scheduler it is a valid node to run on. So, yes, it is intended behavior.
Note that this is still a preference, but it will still try to schedule on the right node, if there are enough resources.
What you have to do is what is suggesting omricoco; topologyKey: "kubernetes.io/hostname". You need to let the scheduler group by hostname, so you will have only 1 node per group, and the pod to be scheduled will be on the same node.

Related

Kubernetes scheduling: control preference and remove pods in priority on a node

Context
I have a Kubernetes cluster with 2 pools:
the first one, bare-metal, is composed of a finite number of bare metal instances. Every node in this pool has a label node-type: bare-metal
the second one, vms, is composed of virtual machines. This is used if the cluster needs to autoscale, since provisioning VMs is faster than provisioning a new bare metal instance. Every node in this pool has a label node-type: vms
For performance reasons, I would like that, when possible, all pods run on a bare-metal node. If not possible, and only if not possible (not enough memory, too many pods already existing), that's ok for me to run pods on a vms node. I have 2 questions (tightly related) about this.
About preferredDuringSchedulingIgnoredDuringExecution behavior
To schedule pods in priority on bare-metal, I've used preferredDuringSchedulingIgnoredDuringExecution. Example with a simple nginx deployment with 90 replicas (a bare-metal pool node could hold all of them):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 90
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
cpu: 50m
memory: 50M
limits:
cpu: 50m
memory: 50M
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- bare-metal
This works fine, even if I don't really understand why on the 90 pods, 3 are scheduled on a node of vms pool (even if scheduling pods on the bare-metal pool is still possible). Well, I understand that this is preferred instead of required, but I'm wondering if we can tune this "preference".
Question: Can I configure my deployment to tell like "preferred only if there is no other solution"?
Scaling down
When I scale down my Deployment, I would like that pods scheduled on vms pool are deleted first, because my pods should run on priority on a bare-metal pool node. However, if I run:
kubectl scale deployment nginx --replicas 3
I can observe that pods on bare-metal pool nodes are deleted first, leaving only 3 pods on a vms pool node.
Question: Is there a way to configure my deployment to tell Kubernetes to remove the pods on vms pool first?
Edit: It seems that the proposal of the introduction of PodDeletionCost can solve this. But it's not implemented for now.

How to distribute K8 Deployments evenly across nodes with Kubernetes?

I have 12 K8 deployments that should be distributed somewhat evenly across 3 K8 nodes based on resource utilization (like the uptime command). I expected Kubernetes to automatically choose the node that is utilizing the least amount of resources at the time of pod creation, which I would think should result in a somewhat even distribution, but to my surprise Kubernetes is creating the majority of the Pods on the same single node that is barely handling the load, while the other nodes are not being utilized almost at all.
I heard about using topologySpreadConstraints like so
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
type: wordpress-3
But I cant get it to work properly, what is the correct way to achieve the even distribution behavior of deployments that I am looking for? thanks!

Are you using bitnami's wordpress chart?
If so you can update the values.yaml you pass into the chart and set anti-affinity like this:
# Affinity
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app.kubernetes.io/instance: {name of your Wordpress release}
This will force kubernetes to only allow one Wordpress pod on one host (i.e. node). I use this setup on my own Wordpress installations and it means if one node goes down, it doesn't take out the site as the other replicas will still be running and on separate nodes

Kubernetes : How to ensure one pod gets scheduled on each worker node?

Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?

You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.

Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.

Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.

Kubernetes pod transferring on insufficient memory

I am trying to find an elegant way to resolve the following scenario.
We have a ASW Kubernetes cluster with 6 nodes of 16G RAM each one,
The cluster has various of pods with different resources requirements between 1G to 6G minimum requested memory.
There is a scenario when we get a pod pending due to insufficient memory.
It happens when we need to upgrade a few pods with different memory requirements. The pod with 6G is pending since no node has 6G available.
What I would expect from the Kubernetes to rearrange pods between nodes in order to free 6G on a specific node rather than hold 5G free on two diff notes (in total 10G) and returns insufficient memory.
Is there a way to instruct the Kubernetes to initialise the memory better and handle this automatically.
I was thinking about the pod priorities capability. The less memory request the low prioritised. Wondering if basing on this setting the Kubernetes will be able to restart the less important (the small) pod once the bigger is deployed, in this way to rearrange them between nodes.
Any idea will be appreciated

There is no silver bullet solution but there are combination things you can do maybe using Pod Affinity/Pod AntiAffinity, Node Affinity, and Pod Topology Spread Constraints. It also depends on your workload priorities.
If you have 6 nodes you can have something like this:
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready <none> 4m26s v1.16.0 node=node1,type=heavy
node2 Ready <none> 3m58s v1.16.0 node=node2,type=heavy
node3 Ready <none> 3m17s v1.16.0 node=node3,type=heavy
node4 Ready <none> 2m43s v1.16.0 node=node4,type=light
node5 Ready <none> 3m17s v1.16.0 node=node5,type=light
node6 Ready <none> 2m43s v1.16.0 node=node6,type=light
Then in your 6G Pod spec, which will schedule on node1-node6 with a skew of 3 on the heavy nodes based on the heavy pods having PodAffinity.
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
workload: heavy
spec:
topologySpreadConstraints:
- maxSkew: 3
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
workload: heavy
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: workload
operator: In
values:
- heavy
topologyKey: type
containers:
- name: myheavyapp
image: myapp:latest
...
Then you can use NodeAffinity just to schedule your light 1G pods on the light nodes only.
kind: Pod
apiVersion: v1
metadata:
name: mylightpod
labels:
workload: light
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: In
values:
- light
...
This is just an example, you can change labels, and skews to fit whatever your use case is.
Additionally, to prevent downtime you can configure a PodDisruptionBudget

By default Kubernetes scheduler will never kill any containers to accommodate newer containers. That's because if it does that, running container may be forced to reschedule on other nodes which is undesirable. Kubernetes will respect current state of cluster and tries to keep stable environment.
What you can do about this issue is, when you deploy 6G RAM app you can deploy it then delete 1G RAM requesting pods, so Kubernetes scheduler can deploy bigger app on available nodes first and deploy other pods to other nodes. This is also Kubernetes scheduler's default action, always tries to put bigger pieces first so it can also put smaller ones better.

How to prevent a GCE Kubernetes pod from working on a GPU instance?

I use Google Cloud Platform for my project.
Now, I have a cluster with 4 node pools:
- "micro-pool": with minimal machines for managing the cluster
- "cpu-pool": with cpu-only machines for processes that don't need a GPU
- 2 "gpu-pools": two pools with machines that have GPUs attached.
Now, what I need is for my CPU processes to never work on a GPU machine because they take so much time and doing that on a GPU machine is just costing money for nothing.
I run my pods using the
kubectl run dc-1 --image={image-name} --replicas=1 --restart=Never --limits="nvidia.com/gpu=0,cpu=4000m,memory=2Gi" -- bash -c "command to execute"
Now, this works fine if there were no "GPU-machines" created from previous GPU runs. But if there was a very recent GPU run, this command will run on that instance because it has the minimum cpu and memory requirements. I thought the --limits="nvidia.com/gpu=0 would do the trick but obviously it didn't.
What should I do?

if you want to assign the pod on particular instance or node you can use the kubernetes node selector
for example :
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
here it will assign pod based on the node selector which is disk type.
you can also check this url for further documentation : https://kubernetes.io/docs/concepts/configuration/assign-pod-node
Edit 1 :
as you are on GCP you can use this way also :
nodeSelector:
#<labelname>:value
cloud.google.com/gke-nodepool: pool-highcpu8 (poolname)
Edit 2 :
if you have knowledge of affinity and anity-affinity you can implement it also.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/node-type
operator: In
values:
- gpu
For cpu :
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources
operator: In
values:
- cpu-only

This is a good use case for taints and tolerations. You can taint the GPU nodes with NoSchedule. This will prevent pods (even system pods) that don't have a toleration for that taint from running on the GPU nodes
kubectl taint nodes gpuNode1 nodetype=gpu:NoSchedule
Then, on pods you do want to run on these nodes, you can add a toleration for the taint:
tolerations:
- key: "nodetype"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
I'm not sure on GCP, but on Azure's AKS you can configure the taint when you create the cluster and the node pools.
Edit:
You will want to combine this with Harsh Manvar's sugestion of node selectors and/or affinity. Just because your pod can tolerate the taint, doesn't mean it will be scheduled on the GPU nodes for sure, it will just make sure other things are not.