Using Kubernetes I have a set of nodes that are high cpu and I am using a affinity policy for a given deployment to specifically target these high cpu nodes:
# deployment.yaml
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: high-cpu-node
operator: In
values:
- "true"
That works, however it does not prevent all the rest of the deployments from scheduling pods on these high cpu nodes. How do I specify that these high cpu nodes should ONLY run pods where high-cpu-node=true? Is it possible to do this without going and modifying all the other deployment configurations (I have dozens of deployments)?
To get this behaviour you should taint nodes and use tolerations on deployments: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
But, unfortunately, you would have to modify deployments. It's not possible to achieve this simply via labels.
Related
Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.
I am facing an issue while trying to deploy my redis pods in a k3s cluster.
I have updated my Chart.yaml to add the redis dependency as below:
...
dependencies:
name: redis
version: 10.2.3
repository: https://charts.bitnami.com/bitnami
...
but when i tried to apply nodeaffinity rule in values.yaml as below
redis:
master:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker4
However, we see that it is not being scheduled to node4. Please can someone tell me which rule is incorrect or should I use pod affinity instead.
preferredDuringSchedulingIgnoredDuringExecution is a soft constraint. Scheduler takes your preference into consideration but it need not honor it if some other node has higher priority score after running through other scheduler logic. You have also given it a weight: 1 which is the minimal weight.
If you want to enforce the pod to run on worker4, you can create a hard constraint using requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution. This would mean if there is no other node matching the label kubernetes.io/hostname: worker4, your pod would become unschedulable.
If you want to use preferredDuringSchedulingIgnoredDuringExecution so that your pod can get scheduled to any node in event that worker4 is not available, you can try increasing the weight. weight takes a range range 1-100.
I've got an AKS cluster configured with two fairly small VM worker nodes, and then a virtual node to use ACI. What I really want to happen is for pods to get scheduled on the two VM nodes until they are full, then use the virtual node, but I cannot get this to work.
I've tried using node affinity, as suggested here, but this just doesn't work, pods get scheduled on the virtual node first. If I use a required node affinity, then they do get scheduled only on the VM nodes, but that is not what I want. I am guessing the issue here is that the resource availability on my VM nodes is significantly lower than the virtual node (as you would expect) so the virtual node is getting much more weight applied to it, which counteracts the affinity rule, but I don't really know as I can't see any way to see this weight.
So, does anyone have a way to make this scenario work?
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/ goes over the different scoring options used by the scheduler and https://github.com/kubernetes/examples/blob/master/staging/scheduler-policy/scheduler-policy-config.json shows how to customize them.
I suspect what you want is a preferred affinity combined with increasing the scoring factor for NodeAffinityPriority.
nodeAffinity is the right way to go, but you have to play right with requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution parameters.
For example:
apiVersion: v1
kind: Pod
metadata:
name: node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: KEY-FOR-ALL-THREE-NODES
operator: In
values:
- VALUE-FOR-ALL-THREE-NODES
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: KEY-FOR-THE-TWO-SMALL-NODES
operator: In
values:
- VALE-FOR-THE-TWO-SMALL-NODES
containers:
- name: nginx
image: nginx
This pod can run only on the nodes with the key: value stated as requirement (so all three nodes), but you are giving it a preference to run on the small nodes (if there is room), with weight of 100. The weight is a subjective thing, so it should work the same with +1 then with +100.
Also, since you have three nodes, you can skip requirement part, and only set a preference.
I use Google Cloud Platform for my project.
Now, I have a cluster with 4 node pools:
- "micro-pool": with minimal machines for managing the cluster
- "cpu-pool": with cpu-only machines for processes that don't need a GPU
- 2 "gpu-pools": two pools with machines that have GPUs attached.
Now, what I need is for my CPU processes to never work on a GPU machine because they take so much time and doing that on a GPU machine is just costing money for nothing.
I run my pods using the
kubectl run dc-1 --image={image-name} --replicas=1 --restart=Never --limits="nvidia.com/gpu=0,cpu=4000m,memory=2Gi" -- bash -c "command to execute"
Now, this works fine if there were no "GPU-machines" created from previous GPU runs. But if there was a very recent GPU run, this command will run on that instance because it has the minimum cpu and memory requirements. I thought the --limits="nvidia.com/gpu=0 would do the trick but obviously it didn't.
What should I do?
if you want to assign the pod on particular instance or node you can use the kubernetes node selector
for example :
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
here it will assign pod based on the node selector which is disk type.
you can also check this url for further documentation : https://kubernetes.io/docs/concepts/configuration/assign-pod-node
Edit 1 :
as you are on GCP you can use this way also :
nodeSelector:
#<labelname>:value
cloud.google.com/gke-nodepool: pool-highcpu8 (poolname)
Edit 2 :
if you have knowledge of affinity and anity-affinity you can implement it also.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/node-type
operator: In
values:
- gpu
For cpu :
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources
operator: In
values:
- cpu-only
This is a good use case for taints and tolerations. You can taint the GPU nodes with NoSchedule. This will prevent pods (even system pods) that don't have a toleration for that taint from running on the GPU nodes
kubectl taint nodes gpuNode1 nodetype=gpu:NoSchedule
Then, on pods you do want to run on these nodes, you can add a toleration for the taint:
tolerations:
- key: "nodetype"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
I'm not sure on GCP, but on Azure's AKS you can configure the taint when you create the cluster and the node pools.
Edit:
You will want to combine this with Harsh Manvar's sugestion of node selectors and/or affinity. Just because your pod can tolerate the taint, doesn't mean it will be scheduled on the GPU nodes for sure, it will just make sure other things are not.
I am interested in knowing how pervasively labels / selectors are getting used in Kubernetes. Is it widely used feature in field to segregate container workloads.
If not, what are other ways that are used to segregate workloads in kubernetes.
I'm currently running a Kubernetes in production for some months and using the labels on some pods to spread them out over the nodes using the podAntiAffinity rules. So that these pods aren't all located on a single node. Mind you, I'm running a small cluster of three nodes.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- your-deployment-name
topologyKey: "kubernetes.io/hostname"
I've found this a useful way to use labels.