I've got an AKS cluster configured with two fairly small VM worker nodes, and then a virtual node to use ACI. What I really want to happen is for pods to get scheduled on the two VM nodes until they are full, then use the virtual node, but I cannot get this to work.
I've tried using node affinity, as suggested here, but this just doesn't work, pods get scheduled on the virtual node first. If I use a required node affinity, then they do get scheduled only on the VM nodes, but that is not what I want. I am guessing the issue here is that the resource availability on my VM nodes is significantly lower than the virtual node (as you would expect) so the virtual node is getting much more weight applied to it, which counteracts the affinity rule, but I don't really know as I can't see any way to see this weight.
So, does anyone have a way to make this scenario work?
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/ goes over the different scoring options used by the scheduler and https://github.com/kubernetes/examples/blob/master/staging/scheduler-policy/scheduler-policy-config.json shows how to customize them.
I suspect what you want is a preferred affinity combined with increasing the scoring factor for NodeAffinityPriority.
nodeAffinity is the right way to go, but you have to play right with requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution parameters.
For example:
apiVersion: v1
kind: Pod
metadata:
name: node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: KEY-FOR-ALL-THREE-NODES
operator: In
values:
- VALUE-FOR-ALL-THREE-NODES
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: KEY-FOR-THE-TWO-SMALL-NODES
operator: In
values:
- VALE-FOR-THE-TWO-SMALL-NODES
containers:
- name: nginx
image: nginx
This pod can run only on the nodes with the key: value stated as requirement (so all three nodes), but you are giving it a preference to run on the small nodes (if there is room), with weight of 100. The weight is a subjective thing, so it should work the same with +1 then with +100.
Also, since you have three nodes, you can skip requirement part, and only set a preference.
Related
Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.
I am facing an issue while trying to deploy my redis pods in a k3s cluster.
I have updated my Chart.yaml to add the redis dependency as below:
...
dependencies:
name: redis
version: 10.2.3
repository: https://charts.bitnami.com/bitnami
...
but when i tried to apply nodeaffinity rule in values.yaml as below
redis:
master:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker4
However, we see that it is not being scheduled to node4. Please can someone tell me which rule is incorrect or should I use pod affinity instead.
preferredDuringSchedulingIgnoredDuringExecution is a soft constraint. Scheduler takes your preference into consideration but it need not honor it if some other node has higher priority score after running through other scheduler logic. You have also given it a weight: 1 which is the minimal weight.
If you want to enforce the pod to run on worker4, you can create a hard constraint using requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution. This would mean if there is no other node matching the label kubernetes.io/hostname: worker4, your pod would become unschedulable.
If you want to use preferredDuringSchedulingIgnoredDuringExecution so that your pod can get scheduled to any node in event that worker4 is not available, you can try increasing the weight. weight takes a range range 1-100.
Using Kubernetes I have a set of nodes that are high cpu and I am using a affinity policy for a given deployment to specifically target these high cpu nodes:
# deployment.yaml
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: high-cpu-node
operator: In
values:
- "true"
That works, however it does not prevent all the rest of the deployments from scheduling pods on these high cpu nodes. How do I specify that these high cpu nodes should ONLY run pods where high-cpu-node=true? Is it possible to do this without going and modifying all the other deployment configurations (I have dozens of deployments)?
To get this behaviour you should taint nodes and use tolerations on deployments: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
But, unfortunately, you would have to modify deployments. It's not possible to achieve this simply via labels.
There is a cluster Kubernetes and IBM Cloud Private with two workers.
I have one deployment which creates two pods. How can I force deployment to install its pods on two different workers? In this case if I lost one icp worker I always have other with need pod.
If you want pods to not schedule on the same node, the correct concept that you will want to use is inter-pod anti-affinity. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity-beta-feature
Observe:
spec:
replicas: 2
selector:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
You can create your pods as kubernetes DaemonSet. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. You can access below link to see details.
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
In addition to #Santiclause answer regarding scheduling policy in affinity mode there are two different mods of affinity:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution.
While using requiredDuringSchedulingIgnoredDuringExecution affinity scheduler we need to make sure that all rules are met for a pod to be scheduled.
If you will have i.e. not enough nodes to spawn all pods the scheduler will wait forever until there will be enough nodes available.
If you use preferredDuringSchedulingIgnoredDuringExecution affinity scheduler it will try to spawn all replicas based on the highest score the nodes gets from the combination of defined rules and their weight.
Weight is a parameter used along with a rule, each rule can have a different weight. In order to calculate a Score for a node we use following logic:
For every node, we iterate through rules defined in the configuration (i.e. resource request, requiredDuringScheduling, affinity expressions, etc.). In case the rule is matched we add the weight value to the score for that node. Once all rules for all nodes are processed we will have a list of all nodes with their final score. The node(s) with the highest score are the most preferred.
Just to summarize, higher weight value will increase importance of a rule and will help scheduler to decide which node to choose.
How can I configure a specific pod to run on a multi-node kubernetes cluster so that it would restrict the containers of the POD to a subset of the nodes.
E.g. let's say I have A, B, C three nodes running mu kubernetes cluster.
How to limit a Pod to run its containers only on A & B, and not on C?
You can add label to nodes that you want to run pod on and add nodeSelector to pod configuration. The process is described here:
http://kubernetes.io/docs/user-guide/node-selection/
So basically you want to
kubectl label nodes A node_type=foo
kubectl label nodes B node_type=foo
And you want to have this nodeSelector in your pod spec:
nodeSelector:
node_type: foo
Firstly, you need to add label to nodes. You can refer to Nebril's answer of how to add label.
Then I recommend you to use the node affinity feature to constrain pods to nodes with particular labels. Here is an example.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node_type
operator: In
values:
- foo
containers:
- name: with-node-affinity
image: gcr.io/google_containers/pause:2.0
Compared to nodeSelector, the affinity/anti-affinity feature greatly expands the types of constraints you can express.
For more detailed information about node affinity, you can refer to the following link: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
I am unable to post a comment to previous replies but I upvoted the answer that is complete.
nodeSelector is not strictly enforced and relying on it may cause some grief including Master getting overwhelmed with requests as well as IOs from pods scheduled on Master. At the time the answer was provided it may still have been the only option but in the later versions that is not so.
Definitely use nodeAffinity or nodeAntiAffinity with requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution
For more expressive filters for scheduling peruse: Assigning pods to nodes