Kubernetes: How To Ensure That Pod A Only Ends Up On Nodes Where Pod B Is Running - kubernetes

I have two use cases where teams only want Pod A to end up on a Node where Pod B is running. They often have many Copies of Pod B running on a Node, but they only want one copy of Pod A running on that same Node.
Currently they are using daemonsets to manage Pod A, which is not effective because then Pod A ends up on a lot of nodes where Pod B is not running. I would prefer not to restrict the nodes they can end up on with labels because that would limit the Node capacity for Pod B (ie- if we have 100 nodes and 20 are labeled, then Pod B's possible capacity is only 20).
In short, how can I ensure that one copy of Pod A runs on any Node with at least one copy of Pod B running?

The current scheduler doesn’t really have anything like this. You would need to write something yourself.

As per my understanding, you have Kubernetes cluster with N nodes and some pods of type B is scheduled there. Now you want to have only one instance of pod of type A to be present on the the node where more than zero pods of type B is scheduled. I assume that A<=N and A<=B and ( B>N or B<=N ) (Read <= as greater or equal).
You are using a Daemonset controller to schedule podsA at this moment, and it doesn't work as you want. But you can fix it by forcing Deaemonset to be scheduled by default scheduler instead of DaemonSet controller which schedules its pods without considering pod priority and preemption.
ScheduleDaemonSetPods allows you to schedule DaemonSets using the default scheduler instead of the DaemonSet controller, by adding the NodeAffinity term to the DaemonSet pods, instead of the .spec.nodeName term. The default scheduler is then used to bind the pod to the target host. If node affinity of the DaemonSet pod already exists, it is replaced. The DaemonSet controller only performs these operations when creating or modifying DaemonSet pods, and no changes are made to the spec.template of the DaemonSet.
In addition, node.kubernetes.io/unschedulable:NoSchedule toleration is added automatically to DaemonSet Pods. The default scheduler ignores unschedulable Nodes when scheduling DaemonSet Pods.
So if we add podAaffinity/podAntiAffinity to a Daemonset, the N=number of nodes replicas will be created, but only for nodes that match the condition of (anti)affinity the pods will be scheduled, rest of pods will remain in the Pending state.
Here is an example of such Daemonset:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ds-splunk-sidecar
namespace: default
labels:
k8s-app: ds-splunk-sidecar
spec:
selector:
matchLabels:
name: ds-splunk-sidecar
template:
metadata:
labels:
name: ds-splunk-sidecar
spec:
affinity:
# podAntiAffinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- splunk-app
topologyKey: "kubernetes.io/hostname"
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: ds-splunk-sidecar
image: nginx
terminationGracePeriodSeconds: 30
The output of kubectl get pods -o wide | grep splunk:
ds-splunk-sidecar-26cpt 0/1 Pending 0 4s <none> <none> <none> <none>
ds-splunk-sidecar-8qvpx 1/1 Running 0 4s 10.244.2.87 kube-node2-2 <none> <none>
ds-splunk-sidecar-gzn7l 0/1 Pending 0 4s <none> <none> <none> <none>
ds-splunk-sidecar-ls56g 0/1 Pending 0 4s <none> <none> <none> <none>
splunk-7d65dfdc99-nz6nz 1/2 Running 0 2d 10.244.2.16 kube-node2-2 <none> <none>
The output of the kubectl get pod ds-splunk-sidecar-26cpt -o yaml (which is in Pending state). The nodeAffinity section is automatically added to pod.spec without affecting the parent Daemonset configuration:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2020-04-02T13:10:23Z"
generateName: ds-splunk-sidecar-
labels:
controller-revision-hash: 77bfdfc748
name: ds-splunk-sidecar
pod-template-generation: "1"
name: ds-splunk-sidecar-26cpt
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: ds-splunk-sidecar
uid: 4fda6743-74e3-11ea-8141-42010a9c0004
resourceVersion: "60026611"
selfLink: /api/v1/namespaces/default/pods/ds-splunk-sidecar-26cpt
uid: 4fdf96d5-74e3-11ea-8141-42010a9c0004
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- kube-node2-1
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- splunk-app
topologyKey: kubernetes.io/hostname
containers:
- image: nginx
imagePullPolicy: Always
name: ds-splunk-sidecar
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-mxvh9
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- name: default-token-mxvh9
secret:
defaultMode: 420
secretName: default-token-mxvh9
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-04-02T13:10:23Z"
message: '0/4 nodes are available: 1 node(s) didn''t match pod affinity rules,
1 node(s) didn''t match pod affinity/anti-affinity, 3 node(s) didn''t match
node selector.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: BestEffort
Alternatively you can achieve the similar results using a Deployment controller:
As soon as we only can automatically scale deployments based on Pod metrics (unless you write your own HPA) , we have to set the number of the A replicas equals to N manually. In the case that there is one node without pod B, one pod of A will stay in the pending state.
There is an almost precise example of the setup described in the question using directive requiredDuringSchedulingIgnoredDuringExecution . Please see the section "More Practical Use-cases: Always co-located in the same nodelink" of the "Assigning Pods to Nodes" documentation page:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deplA
spec:
selector:
matchLabels:
app: deplA
replicas: N #<---- Number of nodes in the cluster <= replicas of deplB
template:
metadata:
labels:
app: deplA
spec:
affinity:
podAntiAffinity: # Prevent scheduling more tnan one PodA on the same node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- deplA
topologyKey: "kubernetes.io/hostname"
podAffinity: # ensures that PodA is schedules only if PodB is present on the same node.
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- deplB
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine
There is only one problem, the same for both cases, If PodB is rescheduled on the different node by any reason and no more PodB is present on the node, PodA will not be evicted automatically from that node.
That problem could be solved by scheduling a CronJob with kubectl image and proper service-account specified, that every ~5 mins kills all PodsA where no corresponding PodB is present on the same node. (Please search for the existing solution on Stack or ask another question about the script content)

As already explained by coderanger- currect scheduler doesn't support this fuction. Ideal solution would be to create your own scheduler to support such functionality.
However you can use to podAffinity to partially schedule pods on same node.
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- <your_value>
topologyKey: "kubernetes.io/hostname"
It will try to schedule pods as tightly as possible.

Related

Can a Pod with an affinity for one node's label, but without a toleration for that node's taint, mount to that node?

Say you have Node1 with taint node1taint:NoSchedule and label node1specialkey=asdf.
And Node2 with no taints.
Then you create PodA with affinity to Node1:
apiVersion: v1
kind: Pod
metadata:
labels:
name: PodA
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node1specialKey
operator: Exists
containers:
- image: busybox
name: PodA
Which pod should the node schedule to? Will the affinity override the taint?
Thanks!
The pod will not schedule anywhere, because it does not tolerate Node1's taint and it does not have an affinity for Node2.
Here is the missing pod taint that would, in combination with the affinity, successfully schedule PodA on Node1.
tolerations:
- key: "node1taint"
operator: "Exists"
effect: "NoSchedule"
A taint is more powerful than an affinity. The pod needs the toleration, too, because affinity alone is not strong enough here in Kubernetes-land.

How to achieve 1 node = 1 pod in Kubernetes

I have 5 nodes in a cluster (Test).
I have labeled them to the following:
Nodes:
namespace=A
namespace=A
namespace=B
namespace=B
namespace=C
I applied taints and tolerations, nodeAffinity and podAntiAffinity. Our nodes is auto scale enabled. However, our nodes weren't scaling up, all the pods are going in 1 node.
I have read in this link Kubernetes: Evenly distribute the replicas across the cluster, that using podAntiAffinity, node Affinity, taints and tolerations do not guarantee this requirement. Our requirement is, 1 pod should be deployed evenly on nodes and should scale up accordingly.
What am I missing?
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
labels:
app: prometheus
spec:
serviceName: "prometheus"
selector:
matchLabels:
name: prometheus
template:
metadata:
labels:
name: prometheus
app: prometheus
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: namespace
operator: In
values:
- A
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- prometheus
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- name: prometheus
image: quay.io/prometheus/prometheus:v2.6.0
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
terminationGracePeriodSeconds: 30
tolerations:
- key: namespace
operator: Equal
value: A
Did you try nodeSelector which will schedule the pod on the node that you attached the label to?
nodeSelector:
namespace: A

Kubernetes pods scheduled to non-tainted node

I have created a GKE Kubernetes cluster and two workloads deployed on that cluster, There are separate node pools for each workload. The node pool for celery workload is tainted with
celery-node-pool=true.
The pod's spec has the following toleration:
tolerations:
- key: "celery-node-pool"
operator: "Exists"
effect: "NoSchedule"
Despite having the node taint and toleration some of the pods from celery workload are deployed to the non-tainted node. Why is this happening and am I doing something wrong? What other taints and tolerations should I add to keep the pods on specific nodes?
Using Taints:
Taints allow a node to repel a set of pods.You have not specified the effect in the taint. It should be node-pool=true:NoSchedule. Also your other node need to repel this pod so you need to add a different taint to other nodes and not have that toleration in the pod.
Using Node Selector:
You can constrain a Pod to only be able to run on particular Node(s) , or to prefer to run on particular nodes.
You can label the node
kubectl label nodes kubernetes-foo-node-1.c.a-robinson.internal node-pool=true
Add node selector in the pod spec:
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
node-pool: true
Using Node Affinity
nodeSelector provides a very simple way to constrain pods to nodes with particular labels. The affinity/anti-affinity feature, greatly expands the types of constraints you can express.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-pool
operator: In
values:
- true
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
What other taints and tolerations should I add to keep the pods on specific nodes?
You should also add a node selector to pin your pods to tainted node, else pod is free to go to a non-tainted node if scheduler wants.
kubectl taint node node01 hostname=node01:NoSchedule
If i taint node01 and want my pods be placed on it with toleration need node selector as well.
nodeSelector provides a very simple way to constrain(affinity) pods to nodes with particular labels.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
tolerations:
- key: "hostname"
operator: "Equal"
value: "node01"
effect: "NoSchedule"
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
kubernetes.io/hostname: node01

Is it possible to assign a pod of StatefulSet to a specific node of a Kubernetes cluster?

I have a 5 node cluster(1-master/4-worker). Is it possible to configure a StatefulSet where I can make a pod(s) to run on a given node knowing it has sufficient capacity rather Kubernetes Scheduler making this decision?
Lets say, my StatefulSet create 4 pods(replicas: 4) as myapp-0,myapp-1,myapp-2 and myapp-3. Now what I am looking for is:
myapp-0 pod-- get scheduled over---> worker-1
myapp-1 pod-- get scheduled over---> worker-2
myapp-2 pod-- get scheduled over---> worker-3
myapp-3 pod-- get scheduled over---> worker-4
Please let me know if it can be achieved somehow? Because if I add a toleration to pods of a StatefulSet, it will be same for all the pods and all of them will get scheduled over a single node matching the taint.
Thanks, J
You can delegate responsibility for scheduling arbitrary subsets of pods to your own custom scheduler(s) that run(s) alongside, or instead of, the default Kubernetes scheduler.
You can write your own custom scheduler. A custom scheduler can be written in any language and can be as simple or complex as you need. Below is a very simple example of a custom scheduler written in Bash that assigns a node randomly. Note that you need to run this along with kubectl proxy for it to work.
SERVER='localhost:8001'
while true;
do
for PODNAME in $(kubectl --server $SERVER get pods -o json | jq '.items[] | select(.spec.schedulerName == "my-scheduler") | select(.spec.nodeName == null) | .metadata.name' | tr -d '"')
;
do
NODES=($(kubectl --server $SERVER get nodes -o json | jq '.items[].metadata.name' | tr -d '"'))
NUMNODES=${#NODES[#]}
CHOSEN=${NODES[$[$RANDOM % $NUMNODES]]}
curl --header "Content-Type:application/json" --request POST --data '{"apiVersion":"v1", "kind": "Binding", "metadata": {"name": "'$PODNAME'"}, "target": {"apiVersion": "v1", "kind"
: "Node", "name": "'$CHOSEN'"}}' http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/
echo "Assigned $PODNAME to $CHOSEN"
done
sleep 1
done
Then just in your StatefulSet configuration file under specification section you will have to add schedulerName: your-scheduler line.
You can also use pod affinity:.
Example:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
The below yaml snippet of the webserver statefuset has podAntiAffinity and podAffinity configured. This informs the scheduler that all its replicas are to be co-located with pods that have selector label app=store. This will also ensure that each web-server replica does not co-locate on a single node.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.12-alpine
If we create the above two deployments, our three node cluster should look like below.
node-1 node-2 node-3
webserver-1 webserver-2 webserver-3
cache-1 cache-2 cache-3
The above example uses PodAntiAffinity rule with topologyKey: "kubernetes.io/hostname" to deploy the redis cluster so that no two instances are located on the same host
You can simply define three replicas of specific pod and define particular pod configuration file, egg.:
There is label: nodeName which is the simplest form of node selection constraint, but due to its limitations it is typically not used. nodeName is a field of PodSpec. If it is non-empty, the scheduler ignores the pod and the kubelet running on the named node tries to run the pod. Thus, if nodeName is provided in the PodSpec, it takes precedence over the above methods for node selection.
Here is an example of a pod config file using the nodeName field:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-worker-1
More information about scheduler: custom-scheduler.
Take a look on this article: assigining-pods-kubernetes.
You can use the following KubeMod ModRule:
apiVersion: api.kubemod.io/v1beta1
kind: ModRule
metadata:
name: statefulset-pod-node-affinity
spec:
type: Patch
match:
# Select pods named myapp-xxx.
- select: '$.kind'
matchValue: Pod
- select: '$.metadata.name'
matchRegex: myapp-.*
patch:
# Patch the selected pods such that their node affinity matches nodes that contain a label with the name of the pod.
- op: add
path: /spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution
value: |-
nodeSelectorTerms:
- matchExpressions:
- key: accept-pod/{{ .Target.metadata.name }}
operator: In
values:
- 'true'
The above ModRule will monitor for the creation of pods named myapp-* and will inject a nodeAffinity section into their resource manifest before they get deployed. This will instruct the scheduler to schedule the pod to a node which has a label accept-pod/<pod-name> set to true.
Then you can assign future pods to nodes by adding labels to the nodes:
kubectl label node worker-1 accept-pod/myapp-0=true
kubectl label node worker-2 accept-pod/myapp-1=true
kubectl label node worker-3 accept-pod/myapp-2=true
...
After the above ModRule is deployed, creating the StatefulSet will trigger the creation of its pods, which will be intercepted by the ModRule. The ModRule will dynamically inject the nodeAffinity section using the name of the pod.
If, later on, the StatefulSet is deleted, deploying it again will lead to the pods being scheduled on the same exact nodes as they were before.
You can do this using nodeSelector and node affinity (take a look at this guide https://kubernetes.io/docs/concepts/configuration/assign-pod-node/), anyone can be used to run pods on specific nodes. But if the node has taints (restrictions) then you need to add tolerations for those nodes (more can be found here https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/). Using this approach, you can specify a list of nodes to be used for your pod's scheduling, the catch is if you specify for ex. 3 nodes and you have 5 pods then you don't have control how many pods will run on each of these nodes. They gets distributed as per kube-schedular.
Another relevant use case: If you want to run one pod in each of the specified nodes, you can create a daemonset and select nodes using nodeSelector.
You can use podAntiAffinity to distribute replicas to different nodes.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 4
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
This would deploy web-0 in worker1 , web-1 in worker2, web-2 in worker3 and web-3 in worker4.
take a look to this guideline https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
however, what you are looking for is the nodeSelector directive that should be placed in the pod spec.

In k8s, how to let the nodes choose by itself what kind of pods they would accept

I want one of my node only accepts some kind of pods.
So I wonder, is there a way to make one node only accept those pods with some specific labels?
You have two options:
Node Affinity: property of Pods which attract them to set of nodes.
Taints & Toleration : Taints are opposite of Node Affinity, they allow node to repel set of Pods.
Using Node Affinity
You need to label your nodes:
kubectl label nodes node1 mylabel=specialpods
Then when you launch Pods specify the affinity:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: mylabel
operator: In
values:
- specialpods
containers:
- name: nginx-container
image: nginx
Using Taint & Toleration
Taint & Toleration work together: you taint a node, and then specify the toleration for pod, only those Pods will be scheduled on node whose toleration "matches" taint:
Taint: kubectl taint nodes node1 mytaint=specialpods:NoSchedule
Add toleration in Pod Spec:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
tolerations:
- key: "mytaint"
operator: "Equal"
value: "specialpods"
effect: "NoSchedule"
containers:
- name: nginx-container
image: nginx