I have a pod that has both node affinity and pod affinity. could some help me understand how would things behave in such a scenario?
Node 1:
label:
schedule-on : gpu
Node 2:
label:
schedule-on : gpu
Node 3:
label:
schedule-on : non-gpu
Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: test
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: schedule-on
operator: In
values:
- gpu
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- test
topologyKey: schedule-on
the output of the above is:
Pods are getting scheduled on a different node [node1,node2]
ideal output: Pod needs to be scheduled on the same node [node1]
Here is my finding.
Finding 1: I believe node affinity is taking precedence and pod affinity is getting ignored
It's the union of node affinity and pod affinity. since both the pod has the same topology key domain . hence making them in the same colocation the pods can get scheduled in different nodes but in same colocation .
When matching the topology key and placing the pod. Value of the key is also considered
I would like to add some definitions to your answer.
First of all, from this article:
Node affinity is a set of rules. It is used by the scheduler to decide where a pod can be placed in the cluster. The rules are defined using labels on nodes and label selectors specified in pods definition. Node affinity allows a pod to specify an affinity towards a group of nodes it can be scheduled on.
And from here:
Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
In your scenario: as you wrote - both pods have the same topology key domain. This fact makes them in the same colocation, so the pods can get scheduled in different nodes but in same colocation.
Here are also some words about topologyKey:
TopologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
Related
I have 4 node with different zone:
Node A : zone a
Node B : zone b
Node C : zone c
Node D : zone c
I want to spread the pod to Node A, B and C. I have Deployment that have 3 replicas to spread across those node, each pod each node. My deployments using kustomization and ArgoCD to deploy. Using the topologySpreadConstraint need to be update the label but labels are immutable on this case.
Current deployment condition using this
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-apps
spec:
replicas: 3
revisionHistoryLimit: 0
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: app
operator: In
values:
- my-apps
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-apps
version: v1
...
I've done add label for those 3 nodes and this configuration works well at first time. But when it comes to update the deployment and rolling update, The pods on nodes going to imbalance.
zone a : 2 pod
zone b : 1 pod
zone c : 0 pod
I've done playing with podAntiAffinity but its return as pending if I use hard affinity and still imbalance if I use soft affinity. Any suggestion best practice for this case? did I missed something?
Rolling updates cause some pods to be removed from nodes and others to be added. This can cause the pods on nodes to become imbalanced, as the pods that were already on the nodes will remain, but the pods that are added during the update will likely be different. To prevent this from happening, it is important to use the maxUnavailable option in the rolling update strategy. This allows you to specify the maximum number of pods that can be removed from a node during the rolling update, ensuring that the pods on each node remain balanced.
kubectl apply -f deployment.yaml --strategy=RollingUpdate --strategy-rolling-update-maxUnavailable=1
This command will create or update a deployment with the rolling update strategy, and the maxUnavailable option set to 1.This will ensure that no more than 1 pod is removed from a node during the rolling update, thus keeping the pods across nodes balanced.Try it and let me know if this works
If you are scaling down the pods, as per official doc limitations:
There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in an imbalanced Pods distribution.You can use a tool such as the Descheduler to rebalance the Pods distribution.
I have an EKS node group with 2 nodes for compute workloads. I use a taint on these nodes and tolerations in the deployment. I have a deployment with 2 replicas I want these two pods to be spread on these two nodes like one pod on each node.
I tried using:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- appname
Each pod is put on each node but if I updated the deployment file like changing its image name, it fails to schedule a new pod.
I also tried:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
type: compute
but they aren't spread evenly like 2 pods on a node.
Try adding:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
By default K8s is trying to scale the new replicaset up first before it starts downscaling the old replicas. Since it cannot schedule new replicas (because antiaffinity) they are stuck in pending state.
Once you set the deployment's maxSurge=0, you tell k8s that you don't want the deployment to scale up first during update, and thus in result it can only scale down making place for new replicas to be scheduled.
Setting maxUnavailable=1 tells k8s to replace only one pod at a time.
You can use DeamonSet instead of Deployment. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
See documentation for Deamonset
I was having the same problem with pods failing to schedule and getting stuck in pending state while rolling out new versions while my goal was to run exactly 3 pods at all times, 1 on each of the 3 available nodes.
That means I could not use maxUnavailable: 1 because that would temporarily result in less than 3 pods during the rollout.
Instead of using the app name label for matching anti-affinity, I ended up using a label with a random value ("version") on each deployment. This means new deployments will happily schedule pods to nodes where a previous version is still running, but the new versions will always be spread evenly.
Something like this:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
metadata:
labels:
deploymentVersion: v1
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deploymentVersion
operator: In
values:
- v1
topologyKey: "kubernetes.io/hostname"
v1 can be anything that's a valid label and changes on every deployment attempt.
I'm using envsubst to have dynamic variables in yaml files:
DEPLOYMENT_VERSION=$(date +%s) envsubst < deploy.yaml | kubectl apply -f -
And then the config looks like this:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
metadata:
labels:
deploymentVersion: v${DEPLOYMENT_VERSION}
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deploymentVersion
operator: In
values:
- v${DEPLOYMENT_VERSION}
topologyKey: "kubernetes.io/hostname"
I wish Kubernetes offered a more straightforward way to achieve this.
We have 1000 store nodes and need to deploy an application image on every kubernetes node by rolling out in the below order and would like to specify the deployment node details during the deployment. Is there a way to specify node details in the command line when we execute kubectl create or apply deployment commands?
This application image would be configured to store/node specific details during container/POD creation.
1 node on day 1,
10 node on day 2,
100 node on day 3 etc.
Answering on the question from the title:
How to do controlled rollout using Kubernetes deployment
You can create a Deployment that will have specific fields in its manifest that will configure the way Kubernetes handles it.
With the fields like: podAntiAffinity, requiredDuringSchedulingIgnoredDuringExecution you can ensure that Kubernetes will distribute the Pods equally across the cluster Nodes. You can read more about it by following below documentation:
Kubernetes.io: Docs: Concepts: Scheduling eviction: Assign Pod to Node
Having in mind following rollout schedule:
DAY
REPLICAS_COUNT
1
1
2
10
3
100
4
1000
You could use CI/CD tools (like for example Jenkins) to rollout (change) the amount of replicas of your Deployment across a specific schedule.
You could create a Jenkins pipeline with a deploy stage where you could put your own command with it's scheduler (or delay).
The example of such Deployment that could be used with Jenkins is following:
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
replicas: ${REPLICAS_COUNT}
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
EOF
This Deployment will assign Pods to the Nodes that aren't already having an already running replica of this Deployment (i.e. 1 Pod = 1 Node). If the amount of Pods exceeds the amount of Nodes they will remain in Pending state.
Additional resources:
Jenkins.io: Doc: Pipeline: Tour: Environment
Kubernetes.io: Docs: Concepts: Workloads: Controllers: Deployment
Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.
Is there anyway I can tell Kuberbetes how to schedule the replicas in the statefulset? For example, I have nodes divided into 3 different availability zones (AZ). I have labeled these nodes accordingly. Now I want K8s to put 1 replica in each AZ based on node label. Thanks
Pods will always try to be scheduled across different nodes, to achieve what you are looking for you can try to use DaemonSet, which will allow only one of these kind of pods in each node.
Also, you can use anti affinity based on the already scheduled pods in that node.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The feature you are looking for is called Pod Anti-Affinity and can be specified as such:
apiVersion: apps/v1
kind: StatefulSet
[..]
spec:
template:
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: az
weight: 100
[..]
Since Kubernetes 1.18, there is also Pod Topology Spread Constraints, which is a nicer way to specify these anti-affinity rules.