I want that my api deployment pods will be spread to the whole cluster's nodes.
So I came up with this:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: "kubernetes.io/hostname"
But this allows exactly one pod in each node, and no more.
My problem is when I want to rollout an update, kubernetes remains the new creating pod under "pending" state.
How can I change the requiredDuringSchedulingIgnoredDuringExecution to preferredDuringSchedulingIgnoredDuringExecution?
I have tried, but I got many errors since the preferredDuringSchedulingIgnoredDuringExecution probably requires different configurations from the requiredDuringSchedulingIgnoredDuringExecution.
This is the right implementation:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
This will spread the pods evenly in the nodes and will allow more than one in each node. So basically you can deploy 6 replicas to cluster of 3 nodes without a problem. Also, you can rollout an update even though it creates a new extra pod before shutting down the old one.
Related
I'm running a managed kubernetes cluster in GCP, which has 2 node pools - one on regular VMs, one on spot VMs, autoscaling is configured for both of them.
Currently i'm running batch jobs and async tasks on spot VMs and web apps on regular VMs, but to reduce costs i'd like to move web apps pods mostly to spot VMS. Usually i have 3-5 pods of each app running, so i'd like to leave 1 on regular VMs and 2-4 move to spot.
I've found a nodeAffinity and podAffinity settings and have set preferred pod placement with preferredDuringSchedulingIgnoredDuringExecution and spot VMs node selector, but now all my pods have moved to spot VMs.
Try something like
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/type
operator: In
values:
- regular
- spot/preemptible
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- app-label
topologyKey: spot-node-label
I have a nodepool in AKS with autoscaling enabled. Here is a simple scenario I am trying to achieve.
I have 1 node running a single pod (A) with label deployment=x. I have another pod (B) with a podAntiAffinity rule to avoid nodes running pods with deployment=x label.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deployment
operator: In
values:
- x
topologyKey: kubernetes.io/hostname
The behavior I am seeing is that Pod B gets scheduled to the same node as Pod A. I would have wanted Pod B to be in Pending state until the autoscaler added a new node that satisfies the podAntiAffinity rule and schedule Pod B to the new node. Is this possible to do?
Kubernetes Version: 1.22.6
-- EDIT--
This example does trigger a node scale up, so it's doing what I expect it to do and that working for me.
The podAntiAffinity example I posted works as expected.
You should change key for search. As I understand, once you select key: deployment, you are looking in deployments, it's not related to the node. Try this example:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- name-of-deployment
topologyKey: "kubernetes.io/hostname"
I was trying to spread pods evenly in all zones, but couldn't make it work properly.
In my k8s cluster, nodes are spread across 3 az's. Now suppose min node count is 1 and there are 2 nodes at the moment, first one is totally full of pods. Now when I create one deployment (replica 2) with topology spread constraints as ScheduleAnyway then since 2nd node has enough resources both the pods are deployed in that node. I don't want that. I tried changing condition to DoNotSchedule. But since I have only 3 az's, I am only able to schedule 3 pods and it's triggering new node for all 3 pods. I want to make sure that relpica's are spread in 3 all az's.
Here is snippet from deployment spec. Does anyone know what should be the way out?
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "my-app"
You need to tweak the attribute max skew.
Assign the attribute a higher value
Refer the example given at :
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
If I good understand your problem you can use node Affinity rule and maxSkew field.
Please take a look at this my answer or have a look at it below. In it, I have described how you can force your pods to split between nodes. All you need to do is set the key and value parameters in matchExpressions section accordingly.
Additionally, you may find the requiredDuringSchedulingIgnoredDuringExecution field and the preferredDuringSchedulingIgnoredDuringExecution field very useful.
Look at this example yaml file:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
example: app
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
- worker-2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
Idea of this configuration: I'm using nodeAffinity here to indicate on which nodes pod can be placed:
- key: kubernetes.io/hostname
and
values:
- worker-1
- worker-2
It is important to set the following line:
- maxSkew: 1
According to the documentation:
maxSkew describes the degree to which Pods may be unevenly distributed. It must be greater than zero.
Thanks to this, the difference in the number of assigned feeds between nodes will always be maximally equal to 1.
This section:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
is optional however, it will allow you to adjust the feed distribution on the free nodes even better. Here you can find a description with differences between: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution:
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be "only run the pod on nodes with Intel CPUs" and an example preferredDuringSchedulingIgnoredDuringExecution would be "try to run this set of pods in failure zone XYZ, but if it's not possible, then allow some to run elsewhere".
I have the following anti-affinity rule configured in my k8s Deployment:
spec:
...
selector:
matchLabels:
app: my-app
environment: qa
...
template:
metadata:
labels:
app: my-app
environment: qa
version: v0
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
In which I say that I do not want any of the Pod replica to be scheduled on a node of my k8s cluster in which is already present a Pod of the same application. So, for instance, having:
nodes(a,b,c) = 3
replicas(1,2,3) = 3
replica_1 scheduled in node_a, replica_2 scheduled in node_b and replica_3 scheduled in node_c
As such, I have each Pod scheduled in different nodes.
However, I was wondering if there is a way to specify that: "I want to spread my Pods in at least 2 nodes" to guarantee high availability without spreading all the Pods to other nodes, for example:
nodes(a,b,c) = 3
replicas(1,2,3) = 3
replica_1 scheduled in node_a, replica_2 scheduled in node_b and replica_3 scheduled (again) in node_a
So, to sum up, I would like to have a softer constraint, that allow me to guarantee high availability spreading Deployment's replicas across at least 2 nodes, without having to launch a node for each Pod of a certain application.
Thanks!
I think I found a solution to your problem. Look at this example yaml file:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
example: app
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
- worker-2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
Idea of this configuration:
I'm using nodeAffinity here to indicate on which nodes pod can be placed:
- key: kubernetes.io/hostname
and
values:
- worker-1
- worker-2
It is important to set the following line:
- maxSkew: 1
According to the documentation:
maxSkew describes the degree to which Pods may be unevenly distributed. It must be greater than zero.
Thanks to this, the difference in the number of assigned feeds between nodes will always be maximally equal to 1.
This section:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
is optional however, it will allow you to adjust the feed distribution on the free nodes even better. Here you can find a description with differences between: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution:
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be "only run the pod on nodes with Intel CPUs" and an example preferredDuringSchedulingIgnoredDuringExecution would be "try to run this set of pods in failure zone XYZ, but if it's not possible, then allow some to run elsewhere".
I have a multizone (3 zones) GKE cluster (1.10.7-gke.1) of 6 nodes and want each zone to have at least one replica of my application.
So I've tried preferred podAntiAffinity:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: failure-domain.beta.kubernetes.io/zone
Everything looks good the first time I install (scale from 1 to 3 replicas) my application. After the next rolling update, everything gets mixed up and I can have 3 copies of my application in one zone. Since additional replicas are created and the old ones are terminated.
When I am trying the same term with requiredDuringSchedulingIgnoredDuringExecution everything looks good but rolling updates don't work because new replicas can't be scheduled (pods with "component" = "app" already exist in each zone).
How to configure my deployment to be sure I have replica in each availability zone?
UPDATED:
My workaround now is to have hard anti-affinity and deny additional pods (more than 3) during the rolling update:
replicaCount: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: failure-domain.beta.kubernetes.io/zone
deploymentStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
I don't think the Kubernetes scheduler provides a way to guarantee pods in all availability zones. I believe it's a best-effort approach when it comes to that and there may be some limitations.
I've opened an issue to check whether this can be supported either through NodeAffinity or PodAffiity/PodAntiAffinity.
If you have two nodes in each zone, you can use below affinity rules to make sure rolling updates works as well and you have a pod in each zone.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: "kubernetes.io/hostname"
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: failure-domain.beta.kubernetes.io/zone
The key issue here is rolling update - upon doing rolling update, old replica is kept until new one is launched. But new one can't be scheduled/launched due to conflict with its old replica.
So if rolling update isn't a concern, a workaround here to change strategy type to Recreate:
apiVersion: apps/v1
kind: Deployment
...
spec:
...
strategy:
type: Recreate
...
Then applying podAntiAffinity/requiredDuringSchedulingIgnoredDuringExecution rules would work.