How to spread pods evenly among all zones in kubernetes? - kubernetes

I was trying to spread pods evenly in all zones, but couldn't make it work properly.
In my k8s cluster, nodes are spread across 3 az's. Now suppose min node count is 1 and there are 2 nodes at the moment, first one is totally full of pods. Now when I create one deployment (replica 2) with topology spread constraints as ScheduleAnyway then since 2nd node has enough resources both the pods are deployed in that node. I don't want that. I tried changing condition to DoNotSchedule. But since I have only 3 az's, I am only able to schedule 3 pods and it's triggering new node for all 3 pods. I want to make sure that relpica's are spread in 3 all az's.
Here is snippet from deployment spec. Does anyone know what should be the way out?
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "my-app"

You need to tweak the attribute max skew.
Assign the attribute a higher value
Refer the example given at :
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

If I good understand your problem you can use node Affinity rule and maxSkew field.
Please take a look at this my answer or have a look at it below. In it, I have described how you can force your pods to split between nodes. All you need to do is set the key and value parameters in matchExpressions section accordingly.
Additionally, you may find the requiredDuringSchedulingIgnoredDuringExecution field and the preferredDuringSchedulingIgnoredDuringExecution field very useful.
Look at this example yaml file:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
example: app
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
- worker-2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
Idea of this configuration: I'm using nodeAffinity here to indicate on which nodes pod can be placed:
- key: kubernetes.io/hostname
and
values:
- worker-1
- worker-2
It is important to set the following line:
- maxSkew: 1
According to the documentation:
maxSkew describes the degree to which Pods may be unevenly distributed. It must be greater than zero.
Thanks to this, the difference in the number of assigned feeds between nodes will always be maximally equal to 1.
This section:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
is optional however, it will allow you to adjust the feed distribution on the free nodes even better. Here you can find a description with differences between: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution:
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be "only run the pod on nodes with Intel CPUs" and an example preferredDuringSchedulingIgnoredDuringExecution would be "try to run this set of pods in failure zone XYZ, but if it's not possible, then allow some to run elsewhere".

Related

Kubernetes how to spread pods to nodes but with preferredDuringSchedulingIgnoredDuringExecution

I want that my api deployment pods will be spread to the whole cluster's nodes.
So I came up with this:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: "kubernetes.io/hostname"
But this allows exactly one pod in each node, and no more.
My problem is when I want to rollout an update, kubernetes remains the new creating pod under "pending" state.
How can I change the requiredDuringSchedulingIgnoredDuringExecution to preferredDuringSchedulingIgnoredDuringExecution?
I have tried, but I got many errors since the preferredDuringSchedulingIgnoredDuringExecution probably requires different configurations from the requiredDuringSchedulingIgnoredDuringExecution.
This is the right implementation:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
This will spread the pods evenly in the nodes and will allow more than one in each node. So basically you can deploy 6 replicas to cluster of 3 nodes without a problem. Also, you can rollout an update even though it creates a new extra pod before shutting down the old one.

Assigning Pods to different nodepools with NodeAffinity

I am trying to assign a cluster of pods to nodepools and I would like those nodepools to change based on the resources requested by the cluster pods. However, I'd like the pods to prefer the smaller of the nodepools (worker) and ignore the larger nodes (lgworker) (so, do not trigger a scale up).
extraPodConfig:
tolerations:
- key: toleration_label
value: worker
operator: Equal
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: a_node_selector_label
operator: In
values:
- worker
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node_label
operator: In
values:
- worker
- weight: 90
preference:
matchExpressions:
- key: node_label
operator: In
values:
- lgworker
The cluster pods default resource requests will fit on the smaller nodes easily and so I want to cause that to be used first. The larger nodepool should only be triggered when more resources than would fit on the smaller node be requested.
I have tried to weight the preferences, however the default cluster pods are being scheduled on to the larger nodepool.
Is there something I am missing that would help me properly assign pods to the smaller nodes over the larger nodes?
Using appropriate weighting helps to prefer the correct node, however, when enough Dask workers are requested, a number of those workers may end up on the lgworker nodes. The fix to this would be to update the kube-scheduler to consider 100% of nodes when considering scheduling. By default the kube-scheduler will consider N% (dynamically determined) of nodes at a time to evaluate by filtering and scoring v1.21.Kube-Scheduler.
NodeAffinity will only go so far and due to it's non-guarantee of enforcing affinities, this could cause pods to be scheduled on unpreferred nodes.
Node Affinity v1 :
The scheduler will prefer to schedule pods to nodes that satisfy the affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node matches the corresponding matchExpressions; the node(s) with the highest sum are the most preferred.
extraPodConfig:
tolerations:
- key: node_toleration
value: worker
operator: Equal
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node_label
operator: In
values:
- worker
- lgworker
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node_label
operator: In
values:
- worker
- weight: 1
preference:
matchExpressions:
- key: node_label
operator: In
values:
- lgworker
So, affecting the kube-scheduler would involve updating its configuration:
Example
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
provider: DefaultProvider
...
percentageOfNodesToScore: 100

Kubernetes anti-affinity rule to spread Deployment Pods to at least 2 nodes

I have the following anti-affinity rule configured in my k8s Deployment:
spec:
...
selector:
matchLabels:
app: my-app
environment: qa
...
template:
metadata:
labels:
app: my-app
environment: qa
version: v0
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
In which I say that I do not want any of the Pod replica to be scheduled on a node of my k8s cluster in which is already present a Pod of the same application. So, for instance, having:
nodes(a,b,c) = 3
replicas(1,2,3) = 3
replica_1 scheduled in node_a, replica_2 scheduled in node_b and replica_3 scheduled in node_c
As such, I have each Pod scheduled in different nodes.
However, I was wondering if there is a way to specify that: "I want to spread my Pods in at least 2 nodes" to guarantee high availability without spreading all the Pods to other nodes, for example:
nodes(a,b,c) = 3
replicas(1,2,3) = 3
replica_1 scheduled in node_a, replica_2 scheduled in node_b and replica_3 scheduled (again) in node_a
So, to sum up, I would like to have a softer constraint, that allow me to guarantee high availability spreading Deployment's replicas across at least 2 nodes, without having to launch a node for each Pod of a certain application.
Thanks!
I think I found a solution to your problem. Look at this example yaml file:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
example: app
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
- worker-2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
Idea of this configuration:
I'm using nodeAffinity here to indicate on which nodes pod can be placed:
- key: kubernetes.io/hostname
and
values:
- worker-1
- worker-2
It is important to set the following line:
- maxSkew: 1
According to the documentation:
maxSkew describes the degree to which Pods may be unevenly distributed. It must be greater than zero.
Thanks to this, the difference in the number of assigned feeds between nodes will always be maximally equal to 1.
This section:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1
is optional however, it will allow you to adjust the feed distribution on the free nodes even better. Here you can find a description with differences between: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution:
Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be "only run the pod on nodes with Intel CPUs" and an example preferredDuringSchedulingIgnoredDuringExecution would be "try to run this set of pods in failure zone XYZ, but if it's not possible, then allow some to run elsewhere".

How does weight affect pod scheduling when affinity rules are set?

Background:
While performance testing an application, I was getting inconsistent results when scaling the replicas for my php-fpm containers where I realized that 3/4 pods were scheduled on the same node.
I then configured anti affinity rules to not schedule pods on the same node. I quickly realized that using requiredDuringSchedulingIgnoredDuringExecution was not an option because I could not have # of replicas > # of nodes so I configured preferredDuringSchedulingIgnoredDuringExecution.
For the most part, it looks like my pods are scheduled evenly across all my nodes however sometimes (seen through a rolling upgrade), I see pods on the same node. I feel like the weight value which is currently set to 100 is playing a factor.
Here is the yaml I am using (helm):
{{- if .Values.podAntiAffinity }}
{{- if .Values.podAntiAffinity.enabled }}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: "{{ .Values.deploymentName }}"
topologyKey: "kubernetes.io/hostname"
{{- end }}
{{- end }}
Questions:
The way I read the documentation, the weight number will be added to a calculated score for the node based on how busy it is (simplified) however what I don't understand is how a weight of 1 vs 100 would be any different?
Why are pods sometimes scheduled on the same node with this rule? Is it because the total score for the node that the pod wasn't scheduled on is too low (as it is too busy)?
Is there a way to see a log/event of how the pod was scheduled on a particular node? I'd expect kubectl describe pod to have those details but seemingly it does not (except in an error scenario).
preferredDuringSchedulingIgnoredDuringExecution is not guaranteed.
two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee.
The weight you set is giving an edge but there are other parameters (set by user and kubernetes) with their own weights. Below example should give a better picture where weight that you set matters
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: example.com/myLabel
operator: In
values:
- a
weight: 40
- preference:
matchExpressions:
- key: example.com/myLabel
operator: In
values:
- b
weight: 35

What is the recommended way to deploy kafka to make it deploy in all the available nodes?

I have 3 nodes in k8s and i'm running kafka (3 cluster).
While deploying zk/broker/rest-proxy, its not getting deployed in all the available nodes. How can i make sure that all pods are deployed in different nodes. Do i need to use nodeaffinity or podaffinity ?
If you want all pods to run on different nodes - you must use PodAntiAffinity. If this is hard requirement - you must use requiredDuringSchedulingIgnoredDuringExecution rule. If it's not - use preferredDuringSchedulingIgnoredDuringExecution.
topologyKey should be kubernetes.io/hostname.
In labelSelector put your pod's labels.
I recommend using soft anti-affinity which will look like:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- <your app label>
topologyKey: kubernetes.io/hostname
weight: 100
Here I explained the difference between anti-affinity types with examples applied to a live cluster:
https://blog.verygoodsecurity.com/posts/kubernetes-multi-az-deployments-using-pod-anti-affinity/