POD affinity rule to schedule pods across all nodes - kubernetes

we are running 6 nodes in K8s cluster. Out of 6, 2 of them running RabbitMQ, Redis & Prometheus we have used node-selector & cordon node so no other pods schedule on that particular nodes.
On renaming other 4 nodes application PODs run, we have around 18-19 micro services.
For GKE there is one open issue in K8s official repo regarding auto scale down: https://github.com/kubernetes/kubernetes/issues/69696#issuecomment-651741837 automatically however people are suggesting approach of setting PDB and we that tested on Dev/Stag.
What we are looking for now is to fix PODs on particular node pool which do not scale, as we are running single replicas of some services.
As of now, we thought of using and apply affinity to those services which are running with single replicas and no requirement of scaling.
while for scalable services we won't specify any type of rule so by default K8s scheduler will schedule pod across different nodes, so this way if any node scale down we dont face any downtime for single running replica service.
Affinity example :
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: do-not-scale
operator: In
values:
- 'true'
We are planning to use affinity type preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution.
Note : Here K8s is not creating new replica first on another node during node drain (scaledown of any node) as we are running single replicas with rolling update & minAvailable: 25% strategy.
Why: If PodDisruptionBudget is not specified and we have a deployment with one replica, the pod will be terminated and then a new pod will be scheduled on a new node.
To make sure the application will be available during the node draining process we have to specify PodDisruptionBudget and create more replicas. If we have 1 pod with minAvailable: 30% it will refuse to drain node (scaledown).
Please point out a mistake if you are seeing anything wrong & suggest better option.

First of all, defining PodDisruptionBudget makes not much sense whan having only one replica. minAvailable expressed as a percentage is rounded up to an integer as it represents the minimum number of Pods which need to be available all the time.
Keep in mind that you have no guarantee for any High Availability when launching only one-replica Deployments.
Why: If PodDisruptionBudget is not specified and we have a deployment
with one replica, the pod will be terminated and then a new pod will
be scheduled on a new node.
If you didn't explicitely define in your Deployment's spec the value of maxUnavailable, by default it is set to 25%, which being rounded up to an integer (representing number of Pods/replicas) equals 1. It means that 1 out of 1 replicas is allowed to be unavailable.
If we have 1 pod with minAvailable: 30% it will refuse to drain node
(scaledown).
Single replica with minAvailable: 30% is rounded up to 1 anyway. 1/1 should be still up and running so Pod cannot be evicted and node cannot be drained in this case.
You can try the following solution however I'm not 100% sure if it will work when your Pod is re-scheduled to another node due to it's eviction from the one it is currently running on.
But if you re-create your Pod e.g. because you update it's image to a new version, you can guarantee that at least one replica will be still up and running (old Pod won't be deleted unless the new one enters Ready state) by setting maxUnavailable: 0. As per the docs, by default it is set to 25% which is rounded up to 1. So by default you allow that one of your replicas (which in your case happens to be 1/1) becomes unavailable during the rolling update. If you set it to zero, it won't allow the old Pod to be deleted unless the new one becomes Ready. At the same time maxSurge: 2 allows that 2 replicas temporarily exist at the same time during the update.
Your Deployment definition may begin as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 👈
maxSurge: 2
selector:
...
Compare it with this answer, provided by mdaniel, where I originally found it.

Related

Can a deployment kind with a replica count = 1 ever result in two Pods in the 'Running' phase?

From what I understand, with the above configuration, it is possible to have 2 pods that exist in the cluster associated with the deployment. However, the old Pod is guranteed to be in the 'Terminated' state. An example scenario is updating the image version associated with the deployment.
There should not be any scenario where there are 2 Pods that are associated with the deployment and both are in the 'Running' phase. Is this correct?
In the scenarios I tried, for example, Pod eviction or updating the Pod spec. The existing Pod enters 'Terminating' state and a new Pod is deployed.
This is what I expected. Just wanted to make sure that all possible scenarios around updating Pod spec or Pod eviction cannot end up with two Pods in the 'Running' state as it would violate the replica count = 1 config.
It depends on your update strategy. Many times it's desired to have the new pod running and healthy before you shut down the old pod, otherwise you have downtime which may not be acceptable as per business requirements. By default, it's doing rolling updates.
The defaults look like the below, so if you don't specify anything, that's what will be used.
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
So usually, you would have a moment where both pods are running. But Kubernetes will terminate the old pod as soon as the new pod becomes ready, so it will be hard, if not impossible, to literally see both in the state ready.
You can read about it in the docs: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
This is also explained here: https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/
Users expect applications to be available all the time and developers are expected to deploy new versions of them several times a day. In Kubernetes this is done with rolling updates. Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones. The new Pods will be scheduled on Nodes with available resources.
To get the behaviour, described by you, you would set spec.strategy.type to Recreate.
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate.

Kubernetes AKS PodDisruptionBudget, HorizontalPodAutoscaler & RollingUpdate Interaction - Without Load

My question is similar to this one, about pdb,hpa and drain
Bear in mind that most of the time my pod is not under load, and so has only 1 replica, but at times, it scales up to ~7 or 8 replica
I have the following Kubernetes objects in an AKS cluster:
Deployment with rollingUpdate.maxUnavailable set to 0, rollingUpdate.maxSurge set to 1
PodDisruptionBudget with minUnavailable set to 1.
HorizontalPodAutoscaler setup to allow auto scaling, with bounds 1-10 lower-upper
Cluster auto-scaling is enabled.
When I trigger a update of the cluster (not an update of the service pod)
AKS first triggers a drain on the node that hosts my pod.
The update fails, because at this time, the number of replicas is 1, and the PodDisruptionBudget is 1, so the evict on the pod fails, and so the drain fails, and AKS eventually fails the update after that.
Other pods on the same node work well, there are new nodes created, and new pods scheduled there, and then the old pods, and the old node are terminated, those pods don't specify a PodDisruptionBudget at all.
My question is is this just an AKS specific thing in how they implemented the Cluster Upgrade - do these constraints work in other kubernetes implementations?
I would expect that the drain should allow my pod to be evicted, but scheduled onto another node, creating the new pod first, as per the rollingUpdate strategy stipulated.
the HorizontalPodAutoscaler should allow a second replica to be created by rollingUpdate.maxSurge=1
I am assuming that the rollingUpdate is never called when I upgrade the cluster, though I am not sure.
The quickest solution I can think of is HorizontalPodAutoscaler = 2-10 so that there's never only one left. but this is wasteful for most of the time.
What is my best option?

Kubernetes EKS deployment set soft node affinity to split pods 50/50 per nodegroup

I have an EKS cluster with two nodegroups each in different AZ. One deployment Deployment1 is running on 2 namespaces for redundancy, one copy per namespace and each of them run in separate AZs/nodegroup. Also there is another deployment Deployment2 that does not have any node affinity set and K8s manages where pods get scheduled.
Both deployments are huge with lots of pods. I have a subnet of 250 IPs available to me for each node group.
The problem is that while Deployment1 is fine on it's own, and gets split almost equally per AZ/Nodegroup, the Deployment2 tends to schedule most pods in one of the nodegroups and that ends when there are no more IPs available. This is a problem for Deployment1 since one namespace of it is tied to that nodegroup and no new pods can be scheduled there if load changes.
Can I somehow balance Deployment2 so it has 'soft affinity' that would split it 50/50 per each nodegroup, but if needed, can schedule pods in the other nodegroup?
If you're using Kubernetes 1.19 or later you can use topologySpreadConstraints, adding this to the pod template:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
foo: bar
where maxSkew define how uneven pods can be scheduled, topologyKey is the key of node labels and the labelSelector matches a label of your deployment. See docs
If your on an older Kubernetes version, you can look at pod anti affinity.

How to make sure Kubernetes autoscaler not deleting the nodes which runs specific pod

I am running a Kubernetes cluster(AWS EKS one) with Autoscaler pod So that Cluster will autoscale according to the resource request within the cluster.
Also, cluster will shrink no of nodes when the load is reduced. As I observed, Autosclaer can delete any node in this process.
I want to control this behavior such as asking Autoscaler to stop deleting nodes that runs a specific pod.
For example, If a node runs the Jenkins pod, Autoscaler should skip that node and delete other matching node from the cluster.
Will there a way to achieve this requirement. Please give your thoughts.
You can use "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
...
template:
metadata:
labels:
app: jenkins
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
spec:
nodeSelector:
failure-domain.beta.kubernetes.io/zone: us-west-2b
...
You should set a pod disruption budget that references specific pods by label. If you want to ensure that there is at least one Jenkins worker pod running at all times, for example, you could create a PDB like
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: jenkins-worker-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: jenkins
component: worker
(adapted from the basic example in Specifying a Disruption Budget in the Kubernetes docs).
Doing this won't prevent nodes from being destroyed; the cluster autoscaler is still free to scale things down. What it will do is temporarily delay destroying a node until the disruption budget can be met again.
For example, say you've configured your Jenkins setup so that there are three workers. Two get scheduled on the same node, and the autoscaler takes that node offline. The ordinary Kubernetes Deployment system will create two new replicas on nodes that still exist. If the autoscaler also decides it wants to destroy the node that has the last worker, the pod disruption budget above will prevent it from doing so until at least one other worker is running.
When you say "the Jenkins pod" in the question, there are two other important implications to this. One is that you should almost always configure your applications using higher-level objects like Deployments or StatefulSets and not bare Pods. The other is that it is generally useful to run multiple copies of things for redundancy if nothing else. Even absent the cluster autoscaler, disks fail, Amazon occasionally arbitrarily decommissions EC2 instances, and nodes otherwise can go offline outside of your control; you often don't want just one copy of something running in your cluster, especially if you're considering it a critical service.
In autoscaler FAQ on github you can read the following:
What types of pods can prevent CA from removing a node?
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have a pod disruption
budget
set or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching
anti-affinity, etc)
Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
*Unless the pod has the following annotation (supported in
CA 1.0.3 or later): "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

How to fix "pods is not balanced" in Kubernetes cluster

Pods doesn't balance in node pool. why doesn't spread to each node?
I have 9 instance in 1 node pool. In the past, I’ve tried add to 12 instance. Pods doesn't balance.
image description here
Would like to know if there is any solution that can help solve this problem and used 9 instance in 1 node pool?
Pods are scheduled to run on nodes by the kube-scheduler. And once they are scheduled, they are not rescheduled unless they are removed.
So if you add more nodes, the already running pods won't reschedule.
There is a project in incubator that solves exactly this problem.
https://github.com/kubernetes-incubator/descheduler
Scheduling in Kubernetes is the process of binding pending pods to
nodes, and is performed by a component of Kubernetes called
kube-scheduler. The scheduler's decisions, whether or where a pod can
or can not be scheduled, are guided by its configurable policy which
comprises of set of rules, called predicates and priorities. The
scheduler's decisions are influenced by its view of a Kubernetes
cluster at that point of time when a new pod appears first time for
scheduling. As Kubernetes clusters are very dynamic and their state
change over time, there may be desired to move already running pods to
some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node
affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
You should look into inter-pod anti-affinity. This feature allows you to constrain where your pods should not be scheduled based on the labels of the pods running on a node. In your case, given your app has label app-label, you can use it to ensure pods do not get scheduled on nodes that have pods with the label app-label. For example:
apiVersion: apps/v1
kind: Deployment
...
spec:
selector:
matchLabels:
label-key: label-value
template:
metadata:
labels:
label-key: label-value
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: label-key
operator: In
values:
- label-value
topologyKey: "kubernetes.io/hostname"
...
PS: If you use requiredDuringSchedulingIgnoredDuringExecution, you can have at most as many pods as you have nodes. If you expect to have more pods than nodes available, you will have to use preferredDuringSchedulingIgnoredDuringExecution, which makes antiaffinity be a preference, rather than an obligation.