Kubernetes AKS PodDisruptionBudget, HorizontalPodAutoscaler & RollingUpdate Interaction - Without Load - kubernetes

My question is similar to this one, about pdb,hpa and drain
Bear in mind that most of the time my pod is not under load, and so has only 1 replica, but at times, it scales up to ~7 or 8 replica
I have the following Kubernetes objects in an AKS cluster:
Deployment with rollingUpdate.maxUnavailable set to 0, rollingUpdate.maxSurge set to 1
PodDisruptionBudget with minUnavailable set to 1.
HorizontalPodAutoscaler setup to allow auto scaling, with bounds 1-10 lower-upper
Cluster auto-scaling is enabled.
When I trigger a update of the cluster (not an update of the service pod)
AKS first triggers a drain on the node that hosts my pod.
The update fails, because at this time, the number of replicas is 1, and the PodDisruptionBudget is 1, so the evict on the pod fails, and so the drain fails, and AKS eventually fails the update after that.
Other pods on the same node work well, there are new nodes created, and new pods scheduled there, and then the old pods, and the old node are terminated, those pods don't specify a PodDisruptionBudget at all.
My question is is this just an AKS specific thing in how they implemented the Cluster Upgrade - do these constraints work in other kubernetes implementations?
I would expect that the drain should allow my pod to be evicted, but scheduled onto another node, creating the new pod first, as per the rollingUpdate strategy stipulated.
the HorizontalPodAutoscaler should allow a second replica to be created by rollingUpdate.maxSurge=1
I am assuming that the rollingUpdate is never called when I upgrade the cluster, though I am not sure.
The quickest solution I can think of is HorizontalPodAutoscaler = 2-10 so that there's never only one left. but this is wasteful for most of the time.
What is my best option?

Related

Can a deployment kind with a replica count = 1 ever result in two Pods in the 'Running' phase?

From what I understand, with the above configuration, it is possible to have 2 pods that exist in the cluster associated with the deployment. However, the old Pod is guranteed to be in the 'Terminated' state. An example scenario is updating the image version associated with the deployment.
There should not be any scenario where there are 2 Pods that are associated with the deployment and both are in the 'Running' phase. Is this correct?
In the scenarios I tried, for example, Pod eviction or updating the Pod spec. The existing Pod enters 'Terminating' state and a new Pod is deployed.
This is what I expected. Just wanted to make sure that all possible scenarios around updating Pod spec or Pod eviction cannot end up with two Pods in the 'Running' state as it would violate the replica count = 1 config.
It depends on your update strategy. Many times it's desired to have the new pod running and healthy before you shut down the old pod, otherwise you have downtime which may not be acceptable as per business requirements. By default, it's doing rolling updates.
The defaults look like the below, so if you don't specify anything, that's what will be used.
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
So usually, you would have a moment where both pods are running. But Kubernetes will terminate the old pod as soon as the new pod becomes ready, so it will be hard, if not impossible, to literally see both in the state ready.
You can read about it in the docs: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
This is also explained here: https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/
Users expect applications to be available all the time and developers are expected to deploy new versions of them several times a day. In Kubernetes this is done with rolling updates. Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones. The new Pods will be scheduled on Nodes with available resources.
To get the behaviour, described by you, you would set spec.strategy.type to Recreate.
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate.

Can a node autoscaler automatically start an extra pod when replica count is 1 & minAvailable is also 1?

our autoscaling (horizontal and vertical) works pretty fine, except the downscaling is not working somehow (yeah, we checked the usual suspects like https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why ).
Since we want to save resources and have pods which are not ultra-sensitive, we are setting following
Deployment
replicas: 1
PodDisruptionBudget
minAvailable: 1
HorizontalPodAutoscaler
minReplicas: 1
maxReplicas: 10
But it seems now that this is the problem that the autoscaler is not scaling down the nodes (even though the node is only used by 30% by CPU + memory and we have other nodes which have absolutely enough memory + cpu to move these pods).
Is it possible in general that the auto scaler starts an extra pod on the free node and removes the old pod from the old node?
Is it possible in general that the auto scaler starts an extra pod on the free node and removes the old pod from the old node?
Yes, that should be possible in general, but in order for the cluster autoscaler to remove a node, it must be possible to move all pods running on the node somewhere else.
According to docs there are a few type of pods that are not movable:
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default
don't have a pod disruption budget set or their PDB is too restrictive >(since CA 0.6).
Pods that are not backed by a controller object (so not created by >deployment, replica set, job, stateful set etc).
Pods with local storage.
Pods that cannot be moved elsewhere due to various constraints (lack of >resources, non-matching node selectors or affinity, matching anti-affinity, etc)
Pods that have the following annotation set:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false
You could check the cluster autoscaler logs, they may provide a hint to why no scale in happens:
kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler
Without having more information about your setup it is hard to guess what is going wrong, but unless you are using local storage, node selectors or affinity/anti-affinity rules etc Pod disruption policies is a likely candidate. Even if you are not using them explicitly they can still prevent node scale in if they there are pods in the kube-system namespace that are missing pod disruption policies (See this answer for an example of such a scenario in GKE)

POD affinity rule to schedule pods across all nodes

we are running 6 nodes in K8s cluster. Out of 6, 2 of them running RabbitMQ, Redis & Prometheus we have used node-selector & cordon node so no other pods schedule on that particular nodes.
On renaming other 4 nodes application PODs run, we have around 18-19 micro services.
For GKE there is one open issue in K8s official repo regarding auto scale down: https://github.com/kubernetes/kubernetes/issues/69696#issuecomment-651741837 automatically however people are suggesting approach of setting PDB and we that tested on Dev/Stag.
What we are looking for now is to fix PODs on particular node pool which do not scale, as we are running single replicas of some services.
As of now, we thought of using and apply affinity to those services which are running with single replicas and no requirement of scaling.
while for scalable services we won't specify any type of rule so by default K8s scheduler will schedule pod across different nodes, so this way if any node scale down we dont face any downtime for single running replica service.
Affinity example :
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: do-not-scale
operator: In
values:
- 'true'
We are planning to use affinity type preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution.
Note : Here K8s is not creating new replica first on another node during node drain (scaledown of any node) as we are running single replicas with rolling update & minAvailable: 25% strategy.
Why: If PodDisruptionBudget is not specified and we have a deployment with one replica, the pod will be terminated and then a new pod will be scheduled on a new node.
To make sure the application will be available during the node draining process we have to specify PodDisruptionBudget and create more replicas. If we have 1 pod with minAvailable: 30% it will refuse to drain node (scaledown).
Please point out a mistake if you are seeing anything wrong & suggest better option.
First of all, defining PodDisruptionBudget makes not much sense whan having only one replica. minAvailable expressed as a percentage is rounded up to an integer as it represents the minimum number of Pods which need to be available all the time.
Keep in mind that you have no guarantee for any High Availability when launching only one-replica Deployments.
Why: If PodDisruptionBudget is not specified and we have a deployment
with one replica, the pod will be terminated and then a new pod will
be scheduled on a new node.
If you didn't explicitely define in your Deployment's spec the value of maxUnavailable, by default it is set to 25%, which being rounded up to an integer (representing number of Pods/replicas) equals 1. It means that 1 out of 1 replicas is allowed to be unavailable.
If we have 1 pod with minAvailable: 30% it will refuse to drain node
(scaledown).
Single replica with minAvailable: 30% is rounded up to 1 anyway. 1/1 should be still up and running so Pod cannot be evicted and node cannot be drained in this case.
You can try the following solution however I'm not 100% sure if it will work when your Pod is re-scheduled to another node due to it's eviction from the one it is currently running on.
But if you re-create your Pod e.g. because you update it's image to a new version, you can guarantee that at least one replica will be still up and running (old Pod won't be deleted unless the new one enters Ready state) by setting maxUnavailable: 0. As per the docs, by default it is set to 25% which is rounded up to 1. So by default you allow that one of your replicas (which in your case happens to be 1/1) becomes unavailable during the rolling update. If you set it to zero, it won't allow the old Pod to be deleted unless the new one becomes Ready. At the same time maxSurge: 2 allows that 2 replicas temporarily exist at the same time during the update.
Your Deployment definition may begin as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 👈
maxSurge: 2
selector:
...
Compare it with this answer, provided by mdaniel, where I originally found it.

How to make sure Kubernetes autoscaler not deleting the nodes which runs specific pod

I am running a Kubernetes cluster(AWS EKS one) with Autoscaler pod So that Cluster will autoscale according to the resource request within the cluster.
Also, cluster will shrink no of nodes when the load is reduced. As I observed, Autosclaer can delete any node in this process.
I want to control this behavior such as asking Autoscaler to stop deleting nodes that runs a specific pod.
For example, If a node runs the Jenkins pod, Autoscaler should skip that node and delete other matching node from the cluster.
Will there a way to achieve this requirement. Please give your thoughts.
You can use "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
...
template:
metadata:
labels:
app: jenkins
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
spec:
nodeSelector:
failure-domain.beta.kubernetes.io/zone: us-west-2b
...
You should set a pod disruption budget that references specific pods by label. If you want to ensure that there is at least one Jenkins worker pod running at all times, for example, you could create a PDB like
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: jenkins-worker-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: jenkins
component: worker
(adapted from the basic example in Specifying a Disruption Budget in the Kubernetes docs).
Doing this won't prevent nodes from being destroyed; the cluster autoscaler is still free to scale things down. What it will do is temporarily delay destroying a node until the disruption budget can be met again.
For example, say you've configured your Jenkins setup so that there are three workers. Two get scheduled on the same node, and the autoscaler takes that node offline. The ordinary Kubernetes Deployment system will create two new replicas on nodes that still exist. If the autoscaler also decides it wants to destroy the node that has the last worker, the pod disruption budget above will prevent it from doing so until at least one other worker is running.
When you say "the Jenkins pod" in the question, there are two other important implications to this. One is that you should almost always configure your applications using higher-level objects like Deployments or StatefulSets and not bare Pods. The other is that it is generally useful to run multiple copies of things for redundancy if nothing else. Even absent the cluster autoscaler, disks fail, Amazon occasionally arbitrarily decommissions EC2 instances, and nodes otherwise can go offline outside of your control; you often don't want just one copy of something running in your cluster, especially if you're considering it a critical service.
In autoscaler FAQ on github you can read the following:
What types of pods can prevent CA from removing a node?
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have a pod disruption
budget
set or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching
anti-affinity, etc)
Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
*Unless the pod has the following annotation (supported in
CA 1.0.3 or later): "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

Google Kubernetes: worker pool not scaling down to zero

I'm setting up a GKE cluster on Google Kubernetes Engine to run some heavy jobs. I have a render-pool of big machines that I want to autoscale from 0 to N (using the cluster autoscaler). My default-pool is a cheap g1-small to run the system pods (those never go away so the default pool can't autoscale to 0, too bad).
My problem is that the render-pool doesn't want to scale down to 0. It has some system pods running on it; are those the problem? The default pool has plenty of resources to run all of them as far as I can tell. I've read the autoscaler FAQ, and it looks like it should delete my node after 10 min of inactivity. I've waited an hour though.
I created the render pool like this:
gcloud container node-pools create render-pool-1 --cluster=test-zero-cluster-2 \
--disk-size=60 --machine-type=n2-standard-8 --image-type=COS \
--disk-type=pd-standard --preemptible --num-nodes=1 --max-nodes=3 --min-nodes=0 \
--enable-autoscaling
The cluster-autoscaler-status configmap says ScaleDown: NoCandidates and it is probing the pool frequently, as it should.
What am I doing wrong, and how do I debug it? Can I see why the autoscaler doesn't think it can delete the node?
As pointed out in the comments, some pods, under specific circumstances will prevent the CA from downscaling.
In GKE, you have logging pods (fluentd), kube-dns, monitoring, etc., all considered system pods. This means that any node where they're scheduled, will not be a candidate for downscaling.
Considering this, it all boils down to creating an scenario where all the previous conditions for downscaling are met.
Since you only want to scale down an specific node-pool, I'd use Taints and tolerations to keep system pods in the default pool.
For GKE specifically, you can pick each app by their k8s-app label, for instance:
$ kubectl taint nodes GPU-NODE k8s-app=heapster:NoSchedule
This will prevent the tainted nodes from scheduling Heapster.
Not recommended but, you can go broader and try to get all the GKE system pods using kubernetes.io/cluster-service instead:
$ kubectl taint nodes GPU-NODE kubernetes.io/cluster-service=true:NoSchedule
Just be careful as the scope of this label is broader and you'll have to keep track of oncoming changes, as this label is possibily going to be deprecated someday.
Another thing that you might want to consider is using Pod Disruption Budgets. This might be more effective in stateless workloads, but setting it very tight is likely to cause inestability.
The idea of a PDB is to tell GKE what's the very minimal amount of pods that can be run at any given time, allowing the CA to evict them. It can be applied to system pods like below:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: dns-pdb
spec:
minAvailable: 1
selector:
matchLabels:
k8s-app: kube-dns
This tell GKE that, although there's usually 3 replicas of kube-dns, the application might be able to take 2 disruptions and sustain temporarily with only 1 replica, allowing the CA to evict these pods and reschedule them in other nodes.
As you probably noticed, this will put stress on DNS resolution in the cluster (in this particular example), so be careful.
Finally and regarding how to debug the CA. For now, consider that GKE is a managed version of Kubernetes where you don't really have direct access to tweak some features (for better or worse). You cannot set flags in the CA and access to logs could be through GCP support. The idea is to protect the workloads running in the cluster rather than to be cost-wise.
Downscaling in GKE is more about using different features in Kubernetes together until the CA conditions for downscaling are met.