Can a node autoscaler automatically start an extra pod when replica count is 1 & minAvailable is also 1? - kubernetes

our autoscaling (horizontal and vertical) works pretty fine, except the downscaling is not working somehow (yeah, we checked the usual suspects like https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why ).
Since we want to save resources and have pods which are not ultra-sensitive, we are setting following
Deployment
replicas: 1
PodDisruptionBudget
minAvailable: 1
HorizontalPodAutoscaler
minReplicas: 1
maxReplicas: 10
But it seems now that this is the problem that the autoscaler is not scaling down the nodes (even though the node is only used by 30% by CPU + memory and we have other nodes which have absolutely enough memory + cpu to move these pods).
Is it possible in general that the auto scaler starts an extra pod on the free node and removes the old pod from the old node?

Is it possible in general that the auto scaler starts an extra pod on the free node and removes the old pod from the old node?
Yes, that should be possible in general, but in order for the cluster autoscaler to remove a node, it must be possible to move all pods running on the node somewhere else.
According to docs there are a few type of pods that are not movable:
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default
don't have a pod disruption budget set or their PDB is too restrictive >(since CA 0.6).
Pods that are not backed by a controller object (so not created by >deployment, replica set, job, stateful set etc).
Pods with local storage.
Pods that cannot be moved elsewhere due to various constraints (lack of >resources, non-matching node selectors or affinity, matching anti-affinity, etc)
Pods that have the following annotation set:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false
You could check the cluster autoscaler logs, they may provide a hint to why no scale in happens:
kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler
Without having more information about your setup it is hard to guess what is going wrong, but unless you are using local storage, node selectors or affinity/anti-affinity rules etc Pod disruption policies is a likely candidate. Even if you are not using them explicitly they can still prevent node scale in if they there are pods in the kube-system namespace that are missing pod disruption policies (See this answer for an example of such a scenario in GKE)

Related

GKE node pool with Autoscaling does not scale down

I have a GKE cluster with two nodepools. I turned on autoscaling on one of my nodepools but it does not seem to automatically scale down.
I have enabled HPA and that works fine. It scales the pods down to 1 when I don't see traffic.
The API is currently not getting any traffic so I would expect the nodes to scale down as well.
But it still runs the maximum 5 nodes despite some nodes using less than 50% of allocatable memory/CPU.
What did I miss here? I am planning to move these pods to bigger machines but to do that I need the node autoscaling to work to control the monthly cost.
There are many reasons that can cause CA to not be downscaling successfully. If we resume how this should work normally it will be something like this:
Cluster autoscaler will periodically check (every 10 seconds) utilization of the nodes.
If the utilization factor is less than 0.5 the node will be considered as under utilization.
Then the nodes will be marked for removal and will be monitored for next 10 mins to make sure the utilization factor stays less than 0.5.
If even after 10 mins it stays under utilized then the node would be removed by cluster autoscaler.
If above is not being accomplished, then something else is preventing your nodes to be downscaling. In my experience PDBs needs to be applied to kube-system pods and I would say that could be the reason why; however, there are many reasons why this can be happening, here are reasons that can cause downscaling issues:
1. PDB is not applied to your kube-system pods. Kube-system pods prevent Cluster Autoscaler from removing nodes on which they are running. You can manually add Pod Disruption Budget(PDBs) for the kube-system pods that can be safely rescheduled elsewhere, this can be added with next command:
`kubectl create poddisruptionbudget PDB-NAME --namespace=kube-system --selector app=APP-NAME --max-unavailable 1`
2. Containers using local storage (volumes), even empty volumes. Kubernetes prevents scale down events on nodes with pods using local storage. Look for this kind of configuration that prevents Cluster Autoscaler to scale down nodes.
3. Pods annotated with cluster-autoscaler.kubernetes.io/safe-to-evict: true. Look for pods with this annotation that can be preventing Nodes scaledown
4. Nodes annotated with cluster-autoscaler.kubernetes.io/scale-down-disabled: true. Look for Nodes with this annotation that can be preventing cluster Autoscale. These configurations are the ones I will suggest you check on, in order to make your cluster to be scaling down nodes that are under utilized. -----
Also you can see this page where explains the configuration to prevent the downscales, which can be what is happening to you.

How to make sure Kubernetes autoscaler not deleting the nodes which runs specific pod

I am running a Kubernetes cluster(AWS EKS one) with Autoscaler pod So that Cluster will autoscale according to the resource request within the cluster.
Also, cluster will shrink no of nodes when the load is reduced. As I observed, Autosclaer can delete any node in this process.
I want to control this behavior such as asking Autoscaler to stop deleting nodes that runs a specific pod.
For example, If a node runs the Jenkins pod, Autoscaler should skip that node and delete other matching node from the cluster.
Will there a way to achieve this requirement. Please give your thoughts.
You can use "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
...
template:
metadata:
labels:
app: jenkins
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
spec:
nodeSelector:
failure-domain.beta.kubernetes.io/zone: us-west-2b
...
You should set a pod disruption budget that references specific pods by label. If you want to ensure that there is at least one Jenkins worker pod running at all times, for example, you could create a PDB like
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: jenkins-worker-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: jenkins
component: worker
(adapted from the basic example in Specifying a Disruption Budget in the Kubernetes docs).
Doing this won't prevent nodes from being destroyed; the cluster autoscaler is still free to scale things down. What it will do is temporarily delay destroying a node until the disruption budget can be met again.
For example, say you've configured your Jenkins setup so that there are three workers. Two get scheduled on the same node, and the autoscaler takes that node offline. The ordinary Kubernetes Deployment system will create two new replicas on nodes that still exist. If the autoscaler also decides it wants to destroy the node that has the last worker, the pod disruption budget above will prevent it from doing so until at least one other worker is running.
When you say "the Jenkins pod" in the question, there are two other important implications to this. One is that you should almost always configure your applications using higher-level objects like Deployments or StatefulSets and not bare Pods. The other is that it is generally useful to run multiple copies of things for redundancy if nothing else. Even absent the cluster autoscaler, disks fail, Amazon occasionally arbitrarily decommissions EC2 instances, and nodes otherwise can go offline outside of your control; you often don't want just one copy of something running in your cluster, especially if you're considering it a critical service.
In autoscaler FAQ on github you can read the following:
What types of pods can prevent CA from removing a node?
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have a pod disruption
budget
set or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching
anti-affinity, etc)
Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
*Unless the pod has the following annotation (supported in
CA 1.0.3 or later): "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

Google Kubernetes: worker pool not scaling down to zero

I'm setting up a GKE cluster on Google Kubernetes Engine to run some heavy jobs. I have a render-pool of big machines that I want to autoscale from 0 to N (using the cluster autoscaler). My default-pool is a cheap g1-small to run the system pods (those never go away so the default pool can't autoscale to 0, too bad).
My problem is that the render-pool doesn't want to scale down to 0. It has some system pods running on it; are those the problem? The default pool has plenty of resources to run all of them as far as I can tell. I've read the autoscaler FAQ, and it looks like it should delete my node after 10 min of inactivity. I've waited an hour though.
I created the render pool like this:
gcloud container node-pools create render-pool-1 --cluster=test-zero-cluster-2 \
--disk-size=60 --machine-type=n2-standard-8 --image-type=COS \
--disk-type=pd-standard --preemptible --num-nodes=1 --max-nodes=3 --min-nodes=0 \
--enable-autoscaling
The cluster-autoscaler-status configmap says ScaleDown: NoCandidates and it is probing the pool frequently, as it should.
What am I doing wrong, and how do I debug it? Can I see why the autoscaler doesn't think it can delete the node?
As pointed out in the comments, some pods, under specific circumstances will prevent the CA from downscaling.
In GKE, you have logging pods (fluentd), kube-dns, monitoring, etc., all considered system pods. This means that any node where they're scheduled, will not be a candidate for downscaling.
Considering this, it all boils down to creating an scenario where all the previous conditions for downscaling are met.
Since you only want to scale down an specific node-pool, I'd use Taints and tolerations to keep system pods in the default pool.
For GKE specifically, you can pick each app by their k8s-app label, for instance:
$ kubectl taint nodes GPU-NODE k8s-app=heapster:NoSchedule
This will prevent the tainted nodes from scheduling Heapster.
Not recommended but, you can go broader and try to get all the GKE system pods using kubernetes.io/cluster-service instead:
$ kubectl taint nodes GPU-NODE kubernetes.io/cluster-service=true:NoSchedule
Just be careful as the scope of this label is broader and you'll have to keep track of oncoming changes, as this label is possibily going to be deprecated someday.
Another thing that you might want to consider is using Pod Disruption Budgets. This might be more effective in stateless workloads, but setting it very tight is likely to cause inestability.
The idea of a PDB is to tell GKE what's the very minimal amount of pods that can be run at any given time, allowing the CA to evict them. It can be applied to system pods like below:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: dns-pdb
spec:
minAvailable: 1
selector:
matchLabels:
k8s-app: kube-dns
This tell GKE that, although there's usually 3 replicas of kube-dns, the application might be able to take 2 disruptions and sustain temporarily with only 1 replica, allowing the CA to evict these pods and reschedule them in other nodes.
As you probably noticed, this will put stress on DNS resolution in the cluster (in this particular example), so be careful.
Finally and regarding how to debug the CA. For now, consider that GKE is a managed version of Kubernetes where you don't really have direct access to tweak some features (for better or worse). You cannot set flags in the CA and access to logs could be through GCP support. The idea is to protect the workloads running in the cluster rather than to be cost-wise.
Downscaling in GKE is more about using different features in Kubernetes together until the CA conditions for downscaling are met.

Resizing a google cloud Kubernetes cluster to zero not working

I try to resize a kubernetes cluster to zero nodes using
gcloud container clusters resize $CLUSTER_NAME --size=0 --zone $ZONE
I get a success message but the size of the node-pool remains the same (I use only one node pool)
Is it possible to resize the cluster to zero?
Sometimes you just need to wait 10-20 minutes before autoscale operation takes effect.
In other cases, you may need to check if some conditions are met for downscaling the node.
According to autoscaler documentation:
Cluster autoscaler also measures the usage of each node against the node pool's total demand for capacity. If a node has had no new Pods scheduled on it for a set period of time, and all Pods running on that node can be scheduled onto other nodes in the pool, the autoscaler moves the Pods and deletes the node.
Note that cluster autoscaler works based on Pod resource requests, that is, how many resources your Pods have requested. Cluster autoscaler does not take into account the resources your Pods are actively using. Essentially, cluster autoscaler trusts that the Pod resource requests you've provided are accurate and schedules Pods on nodes based on that assumption.
Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods)
Cluster autoscaler has following limitations:
- When scaling down, cluster autoscaler supports a graceful termination period for a Pod of up to 10 minutes. A Pod is always killed after a maximum of 10 minutes, even if the Pod is configured with a higher grace period.
Note: Every change you make to the cluster autoscaler causes the Kubernetes master to restart, which takes several minutes to complete.
However, there are cases mentioned in FAQ that can prevent CA from removing a node:
What types of pods can prevent CA from removing a node?
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have PDB or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)
*Unless the pod has the following annotation (supported in CA 1.0.3 or later):
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
How can I scale my cluster to just 1 node?
Prior to version 0.6, Cluster Autoscaler was not touching nodes that were running important kube-system pods like DNS, Heapster, > Dashboard etc. If these pods landed on different nodes, CA could not scale the cluster down and the user could end up with a completely empty 3 node cluster. In 0.6, we added an option to tell CA that some system pods can be moved around. If the user configures a PodDisruptionBudget for the kube-system pod, then the default strategy of not touching the node running this pod is overridden with PDB settings. So, to enable kube-system pods migration, one should set minAvailable to 0 (or <= N if there are N+1 pod replicas.) See also I have a couple of nodes with low utilization, but they are not scaled down. Why?
How can I scale a node group to 0?
From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0 (and obviously from 0), assuming that all scale-down conditions are met.
For AWS, if you are using nodeSelector, you need to tag the ASG with a node-template key "k8s.io/cluster-autoscaler/node-template/label/".
For example, for a node label of foo=bar, you would tag the ASG with:
{
"ResourceType": "auto-scaling-group",
"ResourceId": "foo.example.com",
"PropagateAtLaunch": true,
"Value": "bar",
"Key": "k8s.io/cluster-autoscaler/node-template/label/foo"
}

What's the purpose of Kubernetes DaemonSet when replication controllers have node anti-affinity

DaemonSet is a Kubernetes beta resource that can ensure that exactly one pod is scheduled to a group of nodes. The group of nodes is all nodes by default, but can be limited to a subset using nodeSelector or the Alpha feature of node affinity/anti-affinity.
It seems that DaemonSet functionality can be achieved with replication controllers/replica sets with proper node affinity and anti-affinity.
Am I missing something? If that's correct should DaemonSet be deprecated before it even leaves Beta?
As you said, DaemonSet guarantees one pod per node for a subset of the nodes in the cluster. If you use ReplicaSet instead, you need to
use the node affinity/anti-affinity and/or node selector to control the set of nodes to run on (similar to how DaemonSet does it).
use inter-pod anti-affinity to spread the pods across the nodes.
make sure the number of pods > number of node in the set, so that every node has one pod scheduled.
However, ensuring (3) is a chore as the set of nodes can change over time. With DaemonSet, you don't have to worry about that, nor would you need to create extra, unschedulable pods. On top of that, DaemonSet does not rely on the scheduler to assign its pods, which makes it useful for cluster bootstrap (see How Daemon Pods are scheduled).
See the "Alternative to DaemonSet" section in the DaemonSet doc for more comparisons. DaemonSet is still the easiest way to run a per-node daemon without external tools.