GKE autoscaler not scaling down the nodes - kubernetes

I have created a google Kubernetes engine with autoscale enabled with minimum and maximum nodes. A few days ago I deployed couple of servers on production which increased the nodes count as expected. but when I deleted those deployments I expect it to resize the nodes which are to scale down. I waited more than an hour but it still did not scale down.
All my other pods are controlled by replica controller since I deployed with kind: deployment.
All my statefulset pods are using PVC as volume.
I'm not sure what prevented the nodes to scale down so I manually scaled the nodes for now. Since I made the changes manually I can not get the autoscaler logs now.
Does anyone know what could be the issue here?
GKE version is 1.16.15-gke.4300
As mentioned in this link
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
I'm not using any local storage.
pods not having PodDisruptionBudget(don't know what is that)
Pods are created by deployments (helm charts)
only thing is I don't have "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" this annotation. is this must?

I have tested Cluster Autoscaler on my GKE cluster. It work's bit differently than you expected.
Backgorund
You can enable autoscaling using command or enable it during creation like it's described in this documentation.
In Cluster Autoscaler documentation you can find various information like Operation criteria, Limitations, etc.
As I mentioned in comment section, Cluster Autoscaler - Frequently Asked Questions won't work if will encounter one of below situation:
Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
are not run on the node by default, *
don't have a pod disruption budget set or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, statefulset etc). *
Pods with local storage. *
Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)
Pods that have the following annotation set:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
For my tests I've used 6 nodes, with autoscaling range 1-6 and nginx application with requests cpu: 200m and memory: 128Mi.
As OP mentioned that is not able to provide autoscaler logs, I will paste my logs from Logs Explorer. Description of how they can be achieved is in Viewing cluster autoscaler events documentation.
In those logs you should search noScaleDown events. You will find there a few information, however the most important is:
reason: {
parameters: [
0: "kube-dns-66d6b7c877-hddgs"
]
messageId: "no.scale.down.node.pod.kube.system.unmovable"
As it's described in NoScaleDown node-level reasons for "no.scale.down.node.pod.kube.system.unmovable":
Pod is blocking scale down because it's a non-daemonset, non-mirrored, non-pdb-assigned kube-system pod. See the Kubernetes Cluster Autoscaler FAQ for more details.
Solution
If you want to make Cluster Autoscaler work on GKE, you have to create Disruptions with proper information, how to create it can be found in How to set PDBs to enable CA to move kube-system pods?
kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1
where you have to specify the correct selector and --max-unavailable or --min-available depends on your needs. For more details, please read Specifying a PodDisruptionBudget documentation.
Tests
$ kubectl get deploy,nodes
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/nginx-deployment 16/16 16 16 66m
NAME STATUS ROLES AGE VERSION
node/gke-cluster-1-default-pool-6d42fa0a-1ckn Ready <none> 11m v1.16.15-gke.6000
node/gke-cluster-1-default-pool-6d42fa0a-2j4j Ready <none> 11m v1.16.15-gke.6000
node/gke-cluster-1-default-pool-6d42fa0a-388n Ready <none> 3h33m v1.16.15-gke.6000
node/gke-cluster-1-default-pool-6d42fa0a-5x35 Ready <none> 3h33m v1.16.15-gke.6000
node/gke-cluster-1-default-pool-6d42fa0a-pdfk Ready <none> 3h33m v1.16.15-gke.6000
node/gke-cluster-1-default-pool-6d42fa0a-wqtm Ready <none> 11m v1.16.15-gke.6000
$ kubectl get pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-system kubedns 1 N/A 1 43m
Scaledown deployment
$ kubectl scale deploy nginx-deployment --replicas=2
deployment.apps/nginx-deployment scaled
After a while (~10-15 minutes) in the event viewer you will find the Decision event and inside you will find information that the node was deleted.
...
scaleDown: {
nodesToBeRemoved: [
0: {
node: {
mig: {
zone: "europe-west2-c"
nodepool: "default-pool"
name: "gke-cluster-1-default-pool-6d42fa0a-grp"
}
name: "gke-cluster-1-default-pool-6d42fa0a-wqtm"
Number of nodes decreased:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-cluster-1-default-pool-6d42fa0a-2j4j Ready <none> 30m v1.16.15-gke.6000
gke-cluster-1-default-pool-6d42fa0a-388n Ready <none> 3h51m v1.16.15-gke.6000
gke-cluster-1-default-pool-6d42fa0a-5x35 Ready <none> 3h51m v1.16.15-gke.6000
gke-cluster-1-default-pool-6d42fa0a-pdfk Ready <none> 3h51m v1.16.15-gke.6000
Another place where you can confirm it's scaling down is kubectl get events --sort-by='.metadata.creationTimestamp'
Output:
5m16s Normal NodeNotReady node/gke-cluster-1-default-pool-6d42fa0a-wqtm Node gke-cluster-1-default-pool-6d42fa0a-wqtm status is now: NodeNotReady
4m56s Normal NodeNotReady node/gke-cluster-1-default-pool-6d42fa0a-1ckn Node gke-cluster-1-default-pool-6d42fa0a-1ckn status is now: NodeNotReady
4m Normal Deleting node gke-cluster-1-default-pool-6d42fa0a-wqtm because it does not exist in the cloud provider node/gke-cluster-1-default-pool-6d42fa0a-wqtm Node gke-cluster-1-default-pool-6d42fa0a-wqtm event: DeletingNode
3m55s Normal RemovingNode node/gke-cluster-1-default-pool-6d42fa0a-wqtm Node gke-cluster-1-default-pool-6d42fa0a-wqtm event: Removing Node gke-cluster-1-default-pool-6d42fa0a-wqtm from Controller
3m50s Normal Deleting node gke-cluster-1-default-pool-6d42fa0a-1ckn because it does not exist in the cloud provider node/gke-cluster-1-default-pool-6d42fa0a-1ckn Node gke-cluster-1-default-pool-6d42fa0a-1ckn event: DeletingNode
3m45s Normal RemovingNode node/gke-cluster-1-default-pool-6d42fa0a-1ckn Node gke-cluster-1-default-pool-6d42fa0a-1ckn event: Removing Node gke-cluster-1-default-pool-6d42fa0a-1ckn from Controller
Conclusion
By default, kube-system pods prevent CA from removing nodes on which they are running. Users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere. It can be achieved using:
kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1
List of possible reasons why CA won't autoscale can be found in Cluster Autoscaler - Frequently Asked Questions.
To verify which pods could still block CA downscale, you can use Autoscaler Events.

Related

Kubernetes pod fail/restart simulation

We have a data visualization server hosted in Kubernetes pods. The dashboards in that data viz are displayed in the browser of different monitors/terminals for near-real time operational reporting. Sometimes the pods fail, and when they come alive again, the browser redirects to Single Sign-On page instead of going to the dashboard the URL is originally configured to.
The server are hosted in I would presume a replica set. There are two pods that exist as far as I can tell.
I was granted privilege on using kubectl to solve this problem, but still quite new with the whole Kubernetes thing. Using kubectl, how do I simulate pod failure/restart for testing purposes? Since the pods are in duplicate, shutting one of them will only redirect the traffic to the other pod. How to make both pods fail/restart at the same time? (I guess doing kubectl delete pod on both pods will do, but I want to make sure k8s will respawn the pods automatically, and not delete them forever).
If I understand the use case correctly, you might want to use kubectl scale command. This will give you the flexibility to make the replica count to zero to N by running a simple kubectl scale command. See examples. Also, if you are using deployment, you can just do the kubectl delete pod, the deployment controller will spawn a new one to satisfy the replica count.
kubectl scale deployment/<DEPLOYMENT-NAME> --replicas=<DESIRED-NUMBER-OF-REPLICA>
short example:
kubectl scale deployment/deployment-web --replicas=0
deployment.apps/deployment-web scaled
Long Example:
// create a deployment called, deployment-web with two replicas.
kubectl create deployment deployment-web --image=nginx --replicas 2
deployment.apps/deployment-web created
// verify that both replicas are up
kubectl get deployments.apps
NAME READY UP-TO-DATE AVAILABLE AGE
deployment-web 2/2 2 2 13s
// expose the deployment with a service [OPTIONAL-STEP, ONLY FOR EXPLANATION]
kubectl expose deployment deployment-web --port 80
service/deployment-web exposed
//verify that the service is created
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
deployment-web ClusterIP 10.233.24.174 <none> 80/TCP 5s
// dump the list of end-points for that service, there would be one for each replica. Notice the two IPs in the 2nd column.
kubectl get ep
NAME ENDPOINTS AGE
deployment-web 10.233.111.6:80,10.233.115.9:80 12s
//scale down to 1 replica for the deployment
kubectl scale --current-replicas=2 --replicas=1 deployment/deployment-web
deployment.apps/deployment-web scaled
// Notice the endpoint is reduced from 2 to 1.
kubectl get ep
NAME ENDPOINTS AGE
deployment-web 10.233.115.9:80 43s
// also note that there is only one pod remaining
kubectl get pod
NAME READY STATUS RESTARTS AGE
deployment-web-64c769b44-qh2qf 1/1 Running 0 105s
// scale down to zero replica
kubectl scale --current-replicas=1 --replicas=0 deployment/deployment-web
deployment.apps/deployment-web scaled
// The endpoint list is empty
kubectl get ep
NAME ENDPOINTS AGE
deployment-web <none> 9m4s
//Also, both pods are gone
kubectl get pod
No resources found in default namespace.
// When you are done with testing. restore the replicas
kubectl scale --current-replicas=0 --replicas=2 deployment/deployment-web
deployment.apps/deployment-web scaled
//endpoints and pods are restored back
kubectl get ep
NAME ENDPOINTS AGE
deployment-web 10.233.111.8:80,10.233.115.11:80 10m
foo-svc 10.233.115.6:80 50m
kubernetes 192.168.22.9:6443 6d23h
kubectl get pod -l app=deployment-web
NAME READY STATUS RESTARTS AGE
deployment-web-64c769b44-b72k5 1/1 Running 0 8s
deployment-web-64c769b44-mt2dd 1/1 Running 0 8s

pods still there when run kubectl delete pods

I want to remove zk and kafka from my k8s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kafka1-mvzch 1/1 Running 1 25s
kafka2-m292k 0/1 CrashLoopBackOff 8 20m
zookeeper1-qhmnf 1/1 Running 0 20m
zookeeper2-t7r8w 1/1 Running 0 20m
$kubectl delete pod kafka1-mvzch kafka2-m292k zookeeper1-qhmnf zookeeper2-t7r8w
pod "kafka1-mvzch" deleted
pod "kafka1-m292k" deleted
pod "zookeeper1-qhmnf" deleted
pod "zookeeper2-t7r8w" deleted
but when I run get pods, it still shows the pods.
And I got no service and deployment
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 7h1m
$ kubectl get deployment
No resources found in default namespace.
You are removing the pods, and they will be deleted.
But there is some other construct that re-creates pods to replace the (now deleted) previous pods.
In fact, the names of the pods with the random-looking suffix suggest that there is another controller operating the pods.
When looking at the linked tutorial, you notice that a ReplicationController is created. This ensures the pods.
If you want to remove it, remove the replication controller; the pods will be deleted as well.
You can use kubectl get pod -ojsonpath='{.metadata.ownerReferences}' to identify the owner object of the pods. The owner might be a Deployment, StatefulSet, etc.
Looking at the medium.com guide that you mentioned, I see that they suggest to create ReplicationControllers.
You can cleanup your namespace by running kubectl delete replicationcontroller --all.

Why isn't GKE scaling down cluster nodes even though I only have one pod?

I know there are some existing questions out there, they usually refer to this https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why
But Im still having trouble debugging. I only have 1 pod running on my cluster so I don't see why it wouldn't scale to 1 node. How can I debug this further?
Heres some info:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-qua-gke-foobar1234-default-pool-6302174e-4k84 Ready <none> 4h14m v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-6wfs Ready <none> 16d v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-74lm Ready <none> 4h13m v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-m223 Ready <none> 4h13m v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-srlg Ready <none> 66d v1.14.10-gke.27
kubectl get pods
NAME READY STATUS RESTARTS AGE
qua-gke-foobar1234-5959446675-njzh4 1/1 Running 0 14m
nodePools:
- autoscaling:
enabled: true
maxNodeCount: 10
minNodeCount: 1
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-highcpu-32
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/datastore
- https://www.googleapis.com/auth/devstorage.full_control
- https://www.googleapis.com/auth/pubsub
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
initialNodeCount: 1
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/fooooobbbarrr-dev/zones/us-central1-a/instanceGroupManagers/gke-qua-gke-foobar1234-default-pool-6302174e-grp
locations:
- us-central1-a
management:
autoRepair: true
autoUpgrade: true
name: default-pool
podIpv4CidrSize: 24
selfLink: https://container.googleapis.com/v1/projects/ffoooobarrrr-dev/locations/us-central1/clusters/qua-gke-foobar1234/nodePools/default-pool
status: RUNNING
version: 1.14.10-gke.27
kubectl describe horizontalpodautoscaler
Name: qua-gke-foobar1234
Namespace: default
Labels: <none>
Annotations: autoscaling.alpha.kubernetes.io/conditions:
[{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-03-17T19:59:19Z","reason":"ReadyForNewScale","message":"recommended size...
autoscaling.alpha.kubernetes.io/current-metrics:
[{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
autoscaling.alpha.kubernetes.io/metrics:
[{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"qua-gke-foobar1234","namespace":...
CreationTimestamp: Tue, 17 Mar 2020 12:59:03 -0700
Reference: Deployment/qua-gke-foobar1234
Min replicas: 1
Max replicas: 10
Deployment pods: 1 current / 1 desired
Events: <none>
HorizontalPodAutoscaler will increase or decrease the number of pods, not nodes. It doesn't have anything to do with the node scaling.
Node scaling is handled by the cloud provider, in your case, by Google Cloud Platform.
You should check if you have node autoscaler enabled or not from the GCP console.
You should follow these steps:
1. Go to the the Kubernetes clusters screen on GCP console
2. Click on your cluster
3. From the bottom, click on the node pool you want to enable autoscaling for
4. Click "edit"
5. Enable autoscaling, define minimum and maximum number of nodes, and save. See the screenshot:
Alternatively, via the gcloud CLI, as described here:
gcloud container clusters update cluster-name --enable-autoscaling \
--min-nodes 1 --max-nodes 10 --zone compute-zone --node-pool default-pool
So the original problem with my debugging attempt was that I ran kubectl get pods and not kubectl get pods --all-namespaces so I couldnt see the pods running on the system. Then I add PDBs on all the system pods.
kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-fluentd-scaler --namespace=kube-system --selector k8s-app=fluentd-gcp-scaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1
I then was starting to get these errors on some of the pdb event logs. controllermanager Failed to calculate the number of expected pods: found no controllers for pod, I saw these in the pdb evens when I ran kubectl describe pdb --all-namespaces. I dont know why these were occuring but I removed those pdbs. Then everything started working!
I had the same issue and the cause was the lack of PDBs for workloads running in kube-system NS. You can check the "Autoscaler Logs" tab.
If you don't configure PDBs, cluster autoscaler won't remove surplus GKE nodes. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
There is an interesting discussion about whether there should be some default behaviour or PDB. https://github.com/kubernetes/kubernetes/issues/35318

Kubernetes pods are pending not active

If I run this:
kubectl get pods -n kube-system
I get this output:
NAME READY STATUS RESTARTS AGE
coredns-6fdd4f6856-6bl64 0/1 Pending 0 1h
coredns-6fdd4f6856-xgrbm 0/1 Pending 0 1h
kubernetes-dashboard-65c76f6c97-c69jg 0/1 Pending 0 13m
supposedly I need a kubernetes scheduler in order to actually launch containers? Does anyone know how to initiate a kube-scheduler?
More than a Kubernetes scheduler issue, it looks like it's more about not having enough resources on your nodes (or no nodes at all) in your cluster to schedule any workloads. You can check your nodes with:
$ kubectl get nodes
Also, you are not likely able to see any control plane resource on the kube-system namespace because you may be using managed services like EKS or GKE.

Autoscaler not scaling up leaving nodes in NotReady state and pods in Unknown state

I am running a cluster on GKE with a single node pool. It has 3 nodes and can scale from 1 to 99 nodes. The cluster uses the nginx-ingress controller
On this cluster, I want to deploy apps. An app is scoped by a namespace and consists of 3 deployments and one ingress (defining paths to access the application from the internet). Each deployment runs a single replica of a container.
Deploying a couple of apps works fine, but deploying many apps (requiring the node pool to scale up) breaks everything:
All pods start having warnings (including those successfully deployed earlier)
kubectl get pods --namespace bcd
NAME READY STATUS RESTARTS AGE
actions-664b7d79f5-7qdkw 1/1 Unknown 1 35m
actions-664b7d79f5-v8s2m 1/1 Running 1 18m
core-85cb74f89b-ns49z 1/1 Unknown 1 35m
core-85cb74f89b-qqzfp 1/1 Running 1 18m
nlu-77899ddbf-8pd7k 1/1 Running 1 27m
All nodes becomes unready:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-clients-projects-default-pool-f9af73d4-gzwr NotReady <none> 42m v1.9.7-gke.6
gke-clients-projects-default-pool-f9af73d4-p5l2 NotReady <none> 21m v1.9.7-gke.6
gke-clients-projects-default-pool-f9af73d4-wnxc NotReady <none> 37m v1.9.7-gke.6
Deleting the namespace to remove all resources from the cluster also seems to fail as after a long while the pods remain active but still in an unknown state.
How can I safely add more apps and let the cluster autoscale?
The reason seems to be that not knowing the resources needed for each pod, the scheduler schedules them on any available node, potentially exhausting available resources and putting the Docker daemon in an inconsistent state.
The solution is to specify resources requests and limits: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container