GKE does not scale to/from 0 when autoscaling enabled

GKE does not scale to/from 0 when autoscaling enabled - kubernetes

I want to run a CronJob on my GKE in order to perform a batch operation on a daily basis. The ideal scenario would be for my cluster to scale to 0 nodes when the job is not running and to dynamically scale to 1 node and run the job on it every time the schedule is met.
I am first trying to achieve this by using a simple CronJob found in the kubernetes doc that only prints the current time and terminates.
I first created a cluster with the following command:
gcloud container clusters create $CLUSTER_NAME \
--enable-autoscaling \
--min-nodes 0 --max-nodes 1 --num-nodes 1 \
--zone $CLUSTER_ZONE
Then, I created a CronJob with the following description:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: Never
The job is scheduled to run every hour and to print the current time before terminating.
First thing, I wanted to create the cluster with 0 nodes but setting --num-nodes 0 results in an error. Why is it so? Note that I can manually scale down the cluster to 0 nodes after it has been created.
Second, if my cluster has 0 nodes, the job won't be scheduled because the cluster does not scale to 1 node automatically but instead gives the following error:
Cannot schedule pods: no nodes available to schedule pods.
Third, if my cluster has 1 node, the job runs normally but after that, the cluster won't scale down to 0 nodes but stay with 1 node instead. I let my cluster run for two successive jobs and it did not scale down in between. I assume one hour should be long enough for the cluster to do so.
What am I missing?
EDIT: I've got it to work and detailed my solution here.

Update:
Note: Beginning with Kubernetes version 1.7, you can specify a minimum
size of zero for your node pool. This allows your node pool to scale
down completely if the instances within aren't required to run your
workloads.
https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler
Old answer:
Scaling the entire cluster to 0 is not supported, because you always need at least one node for system pods:
See docs
You could create one node pool with a small machine for system pods, and an additional node pool with a big machine where you would run your workload. This way the second node pool can scale down to 0 and you still have space to run the system pods.
After attempting, #xEc mentions: Also note that there are scenarios in which my node pool wouldn't scale, like if I created the pool with an initial size of 0 instead of 1.
Initial suggestion:
Perhaps you could run a micro VM, with cron to scale the cluster up, submit a Job (instead of CronJob), wait for it to finish and then scale it back down to 0?

I do not think it's a good idea to tweak GKE for this kind of job. If you really need 0 instances I'd suggest you use either
App Engine Standard Environment, which allows you scale Instances to 0 (https://cloud.google.com/appengine/docs/standard/go/config/appref)
or
Cloud Functions, they are 'instanceless'/serverless anyway. You can use this unofficial guide to trigger your Cloud Functions (https://cloud.google.com/community/tutorials/using-stackdriver-uptime-checks-for-scheduling-cloud-functions)

Related

How to implement horizontal auto scaling in GKE autopilot based on a custom metric

I'm running a Kubernetes cluster on GKE autopilot
I have pods that do the following - Wait for a job, run the job (This can take minutes or hours), Then go to Pod Succeeded State which will cause Kubernetes to restart the pod.
The number of pods I need is variable depending on how many users are on the platform. Each user can request a job that needs a pod to run.
I don't want users to have to wait for pods to scale up so I want to keep a number of extra pods ready and waiting to execute.
The application my pods are running can be in 3 states - { waiting for job, running job, completed job}
Scaling up is fine as I can just use the scale API and always request to have a certain percentage of pods in waiting for job state
When scaling down I want to ensure that Kubernetes doesn't kill any pods that are in the running job state.
Should I implement a Custom Horizontal Pod Autoscaler?
Can I configure custom probes for my pod's application state?
I could use also use pod priority or a preStop hook

You can configure horizontal Pod autoscaling to ensure that Kubernetes doesn't kill any pods.
Steps for configuring horizontal pod scaling:
Create the Deployment, apply the nginx.yaml manifest,Run the following command:
kubectl apply -f nginx.yaml
Autoscaling based on resources utilization
1-Go to the Workloads page in Cloud Console.
2-Click the name of the nginx Deployment.
3-Click list Actions > Autoscale.
4-Specify the following values:
-Minimum number of replicas: 1
-Maximum number of replicas: 10
-Auto Scaling metric: CPU
-Target: 50
-Unit: %
5-Click Done.
6-Click Autoscale.
To get a list of Horizontal Pod Autoscalers in the cluster, use the following command:
kubectl get hpa
Guide on how to Configure horizontal pod autoscaling.
You can also refer to this link of auto-scaling rules for the GKE autopilot cluster using a custom metric on the Cloud Console.

POD affinity rule to schedule pods across all nodes

we are running 6 nodes in K8s cluster. Out of 6, 2 of them running RabbitMQ, Redis & Prometheus we have used node-selector & cordon node so no other pods schedule on that particular nodes.
On renaming other 4 nodes application PODs run, we have around 18-19 micro services.
For GKE there is one open issue in K8s official repo regarding auto scale down: https://github.com/kubernetes/kubernetes/issues/69696#issuecomment-651741837 automatically however people are suggesting approach of setting PDB and we that tested on Dev/Stag.
What we are looking for now is to fix PODs on particular node pool which do not scale, as we are running single replicas of some services.
As of now, we thought of using and apply affinity to those services which are running with single replicas and no requirement of scaling.
while for scalable services we won't specify any type of rule so by default K8s scheduler will schedule pod across different nodes, so this way if any node scale down we dont face any downtime for single running replica service.
Affinity example :
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: do-not-scale
operator: In
values:
- 'true'
We are planning to use affinity type preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution.
Note : Here K8s is not creating new replica first on another node during node drain (scaledown of any node) as we are running single replicas with rolling update & minAvailable: 25% strategy.
Why: If PodDisruptionBudget is not specified and we have a deployment with one replica, the pod will be terminated and then a new pod will be scheduled on a new node.
To make sure the application will be available during the node draining process we have to specify PodDisruptionBudget and create more replicas. If we have 1 pod with minAvailable: 30% it will refuse to drain node (scaledown).
Please point out a mistake if you are seeing anything wrong & suggest better option.

First of all, defining PodDisruptionBudget makes not much sense whan having only one replica. minAvailable expressed as a percentage is rounded up to an integer as it represents the minimum number of Pods which need to be available all the time.
Keep in mind that you have no guarantee for any High Availability when launching only one-replica Deployments.
Why: If PodDisruptionBudget is not specified and we have a deployment
with one replica, the pod will be terminated and then a new pod will
be scheduled on a new node.
If you didn't explicitely define in your Deployment's spec the value of maxUnavailable, by default it is set to 25%, which being rounded up to an integer (representing number of Pods/replicas) equals 1. It means that 1 out of 1 replicas is allowed to be unavailable.
If we have 1 pod with minAvailable: 30% it will refuse to drain node
(scaledown).
Single replica with minAvailable: 30% is rounded up to 1 anyway. 1/1 should be still up and running so Pod cannot be evicted and node cannot be drained in this case.
You can try the following solution however I'm not 100% sure if it will work when your Pod is re-scheduled to another node due to it's eviction from the one it is currently running on.
But if you re-create your Pod e.g. because you update it's image to a new version, you can guarantee that at least one replica will be still up and running (old Pod won't be deleted unless the new one enters Ready state) by setting maxUnavailable: 0. As per the docs, by default it is set to 25% which is rounded up to 1. So by default you allow that one of your replicas (which in your case happens to be 1/1) becomes unavailable during the rolling update. If you set it to zero, it won't allow the old Pod to be deleted unless the new one becomes Ready. At the same time maxSurge: 2 allows that 2 replicas temporarily exist at the same time during the update.
Your Deployment definition may begin as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 👈
maxSurge: 2
selector:
...
Compare it with this answer, provided by mdaniel, where I originally found it.

Prevent killing some pods when scaling down possible?

I need to scale a set of pods that run queue-based workers. Jobs for workers can run for a long time (hours) and should not get interrupted. The number of pods is based on the length of the worker queue. Scaling would be either using the horizontal autoscaler using custom metrics, or a simple controller that changes the number of replicas.
Problem with either solution is that, when scaling down, there is no control over which pod(s) get terminated. At any given time, most workers are likely working on short running jobs, idle, or (more rare) processing a long running job. I'd like to avoid killing the long running job workers, idle or short running job workers can be terminated without issue.
What would be a way to do this with low complexity? One thing I can think of is to do this based on CPU usage of the pods. Not ideal, but it could be good enough. Another method could be that workers somehow expose a priority indicating whether they are the preferred pod to be deleted. This priority could change every time a worker picks up a new job though.
Eventually all jobs will be short running and this problem will go away, but that is a longer term goal for now.

Since version 1.22 there is a beta feature that helps you do that. You can add the annotation controller.kubernetes.io/pod-deletion-cost with a value in the range [-2147483647, 2147483647] and this will cause pods with lower value to be killed first. Default is 0, so anything negative on one pod will cause a pod to get killed during downscaling, e.g.
kubectl annotate pods my-pod-12345678-abcde controller.kubernetes.io/pod-deletion-cost=-1000
Link to discussion about the implementation of this feature: Scale down a deployment by removing specific pods (PodDeletionCost) #2255
Link to the documentation: ReplicaSet / Pod deletion cost

During the process of termination of a pod, Kubernetes sends a SIGTERM signal to the container of your pod. You can use that signal to gracefully shutdown your app. The problem is that Kubernetes does not wait forever for your application to finish and in your case your app may take a long time to exit.
In this case I recommend you use a preStop hook, which is completed before Kubernetes sends the KILL signal to the container. There is an example here on how to use handlers:
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
spec:
containers:
- name: lifecycle-demo-container
image: nginx
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
preStop:
exec:
command: ["/bin/sh","-c","nginx -s quit; while killall -0 nginx; do sleep 1; done"]

There is a kind of workaround that can give some control over the pod termination.
Not quite sure if it the best practice, but at least you can try it and test if it suits your app.
Increase the Deployment grace period with terminationGracePeriodSeconds: 3600 where 3600 is the time in seconds of the longest possible task in the app. This makes sure that the pods will not be terminated by the end of the grace period. Read the docs about the pod termination process in detail.
Define a preStop handler. More details about lifecycle hooks can be found in docs as well as in the example. In my case, I've used the script below to create the file which will later be used as a trigger to terminate the pod (probably there are more elegant solutions).
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "touch /home/node/app/preStop"]
Stop your app running as soon as the condition is met. When the app exits the pod terminates as well. It is not possible to end the process with PID 1 from preStop shell script so you need to add some logic to the app to terminate itself. In my case, it is a NodeJS app, there is a scheduler that is running every 30 seconds and checks whether two conditions are met. !isNodeBusy identifies whether it is allowed to finish the app and fs.existsSync('/home/node/app/preStop') whether preStop hook was triggered. It might be different logic for your app but you get the basic idea.
schedule.scheduleJob('*/30 * * * * *', () => {
if(!isNodeBusy && fs.existsSync('/home/node/app/preStop')){
process.exit();
}
});
Keep in mind that this workaround works only with voluntary disruptions and obviously not helpful with involuntary disruptions. More info in docs.

I think running this type of workload using a Deployment or similar, and using a HorizontalPodAutoscaler for scaling, is the wrong way to go. One way you could go about this is to:
Define a controller (this could perhaps be a Deployment) whose task is to periodically create a Kubernetes Job object.
The spec of the Job should contain a value for .spec.parallelism equal to the maximum number of concurrent executions you will accept.
The Pods spawned by the Job then run your processing logic. They should each pull a message from the queue, process it, and then delete it from the queue (in the case of success).
The Job must exit with the correct status (success or failure). This ensures that the Job recognises when the processing has completed, and so will not spin up additional Pods.
Using this method, .spec.parallelism controls the autoscaling based on how much work there is to be done, and scale-down is an automatic benefit of using a Job.

You are looking for Pod Priority and Preemption. By configuring a high priority PriorityClass for your pods you can ensure that they won't be removed to make space for other pods with a lower priority.
Create a new PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class will not cause other pods to be preempted."
Set your new PriorityClass in your pods
priorityClassName: high-priority
The value: 1000000 in the PriorityClass configures the scheduling priority of the pod. The higher the value the more important the pod is.

For those who lands on this page facing the issues of Pods getting killed while Node scaling down -
This is an expected feature of Cluster Autoscaler as CA will try to optimize the pods so that it could use a minimum size of the cluster.
However, You can protect your Job pods from eviction (getting killed) by creating a PodDisruptionBudget with maxUnavailable=0 for them.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: sample-pdb
spec:
maxUnavailable: 0
selector:
matchLabels:
app: <your_app_name>

Set retention policy for Pods created by Kubernetes CronJob

I understand that Kubernetes CronJobs create pods that run on the schedule specified inside the CronJob. However, the retention policy seems arbitrary and I don't see a way where I can retain failed/successful pods for a certain period of time.

I am not sure about what you are exactly asking here.
CronJob does not create pods. It creates Jobs (which also manages) and those jobs are creating pods.
As per Kubernetes Jobs Documentation If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy. In short, pods and jobs will not be deleted utill you remove CronJob. You will be able to check logs from Pods/Jobs/CronJob. Just use kubctl describe
As default CronJob keeps history of 3 successfulJobs and only 1 of failedJob. You can change this limitation in CronJob spec by parameters:
spec:
successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 0
0 means that CronJob will not keep any history of failed jobs
10 means that CronJob will keep history of 10 succeeded jobs
You will not be able to retain pod from failed job because, when job fails it will be restarted until it was succeded or reached backoffLimit given in the spec.
Other option you have is to suspend CronJob.
kubctl patch cronjob <name_of_cronjob> -p '{"spec:"{"suspend":true}}'
If value of spuspend is true, CronJob will not create any new jobs or pods. You will have access to completed pods and jobs.
If none of the above helpd you, could you please give more information what do you exactly expect?
CronJob spec would be helpful.

Node pool does not reduce his node size to zero although autoscaling is enabled

I have created two node pools. A small one for all the google system jobs and a bigger one for my tasks. The bigger one should reduce its size to 0 after the job is done.
The problem is: Even if there are no cron jobs, the node pool do not
reduce his size to 0.
Creating cluster:
gcloud beta container --project "projectXY" clusters create "cluster" --zone "europe-west3-a" --username "admin" --cluster-version "1.9.6-gke.0" --machine-type "n1-standard-1" --image-type "COS" --disk-size "100" --scopes "https://www.googleapis.com/auth/cloud-platform" --num-nodes "1" --network "default" --enable-cloud-logging --enable-cloud-monitoring --subnetwork "default" --enable-autoscaling --enable-autoupgrade --min-nodes "1" --max-nodes "1"
Creating node pool:
The node pool should reduce its size to 0 after all tasks are done.
gcloud container node-pools create workerpool --cluster=cluster --machine-type="n1-highmem-8", -m "n1-highmem-8" --zone=europe-west3-a, -z europe-west3-a --disk-size=100 --enable-autoupgrade --num-nodes=0 --enable-autoscaling --max-nodes=2 --min-nodes=0
Create cron job:
kubectl create -f cronjob.yaml

Quoting from Google Documentation:
"Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods)."
Notice also that:
"Cluster autoscaler also measures the usage of each node against the node pool's total demand for capacity. If a node has had no new Pods scheduled on it for a set period of time, and [this option does not work for you since it is the last node] all Pods running on that node can be scheduled onto other nodes in the pool , the autoscaler moves the Pods and deletes the node.
Note that cluster autoscaler works based on Pod resource requests, that is, how many resources your Pods have requested. Cluster autoscaler does not take into account the resources your Pods are actively using. Essentially, cluster autoscaler trusts that the Pod resource requests you've provided are accurate and schedules Pods on nodes based on that assumption."
Therefore I would check:
that your version of your Kubernetes cluster is at least 1.7
that there are no pods running on the last node (check every namespace, the pods that have to run on every node do no count: fluentd, kube-dns, kube-proxy), the fact that there are no cronjobs is not enough
that for the autoscaler is NOT enabled for the corresponding managed instance groups since they are different tools
that there are no pods stuck in any weird state still assigned to that node
that there is no pods waiting to be scheduled in the cluster
If still everything likely it is an issue with the autoscaler and you can either open a private issue specifying your project ID with Google since there is not much the community can do.
If you are interested place in the comments the link of the issue tracker and I will take a look in your project (I work for Google Cloud Platform Support)

I ran into the same issue and tested a number of different scenarios. I finally got it to work by doing the following:
Create your node pool with an initial size of 1 instead of 0:
gcloud container node-pools create ${NODE_POOL_NAME} \
--cluster ${CLUSTER_NAME} \
--num-nodes 1 \
--enable-autoscaling --min-nodes 0 --max-nodes 1 \
--zone ${ZONE} \
--machine-type ${MACHINE_TYPE}
Configure your CronJob in a similar fashion to this:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cronjob800m
spec:
schedule: "7 * * * *"
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 0
successfulJobsHistoryLimit: 0
jobTemplate:
spec:
template:
spec:
containers:
- name: cronjob800m
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
resources:
requests:
cpu: "800m"
restartPolicy: Never
Note that the resources are set in a way that the job is only able to be run on the large node pool but not on the small one. Also note that we set both failedJobsHistoryLimit and successfulJobsHistoryLimit to 0 in order for the job to be automatically cleaned from the node pool after success/failure.
That should be it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse