I have a batch job that I want to run on a Kubernetes cluster on Google Cloud. That job has to be run periodically, say once a week and takes a day to complete. From the doc:
Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods).
My question is, will it still generate costs to have this cluster if it is scaled down to 0 nodes? From what I understand, the cluster size won't be null hence costs would still be generated.
If that is the case, what would be the correct approach to minimize my costs? Should I periodically create/delete the cluster before/after running the job?
If you provision Kubernetes cluster dynamically, as far as you can rebuild cluster environment without any dependencies on the worker Nodes from scratch, Autoscaling down to zero Nodes will be a good solution, whereas Kubernetes master Nodes (system Pods) are not charged in GKE, according to the Price page.
You can create node-pools:
gcloud container node-pools create ${CLUSTER_NAME}-pool \
--cluster ${CLUSTER_NAME} \
--enable-autoscaling --min-nodes 0 --max-nodes 10 \
--zone ${INSTANCE_ZONE}
and then force scaling down on demand:
gcloud container clusters resize ${CLUSTER_NAME} --size=0 [--node-pool=${CLUSTER_NAME}-pool]
Also get yourself familiar with this Document, it describes the types of Pods which can prevent Cluster Autoscaler from removing Node.
Related
On AWS EKS
I'm adding deployment with 17 replicas (requesting and limiting 64Mi memory) to a small cluster with 2 nodes type t3.small.
Counting with kube-system pods, total running pods per node is 11 and 1 is left pending, i.e.:
Node #1:
aws-node-1
coredns-5-1as3
coredns-5-2das
kube-proxy-1
+7 app pod replicas
Node #2:
aws-node-1
kube-proxy-1
+9 app pod replicas
I understand that t3.small is a very small instance. I'm only trying to understand what is limiting me here. Memory request is not it, I'm way below the available resources.
I found that there is IP addresses limit per node depending on instance type.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html?shortFooter=true#AvailableIpPerENI .
I didn't find any other documentation saying explicitly that this is limiting pod creation, but I'm assuming it does.
Based on the table, t3.small can have 12 IPv4 addresses. If this is the case and this is limiting factor, since I have 11 pods, where did 1 missing IPv4 address go?
The real maximum number of pods per EKS instance are actually listed in this document.
For t3.small instances, it is 11 pods per instance. That is, you can have a maximum number of 22 pods in your cluster. 6 of these pods are system pods, so there remains a maximum of 16 workload pods.
You're trying to run 17 workload pods, so it's one too much. I guess 16 of these pods have been scheduled and 1 is left pending.
The formula for defining the maximum number of pods per instance is as follows:
N * (M-1) + 2
Where:
N is the number of Elastic Network Interfaces (ENI) of the instance type
M is the number of IP addresses of a single ENI
So, for t3.small, this calculation is 3 * (4-1) + 2 = 11.
Values for N and M for each instance type in this document.
For anyone who runs across this when searching google. Be advised that as of August 2021 its now possible to increase the max pods on a node using the latest AWS CNI plugin as described here.
Using the basic configuration explained there a t3.medium node went from a max of 17 pods to a max of 110 which is more then adequate for what I was trying to do.
This is why we stopped using EKS in favor of a KOPS deployed self-managed cluster.
IMO EKS which employs the aws-cni causes too many constraints, it actually goes against one of the major benefits of using Kubernetes, efficient use of available resources.
EKS moves the system constraint away from CPU / memory usage into the realm of network IP limitations.
Kubernetes was designed to provide high density, manage resources efficiently. Not quite so with EKS’s version, since a node could be idle, with almost its entire memory available and yet the cluster will be unable to schedule pods on an otherwise low utilized node if pods > (N * (M-1) + 2).
One could be tempted to employ another CNI such as Calico, however would be limited to worker nodes since access to master nodes is forbidden.
This causes the cluster to have two networks and problems will arise when trying to access K8s API, or working with Admissions Controllers.
It really does depend on workflow requirements, for us, high pod density, efficient use of resources, and having complete control of the cluster is paramount.
connect to you EKS node
run this
/etc/eks/bootstrap.sh clusterName --use-max-pods false --kubelet-extra-args '--max-pods=50'
ignore nvidia-smi not found the output
whole script location https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh
EKS allows to increase max number of pods per node but this can be done only with Nitro instances. check the list here
Make sure you have VPC CNI 1.9+
Enable Prefix delegation for VPC_CNI plugin
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
If you are using self managed node group, make sure to pass the following in BootstrapArguments
--use-max-pods false --kubelet-extra-args '--max-pods=110'
or you could create the node group using eksctl using
eksctl create nodegroup --cluster my-cluster --managed=false --max-pods-per-node 110
If you are using managed node group with a specified AMI, it has bootstrap.sh so you could modify user_data to do something like this
/etc/eks/bootstrap.sh my-cluster \ --use-max-pods false \ --kubelet-extra-args '--max-pods=110'
Or simply using eksctl by running
eksctl create nodegroup --cluster my-cluster --max-pods-per-node 110
For more details, check AWS documentation https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
I need to downsize a cluster from 3 to 2 nodes.
I have critical pods running on some nodes (0 and 1). As I found that the last node (2) in the cluster is the one that has the non critical pods, I have "cordoned" it so it won't get any new ones.
I wonder is if I can make sure that that last node (2) is the one that will be removed when I go to Azure portal and downsize my cluster to 2 nodes (it is the last node and it is cordoned).
I have read that if I manually delete the node, the system will still consider there are 3 nodes running so it's important to use the cluster management to downsize it.
You cannot control which node will be removed when scaling down the AKS cluster.
However, there are some workarounds for that:
Delete the cordoned node manually via portal and than launch upgrade. It would try to add the node but with no success because the subnet has no space left.
Another option is to:
Set up cluster autoscaler with two nodes
Scale up the number of nodes in the UI
Drain the node you want to delete and wait for autoscaler do it's job
Here are some sources and useful info:
Scale the node count in an Azure Kubernetes Service (AKS) cluster
Support selection of nodes to remove when scaling down
az aks scale
Please let me know if that helped.
I have a GKE cluster that I want to have sitting at 0 nodes, scale up to 3 nodes to perform a task, and then after a certain amount of idle time, scale back down to 0. Is there a way to do this?
A GKE cluster can never scale down to 0 because of the system pods running in the cluster. The pods running in the kube-system namespace count against resource usage in your nodes thus the autoscaler will never make the decision to scale the entire cluster down to 0
It is definitely possible to have individual node pools scale down to 0 though.
You may want to consider using 2 different node pools: 1 small one to hold all the system pods (minus daemonset pods) and another larger pool with autoscaling enabled from 0 to X. You can add a taint to this node pool to ensure system pods don't use it.
This will minimize your resource usage during down times, but there is no way to ensure k8s automatically resizes to 0
An alternative, if you have a planned schedule for when the cluster should scale up or down, you can leverage Cloud Scheduler to launch a job that sends an API call to the container API to resize your cluster.
Or you could configure a job in the cluster or a prestop hook in your final job to trigger a Cloud Function
Yes, you can scale down pools after some idle condition.
You can use G-Cloud cli to scale up and scale down node pool.
Resize GKE pool using cloud function :
You can also use Cloud function to trigger API calls and use cloud scheduler to trigger that function at a specific time.
OR
You can write Cloud function with PubSub as a trigger as pass message to PubSub based on that you can resize the GKE node pool as per requirement.
Example if anyone looking for : https://github.com/harsh4870/cloud-function-scale-gke
As we can read in the GKE documentation about Cluster autoscaler.
Autoscaling limits
When you autoscale clusters, node pool scaling limits are determined by zone availability.
For example, the following command creates an autoscaling multi-zone cluster with six nodes across three zones, with a minimum of one node per zone and a maximum of four nodes per zone:
gcloud container clusters create example-cluster \
--zone us-central1-a \
--node-locations us-central1-a,us-central1-b,us-central1-f \
--num-nodes 2 --enable-autoscaling --min-nodes 1 --max-nodes 4
The total size of this cluster is between three and twelve nodes, spread across three zones. If one of the zones fails, the total size of cluster becomes between two and eight nodes.
But there are limitations.
Occasionally, cluster autoscaler cannot scale down completely and an extra node exists after scaling down. This can occur when required system Pods are scheduled onto different nodes, because there is no trigger for any of those Pods to be moved to a different node. See I have a couple of nodes with low utilization, but they are not scaled down. Why?. To work around this limitation, you can configure a Pod disruption budget.
Here are the reasons why the nodes might not be scaled down:
the node group already has the minimum size,
node has the scale-down disabled annotation (see How can I prevent Cluster Autoscaler from scaling down a particular node?)
node was unneeded for less than 10 minutes (configurable by --scale-down-unneeded-time flag),
there was a scale-up in the last 10 min (configurable by --scale-down-delay-after-add flag),
there was a failed scale-down for this group in the last 3 minutes (configurable by --scale-down-delay-after-failure flag),
there was a failed attempt to remove this particular node, in which case Cluster Autoscaler will wait for extra 5 minutes before considering it for removal again,
using large custom value for --scale-down-delay-after-delete or --scan-interval, which delays CA action.
The following command would resize the cluster to zero nodes:
gloud container clusters resize [cluster-name] --size 0 --zone [zone]
Now, it is up to you how you wanna increase or decrease the size of the cluster.
Suppose you have a few things to be deployed and you know how much they will take, increase the size of the cluster with the following command:
gloud container clusters resize [cluster-name] --size 3 --zone [zone]
and once done with the task you wanted to perform, run the above-mentioned command again to resize it to zero. You can write a shell script to automate this thing, provided you are certain about the time needed by the cluster to perform the tasks you want.
I have a test cluster in GKE (it runs my non-essential dev services). I am using the following GKE features for the cluster:
preemptible nodes (~4x f1-micro)
dedicated ingress node(s)
node auto-upgrade
node auto-repair
auto-scaling node-pools
regional cluster
stackdriver healthchecks
I created my pre-emptible node-pool thusly (auto-scaling between 3 and 6 actual nodes across 3 zones):
gcloud beta container node-pools create default-pool-f1-micro-preemptible \
--cluster=dev --zone us-west1 --machine-type=f1-micro --disk-size=10 \
--preemptible --node-labels=preemptible=true --tags=preemptible \
--enable-autoupgrade --enable-autorepair --enable-autoscaling \
--num-nodes=1 --min-nodes=0 --max-nodes=2
It all works great, most of the time. However, around 3 or 4 times per day, I receive healthcheck notifications regarding downtime on some services running on the pre-emptible nodes. (exactly what I would expect ONCE per 24h when the nodes get reclaimed/regenerated. But not 3+ times.)
By the time I receive the email notification, the cluster has already recovered, but when checking kubectl get nodes I can see that the "age" on some of the pre-emptible nodes is ~5min, matching the approx. time of the outage.
I am not sure where to find the logs for what is happening, or WHY the resets were triggered (poorly-set resources settings? unexpected pre-emptible scheduling? "auto-repair"?) I expect this is all in stackdriver somewhere, but I can't find WHERE. The Kubernetes/GKE logs are quite chatty, and everything is at INFO level (either hiding the error text, or the error logs are elsewhere).
I must say, I do enjoy the self-healing nature of the setup, but in this case I would prefer to be able to inspect the broken pods/nodes before they are reclaimed. I would also prefer to troubleshoot without tearing-down/rebuilding the cluster, especially to avoid additional costs.
I was able to solve this issue through a brute force process, creating several test node-pools in GKE running the same workloads (I didn't bother connecting up ingress, DNS, etc), and varying the options supplied to gcloud beta container node-pools create.
Since I was paying for these experiments, I did not run them all simultaneously, although that would have produced a faster answer. I also did prefer the tests which keep the --preemptible option, since that affects the cost significantly.
My results determined that the issue was with the --enable-autorepair argument and removing it has reduced failed health-checks to an acceptable level (expected for preemptible nodes).
Preemptible VMs offer the same machine types and options as regular compute instances and last for up to 24 hours.
This means that preemptible instance will die no less than once per 24h, but 3-4 times is still well within expectations. Preempts do not guarantee nor state anywhere that it will be only once.
I have deployed an app using Kubernetes to a Google Cloud Container Engine Cluster.
I got into autoscaling, and I found the following options:
Kubernetes Horizontal Pod Autoscaling (HPA)
As explained here, Kubernetes offers the HPA on deployments. As per the docs:
Horizontal Pod Autoscaling automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization
Google Cloud Container Cluster
Now I have a Google Cloud Container Cluster using 3 instances, with autoscaling enabled. As per the docs:
Cluster Autoscaler enables users to automatically resize clusters so that all scheduled pods have a place to run.
This means I have two places to define my autoscaling. Hence my questions:
Is a Pod the same as VM instance inside my cluster, or can multiple Pod's run inside a single VM instance?
Are these two parameters doing the same (aka creating/removing VM instances inside my cluster). If not, what is their behaviour compared to one another?
What happens if e.g. I have a number of pods between 3 and 10 and a cluster with number of instances between 1 and 3 and autoscaling kicks in. When and how would both scale?
Many thanks!
Is a Pod the same as VM instance inside my cluster, or can multiple
Pod's run inside a single VM instance?
Multiple Pods can run the same instance (called node in kuberenetes). You can define maximum resources to consume for a POD in the deployment yaml. See the docs. This is an important prerequisite for autoscaling.
Are these two parameters doing the same (aka creating/removing VM
instances inside my cluster). If not, what is their behaviour compared
to one another?
Kubernetes autoscaler will schedule additional PODs in your existing nodes. Google autoscaler will add worker nodes (new instances) to your cluster. Google autoscaler looks at queued up PODs that cannot be scheduled because there is no space in your cluster and when it finds those will add nodes.
What happens if e.g. I have a number of pods between 3 and 10 and a
cluster with number of instances between 1 and 3 and autoscaling kicks
in. When and how would both scale?
By the maximum resource usage you define for your pods google autoscaler will estimate how many new nodes are required to run all queued up, scheduled pods.
Also read this article.