I have a test cluster in GKE (it runs my non-essential dev services). I am using the following GKE features for the cluster:
preemptible nodes (~4x f1-micro)
dedicated ingress node(s)
node auto-upgrade
node auto-repair
auto-scaling node-pools
regional cluster
stackdriver healthchecks
I created my pre-emptible node-pool thusly (auto-scaling between 3 and 6 actual nodes across 3 zones):
gcloud beta container node-pools create default-pool-f1-micro-preemptible \
--cluster=dev --zone us-west1 --machine-type=f1-micro --disk-size=10 \
--preemptible --node-labels=preemptible=true --tags=preemptible \
--enable-autoupgrade --enable-autorepair --enable-autoscaling \
--num-nodes=1 --min-nodes=0 --max-nodes=2
It all works great, most of the time. However, around 3 or 4 times per day, I receive healthcheck notifications regarding downtime on some services running on the pre-emptible nodes. (exactly what I would expect ONCE per 24h when the nodes get reclaimed/regenerated. But not 3+ times.)
By the time I receive the email notification, the cluster has already recovered, but when checking kubectl get nodes I can see that the "age" on some of the pre-emptible nodes is ~5min, matching the approx. time of the outage.
I am not sure where to find the logs for what is happening, or WHY the resets were triggered (poorly-set resources settings? unexpected pre-emptible scheduling? "auto-repair"?) I expect this is all in stackdriver somewhere, but I can't find WHERE. The Kubernetes/GKE logs are quite chatty, and everything is at INFO level (either hiding the error text, or the error logs are elsewhere).
I must say, I do enjoy the self-healing nature of the setup, but in this case I would prefer to be able to inspect the broken pods/nodes before they are reclaimed. I would also prefer to troubleshoot without tearing-down/rebuilding the cluster, especially to avoid additional costs.
I was able to solve this issue through a brute force process, creating several test node-pools in GKE running the same workloads (I didn't bother connecting up ingress, DNS, etc), and varying the options supplied to gcloud beta container node-pools create.
Since I was paying for these experiments, I did not run them all simultaneously, although that would have produced a faster answer. I also did prefer the tests which keep the --preemptible option, since that affects the cost significantly.
My results determined that the issue was with the --enable-autorepair argument and removing it has reduced failed health-checks to an acceptable level (expected for preemptible nodes).
Preemptible VMs offer the same machine types and options as regular compute instances and last for up to 24 hours.
This means that preemptible instance will die no less than once per 24h, but 3-4 times is still well within expectations. Preempts do not guarantee nor state anywhere that it will be only once.
Related
I have an EKS cluster with two worker nodes. I would like to "switch off" the nodes or do something to reduce costs of my cluster outside working hours. Is there any way to turn off the nodes at night and turn on again at morning?
Thanks a lot.
This is a very common concern with anyone using managed K8s cluster. There might be different approaches people might be taking for this. What works best for us is a combination of kube-downscaler and cluster-autoscaler.
kube-downscaler helps you to scale down / "pause" Kubernetes workload (Deployments, StatefulSets, and/or HorizontalPodAutoscalers and CronJobs too !) during non-work hours.
cluster-autoscaler is a tool that automatically:
Scales-down the size of the Kubernetes cluster when there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.
Scales-up the size of the Kubernetes cluster when there are pods that failed to run in the cluster due to insufficient resources.
So, essentially during night when kube-downscaler scales down the pods and other objects, cluster-autoscaler notices the underutilized nodes and kill them before placing pods on other nodes. And does the opposite in the morning.
Ofcourse, there might be some fine-tuning needed regarding the configuration of the two to make it work best for you.
Unrelated to your specific question but, if you are in "savings" mode you may want to have a look at EC2 Spot Instances for EKS assuming you can operate within their boundaries. See here for the details.
On AWS EKS
I'm adding deployment with 17 replicas (requesting and limiting 64Mi memory) to a small cluster with 2 nodes type t3.small.
Counting with kube-system pods, total running pods per node is 11 and 1 is left pending, i.e.:
Node #1:
aws-node-1
coredns-5-1as3
coredns-5-2das
kube-proxy-1
+7 app pod replicas
Node #2:
aws-node-1
kube-proxy-1
+9 app pod replicas
I understand that t3.small is a very small instance. I'm only trying to understand what is limiting me here. Memory request is not it, I'm way below the available resources.
I found that there is IP addresses limit per node depending on instance type.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html?shortFooter=true#AvailableIpPerENI .
I didn't find any other documentation saying explicitly that this is limiting pod creation, but I'm assuming it does.
Based on the table, t3.small can have 12 IPv4 addresses. If this is the case and this is limiting factor, since I have 11 pods, where did 1 missing IPv4 address go?
The real maximum number of pods per EKS instance are actually listed in this document.
For t3.small instances, it is 11 pods per instance. That is, you can have a maximum number of 22 pods in your cluster. 6 of these pods are system pods, so there remains a maximum of 16 workload pods.
You're trying to run 17 workload pods, so it's one too much. I guess 16 of these pods have been scheduled and 1 is left pending.
The formula for defining the maximum number of pods per instance is as follows:
N * (M-1) + 2
Where:
N is the number of Elastic Network Interfaces (ENI) of the instance type
M is the number of IP addresses of a single ENI
So, for t3.small, this calculation is 3 * (4-1) + 2 = 11.
Values for N and M for each instance type in this document.
For anyone who runs across this when searching google. Be advised that as of August 2021 its now possible to increase the max pods on a node using the latest AWS CNI plugin as described here.
Using the basic configuration explained there a t3.medium node went from a max of 17 pods to a max of 110 which is more then adequate for what I was trying to do.
This is why we stopped using EKS in favor of a KOPS deployed self-managed cluster.
IMO EKS which employs the aws-cni causes too many constraints, it actually goes against one of the major benefits of using Kubernetes, efficient use of available resources.
EKS moves the system constraint away from CPU / memory usage into the realm of network IP limitations.
Kubernetes was designed to provide high density, manage resources efficiently. Not quite so with EKS’s version, since a node could be idle, with almost its entire memory available and yet the cluster will be unable to schedule pods on an otherwise low utilized node if pods > (N * (M-1) + 2).
One could be tempted to employ another CNI such as Calico, however would be limited to worker nodes since access to master nodes is forbidden.
This causes the cluster to have two networks and problems will arise when trying to access K8s API, or working with Admissions Controllers.
It really does depend on workflow requirements, for us, high pod density, efficient use of resources, and having complete control of the cluster is paramount.
connect to you EKS node
run this
/etc/eks/bootstrap.sh clusterName --use-max-pods false --kubelet-extra-args '--max-pods=50'
ignore nvidia-smi not found the output
whole script location https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh
EKS allows to increase max number of pods per node but this can be done only with Nitro instances. check the list here
Make sure you have VPC CNI 1.9+
Enable Prefix delegation for VPC_CNI plugin
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
If you are using self managed node group, make sure to pass the following in BootstrapArguments
--use-max-pods false --kubelet-extra-args '--max-pods=110'
or you could create the node group using eksctl using
eksctl create nodegroup --cluster my-cluster --managed=false --max-pods-per-node 110
If you are using managed node group with a specified AMI, it has bootstrap.sh so you could modify user_data to do something like this
/etc/eks/bootstrap.sh my-cluster \ --use-max-pods false \ --kubelet-extra-args '--max-pods=110'
Or simply using eksctl by running
eksctl create nodegroup --cluster my-cluster --max-pods-per-node 110
For more details, check AWS documentation https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
I have a GKE cluster running in us-central1 with a preemptable node pool. I have nodes in each zone (us-central1-b,us-central1-c,us-central1-f). For the last 10 hours, I get the following error for the underlying node vm:
Instance '[instance-name]' creation failed: The zone
'[instance-zone]'
does not have enough resources available to fulfill
the request. Try a different zone, or try again
later.
I tried creating new clusters in different regions with different machine types, using HA (multi-zone) settings and I get the same error for every cluster.
I saw an issue on Google Cloud Status Dashboard and tried with the console, as recommended, and it errors out with a timeout error.
Is anyone else having this problem? Any idea what I may be dong wrong?
UPDATES
Nov 11
I stood up a cluster in us-west2, this was the only one which would work. I used gcloud command line, it seems the UI was not effective. There was a note similar to this situation, use gcloud not ui, on the Google Cloud Status Dashboard.
I tried creating node pools in us-central1 with the gcloud command line, and ui, to no avail.
I'm now federating deployments across regions and standing up multi-region ingress.
Nov. 12
Cannot create HA clusters in us-central1; same message as listed above.
Reached out via twitter and received a response.
Working with the K8s guide to federation to see if I can get multi-cluster running. Most likely going to use Kelsey Hightowers approach
Only problem, can't spin up clusters to federate.
Findings
Talked with google support, need a $150/mo. package to get a tech person to answer my questions.
Preemptible instances are not a good option for a primary node pool. I did this because I'm cheap, it bit me hard.
The new architecture is a primary node pool with committed use VMs that do not autoscale, and a secondary node pool with preemptible instances for autoscale needs. The secondary pool will have minimum nodes = 0 and max nodes = 5 (for right now); this cluster is regional so instances are across all zones.
Cost for an n1-standard-1 sustained use (assuming 24/7) a 30% discount off list.
Cost for a 1-year n1-standard-1 committed use is about ~37% discount off list.
Preemptible instances are re-provisioned every 24hrs., if they are not taken from you when resource needs spike in the region.
I believe I fell prey to a resource spike in the us-central1.
A must-watch for people looking to federate K8s: Kelsey Hightower - CNCF Keynote | Kubernetes Federation
Issue appears to be resolved as of Nov 13th.
I have a batch job that I want to run on a Kubernetes cluster on Google Cloud. That job has to be run periodically, say once a week and takes a day to complete. From the doc:
Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods).
My question is, will it still generate costs to have this cluster if it is scaled down to 0 nodes? From what I understand, the cluster size won't be null hence costs would still be generated.
If that is the case, what would be the correct approach to minimize my costs? Should I periodically create/delete the cluster before/after running the job?
If you provision Kubernetes cluster dynamically, as far as you can rebuild cluster environment without any dependencies on the worker Nodes from scratch, Autoscaling down to zero Nodes will be a good solution, whereas Kubernetes master Nodes (system Pods) are not charged in GKE, according to the Price page.
You can create node-pools:
gcloud container node-pools create ${CLUSTER_NAME}-pool \
--cluster ${CLUSTER_NAME} \
--enable-autoscaling --min-nodes 0 --max-nodes 10 \
--zone ${INSTANCE_ZONE}
and then force scaling down on demand:
gcloud container clusters resize ${CLUSTER_NAME} --size=0 [--node-pool=${CLUSTER_NAME}-pool]
Also get yourself familiar with this Document, it describes the types of Pods which can prevent Cluster Autoscaler from removing Node.
I am trying to run a job on GKE for 5 mins and 50 nodes. However when i scale down instances it happens sequentially and thus costing me much more for a 4-5 min job.
Is there any way to paralelly delete GKE instances?
Kubernetes cluster has an underlying Instance Group.
I was able to delete the nodes in parallel by directly changing the number of nodes in Instance Group from 50 to 5.
All nodes were deleted within 30 seconds and GKE had also automatically updated the cluster size with the new value.
If you are using already to scale down your cluster:
$ gcloud container clusters resize [CLUSTER_NAME] --node-pool [NODE_POOL] --size [SIZE]
I believe there are other option to speed up this process.
On other other hand if you are reeling on autoscaler you could try if with that command the resize is faster and you would be able to meet your requirements. However keep in mind that the purpose of Kubernetes is not create an infrastructure were it is extremely fast to spin up and add 50 nodes and when you are done to kill them.
Consider also from the doumentation:
The kubectl drain command should only be issued to a single node at a time. However, you can run multiple kubectl drain commands for different node in parallel, in different terminals or in the background. Multiple drain commands running concurrently will still respect the PodDisruptionBudget you specify.
Therefore it turns out from the documentation that drain a node (that is an essential phase to remove a node) it is discouraged