I have a GKE cluster running in us-central1 with a preemptable node pool. I have nodes in each zone (us-central1-b,us-central1-c,us-central1-f). For the last 10 hours, I get the following error for the underlying node vm:
Instance '[instance-name]' creation failed: The zone
'[instance-zone]'
does not have enough resources available to fulfill
the request. Try a different zone, or try again
later.
I tried creating new clusters in different regions with different machine types, using HA (multi-zone) settings and I get the same error for every cluster.
I saw an issue on Google Cloud Status Dashboard and tried with the console, as recommended, and it errors out with a timeout error.
Is anyone else having this problem? Any idea what I may be dong wrong?
UPDATES
Nov 11
I stood up a cluster in us-west2, this was the only one which would work. I used gcloud command line, it seems the UI was not effective. There was a note similar to this situation, use gcloud not ui, on the Google Cloud Status Dashboard.
I tried creating node pools in us-central1 with the gcloud command line, and ui, to no avail.
I'm now federating deployments across regions and standing up multi-region ingress.
Nov. 12
Cannot create HA clusters in us-central1; same message as listed above.
Reached out via twitter and received a response.
Working with the K8s guide to federation to see if I can get multi-cluster running. Most likely going to use Kelsey Hightowers approach
Only problem, can't spin up clusters to federate.
Findings
Talked with google support, need a $150/mo. package to get a tech person to answer my questions.
Preemptible instances are not a good option for a primary node pool. I did this because I'm cheap, it bit me hard.
The new architecture is a primary node pool with committed use VMs that do not autoscale, and a secondary node pool with preemptible instances for autoscale needs. The secondary pool will have minimum nodes = 0 and max nodes = 5 (for right now); this cluster is regional so instances are across all zones.
Cost for an n1-standard-1 sustained use (assuming 24/7) a 30% discount off list.
Cost for a 1-year n1-standard-1 committed use is about ~37% discount off list.
Preemptible instances are re-provisioned every 24hrs., if they are not taken from you when resource needs spike in the region.
I believe I fell prey to a resource spike in the us-central1.
A must-watch for people looking to federate K8s: Kelsey Hightower - CNCF Keynote | Kubernetes Federation
Issue appears to be resolved as of Nov 13th.
Related
I use GKE for years and I wanted to experiment with GKE with AutoPilot mode, and my initial expectation was, it starts with 0 worker nodes, and whenever I deploy a workload, it automatically scales the nodes based on requested memory and CPU. However, I created a GKE Cluster, there is nothing related to nodes in UI, but in kubectl get nodes output I see there are 2 nodes. Do you have any idea how to start that cluster with no node initially?
The principle of GKE autopilot is NOT TO worry about the node, it's managed for you. No matter if there is 1, 2 or 10 node to your cluster, you don't pay for them, you pay only when a POD run in your cluster (CPU and Memory time usage).
So, you can't handle the number of node, number of pools and low level management like that, something similar to serverless product (Google prefers saying "nodeless" cluster)
At the opposite, it's great to already have resources provisioned that you don't pay on your cluster, you will deploy and scale quicker!
EDIT 1
You can have a look to the pricing. You have a flat fee of $74.40 per month ($0.10/hour) for the control plane. And then you pay your pods (CPU + Memory).
You have 1 free cluster per Billing account.
Missing clusters after resolving GCP billing.
Gcp project had unresolved billing.
This led to instances and k8 resources unaccessible.
We resolved the billing.
All instances and (k8) instance groups came back.
Problem the clusters are nowhere to be seen.
gcloud container clusters list
*empty*
Before this we had services running.
All loadbalancers are visible under network services
How do I rebuild the cluster from the instances?
By any chance was the unresolved billing related to the end of the GCP free trial?
If this was the case, if this was less than 30 days, there is still a good chance of getting your clusters back. Since at then end of the free trial all services are stopped and sent for deletion, as outlined here. But there is a grace period of 30 days so these services can be recovered
If this is still before the 30 days, and you have upgraded to a paid GCP plan and the clusters have not showed up, I would indeed recommend to contact GCP support to see what is going on
My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.
Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.
I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.
The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.
I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.
Any pointers as to what could be the reason or where I look for possible causes would be appreciated.
Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):
Posting this answer as a community wiki to give some troubleshooting tips/steps as the underlying issue wasn't found.
Feel free to expand it.
After below steps, the issue with a node rebooting were not present anymore:
Updated the Kubernetes version (GKE)
Uninstalling Istio
Using e2-medium instances as nodes.
As pointed by user #aurelius:
I would start from posting the kubectl describe node maybe there is something going on before your Node gets rebooted and unhealthy. Also do you use resources and limits? Can this restarts be a result of some burstable workload? Also have you tried checking system logs after the restart on the Node itself? Can you post the results? – aurelius Jun 7 '19 at 15:38
Above comment could be a good starting point for troubleshooting issues with the cluster.
Options to troubleshoot the cluster pointed in comment:
$ kubectl describe node focusing on output in:
Conditions - KubeletReady, KubeletHasSufficientMemory, KubeletHasNoDiskPressure, etc.
Allocated resources - Requests and Limits of scheduled workloads
Checking system logs after the restart on the node itself:
GCP Cloud Console (Web UI) -> Logging -> Legacy Logs Viewer/Logs Explorer -> VM Instance/GCE Instance
It could be also beneficiary to check the CPU/RAM usage in:
GCP Cloud Console (Web UI) -> Monitoring -> Metrics Explorer
You can also check if there are any operations on the cluster:
gcloud container operations list
Adding to above points:
Creating a cluster with Istio on GKE
We suggest creating at least a 4 node cluster with the 2 vCPU machine type when using this add-on. You can deploy Istio itself with the default GKE new cluster setup but this may not provide enough resources to explore sample applications.
-- Cloud.google.com: Istio: Docs: Istio on GKE: Installing
Also, the official docs of Istio are stating:
CPU and memory
Since the sidecar proxy performs additional work on the data path, it consumes CPU and memory. As of Istio 1.7, a proxy consumes about 0.5 vCPU per 1000 requests per second.
-- Istio.io: Docs: Performance and scalability: CPU and memory
Additional resources:
Cloud.google.com: Kubernetes Engine: Docs: Troubleshooting
Kubernetes.io: Docs: Debug cluster
I have a micro service scaled out across several pods in a Google Cloud Kubernetes Engine. Being in a multi-cloud-shop, we have our logging/monitoring/telemetry in Azure Application Insights.
Our data should be kept inside Europe, so our GCP Kubernetes cluster is set up with
Master zone: europe-west1-b
Node zones: europe-west1-b
When I create a node pool on this cluster, the nodes apparently has the zone europe-west1-b (as expected), seen from the Google Cloud Platform Console "Node details".
However, in Azure Application Insights, from the telemetry reported from the applications running in pods in this node pool, the client_City is reported as "Mountain View" and client_StateOrProvince is "California", and some cases "Ann Arbor" in "Michigan".
At first I waived this strange location as just some inter-cloud-issue (e.g. defaulting to something strange when not filling out the information as expected on the receiving end, or similar).
But now, Application Insights actually pointed out that there is a quite significant performance difference depending on if my pod is running in Michigan or in California, which lead me to belive that these fields are actually correct.
Is GCP fooling me? Am I looking at the wrong place? How can I make sure my GCP Kubernetes nodes are running in Europe?
This is essential for me to know, both from a GCPR perspective, and of course performance (latency) wise.
the Azure Application Insights are fooling you, because the external IP was registered by Google in California, not considering that these are used by data-centers distributed all over the globe. also have a GCE instance deployed to Frankfurt am Main, while the IP appears as if it would be Mountain View. StackDriver might report the actual locations (and not some vague GeoIP locations).
I have a test cluster in GKE (it runs my non-essential dev services). I am using the following GKE features for the cluster:
preemptible nodes (~4x f1-micro)
dedicated ingress node(s)
node auto-upgrade
node auto-repair
auto-scaling node-pools
regional cluster
stackdriver healthchecks
I created my pre-emptible node-pool thusly (auto-scaling between 3 and 6 actual nodes across 3 zones):
gcloud beta container node-pools create default-pool-f1-micro-preemptible \
--cluster=dev --zone us-west1 --machine-type=f1-micro --disk-size=10 \
--preemptible --node-labels=preemptible=true --tags=preemptible \
--enable-autoupgrade --enable-autorepair --enable-autoscaling \
--num-nodes=1 --min-nodes=0 --max-nodes=2
It all works great, most of the time. However, around 3 or 4 times per day, I receive healthcheck notifications regarding downtime on some services running on the pre-emptible nodes. (exactly what I would expect ONCE per 24h when the nodes get reclaimed/regenerated. But not 3+ times.)
By the time I receive the email notification, the cluster has already recovered, but when checking kubectl get nodes I can see that the "age" on some of the pre-emptible nodes is ~5min, matching the approx. time of the outage.
I am not sure where to find the logs for what is happening, or WHY the resets were triggered (poorly-set resources settings? unexpected pre-emptible scheduling? "auto-repair"?) I expect this is all in stackdriver somewhere, but I can't find WHERE. The Kubernetes/GKE logs are quite chatty, and everything is at INFO level (either hiding the error text, or the error logs are elsewhere).
I must say, I do enjoy the self-healing nature of the setup, but in this case I would prefer to be able to inspect the broken pods/nodes before they are reclaimed. I would also prefer to troubleshoot without tearing-down/rebuilding the cluster, especially to avoid additional costs.
I was able to solve this issue through a brute force process, creating several test node-pools in GKE running the same workloads (I didn't bother connecting up ingress, DNS, etc), and varying the options supplied to gcloud beta container node-pools create.
Since I was paying for these experiments, I did not run them all simultaneously, although that would have produced a faster answer. I also did prefer the tests which keep the --preemptible option, since that affects the cost significantly.
My results determined that the issue was with the --enable-autorepair argument and removing it has reduced failed health-checks to an acceptable level (expected for preemptible nodes).
Preemptible VMs offer the same machine types and options as regular compute instances and last for up to 24 hours.
This means that preemptible instance will die no less than once per 24h, but 3-4 times is still well within expectations. Preempts do not guarantee nor state anywhere that it will be only once.