RouteController failed to create a route on GKE - kubernetes

I have a cluster on GKE whose node pool I create when I want to use the cluster, and delete when I'm done with it.
It's a two node cluster with the master in europe-west2-a and with and whose node zones are europe-west2-a and europe-west2-b.
The most recent creation resulted in the node in zone B failing with NetworkUnavailable because RouteController failed to create a route. The reason was because Could not create route xxx 10.244.1.0/24 for node xxx after 342.263706ms: instance not found.
Why would this be happening all of a sudden, and what can I do to fix it?!

You didn't mention which version of GKE you are using so just for clarification:
Changes in access scopes
Beginning with Kubernetes version 1.10, gcloud and GCP Console no longer grants the compute-rw access scope on new clusters and new node pools by default. Furthermore, if --scopes is specified in gcloud container create, gcloud no longer silently adds compute-rw or storage-ro.
In any case you can still revert to legacy access scopes but this is not recommended approach.
Hope this help.

With gke 1.13.6-gke.13, some of the default scopes were changed, including the compute-rw scope being removed. I think that due to the age of the cluster, this scope was necessary for a route to be correctly created between nodes in a node pool.
In the end, my gcloud creation command had these scopes:
--scopes https://www.googleapis.com/auth/projecthosting,storage-rw,monitoring,trace,compute-rw

Related

terraform: how to upgrade Azure AKS default nodepool VM size without replacement of the cluster

I'm trying to upgrade the VM size of my AKS cluster using this approach with Terraform. Basically I create a new nodepool with the required amount of nodes, then I cordon the old node to disallow scheduling of new pods. Next, I drain the old node to reschedule all the pods in the newly created node. Then, I proceed to upgrade the VM size.
The problem I am facing is that azurerm_kubernetes_cluster resource allow for the creation of the default_node_pool and another resource, azurerm_kuberentes_cluster_node_pool allows me to create new nodepools with extra nodes.
Everything works until I create the new nodepool, cordon and drain the old one. However when I change the default_nod_pool.vm_size and run terraform apply, it tells me that the whole resource has to be recreated, including the new nodepool I just created, because it's linked to the cluster id, which will be replaced.
How should I manage this upgrade from the documentation with Terraform if upgrading the default node pool always forces replacement even if a new nodepool is in place?
terraform version
terraform v1.1.7
on linux_amd64
+ provider registry.terraform.io/hashicorp/azurerm v2.82.0
+ provider registry.terraform.io/hashicorp/local v2.2.2

Upgrading EKS node group from 1.17 to 1.18 failed due to AsgInstanceLaunchFailures - how do I fix it?

I have an EKS cluster that has gone through an upgrade from 1.17 to 1.18.
The cluster has 2 node groups (updated using the AWS console).
EKS control plane and one of the node groups upgrades were ok.
The last node group the upgrade is failing due to a health issue - AsgInstanceLaunchFailures - One or more target groups not found. Validating load balancer configuration failed. and now the node group is marked as Degraded.
when I access the update ID I see the following error:
NodeCreationFailure - Couldn't proceed with upgrade process as new nodes are not joining node group {NODE_GROUP_NAME}
I tried accessing the ASG with that ID and I can see it has several load-balancing target groups attached to it.
I could not find any way to fix this in the AWS docs.
Any advice?
Issue resolved.
it appears there was an empty target group added manually to the cluster (there were 3 other target groups created automatically). Once the empty target group was deleted, the upgrade was completed successfully.
I am still unclear as to how EKS chooses the proper target group to update when there is more than one.
Are you able to launch new nodes which are coming up in Ready State and joining cluster? Based on EKS public doc, the upgrade request would succeed only when ASG can launch new instances in Ready state in all the AZs of the node group.
To debug this further, you can trigger a new upgrade request and check the health of new nodes which EKS brings up in your cluster.

How to enable Kubernetes API in GCP? not sorted out here by following the doc

I am learning GCP and wanted to create a Kubernetes cluster with instance, here is what I did and what I followed with no success:
First set the region to my default us-east1-b:
xenonxie#cloudshell:~ (rock-perception-263016)$ gcloud config set compute/region us-east1-b
Updated property [compute/region].
Now proceed to create it:
xenonxie#cloudshell:~ (rock-perception-263016)$ gcloud container clusters create my-first-cluster --num-nodes 1
ERROR: (gcloud.container.clusters.create) One of [--zone, --region]
must be supplied: Please specify location.
So it seems default region/zone us-east1-b is NOT picked up
I then run the same command again with region specified explicitly:
xenonxie#cloudshell:~ (rock-perception-263016)$ gcloud container clusters create my-first-cluster --num-nodes 1 --zone us-east1-b
WARNING: Currently VPC-native is not the default mode during cluster
creation. In the future, this will become the default mode and can be
disabled using --no-enable-ip-alias flag. Use
--[no-]enable-ip-alias flag to suppress this warning. WARNING: Newly
created clusters and node-pools will have node auto-upgrade enabled by
default. This can be disabled using the --no-enable-autoupgrade
flag. WARNING: Starting in 1.12, default node pools in new clusters
will have their legacy Compute Engine instance metadata endpoints
disabled by default. To create a cluster with legacy instance metadata
endpoints disabled in the default node pool,run clusters create with
the flag --metadata disable-legacy-endpoints=true. WARNING: Your Pod
address range (--cluster-ipv4-cidr) can accommodate at most 1008
node(s). This will enable the autorepair feature for nodes. Please see
https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for
more information on node autorepairs. ERROR:
(gcloud.container.clusters.create) ResponseError: code=403,
message=Kubernetes Engine API is not enabled for this project. Please
ensure it is enabled in Google Cloud Console and try again: visit
https://console.cloud.google.com/apis/api/container.googleapis.com/overview?project=rock-perception-263016
to do so.
From the warning/error it seems I need to enable Kubernetes API, and a link is provided to me already, wonderful, I then clicked the link and it took me to enable it, which I did, right after I enabled it, I was prompt to create credential before I can use the API.
Clicking into it and choosing the right API, as you can see from the screenshot, it doesn't give me a button to create the credential:
What is missing here?
Thank you very much.
Once the API is created, you can go ahead and create the cluster. The credentials are not used when you use gcloud since the SDK will wrap the API call and use your logged-in user credentials.
As long as the Kubernetes Engine API shows as enabled, you should be able to run the same command you used and the cluster will be created. Most of those are just warnings letting you know about default settings that you did not specify

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool
To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.
The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

kubernetes provisioner for pv in a statefulset with aws-ebs pv issue

Have followed the documentation on how to setup k8s on aws including
Add the provider=aws
Make sure the Nodes have correct IAM permissions
Keep getting the following and I am unsure of where to find logs to see the underlying error that is making the AWS query fail.
This is how error looks:
Failed to provision volume with StorageClass "gp2": error querying for all zones: no instances returned
I faced the same issue and found the solution.
I hope this applies to your issue as well.
So every EC2 instance that is a node in your kubernetes cluster should have a tag
kubernetes.io/cluster/CLUSTERNAME = owned
When you request to create a new persistentstoragevolume kubernetes will request this from AWS. AWS will then check in which AZs you have worked nodes so it doesn't create the volume in a AZ where there are no nodes.
It seem to be doing this by listing all EC2 instances with the tag kubernetes.io/cluster/CLUSTERNAME = owned
But if you have changed or removed this tag, so that it no longer match you cluster name, you will get the exact error message you got here.
Lets say you changed it to
kubernetes.io/cluster/CLUSTERNAME-default = owned
That would trigger the issue.