How to check if AWS EKS cluster is active to be able to start a nodegroup? - aws-cloudformation

I started an AWS EKS cluster using AWS CloudFormation. I then waited for the stack creation to complete through the following command:
$ aws cloudformation wait stack-create-complete --stack-name "myclusterstack"
I also waited for the EKS cluster to be active through the following command:
aws eks wait cluster-active --name "mycluster"
After these commands complete, I started up an AWS EKS Nodegroup using AWS CloudFormation. However, I get the following error in the CloudFormation stack events:
Resource handler returned message: "Cluster 'mycluster' is not in ACTIVE status (Service: Eks, Status Code: 400, Request ID: a9aa2e4e-1b17-4fd4-bd89-851037867b25)" (RequestToken: 410bd19e-7710-b637-a30c-889f5b5a8893, HandlerErrorCode: InvalidRequest)
I noticed, however, that if I wait for around 5 minutes after creation of the cluster, I don't get this error when I create the nodegroup.
It seems like a few more minutes is needed after aws eks wait cluster-active to be able to successfully launch a nodegroup.
I also tried checking if the cluster endpoint is available as a means to check if the cluster is already active:
aws eks describe-cluster --name "mycluster" --query 'cluster.endpoint'
However, creation of the nodegroup still fails even when the cluster endpoint is already available.
Is sleeping for a few minutes the only way to be able to spin up a nodegroup after cluster creation? What should be the proper way to wait for a cluster to be active to be able to launch a nodegroup?

Related

GKE-kubevirt virtualmachineinstances-mutator.kubevirt.io issue

I'm trying to use kubevirt in google cloud platform GKE cluster.
I got to know the GKE won't support nested virtualisation so i tried below step to install kubevirt in GKE
Is there a way to enable nested virtualization in GKE cluster node?
Start a GKE cluster with ubuntu/containerd, n1-standard nodes and minimum cpu of Haswell.
Find the template used for your new cluster, then to determine the proper source image:
gcloud compute instance-templates describe --format=json | jq ".properties.disks[0].initializeParams.sourceImage"
Create a copy of the source disk with nested virtualization enabled:
gcloud compute images --project $PROJECT create $NEW_IMAGE_NAME --source-image $SOURCE_IMAGE --source-image-project=$SOURCE_PROJECT --licenses "https://www.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx"
Use "Create Similar" on the template for your GKE cluster. Change the boot disk to $NEW_IMAGE_NAME. You will also need to drill down to networking/alias and change the default subnet to your pod network.
5.Trigger a rolling update on the group for your GKE nodes to move them to the new template.
I'm able to install kubevirt and virtctl but when i try to launch basic vm using
kubectl apply -f https://kubevirt.io/labs/manifests/vm.yaml
I'm getting below error:
Error from server (InternalError): error when creating "https://kubevirt.io/labs/manifests/vm.yaml": Internal error occurred: failed calling webhook "virtualmachines-mutator.kubevirt.io": failed to call webhook: Post "https://virt-api.kubevirt.svc:443/virtualmachines-mutate?timeout=10s": context deadline exceeded
Is there any way to debug the error.
How to make Kubevirt work in GKE.
I am trying to achieve Kubevirt in GKE

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool
To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.
The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

TLS handshake timeout with kubernetes in GKE

I've created a cluster on Google Kubernetes Engine (previously Google Container Engine) and installed the Google Cloud SDK and the Kubernetes tools with it on my Windows machine.
It worked well for some time, and, out of nowhere, it stopped working. Every command I'm issuing with kubectl provokes the following:
Unable to connect to the server: net/http: TLS handshake timeout
I've searched Google, the Kubernetes Github Issues, Stack Overflow, Server Fault ... without success.
I've tried the following:
Restart my computer
Change wifi connection
Check that I'm not somehow using a proxy
Delete and re-create my cluster
Uninstall the Google Cloud SDK (and kubectl) from my machine and re-install them
Delete my .kube folder (config and cache)
Check my .kube/config
Change my cluster's version (tried 1.8.3-gke.0 and 1.7.8-gke.0)
Retry several hours later
Tried both on PowerShell and cmd.exe
Note that the cluster seem to work perfectly, since I have my application running on it and can interact with it normally through the Google Cloud Shell.
Running:
gcloud container clusters get-credentials cluster-2 --zone europe-west1-b --project ___
kubectl get pods
works on Google Cloud Shell and provokes the TLS handshake timeout on my machine.
For others seeing this issue, there is another cause to consider.
After doing:
gcloud config set project $PROJECT_NAME
gcloud config set container/cluster $CLUSTER_NAME
gcloud config set compute/zone europe-west2
gcloud beta container clusters get-credentials $CLUSTER_NAME --region europe-west2 --project $PROJECT_NAME
I was then seeing:
kubectl cluster-info
Unable to connect to the server: net/http: TLS handshake timeout
I tried everything suggested here and elsewhere. When the above worked without issue from my home desktop, I discovered that shared workspace wifi was disrupting TLS/VPNs to control the internet access!
This is what I did to solve the above problem.
I simply ran the following commands::
> gcloud container clusters get-credentials {cluster_name} --zone {zone_name} --project {project_name}
> gcloud auth application-default login
Replace the placeholders appropriately.
So this MAY NOT work for you on GKE, but Azure AKS (managed Kubernetes) has a similar problem with the same error message so who knows — this might be helpful to someone.
The solution to this for me was to scale the nodes in my Cluster from the Azure Kubernetes service blade web console.
Workaround / Solution
Log into the Azure (or GKE) Console — Kubernetes Service UI.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Total time it took me ~2 mins.
More Background Info on the Issue
Added this to the full ticket description write up that I posted over here (if you want more info have a read):
'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure AKS server?

Failed to stop job or delete job in dataproc google cloud platform

When I am trying to delete dataproc cluster in google cloud platform getting below error,
Failed to stop job b021d29d-acc9-409d-8fca-52363076a63c Cluster not
found
could any one help??
I'm guessing you are trying to delete the cluster via the Dataproc Clusters UI. In that case the problem could be a bug with the UI itself which always sets the cluster region argument to 'global'. If your cluster region is not set to 'global' you'll get the 'Cluster not found' error.
The solution is to use the gcloud api:
gcloud dataproc clusters delete NAME [--async] [--region=REGION] [GCLOUD_WIDE_FLAG …]
ref: https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/delete

Kubernetes unable to pull images from gcr.io

I am trying to setup Kubernetes for the first time. I am following the Fedora Manual installation guide: http://kubernetes.io/v1.0/docs/getting-started-guides/fedora/fedora_manual_config.html
I am trying to get the kubernetes addons running , specifically the kube-ui. I created the service and replication controller like so:
kubectl create -f cluster/addons/kube-ui/kube-ui-rc.yaml --namespace=kube-system
kubectl create -f cluster/addons/kube-ui/kube-ui-svc.yaml --namespace=kube-system
When i run
kubectl get events --namespace=kube-system
I see errors such as this:
Failed to pull image "gcr.io/google_containers/pause:0.8.0": image pull failed for gcr.io/google_containers/pause:0.8.0, this may be because there are no credentials on this request. details: (Authentication is required.)
How am i supposed to tell kubernetes to authenticate? This isnt covered in the documentation. So how do i fix this?
This happened due to a recent outage to gce storage as a result of which all of us went through this error while pulling images from gcr (which uses gce storage on the backend).
Are you still seeing this error ?
as the message says, you need credentials. Are you using Google Container Engine? Then you need to run
gcloud config set project <your-project>
gcloud config set compute/zone <your-zone, like us-central1-f>
gcloud beta container clusters get-credentials --cluster <your-cluster-name>
then your GCE cluster will have the credentials