GKE-kubevirt virtualmachineinstances-mutator.kubevirt.io issue - kubernetes

I'm trying to use kubevirt in google cloud platform GKE cluster.
I got to know the GKE won't support nested virtualisation so i tried below step to install kubevirt in GKE
Is there a way to enable nested virtualization in GKE cluster node?
Start a GKE cluster with ubuntu/containerd, n1-standard nodes and minimum cpu of Haswell.
Find the template used for your new cluster, then to determine the proper source image:
gcloud compute instance-templates describe --format=json | jq ".properties.disks[0].initializeParams.sourceImage"
Create a copy of the source disk with nested virtualization enabled:
gcloud compute images --project $PROJECT create $NEW_IMAGE_NAME --source-image $SOURCE_IMAGE --source-image-project=$SOURCE_PROJECT --licenses "https://www.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx"
Use "Create Similar" on the template for your GKE cluster. Change the boot disk to $NEW_IMAGE_NAME. You will also need to drill down to networking/alias and change the default subnet to your pod network.
5.Trigger a rolling update on the group for your GKE nodes to move them to the new template.
I'm able to install kubevirt and virtctl but when i try to launch basic vm using
kubectl apply -f https://kubevirt.io/labs/manifests/vm.yaml
I'm getting below error:
Error from server (InternalError): error when creating "https://kubevirt.io/labs/manifests/vm.yaml": Internal error occurred: failed calling webhook "virtualmachines-mutator.kubevirt.io": failed to call webhook: Post "https://virt-api.kubevirt.svc:443/virtualmachines-mutate?timeout=10s": context deadline exceeded
Is there any way to debug the error.
How to make Kubevirt work in GKE.
I am trying to achieve Kubevirt in GKE

Related

gpu worker node unable to join cluster

I've a EKS setup (v1.16) with 2 ASG: one for compute ("c5.9xlarge") and the other gpu ("p3.2xlarge").
Both are configured as Spot and set with desiredCapacity 0.
K8S CA works as expected and scale out each ASG when necessary, the issue is that the newly created gpu instance is not recognized by the master and running kubectl get nodes emits nothing.
I can see that the ec2 instance was in Running state and also I could ssh the machine.
I double checked the the labels and tags and compared them to the "compute".
Both are configured almost similarly, the only difference is that the gpu nodegroup has few additional tags.
Since I'm using eksctl tool (v.0.35.0) and the compute nodeGroup vs. gpu nodeGroup is basically copy&paste, I can't figured out what could be the problem.
UPDATE:
ssh the instance I could see the following error (/var/log/messages)
failed to run Kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"
and the kubelet service crashed.
would it possible the my GPU uses wrong AMI (amazon-eks-gpu-node-1.18-v20201211)?
As a simple you can use this preBootstrapCommands in eksctl yaml config file:
- name: test-node-group
preBootstrapCommands:
- "sed -i 's/cgroupDriver:.*/cgroupDriver: cgroupfs/' /etc/eksctl/kubelet.yaml"
There is some issue with EKS 1.16, even the graviton processors machine won't join the cluster. To fix it first you try upgrading your CNI version. Please refer the documentation here:
https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html
And if that doesn't work, then upgrade your EKS version to the latest available version then should work.
I've found out the issue. It seems to be mis-alignment between eksctl (v0.35.0) and the AL2-GPU AMI.
AWS team change the control group in docker to be "systemd" instead of "cgroup" (github) while the eksctl tool I used didn't absorb the changes.
A temporary solution is to edit the /etc/eksctl/kubelet.yaml file using preBootstrapCommands

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool
To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.
The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

Terraform Kubernetes provider with EKS fails on configmap

I've followed the instructions to create an EKS cluster in AWS using Terraform.
https://www.terraform.io/docs/providers/aws/guides/eks-getting-started.html
I've also copied the output for connecting to the cluster to ~/.kube/config-eks. I've verified this successfully works as I've been able to connect to the cluster and manually deploy containers. However, now i'm trying to use the Terraform Kubernetes provider to connect to the cluster but cannot seem to be able to configure the provider properly.
I've configured the provider to use my kubectl configuration but when attempting to push a simple configmap, i get an error stating the following:
configmaps is forbidden: User "system:anonymous" cannot create configmaps in the namespace "kube-system"
I know that the provider is picking up part of the configuration but I cannot seem to get it to authenticate. I suspect this is because EKS uses heptio for authentication and i'm not sure if the K8s Go client used by Terraform can support heptio. However, given that Terraform released their AWS EKS support when EKS went GA, I'd doubt that they wouldn't also update their Terraform provider to work with it.
Is it possible to even do this now? Are there alternatives?
Exec auth was added here: https://github.com/kubernetes/client-go/commit/19c591bac28a94ca793a2f18a0cf0f2e800fad04
This is what is utilized for custom authentication plugins and was published Feb 7th.
Right now, Terraform doesn't support the new exec-based authentication provider, but there is an issue open with a workaround: https://github.com/terraform-providers/terraform-provider-kubernetes/issues/161
That said, if I get some free time I will work on a PR.

TLS handshake timeout with kubernetes in GKE

I've created a cluster on Google Kubernetes Engine (previously Google Container Engine) and installed the Google Cloud SDK and the Kubernetes tools with it on my Windows machine.
It worked well for some time, and, out of nowhere, it stopped working. Every command I'm issuing with kubectl provokes the following:
Unable to connect to the server: net/http: TLS handshake timeout
I've searched Google, the Kubernetes Github Issues, Stack Overflow, Server Fault ... without success.
I've tried the following:
Restart my computer
Change wifi connection
Check that I'm not somehow using a proxy
Delete and re-create my cluster
Uninstall the Google Cloud SDK (and kubectl) from my machine and re-install them
Delete my .kube folder (config and cache)
Check my .kube/config
Change my cluster's version (tried 1.8.3-gke.0 and 1.7.8-gke.0)
Retry several hours later
Tried both on PowerShell and cmd.exe
Note that the cluster seem to work perfectly, since I have my application running on it and can interact with it normally through the Google Cloud Shell.
Running:
gcloud container clusters get-credentials cluster-2 --zone europe-west1-b --project ___
kubectl get pods
works on Google Cloud Shell and provokes the TLS handshake timeout on my machine.
For others seeing this issue, there is another cause to consider.
After doing:
gcloud config set project $PROJECT_NAME
gcloud config set container/cluster $CLUSTER_NAME
gcloud config set compute/zone europe-west2
gcloud beta container clusters get-credentials $CLUSTER_NAME --region europe-west2 --project $PROJECT_NAME
I was then seeing:
kubectl cluster-info
Unable to connect to the server: net/http: TLS handshake timeout
I tried everything suggested here and elsewhere. When the above worked without issue from my home desktop, I discovered that shared workspace wifi was disrupting TLS/VPNs to control the internet access!
This is what I did to solve the above problem.
I simply ran the following commands::
> gcloud container clusters get-credentials {cluster_name} --zone {zone_name} --project {project_name}
> gcloud auth application-default login
Replace the placeholders appropriately.
So this MAY NOT work for you on GKE, but Azure AKS (managed Kubernetes) has a similar problem with the same error message so who knows — this might be helpful to someone.
The solution to this for me was to scale the nodes in my Cluster from the Azure Kubernetes service blade web console.
Workaround / Solution
Log into the Azure (or GKE) Console — Kubernetes Service UI.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Total time it took me ~2 mins.
More Background Info on the Issue
Added this to the full ticket description write up that I posted over here (if you want more info have a read):
'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure AKS server?

Kubernetes unable to pull images from gcr.io

I am trying to setup Kubernetes for the first time. I am following the Fedora Manual installation guide: http://kubernetes.io/v1.0/docs/getting-started-guides/fedora/fedora_manual_config.html
I am trying to get the kubernetes addons running , specifically the kube-ui. I created the service and replication controller like so:
kubectl create -f cluster/addons/kube-ui/kube-ui-rc.yaml --namespace=kube-system
kubectl create -f cluster/addons/kube-ui/kube-ui-svc.yaml --namespace=kube-system
When i run
kubectl get events --namespace=kube-system
I see errors such as this:
Failed to pull image "gcr.io/google_containers/pause:0.8.0": image pull failed for gcr.io/google_containers/pause:0.8.0, this may be because there are no credentials on this request. details: (Authentication is required.)
How am i supposed to tell kubernetes to authenticate? This isnt covered in the documentation. So how do i fix this?
This happened due to a recent outage to gce storage as a result of which all of us went through this error while pulling images from gcr (which uses gce storage on the backend).
Are you still seeing this error ?
as the message says, you need credentials. Are you using Google Container Engine? Then you need to run
gcloud config set project <your-project>
gcloud config set compute/zone <your-zone, like us-central1-f>
gcloud beta container clusters get-credentials --cluster <your-cluster-name>
then your GCE cluster will have the credentials