Kubernetes (Azure's AKS) suddenly gives error "kubectl x509 certificate has expired or is not yet valid" - kubernetes

Suddenly an entire Kubernetes cluster (Azure's AKS-solution) became unresponsive.
When running kubectl commands, the result is kubectl x509 certificate has expired or is not yet valid.
Nothing in Azure Portal indicates an unhealthy state.

The quick solution:
az aks rotate-certs -g $RESOURCE_GROUP_NAME -n $CLUSTER_NAME
When certificates have been rotated, you can use kubectl again.
Be ready to wait 30 minutes before the cluster fully recovers.
Full explanation can be found in this article:
https://learn.microsoft.com/en-us/azure/aks/certificate-rotation

AKS clusters created prior to May 2019 have certificates that expire after two years. Any cluster created after May 2019 or any cluster that has its certificates rotated have Cluster CA certificates that expire after 30 years. All other AKS certificates, which use the Cluster CA for signing, will expire after two years and are automatically rotated during an AKS version upgrade which happened after 8/1/2021. To verify when your cluster was created, use kubectl get nodes to see the Age of your node pools.
Here is the commands you can resolve the issue by rotate certificates and
az account set --subscription
az aks get-credentials -g $RESOURCE_GROUP_NAME -n $CLUSTER_NAME
az aks rotate-certs -g $RESOURCE_GROUP_NAME -n $CLUSTER_NAME
Note: get-credentials is mandatory to rotate certificates.

Related

What causes x509 cert unknown authority in some Kubernetes clusters when using the Hashicorp Vault CLI?

I'm trying to deploy an instance of HashiCorp Vault with TLS and integrated storage using the official Helm chart. I've run through the official tutorial using minikube without any issues. I also tested this tutorial with a cluster created with kind. The tutorial went as expected on both minikube and kind, however when I tried on a production cluster created with TKGI (Tanzu Kubernetes Grid Integrated) I ran into x509 errors running vault commands in the server pods. I can get by some of them by using -tls-skip-verify, but what may be different between these two clusters to cause the warning? It seems to be causing additional problems when I try to join the replicas to the raft pool.
Here's an example showing the x509 error,
bash-3.2$ kubectl exec -n vault vault-0 -- vault operator init \
> -key-shares=1 \
> -key-threshold=1 \
> -format=json > /tmp/vault/cluster-keys.json
Get "https://127.0.0.1:8200/v1/sys/seal-status": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "ca")
Is there something that could be updated on the TKGI clusters so that these x509 errors could be avoided?

Unable to connect with kubectl to a GKE cluster

I am currently trying to connect with kubeclt to a GKE cluster.
I followed the steps in the documentation and executed successfully the following:
gcloud container clusters get-credentials <cluster_name> --zone <zone>
Some days ago it work perfectly fine. I was able to setup a connection with kubectl.
The configuration was not changed in any way. And also still try to accessing the cluster through the same network. The cluster itself is running stable. Whatever I try I run into a timeout.
I have already had a look into the kubectl configuration:
kubectl config view
It seems to be that the access token is expired.
...
expiry: "2022-08-01T12:12:35Z"
expiry-key: '{.credential.token_expiry}'
token-key: '{.credential.access_token}'
...
Is there any change to update the token? I am not able to update the token with the get-credential command. Already delete the configuration completely and run the command afterwards. But the token is still the same.
I am very thankful for any hints or ideas on this.
Have you tried rerunning your credentials command again to refresh your local kubeconfig?
gcloud container clusters get-credentials <cluster_name> --zone <zone>
Alternatively, try the beta variant:
gcloud beta container clusters get-credentials <cluster_name> --zone <zone>
(You may need to install the beta package using gcloud components install beta)

No pods started after "kubeadm alpha certs renew"

I did a
kubeadm alpha certs renew
but after that, no pods get started. When starting from a Deployment, kubectl get pod doesn't even list the pod, when explicitly starting a pod, it is stuck on Pending.
What am I missing?
Normally I would follow a pattern to debug such issues starting with:
Check all the certificate files are rotated by kubeadm using sudo cat /etc/kubernetes/ssl/apiserver.crt | openssl x509 -text.
Make sure all the control plane services (api-server, controller, scheduler etc) have been restarted to use the new certificates.
If [1] and [2] are okay you should be able to do kubectl get pods
Now you should check the certificates for kubelet and make sure you are not hitting https://github.com/kubernetes/kubeadm/issues/1753
Make sure kubelet is restarted to use the new certificate.
I think of control plane (not being able to do kubectl) and kubelet (node status not ready, should see certificates attempts in api-server logs from the node) certificates expiry separately so I can quickly tell which might be broken.

kubectl x509: certificate signed by unknown authority

We have running instance on GKE. Till this afternoon we started to receive "x509: certificate signed by unknown authority" error from new created kubernetes cluster. The kubectl is working with old clusters but not with new ones.
What we tried:
gcloud update
kubectl update
gcloud re authenticate
clean gcloud install & auth
.kube/config cert remove & gcloud container clusters get-credentials
remove and add new clusters
Thanks.

TLS handshake timeout with kubernetes in GKE

I've created a cluster on Google Kubernetes Engine (previously Google Container Engine) and installed the Google Cloud SDK and the Kubernetes tools with it on my Windows machine.
It worked well for some time, and, out of nowhere, it stopped working. Every command I'm issuing with kubectl provokes the following:
Unable to connect to the server: net/http: TLS handshake timeout
I've searched Google, the Kubernetes Github Issues, Stack Overflow, Server Fault ... without success.
I've tried the following:
Restart my computer
Change wifi connection
Check that I'm not somehow using a proxy
Delete and re-create my cluster
Uninstall the Google Cloud SDK (and kubectl) from my machine and re-install them
Delete my .kube folder (config and cache)
Check my .kube/config
Change my cluster's version (tried 1.8.3-gke.0 and 1.7.8-gke.0)
Retry several hours later
Tried both on PowerShell and cmd.exe
Note that the cluster seem to work perfectly, since I have my application running on it and can interact with it normally through the Google Cloud Shell.
Running:
gcloud container clusters get-credentials cluster-2 --zone europe-west1-b --project ___
kubectl get pods
works on Google Cloud Shell and provokes the TLS handshake timeout on my machine.
For others seeing this issue, there is another cause to consider.
After doing:
gcloud config set project $PROJECT_NAME
gcloud config set container/cluster $CLUSTER_NAME
gcloud config set compute/zone europe-west2
gcloud beta container clusters get-credentials $CLUSTER_NAME --region europe-west2 --project $PROJECT_NAME
I was then seeing:
kubectl cluster-info
Unable to connect to the server: net/http: TLS handshake timeout
I tried everything suggested here and elsewhere. When the above worked without issue from my home desktop, I discovered that shared workspace wifi was disrupting TLS/VPNs to control the internet access!
This is what I did to solve the above problem.
I simply ran the following commands::
> gcloud container clusters get-credentials {cluster_name} --zone {zone_name} --project {project_name}
> gcloud auth application-default login
Replace the placeholders appropriately.
So this MAY NOT work for you on GKE, but Azure AKS (managed Kubernetes) has a similar problem with the same error message so who knows — this might be helpful to someone.
The solution to this for me was to scale the nodes in my Cluster from the Azure Kubernetes service blade web console.
Workaround / Solution
Log into the Azure (or GKE) Console — Kubernetes Service UI.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Total time it took me ~2 mins.
More Background Info on the Issue
Added this to the full ticket description write up that I posted over here (if you want more info have a read):
'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure AKS server?