Cert-manager and Ingress pods in Crash loop back off (AKS) - kubernetes

I was trying to upgrade the kubernetes version of our cluster from 1.19.7 to 1.22 and some of the worker nodes failed in updating so I restarted the cluster. After restarting the upgrade was successful but Cert-manager-webhook and Cert-manager-cainjector pods went down along with the ingress pods. i.e. they are either in crashloopbackoff state or error state
after checking the logs,
The cert-manager-webhook is throwing this error - "msg"="Failed to generate initial serving certificate, retrying..." "error"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input" "interval"=1000000000
"msg"="Generating new ECDSA private key"
The cert-manager-cainjector is throwing this error- cert-manager/controller-runtime/manager "msg"="Failed to get API Group-Resources" "error"="an error on the server (\"\") has prevented the request from succeeding"
The nginx-ingress pod is throwing this error - SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
Can anyone please help?

Related

gmp managed prometheus example not working on a brand new vanilla stable gke autopilot cluster

Google managed prometheus seems like a great service however at the moment it does not work even in the example... https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed
Setup:
create a new autopilot cluster 1.21.12-gke.2200
enable manage prometheus via gcloud cli command
gcloud beta container clusters update <mycluster> --enable-managed-prometheus --region us-central1
add port 8443 firewall webhook command
install ingress-nginx
try and use the PodMonitoring manifest to get metrics from ingress-nginx
Error from server (InternalError): error when creating "ingress-nginx/metrics.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc
There is a thread suggesting this will all be fixed this week (8/11/2022), https://github.com/GoogleCloudPlatform/prometheus-engine/issues/300, but it seems like this should work regardless.
if I try to port forward ...
kubectl -n gke-gmp-system port-forward svc/gmp-operator 8443
error: Pod 'gmp-operator-67d5fff8b9-p4n7t' does not have a named port 'webhook'

Single-Node Kubernete Cluster Has Cluster-Wide 401 Unauthorized Error in Microservices After CA Cert Ratation

What was done?
kubeadm init phase certs all
kubeadm init phase kubeconfig all
Daemon reloaded
Kubelet restarted
Calico CNI restarted
Now:
All Worker Nodes show Ready State
All Deployments and pods show Running state
Application has errors in logs:
akka.management.cluster.bootstrap.internal.BootstrapCoordinator -
Resolve attempt failed! Cause:
akka.discovery.kubernetes.KubernetesApiServiceDiscovery$KubernetesApiException:
Non-200 from Kubernetes API server: 401 Unauthorized
Kube Apiserver has logs:
Unable to authenticate the request due to an error: [invalid bearer token, square/go-jose: error in cryptographic primitive]
Could it be the old certs and tokensbeing cached by the services somewhere ?

k8s dashboard: Metric client health check failed

I install the k8s dashboard use the following command:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.4/aio/deploy/recommended.yaml
then I watch the log of dashboard pod:
$ kubectl -n kubernetes-dashboard logs -f kubernetes-dashboard-665f4c5ff-wcrj9
2020/09/12 04:19:10 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:19:43 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:20:17 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:20:50 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:21:23 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:21:56 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:22:29 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
kubeadm version: 1.19
kubectl version: 1.19
Can anyone help me?
To give a bit of background information: once you install the Kubernetes Dashboard you install a Pod that provides the Dashboard as well as a Pod that is in charge of scraping Metrics from the Kubernetes Metrics API, the Dashboard Metrics Scraper. The dashboard delegates to the scraper, expecting to address it via its K8s Service: "dashboard-metrics-scraper".
In your case, this service can't be found. Do a "kubectl get service -n kubernetes-dashboard" to see whether the scraper service was deleted or renamed. If it was deleted, reapply the Dashboard installation yamls to recreate it.
I was unable to replicate your issue but here are some steps you can try to debug the problem:
Metric client health check failed: ... Retrying in 30 seconds error appears only one time in the dashboard's source code, when Health check fails.
HealthCheck itself is a proxy request to api-server.
Use following command to test if proxy is working correctly.
$ kubectl get --raw "/api/v1/namespaces/kubernetes-dashboard/services/dashboard-metrics-scraper/proxy/healthz"
it should return: URL: /healthz. If didn't, there is most probably sth wrong with the dashboard-metrics-scraper service or the pod. Make sure that service exists and the pod is running and ready.
If it's working for you (from cli), but it is still not working for kubernetes-dashboard, this mean that you should check kubernetes-dashboard's RBAC permissions. Make sure that kubernetes-dashboard has permissions to proxy.
The second error you are seeing:
{"level":"error","msg":"Error scraping node metrics: the server could not find the requested resource (get nodes.metrics.k8s.io)","time":"2020-09-13T02:52:38Z"}
indicates that you don't have a metrics server deployed in your cluster. Check metrics-server github repo for more information.
I'm on kubernetes 1.20.1-00 ubuntu 20.04. I got the
{"level":"error","msg":"Error scraping node metrics: the server could not find the requested resource (get nodes.metrics.k8s.io)","time":"2020-09-13T02:52:38Z"}
error because I deployed kubernetes dashboard with metric scraper prior to deploying metric server. After a day of running in that configuration I was still getting the "Error scraping node..." in my metric scraper pod logs.
I resolved it by scaling the the metric scraper deployment to 0 (zero) and then scaling it back to the desired no of pods (in my case 3).
The error message in the logs went away immediately once the metric scraper pods had spun up.
I'm not implying that this is the correct fix just an observation from seeing an identical error. It could caused by simply deploying metric server and Kubernetes dashboard in the wrong order as I did.

Timeouts in metrics-server right after installing ingress in AKS

Prerequisites:
New kubernetes cluster (Azure, v. 1.14.8) is set up
Metrics-server is set up automatically by AKS (v. 0.3.5)
Steps:
Install ingress into cluster via helm install ingress stable/nginx-ingress --namespace ingress --create-namespace --set controller.replicaCount=1
Wait few minutes
After some minutes (3-8) there are errors in metrics-server and it fall into loop with FailedDiscoveryCheck error: Failed to make webhook authorized request: Post https://...azmk8s.io:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: read tcp %IP%: read: connection timed out.
Error in NGINX Ingress controller pod:
E0625 12:18:49.622522 6 leaderelection.go:320] error retrieving resource lock ingress/ingress-controller-leader-nginx: Get "https://10.0.0.1:443/api/v1/namespaces/ingress/configmaps/ingress-controller-leader-nginx": context deadline exceeded
I0625 12:18:49.622561 6 leaderelection.go:277] failed to renew lease ingress/ingress-controller-leader-nginx: timed out waiting for the condition
I0625 12:18:49.626143 6 leaderelection.go:242] attempting to acquire leader lease ingress/ingress-controller-leader-nginx...
E0625 12:34:13.890642 6 leaderelection.go:320] error retrieving resource lock ingress/ingress-controller-leader-nginx: Get "https://10.0.0.1:443/api/v1/namespaces/ingress/configmaps/ingress-controller-leader-nginx": read tcp 10.244.0.53:55144->10.0.0.1:443: read: connection timed out
The metrics-server does not work until its restart. After the restart no issues are observed. The adding of liveness/readiness probes to metrics-server deployment fixes the issue with late restart of metrics-server, but does not fix the root cause.
Why the metrics-server stop working only after few minutes of installing ingress? How the installing of ingress affects cluster? It is reproduced stably. You can delete ingress, then install it again and the issue will be repeated.
Sometimes, metrics-server fails with error:
Message: endpoints for service/metrics-server in "kube-system" have no addresses
Reason: MissingEndpoints
The same behavior is also observed for another pod: If you install kubernetes-dashboard, then it stops working after installation of ingress. There is error 500 context deadline exceeded.
It is wanted to understand and fix the root cause.

Kubernetes-dashboard pod is crashing again and again

I have installed and configured Kubernetes on my ubuntu machine, followed this Document
After deploying the Kubernetes-dashboard, container keep crashing
kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml
Started the Proxy using:
kubectl proxy --address='0.0.0.0' --accept-hosts='.*' --port=8001
Pod status:
kubectl get pods -o wide --all-namespaces
....
....
kube-system kubernetes-dashboard-64576d84bd-z6pff 0/1 CrashLoopBackOff 26 2h 192.168.162.87 kb-node <none>
Kubernetes system log:
root#KB-master:~# kubectl -n kube-system logs kubernetes-dashboard-64576d84bd-z6pff --follow
2018/09/11 09:27:03 Starting overwatch
2018/09/11 09:27:03 Using apiserver-host location: http://192.168.33.30:8001
2018/09/11 09:27:03 Skipping in-cluster config
2018/09/11 09:27:03 Using random key for csrf signing
2018/09/11 09:27:03 No request provided. Skipping authorization
2018/09/11 09:27:33 Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service account's configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get http://192.168.33.30:8001/version: dial tcp 192.168.33.30:8001: i/o timeout
Refer to our FAQ and wiki pages for more information: https://github.com/kubernetes/dashboard/wiki/FAQ
Getting the msg when I'm trying to hit below link on the browser
URL:http://192.168.33.30:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login
Error: 'dial tcp 192.168.162.87:8443: connect: connection refused'
Trying to reach: 'https://192.168.162.87:8443/'
Anyone can help me with this.
http://192.168.33.30:8001 is not a legitimate API server URL. All communications with the API server use TLS internally (https:// URL scheme). These communications are verified using the API server CA certificate and are authenticated by mean of tokens signed by the same CA.
What you see is the result of a misconfiguration. At first sight it seems like you mixed pod, service and host networks.
Make sure you understand the difference between Host network, Pod network and Service network. These 3 networks can not overlap. For example --pod-network-cidr=192.168.0.0/16 must not include the IP address of your host, change it to 10.0.0.0/16 or something smaller if necessary.
After you have a clear overview of the network topology, run the setup again and everything will be configured correctly, including the Kubernetes CA.