k3s - Metrics server doesn't work for worker nodes - kubernetes

I deployed a k3s cluster into 2 raspberry pi 4. One as a master and the second as a worker using the script k3s offered with the following options:
For the master node:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='server --bind-address 192.168.1.113 (which is the master node ip)' sh -
To the agent node:
curl -sfL https://get.k3s.io | \
K3S_URL=https://192.168.1.113:6443 \
K3S_TOKEN=<master-token> \
INSTALL_K3S_EXEC='agent' sh-
Everything seems to work, but kubectl top nodes returns the following:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k3s-master 137m 3% 1285Mi 33%
k3s-node-01 <unknown> <unknown> <unknown> <unknown>
I also tried to deploy the k8s dashboard, according to what is written in the docs but it fails to work because it can't reach the metrics server and gets a timeout error:
"error trying to reach service: dial tcp 10.42.1.11:8443: i/o timeout"
and I see a lot of errors in the pod logs:
2021/09/17 09:24:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:25:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:26:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:27:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
logs from the metrics-server pod:
elet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:03:24.767949 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:04:24.767960 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host

Moving this out of comments for better visibility.
After creation of small cluster, I wasn't able to reproduce this behaviour and metrics-server worked fine for both nodes, kubectl top nodes showed information and metrics about both available nodes (thought it took some time to start collecting the metrics).
Which leads to troubleshooting steps why it doesn't work. Checking metrics-server logs is the most efficient way to figure this out:
$ kubectl logs metrics-server-58b44df574-2n9dn -n kube-system
Based on logs it will be different steps to continue, for instance in comments above:
first it was no route to host which is related to network and lack of possibility to resolve hostname
then i/o timeout which means route exists, but service did not respond back. This may happen due to firewall which blocks certain ports/sources, kubelet is not running (listens to port 10250) or as it appeared for OP, there was an issue with ntp which affected certificates and connections.
errors may be different in other cases, it's important to find the error and based on it troubleshoot further.

Related

k8s dashboard: Metric client health check failed

I install the k8s dashboard use the following command:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.4/aio/deploy/recommended.yaml
then I watch the log of dashboard pod:
$ kubectl -n kubernetes-dashboard logs -f kubernetes-dashboard-665f4c5ff-wcrj9
2020/09/12 04:19:10 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:19:43 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:20:17 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:20:50 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:21:23 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:21:56 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2020/09/12 04:22:29 Metric client health check failed: an error on the server ("unknown") has prevented the request from succeeding (get services dashboard-metrics-scraper). Retrying in 30 seconds.
kubeadm version: 1.19
kubectl version: 1.19
Can anyone help me?
To give a bit of background information: once you install the Kubernetes Dashboard you install a Pod that provides the Dashboard as well as a Pod that is in charge of scraping Metrics from the Kubernetes Metrics API, the Dashboard Metrics Scraper. The dashboard delegates to the scraper, expecting to address it via its K8s Service: "dashboard-metrics-scraper".
In your case, this service can't be found. Do a "kubectl get service -n kubernetes-dashboard" to see whether the scraper service was deleted or renamed. If it was deleted, reapply the Dashboard installation yamls to recreate it.
I was unable to replicate your issue but here are some steps you can try to debug the problem:
Metric client health check failed: ... Retrying in 30 seconds error appears only one time in the dashboard's source code, when Health check fails.
HealthCheck itself is a proxy request to api-server.
Use following command to test if proxy is working correctly.
$ kubectl get --raw "/api/v1/namespaces/kubernetes-dashboard/services/dashboard-metrics-scraper/proxy/healthz"
it should return: URL: /healthz. If didn't, there is most probably sth wrong with the dashboard-metrics-scraper service or the pod. Make sure that service exists and the pod is running and ready.
If it's working for you (from cli), but it is still not working for kubernetes-dashboard, this mean that you should check kubernetes-dashboard's RBAC permissions. Make sure that kubernetes-dashboard has permissions to proxy.
The second error you are seeing:
{"level":"error","msg":"Error scraping node metrics: the server could not find the requested resource (get nodes.metrics.k8s.io)","time":"2020-09-13T02:52:38Z"}
indicates that you don't have a metrics server deployed in your cluster. Check metrics-server github repo for more information.
I'm on kubernetes 1.20.1-00 ubuntu 20.04. I got the
{"level":"error","msg":"Error scraping node metrics: the server could not find the requested resource (get nodes.metrics.k8s.io)","time":"2020-09-13T02:52:38Z"}
error because I deployed kubernetes dashboard with metric scraper prior to deploying metric server. After a day of running in that configuration I was still getting the "Error scraping node..." in my metric scraper pod logs.
I resolved it by scaling the the metric scraper deployment to 0 (zero) and then scaling it back to the desired no of pods (in my case 3).
The error message in the logs went away immediately once the metric scraper pods had spun up.
I'm not implying that this is the correct fix just an observation from seeing an identical error. It could caused by simply deploying metric server and Kubernetes dashboard in the wrong order as I did.

How to deal with error "dial tcp 10.240.0.4:10250: i/o timeout" to see pod's logs in AKS?

Before I could run this command kubectl logs <pod> without issue for many days/versions. However, after I pushed another image and deployed recently, I faced below error:
Error from server: Get https://aks-agentpool-xxx-0:10250/containerLogs/default/<-pod->/<-service->: dial tcp 10.240.0.4:10250: i/o timeout
I tried to re-build and re-deploy but failed.
Below was the Node info for reference:
Not sure if your issue is caused by the problem described in this troubleshooting. But maybe you can take a try, it shows below:
Make sure that the default network security group isn't modified and
that both port 22 and 9000 are open for connection to the API server.
Check whether the tunnelfront pod is running in the kube-system
namespace using the kubectl get pods --namespace kube-system command.
If it isn't, force deletion of the pod and it will restart.

k8s: Unable to delete deployment due to lack of RAM

I got into a vicious circle. I was trying to deploy a few services on AWS Ubuntu machine. It has 1 Gb RAM. By the end of deploying all RAM was used. I decided to delete some of the deployments but I was even unable to check the status of pods and deployments:
$ kubectl delete -f test.yaml
unable to recognize "test.yaml": Get https://172.31.38.138:6443/api?timeout=32s: dial tcp 172.31.38.138:6443: connect: connection refused
$ kubectl get deployments
Unable to connect to the server: dial tcp 172.31.38.138:6443: i/o timeoutUnable to connect to the server: dial tcp 172.31.38.138:6443: i/o timeout
I do understand that the issue is lack of memory. Hence kube-dns, kube-proxy, etc cannot work correctly. The question is:
How can I delete my test deployments without kubectl delete...?
Thanks
Stop Kubelet service then run docker system prune command to delete all pods. And finally restart kubelet

Kubernetes Metrics unable to fetch pod/node metrics

I've installed metrics-server on kubernetes v1.11.2.
I'm running a bare-metal cluster using 3 nodes and 1 master
In the metrics-server log I have the following errors:
E0907 14:29:51.774592 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:vps01: unable to
fetch metrics from Kubelet vps01 (vps01): Get https://vps01:10250/stats/summary/: dial tcp: lookup vps01 on 10.96.0.10:53: no such host, unable to fully scr
ape metrics from source kubelet_summary:vps04: unable to fetch metrics from Kubelet vps04 (vps04): Get https://vps04:10250/stats/summary/: dial tcp: lookup
vps04 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:vps03: unable to fetch metrics from Kubelet vps03 (vps03):
Get https://vps03:10250/stats/summary/: dial tcp: lookup vps03 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:vp
s02: unable to fetch metrics from Kubelet vps02 (vps02): Get https://vps02:10250/stats/summary/: dial tcp: lookup vps02 on 10.96.0.10:53: no such host]
E0907 14:30:01.694794 1 reststorage.go:98] unable to fetch pod metrics for pod boxweb/boxweb-deployment-7756c49688-fz625: no metrics known for pod "bo
xweb/boxweb-deployment-7756c49688-fz625"
E0907 14:30:10.517886 1 reststorage.go:112] unable to fetch node metrics for node "vps01": no metrics known for node "vps01"
I also can't get any metrics using
kubectl top node vps01
Same for autoscale it is not working
unable to get metrics for resource cpu: unable to fetch metrics from
resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I found the following solution:
Change the metrics-server-deployment.yaml file and add:
command:
- /metrics-server
- --kubelet-preferred-address-types=InternalIP
- --kubelet-insecure-tls
It looks like you have DNS issue from your metrics-server pod. You can connect to the pod:
kubectl exec -it metrics-server-xxxxxxxxxx-xxxxx -n kube-system sh
/ # ping vps01
If you can't ping you can't resolve your node.
core-dns or kube-dns use the /etc/resolv.conf on each on your nodes too, so I would check if you can resolve the nodes between each other. Say, can you ping vps01 from vps02 or vps03, etc.
I got the same issue and I resolved by adding hostname in /etc/hosts on every node.
For collecting metric data (CPU/memory usage) metric-server try to access the nodes. However, the metric-server cannot resolve the hostname(vps01, vps02, vps03, and vps04) because the ones are not registered in DNS. As you mentioned, you cannot register the hostnames in DNS.
So, you must add the hostnames to /etc/hosts on the node where the metrics-server POD is running.
The autoscaler does not work since the metric-server is not working and there is no metric data.

Kubernetes-dashboard pod is crashing again and again

I have installed and configured Kubernetes on my ubuntu machine, followed this Document
After deploying the Kubernetes-dashboard, container keep crashing
kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml
Started the Proxy using:
kubectl proxy --address='0.0.0.0' --accept-hosts='.*' --port=8001
Pod status:
kubectl get pods -o wide --all-namespaces
....
....
kube-system kubernetes-dashboard-64576d84bd-z6pff 0/1 CrashLoopBackOff 26 2h 192.168.162.87 kb-node <none>
Kubernetes system log:
root#KB-master:~# kubectl -n kube-system logs kubernetes-dashboard-64576d84bd-z6pff --follow
2018/09/11 09:27:03 Starting overwatch
2018/09/11 09:27:03 Using apiserver-host location: http://192.168.33.30:8001
2018/09/11 09:27:03 Skipping in-cluster config
2018/09/11 09:27:03 Using random key for csrf signing
2018/09/11 09:27:03 No request provided. Skipping authorization
2018/09/11 09:27:33 Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service account's configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get http://192.168.33.30:8001/version: dial tcp 192.168.33.30:8001: i/o timeout
Refer to our FAQ and wiki pages for more information: https://github.com/kubernetes/dashboard/wiki/FAQ
Getting the msg when I'm trying to hit below link on the browser
URL:http://192.168.33.30:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login
Error: 'dial tcp 192.168.162.87:8443: connect: connection refused'
Trying to reach: 'https://192.168.162.87:8443/'
Anyone can help me with this.
http://192.168.33.30:8001 is not a legitimate API server URL. All communications with the API server use TLS internally (https:// URL scheme). These communications are verified using the API server CA certificate and are authenticated by mean of tokens signed by the same CA.
What you see is the result of a misconfiguration. At first sight it seems like you mixed pod, service and host networks.
Make sure you understand the difference between Host network, Pod network and Service network. These 3 networks can not overlap. For example --pod-network-cidr=192.168.0.0/16 must not include the IP address of your host, change it to 10.0.0.0/16 or something smaller if necessary.
After you have a clear overview of the network topology, run the setup again and everything will be configured correctly, including the Kubernetes CA.