Today randomly minikube seems to be taking very long to respond to command via kubectl.
And occasionally even:
kubectl get pods
Unable to connect to the server: net/http: TLS handshake timeout
How can I diagnose this?
Some logs from minikube logs:
==> kube-scheduler <==
I0527 14:16:55.809859 1 serving.go:319] Generated self-signed cert in-memory
W0527 14:16:56.256478 1 authentication.go:387] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0527 14:16:56.256856 1 authentication.go:249] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0527 14:16:56.257077 1 authentication.go:252] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0527 14:16:56.257189 1 authorization.go:177] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0527 14:16:56.257307 1 authorization.go:146] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0527 14:16:56.264875 1 server.go:142] Version: v1.14.1
I0527 14:16:56.265228 1 defaults.go:87] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W0527 14:16:56.286959 1 authorization.go:47] Authorization is disabled
W0527 14:16:56.286982 1 authentication.go:55] Authentication is disabled
I0527 14:16:56.286995 1 deprecated_insecure_serving.go:49] Serving healthz insecurely on [::]:10251
I0527 14:16:56.287397 1 secure_serving.go:116] Serving securely on 127.0.0.1:10259
I0527 14:16:57.417028 1 controller_utils.go:1027] Waiting for caches to sync for scheduler controller
I0527 14:16:57.524378 1 controller_utils.go:1034] Caches are synced for scheduler controller
I0527 14:16:57.827438 1 leaderelection.go:217] attempting to acquire leader lease kube-system/kube-scheduler...
E0527 14:17:10.865448 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-scheduler: Get https://localhost:8443/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0527 14:17:43.418910 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-scheduler: Get https://localhost:8443/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I0527 14:18:01.447065 1 leaderelection.go:227] successfully acquired lease kube-system/kube-scheduler
I0527 14:18:29.044544 1 leaderelection.go:263] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E0527 14:18:38.999295 1 server.go:252] lost master
E0527 14:18:39.204637 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-scheduler: Get https://localhost:8443/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
lost lease
Update:
To work around this issue I just did a minikube delete and minikube start, and the performance issue resolved..
As solution has been found, I am posting this as Community Wiki for future users.
1) Debugging issues with minikube by adding -v flag and set debug level (0, 1, 2, 3, 7).
As example: minikube start --v=1 to set outbut to INFO level.
More detailed information here
2) Use logs command minikube logs
3) Because Minikube is working on Virtual Machine sometimes is better to delete minikube and start it again (It helped in this case).
minikube delete
minikube start
4) It might get slow due to lack of resources.
Minikube as default is using 2048MB of memory and 2 CPUs. More details about this can be fund here
In addition, you can enforce Minikube to create more using command
minikube start --cpus 4 --memory 8192
Related
This is sort of strange behavior in our K8 cluster.
When we try to deploy a new version of our applications we get:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "<container-id>" network for pod "application-6647b7cbdb-4tp2v": networkPlugin cni failed to set up pod "application-6647b7cbdb-4tp2v_default" network: Get "https://[10.233.0.1]:443/api/v1/namespaces/default": dial tcp 10.233.0.1:443: connect: connection refused
I used kubectl get cs and found controller and scheduler in Unhealthy state.
As describer here updated /etc/kubernetes/manifests/kube-scheduler.yaml and
/etc/kubernetes/manifests/kube-controller-manager.yaml by commenting --port=0
When I checked systemctl status kubelet it was working.
Active: active (running) since Mon 2020-10-26 13:18:46 +0530; 1 years 0 months ago
I had restarted kubelet service and controller and scheduler were shown healthy.
But systemctl status kubelet shows (soon after restart kubelet it showed running state)
Active: activating (auto-restart) (Result: exit-code) since Thu 2021-11-11 10:50:49 +0530; 3s ago<br>
Docs: https://github.com/GoogleCloudPlatform/kubernetes<br> Process: 21234 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET
Tried adding Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --fail-swap-on=false" to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf as described here, but still its not working properly.
Also removed --port=0 comment in above mentioned manifests and tried restarting,still same result.
Edit: This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf
Many other suggested kubeadm init again. But this cluster was created using kubespray no manually added nodes.
We have baremetal k8 running on Ubuntu 18.04.
K8: v1.18.8
We would like to know any debugging and fixing suggestions.
PS:
When we try to telnet 10.233.0.1 443 from any node, first attempt fails and second attempt success.
Edit: Found this in kubelet service logs
Nov 10 17:35:05 node1 kubelet[1951]: W1110 17:35:05.380982 1951 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "app-7b54557dd4-bzjd9_default": unexpected command output nsenter: cannot open /proc/12311/ns/net: No such file or directory
Posting comment as the community wiki answer for better visibility
This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf
I am trying to add a node to my (currently running) Kubernetes cluster.
When I run the kubeadm join command, I get the following error:
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker
cgroup driver. The recommended driver is "systemd".
Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: couldn't validate the identity of the API Server:
Get "https://159.65.40.41:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s":
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
To see the stack trace of this error execute with --v=5 or higher
here is a snippet from the stack trace
I0917 16:06:58.162180 2714 token.go:215] [discovery] Failed to request cluster-info, will try again: Get "https://*redacted*:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
What does this mean and how do I solve it?
I forgot that I had a firewall installed on my server
I have added port 6443 per instructions found here (Kubeadm join failed : Failed to request cluster-info) and all is well!
My k8s 1.12.8 cluster (created via kops) has been running fine for 6+ months. Recently, something caused both kube-scheduler and kube-controller-manager on the master node to die and restart:
SyncLoop (PLEG): "kube-controller-manager-ip-x-x-x-x.z.compute.internal_kube-system(abc123)", event: &pleg.PodLifecycleEvent{ID:"abc123", Type:"ContainerDied", Data:"def456"}
hostname for pod:"kube-controller-manager-ip-x-x-x-x.z.compute.internal" was longer than 63. Truncated hostname to :"kube-controller-manager-ip-x-x-x-x.z.compute.inter"
SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z.compute.internal_kube-system(hij678)", event: &pleg.PodLifecycleEvent{ID:"hij678", Type:"ContainerDied", Data:"890klm"}
SyncLoop (PLEG): "kube-controller-manager-ip-x-x-x-x.eu-west-2.compute.internal_kube-system(abc123)", event: &pleg.PodLifecycleEvent{ID:"abc123", Type:"ContainerStarted", Data:"def345"}
SyncLoop (container unhealthy): "kube-scheduler-ip-x-x-x-x.z.compute.internal_kube-system(hjk678)"
SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z.compute.internal_kube-system(ghj567)", event: &pleg.PodLifecycleEvent{ID:"ghj567", Type:"ContainerStarted", Data:"hjk768"}
Ever since kube-scheduler and kube-controller-manager restarted, kubelet is completely unable to get or update any node status:
Error updating node status, will retry: failed to patch status "{"status":{"$setElementOrder/conditions":[{"type":"NetworkUnavailable"},{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"PIDPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2020-08-12T09:22:08Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2020-08-12T09:22:08Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2020-08-12T09:22:08Z","type":"DiskPressure"},{"lastHeartbeatTime":"2020-08-12T09:22:08Z","type":"PIDPressure"},{"lastHeartbeatTime":"2020-08-12T09:22:08Z","type":"Ready"}]}}" for node "ip-172-20-60-88.eu-west-2.compute.internal": Patch https://127.0.0.1/api/v1/nodes/ip-172-20-60-88.eu-west-2.compute.internal/status?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Error updating node status, will retry: error getting node "ip-x-x-x-x.z.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z.compute.internal?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Error updating node status, will retry: error getting node "ip-x-x-x-x.z.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z.compute.internal?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Error updating node status, will retry: error getting node "ip-x-x-x-x.z.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z.compute.internal?timeout=10s: context deadline exceeded
Error updating node status, will retry: error getting node "ip-x-x-x-x.z.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z.compute.internal?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Unable to update node status: update node status exceeds retry count
The cluster is completely unable to perform any updates in this state.
What can cause the master node to lose connectivity to nodes like
this?
Is the 2nd line in the first log output 'Truncated
hostname..' a potential source of the issue?
How can I further
diagnose what is actually causing the get/update node actions to
fail?
I remember kubernetes limits the hostname to less than 64 characters. Is there a case where hostname is updated this time?
If so it would be good to reconstruct the kubelet configuration using this documentation
https://kubernetes.io/docs/tasks/administer-cluster/reconfigure-kubelet/
I am trying to backup and restore rancher server (single node install), as the described here.
After backup, I tried to turn off the rancher server node, and I run a new rancher container on a new node (in the same network, but another ip address), then I restored using the backup file.
After restoring, I logged in to the rancher UI and it showed the error below:
So, I checked the logs of the rancher server and it showed as below:
2019-10-05 16:41:32.197641 I | http: TLS handshake error from 127.0.0.1:38388: EOF
2019-10-05 16:41:32.202442 I | http: TLS handshake error from 127.0.0.1:38380: EOF
2019-10-05 16:41:32.210378 I | http: TLS handshake error from 127.0.0.1:38376: EOF
2019-10-05 16:41:32.211106 I | http: TLS handshake error from 127.0.0.1:38386: EOF
2019/10/05 16:42:26 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:44:34 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:48:50 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019-10-05 16:50:19.114475 I | mvcc: store.index: compact 75951
2019-10-05 16:50:19.137825 I | mvcc: finished scheduled compaction at 75951 (took 22.527694ms)
2019-10-05 16:55:19.120803 I | mvcc: store.index: compact 76282
2019-10-05 16:55:19.124813 I | mvcc: finished scheduled compaction at 76282 (took 2.746382ms)
After that, I checked logs of the master nodes, I found that the rancher agent still tries to connect to the old rancher server (old ip address), not as the new one, so it makes the cluster not available.
How can I fix this?
You need to re-register the node in Rancher using the following steps.
Update the server-url in Rancher by going to Global -> Settings -> server-url
This should be the full URL with https://
Then use this script to re-register the node in Rancher https://github.com/mattmattox/cluster-agent-tool
I started from 3 master nodes and I increased it to 5. I am trying to add the new members to the existing cluster. My apiserver container stops working with the following error:
E1106 20:44:18.977854 1 cacher.go:274] unexpected ListAndWatch error: k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: Failed to list *storage.StorageClass: client: etcd cluster is unavailable or misconfigured
I1106 20:44:19.043807 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52142: EOF
I1106 20:44:19.072129 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52148: EOF
I1106 20:44:19.084461 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52150: EOF
F1106 20:44:19.103677 1 controller.go:128] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured
From the already working master nodes I can see the new member:
azureuser#k8s-master-50639053-0:~$ etcdctl member list
99673c60d6c07e0e: name=k8s-master-50639053-2 peerURLs=http://10.0.118.7:2380 clientURLs=
b130aa7583380f88: name=k8s-master-50639053-3 peerURLs=http://10.0.118.8:2380 clientURLs=
b4b196cc0c9fca4a: name=k8s-master-50639053-1 peerURLs=http://10.0.118.6:2380 clientURLs=
c264b3b67880db3f: name=k8s-master-50639053-0 peerURLs=http://10.0.118.5:2380 clientURLs=
e6e511de7d665829: name=k8s-master-50639053-4 peerURLs=http://10.0.118.9:2380 clientURLs=
If I check the cluster health I got:
azureuser#k8s-master-50639053-0:~$ etcdctl cluster-health
member 99673c60d6c07e0e is healthy: got healthy result from http://10.0.118.7:2379
member b4b196cc0c9fca4a is healthy: got healthy result from http://10.0.118.6:2379
member c264b3b67880db3f is healthy: got healthy result from http://10.0.118.5:2379
member fd36b7acc85d92b8 is unhealthy: got unhealthy result from http://10.0.118.9:2379
cluster is healthy
It works if I run in the new master node and stop the etcd service:
sudo etcd --listen-client-urls http://10.0.118.9:2379 --advertise-client-urls http://10.0.118.9:2379 --listen-peer-urls http://10.0.118.9:2380
Could someone help me?
Thanks.
Update: According to git its due to certificates and its not currently supported by ACS-ENGINE.