K3s Server Fails When Starting Up - raspberry-pi

I'm trying to start K3s on master node running Ubuntu 22.04 server on a Raspberry Pi and facing the following error:
I0118 18:44:48.223759 1119 event.go:294] "Event occurred" object="kube-system/ccm" fieldPath="" kind="Addon" apiVersion="k3s.cattle.io/v1" type="Normal" reason="ApplyingManifest" message="Applying manifest at \"/var/lib/rancher/k3s/server/manifests/ccm.yaml\""
I0118 18:44:48.292989 1119 kubelet_node_status.go:108] "Node was previously registered" node="m1"
I0118 18:44:48.293739 1119 kubelet_node_status.go:73] "Successfully registered node" node="m1"
INFO[0012] Starting the netpol controller version v1.5.2-0.20221026101626-e01045262706, built on 2022-12-21T00:01:25Z, go1.19.4
I0118 18:44:48.296470 1119 network_policy_controller.go:163] Starting network policy controller
I0118 18:44:48.343883 1119 kuberuntime_manager.go:1114] "Updating runtime config through cri with podcidr" CIDR="10.42.0.0/24"
I0118 18:44:48.403211 1119 kubelet_network.go:61] "Updating Pod CIDR" originalPodCIDR="" newPodCIDR="10.42.0.0/24"
E0118 18:44:48.458686 1119 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
I0118 18:44:48.462967 1119 setters.go:548] "Node became not ready" node="m1" condition={Type:Ready Status:False LastHeartbeatTime:2023-01-18 18:44:48.462712391 +0000 UTC m=+13.144141009 LastTransitionTime:2023-01-18 18:44:48.462712391 +0000 UTC m=+13.144141009 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
E0118 18:44:48.581158 1119 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
INFO[0013] Creating helm-controller event broadcaster
I0118 18:44:48.770207 1119 apiserver.go:52] "Watching apiserver"
I0118 18:44:48.849863 1119 kube.go:133] Node controller sync successful
I0118 18:44:48.850843 1119 vxlan.go:138] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
FATA[0013] flannel exited: operation not supported
joesan#m1:/etc/systemd/system$
I'm just simply using the following command to start it:
$sudo k3s server
Any recommendation on what I'm missing here? I'm using:
joesan#m1:/etc/systemd/system$ k3s -version
k3s version v1.26.0+k3s1 (fae88176)
go version go1.19.4

Related

kubelet won't start after kuberntes/manifest update

This is sort of strange behavior in our K8 cluster.
When we try to deploy a new version of our applications we get:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "<container-id>" network for pod "application-6647b7cbdb-4tp2v": networkPlugin cni failed to set up pod "application-6647b7cbdb-4tp2v_default" network: Get "https://[10.233.0.1]:443/api/v1/namespaces/default": dial tcp 10.233.0.1:443: connect: connection refused
I used kubectl get cs and found controller and scheduler in Unhealthy state.
As describer here updated /etc/kubernetes/manifests/kube-scheduler.yaml and
/etc/kubernetes/manifests/kube-controller-manager.yaml by commenting --port=0
When I checked systemctl status kubelet it was working.
Active: active (running) since Mon 2020-10-26 13:18:46 +0530; 1 years 0 months ago
I had restarted kubelet service and controller and scheduler were shown healthy.
But systemctl status kubelet shows (soon after restart kubelet it showed running state)
Active: activating (auto-restart) (Result: exit-code) since Thu 2021-11-11 10:50:49 +0530; 3s ago<br>
Docs: https://github.com/GoogleCloudPlatform/kubernetes<br> Process: 21234 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET
Tried adding Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --fail-swap-on=false" to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf as described here, but still its not working properly.
Also removed --port=0 comment in above mentioned manifests and tried restarting,still same result.
Edit: This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf
Many other suggested kubeadm init again. But this cluster was created using kubespray no manually added nodes.
We have baremetal k8 running on Ubuntu 18.04.
K8: v1.18.8
We would like to know any debugging and fixing suggestions.
PS:
When we try to telnet 10.233.0.1 443 from any node, first attempt fails and second attempt success.
Edit: Found this in kubelet service logs
Nov 10 17:35:05 node1 kubelet[1951]: W1110 17:35:05.380982 1951 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "app-7b54557dd4-bzjd9_default": unexpected command output nsenter: cannot open /proc/12311/ns/net: No such file or directory
Posting comment as the community wiki answer for better visibility
This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf

Why might a gvisor based node pool fail to bootstrap properly?

I'm trying to provision a new node pool using gvisor sandboxing in GKE. I use the GCP web console to add a new node pool, use the cos_containerd OS and check the Enable gvisor Sandboxing checkbox, but the node pool provisioning fails each time with an "Unknown Error" in the GCP console notifications. The nodes never join the K8S cluster.
The GCE VM seems to boot fine and when I look in the journalctl for the node I see that cloud-init seems to have finished just fine, but the kubelet doesn't seem to be able to start. I see error messages like this:
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.184163 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.284735 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.385229 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.485626 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.522961 1143 eviction_manager.go:251] eviction manager: failed to get summary stats: failed to get node info: node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz containerd[976]: time="2020-10-12T16:58:07.576735750Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.577353 1143 kubelet.go:2191] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.587824 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.989869 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:08 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:08.090287 1143
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.296365 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.396933 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz node-problem-detector[1166]: F1012 16:58:09.449446 2481 main.go:71] cannot create certificate signing request: Post https://172.17.0.2/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?timeout=5m0s: dial tcp 172.17.0.2:443: connect: no route
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz node-problem-detector[1166]: E1012 16:58:09.450695 1166 manager.go:162] failed to update node conditions: Patch https://172.17.0.2/api/v1/nodes/gke-main-sanboxes-dd9b8d84-dmzz/status: getting credentials: exec: exit status 1
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.453825 2486 cache.go:125] failed reading existing private key: open /var/lib/kubelet/pki/kubelet-client.key: no such file or directory
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.543449 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.556623 2486 tpm.go:124] failed reading AIK cert: tpm2.NVRead(AIK cert): decoding NV_ReadPublic response: handle 1, error code 0xb : the handle is not correct for the use
I am not really sure what might be causing that, and I'd really like to be able to use autoscaling with this node pool, so I don't want to just fix it manually for this node and have to do so for any new nodes that join. How can I configure the node pool such that the gvisor based nodes provision fine on their own?
My cluster details:
GKE version: 1.17.9-gke.6300
Cluster type: Regional
VPC-native
Private cluster
Shielded GKE Nodes
You can report issues with Google products by following below link:
Cloud.google.com: Support: Docs: Issue Trackers
You will need to choose the: Create new Google Kubernetes Engine issue under Compute section.
I can confirm that I stumbled upon the same issue when creating a cluster as described in the question (private, shielded, etc.):
Create a cluster with one node pool.
Add the node pool with gvisor enabled after the cluster successfully created.
Creating cluster like above will push the GKE cluster to RECONCILING state:
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
gke-gvisor europe-west3 1.17.9-gke.6300 XX.XXX.XXX.XXX e2-medium 1.17.9-gke.6300 6 RECONCILING
The changes in the cluster state:
Provisoning - creating the cluster
Running - created the cluster
Reconciling - added the node pool
Running - the node pool was added (for about a minute)
Reconciling - the cluster went into that state for about 25 minutes
GCP Cloud Console (Web UI) reports: Repairing Cluster

How can I run multiple network interfaces on a k8s node?

Running Openshift 4.1 on K8s v1.13.4. I'm trying to add a second network (for NFS storage) to my compute nodes, and as soon as I do, the node stops reporting NodeReady.
See below logs from kubelet. Completely lost.. How can I add another interface to my nodes?
v1.13.4
FieldPath:""}): type: 'Normal' reason: 'NodeReady' Node compute-0 status is now: NodeReady
Jun 26 05:41:22 compute-0 hyperkube[923]: E0626 05:41:22.367174 923 kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"Dis
...
Jun 26 05:41:22 compute-0 hyperkube[923]: [map[address:10.90.49.111 type:ExternalIP] map[type:ExternalIP address:10.90.51.94] map[address:10.90.49.111 type:InternalIP] map[address:10.90.51.94 type:InternalIP]]
Jun 26 05:41:22 compute-0 hyperkube[923]: doesn't match $setElementOrder list:
Resolution was to delete node from cluster kubectl delete node compute-0, reboot it, and let ignition rejoin it to the cluster.
This is a known bug

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.

Scaling-Master / Unable to perform initial IP allocation check

I started from 3 master nodes and I increased it to 5. I am trying to add the new members to the existing cluster. My apiserver container stops working with the following error:
E1106 20:44:18.977854 1 cacher.go:274] unexpected ListAndWatch error: k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: Failed to list *storage.StorageClass: client: etcd cluster is unavailable or misconfigured
I1106 20:44:19.043807 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52142: EOF
I1106 20:44:19.072129 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52148: EOF
I1106 20:44:19.084461 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52150: EOF
F1106 20:44:19.103677 1 controller.go:128] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured
From the already working master nodes I can see the new member:
azureuser#k8s-master-50639053-0:~$ etcdctl member list
99673c60d6c07e0e: name=k8s-master-50639053-2 peerURLs=http://10.0.118.7:2380 clientURLs=
b130aa7583380f88: name=k8s-master-50639053-3 peerURLs=http://10.0.118.8:2380 clientURLs=
b4b196cc0c9fca4a: name=k8s-master-50639053-1 peerURLs=http://10.0.118.6:2380 clientURLs=
c264b3b67880db3f: name=k8s-master-50639053-0 peerURLs=http://10.0.118.5:2380 clientURLs=
e6e511de7d665829: name=k8s-master-50639053-4 peerURLs=http://10.0.118.9:2380 clientURLs=
If I check the cluster health I got:
azureuser#k8s-master-50639053-0:~$ etcdctl cluster-health
member 99673c60d6c07e0e is healthy: got healthy result from http://10.0.118.7:2379
member b4b196cc0c9fca4a is healthy: got healthy result from http://10.0.118.6:2379
member c264b3b67880db3f is healthy: got healthy result from http://10.0.118.5:2379
member fd36b7acc85d92b8 is unhealthy: got unhealthy result from http://10.0.118.9:2379
cluster is healthy
It works if I run in the new master node and stop the etcd service:
sudo etcd --listen-client-urls http://10.0.118.9:2379 --advertise-client-urls http://10.0.118.9:2379 --listen-peer-urls http://10.0.118.9:2380
Could someone help me?
Thanks.
Update: According to git its due to certificates and its not currently supported by ACS-ENGINE.