kubelet won't start after kuberntes/manifest update - kubernetes

This is sort of strange behavior in our K8 cluster.
When we try to deploy a new version of our applications we get:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "<container-id>" network for pod "application-6647b7cbdb-4tp2v": networkPlugin cni failed to set up pod "application-6647b7cbdb-4tp2v_default" network: Get "https://[10.233.0.1]:443/api/v1/namespaces/default": dial tcp 10.233.0.1:443: connect: connection refused
I used kubectl get cs and found controller and scheduler in Unhealthy state.
As describer here updated /etc/kubernetes/manifests/kube-scheduler.yaml and
/etc/kubernetes/manifests/kube-controller-manager.yaml by commenting --port=0
When I checked systemctl status kubelet it was working.
Active: active (running) since Mon 2020-10-26 13:18:46 +0530; 1 years 0 months ago
I had restarted kubelet service and controller and scheduler were shown healthy.
But systemctl status kubelet shows (soon after restart kubelet it showed running state)
Active: activating (auto-restart) (Result: exit-code) since Thu 2021-11-11 10:50:49 +0530; 3s ago<br>
Docs: https://github.com/GoogleCloudPlatform/kubernetes<br> Process: 21234 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET
Tried adding Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --fail-swap-on=false" to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf as described here, but still its not working properly.
Also removed --port=0 comment in above mentioned manifests and tried restarting,still same result.
Edit: This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf
Many other suggested kubeadm init again. But this cluster was created using kubespray no manually added nodes.
We have baremetal k8 running on Ubuntu 18.04.
K8: v1.18.8
We would like to know any debugging and fixing suggestions.
PS:
When we try to telnet 10.233.0.1 443 from any node, first attempt fails and second attempt success.
Edit: Found this in kubelet service logs
Nov 10 17:35:05 node1 kubelet[1951]: W1110 17:35:05.380982 1951 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "app-7b54557dd4-bzjd9_default": unexpected command output nsenter: cannot open /proc/12311/ns/net: No such file or directory

Posting comment as the community wiki answer for better visibility
This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf

Related

facing problem while running kubeadm init command

[WARNING Firewalld]: firewalld is active, please ensure ports [6443 10250] are open or your cluster may not function correctly
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR CRI]: container runtime is not running: output: time="2023-02-19T02:25:52Z" level=fatal msg="validate service connection: CRI v1 runtime API is not implemented for endpoint "unix:///var/run/containerd/containerd.sock": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher
I Enabled the port 6443 and 10250 in Security group.
Also tried the commands-
rm /etc/containerd/config.toml
systemctl restart containerd
kubeadm init
The containerd is not available in etc directory.

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.166.132 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 10.244.104.1 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 10.244.135.2 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
NAME READY STATUS RESTARTS AGE
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 Cli.java:555 - address=/192.168.0.11:38590 url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 Clock.java:35 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 Slf4JLogger.java:146 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 Loggers.java:28 - [s6501] Error connecting to /tmp/dse.sock, trying next node
java.io.FileNotFoundException: null
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:110)
at io.netty.channel.unix.Socket.connect(Socket.java:257)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:732)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:717)
at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:559)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1366)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.connect(ConnectInitHandler.java:57)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:512)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:1024)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:276)
at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:252)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 Cli.java:555 - address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
The error java.io.FileNotFoundException: null appears also when cassandra starts successfully.
So what remains is the error:
address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api-6.8.0.20200316-LABS-all.jar --dse-socket /tmp/dse.sock --host tcp://0.0.0.0```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error "java.io.FileNotFoundException: null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

How can I run multiple network interfaces on a k8s node?

Running Openshift 4.1 on K8s v1.13.4. I'm trying to add a second network (for NFS storage) to my compute nodes, and as soon as I do, the node stops reporting NodeReady.
See below logs from kubelet. Completely lost.. How can I add another interface to my nodes?
v1.13.4
FieldPath:""}): type: 'Normal' reason: 'NodeReady' Node compute-0 status is now: NodeReady
Jun 26 05:41:22 compute-0 hyperkube[923]: E0626 05:41:22.367174 923 kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"Dis
...
Jun 26 05:41:22 compute-0 hyperkube[923]: [map[address:10.90.49.111 type:ExternalIP] map[type:ExternalIP address:10.90.51.94] map[address:10.90.49.111 type:InternalIP] map[address:10.90.51.94 type:InternalIP]]
Jun 26 05:41:22 compute-0 hyperkube[923]: doesn't match $setElementOrder list:
Resolution was to delete node from cluster kubectl delete node compute-0, reboot it, and let ignition rejoin it to the cluster.
This is a known bug

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.

kubeadm init kubelet complains default bind address already in use

kubeadm version 1.12.2
$ sudo kubeadm init --config kubeadm_new.config --ignore-preflight-errors=all
/var/log/syslog shows:
Nov 15 08:44:13 khteh-T580 kubelet[5101]: I1115 08:44:13.438374 5101 server.go:1013] Started kubelet
Nov 15 08:44:13 khteh-T580 kubelet[5101]: I1115 08:44:13.438406 5101 server.go:133] Starting to listen on 0.0.0.0:10250
Nov 15 08:44:13 khteh-T580 kubelet[5101]: E1115 08:44:13.438446 5101 kubelet.go:1287] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
Nov 15 08:44:13 khteh-T580 kubelet[5101]: E1115 08:44:13.438492 5101 server.go:753] Starting health server failed: listen tcp 127.0.0.1:10248: bind: address already in use
Nov 15 08:44:13 khteh-T580 kubelet[5101]: I1115 08:44:13.438968 5101 server.go:318] Adding debug handlers to kubelet server.
Nov 15 08:44:13 khteh-T580 kubelet[5101]: F1115 08:44:13.439455 5101 server.go:145] listen tcp 0.0.0.0:10250: bind: address already in use
I have tried sudo systemctl stop kubelet and manually kill kubelet process but to no avail. Any advice and insights are appreciated.
Here is what you can do:
Try the following command to find out which process is holding the port 10250
root#master admin]# ss -lntp | grep 10250
LISTEN 0 128 :::10250 :::* users:(("kubelet",pid=23373,fd=20))
It will give you PID of that process and name of that process. If it is unwanted process which is holding the port, you can always kill the process and that port becomes available to use by kubelet.
After killing the process again run the above command, it should return no value.
Just to be on safe side run kubeadm reset and then run kubeadm init and it should go through.
Have you tried using netstat to see what other process is running that has already bound to that port?
sudo netstat -tulpn | grep 10250
For me, later on I discovered that there were 2 extra containers "Terminating" of core-dns-xxxxx in my cluster.
After deleting them forcefully solved the problem for me:
kubectl delete core-dns-xxxx --force
Thanks.
I ditch kubeadm and use microk8s.