kube-apiserver unable to communicate with TLS enabled etcd - kubernetes

I'm trying to set up a kubernetes cluster on some raspberry pis. I have successfully set up an etcd cluster with TLS enabled, and I can access this cluster via etcdctl and curl.
However, when I try to run kube-apiserver with the same ca file, I get messages saying that the etcd cluster is misconfigured or unavailable.
My question is, why are curl and etcdctl able to view cluster health and add keys with the same ca file that kube-apiserver is trying to use, but the kube-apiserver cannot?
When I run kube-apiserver and hit 127.0.0.1 over HTTP, not HTTPS, I can start the api server.
If this and the information below is not enough to understand the problem, please let me know. I'm not experienced with TLS/x509 certificates at all. I've been using Kelsey Hightowers Kubernetes The Hard Way mixed with the CoreOS docs for spinning up a kubernetes cluster, as well as looking at github issues and things like that.
Here is my etcd unit file:
[Unit]
Description=etcd
Documentation=https://github.com/coreos/etcd
[Service]
Environment=ETCD_UNSUPPORTED_ARCH=arm
ExecStart=/usr/bin/etcd \
--name etcd-master1 \
--cert-file=/etc/etcd/etcd.pem \
--key-file=/etc/etcd/etcd-key.pem \
--peer-cert-file=/etc/etcd/etcd.pem \
--peer-key-file=/etc/etcd/etcd-key.pem \
--trusted-ca-file=/etc/etcd/ca.pem \
--peer-trusted-ca-file=/etc/etcd/ca.pem \
--initial-advertise-peer-urls=https://10.0.1.200:2380 \
--listen-peer-urls https://10.0.1.200:2380 \
--listen-client-urls https://10.0.1.200:2379,http://127.0.0.1:2379 \
--advertise-client-urls https://10.0.1.200:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster etcd-master1=https://10.0.1.200:2380,etcd-master2=https://10.0.1.201:2380 \
--initial-cluster-state new \
--data-dir=var/lib/etcd \
--debug
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Here is the kube-apiserver command I'm trying to run:
#!/bin/bash
./kube-apiserver \
--etcd-cafile=/etc/etcd/ca.pem \
--etcd-certfile=/etc/etcd/etcd.pem \
--etcd-keyfile=/etc/etcd/etcd-key.pem \
--etcd-servers=https://10.0.1.200:2379,https://10.0.1.201:2379 \
--service-cluster-ip-range=10.32.0.0/24
Here is some output of that attempt. I think it's kind of weird that it is trying to list the etcd node, but nothing is printed out:
deploy#master1:~$ sudo ./run_kube_apiserver.sh
I1210 00:11:35.096887 20480 config.go:499] Will report 10.0.1.200 as public IP address.
I1210 00:11:35.842049 20480 trace.go:61] Trace "List *api.PodTemplateList" (started 2016-12-10 00:11:35.152287704 -0500 EST):
[134.479µs] [134.479µs] About to list etcd node
[688.376235ms] [688.241756ms] Etcd node listed
[689.062689ms] [686.454µs] END
E1210 00:11:35.860221 20480 cacher.go:261] unexpected ListAndWatch error: pkg/storage/cacher.go:202: Failed to list *api.PodTemplate: client: etcd cluster is unavailable or misconfigured
I1210 00:11:36.588511 20480 trace.go:61] Trace "List *api.LimitRangeList" (started 2016-12-10 00:11:35.273714755 -0500 EST):
[184.478µs] [184.478µs] About to list etcd node
[1.314010127s] [1.313825649s] Etcd node listed
[1.314362833s] [352.706µs] END
E1210 00:11:36.596092 20480 cacher.go:261] unexpected ListAndWatch error: pkg/storage/cacher.go:202: Failed to list *api.LimitRange: client: etcd cluster is unavailable or misconfigured
I1210 00:11:37.286714 20480 trace.go:61] Trace "List *api.ResourceQuotaList" (started 2016-12-10 00:11:35.325895387 -0500 EST):
[133.958µs] [133.958µs] About to list etcd node
[1.96003213s] [1.959898172s] Etcd node listed
[1.960393274s] [361.144µs] END
A successful cluster-health query:
deploy#master1:~$ sudo etcdctl --cert-file /etc/etcd/etcd.pem --key-file /etc/etcd/etcd-key.pem --ca-file /etc/etcd/ca.pem cluster-health
member 133c48556470c88d is healthy: got healthy result from https://10.0.1.200:2379
member 7acb9583fc3e7976 is healthy: got healthy result from https://10.0.1.201:2379
I am also seeing a lot of timeouts on the etcd servers themselves trying to send heartbeats back:
Dec 10 00:19:56 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 790.808604ms)
Dec 10 00:19:56 master1 etcd[19308]: server is likely overloaded
Dec 10 00:22:40 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 122.586925ms)
Dec 10 00:22:40 master1 etcd[19308]: server is likely overloaded
Dec 10 00:22:41 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 551.618961ms)
Dec 10 00:22:41 master1 etcd[19308]: server is likely overloaded
I can still do etcd operations like gets and puts, but I'm wondering if this could be a contributing factor? Can I tell the kube-apiserver to wait longer for etcd? I've been trying to figure this out myself but IMO the technical parts of the kuberentes components aren't really well documented, and a lot of the examples are very turnkey, without really explaining what everything is doing and why. I can find all kinds of diagrams and blog posts about high-level stuff, but things like e.g. how to run the actual binary, and what flags are and are not required, are kind of lacking.

Related

kubelet won't start after kuberntes/manifest update

This is sort of strange behavior in our K8 cluster.
When we try to deploy a new version of our applications we get:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "<container-id>" network for pod "application-6647b7cbdb-4tp2v": networkPlugin cni failed to set up pod "application-6647b7cbdb-4tp2v_default" network: Get "https://[10.233.0.1]:443/api/v1/namespaces/default": dial tcp 10.233.0.1:443: connect: connection refused
I used kubectl get cs and found controller and scheduler in Unhealthy state.
As describer here updated /etc/kubernetes/manifests/kube-scheduler.yaml and
/etc/kubernetes/manifests/kube-controller-manager.yaml by commenting --port=0
When I checked systemctl status kubelet it was working.
Active: active (running) since Mon 2020-10-26 13:18:46 +0530; 1 years 0 months ago
I had restarted kubelet service and controller and scheduler were shown healthy.
But systemctl status kubelet shows (soon after restart kubelet it showed running state)
Active: activating (auto-restart) (Result: exit-code) since Thu 2021-11-11 10:50:49 +0530; 3s ago<br>
Docs: https://github.com/GoogleCloudPlatform/kubernetes<br> Process: 21234 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET
Tried adding Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --fail-swap-on=false" to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf as described here, but still its not working properly.
Also removed --port=0 comment in above mentioned manifests and tried restarting,still same result.
Edit: This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf
Many other suggested kubeadm init again. But this cluster was created using kubespray no manually added nodes.
We have baremetal k8 running on Ubuntu 18.04.
K8: v1.18.8
We would like to know any debugging and fixing suggestions.
PS:
When we try to telnet 10.233.0.1 443 from any node, first attempt fails and second attempt success.
Edit: Found this in kubelet service logs
Nov 10 17:35:05 node1 kubelet[1951]: W1110 17:35:05.380982 1951 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "app-7b54557dd4-bzjd9_default": unexpected command output nsenter: cannot open /proc/12311/ns/net: No such file or directory
Posting comment as the community wiki answer for better visibility
This issue was due to kubelet certificate expired and fixed following these steps. If someone faces this issue, make sure /var/lib/kubelet/pki/kubelet-client-current.pem certificate and key values are base64 encoded when placing on /etc/kubernetes/kubelet.conf

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.166.132 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 10.244.104.1 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 10.244.135.2 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
NAME READY STATUS RESTARTS AGE
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 Cli.java:555 - address=/192.168.0.11:38590 url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 Clock.java:35 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 Slf4JLogger.java:146 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 Loggers.java:28 - [s6501] Error connecting to /tmp/dse.sock, trying next node
java.io.FileNotFoundException: null
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:110)
at io.netty.channel.unix.Socket.connect(Socket.java:257)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:732)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:717)
at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:559)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1366)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.connect(ConnectInitHandler.java:57)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:512)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:1024)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:276)
at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:252)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 Cli.java:555 - address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
The error java.io.FileNotFoundException: null appears also when cassandra starts successfully.
So what remains is the error:
address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api-6.8.0.20200316-LABS-all.jar --dse-socket /tmp/dse.sock --host tcp://0.0.0.0```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error "java.io.FileNotFoundException: null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

Adding removed etcd member in Kubernetes master

I was following Kelsey Hightower's kubernetes-the-hard-way repo and successfully created a cluster with 3 master nodes and 3 worker nodes. Here are the problems encountered when removing one of the etcd members and then adding it back, also with all the steps used:
3 master nodes:
10.240.0.10 controller-0
10.240.0.11 controller-1
10.240.0.12 controller-2
Step 1:
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
b28b52253c9d447e, started, controller-2, https://10.240.0.12:2380, https://10.240.0.12:2379
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379
ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379
Step 2 (Remove etcd member of controller-2):
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member remove b28b52253c9d447e --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Step 3 (Add the member back):
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member add controller-2 --peer-urls=https://10.240.0.12:2380 --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
Member 66d450d03498eb5c added to cluster 3e7cc799faffb625
ETCD_NAME="controller-2"
ETCD_INITIAL_CLUSTER="controller-2=https://10.240.0.12:2380,controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.240.0.12:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Step 4 (run member list command):
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
66d450d03498eb5c, unstarted, , https://10.240.0.12:2380,
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380,
https://10.240.0.10:2379 ffed16798470cab5, started, controller-1,
https://10.240.0.11:2380, https://10.240.0.11:2379
Step 5 (Run the command to start etcd in controller-2):
isaac#controller-2:~$ sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:
2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=http://10.240.0.10:2380,controller-1=http://10.240.0.11:2380,controller-2=http://10.240.0.1
2:2380 --ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem
Result:
2019-06-09 13:10:14.958799 I | etcdmain: etcd Version: 3.3.9
2019-06-09 13:10:14.959022 I | etcdmain: Git SHA: fca8add78
2019-06-09 13:10:14.959106 I | etcdmain: Go Version: go1.10.3
2019-06-09 13:10:14.959177 I | etcdmain: Go OS/Arch: linux/amd64
2019-06-09 13:10:14.959237 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2019-06-09 13:10:14.959312 W | etcdmain: no data-dir provided, using default data-dir ./controller-2.etcd
2019-06-09 13:10:14.959435 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-06-09 13:10:14.959575 C | etcdmain: cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented
Clearly, the etcd service did not start as expected, so I do the troubleshooting as below:
isaac#controller-2:~$ sudo systemctl status etcd
Result:
● etcd.service - etcd Loaded: loaded
(/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2019-06-09 13:06:55 UTC; 29min ago
Docs: https://github.com/coreos Process: 1876 ExecStart=/usr/local/bin/etcd --name controller-2
--cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/etcd/kubernetes.pem --peer-key-file=/etc/etcd/kube Main PID: 1876 (code=exited, status=0/SUCCESS) Jun 09 13:06:55 controller-2 etcd[1876]: stopped
peer f98dc20bce6225a0 Jun 09 13:06:55 controller-2 etcd[1876]:
stopping peer ffed16798470cab5... Jun 09 13:06:55 controller-2
etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun
09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer
ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]:
stopped HTTP pipelining with peer ffed16798470cab5 Jun 09 13:06:55
controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5
(stream MsgApp v2 reader) Jun 09 13:06:55 controller-2 etcd[1876]:
stopped streaming with peer ffed16798470cab5 (stream Message reader)
Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer ffed16798470cab5
Jun 09 13:06:55 controller-2 etcd[1876]: failed to find member
f98dc20bce6225a0 in cluster 3e7cc799faffb625 Jun 09 13:06:55
controller-2 etcd[1876]: forgot to set Type=notify in systemd service
file?
I indeed tried to start the etcd member using different commands but seems the etcd of controller-2 still stuck at unstarted state. May I know the reason of that? Any pointers would be highly appreciated! Thanks.
Turned out I solved the problem as follows (credit to Matthew):
Delete the etcd data directory with the following command:
rm -rf /var/lib/etcd/*
To fix the message cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented, I revised the command to start the etcd as follows:
sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380,controller-2=https://10.240.0.12:2380 --peer-trusted-ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem --peer-cert-file /etc/etcd/kubernetes.pem --peer-key-file /etc/etcd/kubernetes-key.pem --data-dir /var/lib/etcd
A few points to note here:
The newly added arguments --cert-file and --key-file presented the required key and certificate of controller2.
Argument --peer-trusted-ca-file is also presented so as to check if the x509 certificate presented by controller0 and controller1 are signed by a known CA. If this is not presented, error etcdserver: could not get cluster response from https://10.240.0.11:2380: Get https://10.240.0.11:2380/members: x509: certificate signed by unknown authority may be encountered.
The value presented for the argument --initial-cluster needs to be in-line with that shown in the systemd unit file.
if you are re-adding the more easy solution is following
rm -rf /var/lib/etcd/*
kubeadm join phase control-plane-join etcd --control-plane

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.

Scaling-Master / Unable to perform initial IP allocation check

I started from 3 master nodes and I increased it to 5. I am trying to add the new members to the existing cluster. My apiserver container stops working with the following error:
E1106 20:44:18.977854 1 cacher.go:274] unexpected ListAndWatch error: k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: Failed to list *storage.StorageClass: client: etcd cluster is unavailable or misconfigured
I1106 20:44:19.043807 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52142: EOF
I1106 20:44:19.072129 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52148: EOF
I1106 20:44:19.084461 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52150: EOF
F1106 20:44:19.103677 1 controller.go:128] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured
From the already working master nodes I can see the new member:
azureuser#k8s-master-50639053-0:~$ etcdctl member list
99673c60d6c07e0e: name=k8s-master-50639053-2 peerURLs=http://10.0.118.7:2380 clientURLs=
b130aa7583380f88: name=k8s-master-50639053-3 peerURLs=http://10.0.118.8:2380 clientURLs=
b4b196cc0c9fca4a: name=k8s-master-50639053-1 peerURLs=http://10.0.118.6:2380 clientURLs=
c264b3b67880db3f: name=k8s-master-50639053-0 peerURLs=http://10.0.118.5:2380 clientURLs=
e6e511de7d665829: name=k8s-master-50639053-4 peerURLs=http://10.0.118.9:2380 clientURLs=
If I check the cluster health I got:
azureuser#k8s-master-50639053-0:~$ etcdctl cluster-health
member 99673c60d6c07e0e is healthy: got healthy result from http://10.0.118.7:2379
member b4b196cc0c9fca4a is healthy: got healthy result from http://10.0.118.6:2379
member c264b3b67880db3f is healthy: got healthy result from http://10.0.118.5:2379
member fd36b7acc85d92b8 is unhealthy: got unhealthy result from http://10.0.118.9:2379
cluster is healthy
It works if I run in the new master node and stop the etcd service:
sudo etcd --listen-client-urls http://10.0.118.9:2379 --advertise-client-urls http://10.0.118.9:2379 --listen-peer-urls http://10.0.118.9:2380
Could someone help me?
Thanks.
Update: According to git its due to certificates and its not currently supported by ACS-ENGINE.