how to restart a Mon in a Ceph cluster - ceph

I am running a Ceph Cluster (pacific) consisting of 3 nodes. I have 3 Monitor Daemons started (one on each node).
Just for some test I wanted to remove node3 out of the list of monitors with the command
sudo ceph mon remove node3
The mon disappeared in the Dashboard and also when I call ceph status command:
$ sudo ceph status
cluster:
id: xxxxx-yyyyy-zzzzzzz
health: HEALTH_WARN
1 failed cephadm daemon(s)
services:
mon: 2 daemons, quorum node1,node2 (age 53m)
mgr: node1.gafnwm(active, since 102m), standbys: node2.jppwxo
osd: 3 osds: 3 up (since 15m), 3 in (since 96m)
But now I found no way to restart or add the monitor again into my cluster. I tried:
$ sudo ceph orch daemon add mon node3:10.0.0.4
Error EINVAL: name mon.node3 already in use
What does this mean? Why is the name mon.node3 already in use even if I can't see it?

Related

kubernetes api server not automatically start after master reboots

I have setup a small cluster with kubeadm, it was working fine and 6443 port was up. But after rebooting my system, the cluster is not getting up anymore.
What should I do?
Here is some information:
systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sun 2020-04-05 14:16:44 UTC; 6s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 31079 (kubelet)
Tasks: 20 (limit: 4915)
CGroup: /system.slice/kubelet.service
└─31079 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet
k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://infra01.mydomainname.com:6443/api/v1/nodes?fieldSelector=metadata.name%3Dtest-infra01&limit=500&resourceVersion=0: dial tcp 116.66.187.210:6443: connect: connection refused
kubectl get nodes
The connection to the server infra01.mydomainname.com:6443 was refused - did you specify the right host or port?
kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:12:12Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
journalctl -xeu kubelet
6 18167 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458:
Failed to list *v1.Node: Get https://infra01.mydomainname.com
1 18167 reflector.go:153]
k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://huawei-infra01.s
4 18167 aws_credentials.go:77] while getting AWS credentials
NoCredentialProviders: no valid providers in chain. Deprecated.
messaging see aws.Config.CredentialsChainVerboseErrors
6 18167 kuberuntime_manager.go:211] Container runtime docker initialized,
version: 19.03.7, apiVersion: 1.40.0
6 18167 server.go:1113] Started kubelet
1 18167 kubelet.go:1302] Image garbage collection failed once. Stats
initialization may not have completed yet: failed to get imageF
8 18167 server.go:144] Starting to listen on 0.0.0.0:10250
4 18167 server.go:778] Starting healthz server failed: listen tcp
127.0.0.1:10248: bind: address already in use
5 18167 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
4 18167 volume_manager.go:265] Starting Kubelet Volume Manager
1 18167 desired_state_of_world_populator.go:138] Desired state populator
starts to run
3 18167 server.go:384] Adding debug handlers to kubelet server.
4 18167 server.go:158] listen tcp 0.0.0.0:10250: bind: address already in
use
Docker
docker run hello-world
Hello from Docker!
ubuntu
lsb_release -a
Ubuntu 18.04.2 LTS
swap && kubeconfig
swap is turned off and kubeconfig was correctly exported
Note
Things can be fixed by resetting the cluster, but this should be the final option.
Kubelet is not started because of port already in use and hence not able to create pod for api server.
Use following command to find out which process is holding the port 10250
root#master admin]# ss -lntp | grep 10250
LISTEN 0 128 :::10250 :::* users:(("kubelet",pid=23373,fd=20))
It will give you PID of that process and name of that process. If it is unwanted process which is holding the port, you can always kill the process and that port becomes available to use by kubelet.
After killing the process again run the above command, it should return no value.
Just to be on safe side run kubeadm reset and then run kubeadm init and it should go through
Edit:
Using snap stop kubelet did the trick of stopping kubelet on the node.

kubernetes worker node in "NotReady" status

I am trying to setup my first cluster using Kubernetes 1.13.1. The master got initialized okay, but both of my worker nodes are NotReady. kubectl describe node shows that Kubelet stopped posting node status on both worker nodes. On one of the worker nodes I get log output like
> kubelet[3680]: E0107 20:37:21.196128 3680 kubelet.go:2266] node
> "xyz" not found.
Here is the full details:
I am using Centos 7 & Kubernetes 1.13.1.
Initializing was done as follows:
[root#master ~]# kubeadm init --apiserver-advertise-address=10.142.0.4 --pod-network-cidr=10.142.0.0/24
Successfully initialized the cluster:
You can now join any number of machines by running the following on each node
as root:
`kubeadm join 10.142.0.4:6443 --token y0epoc.zan7yp35sow5rorw --discovery-token-ca-cert-hash sha256:f02d43311c2696e1a73e157bda583247b9faac4ffb368f737ee9345412c9dea4`
deployed the flannel CNI:
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The join command worked fine.
[kubelet-start] Activating the kubelet service
[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "node01" as an annotation
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the master to see this node join the cluster.
Result of kubectl get nodes:
[root#master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 9h v1.13.1
node01 NotReady <none> 9h v1.13.1
node02 NotReady <none> 9h v1.13.1
on both nodes:
[root#node01 ~]# service kubelet status
Redirecting to /bin/systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2019-01-08 04:49:20 UTC; 32s ago
Docs: https://kubernetes.io/docs/
Main PID: 4224 (kubelet)
Memory: 31.3M
CGroup: /system.slice/kubelet.service
└─4224 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfi
`Jan 08 04:54:10 node01 kubelet[4224]: E0108 04:54:10.957115 4224 kubelet.go:2266] node "node01" not found`
I appreciate your advise on how to troubleshoot this.
The previous answer sounds correct. You can verify that by running
kubectl describe node node01 on the master, or wherever kubectl is correctly configured.
It seems like the reason of this error is due to incorrect subnet. In Flannel documentation it is written that you should use /16 not /24 for pod network.
NOTE: If kubeadm is used, then pass --pod-network-cidr=10.244.0.0/16
to kubeadm init to ensure that the podCIDR is set.
I tried to run kubeadm with /24 and although I had nodes in Ready state the flannel pods did not run properly which resulted in some issues.
You can check if your flannel pods are running properly by:
kubectl get pods -n kube-system if the status is other than running then it is incorrect behavior. In this case you can check details by running kubectl describe pod PODNAME -n kube-system. Try changing the subnet and update us if that fixed the problem.
I ran into almost the same problem, and in the end I found that the reason was that the firewall was not turned off. You can try the following commands:
sudo ufw disable
or
systemctl disable firewalld
or
setenforce 0

Kubernetes: specify CPUs for cpumanager

Is it possible to specify CPU ID list to the Kubernetes cpumanager? The goal is to make sure pods get CPUs from a single socket (0). I brought all the CPUs on the peer socket offline as mentioned here, for example:
$ echo 0 > /sys/devices/system/cpu/cpu5/online
After doing this, the Kubernetes master indeed sees the remaining online CPUs
kubectl describe node foo
Capacity:
cpu: 56 <<< socket 0 CPU count
ephemeral-storage: 958774760Ki
hugepages-1Gi: 120Gi
memory: 197524872Ki
pods: 110
Allocatable:
cpu: 54 <<< 2 system reserved CPUs
ephemeral-storage: 958774760Ki
hugepages-1Gi: 120Gi
memory: 71490952Ki
pods: 110
System Info:
Machine ID: 1155420082478559980231ba5bc0f6f2
System UUID: 4C4C4544-0044-4210-8031-C8C04F584B32
Boot ID: 7fa18227-748f-496c-968c-9fc82e21ecd5
Kernel Version: 4.4.13
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.3
Kubelet Version: v1.11.1
Kube-Proxy Version: v1.11.1
However, cpumanager still seems to think there are 112 CPUs (socket0 + socket1).
cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-111"}
As a result, the kubelet system pods are throwing the following error:
kube-system kube-proxy-nk7gc 0/1 rpc error: code = Unknown desc = failed to update container "eb455f81a61b877eccda0d35eea7834e30f59615346140180f08077f64896760": Error response from daemon: Requested CPUs are not available - requested 0-111, available: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110 762 36d <IP address> foo <none>
I was able to get this working. Posting this as an answer so that someone in need might benefit.
It appears the CPU set is read from /var/lib/kubelet/cpu_manager_state file and it is not updated across kubelet restarts. So this file needs to be removed before restarting kubelet.
The following worked for me:
# On a running worker node, bring desired CPUs offline. (run as root)
$ cpu_list=`lscpu | grep "NUMA node1 CPU(s)" | awk '{print $4}'`
$ chcpu -d $cpu_list
$ rm -f /var/lib/kubelet/cpu_manager_state
$ systemctl restart kubelet.service
# Check the CPU set seen by the CPU manager
$ cat /var/lib/kubelet/cpu_manager_state
# Try creating pods and check the syslog:
Dec 3 14:36:05 k8-2-w1 kubelet[8070]: I1203 14:36:05.122466 8070 state_mem.go:84] [cpumanager] updated default cpuset: "0,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110"
Dec 3 14:36:05 k8-2-w1 kubelet[8070]: I1203 14:36:05.122643 8070 policy_static.go:198] [cpumanager] allocateCPUs: returning "2,4,6,8,58,60,62,64"
Dec 3 14:36:05 k8-2-w1 kubelet[8070]: I1203 14:36:05.122660 8070 state_mem.go:76] [cpumanager] updated desired cpuset (container id: 356939cdf32d0f719e83b0029a018a2ca2c349fc0bdc1004da5d842e357c503a, cpuset: "2,4,6,8,58,60,62,64")
I have reported a bug here as I think the CPU set should be updated after kubelet restarts.

kubectl get nodes shows NotReady

I have installed two nodes kubernetes 1.12.1 in cloud VMs, both behind internet proxy. Each VMs have floating IPs associated to connect over SSH, kube-01 is a master and kube-02 is a node. Executed export:
no_proxy=127.0.0.1,localhost,10.157.255.185,192.168.0.153,kube-02,192.168.0.25,kube-01
before running kubeadm init, but I am getting the following status for kubectl get nodes:
NAME STATUS ROLES AGE VERSION
kube-01 NotReady master 89m v1.12.1
kube-02 NotReady <none> 29s v1.12.2
Am I missing any configuration? Do I need to add 192.168.0.153 and 192.168.0.25 in respective VM's /etc/hosts?
Looks like pod network is not installed yet on your cluster . You can install weave for example with below command
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
After a few seconds, a Weave Net pod should be running on each Node and any further pods you create will be automatically attached to the Weave network.
You can install pod networks of your choice . Here is a list
after this check
$ kubectl describe nodes
check all is fine like below
Conditions:
Type Status
---- ------
OutOfDisk False
MemoryPressure False
DiskPressure False
Ready True
Capacity:
cpu: 2
memory: 2052588Ki
pods: 110
Allocatable:
cpu: 2
memory: 1950188Ki
pods: 110
next ssh to the pod which is not ready and observe kubelet logs. Most likely errors can be of certificates and authentication.
You can also use journalctl on systemd to check kubelet errors.
$ journalctl -u kubelet
Try with this
Your coredns is in pending state check with the networking plugin you have used and check the proper addons are added
check kubernates troubleshooting guide
https://kubernetes.io/docs/setup/independent/troubleshooting-kubeadm/#coredns-or-kube-dns-is-stuck-in-the-pending-state
https://kubernetes.io/docs/concepts/cluster-administration/addons/
And install the following with those
And check
kubectl get pods -n kube-system
On the off chance it might be the same for someone else, in my case, I was using the wrong AMI image to create the nodegroup.
Run
journalctl -u kubelet
Then check at node logs, if you get below error, disable the sawp using swapoff -a
"Failed to run kubelet" err="failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fa
Main process exited, code=exited, status=1/FAILURE

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod network

Issue Redis POD creation on k8s(v1.10) cluster and POD creation stuck at "ContainerCreating"
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned redis to k8snode02
Normal SuccessfulMountVolume 30m kubelet, k8snode02 MountVolume.SetUp succeeded for volume "default-token-f8tcg"
Warning FailedCreatePodSandBox 5m (x1202 over 30m) kubelet, k8snode02 Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "redis_default" network: failed to find plugin "loopback" in path [/opt/loopback/bin /opt/cni/bin]
Normal SandboxChanged 47s (x1459 over 30m) kubelet, k8snode02 Pod sandbox changed, it will be killed and re-created.
When I used calico as CNI and I faced a similar issue.
The container remained in creating state, I checked for /etc/cni/net.d and /opt/cni/bin on master both are present but not sure if this is required on worker node as well.
root#KubernetesMaster:/opt/cni/bin# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-5c7588df-5zds6 0/1 ContainerCreating 0 21m
root#KubernetesMaster:/opt/cni/bin# kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetesmaster Ready master 26m v1.13.4
kubernetesslave1 Ready <none> 22m v1.13.4
root#KubernetesMaster:/opt/cni/bin#
kubectl describe pods
Name: nginx-5c7588df-5zds6
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: kubernetesslave1/10.0.3.80
Start Time: Sun, 17 Mar 2019 05:13:30 +0000
Labels: app=nginx
pod-template-hash=5c7588df
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/nginx-5c7588df
Containers:
nginx:
Container ID:
Image: nginx
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qtfbs (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-qtfbs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qtfbs
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned default/nginx-5c7588df-5zds6 to kubernetesslave1
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "123d527490944d80f44b1976b82dbae5dc56934aabf59cf89f151736d7ea8adc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8cc5e62ebaab7075782c2248e00d795191c45906cc9579464a00c09a2bc88b71" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "30ffdeace558b0935d1ed3c2e59480e2dd98e983b747dacae707d1baa222353f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "630e85451b6ce2452839c4cfd1ecb9acce4120515702edf29421c123cf231213" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "820b919b7edcfc3081711bb78b79d33e5be3f7dafcbad29fe46b6d7aa22227aa" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "abbfb5d2756f12802072039dec20ba52f546ae755aaa642a9a75c86577be589f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfeb46ffda4d0f8a434f3f3af04328fcc4b6c7cafaa62626e41b705b06d98cc4" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9ae3f47bb0282a56e607779d3267127ee8b0ae1d7f416f5a184682119203b1c8" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "546d07f1864728b2e2675c066775f94d658e221ada5fb4ed6bf6689ec7b8de23" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Normal SandboxChanged 18m (x12 over 18m) kubelet, kubernetesslave1 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 3m39s (x829 over 18m) kubelet, kubernetesslave1 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f586be437843537a3082f37ad139c88d0eacfbe99ddf00621efd4dc049a268cc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
root#KubernetesMaster:/etc/cni/net.d#
On worker node NGINX is trying to come up but getting exited, I am not sure what's going on here - I am newbie to kubernetes & not able to fix this issue -
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
94b2994401d0 k8s.gcr.io/pause:3.1 "/pause" 1 second ago Up Less than a second k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_534
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f72500cae2b7 k8s.gcr.io/pause:3.1 "/pause" 1 second ago Up Less than a second k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_585
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 5 minutes ago Up 5 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 5 minutes ago Up 5 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 5 minutes ago Up 5 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
I checked about /etc/cni/net.d & /opt/cni/bin on worker node as well, it is there -
root#kubernetesslave1:/home/ubuntu# cd /etc/cni
root#kubernetesslave1:/etc/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 net.d
root#kubernetesslave1:/etc/cni# cd /opt/cni
root#kubernetesslave1:/opt/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 bin
root#kubernetesslave1:/opt/cni# cd bin
root#kubernetesslave1:/opt/cni/bin# ls -ltr
total 107440
-rwxr-xr-x 1 root root 3890407 Aug 17 2017 bridge
-rwxr-xr-x 1 root root 3475802 Aug 17 2017 ipvlan
-rwxr-xr-x 1 root root 3520724 Aug 17 2017 macvlan
-rwxr-xr-x 1 root root 3877986 Aug 17 2017 ptp
-rwxr-xr-x 1 root root 3475750 Aug 17 2017 vlan
-rwxr-xr-x 1 root root 9921982 Aug 17 2017 dhcp
-rwxr-xr-x 1 root root 2605279 Aug 17 2017 sample
-rwxr-xr-x 1 root root 32351072 Mar 17 05:19 calico
-rwxr-xr-x 1 root root 31490656 Mar 17 05:19 calico-ipam
-rwxr-xr-x 1 root root 2856252 Mar 17 05:19 flannel
-rwxr-xr-x 1 root root 3084347 Mar 17 05:19 loopback
-rwxr-xr-x 1 root root 3036768 Mar 17 05:19 host-local
-rwxr-xr-x 1 root root 3550877 Mar 17 05:19 portmap
-rwxr-xr-x 1 root root 2850029 Mar 17 05:19 tuning
root#kubernetesslave1:/opt/cni/bin#
Ensure that /etc/cni/net.d and its /opt/cni/bin friend both exist and are correctly populated with the CNI configuration files and binaries on all Nodes. For flannel specifically, one might make use of the flannel cni repo
I had this issue with my GKE cluster on GCP with one of my preemptive node pools. Thanks to #mdaniel tip of checking the integrity of /etc/cni/net.d I could reproduce the issue again by ssh into the node of a testing cluster with the command gcloud compute ssh <name of some node> --zone <zone-of-cluster> --internal-ip. Then I simply edited the file /etc/cni/net.d/10-gke-ptp.conflist and messed with the values on the "routes": [ {"dst": "0.0.0.0/0"} ] (changed from 0.0.0.0/0 to 1.0.0.0/0).
After that, I deleted the pods that were running inside of it and they all got stuck with the ContainerCreating status forever generating kublet events with the error Failed create pod sandbox: rpc error: code...
Note that in order to test I've set up my nodepool to have maximum of 1 node. Otherwise it will scale up a new one and the pods will be recreated at the new node. In my production incident the nodepool reached maximum node count so setting my tests to max 1 node reproduced a similar situation.
Since that, deleting the node from GKE solved the issue in production, I created a Python script that lists all events on the cluster and filters the ones that have the keyword "Failed create pod sandbox: rpc error: code". Than I go over all events and get their pods, and then from the pods, I get the nodes. Finally I loop over the nodes deleting them both from Kubernetes API and from Compute API with it's respective Python clients. For the Python script I used the libs: kubernetes and google-cloud-compute.
This is a simpler version of the script. Test it before using it:
from kubernetes import client, config
from google.cloud.compute_v1.services.instances import InstancesClient
ERROR_KEYWORDS = [
'Failed to create pod sandbox'.lower()
]
config.load_kube_config()
v1 = client.CoreV1Api()
events_result = v1.list_event_for_all_namespaces()
filtered_events = []
# filter only the events containing ERROR_KEYWORDS
for event in events_result.items:
for error_keyword in ERROR_KEYWORDS:
if error_keyword in event.message.lower():
filtered_events.append(event)
# gets the list of pods from those events
pods_list = {}
for event in filtered_events:
try:
pod = v1.read_namespaced_pod(
event.involved_object.name,
namespace=event.involved_object.namespace
)
pod_dict = {
"name": event.involved_object.name,
"namespace": event.involved_object.namespace,
"node": pod.spec.node_name
}
pods_list[event.involved_object.name] = pod_dict
except Exception as e:
pass
# Get the nodes from those pods
broken_nodes = set()
for name, pod_dict in pods_list.items():
if pod_dict.get('node'):
broken_nodes.add(pod_dict["node"])
broken_nodes = list(broken_nodes)
# Deletes the nodes from both Kubernetes API and Compute Engine API
if broken_nodes:
broken_nodes_str = ", ".join(broken_nodes)
print(f'BROKEN NODES: "{broken_nodes_str}"')
for node in broken_nodes:
try:
api_response = v1.delete_node(node)
except Exception as e:
pass
time.sleep(30)
try:
result = gcp_client.delete(project=PROJECT_ID, zone=CLUSTER_ZONE, instance=node)
except Exception as e:
pass
AWS EKS doesn't yet support t3a, m5ad r5ad instances
kubectl drain node1 node2 --delete-local-data --force --ignore-daemonsets
Just when I was planning to expel pods on all nodes, I didn’t expect the pods with errors all the time to become Running. You can try to execute it, hope it will be useful to you
This problem appeared for me when I added a PVC on AWS EKS.
Updating the aws-node CNI plugin to the latest version resolved it -
https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html
Following steps resets kubernetes cluster and helped me to solve my problem.
Stop all running pods
Delete all worker nodes from cluster
Perform kubeadm reset on master and nodes
Initiate the master node
kubeadm init --apiserver-advertise-address
install Pod network “WeaveNet”
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&env.IPALLOC_RANGE=192.168.0.0/16"
Join nodes to the cluster
Restart all nodes
#-------------------------------------
#Reset the kubernetes environment
#-----------------------------------
#[root#centos8-Master: ~]# k get nodes
#NAME STATUS ROLES AGE VERSION
#centos8-master Ready control-plane 14m v1.24.1
#centos8-slave Ready <none> 11m v1.24.3
#
#Master Node
#1. Delete the nodes
#First delete all pods, deployments, svc
#kubectl delete --all pods
#kubectl delete --all deployments
#kubectl delete --all svc
#kubectl drain centos8-slave --ignore-daemonsets --delete-emptydir-data --force
#kubectl delete node centos8-slave
#
#Worker Node
#2. Go to worker node, stop all the kubelet services.
#[root#centos8-Slave rprasads]# kubectl version --short
#Client Version: v1.24.3
#Kustomize Version: v4.5.4
#[root#centos8-Slave rprasads]# systemctl stop kubelet
#[root#centos8-Slave rprasads]# netstat -tulnp |grep kube
#kill -9 <pid> [kube-proxy]
#
#Master Node
#2. Reset the kubeadm.
#$ sudo kubeadm reset
#$ sudo swapoff -a
#
#Master Node
#3. Get you kubeadm version
#[root#centos8-Master: ~]# kubectl version --short
#Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
#Client Version: v1.24.1
#Kustomize Version: v4.5.4
#Server Version: v1.24.3
#
#Master Node
#4.On Master Initialize the kubeadm with proper network address and version
#$ kubeadm init --apiserver-advertise-address=192.168.56.101 --pod-network-cidr=192.168.0.0/16
##Download calico yaml file from the site: Refer the documentation https://projectcalico.docs.tigera.io/getting-started/kubernetes/self-managed-onprem/onpremises#install-calico-with-kubernetes-api-datastore-more-than-50-nodes
#
#$ curl https://projectcalico.docs.tigera.io/manifests/calico.yaml -O
#$ kubectl apply -f calico.yaml
#
#Worker Node
#5. Go to worker node and add the node with the command displayed.
# kubeadm join 192.168.56.101:6443 --token h0nuxq.zk9m731nc4ia93pq --discovery-token-ca-cert-hash sha256:1682644baf3433caeb0e6f9099ed487ef48b94ab6a0314df88e3ff42ae501a13
#
#Master Node
#6.On the master node run below commands.
#$ sudo rm -rf $HOME/.kube
#
#$ mkdir -p $HOME/.kube
#$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
#$ sudo chown $(id -u):$(id -g) $HOME/.kube/config
#
#$ sudo systemctl enable docker.service
#$ sudo service kubelet restart
#
#$ kubectl get nodes
#
#
#------------------------------------------------
#Test your new kubernetes cluster environment.
#-----------------------------------------------
#[root#centos8-Master: ~]# kubectl run nginx --image=nginx
#Wait for some time.
#
#[root#centos8-Master: ~]# k describe pods nginx
#Normal Scheduled 21s default-scheduler Successfully assigned default/nginx to centos8-slave
#
#[root#centos8-Master: ~]# k get pods
#NAME READY STATUS RESTARTS AGE
#nginx 1/1 Running 0 25s
#
#*************************************END*************************************