How to setup a kubernetes cluster in the purely IPV6 setting? - kubernetes

I have two PCs, A is with Windows 10 installed, B is with Ubuntu 20.04 installed.
In general, A and B are in the different local networks, for example, A is in 10.0.7.X, B is in 10.1.7.X, and they can not ping each other successfully, but they can visit each other by IPV6 address.
Now I want to setup a cluster in these two PCs, A is the master node and the B is the slave node.
In A, I have installed the Hyper-V and installed a Ubuntu 20.04 in the Hyper-V virtual machine and I can setup an IPV4 based Kubernetes cluster when putting them in the same local network (e.g., 10.0.7.X).
Now I'm trying to setup the cluster in purely IPV6 setting. I've tried the following command:
kubeadm init \
--apiserver-advertise-address=2400:dd01:1032:7:4070:XXXX:XXXX:XXXX\
--pod-network-cidr 10.244.0.0/16 \
--image-repository registry.cn-hangzhou.aliyuncs.com/google_containers \
--cri-socket /var/run/dockershim.sock \
--feature-gates IPv6DualStack=true
and add the network manager based on
curl https://docs.projectcalico.org/manifests/canal.yaml -O
kubectl apply -f canal.yaml
and the outputs of kubectl get pods -A are:
root#master:~# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-8f59968d4-jsw55 0/1 ContainerCreating 0 65s
kube-system canal-gwgjw 0/2 CrashLoopBackOff 2 66s
kube-system coredns-6c76c8bb89-6xjc2 0/1 ContainerCreating 0 98s
kube-system coredns-6c76c8bb89-ww6cg 0/1 ContainerCreating 0 98s
kube-system etcd-master 1/1 Running 0 104s
kube-system kube-apiserver-master 1/1 Running 0 104s
kube-system kube-controller-manager-master 1/1 Running 0 104s
kube-system kube-proxy-nd9bt 1/1 Running 0 98s
kube-system kube-scheduler-master 1/1 Running 0 104s
and the outputs of systemctl status kubelet is:
ubuntu#master:~$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2020-10-19 18:24:59 CST; 1min 24s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 98042 (kubelet)
Tasks: 24 (limit: 4657)
Memory: 43.6M
CGroup: /system.slice/kubelet.service
├─ 98042 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_co>
└─100328 /opt/cni/bin/calico
10月 19 18:26:17 master kubelet[98042]: I1019 18:26:17.206835 98042 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 8b35778f95a8f3fed7e15a1d29456f6067bba1e827ca584f1474e073a085b41e
10月 19 18:26:17 master kubelet[98042]: I1019 18:26:17.213580 98042 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: a8b573738e9b72c47c55f49b6946812206251770a4d8490002a04e80991680ef
10月 19 18:26:17 master kubelet[98042]: E1019 18:26:17.225830 98042 pod_workers.go:191] Error syncing pod fcdd3939-7388-4308-bad6-bd419adcd039 ("canal-gwgjw_kube-system(fcdd3939-7388-4308-bad6-bd419adcd039)"), skipping: failed to "StartContainer" for "kube-flannel" with >
10月 19 18:26:22 master kubelet[98042]: W1019 18:26:22.801044 98042 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "ea68e0f25fbd137253c572e36bcb9cc9a29ebb4850571411a3ed55a5858f4ebc"
10月 19 18:26:23 master kubelet[98042]: W1019 18:26:23.828962 98042 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "8f789a2f1bd4d9d6a6f590b6cbb218b2dda0589ce30242efbf8391e596f4050b"
10月 19 18:26:23 master kubelet[98042]: E1019 18:26:23.922994 98042 cni.go:387] Error deleting kube-system_calico-kube-controllers-8f59968d4-jsw55/ea68e0f25fbd137253c572e36bcb9cc9a29ebb4850571411a3ed55a5858f4ebc from network calico/k8s-pod-network: error getting ClusterI>
10月 19 18:26:23 master kubelet[98042]: E1019 18:26:23.925412 98042 remote_runtime.go:140] StopPodSandbox "ea68e0f25fbd137253c572e36bcb9cc9a29ebb4850571411a3ed55a5858f4ebc" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown>
10月 19 18:26:23 master kubelet[98042]: E1019 18:26:23.925499 98042 kuberuntime_manager.go:898] Failed to stop sandbox {"docker" "ea68e0f25fbd137253c572e36bcb9cc9a29ebb4850571411a3ed55a5858f4ebc"}
10月 19 18:26:23 master kubelet[98042]: E1019 18:26:23.925575 98042 kuberuntime_manager.go:677] killPodWithSyncResult failed: failed to "KillPodSandbox" for "bab09661-da02-414f-8faa-ccb02c8267dd" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin c>
10月 19 18:26:23 master kubelet[98042]: E1019 18:26:23.925617 98042 pod_workers.go:191] Error syncing pod bab09661-da02-414f-8faa-ccb02c8267dd ("calico-kube-controllers-8f59968d4-jsw55_kube-system(bab09661-da02-414f-8faa-ccb02c8267dd)"), skipping: failed to "KillPodSandb>
ubuntu#master:~$
Does anybody know how to setup a kubernetes cluster in purely ipv6 setting? Thanks.

Related

Kubernetes cluster on bare metal by kubeadm

I'm trying to create a single control-plane cluster with kubeadm on 3 bare metal nodes (1 master and 2 workers) running on Debian 10 with Docker as a container runtime. Each node has an external IP and internal IP.
I want to configure a cluster on the internal network and be accessible from the Internet.
Used this command for that (please correct me if something wrong):
kubeadm init --control-plane-endpoint=10.10.0.1 --apiserver-cert-extra-sans={public_DNS_name},10.10.0.1 --pod-network-cidr=192.168.0.0/16
I got:
kubectl get no -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
dev-k8s-master-0.public.dns Ready master 16h v1.18.2 10.10.0.1 <none> Debian GNU/Linux 10 (buster) 4.19.0-8-amd64 docker://19.3.8
Init phase complete successfully and the cluster is accessible from the Internet. All pods are up and running except coredns that should be running after networking will be applied.
kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
After networking applied, coredns pods still not ready:
kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-75d56dfc47-g8g9g 0/1 CrashLoopBackOff 192 16h
kube-system calico-node-22gtx 1/1 Running 0 16h
kube-system coredns-66bff467f8-87vd8 0/1 Running 0 16h
kube-system coredns-66bff467f8-mv8d9 0/1 Running 0 16h
kube-system etcd-dev-k8s-master-0 1/1 Running 0 16h
kube-system kube-apiserver-dev-k8s-master-0 1/1 Running 0 16h
kube-system kube-controller-manager-dev-k8s-master-0 1/1 Running 0 16h
kube-system kube-proxy-lp6b8 1/1 Running 0 16h
kube-system kube-scheduler-dev-k8s-master-0 1/1 Running 0 16h
Some logs from failed pods:
kubectl -n kube-system logs calico-kube-controllers-75d56dfc47-g8g9g
2020-04-22 08:24:55.853 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", ReconcilerPeriod:"5m", CompactionPeriod:"10m", EnabledControllers:"node", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", HealthEnabled:true, SyncNodeLabels:true, DatastoreType:"kubernetes"}
2020-04-22 08:24:55.855 [INFO][1] k8s.go 228: Using Calico IPAM
W0422 08:24:55.855525 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2020-04-22 08:24:55.856 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2020-04-22 08:25:05.857 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-04-22 08:25:05.857 [FATAL][1] main.go 114: Failed to initialize Calico datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
coredns:
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0422 08:29:12.275344 1 trace.go:116] Trace[1050055850]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go#v0.17.2/tools/cache/reflector.go:105 (started: 2020-04-22 08:28:42.274382393 +0000 UTC m=+59491.429700922) (total time: 30.000897581s):
Trace[1050055850]: [30.000897581s] [30.000897581s] END
E0422 08:29:12.275388 1 reflector.go:153] pkg/mod/k8s.io/client-go#v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
I0422 08:29:12.276163 1 trace.go:116] Trace[188478428]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go#v0.17.2/tools/cache/reflector.go:105 (started: 2020-04-22 08:28:42.275499997 +0000 UTC m=+59491.430818380) (total time: 30.000606394s):
Trace[188478428]: [30.000606394s] [30.000606394s] END
E0422 08:29:12.276198 1 reflector.go:153] pkg/mod/k8s.io/client-go#v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
I0422 08:29:12.277424 1 trace.go:116] Trace[16697023]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go#v0.17.2/tools/cache/reflector.go:105 (started: 2020-04-22 08:28:42.276675998 +0000 UTC m=+59491.431994406) (total time: 30.000689778s):
Trace[16697023]: [30.000689778s] [30.000689778s] END
E0422 08:29:12.277452 1 reflector.go:153] pkg/mod/k8s.io/client-go#v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
Any thoughts what's wrong?
This answer is to call attention to #florin suggestion:
I've seen a similar behavior when I had multiple public interfaces on the node and calico selected the wrong one.
What I did is to set IP_AUTODETECT_METHOD in the calico config.
From Calico Configuration on IP_AUTO_DETECT_METHOD:
The method to use to autodetect the IPv4 address for this host. This is only used when the IPv4 address is being autodetected. See IP Autodetection methods for details of the valid methods.
Learn more Here: https://docs.projectcalico.org/reference/node/configuration#ip-autodetection-methods
I am also facing same problem, but following is work for me, try this in you master node.
$ sudo iptables -P INPUT ACCEPT
$ sudo iptables -P FORWARD ACCEPT
$ sudo iptables -P FORWARD ACCEPT
$ sudo iptables -F

Kubelet Master stays in KubeletNotReady because of cni missing

Kubelet has been initialized with pod network for Calico :
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --image-repository=someserver
Then i get calico.yaml v3.11 and applied it :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" apply -f calico.yaml
Right after i check on the pod status :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" get nodes
NAME STATUS ROLES AGE VERSION
master-1 NotReady master 7m21s v1.17.2
on describe i've got cni config unitialized, but i thought that calico should have done that ?
MemoryPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
In fact i have nothing under /etc/cni/net.d/ so it seems it forgot something ?
ll /etc/cni/net.d/
total 0
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" -n kube-system get pods
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-f7lqq 0/1 Pending 0 3h
calico-node-f4xzh 0/1 Init:ImagePullBackOff 0 3h
coredns-7fb8cdf968-bbqbz 0/1 Pending 0 3h24m
coredns-7fb8cdf968-vdnzx 0/1 Pending 0 3h24m
etcd-master-1 1/1 Running 0 3h24m
kube-apiserver-master-1 1/1 Running 0 3h24m
kube-controller-manager-master-1 1/1 Running 0 3h24m
kube-proxy-9m879 1/1 Running 0 3h24m
kube-scheduler-master-1 1/1 Running 0 3h24m
As explained i'm running through a local repo and journalctl says :
kubelet[21935]: E0225 14:30:54.830683 21935 pod_workers.go:191] Error syncing pod cec2f72b-844a-4d6b-8606-3aff06d4a36d ("calico-node-f4xzh_kube-system(cec2f72b-844a-4d6b-8606-3aff06d4a36d)"), skipping: failed to "StartContainer" for "upgrade-ipam" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://repo:10000/v2/calico/cni/manifests/v3.11.2: no basic auth credentials"
kubelet[21935]: E0225 14:30:56.008989 21935 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feels like it's not only CNI the issue
Core DNS pod will be pending and master will be NotReady till calico pods are successfully running and CNI is setup properly.
It seems to be network issue to download calico docker images from docker.io. So you can pull calico images from docker.io and and push it to your internal container registry and then modify the calico yaml to refer that registry in images section of calico.yaml and finally apply the modified calico yaml to the kubernetes cluster.
So the issue with Init:ImagePullBackOff was that it cannot apply image from my private repo automatically. I had to pull all images for calico from docker. Then i deleted the calico pod it's recreate itself with the newly pushed image
sudo docker pull private-repo/calico/pod2daemon-flexvol:v3.11.2
sudo docker pull private-repo/calico/node:v3.11.2
sudo docker pull private-repo/calico/cni:v3.11.2
sudo docker pull private-repo/calico/kube-controllers:v3.11.2
sudo kubectl -n kube-system delete po/calico-node-y7g5
After that the node re-do all the init phase and :
sudo kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-qkf47 1/1 Running 0 11s
calico-node-mkcsr 1/1 Running 0 21m
coredns-7fb8cdf968-bgqvj 1/1 Running 0 37m
coredns-7fb8cdf968-v85jx 1/1 Running 0 37m
etcd-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-apiserver-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-controller-manager-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-proxy-9hkns 1/1 Running 0 37m
kube-scheduler-lin-1k8w1dv-vmh 1/1 Running 0 38m

failed to update node lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "yang"

I use two virtual machine based on VMware worstation pro 15.5 ,to learn and practice k8s.
OS :Ubuntu 18.04.3
docker :18.09.7
kubectl kubeadm kubelet v1.17.3
flannel:v0.11.0-amd64
After I execute kubeadm init --kubernetes-version=v1.17.3 --apisever-advertise-address 192.168.0.100 --pod-network-cidr=10.244.0.0/16 on the master node,everything is ok, kubectl get nodes show master node is READY.
But after I use the kubeadm join on the slave node, the master node's kube-system pods reduce, only exist coredns kube-flannel kube-proxy
systemctl status kubelet
show
failed to update node lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "yang";the object has been modified, please apply your changes to the latest version and try again
trying to delete pod kube-........
Also kubectl get nodes show only have master node .
Here is the scripts
The first is to set up docker kubeadm kubelet kubectl
#!/bin/bash
apt-get -y autoremove docker docker-engine docker.io docker-ce
apt-get update -y
apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
software-properties-common
apt-get vim net-tools
curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
apt-get update -y
# docker源加速
mkdir -p /etc/docker
echo '{"registry-mirrors":["https://vds6zmad.mirror.aliyuncs.com"]}' > /etc/docker/daemon.json
# 安装docker
apt-get install docker.io -y
# 启动和自启动
systemctl start docker
systemctl enable docker
apt-get update && apt-get install -y apt-transport-https curl
curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add -
#配置kubernetes阿里源
tee /etc/apt/sources.list.d/kubernetes.list <<-'EOF'
deb https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial main
EOF
apt-get update
apt-get install -y kubelet kubeadm kubectl
# 设置停止自动更新
apt-mark hold kubelet kubeadm kubectl
# kubelet开机自启动
systemctl enable kubelet && systemctl start kubelet
this is used to pull the images and tag
#! /usr/bin/python3
import os
images=[
"kube-apiserver:v1.17.3",
"kube-controller-manager:v1.17.3",
"kube-scheduler:v1.17.3",
"kube-proxy:v1.17.3",
"pause:3.1",
"etcd:3.4.3-0",
"coredns:1.6.5",
]
for i in images:
pullCMD = "docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/{}".format(i)
print("run cmd '{}', please wait ...".format(pullCMD))
os.system(pullCMD)
tagCMD = "docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/{} k8s.gcr.io/{}".format(i, i)
print("run cmd '{}', please wait ...".format(tagCMD ))
os.system(tagCMD)
rmiCMD = "docker rmi registry.cn-hangzhou.aliyuncs.com/google_containers/{}".format(i)
print("run cmd '{}', please wait ...".format(rmiCMD ))
os.system(rmiCMD)
When I only start the master, the commandkubectl get pods --all-namespacesshow these
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-6955765f44-gxq7p 1/1 Running 6 43h
kube-system coredns-6955765f44-xmcbq 1/1 Running 6 43h
kube-system etcd-yang 1/1 Running 61 14h
kube-system kube-apiserver-yang 1/1 Running 48 14h
kube-system kube-controller-manager-yang 1/1 Running 6 14h
kube-system kube-flannel-ds-amd64-v58g6 1/1 Running 5 43h
kube-system kube-proxy-2vcwg 1/1 Running 5 43h
kube-system kube-scheduler-yang 1/1 Running 6 14h
the command systemctl status kubeletshow these
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2020-02-28 13:19:08 CST; 9min ago
Docs: https://kubernetes.io/docs/home/
Main PID: 686 (kubelet)
Tasks: 0 (limit: 4634)
CGroup: /system.slice/kubelet.service
└─686 /usr/bin/kubelet --cgroup-driver=systemd --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/v
2月 28 13:19:27 yang kubelet[686]: W0228 13:19:27.709246 686 pod_container_deletor.go:75] Container "900ab1df52d8bbc9b3b0fc035df30ae242d2c8943486dc21183a6ccc5bd22c9b" not found in pod's containers
2月 28 13:19:27 yang kubelet[686]: W0228 13:19:27.710478 686 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "900ab1df52d8bbc9b3b0fc035df30ae242d2c8943486dc21183a6ccc5bd22c9b"
2月 28 13:19:28 yang kubelet[686]: E0228 13:19:28.512094 686 cni.go:364] Error adding kube-system_coredns-6955765f44-xmcbq/1f179bccaa042b92bd0f9ed97c0bf6f129bc986c574ca32c5435827eecee4f29 to network flannel/cbr0: open /run/flannel/subnet.env: no suc
2月 28 13:19:29 yang kubelet[686]: E0228 13:19:29.002294 686 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "1f179bccaa042b92bd0f9ed97c0bf6f129bc986c574ca32c54358
2月 28 13:19:29 yang kubelet[686]: E0228 13:19:29.002345 686 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "coredns-6955765f44-xmcbq_kube-system(7e9ca770-f27c-4143-a025-cd6316a1a7e4)" failed: rpc error: code = Unknown desc = failed to set up s
2月 28 13:19:29 yang kubelet[686]: E0228 13:19:29.002357 686 kuberuntime_manager.go:729] createPodSandbox for pod "coredns-6955765f44-xmcbq_kube-system(7e9ca770-f27c-4143-a025-cd6316a1a7e4)" failed: rpc error: code = Unknown desc = failed to set up
2月 28 13:19:29 yang kubelet[686]: E0228 13:19:29.002404 686 pod_workers.go:191] Error syncing pod 7e9ca770-f27c-4143-a025-cd6316a1a7e4 ("coredns-6955765f44-xmcbq_kube-system(7e9ca770-f27c-4143-a025-cd6316a1a7e4)"), skipping: failed to "CreatePodSan
2月 28 13:19:29 yang kubelet[686]: W0228 13:19:29.802166 686 docker_sandbox.go:394] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "coredns-6955765f44-xmcbq_kube-system": CNI failed to retrieve network
2月 28 13:19:29 yang kubelet[686]: W0228 13:19:29.810632 686 pod_container_deletor.go:75] Container "1f179bccaa042b92bd0f9ed97c0bf6f129bc986c574ca32c5435827eecee4f29" not found in pod's containers
2月 28 13:19:29 yang kubelet[686]: W0228 13:19:29.823318 686 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1f179bccaa042b92bd0f9ed97c0bf6f129bc986c574ca32c5435827eecee4f29"
But when I start the slave node ,the pods reduce
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-6955765f44-gxq7p 1/1 Running 3 43h
kube-system coredns-6955765f44-xmcbq 1/1 Running 3 43h
kube-system kube-flannel-ds-amd64-v58g6 1/1 Running 3 43h
kube-system kube-proxy-2vcwg 1/1 Running 3 43h
Today I successfully set up a k8s cluster, using the same scripts. Before do it, make sure the two virtual machine can ssh each other, and the master node have set a static IP.I don't know the precise reason.
I met the same problem. Here is my way to solve it:
Check the name of master and worker node, if they are the same, modify /etc/hostname and /etc/hosts
kubeadm reset
kubeadm init --node-name //specify the node name to avoid duplicate name...
I set up k8s on two VMs, and the second VM is cloned from the other... So they have the same hostname, and conflict occurred when executing kubeadm join...

core_dns stuck in ContainerCreating status

I am trying to setup a basic k8s cluster
After doing a kubeadm init --pod-network-cidr=10.244.0.0/16, the coredns pods are stuck in ContainerCreating status
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-2cnhj 0/1 ContainerCreating 0 43h
coredns-6955765f44-dnphb 0/1 ContainerCreating 0 43h
etcd-perf1 1/1 Running 0 43h
kube-apiserver-perf1 1/1 Running 0 43h
kube-controller-manager-perf1 1/1 Running 0 43h
kube-flannel-ds-amd64-smpbk 1/1 Running 0 43h
kube-proxy-6zgvn 1/1 Running 0 43h
kube-scheduler-perf1 1/1 Running 0 43h
OS-IMAGE: Ubuntu 16.04.6 LTS
KERNEL-VERSION: 4.4.0-142-generic
CONTAINER-RUNTIME: docker://19.3.5
Errors from journalctl -xeu kubelet command
Jan 02 10:31:44 perf1 kubelet[11901]: 2020-01-02 10:31:44.112 [INFO][10207] k8s.go 228: Using Calico IPAM
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118281 11901 cni.go:385] Error deleting kube-system_coredns-6955765f44-2cnhj/12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf from
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118828 11901 remote_runtime.go:128] StopPodSandbox "12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf" from runtime service failed:
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118872 11901 kuberuntime_manager.go:898] Failed to stop sandbox {"docker" "12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf"}
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118917 11901 kuberuntime_manager.go:676] killPodWithSyncResult failed: failed to "KillPodSandbox" for "e44bc42f-0b8d-40ad-82a9-334a1b1c8e40" with
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118939 11901 pod_workers.go:191] Error syncing pod e44bc42f-0b8d-40ad-82a9-334a1b1c8e40 ("coredns-6955765f44-2cnhj_kube-system(e44bc42f-0b8d-40ad-
Jan 02 10:31:47 perf1 kubelet[11901]: W0102 10:31:47.081709 11901 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "747c3cc9455a7d
Jan 02 10:31:47 perf1 kubelet[11901]: 2020-01-02 10:31:47.113 [INFO][10267] k8s.go 228: Using Calico IPAM
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.118526 11901 cni.go:385] Error deleting kube-system_coredns-6955765f44-dnphb/747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53 from
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119017 11901 remote_runtime.go:128] StopPodSandbox "747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53" from runtime service failed:
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119052 11901 kuberuntime_manager.go:898] Failed to stop sandbox {"docker" "747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53"}
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119098 11901 kuberuntime_manager.go:676] killPodWithSyncResult failed: failed to "KillPodSandbox" for "52ffb25e-06c7-4cc6-be70-540049a6be20" with
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119119 11901 pod_workers.go:191] Error syncing pod 52ffb25e-06c7-4cc6-be70-540049a6be20 ("coredns-6955765f44-dnphb_kube-system(52ffb25e-06c7-4cc6-
I have tried kubdeadm reset as well but no luck so far
Looks like the issue was because I tried switching from calico to flannel cni. Following the steps mentioned here has resolved the issue for me
Pods failed to start after switch cni plugin from flannel to calico and then flannel
Additionally you may have to clear the contents of /etc/cni/net.d
CoreDNS will not start up before a CNI network is installed.
For flannel to work correctly, you must pass --pod-network-cidr=10.244.0.0/16 to kubeadm init.
Set /proc/sys/net/bridge/bridge-nf-call-iptables to 1 by running sysctl net.bridge.bridge-nf-call-iptables=1 to pass bridged IPv4 traffic to iptables’ chains. This is a requirement for some CNI plugins to work.
Make sure that your firewall rules allow UDP ports 8285 and 8472 traffic for all hosts participating in the overlay network. see here .
Note that flannel works on amd64, arm, arm64, ppc64le and s390x under Linux. Windows (amd64) is claimed as supported in v0.11.0 but the usage is undocumented
To deploy flannel as CNI network
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml
After you have deployed flannel delete the core dns pods, Kubernetes will recreate the pods.
You have deployed flannel as CNI but the logs from kubelet shows that kubernetes is using calico.
[INFO][10207] k8s.go 228: Using Calico IPAM
Something wrong with container network. without that coredns doesnt succeed.
You might have to reinstall with correct CNI. Once CNI is deployed successfully, coreDNS gets deployed automatically
So here is my solution:
First, coreDNS will run on your [Master / Control-Plane] Nodes
Now let's run ifconfig to check for these 2 interfaces cni0 and flannel.1
Suppose cni0=10.244.1.1 & flannel.1=10.244.0.0 then your DNS will not be created
It should be cni0=10.244.0.1 & flannel.1=10.244.0.0. Which mean cni0 must follow flannel.1/24 patterns
Run the following 2 command to Down Interface and Remove it from your Master/Control-Plane Machines
sudo ifconfig cni0 down;
sudo ip link delete cni0;
Now check via ifconfig you will see 2 more vethxxxxxxxx Interface appears. This should fixed your problem.

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod network

Issue Redis POD creation on k8s(v1.10) cluster and POD creation stuck at "ContainerCreating"
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned redis to k8snode02
Normal SuccessfulMountVolume 30m kubelet, k8snode02 MountVolume.SetUp succeeded for volume "default-token-f8tcg"
Warning FailedCreatePodSandBox 5m (x1202 over 30m) kubelet, k8snode02 Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "redis_default" network: failed to find plugin "loopback" in path [/opt/loopback/bin /opt/cni/bin]
Normal SandboxChanged 47s (x1459 over 30m) kubelet, k8snode02 Pod sandbox changed, it will be killed and re-created.
When I used calico as CNI and I faced a similar issue.
The container remained in creating state, I checked for /etc/cni/net.d and /opt/cni/bin on master both are present but not sure if this is required on worker node as well.
root#KubernetesMaster:/opt/cni/bin# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-5c7588df-5zds6 0/1 ContainerCreating 0 21m
root#KubernetesMaster:/opt/cni/bin# kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetesmaster Ready master 26m v1.13.4
kubernetesslave1 Ready <none> 22m v1.13.4
root#KubernetesMaster:/opt/cni/bin#
kubectl describe pods
Name: nginx-5c7588df-5zds6
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: kubernetesslave1/10.0.3.80
Start Time: Sun, 17 Mar 2019 05:13:30 +0000
Labels: app=nginx
pod-template-hash=5c7588df
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/nginx-5c7588df
Containers:
nginx:
Container ID:
Image: nginx
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qtfbs (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-qtfbs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qtfbs
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned default/nginx-5c7588df-5zds6 to kubernetesslave1
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "123d527490944d80f44b1976b82dbae5dc56934aabf59cf89f151736d7ea8adc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8cc5e62ebaab7075782c2248e00d795191c45906cc9579464a00c09a2bc88b71" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "30ffdeace558b0935d1ed3c2e59480e2dd98e983b747dacae707d1baa222353f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "630e85451b6ce2452839c4cfd1ecb9acce4120515702edf29421c123cf231213" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "820b919b7edcfc3081711bb78b79d33e5be3f7dafcbad29fe46b6d7aa22227aa" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "abbfb5d2756f12802072039dec20ba52f546ae755aaa642a9a75c86577be589f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfeb46ffda4d0f8a434f3f3af04328fcc4b6c7cafaa62626e41b705b06d98cc4" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9ae3f47bb0282a56e607779d3267127ee8b0ae1d7f416f5a184682119203b1c8" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "546d07f1864728b2e2675c066775f94d658e221ada5fb4ed6bf6689ec7b8de23" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Normal SandboxChanged 18m (x12 over 18m) kubelet, kubernetesslave1 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 3m39s (x829 over 18m) kubelet, kubernetesslave1 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f586be437843537a3082f37ad139c88d0eacfbe99ddf00621efd4dc049a268cc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
root#KubernetesMaster:/etc/cni/net.d#
On worker node NGINX is trying to come up but getting exited, I am not sure what's going on here - I am newbie to kubernetes & not able to fix this issue -
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
94b2994401d0 k8s.gcr.io/pause:3.1 "/pause" 1 second ago Up Less than a second k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_534
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f72500cae2b7 k8s.gcr.io/pause:3.1 "/pause" 1 second ago Up Less than a second k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_585
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root#kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 5 minutes ago Up 5 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 5 minutes ago Up 5 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 5 minutes ago Up 5 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
I checked about /etc/cni/net.d & /opt/cni/bin on worker node as well, it is there -
root#kubernetesslave1:/home/ubuntu# cd /etc/cni
root#kubernetesslave1:/etc/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 net.d
root#kubernetesslave1:/etc/cni# cd /opt/cni
root#kubernetesslave1:/opt/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 bin
root#kubernetesslave1:/opt/cni# cd bin
root#kubernetesslave1:/opt/cni/bin# ls -ltr
total 107440
-rwxr-xr-x 1 root root 3890407 Aug 17 2017 bridge
-rwxr-xr-x 1 root root 3475802 Aug 17 2017 ipvlan
-rwxr-xr-x 1 root root 3520724 Aug 17 2017 macvlan
-rwxr-xr-x 1 root root 3877986 Aug 17 2017 ptp
-rwxr-xr-x 1 root root 3475750 Aug 17 2017 vlan
-rwxr-xr-x 1 root root 9921982 Aug 17 2017 dhcp
-rwxr-xr-x 1 root root 2605279 Aug 17 2017 sample
-rwxr-xr-x 1 root root 32351072 Mar 17 05:19 calico
-rwxr-xr-x 1 root root 31490656 Mar 17 05:19 calico-ipam
-rwxr-xr-x 1 root root 2856252 Mar 17 05:19 flannel
-rwxr-xr-x 1 root root 3084347 Mar 17 05:19 loopback
-rwxr-xr-x 1 root root 3036768 Mar 17 05:19 host-local
-rwxr-xr-x 1 root root 3550877 Mar 17 05:19 portmap
-rwxr-xr-x 1 root root 2850029 Mar 17 05:19 tuning
root#kubernetesslave1:/opt/cni/bin#
Ensure that /etc/cni/net.d and its /opt/cni/bin friend both exist and are correctly populated with the CNI configuration files and binaries on all Nodes. For flannel specifically, one might make use of the flannel cni repo
I had this issue with my GKE cluster on GCP with one of my preemptive node pools. Thanks to #mdaniel tip of checking the integrity of /etc/cni/net.d I could reproduce the issue again by ssh into the node of a testing cluster with the command gcloud compute ssh <name of some node> --zone <zone-of-cluster> --internal-ip. Then I simply edited the file /etc/cni/net.d/10-gke-ptp.conflist and messed with the values on the "routes": [ {"dst": "0.0.0.0/0"} ] (changed from 0.0.0.0/0 to 1.0.0.0/0).
After that, I deleted the pods that were running inside of it and they all got stuck with the ContainerCreating status forever generating kublet events with the error Failed create pod sandbox: rpc error: code...
Note that in order to test I've set up my nodepool to have maximum of 1 node. Otherwise it will scale up a new one and the pods will be recreated at the new node. In my production incident the nodepool reached maximum node count so setting my tests to max 1 node reproduced a similar situation.
Since that, deleting the node from GKE solved the issue in production, I created a Python script that lists all events on the cluster and filters the ones that have the keyword "Failed create pod sandbox: rpc error: code". Than I go over all events and get their pods, and then from the pods, I get the nodes. Finally I loop over the nodes deleting them both from Kubernetes API and from Compute API with it's respective Python clients. For the Python script I used the libs: kubernetes and google-cloud-compute.
This is a simpler version of the script. Test it before using it:
from kubernetes import client, config
from google.cloud.compute_v1.services.instances import InstancesClient
ERROR_KEYWORDS = [
'Failed to create pod sandbox'.lower()
]
config.load_kube_config()
v1 = client.CoreV1Api()
events_result = v1.list_event_for_all_namespaces()
filtered_events = []
# filter only the events containing ERROR_KEYWORDS
for event in events_result.items:
for error_keyword in ERROR_KEYWORDS:
if error_keyword in event.message.lower():
filtered_events.append(event)
# gets the list of pods from those events
pods_list = {}
for event in filtered_events:
try:
pod = v1.read_namespaced_pod(
event.involved_object.name,
namespace=event.involved_object.namespace
)
pod_dict = {
"name": event.involved_object.name,
"namespace": event.involved_object.namespace,
"node": pod.spec.node_name
}
pods_list[event.involved_object.name] = pod_dict
except Exception as e:
pass
# Get the nodes from those pods
broken_nodes = set()
for name, pod_dict in pods_list.items():
if pod_dict.get('node'):
broken_nodes.add(pod_dict["node"])
broken_nodes = list(broken_nodes)
# Deletes the nodes from both Kubernetes API and Compute Engine API
if broken_nodes:
broken_nodes_str = ", ".join(broken_nodes)
print(f'BROKEN NODES: "{broken_nodes_str}"')
for node in broken_nodes:
try:
api_response = v1.delete_node(node)
except Exception as e:
pass
time.sleep(30)
try:
result = gcp_client.delete(project=PROJECT_ID, zone=CLUSTER_ZONE, instance=node)
except Exception as e:
pass
AWS EKS doesn't yet support t3a, m5ad r5ad instances
kubectl drain node1 node2 --delete-local-data --force --ignore-daemonsets
Just when I was planning to expel pods on all nodes, I didn’t expect the pods with errors all the time to become Running. You can try to execute it, hope it will be useful to you
This problem appeared for me when I added a PVC on AWS EKS.
Updating the aws-node CNI plugin to the latest version resolved it -
https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html
Following steps resets kubernetes cluster and helped me to solve my problem.
Stop all running pods
Delete all worker nodes from cluster
Perform kubeadm reset on master and nodes
Initiate the master node
kubeadm init --apiserver-advertise-address
install Pod network “WeaveNet”
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&env.IPALLOC_RANGE=192.168.0.0/16"
Join nodes to the cluster
Restart all nodes
#-------------------------------------
#Reset the kubernetes environment
#-----------------------------------
#[root#centos8-Master: ~]# k get nodes
#NAME STATUS ROLES AGE VERSION
#centos8-master Ready control-plane 14m v1.24.1
#centos8-slave Ready <none> 11m v1.24.3
#
#Master Node
#1. Delete the nodes
#First delete all pods, deployments, svc
#kubectl delete --all pods
#kubectl delete --all deployments
#kubectl delete --all svc
#kubectl drain centos8-slave --ignore-daemonsets --delete-emptydir-data --force
#kubectl delete node centos8-slave
#
#Worker Node
#2. Go to worker node, stop all the kubelet services.
#[root#centos8-Slave rprasads]# kubectl version --short
#Client Version: v1.24.3
#Kustomize Version: v4.5.4
#[root#centos8-Slave rprasads]# systemctl stop kubelet
#[root#centos8-Slave rprasads]# netstat -tulnp |grep kube
#kill -9 <pid> [kube-proxy]
#
#Master Node
#2. Reset the kubeadm.
#$ sudo kubeadm reset
#$ sudo swapoff -a
#
#Master Node
#3. Get you kubeadm version
#[root#centos8-Master: ~]# kubectl version --short
#Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
#Client Version: v1.24.1
#Kustomize Version: v4.5.4
#Server Version: v1.24.3
#
#Master Node
#4.On Master Initialize the kubeadm with proper network address and version
#$ kubeadm init --apiserver-advertise-address=192.168.56.101 --pod-network-cidr=192.168.0.0/16
##Download calico yaml file from the site: Refer the documentation https://projectcalico.docs.tigera.io/getting-started/kubernetes/self-managed-onprem/onpremises#install-calico-with-kubernetes-api-datastore-more-than-50-nodes
#
#$ curl https://projectcalico.docs.tigera.io/manifests/calico.yaml -O
#$ kubectl apply -f calico.yaml
#
#Worker Node
#5. Go to worker node and add the node with the command displayed.
# kubeadm join 192.168.56.101:6443 --token h0nuxq.zk9m731nc4ia93pq --discovery-token-ca-cert-hash sha256:1682644baf3433caeb0e6f9099ed487ef48b94ab6a0314df88e3ff42ae501a13
#
#Master Node
#6.On the master node run below commands.
#$ sudo rm -rf $HOME/.kube
#
#$ mkdir -p $HOME/.kube
#$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
#$ sudo chown $(id -u):$(id -g) $HOME/.kube/config
#
#$ sudo systemctl enable docker.service
#$ sudo service kubelet restart
#
#$ kubectl get nodes
#
#
#------------------------------------------------
#Test your new kubernetes cluster environment.
#-----------------------------------------------
#[root#centos8-Master: ~]# kubectl run nginx --image=nginx
#Wait for some time.
#
#[root#centos8-Master: ~]# k describe pods nginx
#Normal Scheduled 21s default-scheduler Successfully assigned default/nginx to centos8-slave
#
#[root#centos8-Master: ~]# k get pods
#NAME READY STATUS RESTARTS AGE
#nginx 1/1 Running 0 25s
#
#*************************************END*************************************