Kubelet process has high CPU usage over long time - kubernetes

I have kubernetes cluster with weave CNI plugin consisting of 3 nodes:
1 master node (virtual machine)
2 worker baremetall nodes (4 cores xeon with hyperthreading - 8 logical nodes)
The trouble is that top shows that kubelet has 60-100% CPU usage on first worker.
In journalctl -u kubelet I see a lot of messages (hundreds every minute)
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.075243 3843 docker_sandbox.go:205] Failed to stop sandbox "011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640": Error response from daemon: {"message":"No such container: 011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640"}
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.075360 3843 remote_runtime.go:109] StopPodSandbox "011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-p6kwb_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.075380 3843 kuberuntime_gc.go:138] Failed to stop sandbox "011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-p6kwb_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.076549 3843 docker_sandbox.go:205] Failed to stop sandbox "0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf": Error response from daemon: {"message":"No such container: 0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf"}
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.076654 3843 remote_runtime.go:109] StopPodSandbox "0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-6g8jq_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.076676 3843 kuberuntime_gc.go:138] Failed to stop sandbox "0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-6g8jq_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.079585 3843 docker_sandbox.go:205] Failed to stop sandbox "014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772": Error response from daemon: {"message":"No such container: 014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772"}
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.079805 3843 remote_runtime.go:109] StopPodSandbox "014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-r30cw_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772
It's happen after wrong cronetes tasks which failed during creation. I removed all pods with --force but kubelet still try to remove them. Also I restarted kubelet on that worker with no result. How can I talk to kubelet to forget them?
Version info
Kubernetes v1.6.1
Docker version 1.12.0, build 8eab29e
Linux kube-worker1 4.4.0-72-generic #93-Ubuntu SMP
Container manifest (without metadata)
job:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: cron-task
image: docker.company.ru/image:v2.3.2
command: ["rake", "db:refresh_views"]
env:
- name: RAILS_ENV
value: namespace
- name: CONFIG_PATH
value: /config
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: task-conf
restartPolicy: Never
Also I didn't found any mention of this pod's part of name (2533948c46c1) in cluster's etcd.

Finally I found the solution.
Kubelet stores information about all pods, running on it in
/var/lib/dockershim/sandbox
So when I ls in that folder I found files for all missing pods. Then I deleted these files and log messages disappeared and CPU usage returns to normal value (even without kubelet restart)

This seems to be related to the Pods with hostNetwork=true cannot be removed (and generate errors) when using CNI issue in Kubernetes 1.6.x. Those messages are not critical anyhow but of course it's annoying when you try to find actual issues.
Try using the most recent version of Kubernetes to mitigate the issues.

I ran into the same problem as you and did go profiling for this and find the cause is kubelet pleg mechanism and remove the '/var/lib/dockershim/sandbox' did the magic.

Related

Kubernetes 1.17 containerd 1.2.0 with Calico CNI node not joining to master

I am setting up the kubernetes cluster on CentOS 8 with containerd and Calico as CNI. with kubeadm command setup the master node, its in Ready status.
When I join the node to master, node not becoming ready status. I see below message the log file.
Jan 14 20:17:29 node02 containerd[1417]: time="2020-01-14T20:17:29.416373526-05:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:calico-node-fbst8,Uid:9c7f6334-d106-48e1-af12-1bcdebc7c2c2,Namespace:kube-system,Attempt:0,} failed, error" error="failed to start sandbox container: failed to create containerd task: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:279: applying cgroup configuration for process caused \"Invalid unit name 'pod9c7f6334-d106-48e1-af12-1bcdebc7c2c2'\"": unknown"
Jan 14 20:17:29 node02 kubelet[30113]: E0114 20:17:29.416668 30113 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to start sandbox container: failed to create containerd task: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:279: applying cgroup configuration for process caused \"Invalid unit name 'pod9c7f6334-d106-48e1-af12-1bcdebc7c2c2'\"": unknown
Jan 14 20:17:29 node02 kubelet[30113]: E0114 20:17:29.416742 30113 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "calico-node-fbst8_kube-system(9c7f6334-d106-48e1-af12-1bcdebc7c2c2)" failed: rpc error: code = Unknown desc = failed to start sandbox container: failed to create containerd task: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:279: applying cgroup configuration for process caused \"Invalid unit name 'pod9c7f6334-d106-48e1-af12-1bcdebc7c2c2'\"": unknown
Jan 14 20:17:29 node02 kubelet[30113]: E0114 20:17:29.416761 30113 kuberuntime_manager.go:729] createPodSandbox for pod "calico-node-fbst8_kube-system(9c7f6334-d106-48e1-af12-1bcdebc7c2c2)" failed: rpc error: code = Unknown desc = failed to start sandbox container: failed to create containerd task: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:279: applying cgroup configuration for process caused \"Invalid unit name 'pod9c7f6334-d106-48e1-af12-1bcdebc7c2c2'\"": unknown
Jan 14 20:17:29 node02 kubelet[30113]: E0114 20:17:29.416819 30113 pod_workers.go:191] Error syncing pod 9c7f6334-d106-48e1-af12-1bcdebc7c2c2 ("calico-node-fbst8_kube-system(9c7f6334-d106-48e1-af12-1bcdebc7c2c2)"), skipping: failed to "CreatePodSandbox" for "calico-node-fbst8_kube-system(9c7f6334-d106-48e1-af12-1bcdebc7c2c2)" with CreatePodSandboxError: "CreatePodSandbox for pod \"calico-node-fbst8_kube-system(9c7f6334-d106-48e1-af12-1bcdebc7c2c2)\" failed: rpc error: code = Unknown desc = failed to start sandbox container: failed to create containerd task: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:279: applying cgroup configuration for process caused \\\"Invalid unit name 'pod9c7f6334-d106-48e1-af12-1bcdebc7c2c2'\\\"\": unknown"
Jan 14 20:17:30 node02 containerd[1417]: time="2020-01-14T20:17:30.541254039-05:00" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Jan 14 20:17:30 node02 kubelet[30113]: E0114 20:17:30.541394 30113 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Jan 14 20:17:35 node02 containerd[1417]: time="2020-01-14T20:17:35.541792325-05:00" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Jan 14 20:17:35 node02 kubelet[30113]: E0114 20:17:35.541929 30113 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Any tips to resolve this error?
Did you setting --pod-network-cidr=192.168.0.0/16 to kubeadm init?
Apparently, You need setting it.
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#pod-network
Because you are not using docker you need to setup the cgroup driver explicitly.
To use the systemd cgroup driver, set plugins.cri.systemd_cgroup = true in /etc/containerd/config.toml and systemctl restart containerd
You have to modify the file kubeadm-flags.env in /var/lib/kubelet and set the cgroups driver.
KUBELET_EXTRA_ARGS=--cgroup-driver=systemd
Make sure to point to above file in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env

Failed to create pod sandbox [flannel]

I am running into this error on random pods. Thank you #matthew-l-daniel for the comment - as I didn't know where to start.
Here is the contents of /opt/cni/bin on the node
:/opt/cni/bin$ ls
bridge host-local loopback
Here are the kubelet logs for a container that failed.
Jan 30 15:42:00 ip-172-20-39-216 kubelet[32233]: E0130 15:42:00.924370 32233 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "postgres-core-0_service-master-459cf23(d8acae2f-24a2-11e9-b79c-0a0d1213cce2)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "postgres-core-0": Error response from daemon: grpc: the connection is unavailable
Jan 30 15:42:00 ip-172-20-39-216 kubelet[32233]: E0130 15:42:00.924380 32233 kuberuntime_manager.go:647] createPodSandbox for pod "postgres-core-0_service-master-459cf23(d8acae2f-24a2-11e9-b79c-0a0d1213cce2)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "postgres-core-0": Error response from daemon: grpc: the connection is unavailable
Jan 30 15:42:00 ip-172-20-39-216 kubelet[32233]: E0130 15:42:00.924427 32233 pod_workers.go:186] Error syncing pod d8acae2f-24a2-11e9-b79c-0a0d1213cce2 ("postgres-core-0_service-master-459cf23(d8acae2f-24a2-11e9-b79c-0a0d1213cce2)"), skipping: failed to "CreatePodSandbox" for "postgres-core-0_service-master-459cf23(d8acae2f-24a2-11e9-b79c-0a0d1213cce2)" with CreatePodSandboxError: "CreatePodSandbox for pod \"postgres-core-0_service-master-459cf23(d8acae2f-24a2-11e9-b79c-0a0d1213cce2)\" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod \"postgres-core-0\": Error response from daemon: grpc: the connection is unavailable"
As for flannel container logs, there are many flannel pods running - and all are healthy.
Kubernetes v 1.10.11
Docker version 17.03.2-ce, build f5ec1e2
Flannel logs
E0130 15:34:16.536354 1 vxlan_network.go:187] DelFDB failed: no such file or directory
E0130 15:34:16.536411 1 vxlan_network.go:191] failed to delete vxlanRoute (100.107.178.0/24 -> 100.107.178.0): no such process
E0130 17:33:44.848163 1 vxlan_network.go:187] DelFDB failed: no such file or directory
E0130 17:33:44.848219 1 vxlan_network.go:191] failed to delete vxlanRoute (100.107.201.0/24 -> 100.107.201.0): no such process

Kubernetes worker node staying in "NotReady" state

I have two node Kubernetes setup in Virtualbox. Master is up and running fine. But the worker node is staying in "NotReady" state.
[root#master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 1d v1.10.2
node NotReady <none> 1h v1.10.2
"journalctl -u kubelet" command on worker node is reporting networking related errors:
kuberuntime_manager.go:757] checking backoff for container "install-cni" in pod "kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5ab)"
remote_runtime.go:278] ContainerStatus "459643e54de7f82df8ada0f60e8f3d51d42c5ce348747a66e20ad5720155e63f" from runtime service failed: rpc error: code = U
kuberuntime_container.go:636] failed to remove pod init container "install-cni": failed to get container status "459643e54de7f82df8ada0f60e8f3d51d42c5ce34
kuberuntime_manager.go:757] checking backoff for container "install-cni" in pod "kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5ab)"
kuberuntime_manager.go:767] Back-off 10s restarting failed container=install-cni pod=kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5a
pod_workers.go:186] Error syncing pod 873fa36d-4b83-11e8-9997-080027afb5ab ("kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5ab)"), sk
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni con
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni con
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni con
I am running Kubernetes version 1.10 and docker version 1.13.1. Could you please help me identify the root cause and resolution for this issue?
Well the thing is, when you want to form a kubernetes cluster, it requires that you deploy a CNI plugin which would provide networking between your pods. The error that you have shown here is due to a CNI plugin not being installed or not being configured properly.
The kube-dns pod would be in pending state until the CNI plugin is deployed on your cluster. Once kube-dns moves to a running state, (after deploying the cni provider) you can run your application workloads.
If you have not deployed a CNI plugin, there are several ones you can choose from.
Calico: Provides Pod networking via standard BGP. (Follow the documentation for further info)
kubectl apply -f https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/kubeadm/1.7/calico.yaml
Weave: Creates an overlay network.
export kubever=$(kubectl version | base64 | tr -d '\n')
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$kubever"
Flannel: Creates an overlay network treating each host as a subnet.
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml
Container traffic needs to be made aware to the iptables and you can do that by
sysctl net.bridge.bridge-nf-call-iptables=1
This is required by Flannel and Weave to function.
Please do refer to the documentation of each CNI plugin which would be suitable for your cluster.

Pods are not created on new nodes

When i create a sample nginx pod with some replica's to test my kubernetes cluster. i get a strange output. The pods create themself on the first node but on the 2 other nodes they stuck at status "Container creating"
When i describe the pods (only the ones on the other nodes) they give this error message
Warning FailedCreatePodSandBox 1m kubelet, xploregroup Failed create pod sandbox.
Normal SandboxChanged 1m kubelet, xploregroup Pod sandbox changed, it will be killed and re-created.
the strange part is that all node have all exactly the same configuration (cloned the image from the master) and i joined them all exactly the same way.
The pods get distributed normally but only the pods on node1 is running .
Can someone direct me to the same direction :(
[EDIT]
journalctl -u kubelet gives this error
Mar 12 13:42:45 kubeMaster kubelet[16379]: W0312 13:42:45.824314 16379 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 13:42:45 kubeMaster kubelet[16379]: E0312 13:42:45.824816 16379 kubelet.go:2104] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
The problem seems to be with my network plugin. In my /etc/systemd/system/kubelet.service.d/10.kubeadm.conf . the flags for the network plugins are present ? environment= kubelet_network_args --cni-bin-dir=/etc/cni/net.d
--network-plugin=cni

kubelet error: Failed to start ContainerManager failed to initialise top level QOS containers: root container /kubepods doesn't exist

I am trying to use kubelet to start kubernetes api server as a staic pod, but failed with following errors:
I0523 11:13:41.192680 9248 remote_runtime.go:41] Connecting to runtime service /var/run/dockershim.sock
I0523 11:13:41.196764 9248 kuberuntime_manager.go:171] Container runtime docker initialized, version: 1.12.3, apiVersion: 1.24.0
E0523 11:13:41.199242 9248 kubelet.go:1165] Image garbage collection failed: unable to find data for container /
E0523 11:13:41.199405 9248 event.go:208] Unable to write event: 'Post https://127.0.0.1:8443/api/v1/namespaces/default/events: dial tcp 127.0.0.1:8443: getsockopt: connection refused' (may retry after sleeping)
I0523 11:13:41.199529 9248 server.go:869] Started kubelet v1.6.4
I0523 11:13:41.199711 9248 server.go:127] Starting to listen on 0.0.0.0:10250
I0523 11:13:41.200017 9248 kubelet_node_status.go:230] Setting node annotation to enable volume controller attach/detach
I0523 11:13:41.203018 9248 server.go:294] Adding debug handlers to kubelet server.
E0523 11:13:41.207486 9248 kubelet.go:1661] Failed to check if disk space is available for the runtime: failed to get fs info for "runtime": unable to find data for container /
E0523 11:13:41.207554 9248 kubelet.go:1669] Failed to check if disk space is available on the root partition: failed to get fs info for "root": unable to find data for container /
E0523 11:13:41.214231 9248 kubelet.go:1246] Failed to start ContainerManager failed to initialise top level QOS containers: root container /kubepods doesn't exist
The full log is here: https://travis-ci.org/reachlin/k8s0/jobs/235187507
This is the api server deployment yml: https://github.com/reachlin/k8s0/blob/master/roles/k8s/templates/apiserver.yml.j2
Later, I found the error actually matters is:
Failed to start ContainerManager failed to initialise top level QOS containers: root container /kubepods doesn't exist
after some research, i found the solution here: https://github.com/kubernetes/kubernetes/issues/43704
by adding these two parameters to kubelet:
--cgroups-per-qos=false
--enforce-node-allocatable=""