Pods are not created on new nodes - raspberry-pi

When i create a sample nginx pod with some replica's to test my kubernetes cluster. i get a strange output. The pods create themself on the first node but on the 2 other nodes they stuck at status "Container creating"
When i describe the pods (only the ones on the other nodes) they give this error message
Warning FailedCreatePodSandBox 1m kubelet, xploregroup Failed create pod sandbox.
Normal SandboxChanged 1m kubelet, xploregroup Pod sandbox changed, it will be killed and re-created.
the strange part is that all node have all exactly the same configuration (cloned the image from the master) and i joined them all exactly the same way.
The pods get distributed normally but only the pods on node1 is running .
Can someone direct me to the same direction :(
[EDIT]
journalctl -u kubelet gives this error
Mar 12 13:42:45 kubeMaster kubelet[16379]: W0312 13:42:45.824314 16379 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 13:42:45 kubeMaster kubelet[16379]: E0312 13:42:45.824816 16379 kubelet.go:2104] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
The problem seems to be with my network plugin. In my /etc/systemd/system/kubelet.service.d/10.kubeadm.conf . the flags for the network plugins are present ? environment= kubelet_network_args --cni-bin-dir=/etc/cni/net.d
--network-plugin=cni

Related

Ingress-nginx is in CrashLoopBackOff after K8s upgrade

After upgrading Kubernetes node pool from 1.21 to 1.22, ingress-nginx-controller pods started crashing. The same deployment has been working fine in EKS. I'm just having this issue in GKE. Does anyone have any ideas about the root cause?
$ kubectl logs ingress-nginx-controller-5744fc449d-8t2rq -c controller
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.3.1
Build: 92534fa2ae799b502882c8684db13a25cde68155
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.10
-------------------------------------------------------------------------------
W0219 21:23:08.194770 8 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0219 21:23:08.194995 8 main.go:209] "Creating API client" host="https://10.1.48.1:443"
Ingress pod events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned infra/ingress-nginx-controller-5744fc449d-8t2rq to gke-infra-nodep-ffe54a41-s7qx
Normal Pulling 27m kubelet Pulling image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974"
Normal Started 27m kubelet Started container controller
Normal Pulled 27m kubelet Successfully pulled image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" in 6.443361484s
Warning Unhealthy 26m (x6 over 26m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 502
Normal Killing 26m kubelet Container controller failed liveness probe, will be restarted
Normal Created 26m (x2 over 27m) kubelet Created container controller
Warning FailedPreStopHook 26m kubelet Exec lifecycle hook ([/wait-shutdown]) for Container "controller" in Pod "ingress-nginx-controller-5744fc449d-8t2rq_infra(c4c166ff-1d86-4385-a22c-227084d569d6)" failed - error: command '/wait-shutdown' exited with 137: , message: ""
Normal Pulled 26m kubelet Container image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" already present on machine
Warning BackOff 7m7s (x52 over 21m) kubelet Back-off restarting failed container
Warning Unhealthy 2m9s (x55 over 26m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 502
The Beta API versions (extensions/v1beta1 and networking.k8s.io/v1beta1) of Ingress are no longer served (removed) for GKE clusters created on versions 1.22 and later. Please refer to the official GKE ingress documentation for changes in the GA API version.
Also refer to Official Kubernetes documentation for API removals for Kubernetes v1.22 for more information.
Before upgrading your Ingress API as a client, make sure that every ingress controller that you use is compatible with the v1 Ingress API. See Ingress Prerequisites for more context about Ingress and ingress controllers.
Also check below possible causes for Crashloopbackoff :
Increasing the initialDelaySeconds value for the livenessProbe setting may help to alleviate the issue, as it will give the container more time to start up and perform its initial work operations before the liveness probe server checks its health.
Check “Container restart policy”, the spec of a Pod has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always.
Out of memory or resources : Try to increase the VM size. Containers may crash due to memory limits, then new ones spun up, the health check failed and Ingress served up 502.
Check externalTrafficPolicy=Local is set on the NodePort service will prevent nodes from forwarding traffic to other nodes.
Refer to the Github issue Document how to avoid 502s #34 for more information.

Kubernetes: Calio not working on remote worker, local ok

I setup a Kubernetes cluster with calico.
The setup is "simple"
1x master (local network, ok)
1x node (local network, ok)
1x node (cloud server, not ok)
All debian buster with docker 19.03
On the cloud server the calico pods do not come up:
calico-kube-controllers-token-x:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 47m (x50 over 72m) kubelet Pod sandbox changed, it will be killed and re-created.
Warning FailedMount 43m kubelet MountVolume.SetUp failed for volume "calico-kube-controllers-token-x" : failed to sync secret cache: timed out waiting for the condition
Normal SandboxChanged 3m41s (x78 over 43m) kubelet Pod sandbox changed, it will be killed and re-created.
calico-node-x:
Warning Unhealthy 43m (x5 over 43m) kubelet Liveness probe failed: calico/node is not ready: Felix is not live: Get "http://localhost:9099/liveness": dial tcp [::1]:9099: connect: connection refused
Warning Unhealthy 14m (x77 over 43m) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
Warning BackOff 4m26s (x115 over 39m) kubelet Back-off restarting failed container
My guess is that there is something wrong with IP/Network config, but did not figure out which.
Required ports (k8s&BGP) are forwarded from the router, also tried the master directly connected to the internet
--control-plane-endpoint is a hostname and public resolveable
Calico is using BGP peering (using public ip as peer)
This entry does worry me the most:
displayes local ip: kubectl get --raw /api
I tried to find a way to change this to the public IP of the master, without success.
Anyone got a clue what to try next?
After an additional time spend with analysis the problem happend to be the distributed api ip address was the local one, not the dns-name.
Created a vpn with wireguard from the cloud node to the local master, so the local ip of the master is reachable from the cloud node.
Don't know if that is the cleanest solution, but it works.
Run this command to verify if IP_AUTODETECTION_METHOD environment variable in calico daemonset has been set
kubectl get daemonset/calico-node -n kube-system --output json | jq '.spec.template.spec.containers[].env[] | select(.name | startswith("IP"))'
Run this command in each of your k8s nodes to find the valid network interface
ifconfig
Explicitly set the IP_AUTODETECTION_METHOD environment variable, to make sure the calico node communicates to the correct network interface of the K8s node.
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=en.*

Restore a node after being purged due to resources pressure

I have a k8s cluster setup using kubespray.
Last week one of my k8s nodes have very low storage, so all the pods has been evicted, include some important pods like calico-node, kube-proxy (I thought that these pods are critical and never been evicted no matter what)
After that all the calico-node pods become not ready, when I check the log, it is said that:
Warning: Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.0.xxx, where 192.168.0.xxx is the IP of above problematic node.
My question is how can I restore that node? is it safe to just run the kubespray's cluster.yml again?
My k8s version is v1.13.3
Thanks.
When node has a disk pressure its status changes to NotReady and a taint is added to the node: Taints: node.kubernetes.io/disk-pressure:NoSchedule.
All pods running on this node are getting evicted, except api-server, kube-controller and kube-scheduler- eviction manager will save those pods from getting evicted with error message: cannot evict a critical static pod [...]
Once the node is freed from disk pressure it will change its status to Ready and previously added taint will be removed. You can check it by running kubectl describe node <node_name>. In the conditions field you should see that DiskPressure has changed status to False which means that node has enough space available. Similar information can be also found in Events field.
Normal NodeReady 1s kubelet, node1 Node node1 status is now: NodeReady
Normal NodeHasNoDiskPressure 1s (x2 over 1s) kubelet, node1 Node node1 status is now: NodeHasNoDiskPressure
After confirming that the node is ready with sufficient disk space you can restart kubelet and run kubespray's cluster.yml- the pods will be redeployed on the node. You just have to make sure that node is ready to handle deployments.

kubernetes worker node in "NotReady" status

I am trying to setup my first cluster using Kubernetes 1.13.1. The master got initialized okay, but both of my worker nodes are NotReady. kubectl describe node shows that Kubelet stopped posting node status on both worker nodes. On one of the worker nodes I get log output like
> kubelet[3680]: E0107 20:37:21.196128 3680 kubelet.go:2266] node
> "xyz" not found.
Here is the full details:
I am using Centos 7 & Kubernetes 1.13.1.
Initializing was done as follows:
[root#master ~]# kubeadm init --apiserver-advertise-address=10.142.0.4 --pod-network-cidr=10.142.0.0/24
Successfully initialized the cluster:
You can now join any number of machines by running the following on each node
as root:
`kubeadm join 10.142.0.4:6443 --token y0epoc.zan7yp35sow5rorw --discovery-token-ca-cert-hash sha256:f02d43311c2696e1a73e157bda583247b9faac4ffb368f737ee9345412c9dea4`
deployed the flannel CNI:
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The join command worked fine.
[kubelet-start] Activating the kubelet service
[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "node01" as an annotation
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the master to see this node join the cluster.
Result of kubectl get nodes:
[root#master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 9h v1.13.1
node01 NotReady <none> 9h v1.13.1
node02 NotReady <none> 9h v1.13.1
on both nodes:
[root#node01 ~]# service kubelet status
Redirecting to /bin/systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2019-01-08 04:49:20 UTC; 32s ago
Docs: https://kubernetes.io/docs/
Main PID: 4224 (kubelet)
Memory: 31.3M
CGroup: /system.slice/kubelet.service
└─4224 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfi
`Jan 08 04:54:10 node01 kubelet[4224]: E0108 04:54:10.957115 4224 kubelet.go:2266] node "node01" not found`
I appreciate your advise on how to troubleshoot this.
The previous answer sounds correct. You can verify that by running
kubectl describe node node01 on the master, or wherever kubectl is correctly configured.
It seems like the reason of this error is due to incorrect subnet. In Flannel documentation it is written that you should use /16 not /24 for pod network.
NOTE: If kubeadm is used, then pass --pod-network-cidr=10.244.0.0/16
to kubeadm init to ensure that the podCIDR is set.
I tried to run kubeadm with /24 and although I had nodes in Ready state the flannel pods did not run properly which resulted in some issues.
You can check if your flannel pods are running properly by:
kubectl get pods -n kube-system if the status is other than running then it is incorrect behavior. In this case you can check details by running kubectl describe pod PODNAME -n kube-system. Try changing the subnet and update us if that fixed the problem.
I ran into almost the same problem, and in the end I found that the reason was that the firewall was not turned off. You can try the following commands:
sudo ufw disable
or
systemctl disable firewalld
or
setenforce 0

Kubernetes worker node staying in "NotReady" state

I have two node Kubernetes setup in Virtualbox. Master is up and running fine. But the worker node is staying in "NotReady" state.
[root#master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 1d v1.10.2
node NotReady <none> 1h v1.10.2
"journalctl -u kubelet" command on worker node is reporting networking related errors:
kuberuntime_manager.go:757] checking backoff for container "install-cni" in pod "kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5ab)"
remote_runtime.go:278] ContainerStatus "459643e54de7f82df8ada0f60e8f3d51d42c5ce348747a66e20ad5720155e63f" from runtime service failed: rpc error: code = U
kuberuntime_container.go:636] failed to remove pod init container "install-cni": failed to get container status "459643e54de7f82df8ada0f60e8f3d51d42c5ce34
kuberuntime_manager.go:757] checking backoff for container "install-cni" in pod "kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5ab)"
kuberuntime_manager.go:767] Back-off 10s restarting failed container=install-cni pod=kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5a
pod_workers.go:186] Error syncing pod 873fa36d-4b83-11e8-9997-080027afb5ab ("kube-flannel-ds-zjlvn_kube-system(873fa36d-4b83-11e8-9997-080027afb5ab)"), sk
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni con
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni con
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni con
I am running Kubernetes version 1.10 and docker version 1.13.1. Could you please help me identify the root cause and resolution for this issue?
Well the thing is, when you want to form a kubernetes cluster, it requires that you deploy a CNI plugin which would provide networking between your pods. The error that you have shown here is due to a CNI plugin not being installed or not being configured properly.
The kube-dns pod would be in pending state until the CNI plugin is deployed on your cluster. Once kube-dns moves to a running state, (after deploying the cni provider) you can run your application workloads.
If you have not deployed a CNI plugin, there are several ones you can choose from.
Calico: Provides Pod networking via standard BGP. (Follow the documentation for further info)
kubectl apply -f https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/kubeadm/1.7/calico.yaml
Weave: Creates an overlay network.
export kubever=$(kubectl version | base64 | tr -d '\n')
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$kubever"
Flannel: Creates an overlay network treating each host as a subnet.
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml
Container traffic needs to be made aware to the iptables and you can do that by
sysctl net.bridge.bridge-nf-call-iptables=1
This is required by Flannel and Weave to function.
Please do refer to the documentation of each CNI plugin which would be suitable for your cluster.