coredns pods getting failed created by kubeadm init command - kubernetes

When I run kubeadm::init command, all pods are running except coredns pods. when I describe the pods, its showing something cni initialization failed.
do I need any network plugin to be installed before running kubeadm::init??

No, the network add-on is only added after kubeadm init, the documentation is explicit on this topic: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#pod-network

Related

Kubectl connection refused existing cluster

Hope someone can help me.
To describe the situation in short, I have a self managed k8s cluster, running on 3 machines (1 master, 2 worker nodes). In order to make it HA, I attempted to add a second master to the cluster.
After some failed attempts, I found out that I needed to add controlPlaneEndpoint configuration to kubeadm-config config map. So I did, with masternodeHostname:6443.
I generated the certificate and join command for the second master, and after running it on the second master machine, it failed with
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available
Checking the first master now, I get connection refused for the IP on port 6443. So I cannot run any kubectl commands.
Tried recreating the .kube folder, with all the config copied there, no luck.
Restarted kubelet, docker.
The containers running on the cluster seem ok, but I am locked out of any cluster configuration (dashboard is down, kubectl commands not working).
Is there any way I make it work again? Not losing any of the configuration or the deployments already present?
Thanks! Sorry if it’s a noob question.
Cluster information:
Kubernetes version: 1.15.3
Cloud being used: (put bare-metal if not on a public cloud) bare-metal
Installation method: kubeadm
Host OS: RHEL 7
CNI and version: weave 0.3.0
CRI and version: containerd 1.2.6
This is an old, known problem with Kubernetes 1.15 [1,2].
It is caused by short etcd timeout period. As far as I'm aware it is a hard coded value in source, and cannot be changed (feature request to make it configurable is open for version 1.22).
Your best bet would be to upgrade to a newer version, and recreate your cluster.

After uninstalling calico, new pods are stuck in container creating state

After uninstalling calico, kubectl -f calico.yaml, not able to create new pods in the cluster. Any new pods in the cluster are stuck in container creating state. Kubectl describe shows the errors below:
Warning FailedCreatePodSandBox 2m kubelet, 10.0.12.2 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to set up pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to teardown pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
The main issue is caused because calico has an init container but does not have a cleanup container. T
To undeploy calico, we have to do the usual kubectl delete -f <yaml>, and then delete a calico conf file in each of the nodes /etc/cni/net.d/. This configuration file along with other binaries are loaded on to the host by the init container.
https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/
From this link we can see that kubelet reads the configuration file from the default directory, and if there are multiple configuration files, then it applies the CNI plugin from the config file that appears first in an alphabetical order (why, oh god why??).
So, in our case, after uninstalling calico, it would be removed from all the admin privileges but the nodes would still try to apply calico rules based upon the config file it picked up from the default directory. Then restart the node to get rid of the iptable rules.
Removing the file and restarting the node solves the issue and we get back to normal behavior. Another way to solve the same problem is by simply terminating the node from the cluster if you are on a managed kubernetes cluster. Since, public cloud infrastructure automatically boots up another node to keep the same state, it no longer has the calico configuration file.

How to change PodCIDR/PodCIDRs in my node

I have setup a kubernete cluster with kubeadm in 2 beta metal. the cluster works well.
the kubeadm command I used:
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
When I do the the performance test, I need to run more pods in the work node. I met a mistake,
the PodCIDR in my work node is,PodCIDR: 10.244.1.0/24, so i can run 254 pods at most, although i have change the max-pod to 500.
So my question is how can i change the PodCIDR in my node.
I have trid to change to /etc/kubernetes/manifests/kube-controller-manager.yaml set
- --node-cidr-mask-size=16, and restart the kubelet, but it makes no effort.
PodCIDR is managed by the CNI, depends on what CNI you're using. You might want to change the CNI config and restart every node.
Here is the reference: https://capstonec.com/help-i-need-to-change-the-pod-cidr-in-my-kubernetes-cluster/

Problem getting pods stats from kubelet and cri-o

We are running Kubernetes with the following configuration:
On-premise Kubernetes 1.11.3, cri-o 1.11.6 and CentOS7 with UEK-4.14.35
I can't make crictl stats to return pods information, it returns empty list only. Has anyone run into the same problem?
Another issue we have, is that when I query the kubelet for stats/summary it returns an empty pods list.
I think that these two issues are related, although I am not sure which one of them is the problem.
I would recommend checking kubelet service to verify health status and debug any suspicious events within the cluster. I assume that CRI-O runtime engine can select kubelet as the main Pods information provider because of its managing Pod lifecycle role.
systemctl status kubelet -l
journalctl -u kubelet
In case you found some errors or dubious events, share it in a comment below this answer.
However, you can use metrics-server, which will collect Pod metrics in the cluster and enable kube-apiserver flags for Aggregation Layer. Here is a good article about Horizontal Pod Autoscaling and monitoring resources via Prometheus.

CoreOS v.1.6.1 not starting

I am working on setting up a new Kubernetes cluster using the CoreOS documentation. This one uses the CoreOS v1.6.1 image. I am following this documentation from link CoreOS Master setup. I looked in the journalctl logs and I see that the kubeapi-server seems to exit and restart.
The following is a journalctl log indicating on the kube-apiserver :
checking backoff for container "kube-apiserver" in pod "kube-apiserver-10.138.192.31_kube-system(16c7e04edcd7e775efadd4bdcb1940c4)"
Back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-10.138.192.31_kube-system(16c7e04edcd7e775efadd4bdcb1940c4)
Error syncing pod 16c7e04edcd7e775efadd4bdcb1940c4 ("kube-apiserver-10.138.192.31_kube-system(16c7e04edcd7e775efadd4bdcb1940c4)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-10.138.192.31_kube-system(16c7e04edcd7e775efadd4bdcb1940c4)"
I am wondering if it's because I need to start the new etcd3 version instead of the etcd2? Any hints or suggestion is appreciated.
The following is my cloud-config:
coreos:
etcd2:
# generate a new token for each unique cluster from https://discovery.etcd.io/new:
discovery: https://discovery.etcd.io/33e3f7c20be0b57daac4d14d478841b4
# multi-region deployments, multi-cloud deployments, and Droplets without
# private networking need to use $public_ipv4:
advertise-client-urls: http://$private_ipv4:2379,http://$private_ipv4:4001
initial-advertise-peer-urls: http://$private_ipv4:2380
# listen on the official ports 2379, 2380 and one legacy port 4001:
listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001
listen-peer-urls: http://$private_ipv4:2380
fleet:
public-ip: $private_ipv4 # used for fleetctl ssh command
units:
- name: etcd2.service
command: start
However, I have tried with CoreOS v1.5 images and they work fine. It's the CoreOS v1.6 images that I am not able to get the kube-apiserver running for some reason.
You use etcd2, so you need to pass the flag '--storage-backend=etcd2' to your kube-apiserver in your manifest.
You are using etcd2, I think maybe you can try etcd3.
You said:
I am wondering if it's because I need to start the new etcd3 version instead of the etcd2? Any hints or suggestion is appreciated.
I would like to recommend that you can read this doc to learn how to upgrade the etcd.