Pods unable to resolve hostnames in Kubernetes cluster

Pods unable to resolve hostnames in Kubernetes cluster - kubernetes

I'm working on aws eks, and I'm having issues with networking because none of the pods can resolve hostnames.
Checking the kube-config pods, I found this:
NAME READY STATUS RESTARTS AGE
alb-ingress-controller-68b7d9dd9-4mpnc 1/1 Running 0 3d20h
alb-ingress-controller-68b7d9dd9-kqmsl 1/1 Running 8 17d
alb-ingress-controller-68b7d9dd9-q4m87 1/1 Running 11 26d
aws-node-m6ncq 1/1 Running 34 31d
aws-node-r8bs5 1/1 Running 31 28d
aws-node-vfvjn 1/1 Running 34 31d
coredns-f47955f89-4bfb5 1/1 Running 11 26d
coredns-f47955f89-7mqp7 1/1 Running 8 17d
kube-flannel-ds-chv64 0/1 CrashLoopBackOff 1814 28d
kube-flannel-ds-fct45 0/1 CrashLoopBackOff 1831 31d
kube-flannel-ds-zs8z4 0/1 CrashLoopBackOff 1814 28d
kube-proxy-6lcst 1/1 Running 18 31d
kube-proxy-9qfkg 1/1 Running 17 28d
kube-proxy-l5qvd 1/1 Running 18 31d
and kube-flannel logs are this:
I0208 21:01:23.542634 1 main.go:218] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: help:false version:false autoDetectIPv4:false autoDetectIPv6:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true subnetFile:/run/flannel/subnet.env subnetDir: publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 charonExecutablePath: charonViciUri: iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W0208 21:01:23.542713 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0208 21:01:23.647398 1 kube.go:120] Waiting 10m0s for node controller to sync
I0208 21:01:23.647635 1 kube.go:378] Starting kube subnet manager
I0208 21:01:24.647775 1 kube.go:127] Node controller sync successful
I0208 21:01:24.647801 1 main.go:238] Created subnet manager: Kubernetes Subnet Manager - ip-10-23-21-240.us-east-2.compute.internal
I0208 21:01:24.647811 1 main.go:241] Installing signal handlers
I0208 21:01:24.647887 1 main.go:460] Found network config - Backend type: vxlan
I0208 21:01:24.647916 1 main.go:652] Determining IP address of default interface
I0208 21:01:24.648449 1 main.go:699] Using interface with name eth0 and address 10.23.21.240
I0208 21:01:24.648476 1 main.go:721] Defaulting external address to interface address (10.23.21.240)
I0208 21:01:24.648482 1 main.go:734] Defaulting external v6 address to interface address (<nil>)
I0208 21:01:24.648543 1 vxlan.go:137] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
E0208 21:01:24.648871 1 main.go:326] Error registering network: failed to acquire lease: node "ip-10-23-21-240.us-east-2.compute.internal" pod cidr not assigned
W0208 21:01:24.649006 1 reflector.go:424] github.com/flannel-io/flannel/subnet/kube/kube.go:379: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
I0208 21:01:24.649060 1 main.go:440] Stopping shutdownHandler...
Any idea what could be the issue or what to try?
Thanks!

Amazon VPC CNI and Flannel cannot co-exist on EKS. Note Flannel is not on the suggested alternate compatible CNI. To get an idea what does it take to use Flannel on EKS checkout this excellent blog.

Related

How can I detect CNI type/version in Kubernetes cluster?

Is there a Kubectl command or config map in the cluster that can help me find what CNI is being used?

First of all checking presence of exactly one config file in /etc/cni/net.d is a good start:
$ ls /etc/cni/net.d
10-flannel.conflist
and ip a s or ifconfig helpful for checking existence of network interfaces. e.g. flannel CNI should setup flannel.1 interface:
$ ip a s flannel.1
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether de:cb:d1:d6:e3:e7 brd ff:ff:ff:ff:ff:ff
inet 10.244.1.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::dccb:d1ff:fed6:e3e7/64 scope link
valid_lft forever preferred_lft forever
When creating a cluster, CNI installation is typically installed using:
kubectl apply -f <add-on.yaml>
thus the networking pod will be called kube-flannel*, kube-calico* etc. depending on your networking configuration.
Then crictl will help you inspect running pods and containers.
crictl pods ls
On a controller node in a healthy cluster you should have all pods in Ready state.
crictl pods ls
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
dc90dd87e18cf 3 minutes ago Ready coredns-6d4b75cb6d-r2j9s kube-system 0 (default)
d1ab9d0aa815a 3 minutes ago Ready kubernetes-dashboard-cd4778d69-xmtkz kube-system 0 (default)
0c151fdd92e71 3 minutes ago Ready coredns-6d4b75cb6d-bn8hr kube-system 0 (default)
40f18ce56f776 4 minutes ago Ready kube-flannel-ds-d4fd7 kube-flannel 0 (default)
0e390a68380a5 4 minutes ago Ready kube-proxy-r6cq2 kube-system 0 (default)
cd93e58d3bf70 4 minutes ago Ready kube-scheduler-c01 kube-system 0 (default)
266a33aa5c241 4 minutes ago Ready kube-apiserver-c01 kube-system 0 (default)
0910a7a73f5aa 4 minutes ago Ready kube-controller-manager-c01 kube-system 0 (default)
If your cluster is properly configured you should be able to list containers using kubectl:
kubectl get pods -n kube-system
if kubectl is not working (kube-apiserver is not running) you can fallback to crictl.
On an unhealthy cluster kubectl will show pods in CrashLoopBackOff state. crictl pods ls command will give you similar picture, only displaying pods from single node. Also check documentation for common CNI errors.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6d4b75cb6d-brb9d 0/1 ContainerCreating 0 25m
coredns-6d4b75cb6d-pcrcp 0/1 ContainerCreating 0 25m
kube-apiserver-cm01 1/1 Running 27 (18m ago) 26m
kube-apiserver-cm02 0/1 Running 31 (8m11s ago) 23m
kube-apiserver-cm03 0/1 CrashLoopBackOff 33 (2m22s ago) 26m
kube-controller-manager-cm01 0/1 CrashLoopBackOff 13 (50s ago) 24m
kube-controller-manager-cm02 0/1 CrashLoopBackOff 7 (15s ago) 24m
kube-controller-manager-cm03 0/1 CrashLoopBackOff 15 (3m45s ago) 26m
kube-proxy-2dvfg 0/1 CrashLoopBackOff 8 (97s ago) 25m
kube-proxy-7gnnr 0/1 CrashLoopBackOff 8 (39s ago) 25m
kube-proxy-cqmvz 0/1 CrashLoopBackOff 8 (19s ago) 25m
kube-scheduler-cm01 1/1 Running 28 (7m15s ago) 12m
kube-scheduler-cm02 0/1 CrashLoopBackOff 28 (4m45s ago) 18m
kube-scheduler-cm03 1/1 Running 36 (107s ago) 26m
kubernetes-dashboard-cd4778d69-g8jmf 0/1 ContainerCreating 0 2m27s
crictl ps will give you containers (like docker ps), watch for high number of attempts:
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
d54c6f1e45dea 2ae1ba6417cbc 2 seconds ago Running kube-proxy 1 347fef3ae1e98 kube-proxy-7gnnr
d6048ef9e30c7 d521dd763e2e3 41 seconds ago Running kube-apiserver 27 640658b58d1ae kube-apiserver-cm03
b6b8c7a24914e 3a5aa3a515f5d 41 seconds ago Running kube-scheduler 28 c7b710a0acf30 kube-scheduler-cm03
b0a480d2c1baf 586c112956dfc 42 seconds ago Running kube-controller-manager 8 69504853ab81b kube-controller-manager-cm03
and check logs using
crictl logs d54c6f1e45dea
Last not least /opt/cni/bin/ path usually contains binaries required for networking. Another PATH might defined in add on setup or CNI config.
$ ls /opt/cni/bin/
bandwidth bridge dhcp firewall flannel host-device host-local ipvlan loopback macvlan portmap ptp sbr static tuning vlan
Finally crictl reads /etc/crictl.yaml config, you should set proper runtime and image endpoint to match you container runtime:
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 10

Kubernetes can't access pod in multi worker nodes

I was following a tutorial on youtube and the guy said that if you deploy your application in a multi-cluster setup and if your service is of type NodePort, you don't have to worry from where your pod gets scheduled. You can access it with different node IP address like
worker1IP:servicePort or worker2IP:servicePort or workerNIP:servicePort
But I tried just now and this is not the case, I can only access the pod on the node from where it is scheduled and deployed. Is it correct behavior?
kubectl version --short
> Client Version: v1.18.5
> Server Version: v1.18.5
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66bff467f8-6pt8s 0/1 Running 288 7d22h
coredns-66bff467f8-t26x4 0/1 Running 288 7d22h
etcd-redhat-master 1/1 Running 16 7d22h
kube-apiserver-redhat-master 1/1 Running 17 7d22h
kube-controller-manager-redhat-master 1/1 Running 19 7d22h
kube-flannel-ds-amd64-9mh6k 1/1 Running 16 5d22h
kube-flannel-ds-amd64-g2k5c 1/1 Running 16 5d22h
kube-flannel-ds-amd64-rnvgb 1/1 Running 14 5d22h
kube-proxy-gf8zk 1/1 Running 16 7d22h
kube-proxy-wt7cp 1/1 Running 9 7d22h
kube-proxy-zbw4b 1/1 Running 9 7d22h
kube-scheduler-redhat-master 1/1 Running 18 7d22h
weave-net-6jjd8 2/2 Running 34 7d22h
weave-net-ssqbz 1/2 CrashLoopBackOff 296 7d22h
weave-net-ts2tj 2/2 Running 34 7d22h
[root#redhat-master deployments]# kubectl logs weave-net-ssqbz -c weave -n kube-system
DEBU: 2020/07/05 07:28:04.661866 [kube-peers] Checking peer "b6:01:79:66:7d:d3" against list &{[{e6:c9:b2:5f:82:d1 redhat-master} {b2:29:9a:5b:89:e9 redhat-console-1} {e2:95:07:c8:a0:90 redhat-console-2}]}
Peer not in list; removing persisted data
INFO: 2020/07/05 07:28:04.924399 Command line options: map[conn-limit:200 datapath:datapath db-prefix:/weavedb/weave-net docker-api: expect-npc:true host-root:/host http-addr:127.0.0.1:6784 ipalloc-init:consensus=2 ipalloc-range:10.32.0.0/12 metrics-addr:0.0.0.0:6782 name:b6:01:79:66:7d:d3 nickname:redhat-master no-dns:true port:6783]
INFO: 2020/07/05 07:28:04.924448 weave 2.6.5
FATA: 2020/07/05 07:28:04.938587 Existing bridge type "bridge" is different than requested "bridged_fastdp". Please do 'weave reset' and try again
Update:
So basically the issue is because iptables is deprecated in rhel8. But After downgrading my OS to rhel7. I can access the nodeport only on the node it is deployed.

Kubernetes Canal CNI error on masters

I'm setting up a Kubernetes cluster on a customer.
I've done this process before multiple times, including dealing with vagrant specifics and I've been able to constantly get a K8s cluster up and running without too much fuss.
Now, on this customer I'm doing the same but I've been finding a lot of issues when setting things up, which is completely unexpected.
Comparing to other places where I've setup Kubernetes, the only obvious difference that I see is that I have a proxy server which I constantly have to battle with. Nothing that a NO_PROXY env hasn't been able to handle.
The main issue I'm facing is setting up Canal (Calico + Flannel).
For some reason, on Masters 2 and 3 it just won't start.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system canal-2pvpr 2/3 CrashLoopBackOff 7 14m 10.136.3.37 devmn2.cpdprd.pt
kube-system canal-rdmnl 2/3 CrashLoopBackOff 7 14m 10.136.3.38 devmn3.cpdprd.pt
kube-system canal-swxrw 3/3 Running 0 14m 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-apiserver-devmn1.cpdprd.pt 1/1 Running 1 1h 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-apiserver-devmn2.cpdprd.pt 1/1 Running 1 4h 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-apiserver-devmn3.cpdprd.pt 1/1 Running 1 1h 10.136.3.38 devmn3.cpdprd.pt
kube-system kube-controller-manager-devmn1.cpdprd.pt 1/1 Running 0 15m 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-controller-manager-devmn2.cpdprd.pt 1/1 Running 0 15m 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-controller-manager-devmn3.cpdprd.pt 1/1 Running 0 15m 10.136.3.38 devmn3.cpdprd.pt
kube-system kube-dns-86f4d74b45-vqdb4 0/3 ContainerCreating 0 1h <none> devmn2.cpdprd.pt
kube-system kube-proxy-4j7dp 1/1 Running 1 2h 10.136.3.38 devmn3.cpdprd.pt
kube-system kube-proxy-l2wpm 1/1 Running 1 2h 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-proxy-scm9g 1/1 Running 1 2h 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-scheduler-devmn1.cpdprd.pt 1/1 Running 1 1h 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-scheduler-devmn2.cpdprd.pt 1/1 Running 1 4h 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-scheduler-devmn3.cpdprd.pt 1/1 Running 1 1h 10.136.3.38 devmn3.cpdprd.pt
Looking for the specific error, I've come to find out that the issue is with the kube-flannel container, which is throwing an error:
[exXXXXX#devmn1 ~]$ kubectl logs canal-rdmnl -n kube-system -c kube-flannel
I0518 16:01:22.555513 1 main.go:487] Using interface with name ens192 and address 10.136.3.38
I0518 16:01:22.556080 1 main.go:504] Defaulting external address to interface address (10.136.3.38)
I0518 16:01:22.565141 1 kube.go:130] Waiting 10m0s for node controller to sync
I0518 16:01:22.565167 1 kube.go:283] Starting kube subnet manager
I0518 16:01:23.565280 1 kube.go:137] Node controller sync successful
I0518 16:01:23.565311 1 main.go:234] Created subnet manager: Kubernetes Subnet Manager - devmn3.cpdprd.pt
I0518 16:01:23.565331 1 main.go:237] Installing signal handlers
I0518 16:01:23.565388 1 main.go:352] Found network config - Backend type: vxlan
I0518 16:01:23.565440 1 vxlan.go:119] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
E0518 16:01:23.565619 1 main.go:279] Error registering network: failed to acquire lease: node "devmn3.cpdprd.pt" pod cidr not assigned
I0518 16:01:23.565671 1 main.go:332] Stopping shutdownHandler...
I just can't understand why.
Some relevant info:
My clusterCIDR and podCIDR are: 192.168.151.0/25 (I know, it's weird, don't ask unless it's a huge issue)
I've setup etcd on systemd
I've modified the kube-controller-manager.yaml to change the mask size to 25 (otherwise the IP mentioned before wouldn't work).
I'm installing everything with Kubeadm. One weird thing I did notice was that, when viewing the config (kubeadm config view) much of the information that I had setup on the kubeadm config.yaml (for kubeadm init) was not present in the config view, including the paths to etcd certs.
I'm also not sure why that happened, but I've fixed it (hopefully) by editing the kubeadm config map (kubectl edit cm kubeadm-config -n kube-system) and saving it.
Still no luck with canal.
Can anyone help me figure out what's wrong?
I have documented pretty much every step of the configuration I've done, so if required I may be able to provide it.
EDIT:
I figured how meanwhile that indeed my master2 and 3 do not have a podCIDR associated. Why would this happen? And how can I add it?

Try to edit:
/etc/kubernetes/manifests/kube-controller-manager.yaml
and add
--allocate-node-cidrs=true
--cluster-cidr=192.168.151.0/25
then, reload kubelet.
I found this information here and it was useful for me.

flannel pods in CrashLoopBackoff Error in kubernetes

Using flannel as a CNI in kubernetes i am trying to implement a network for pod to pod communication spread on different vagrant vms. I am using this https://raw.githubusercontent.com/coreos/flannel/v0.9.0/Documentation/kube-flannel.yml to create flannel pods. But the kube-flannel pods go in CrashLoopBackOff error and do not start.
[root#flnode-04 ~]# kubectl get pods -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
diamanti-system collectd-v0.5-flnode-04 1/1 Running 0 3h 192.168.30.14 flnode-04
diamanti-system collectd-v0.5-flnode-05 1/1 Running 0 3h 192.168.30.15 flnode-05
diamanti-system collectd-v0.5-flnode-06 1/1 Running 0 3h 192.168.30.16 flnode-06
diamanti-system provisioner-d4kvf 1/1 Running 0 3h 192.168.30.16 flnode-06
kube-system kube-flannel-ds-2kqpv 0/1 CrashLoopBackOff 1 18m 192.168.30.14 flnode-04
kube-system kube-flannel-ds-xgqdm 0/1 CrashLoopBackOff 1 18m 192.168.30.16 flnode-06
kube-system kube-flannel-ds-z59jz 0/1 CrashLoopBackOff 1 18m 192.168.30.15 flnode-05
here are the logs of one pod
[root#flnode-04 ~]# kubectl logs kube-flannel-ds-2kqpv --namespace=kube-system
I0327 10:28:44.103425 1 main.go:483] Using interface with name mgmt0 and address 192.168.30.14
I0327 10:28:44.105609 1 main.go:500] Defaulting external address to interface address (192.168.30.14)
I0327 10:28:44.138132 1 kube.go:130] Waiting 10m0s for node controller to sync
I0327 10:28:44.138213 1 kube.go:283] Starting kube subnet manager
I0327 10:28:45.138509 1 kube.go:137] Node controller sync successful
I0327 10:28:45.138588 1 main.go:235] Created subnet manager: Kubernetes Subnet Manager - flnode-04
I0327 10:28:45.138596 1 main.go:238] Installing signal handlers
I0327 10:28:45.138690 1 main.go:348] Found network config - Backend type: vxlan
I0327 10:28:45.138767 1 vxlan.go:119] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
panic: assignment to entry in nil map
goroutine 1 [running]:
github.com/coreos/flannel/subnet/kube.(*kubeSubnetManager).AcquireLease(0xc420010cd0, 0x7f5314399bd0, 0xc420347880, 0xc4202213e0, 0x6, 0xf54, 0xc4202213e0)
/go/src/github.com/coreos/flannel/subnet/kube/kube.go:239 +0x1f7
github.com/coreos/flannel/backend/vxlan.(*VXLANBackend).RegisterNetwork(0xc4200b3480, 0x7f5314399bd0, 0xc420347880, 0xc420010c30, 0xc4200b3480, 0x0, 0x0, 0x4d0181)
/go/src/github.com/coreos/flannel/backend/vxlan/vxlan.go:141 +0x44e
main.main()
/go/src/github.com/coreos/flannel/main.go:278 +0x8ae
What exactly is the reason for flannel pods going into CrashLoopBackoff and what is the solution ?

I was able to solve the problem by running the command
kubectl annotate node appserv9 flannel.alpha.coreos.com/public-ip=10.10.10.10 --overwrite=true
Reason for bug : nil map in the code(no key available)
it doesn't matter what ip you give but this command has to be run individually on all the nodes so that the above error does not have to assign to a nil map.

If you deploy cluster with kubeadm init --pod-network-cidr network/mask, this network/mask should match the ConfigMap in kube-flannel.yaml
My ConfigMap looks like:
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-system
data:
...
net-conf.json: |
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
So the network/mask should equal 10.244.0.0/16

Rancher Kubernetes Dashboard - Service Unavailable

I am new to Rancher and containers in general. While setting up Kubernetes cluster using Rancher, i’m facing problem while accessing Kubernetes dashboard.
rancher/server: 1.6.6
Single node Rancher server + External MySQL + 3 agent nodes
Infrastructure Stack versions:
healthcheck: v0.3.1
ipsec: net:v0.11.5
network-services: metadata:v0.9.2 / network-manager:v0.7.7
scheduler: k8s:v1.7.2-rancher5
kubernetes (if applicable): kubernetes-agent:v0.6.3
# docker info
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 17.03.1-ce
Storage Driver: overlay
Backing Filesystem: extfs
Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.34-rancher
Operating System: RancherOS v1.0.3
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.798 GiB
Name: ch7radod1
ID: IUNS:4WT2:Y3TV:2RI4:FZQO:4HYD:YSNN:6DPT:HMQ6:S2SI:OPGH:TX4Y
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://proxy.ch.abc.net:8080
Https Proxy: http://proxy.ch.abc.net:8080
No Proxy: localhost,.xyz.net,abc.net
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Accessing UI URL http://10.216.30.10/r/projects/1a6633/kubernetes-dashboard:9090/# shows “Service unavailable”
If i use the CLI section from the UI, i get the following:
> kubectl get nodes
NAME STATUS AGE VERSION
ch7radod3 Ready 1d v1.7.2
ch7radod4 Ready 5d v1.7.2
ch7radod1 Ready 1d v1.7.2
> kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system heapster-4285517626-4njc2 0/1 ContainerCreating 0 5d
kube-system kube-dns-3942128195-ft56n 0/3 ContainerCreating 0 19d
kube-system kube-dns-646531078-z5lzs 0/3 ContainerCreating 0 5d
kube-system kubernetes-dashboard-716739405-lpj38 0/1 ContainerCreating 0 5d
kube-system monitoring-grafana-3552275057-qn0zf 0/1 ContainerCreating 0 5d
kube-system monitoring-influxdb-4110454889-79pvk 0/1 ContainerCreating 0 5d
kube-system tiller-deploy-737598192-f9gcl 0/1 ContainerCreating 0 5d
The setup uses private registry (Artifactory). I checked Artifactory and i could see several images present related to Docker. I was going through private registry section and i also saw this file. In case this file is required, where exactly do i keep it so that Rancher can fetch it and configure the Kubernetes dashboard?
UPDATE:
$ sudo ros engine switch docker-1.12.6
> ERRO[0031] Failed to load https://raw.githubusercontent.com/rancher/os-services/v1.0.3/index.yml: Get https://raw.githubusercontent.com/rancher/os-services/v1.0.3/index.yml: Proxy Authentication Required
> FATA[0031] docker-1.12.6 is not a valid engine
I thought may be it’s due to NGINX so i stopped the NGINX container but i am still getting the above error. Earlier i have tried the same command on this Rancher server and it used to work fine. It’s working fine on agent nodes although they are already having 1.12.6 configured.
UPDATE 2:
> kubectl -n kube-system get po
NAME READY STATUS RESTARTS AGE
heapster-4285517626-4njc2 1/1 Running 0 12d
kube-dns-2588877561-26993 0/3 ImagePullBackOff 0 5h
kube-dns-646531078-z5lzs 0/3 ContainerCreating 0 12d
kubernetes-dashboard-716739405-zq3s9 0/1 CrashLoopBackOff 67 5h
monitoring-grafana-3552275057-qn0zf 1/1 Running 0 12d
monitoring-influxdb-4110454889-79pvk 1/1 Running 0 12d
tiller-deploy-737598192-f9gcl 0/1 CrashLoopBackOff 72 12d

None of your pods running, you need to resolve that issue first. try to restart the whole cluster and see all above pods in running status.

Based on #ivan.sim's suggestion, i posted 'UPDATE 2'. This started me finally to look in the right direction. I then started looking for CrashLoopBackOff error online and came across this link and tried the following command (using CLI option from Rancher console), which was actually quite similar to what #ivan.sim suggested above but this helped me with the node where the dashboard process was running:
> kubectl get pods -a -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system heapster-4285517626-4njc2 1/1 Running 0 12d 10.42.224.157 radod4
kube-system kube-dns-2588877561-26993 0/3 ImagePullBackOff 0 5h <none> radod1
kube-system kube-dns-646531078-z5lzs 0/3 ContainerCreating 0 12d <none> radod4
kube-system kubernetes-dashboard-716739405-zq3s9 0/1 Error 70 5h 10.42.218.11 radod1
kube-system monitoring-grafana-3552275057-qn0zf 1/1 Running 0 12d 10.42.202.44 radod4
kube-system monitoring-influxdb-4110454889-79pvk 1/1 Running 0 12d 10.42.111.171 radod4
kube-system tiller-deploy-737598192-f9gcl 0/1 CrashLoopBackOff 76 12d 10.42.213.24 radod4
Then i went to the host where the process was executing and tried the following command:
[rancher#radod1 ~]$
[rancher#radod1 ~]$ docker ps -a | grep dash
282334b0ed38 gcr.io/google_containers/kubernetes-dashboard-amd64#sha256:b537ce8988510607e95b8d40ac9824523b1f9029e6f9f90e9fccc663c355cf5d "/dashboard --insecur" About a minute ago Exited (1) 55 seconds ago k8s_kubernetes-dashboard_kubernetes-dashboard-716739405-zq3s9_kube-system_7b0afda7-8271-11e7-ae86-021bfe69c163_72
99836d7824fd gcr.io/google_containers/pause-amd64:3.0 "/pause" 5 hours ago Up 5 hours k8s_POD_kubernetes-dashboard-716739405-zq3s9_kube-system_7b0afda7-8271-11e7-ae86-021bfe69c163_1
[rancher#radod1 ~]$
[rancher#radod1 ~]$
[rancher#radod1 ~]$ docker logs 282334b0ed38
Using HTTP port: 8443
Creating API server client for https://10.43.0.1:443
Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: the server has asked for the client to provide credentials
Refer to the troubleshooting guide for more information: https://github.com/kubernetes/dashboard/blob/master/docs/user-guide/troubleshooting.md
After i got the above error, i again searched online and tried few things. Finally, this link helped. After i executed the following commands on all agent nodes, Kubernetes dashboard finally started working!
docker volume rm etcd
rm -rf /var/etcd/backups/*