I am still new to Kubernetes and I was trying to set up a cluster on bare metal servers according to the official docu.
Right now I am running a one worker and one master node configuration, but I am struggling to run all the pods once the cluster initializes. The main problem is the coredns pods, that are stuck in the ContainerCreating state.
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-78fcd69978-4vtsp 0/1 ContainerCreating 0 5s
kube-system coredns-78fcd69978-wtn2c 0/1 ContainerCreating 0 12h
kube-system etcd-dcpoth24213118 1/1 Running 4 12h
kube-system kube-apiserver-dcpoth24213118 1/1 Running 0 12h
kube-system kube-controller-manager-dcpoth24213118 1/1 Running 0 12h
kube-system kube-proxy-8282p 1/1 Running 0 12h
kube-system kube-scheduler-dcpoth24213118 1/1 Running 0 12h
kube-system weave-net-6zz2j 2/2 Running 0 12h
After checking the logs I've noticed this error. The problem is I don't really know what the error is refering to.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19s default-scheduler Successfully assigned kube-system/coredns-78fcd69978-4vtsp to dcpoth24213118
Warning FailedCreatePodSandBox 13s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "2521c9dd723f3fc50b3510791a8c35cbc9ec19768468eb3da3367274a4dfcbba" network for pod "coredns-78fcd69978-4vtsp": networkPlugin cni failed to set up pod "coredns-78fcd69978-4vtsp_kube-system" network: error getting ClusterInformation: Get "https://[10.43.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host, failed to clean up sandbox container "2521c9dd723f3fc50b3510791a8c35cbc9ec19768468eb3da3367274a4dfcbba" network for pod "coredns-78fcd69978-4vtsp": networkPlugin cni failed to teardown pod "coredns-78fcd69978-4vtsp_kube-system" network: error getting ClusterInformation: Get "https://[10.43.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host]
Normal SandboxChanged 10s (x2 over 12s) kubelet Pod sandbox changed, it will be killed and re-created.
I've running the kuberenetes cluster behind a corporate proxy. I've set the environmental variables as follows.
export https_proxy=http://proxyIP:PORT
export http_proxy=http://proxyIP:PORT
export HTTP_PROXY="${http_proxy}"
export HTTPS_PROXY="${https_proxy}"
export NO_PROXY=localhost,127.0.0.1,master_node_IP,worker_node_IP,10.0.0.0/8,10.96.0.0/16
[root#dcpoth24213118 ~]# kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 12h
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 12h
[root#dcpoth24213118 ~]# ip r s
default via 6.48.248.129 dev eth1
6.48.248.128/26 dev eth1 proto kernel scope link src 6.48.248.145
10.32.0.0/12 dev weave proto kernel scope link src 10.32.0.1
10.155.0.0/24 via 6.48.248.129 dev eth1
10.228.0.0/24 via 6.48.248.129 dev eth1
10.229.0.0/24 via 6.48.248.129 dev eth1
10.250.0.0/24 via 6.48.248.129 dev eth1
I've got weave network plugin installed. The issue is that I cannot create any other pods, all will get stuck in the ContainerCreating state.
I've run out of ideas how to fix it. Can someone give me a hint ?
Related
I have followed the instructions on this blog to create a simple container image and deploy it in a k8s cluster.
However, in my case the pods do not run:
student#master:~$ k get pod -o wide -l app=hello-python --field-selector spec.nodeName=master
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hello-python-58547cf485-7l8dg 0/1 ErrImageNeverPull 0 2m26s 192.168.219.126 master <none> <none>
hello-python-598c594dc5-4c9zd 0/1 ErrImageNeverPull 0 2m26s 192.168.219.67 master <none> <none>
student#master:~$ sudo podman images hello-python
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/hello-python latest 11cf1e5a86b1 50 minutes ago 941 MB
student#master:~$ hostname
master
student#master:~$
I understand why it may not work on the worker node, but why it does not work on the same node where the image is cached - the master node?
student#master:~$ k describe pod hello-python-58547cf485-7l8dg | grep -A 10 'Events:'
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned default/hello-python-58547cf485-7l8dg to master
Warning Failed 8m7s (x12 over 10m) kubelet Error: ErrImageNeverPull
Warning ErrImageNeverPull 4m59s (x27 over 10m) kubelet Container image "localhost/hello-python:latest" is not present with pull policy of Never
student#master:~$
My question is: how to make the pod run on the master node with the imagePullPolicy = never given that the image in question is available on the master node as the podman images attests?
EDIT 1
I am using a k8s cluster running on two VMs deployed in GCE. It was setup with a script provided in the context of the Linux Foundation Kubernetes Developer course LFD0259.
EDIT 2
The master node is allowed to run workloads - this is how the LFD259 course sets it up. For example:
student#master:~$ k create deployment xyz --image=httpd
deployment.apps/xyz created
student#master:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xyz-6c6bd4cd89-qn4zr 1/1 Running 0 5m37s 192.168.171.66 worker <none> <none>
student#master:~$
student#master:~$ k scale deployment xyz --replicas=10
deployment.apps/xyz scaled
student#master:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xyz-6c6bd4cd89-c2xv4 1/1 Running 0 73s 192.168.219.71 master <none> <none>
xyz-6c6bd4cd89-g89k2 0/1 ContainerCreating 0 73s <none> master <none> <none>
xyz-6c6bd4cd89-jfftl 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-kbdnq 1/1 Running 0 73s 192.168.219.106 master <none> <none>
xyz-6c6bd4cd89-nm6rt 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-qn4zr 1/1 Running 0 7m22s 192.168.171.66 worker <none> <none>
xyz-6c6bd4cd89-vts6x 1/1 Running 0 73s 192.168.171.84 worker <none> <none>
xyz-6c6bd4cd89-wd2ls 1/1 Running 0 73s 192.168.171.127 worker <none> <none>
xyz-6c6bd4cd89-wv4jn 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-xvtlm 0/1 ContainerCreating 0 73s <none> master <none> <none>
student#master:~$
It depends how you've set up your Kubernetes Cluster. I assume you've installed it with kubeadm. However, by default the Master is not scheduleable for workloads. And by my understanding the image you're talking about only exists on the master node right? If that's the case you can't start a pod with that Image as it only exists on the master node, which doesn't allow workloads by default.
If you were to copy the Image to the worker node, your given command should work.
However if you want to make your Master-Node scheduleable just taint it with (maybe you need to amend the last bit if it differs from yours):
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
I have a daemonset configuration that runs on all nodes.
every pod listens on port 34567. I want from other pod on different node to communicate with this pod. how can I achieve that?
Find the target Pod's IP address as shown below
controlplane $ k get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-fb8b8dccf-42pq8 1/1 Running 1 5m43s 10.88.0.4 node01 <none> <none>
coredns-fb8b8dccf-f9n5x 1/1 Running 1 5m43s 10.88.0.3 node01 <none> <none>
etcd-controlplane 1/1 Running 0 4m38s 172.17.0.23 controlplane <none> <none>
katacoda-cloud-provider-74dc75cf99-2jrpt 1/1 Running 3 5m42s 10.88.0.2 node01 <none> <none>
kube-apiserver-controlplane 1/1 Running 0 4m33s 172.17.0.23 controlplane <none> <none>
kube-controller-manager-controlplane 1/1 Running 0 4m45s 172.17.0.23 controlplane <none> <none>
kube-keepalived-vip-smkdc 1/1 Running 0 5m27s 172.17.0.26 node01 <none> <none>
kube-proxy-8sxkt 1/1 Running 0 5m27s 172.17.0.26 node01 <none> <none>
kube-proxy-jdcqc 1/1 Running 0 5m43s 172.17.0.23 controlplane <none> <none>
kube-scheduler-controlplane 1/1 Running 0 4m47s 172.17.0.23 controlplane <none> <none>
weave-net-8cxqg 2/2 Running 1 5m27s 172.17.0.26 node01 <none> <none>
weave-net-s4tcj 2/2 Running 1 5m43s 172.17.0.23 controlplane <none> <none>
Next "exec" into the originating pod - kube-proxy-8sxkt in my example
kubectl -n kube-system exec -it kube-proxy-8sxkt sh
Next, you will use the destination pod's IP and port (10256 - my example) number to connect. Please note that you may have to install curl/telnet if your originating container's image does not include the application
# curl telnet://172.17.0.23:10256
HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8
Connection: close
You can call via pod's IP.
Note: This IP can only be used in the k8s cluster.
POD address (IP) is a good option you can use it, unless you know the POD IP which might get changed from time to time due to deployment and scaling changes.
i would suggest trying out the Daemon set by exposing it using the service type Node port if you have a fix amount of Node and not much autoscaling there.
If you want to connect your POD with a specific POD you can use the Node IP on which POD is scheduled and use the Node port service.
Node IP:Node port
Read more at : https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport
If you don't want to connect to a specific POD and just any of the Daemon sets replica will work to connect with you can use the service name to connect PODs with each other.
my-svc.my-namespace.svc.cluster-domain.example
Read more about the service and POD DNS
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
I have one master and two worker nodes (worker-1 and worker-2). All the Nodes are up and running without any issue. when i was planned to installed istio service mesh i tried to deploy sample book info deployment.
After deploying bookinfo i verified pod status running below command
root#master:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
details-v1-79c697d759-9k98l 2/2 Running 0 11h 10.200.226.104 worker-1 <none> <none>
productpage-v1-65576bb7bf-zsf6f 2/2 Running 0 11h 10.200.226.107 worker-1 <none> <none>
ratings-v1-7d99676f7f-zxrtq 2/2 Running 0 11h 10.200.226.105 worker-1 <none> <none>
reviews-v1-987d495c-hsnmc 1/2 Running 0 21m 10.200.133.194 worker-2 <none> <none>
reviews-v2-6c5bf657cf-jmbkr 1/2 Running 0 11h 10.200.133.252 worker-2 <none> <none>
reviews-v3-5f7b9f4f77-g2s6p 2/2 Running 0 11h 10.200.226.106 worker-1 <none> <none>
I have noticed that two pod are not running here status shows 1/2 (which is in worker-2 node), almost i spent two days but not able to find anything to fix the above issue. here the describe pod status
Warning Unhealthy 63s (x14 over 89s) kubelet Readiness probe failed: Get "http://10.244.133.194:15021/healthz/ready":
dial tcp 10.200.133.194:15021: connect: connection refused
Then today morning i realized something issue with worker-2 node when the pod is not running with status of 1/2, i planned cordon node like below
kubectl cordon worker-2
kubectl delete pod <worker-2 pod>
kubectl get pod -o wide
After cordon worker-2 node i could see all the pod are up with status of 2/2 in worker-1 node without any issue.
root#master:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
details-v1-79c697d759-9k98l 2/2 Running 0 11h 10.200.226.104 worker-1 <none> <none>
productpage-v1-65576bb7bf-zsf6f 2/2 Running 0 11h 10.200.226.107 worker-1 <none> <none>
ratings-v1-7d99676f7f-zxrtq 2/2 Running 0 11h 10.200.226.105 worker-1 <none> <none>
reviews-v1-987d495c-2n4d9 2/2 Running 0 17s 10.200.226.113 worker-1 <none> <none>
reviews-v2-6c5bf657cf-wzqpt 2/2 Running 0 17s 10.200.226.112 worker-1 <none> <none>
reviews-v3-5f7b9f4f77-g2s6p 2/2 Running 0 11h 10.200.226.106 worker-1 <none> <none>
could you please someone help me how to fix this issue to schedule (pending pods) pods in worker-2 node as well.
Note: when i am trying to re-deploy all the nodes (worker-1 and worker-2) again pod status going back to 1/2 status
oot#master:~/istio-1.9.1/samples# kubectl logs -f ratings-v1-b6994bb9-wfckn -c istio-proxy
ates: 0 successful, 0 rejected
2021-04-21T07:12:19.941679Z warn Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 0 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2021-04-21T07:12:21.942096Z warn Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 0 successful, 0 rejected; lds updates: 0 successful, 0 rejected
While this error message is just a symptom, my problem is real.
My bare-metal cluster experienced a certificate expired situation. i managed to renew all certificates, but upon restart, most pods wouldn't work. the pod that seems responsible, is the flannel one (crashloopbackoff).
logs for the flannel pods show
I1120 22:24:00.541277 1 main.go:475] Determining IP address of default interface
I1120 22:24:00.541546 1 main.go:488] Using interface with name eth0 and address xxx.xxx.xxx.xxx
I1120 22:24:00.541565 1 main.go:505] Defaulting external address to interface address (xxx.xxx.xxx.xxx)
E1120 22:24:03.572745 1 main.go:232] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-dmrzh': Get https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-dmrzh: dial tcp 10.96.0.1:443: getsockopt: network is unreachable
on the host there is not even a flannel interface anymore. neither a systemd file
running flanneld manually yields this output
I1120 20:12:15.923966 26361 main.go:446] Determining IP address of default interface
I1120 20:12:15.924171 26361 main.go:459] Using interface with name eth0 and address xxx.xxx.xxx.xxx
I1120 20:12:15.924187 26361 main.go:476] Defaulting external address to interface address (xxx.xxx.xxx.xxx)
E1120 20:12:15.924285 26361 main.go:223] Failed to create SubnetManager: asn1: structure error: tags don't match (16 vs {class:0 tag:2 length:1 isCompound:false}) {optional:false explicit:false application:false defaultValue:<nil> tag:<nil> stringType:0 timeType:0 set:false omitEmpty:false} tbsCertificate #2
the available pieces of evidence point in several directions, but upon checking those out, it points somewehre else. so i need a pointer to which part causes the problem.
is it etcd?
is it the new etcd certificate?
is it the missing flannel-interface?
is it the non-operational flanneld?
is it something not listed here?
if there is information missing here, please ask, i can surely provide.
key specs:
- host: ubuntu 18.04
- kubeadm 1.13.2
thank you and best regards,
scones
UPDATE1
$ k get cs,po,svc
NAME STATUS MESSAGE ERROR
componentstatus/controller-manager Healthy ok
componentstatus/scheduler Healthy ok
componentstatus/etcd-0 Healthy {"health": "true"}
NAME READY STATUS RESTARTS AGE
pod/cert-manager-6dc5c68468-hkb6j 0/1 Error 51 89d
pod/coredns-86c58d9df4-dtdxq 0/1 Completed 23 304d
pod/coredns-86c58d9df4-k7h7m 0/1 Completed 23 304d
pod/etcd-redacted 1/1 Running 2506 304d
pod/hostpath-provisioner-5c6754fbd4-ckvnp 0/1 Error 12 222d
pod/kube-apiserver-redacted 1/1 Running 1907 304d
pod/kube-controller-manager-redacted 1/1 Running 2682 304d
pod/kube-flannel-ds-amd64-dmrzh 0/1 CrashLoopBackOff 338 372d
pod/kube-proxy-q8jgs 1/1 Running 15 304d
pod/kube-scheduler-redacted 1/1 Running 2694 304d
pod/metrics-metrics-server-65cd865c9f-dbh85 0/1 Error 2658 120d
pod/tiller-deploy-865b88d89-8ftzs 0/1 Error 170 305d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 372d
service/metrics-metrics-server ClusterIP 10.97.186.19 <none> 443/TCP 120d
service/tiller-deploy ClusterIP 10.103.184.226 <none> 44134/TCP 354d
unfortunately i don't recall how i installed flannel a year ago.
kubectl version is also 1.13.2, as is the cluster
the linked post by #hanx is about renewing certificated, not broken network overlays, so not applicable.
I am trying to install Kubernetes 1.14.3 on IPV6 environment.
I don't have any IPV4 interface on this environment , only IPV6.
I tried with p-lain kubeadm config file and it seems to work but when I try to apply the calico cni the calico-node keeps failing.
2019-07-28 07:15:26.714 [INFO][9] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd20::4001]:443/api/v1/nodes/foo: dial tcp [fd20::4001]:443: connect: network is unreachable
this is the status og the pods at the moment:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-6894d6f4f4-hwsmc 0/1 ContainerCreating 0 79s <none> master-eran <none> <none>
kube-system calico-node-fj8q7 0/1 Running 1 79s 2001:df0:8800:4::7 master-eran <none> <none>
kube-system coredns-fb8b8dccf-8b995 0/1 ContainerCreating 0 5m53s <none> master-eran <none> <none>
kube-system coredns-fb8b8dccf-fbpwq 0/1 ContainerCreating 0 5m53s <none> master-eran <none> <none>
kube-system etcd-master-eran 1/1 Running 0 4m56s 2001:df0:8800:4::7 master-eran <none> <none>
kube-system kube-apiserver-master-eran 1/1 Running 0 4m53s 2001:df0:8800:4::7 master-eran <none> <none>
kube-system kube-controller-manager-master-eran 1/1 Running 0 5m7s 2001:df0:8800:4::7 master-eran <none> <none>
kube-system kube-proxy-4qzb8 1/1 Running 0 5m53s 2001:df0:8800:4::7 master-eran <none> <none>
kube-system kube-scheduler-master-eran 1/1 Running 0 4m50s 2001:df0:8800:4::7 master-eran <none> <none>
I guess that the codedns and controller will start only after the calico-node will run but it keeps failing on the error I pasted earlier.
in the kubeadm config file I chose ipvs in proxy configurations.
does any one have any idea on how to solve this?
thanks
NEW STATUS:
I was able to resOlve the calico-node issue but now I am failing on calico-controller :
7-30 07:58:22.979 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://[fd20::4001]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp [fd20::4001]:443: connect: permission denied
2019-07-30 07:58:22.979 [FATAL][1] main.go 118: Failed to initialize Calico datastore error=Get https://[fd20::4001]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp [fd20::4001]:443: connect: permission denied
According to the Calico documentation you may need to perform a few additional steps before you can start using it with ipv6 only. About enabling IPv6 with Kubernetes you can read here.