K3s cluster not starting (anymore) - kubernetes

My local k3s playground decided to suddenly stop working. I have the gut feeling something is wrong with the https certs
I start the cluster from docker compose using
version: '3.2'
services:
server:
image: rancher/k3s:latest
command: server --disable-agent --tls-san 192.168.2.110
environment:
- K3S_CLUSTER_SECRET=somethingtotallyrandom
- K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml
- K3S_KUBECONFIG_MODE=666
volumes:
- k3s-server:/var/lib/rancher/k3s
# get the kubeconfig file
- .:/output
- ./registries.yaml:/etc/rancher/k3s/registries.yaml
ports:
- 192.168.2.110:6443:6443
node:
image: rancher/k3s:latest
volumes:
- ./registries.yaml:/etc/rancher/k3s/registries.yaml
tmpfs:
- /run
- /var/run
privileged: true
environment:
- K3S_URL=https://server:6443
- K3S_CLUSTER_SECRET=somethingtotallyrandom
ports:
- 31000-32000:31000-32000
volumes:
k3s-server: {}
Nothing special. the registries.yaml can be uncommented without making a difference. contents
is
mirrors:
"192.168.2.110:5055":
endpoint:
- "http://192.168.2.110:5055"
However I get now a bunch of weird failures
server_1 | E0516 22:58:03.264451 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
server_1 | E0516 22:58:08.265272 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
node_1 | I0516 22:58:12.695365 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2
node_1 | I0516 22:58:15.006306 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2
node_1 | I0516 22:58:15.006537 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1 | E0516 22:58:15.006757 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1 | E0516 22:58:22.345501 1 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
node_1 | I0516 22:58:27.695296 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1 | E0516 22:58:27.695989 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1 | I0516 22:58:30.328999 1 request.go:621] Throttling request took 1.047650754s, request: GET:https://127.0.0.1:6444/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
server_1 | W0516 22:58:31.081020 1 garbagecollector.go:644] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
server_1 | E0516 22:58:36.442904 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
node_1 | I0516 22:58:40.695404 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1 | E0516 22:58:40.696176 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1 | E0516 22:58:41.443295 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
also it seems my node is not really connecting to the server anymore
user#ipc:~/dev/test_mk3s_docker$ docker exec -it $(docker ps |grep "k3s server"|awk -F\ '{print $1}') kubectl cluster-info
Kubernetes master is running at https://127.0.0.1:6443
CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
user#ipc:~/dev/test_mk3s_docker$ docker exec -it $(docker ps |grep "k3s agent"|awk -F\ '{print $1}') kubectl cluster-info
error: Missing or incomplete configuration info. Please point to an existing, complete config file:
1. Via the command-line flag --kubeconfig
2. Via the KUBECONFIG environment variable
3. In your home directory as ~/.kube/config
To view or setup config directly use the 'config' command.
if I run `kubectl get apiservice I get the following line
v1beta1.storage.k8s.io Local True 20m
v1beta1.scheduling.k8s.io Local True 20m
v1.storage.k8s.io Local True 20m
v1.k3s.cattle.io Local True 20m
v1.helm.cattle.io Local True 20m
v1beta1.metrics.k8s.io kube-system/metrics-server False (FailedDiscoveryCheck) 20m
also downgrading k3s to k3s:v1.0.1 only changes the error message
server_1 | E0516 23:46:02.951073 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: no kind "CSINode" is registered for version "storage.k8s.io/v1" in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30"
server_1 | E0516 23:46:03.444519 1 status.go:71] apiserver received an error that is not an metav1.Status: &runtime.notRegisteredErr{schemeName:"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30", gvk:schema.GroupVersionKind{Group:"storage.k8s.io", Version:"v1", Kind:"CSINode"}, target:runtime.GroupVersioner(nil), t:reflect.Type(nil)}
after executing
docker exec -it $(docker ps |grep "k3s server"|awk -F\ '{print $1}') kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.io
I only get
node_1 | W0517 07:03:06.346944 1 info.go:51] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
node_1 | I0517 07:03:21.504932 1 log.go:172] http: TLS handshake error from 10.42.1.15:53888: remote error: tls: bad certificate

Related

Readiness fails in the Eclipse Hono pods of the Cloud2Edge package

I am a bit desperate and I hope someone can help me. A few months ago I installed the eclipse cloud2edge package on a kubernetes cluster by following the installation instructions, creating a persistentVolume and running the helm install command with these options.
helm install -n $NS --wait --timeout 15m $RELEASE eclipse-iot/cloud2edge --set hono.prometheus.createInstance=false --set hono.grafana.enabled=false --dependency-update --debug
The yaml of the persistentVolume is the following and I create it in the same namespace that I install the package.
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-device-registry
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Mi
hostPath:
path: /mnt/
type: Directory
Everything works perfectly, all pods were ready and running, until the other day when the cluster crashed and some pods stopped working.
The kubectl get pods -n $NS output is as follows:
NAME READY STATUS RESTARTS AGE
ditto-mongodb-7b78b468fb-8kshj 1/1 Running 0 50m
dt-adapter-amqp-vertx-6699ccf495-fc8nx 0/1 Running 0 50m
dt-adapter-http-vertx-545564ff9f-gx5fp 0/1 Running 0 50m
dt-adapter-mqtt-vertx-58c8975678-k5n49 0/1 Running 0 50m
dt-artemis-6759fb6cb8-5rq8p 1/1 Running 1 50m
dt-dispatch-router-5bc7586f76-57dwb 1/1 Running 0 50m
dt-ditto-concierge-f6d5f6f9c-pfmcw 1/1 Running 0 50m
dt-ditto-connectivity-f556db698-q89bw 1/1 Running 0 50m
dt-ditto-gateway-589d8f5596-59c5b 1/1 Running 0 50m
dt-ditto-nginx-897b5bc76-cx2dr 1/1 Running 0 50m
dt-ditto-policies-75cb5c6557-j5zdg 1/1 Running 0 50m
dt-ditto-swaggerui-6f6f989ccd-jkhsk 1/1 Running 0 50m
dt-ditto-things-79ff869bc9-l9lct 1/1 Running 0 50m
dt-ditto-thingssearch-58c5578bb9-pwd9k 1/1 Running 0 50m
dt-service-auth-698d4cdfff-ch5wp 1/1 Running 0 50m
dt-service-command-router-59d6556b5f-4nfcj 0/1 Running 0 50m
dt-service-device-registry-7cf75d794f-pk9ct 0/1 Running 0 50m
The pods that fail all have the same error when running kubectl describe pod POD_NAME -n $NS.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 53m default-scheduler Successfully assigned digitaltwins/dt-service-command-router-59d6556b5f-4nfcj to node1
Normal Pulled 53m kubelet Container image "index.docker.io/eclipse/hono-service-command-router:1.8.0" already present on machine
Normal Created 53m kubelet Created container service-command-router
Normal Started 53m kubelet Started container service-command-router
Warning Unhealthy 52m kubelet Readiness probe failed: Get "https://10.244.1.89:8088/readiness": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 2m58s (x295 over 51m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
According to this, the readinessProbe fails. In the yalm definition of the affected deployments, the readinessProbe is defined:
readinessProbe:
failureThreshold: 3
httpGet:
path: /readiness
port: health
scheme: HTTPS
initialDelaySeconds: 45
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
I have tried increasing these values, increasing the delay to 600 and the timeout to 10. Also i have tried uninstalling the package and installing it again, but nothing changes: the installation fails because the pods are never ready and the timeout pops up. I have also exposed port 8088 (health) and called /readiness with wget and the result is still 503. On the other hand, I have tested if livenessProbe works and it works fine. I have also tried resetting the cluster. First I manually deleted everything in it and then used the following commands:
sudo kubeadm reset
sudo iptables -F && sudo iptables -t nat -F && sudo iptables -t mangle -F && sudo iptables -X
sudo systemctl stop kubelet
sudo systemctl stop docker
sudo rm -rf /var/lib/cni/
sudo rm -rf /var/lib/kubelet/*
sudo rm -rf /etc/cni/
sudo ifconfig cni0 down
sudo ifconfig flannel.1 down
sudo ifconfig docker0 down
sudo ip link set cni0 down
sudo brctl delbr cni0
sudo systemctl start docker
sudo kubeadm init --apiserver-advertise-address=192.168.44.11 --pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl --kubeconfig $HOME/.kube/config apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The cluster seems to work fine because the Eclipse Ditto part has no problem, it's just the Eclipse Hono part. I add a little more information in case it may be useful.
The kubectl logs dt-service-command-router-b654c8dcb-s2g6t -n $NS output:
12:30:06.340 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.101:44142 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:06.756 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.100:46550 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:07.876 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.102:40706 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration]
12:30:08.339 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration]: Failed to create SSL connection
12:30:08.339 [vert.x-eventloop-thread-1] WARN o.e.h.client.impl.HonoConnectionImpl - attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration] failed
javax.net.ssl.SSLHandshakeException: Failed to create SSL connection
The kubectl logs dt-adapter-amqp-vertx-74d69cbc44-7kmdq -n $NS output:
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials]
12:19:36.711 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials]: Failed to create SSL connection
12:19:36.712 [vert.x-eventloop-thread-0] WARN o.e.h.client.impl.HonoConnectionImpl - attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials] failed
javax.net.ssl.SSLHandshakeException: Failed to create SSL connection
The kubectl version output is as follows:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:20:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Thanks in advance!
based on the iconic Failed to create SSL Connection output in the logs, I assume that you have run into the dreaded The demo certificates included in the Hono chart have expired problem.
The Cloud2Edge package chart is being updated currently (https://github.com/eclipse/packages/pull/337) with the most recent version of the Ditto and Hono charts (which includes fresh certificates that are valid for two more years to come). As soon as that PR is merged and the Eclipse Packages chart repository has been rebuilt, you should be able to do a helm repo update and then (hopefully) succesfully install the c2e package.

k3d: local repository not found

I've created my loca registry:
$ docker container run -d
--name registry.localhost
--restart always
-p 5000:5000
registry:2
It's up and running:
$ curl -s registry.localhost:5000/v2/_catalog | jq
{
"repositories": [
"greenplum-for-kubernetes",
"greenplum-operator"
]
}
I'm trying to create a deployment. However I'm getting:
4m7s Normal ScalingReplicaSet deployment/greenplum-operator Scaled up replica set greenplum-operator-76b544fbb9 to 1
4m7s Normal SuccessfulCreate replicaset/greenplum-operator-76b544fbb9 Created pod: greenplum-operator-76b544fbb9-pm7t2
<unknown> Normal Scheduled pod/greenplum-operator-76b544fbb9-pm7t2 Successfully assigned default/greenplum-operator-76b544fbb9-pm7t2 to k3d-k3s-default-agent-0
3m23s Normal Pulling pod/greenplum-operator-76b544fbb9-pm7t2 Pulling image "registry.localhost:5000/greenplum-operator:v2.2.0"
3m23s Warning Failed pod/greenplum-operator-76b544fbb9-pm7t2 Error: ErrImagePull
3m23s Warning Failed pod/greenplum-operator-76b544fbb9-pm7t2 Failed to pull image "registry.localhost:5000/greenplum-operator:v2.2.0": rpc error: code = Unknown desc = failed to pull and unpack image "registry.localhost:5000/greenplum-operator:v2.2.0": failed to resolve reference "registry.localhost:5000/greenplum-operator:v2.2.0": failed to do request: Head https://registry.localhost:5000/v2/greenplum-operator/manifests/v2.2.0: http: server gave HTTP response to HTTPS client
3m1s Warning Failed pod/greenplum-operator-76b544fbb9-pm7t2 Error: ImagePullBackOff
3m1s Normal BackOff pod/greenplum-operator-76b544fbb9-pm7t2 Back-off pulling image "registry.localhost:5000/greenplum-operator:v2.2.0"
In short:
http: server gave HTTP response to HTTPS client
My cluster is up and running as well:
$ k3d cluster create --agents 2 --k3s-server-arg --disable=traefik
--volume $HOME/.k3d/registries.yaml:/etc/rancher/k3s/my-registries.yaml
As you can see:
$ cat ${HOME}/.k3d/registries.yaml
mirrors:
"registry.localhost:5000":
endpoint:
- "http://registry.localhost:5000"
Any ideas?
Do this your workers
Create or modify /etc/docker/daemon.json
{ "insecure-registries":["registry.localhost:5000"] }
Restart docker daemon
sudo service docker restart
Source: https://github.com/docker/distribution/issues/1874
The problem was related with k3s and I mage a miswriting.
k3s needs to get access to /etc/rancher/k3s/registries.yaml file, as you can see here.
The problem is push a my-registries.yaml file instead of registries.yaml:
$ k3d cluster create --agents 2 --k3s-server-arg --disable=traefik
--volume $HOME/.k3d/registries.yaml:/etc/rancher/k3s/my-registries.yaml
The problem was solved:
$ k3d cluster create --agents 2 --k3s-server-arg --disable=traefik
--volume $HOME/.k3d/registries.yaml:/etc/rancher/k3s/registries.yaml

kubernetes pods http: TLS handshake error from x.x.x.x:38676: EOF

When starting cert-manager I get the following message
TLS handshake error from 10.42.152.128:38676: EOF
$ kubectl -n cert-manager logs cert-manager-webhook-8575f88c85-l4tlw
I0214 19:41:28.147106 1 main.go:64] "msg"="enabling TLS as certificate file flags specified"
I0214 19:41:28.147365 1 server.go:126] "msg"="listening for insecure healthz connections" "address"=":6080"
I0214 19:41:28.147418 1 server.go:138] "msg"="listening for secure connections" "address"=":10250"
I0214 19:41:28.147437 1 server.go:155] "msg"="registered pprof handlers"
I0214 19:41:28.147570 1 tls_file_source.go:144] "msg"="detected private key or certificate data on disk has changed. reloading certificate"
2020/02/14 19:43:32 http: TLS handshake error from 10.42.152.128:38676: EOF
Interestingly there is not pod with that IP
$ kubectl get pod -o wide --all-namespaces | grep 128
cert-manager cert-manager-webhook-8575f88c85-l4tlw 1/1 Running 0 4m56s 10.42.112.128 node002 <none> <none>
Similar error on the cert-manager pod
E0214 19:38:22.540589 1 controller.go:131] cert-manager/controller/ingress-shim "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: net/http: TLS handshake timeout" "key"="kube-system/dashboard-kubernetes-dashboard"
I have two ClusterIssuer
kubectl get ClusterIssuer --namespace cert-manager
NAME READY AGE
letsencrypt-prd True 42d
letsencrypt-stg True 42d
But no certificate yet:
kubectl get certificate --all-namespaces
No resources found
When I try to request a certificate I get the same error
kubectl apply -f mycert.yml
Error from server (InternalError): error when creating "cert-wyssmann-dev.yml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: net/http: TLS handshake timeout
I am not sure how exactly can I get to the bottom of the problem. I ran sonobouy to see if this helps me, however test failed on 2 of my 3 nodes.
Plugin: e2e
Status: failed
Total: 1
Passed: 0
Failed: 1
Skipped: 0
Failed tests:
Container e2e is in a terminated state (exit code 1) due to reason: Error:
Plugin: systemd-logs
Status: failed
Total: 3
Passed: 1
Failed: 2
Skipped: 0
Failed tests:
timeout waiting for results
For the failing nodes I can see this in the sonobouy logs
E0214 19:38:22.540589 1 controller.go:131] cert-manager/controller/ingress-shim "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: net/http: TLS handshake timeout" "key"="kube-system/dashboard-kubernetes-dashboard"
If you really don't need the webhook then one quick way to solve this is to disable the webhook as per documentation

kube-dns remains in "ContainerCreating"

I manually installed k8s-1.6.6 and I deployed calico-2.3(uses etcd-3.0.17 with kube-apiserver) and kube-dns on baremetal(ubuntu 16.04).
It dosen't have any problems without RBAC.
But, after applying RBAC by adding "--authorization-mode=RBAC" to kube-apiserver.
I couldn't apply kube-dns whose status remains in "ContainerCreating".
I checked "kubectl describe pod kube-dns.."
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
10m 10m 1 default-scheduler Normal Scheduled Successfully assigned kube-dns-1759312207-t35t3 to work01
9m 9m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: 8c2585b1b3170f220247a6abffb1a431af58060f2bcc715fe29e7c2144d19074
8m 8m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: c6962db6c5a17533fbee563162c499631a647604f9bffe6bc71026b09a2a0d4f
7m 7m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: failed to "KillPodSandbox" for "f693931a-7335-11e7-aaa2-525400aa8825" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-1759312207-t35t3_kube-system\" network: CNI failed to retrieve network namespace path: Error: No such container: 9adc41d07a80db44099460c6cc56612c6fbcd53176abcc3e7bbf843fca8b7532"
5m 5m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: 4c2d450186cbec73ea28d2eb4c51497f6d8c175b92d3e61b13deeba1087e9a40
4m 4m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: failed to "KillPodSandbox" for "f693931a-7335-11e7-aaa2-525400aa8825" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-1759312207-t35t3_kube-system\" network: CNI failed to retrieve network namespace path: Error: No such container: 12df544137939d2b8af8d70964e46b49f5ddec1228da711c084ff493443df465"
3m 3m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: c51c9d50dcd62160ffe68d891967d118a0f594885e99df3286d0c4f8f4986970
2m 2m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: 94533f19952c7d5f32e919c03d9ec5147ef63d4c1f35dd4fcfea34306b9fbb71
1m 1m 1 kubelet, work01 Warning FailedSync Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: 166a89916c1e6d63e80b237e5061fd657f091f3c6d430b7cee34586ba8777b37
16s 12s 2 kubelet, work01 Warning FailedSync (events with common reason combined)
10m 2s 207 kubelet, work01 Warning FailedSync Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-1759312207-t35t3_kube-system(f693931a-7335-11e7-aaa2-525400aa8825)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-1759312207-t35t3_kube-system(f693931a-7335-11e7-aaa2-525400aa8825)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-1759312207-t35t3_kube-system\" network: the server does not allow access to the requested resource (get pods kube-dns-1759312207-t35t3)"
10m 1s 210 kubelet, work01 Normal SandboxChanged Pod sandbox changed, it will be killed and re-created.
my kubelet
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
[Service]
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /var/log/containers
ExecStartPre=/bin/mkdir -p /etc/cni/net.d
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStart=/usr/local/bin/kubelet \
--api-servers=http://127.0.0.1:8080 \
--allow-privileged=true \
--pod-manifest-path=/etc/kubernetes/manifests \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--cluster-dns=10.3.0.10 \
--cluster-domain=cluster.local \
--register-node=true \
--network-plugin=cni \
--cni-conf-dir=/etc/cni/net.d \
--cni-bin-dir=/opt/cni/bin \
--container-runtime=docker
my kube-apiserver
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-apiserver
image: kube-apiserver:v1.6.6
command:
- kube-apiserver
- --bind-address=0.0.0.0
- --etcd-servers=http://127.0.0.1:2379
- --allow-privileged=true
- --service-cluster-ip-range=10.3.0.0/16
- --secure-port=6443
- --advertise-address=172.30.1.10
- --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota
- --tls-cert-file=/srv/kubernetes/apiserver.pem
- --tls-private-key-file=/srv/kubernetes/apiserver-key.pem
- --client-ca-file=/srv/kubernetes/ca.pem
- --service-account-key-file=/srv/kubernetes/apiserver-key.pem
- --kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP
- --anonymous-auth=false
- --authorization-mode=RBAC
- --token-auth-file=/srv/kubernetes/known_tokens.csv
- --basic-auth-file=/srv/kubernetes/basic_auth.csv
- --storage-backend=etcd3
livenessProbe:
httpGet:
host: 127.0.0.1
port: 8080
path: /healthz
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 15
ports:
- name: https
hostPort: 6443
containerPort: 6443
- name: local
hostPort: 8080
containerPort: 8080
volumeMounts:
- name: srvkube
mountPath: "/srv/kubernetes"
readOnly: true
- name: etcssl
mountPath: "/etc/ssl"
readOnly: true
volumes:
- name: srvkube
hostPath:
path: "/srv/kubernetes"
- name: etcssl
hostPath:
path: "/etc/ssl"
I found the cause.
This issue is not related to kube-dns.
I just missed out applying ClusterRole/ClusterRoleBinding, before deplying calico

Error in starting pods- kubernetes. Pods remain in ContainerCreating state

I have installed kubernetes trial version with minikube on my desktop running ubuntu. However there seem to be some issue with bringing up the pods.
Kubectl get pods --all-namespaces shows all the pods in ContainerCreating state and it doesn't shift to Ready.
Even when i do a kubernetes-dahboard, i get
Waiting, endpoint for service is not ready yet.
Minikube version : v0.20.0
Environment:
OS (e.g. from /etc/os-release): Ubuntu 12.04.5 LTS
VM Driver "DriverName": "virtualbox"
ISO version "Boot2DockerURL":
"file:///home/nszig/.minikube/cache/iso/minikube-v0.20.0.iso"
I have installed minikube and kubectl on Ubuntu. However i cannot access the dashboard both through the CLI and through the GUI.
http://127.0.0.1:8001/ui give the below error
{ "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "no endpoints available for service "kubernetes-dashboard"", "reason": "ServiceUnavailable", "code": 503 }
And minikube dashboard on the CLI does not open the dashboard: Output
Waiting, endpoint for service is not ready yet...
Waiting, endpoint for service is not ready yet...
Waiting, endpoint for service is not ready yet...
Waiting, endpoint for service is not ready yet...
.......
Could not find finalized endpoint being pointed to by kubernetes-dashboard: Temporary Error: Endpoint for service is not ready yet
Temporary Error: Endpoint for service is not ready yet
Temporary Error: Endpoint for service is not ready yet
Temporary Error: Endpoint for service is not ready yet
kubectl version: Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"dirty", BuildDate:"2017-06-22T04:31:09Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
minikube logs also reports the errors below:
.....
Jul 10 08:46:12 minikube localkube[3237]: I0710 08:46:12.901880 3237 kuberuntime_manager.go:458] Container {Name:php-redis Image:gcr.io/google-samples/gb-frontend:v4 Command:[] Args:[] WorkingDir: Ports:[{Name: HostPort:0 ContainerPort:80 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:GET_HOSTS_FROM Value:dns ValueFrom:nil}] Resources:{Limits:map[] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:} s:100m Format:DecimalSI} memory:{i:{value:104857600 scale:0} d:{Dec:} s:100Mi Format:BinarySI}]} VolumeMounts:[{Name:default-token-gqtvf ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it. Jul 10 08:46:14 minikube localkube[3237]: E0710 08:46:14.139555 3237 remote_runtime.go:86] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = unable to pull sandbox image "gcr.io/google_containers/pause-amd64:3.0": Error response from daemon: Get https://gcr.io/v1/_ping: x509: certificate signed by unknown authority ....
Name: kubernetes-dashboard-2039414953-czptd Namespace: kube-system
Node: minikube/192.168.99.102 Start Time: Fri, 14 Jul 2017 09:31:58
+0530 Labels: k8s-app=kubernetes-dashboard pod-template-hash=2039414953
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kubernetes-dashboard-2039414953","uid":"2eb39682-6849-11e7-8...
Status: Pending IP: Created
By: ReplicaSet/kubernetes-dashboard-2039414953 Controlled
By: ReplicaSet/kubernetes-dashboard-2039414953 Containers:
kubernetes-dashboard:
Container ID:
Image: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.6.1
Image ID:
Port: 9090/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kubernetes-dashboard-token-12gdj (ro) Conditions: Type Status
Initialized True Ready False PodScheduled True Volumes:
kubernetes-dashboard-token-12gdj:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-token-12gdj
Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node-role.kubernetes.io/master:NoSchedule Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ ------- 1h 11s 443 kubelet, minikube Warning FailedSync Error syncing
pod, skipping: failed to "CreatePodSandbox" for
"kubernetes-dashboard-2039414953-czptd_kube-system(2eb57d9b-6849-11e7-8a56-080027206461)"
with CreatePodSandboxError: "CreatePodSandbox for pod
\"kubernetes-dashboard-2039414953-czptd_kube-system(2eb57d9b-6849-11e7-8a56-080027206461)\"
failed: rpc error: code = 2 desc = unable to pull sandbox image
\"gcr.io/google_containers/pause-amd64:3.0\": Error response from
daemon: Get https://gcr.io/v1/_ping: x509: certificate signed by
unknown authority"
It's quite possible that the Pod container images are being downloaded. The images are not very large so the images should get downloaded pretty quickly on a decent internet connection.
You can use kubectl describe pod --namespace kube-system <pod-name> to know more details on the pod bring up status. Take a look at the Events section of the output.
Until all the kubernetes components in the kube-system namespace are in READY state, you will not be able to access the dashboard.
You can also try SSH'ing into the minikube vm with minikube ssh to debug the issue.
I was able to resolve this issue by doing a clean install using a VPN connection as i had restrictions in my corporate network. This was blocking the site from where the install was trying to pull the sandbox image.
Try using:
kubectl config use-context minikube
..as a preexisting configuration may have be initiated.
guys i did these and it worked for me
ON MASTER ONLY
####################
kubeadm init --apiserver-advertise-address=0.0.0.0 --pod-network-cidr=10.244.0.0/16
(copy join)
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
ON WORKER NODE ##
###################
kubeadm reset
EXECUTE THE JOIN COMMAND WHICH YOU GOT FROM MASTER AFTER KUBEADM INIT.
#kubeadm join