Consul syncCatalog on k8s keep falling into CrashLoopBackOff - kubernetes

I am deploying a consul cluster on k8s version 1.9:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3+coreos.0", GitCommit:"f588569ed1bd4a6c986205dd0d7b04da4ab1a3b6", GitTreeState:"clean", BuildDate:"2018-02-10T01:42:55Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
using hashicorp/consul-k8s:0.11.0 for syncCatalog:
Here is my SyncCatalog Deployment description:
Namespace: consul-presentation
CreationTimestamp: Sun, 29 Mar 2020 20:22:49 +0300
Labels: app=consul
chart=consul-helm
heritage=Tiller
release=consul-presentation
Annotations: deployment.kubernetes.io/revision=1
Selector: app=consul,chart=consul-helm,component=sync-catalog,release=consul-presentation
Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=consul
chart=consul-helm
component=sync-catalog
release=consul-presentation
Annotations: consul.hashicorp.com/connect-inject=false
Service Account: consul-presentation-consul-sync-catalog
Containers:
consul-sync-catalog:
Image: hashicorp/consul-k8s:0.11.0
Port: <none>
Command:
/bin/sh
-ec
consul-k8s sync-catalog \
-k8s-default-sync=true \
-consul-domain=consul \
-k8s-write-namespace=${NAMESPACE} \
-node-port-sync-type=ExternalFirst \
-log-level=debug \
-add-k8s-namespace-suffix \
Liveness: http-get http://:8080/health/ready delay=30s timeout=5s period=5s #success=1 #failure=3
Readiness: http-get http://:8080/health/ready delay=10s timeout=5s period=5s #success=1 #failure=5
Environment:
HOST_IP: (v1:status.hostIP)
NAMESPACE: (v1:metadata.namespace)
CONSUL_HTTP_ADDR: http://consul-presentation.test:8500
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available False MinimumReplicasUnavailable
Progressing True ReplicaSetUpdated
OldReplicaSets: <none>
NewReplicaSet: consul-presentation-consul-sync-catalog-66b5756486 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 1m deployment-controller Scaled up replica set consul-presentation-consul-sync-catalog-66b5756486 to 1
And here is the description of the unhealthy pod:
kubectl describe pod consul-presentation-consul-sync-catalog-66b5756486-2h2s6 -n consul-presentation
Name: consul-presentation-consul-sync-catalog-66b5756486-2h2s6
Namespace: consul-presentation
Node: k8s-k4.test/10.99.1.10
Start Time: Sun, 29 Mar 2020 20:22:49 +0300
Labels: app=consul
chart=consul-helm
component=sync-catalog
pod-template-hash=2261312042
release=consul-presentation
Annotations: consul.hashicorp.com/connect-inject=false
Status: Running
IP: 10.195.5.53
Controlled By: ReplicaSet/consul-presentation-consul-sync-catalog-66b5756486
Containers:
consul-sync-catalog:
Container ID: docker://4f0c65a7be5f9b07cae51d798c532a066fb0784b28a7610dfe4f1a15a2fa5a7c
Image: hashicorp/consul-k8s:0.11.0
Image ID: docker-pullable://hashicorp/consul-k8s#sha256:8be1598ad3e71323509727162f20ed9c140c8cf09d5fa3dc351aad03ec2b0b70
Port: <none>
Command:
/bin/sh
-ec
consul-k8s sync-catalog \
-k8s-default-sync=true \
-consul-domain=consul \
-k8s-write-namespace=${NAMESPACE} \
-node-port-sync-type=ExternalFirst \
-log-level=debug \
-add-k8s-namespace-suffix \
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Sun, 29 Mar 2020 20:28:19 +0300
Finished: Sun, 29 Mar 2020 20:28:56 +0300
Ready: False
Restart Count: 6
Liveness: http-get http://:8080/health/ready delay=30s timeout=5s period=5s #success=1 #failure=3
Readiness: http-get http://:8080/health/ready delay=10s timeout=5s period=5s #success=1 #failure=5
Environment:
HOST_IP: (v1:status.hostIP)
NAMESPACE: consul-presentation (v1:metadata.namespace)
CONSUL_HTTP_ADDR: http://consul-presentation.test:8500
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from consul-presentation-consul-sync-catalog-token-jxw26 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
consul-presentation-consul-sync-catalog-token-jxw26:
Type: Secret (a volume populated by a Secret)
SecretName: consul-presentation-consul-sync-catalog-token-jxw26
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m default-scheduler Successfully assigned consul-presentation-consul-sync-catalog-66b5756486-2h2s6 to k8s-k4.test
Normal SuccessfulMountVolume 7m kubelet, k8s-k4.test MountVolume.SetUp succeeded for volume "consul-presentation-consul-sync-catalog-token-jxw26"
Normal Pulled 6m (x2 over 7m) kubelet, k8s-k4.test Container image "hashicorp/consul-k8s:0.11.0" already present on machine
Normal Created 6m (x2 over 7m) kubelet, k8s-k4.test Created container
Normal Started 6m (x2 over 7m) kubelet, k8s-k4.test Started container
Normal Killing 6m kubelet, k8s-k4.test Killing container with id docker://consul-sync-catalog:Container failed liveness probe.. Container will be killed and recreated.
Warning Unhealthy 6m (x4 over 6m) kubelet, k8s-k4.test Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 6m (x13 over 7m) kubelet, k8s-k4.test Readiness probe failed: HTTP probe failed with statuscode: 500
Warning BackOff 2m (x6 over 3m) kubelet, k8s-k4.test Back-off restarting failed container
I have tried the default trial as described in this helm chart: https://github.com/hashicorp/consul-helm
The only difference is I use ClusterIPs and ingresses which cannot have something to do with the health off a pod.
Any ideas?

The liveness probe failing is telling you that the sync-catalog process cannot talk to Consul. Here is how the liveness/readiness probe is implemented in consul-k8s.
It looks like the Consul address you're providing to the sync-catalog process is http://consul-presentation.test:8500. Is this an external Consul server? Is it running and reachable from the pods on Kubernetes?
Also, are you deploying Consul clients on k8s? In the official Helm chart sync-catalog talks to the Consul clients deployed as a daemonset via hostIP.

When using k8s ingresses with ClusterIPs the consul address should be set to the ingress host, as it is actually exposed, without the port. That means that the corresponding part of the k8s deployment should be like that:
Liveness: http-get http://:8080/health/ready delay=30s timeout=5s period=5s #success=1 #failure=3
Readiness: http-get http://:8080/health/ready delay=10s timeout=5s period=5s #success=1 #failure=5
Environment:
HOST_IP: (v1:status.hostIP)
NAMESPACE: (v1:metadata.namespace)
CONSUL_HTTP_ADDR: http://{INGRESS HOST}

Related

How to resolve this error that nginx-ingress-controller start fail in my k8s cluster?

Rancher v2.4.2
kubernetes version: v1.17.4
In my k8s cluster,nginx-ingress-controller doesn't work and restart always.I don't get anything useful information in the logs, thanks for your help.
cluster nodes:
> kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready controlplane,etcd,worker 18d v1.17.4
master2 Ready controlplane,etcd,worker 17d v1.17.4
node1 Ready worker 17d v1.17.4
node2 Ready worker 17d v1.17.4
cluster pods in ingress-nginx namespace
> kubectl get pods -n ingress-nginx
NAME READY STATUS RESTARTS AGE
default-http-backend-5bb77998d7-k7gdh 1/1 Running 1 17d
nginx-ingress-controller-6l4jh 0/1 Running 10 27m
nginx-ingress-controller-bh2pg 1/1 Running 0 63m
nginx-ingress-controller-drtzx 1/1 Running 0 63m
nginx-ingress-controller-qndbw 1/1 Running 0 63m
the pod logs of nginx-ingress-controller-6l4jh
> kubectl logs nginx-ingress-controller-6l4jh -n ingress-nginx
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: nginx-0.25.1-rancher1
Build:
Repository: https://github.com/rancher/ingress-nginx.git
nginx version: openresty/1.15.8.1
-------------------------------------------------------------------------------
>
describe info
> kubectl describe pod nginx-ingress-controller-6l4jh -n ingress-nginx
Name: nginx-ingress-controller-6l4jh
Namespace: ingress-nginx
Priority: 0
Node: node2/172.26.13.11
Start Time: Tue, 19 Apr 2022 07:12:16 +0000
Labels: app=ingress-nginx
controller-revision-hash=758cb9dbbc
pod-template-generation=8
Annotations: cattle.io/timestamp: 2022-04-19T07:08:51Z
field.cattle.io/ports:
[[{"containerPort":80,"dnsName":"nginx-ingress-controller-hostport","hostPort":80,"kind":"HostPort","name":"http","protocol":"TCP","source...
field.cattle.io/publicEndpoints:
[{"addresses":["172.26.13.130"],"nodeId":"c-wv692:m-d5802d05bbf0","port":80,"protocol":"TCP"},{"addresses":["172.26.13.130"],"nodeId":"c-w...
prometheus.io/port: 10254
prometheus.io/scrape: true
Status: Running
IP: 172.26.13.11
IPs:
IP: 172.26.13.11
Controlled By: DaemonSet/nginx-ingress-controller
Containers:
nginx-ingress-controller:
Container ID: docker://09a6248edb921b9c9cbab678c793fe1cc3d28322ea6abbb8f15c899351ce4b40
Image: 172.26.13.133:5000/rancher/nginx-ingress-controller:nginx-0.25.1-rancher1
Image ID: docker-pullable://172.26.13.133:5000/rancher/nginx-ingress-controller#sha256:fe50ceea3d1a0bc9a7ccef8d5845c9a30b51f608e411467862dff590185a47d2
Ports: 80/TCP, 443/TCP
Host Ports: 80/TCP, 443/TCP
Args:
/nginx-ingress-controller
--default-backend-service=$(POD_NAMESPACE)/default-http-backend
--configmap=$(POD_NAMESPACE)/nginx-configuration
--tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
--udp-services-configmap=$(POD_NAMESPACE)/udp-services
--annotations-prefix=nginx.ingress.kubernetes.io
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Tue, 19 Apr 2022 07:40:12 +0000
Finished: Tue, 19 Apr 2022 07:41:32 +0000
Ready: False
Restart Count: 11
Liveness: http-get http://:10254/healthz delay=60s timeout=20s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=60s timeout=20s period=10s #success=1 #failure=3
Environment:
POD_NAME: nginx-ingress-controller-6l4jh (v1:metadata.name)
POD_NAMESPACE: ingress-nginx (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-serviceaccount-token-2kdbj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nginx-ingress-serviceaccount-token-2kdbj:
Type: Secret (a volume populated by a Secret)
SecretName: nginx-ingress-serviceaccount-token-2kdbj
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: :NoExecute
:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned ingress-nginx/nginx-ingress-controller-6l4jh to node2
Normal Pulled 27m (x3 over 30m) kubelet, node2 Container image "172.26.13.133:5000/rancher/nginx-ingress-controller:nginx-0.25.1-rancher1" already present on machine
Normal Created 27m (x3 over 30m) kubelet, node2 Created container nginx-ingress-controller
Normal Started 27m (x3 over 30m) kubelet, node2 Started container nginx-ingress-controller
Normal Killing 27m (x2 over 28m) kubelet, node2 Container nginx-ingress-controller failed liveness probe, will be restarted
Warning Unhealthy 25m (x10 over 29m) kubelet, node2 Liveness probe failed: Get http://172.26.13.11:10254/healthz: dial tcp 172.26.13.11:10254: connect: connection refused
Warning Unhealthy 10m (x21 over 29m) kubelet, node2 Readiness probe failed: Get http://172.26.13.11:10254/healthz: dial tcp 172.26.13.11:10254: connect: connection refused
Warning BackOff 8s (x69 over 20m) kubelet, node2 Back-off restarting failed container
>
It sounds like the ingress controller pod fails the liveness/readiness checks but looks like only on a certain node. You could try:
check the node for firewall on that port
update to newer version than nginx-0.25.1

Kubernetes CoreDNS in CrashLoopBackOff on worker node

I've searched CoreDns in CrashLoopBackOff but nothing has helped me through.
My Set
k8s - v1.20.2
CoreDns-1.7.0
Installed by kubespray with this one https://kubernetes.io/ko/docs/setup/production-environment/tools/kubespray
My Problem
CoreDNS pods on master Node are in a running state
but on worker Node coreDns pods are in crashLoopBackOff state.
enter image description here
kubectl logs -f coredns-847f564ccf-msbvp -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 5b233a0166923d642fdbca0794b712ab
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s
CoreDns container runs a command "/coredns -conf /etc/resolv.conf" for a while
and then it is destroyed.
enter image description here
Here is Corefile
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf {
prefer_udp
}
cache 30
loop
reload
loadbalance
}
And one of crashed pod's event
kubectl get event --namespace kube-system --field-selector involvedObject.name=coredns-847f564ccf-lqnxs
LAST SEEN TYPE REASON OBJECT MESSAGE
4m55s Warning Unhealthy pod/coredns-847f564ccf-lqnxs Liveness probe failed: Get "http://10.216.50.2:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
9m59s Warning BackOff pod/coredns-847f564ccf-lqnxs Back-off restarting failed container
And Here is CoreDns Description
Containers:
coredns:
Container ID: docker://a174cb3a3800181d1c7b78831bfd37bbf69caf60a82051d6fb29b4b9deeacce9
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns#sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Wed, 21 Apr 2021 21:51:44 +0900
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 21 Apr 2021 21:44:42 +0900
Finished: Wed, 21 Apr 2021 21:46:32 +0900
Ready: False
Restart Count: 9943
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=0s timeout=5s period=10s #success=1 #failure=10
Readiness: http-get http://:8181/ready delay=0s timeout=5s period=10s #success=1 #failure=10
Environment: <none>
Mounts:
/etc/coredns from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-qqhn6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 18m (x9940 over 30d) kubelet Container image "k8s.gcr.io/coredns:1.7.0" already present on machine
Warning Unhealthy 8m37s (x99113 over 30d) kubelet Liveness probe failed: Get "http://10.216.50.2:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning BackOff 3m35s (x121901 over 30d) kubelet Back-off restarting failed container
At this point, any suggestion at all will be helpful
I found something weird.
I test in node1, I can access Coredns pod in the node2, but I can not access Coredns pod in the node1.
I use calico for cni
in node1, coredns1 - 1.1.1.1
in node2, coredns2 - 2.2.2.2
in node1.
access 1.1.1.1:8080/health -> timeout
access 2.2.2.2:8080/health -> ok
in node2.
access 1.1.1.1:8080/health -> ok
access 2.2.2.2:8080/health -> timeout
If Containerd, and Kubelet are under Proxy, please add private IP range: 10.0.0.0/8 into NO_PROXY configuration to make sure they can pull the images.
E.g:
[root#dev-systrdemo301z phananhtuan01]# cat /etc/systemd/system/containerd.service.d/proxy.conf
[Service]
Environment="HTTP_PROXY=dev-proxy.prod.xx.local:8300"
Environment="HTTPS_PROXY=dev-proxy.prod.xx.local:8300"
Environment="NO_PROXY=localhost,127.0.0.0/8,100.67.253.157/24,10.0.0.0/8"
Please refer to this article.

Readiness probe failed: timeout: failed to connect service ":8080" within 1s

I am trying to build and deploy microservices images to a single-node Kubernetes cluster running on my development machine using minikube. I am using the cloud-native microservices demo application Online Boutique by Google to understand the use of technologies like Kubernetes, Istio etc.
Link to github repo: microservices-demo
I have followed all the installation process to locally build and deploy the microservices, and am able to access the web frontend through my browser. However, when I click on any of the product images say, I see this error page.
HTTP Status: 500 Internal Server Error
On doing a check using kubectl get pods
I realize that one of my pods( Recommendation service) has status CrashLoopBackOff.
Running kubectl describe pods recommendationservice-55b4d6c477-kxv8r:
Namespace: default
Priority: 0
Node: minikube/192.168.99.116
Start Time: Thu, 23 Jul 2020 19:58:38 +0530
Labels: app=recommendationservice
app.kubernetes.io/managed-by=skaffold-v1.11.0
pod-template-hash=55b4d6c477
skaffold.dev/builder=local
skaffold.dev/cleanup=true
skaffold.dev/deployer=kubectl
skaffold.dev/docker-api-version=1.40
skaffold.dev/run-id=49913ced-e8df-40a7-9336-a227b56bcb5f
skaffold.dev/tag-policy=git-commit
Annotations: <none>
Status: Running
IP: 172.17.0.14
IPs:
IP: 172.17.0.14
Controlled By: ReplicaSet/recommendationservice-55b4d6c477
Containers:
server:
Container ID: docker://2d92aa966a82fbe58c8f40f6ecf9d6d55c29f8081cb40e0423a2397e1419350f
Image: recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Image ID: docker://sha256:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 23 Jul 2020 21:09:33 +0530
Finished: Thu, 23 Jul 2020 21:09:53 +0530
Ready: False
Restart Count: 29
Limits:
cpu: 200m
memory: 450Mi
Requests:
cpu: 100m
memory: 220Mi
Liveness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Readiness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Environment:
PORT: 8080
PRODUCT_CATALOG_SERVICE_ADDR: productcatalogservice:3550
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-sbpcx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-sbpcx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-sbpcx
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 44m (x15 over 74m) kubelet, minikube Container image "recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb" already present on machine
Warning Unhealthy 9m33s (x99 over 74m) kubelet, minikube Readiness probe failed: timeout: failed to connect service ":8080" within 1s
Warning BackOff 4m25s (x294 over 72m) kubelet, minikube Back-off restarting failed container
In Events, I see Readiness probe failed: timeout: failed to connect service ":8080" within 1s.
What is the reason and how can I resolve this?
Thanks for the help!
Answer
The timeout of the Readiness Probe (1 second) was too short.
More Info
The relevant Readiness Probe is defined such that /bin/grpc_health_probe -addr=:8080 is run inside the server container.
You would expect a 1 second timeout to be sufficient for such a probe but this is running on Minikube so that could be impacting the timeout of the probe.

Istio Prometheus pod in CrashLoopBackOff State

I am trying to setup Istio (1.5.4) for the bookinfo example provided on their website. I have used the demo configuration profile. But on verifying istio installation it fails since Prometheus pod has entered a CrashLoopBackOff state.
NAME READY STATUS RESTARTS AGE
grafana-5f6f8cbf75-psk78 1/1 Running 0 21m
istio-egressgateway-7f9f45c966-g7k9j 1/1 Running 0 21m
istio-ingressgateway-968d69c8b-bhxk5 1/1 Running 0 21m
istio-tracing-9dd6c4f7c-7fm79 1/1 Running 0 21m
istiod-86884c8c45-sw96x 1/1 Running 0 21m
kiali-869c6894c5-wqgjb 1/1 Running 0 21m
prometheus-589c44dbfc-xkwmj 1/2 CrashLoopBackOff 8 21m
The logs for the prometheus pod:
level=warn ts=2020-05-15T09:07:53.113Z caller=main.go:283 deprecation_notice="'storage.tsdb.retention' flag is deprecated use 'storage.tsdb.retention.time' instead."
level=info ts=2020-05-15T09:07:53.114Z caller=main.go:330 msg="Starting Prometheus" version="(version=2.15.1, branch=HEAD, revision=8744510c6391d3ef46d8294a7e1f46e57407ab13)"
level=info ts=2020-05-15T09:07:53.114Z caller=main.go:331 build_context="(go=go1.13.5, user=root#4b1e33c71b9d, date=20191225-01:04:15)"
level=info ts=2020-05-15T09:07:53.114Z caller=main.go:332 host_details="(Linux 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 prometheus-589c44dbfc-xkwmj (none))"
level=info ts=2020-05-15T09:07:53.114Z caller=main.go:333 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-05-15T09:07:53.114Z caller=main.go:334 vm_limits="(soft=unlimited, hard=unlimited)"
level=error ts=2020-05-15T09:07:53.157Z caller=query_logger.go:107 component=activeQueryTracker msg="Failed to create directory for logging active queries"
level=error ts=2020-05-15T09:07:53.157Z caller=query_logger.go:85 component=activeQueryTracker msg="Error opening query log file" file=data/queries.active err="open data/queries.active: no such file or directory"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x24dda5b, 0x5, 0x14, 0x2c62100, 0xc0005f63c0, 0x2c62100)
/app/promql/query_logger.go:115 +0x48c
main.main()
/app/cmd/prometheus/main.go:362 +0x5229
Describe pod output:
Name: prometheus-589c44dbfc-xkwmj
Namespace: istio-system
Priority: 0
Node: inspiron-7577/192.168.0.9
Start Time: Fri, 15 May 2020 14:21:14 +0530
Labels: app=prometheus
pod-template-hash=589c44dbfc
release=istio
Annotations: sidecar.istio.io/inject: false
Status: Running
IP: 172.17.0.11
IPs:
IP: 172.17.0.11
Controlled By: ReplicaSet/prometheus-589c44dbfc
Containers:
prometheus:
Container ID: docker://b6820a000ab67a5ce31d3a38f6f0d510bd150794b2792147fc17ef8f730c03bb
Image: docker.io/prom/prometheus:v2.15.1
Image ID: docker-pullable://prom/prometheus#sha256:169b743ceb4452266915272f9c3409d36972e41cb52f3f28644e6c0609fc54e6
Port: 9090/TCP
Host Port: 0/TCP
Args:
--storage.tsdb.retention=6h
--config.file=/etc/prometheus/prometheus.yml
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Fri, 15 May 2020 14:37:50 +0530
Finished: Fri, 15 May 2020 14:37:53 +0530
Ready: False
Restart Count: 8
Requests:
cpu: 10m
Liveness: http-get http://:9090/-/healthy delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:9090/-/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/istio-certs from istio-certs (rw)
/etc/prometheus from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-token-cgqbc (ro)
istio-proxy:
Container ID: docker://fa756c93510b6f402d7d88c31a5f5f066d4c254590eab70886e7835e7d3871be
Image: docker.io/istio/proxyv2:1.5.4
Image ID: docker-pullable://istio/proxyv2#sha256:e16e2801b7fd93154e8fcb5f4e2fb1240d73349d425b8be90691d48e8b9bb944
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--configPath
/etc/istio/proxy
--binaryPath
/usr/local/bin/envoy
--serviceCluster
istio-proxy-prometheus
--drainDuration
45s
--parentShutdownDuration
1m0s
--discoveryAddress
istio-pilot.istio-system.svc:15012
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--connectTimeout
10s
--proxyAdminPort
15000
--controlPlaneAuthPolicy
NONE
--dnsRefreshRate
300s
--statusPort
15020
--trust-domain=cluster.local
--controlPlaneBootstrap=false
State: Running
Started: Fri, 15 May 2020 14:21:31 +0530
Ready: True
Restart Count: 0
Readiness: http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
Environment:
OUTPUT_CERTS: /etc/istio-certs
JWT_POLICY: first-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istio-pilot.istio-system.svc:15012
POD_NAME: prometheus-589c44dbfc-xkwmj (v1:metadata.name)
POD_NAMESPACE: istio-system (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
ISTIO_META_POD_NAME: prometheus-589c44dbfc-xkwmj (v1:metadata.name)
ISTIO_META_CONFIG_NAMESPACE: istio-system (v1:metadata.namespace)
ISTIO_META_MESH_ID: cluster.local
ISTIO_META_CLUSTER_ID: Kubernetes
Mounts:
/etc/istio-certs/ from istio-certs (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-token-cgqbc (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus
Optional: false
istio-certs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
prometheus-token-cgqbc:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-token-cgqbc
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned istio-system/prometheus-589c44dbfc-xkwmj to inspiron-7577
Warning FailedMount 17m kubelet, inspiron-7577 MountVolume.SetUp failed for volume "prometheus-token-cgqbc" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 17m kubelet, inspiron-7577 MountVolume.SetUp failed for volume "config-volume" : failed to sync configmap cache: timed out waiting for the condition
Normal Pulled 17m kubelet, inspiron-7577 Container image "docker.io/istio/proxyv2:1.5.4" already present on machine
Normal Created 17m kubelet, inspiron-7577 Created container istio-proxy
Normal Started 17m kubelet, inspiron-7577 Started container istio-proxy
Warning Unhealthy 17m kubelet, inspiron-7577 Readiness probe failed: HTTP probe failed with statuscode: 503
Normal Pulled 16m (x4 over 17m) kubelet, inspiron-7577 Container image "docker.io/prom/prometheus:v2.15.1" already present on machine
Normal Created 16m (x4 over 17m) kubelet, inspiron-7577 Created container prometheus
Normal Started 16m (x4 over 17m) kubelet, inspiron-7577 Started container prometheus
Warning BackOff 2m24s (x72 over 17m) kubelet, inspiron-7577 Back-off restarting failed container
It is unable to create directory for logging. Please help with any ideas.
As istio 1.5.4 has been just released there are some issues with prometheus on minikube installed with istioctl manifest apply.
I checked it on a gcp and everything works fine there.
As a workaround, you can use istio operator which was tested by me and OP and as he mentioned in comments, it's working.
Thanks a lot #jt97! It did work.
Steps to install istio operator
Install the istioctl command.
Deploy the Istio operator: istioctl operator init.
Install istio
To install the Istio demo configuration profile using the operator, run the following command:
kubectl create ns istio-system
kubectl apply -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
namespace: istio-system
name: example-istiocontrolplane
spec:
profile: demo
EOF
Could you tell me why the normal installation failed?
As I mentioned in comments, I don't know yet. If I found a reason I will update this question.

Helm init for tiller deploy is stuck at ContainerCREATING status

NAME READY STATUS RESTARTS AGE
coredns-5644d7b6d9-289qz 1/1 Running 0 76m
coredns-5644d7b6d9-ssbb2 1/1 Running 0 76m
etcd-k8s-master 1/1 Running 0 75m
kube-apiserver-k8s-master 1/1 Running 0 75m
kube-controller-manager-k8s-master 1/1 Running 0 75m
kube-proxy-2q9k5 1/1 Running 0 71m
kube-proxy-dz9pk 1/1 Running 0 76m
kube-scheduler-k8s-master 1/1 Running 0 75m
tiller-deploy-7b875fbf86-8nxmk 0/1 ContainerCreating 0 17m
weave-net-nzb67 2/2 Running 0 75m
weave-net-t8kmk 2/2 Running 0 71m
Installed Kubernates version v1.16.2 but when installing tiller using new service account it is strucking at Container creating. Tried all the solutions such as RBAC, Removing tiller role and do it again, reinstalling kubernates etc.
Output for Kubectl describe is as follows.
[02:32:50] root#k8s-master$ kubectl describe pods tiller-deploy-7b875fbf86-8nxmk --namespace kube-system
Name: tiller-deploy-7b875fbf86-8nxmk
Namespace: kube-system
Priority: 0
Node: worker-node1/172.17.0.1
Start Time: Thu, 24 Oct 2019 14:12:45 -0400
Labels: app=helm
name=tiller
pod-template-hash=7b875fbf86
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/tiller-deploy-7b875fbf86
Containers:
tiller:
Container ID:
Image: gcr.io/kubernetes-helm/tiller:v2.15.1
Image ID:
Ports: 44134/TCP, 44135/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:44135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:44135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
TILLER_NAMESPACE: kube-system
TILLER_HISTORY_MAX: 0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from tiller-token-rr2jg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tiller-token-rr2jg:
Type: Secret (a volume populated by a Secret)
SecretName: tiller-token-rr2jg
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned kube-system/tiller-deploy-7b875fbf86-8nxmk to worker-node1
Warning FailedCreatePodSandBox 61s (x5 over 17m) kubelet, worker-node1 Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Normal SandboxChanged 60s (x5 over 17m) kubelet, worker-node1 Pod sandbox changed, it will be killed and re-created.
[~]
FailedCreatePodSandBox
Means that worker-node1 (172.17.0.1) does not have CNI installed or configured, and is a frequently asked question. Whatever mechanism you used to install kubernetes did not do a robust job, or else you missed a step along the way.
I also have pretty high confidence that your kubelet logs on worker-node1 are filled with error messages, if you were to actually look at them.