GKE throws invalid certificate when fetching logs - kubernetes

I'm trying to fetch the logs from a pod running in GKE, but I get this error:
I0117 11:42:54.468501 96671 round_trippers.go:466] curl -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.26.0 (darwin/arm64) kubernetes/b46a3f8" 'https://x.x.x.x/api/v1/namespaces/pleiades/pods/pleiades-0/log?container=server'
I0117 11:42:54.569122 96671 round_trippers.go:553] GET https://x.x.x.x/api/v1/namespaces/pleiades/pods/pleiades-0/log?container=server 500 Internal Server Error in 100 milliseconds
I0117 11:42:54.569170 96671 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 100 ms Duration 100 ms
I0117 11:42:54.569186 96671 round_trippers.go:577] Response Headers:
I0117 11:42:54.569202 96671 round_trippers.go:580] Content-Type: application/json
I0117 11:42:54.569215 96671 round_trippers.go:580] Content-Length: 226
I0117 11:42:54.569229 96671 round_trippers.go:580] Date: Tue, 17 Jan 2023 19:42:54 GMT
I0117 11:42:54.569243 96671 round_trippers.go:580] Audit-Id: a25a554f-c3f5-4f91-9711-3f2970376770
I0117 11:42:54.569332 96671 round_trippers.go:580] Cache-Control: no-cache, private
I0117 11:42:54.571392 96671 request.go:1154] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get \"https://10.6.128.40:10250/containerLogs/pleiades/pleiades-0/server\": x509: certificate is valid for 127.0.0.1, not 10.6.128.40","code":500}
I0117 11:42:54.572267 96671 helpers.go:246] server response object: [{
"metadata": {},
"status": "Failure",
"message": "Get \"https://10.6.128.40:10250/containerLogs/pleiades/pleiades-0/server\": x509: certificate is valid for 127.0.0.1, not 10.6.128.40",
"code": 500
}]
How do I prevent this from happening?

One of the reasons for this error could be because both metrics-server and kubelet listen on port 10250. This is usually not a problem because metrics-server runs in its own namespace but the conflict would have prevented metrics-server from starting when in the host network.
You can confirm this behavior by running the following command :
$ kubectl -n kube-system get pods -l k8s-app=metrics-server -o yaml | grep 10250
- --secure-port=10250
- containerPort: 10250
If you can see a hostPort: 10250 in the yaml file of the metrics-server, please run the following command to delete metrics-server deployment on that cluster :
$ kubectl -n kube-system delete deployment -l k8s-app=metrics-server
Metrics server will be recreated correctly by GKE infrastructure. It should be recreated in ~15 seconds on clusters with a new addon manager, but could take up to 15 minutes on very old clusters.

Related

Kubernetes service under default namespace is unreachable or throws connection refused error

I am trying to deploy the 'lighthouse' application in kubernetes cluster with 'v1.18.14' and the pods are up and running but the logs shows that there is a connection refused error on kubernettes service ip 10.233.0.1 on port 443.
Logs of lighthouse pod -
[centos#master elk_stack_6.x]$ kubectl logs lighthouse-webhooks-7f58c9897c-tbvfz -n jx
{"fields.level":"info","level":"info","msg":"setting the log level","time":"2021-01-21T03:35:52Z"}
{"level":"info","msg":"updating the Lighthouse core configuration","time":"2021-01-21T03:35:52Z"}
{"level":"warning","msg":"unknown plugin","plugin":"blunderbuss","time":"2021-01-21T03:35:52Z"}
{"level":"warning","msg":"unknown plugin","plugin":"heart","time":"2021-01-21T03:35:52Z"}
{"level":"info","msg":"updating the Lighthouse plugins configuration","time":"2021-01-21T03:35:52Z"}
{"level":"warning","msg":"not pushing metrics as there is no push_gateway defined in the config.yaml","time":"2021-01-21T03:35:52Z"}
{"level":"info","msg":"Lighthouse is now listening on path /hook and port 8080 for WebHooks","time":"2021-01-21T03:35:52Z"}
{"level":"info","msg":"Lighthouse is serving prometheus metrics on port 2112","time":"2021-01-21T03:35:52Z"}
E0129 07:23:34.030912 1 reflector.go:309] pkg/mod/k8s.io/client-go#v0.17.6/tools/cache/reflector.go:105: Failed to watch *v1.ConfigMap: Get https://10.233.0.1:443/api/v1/namespaces/jx/configmaps?watch=true: dial tcp 10.233.0.1:443: connect: connection refused
E0129 07:23:35.031560 1 reflector.go:309] pkg/mod/k8s.io/client-go#v0.17.6/tools/cache/reflector.go:105: Failed to watch *v1.ConfigMap: Get https://10.233.0.1:443/api/v1/namespaces/jx/configmaps?watch=true: dial tcp 10.233.0.1:443: connect: connection refused
Service under default namespace -
[centos#master elk_stack_6.x]$ kubectl get svc -n default
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.233.0.1 <none> 443/TCP 12d
Endpoints of default namespace -
[centos#master elk_stack_6.x]$ kubectl get endpoints -n default
NAME ENDPOINTS AGE
kubernetes 167.254.204.56:6443 12d
Trying to connect kubernetes service using curl -
[centos#master ~]$ curl -kv https://10.233.0.1:443
* About to connect() to 10.233.0.1 port 443 (#0)
* Trying 10.233.0.1...
* Connected to 10.233.0.1 (10.233.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* subject: CN=kube-apiserver
* start date: Jan 20 13:49:12 2021 GMT
* expire date: Jan 20 13:49:12 2022 GMT
* common name: kube-apiserver
* issuer: CN=kubernetes
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.233.0.1
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Cache-Control: no-cache, private
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Tue, 02 Feb 2021 07:57:21 GMT
< Content-Length: 233
<
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {
},
"code": 403
* Connection #0 to host 10.233.0.1 left intact
I am facing 403 forbidden error may be because of some missing RBAC policies. Any suggestions would be appreciated.
Kindly try the HTTPS endpoint as mentioned in the error instead of HTTP.
400 error would mean that the request is indeed reaching the server but not in a format that the server is expecting.
Updated to reflect updated question:-
403 means the user via which the cluster is being accessed is not authorised to use the IP for accessing the resource.
Detailed explanation can be found protecting your kubernetes api server
As per pod logs and Curl command output, it may be due to permission issue with Service Account the pod is running with. Check if Service Account of the pod has proper permission to watch/read configMap in the default namespace. If not there, then RBAC role and role binding K8S objects need to be created.

kubectl exec error dialing backend: x509: certificate signed by unknown authority

After a long struggle I just created my cluster, deployed a sample container busybox now i am trying to run the command exec and i get the following error:
error dialing backend: x509: certificate signed by unknown authority
How do i solve this one: here is the command output with v=9 log level.
kubectl exec -v=9 -ti busybox -- nslookup kubernetes
I also noticed in the logs that this curl command that failed is actually the second command the first GET command passed and it return results without any issues.. ( GET https://myloadbalancer.local:6443/api/v1/namespaces/default/pods/busybox 200 OK)
curl -k -v -XPOST -H "X-Stream-Protocol-Version: v4.channel.k8s.io" -H "X-Stream-Protocol-Version: v3.channel.k8s.io" -H "X-Stream-Protocol-Version: v2.channel.k8s.io" -H "X-Stream-Protocol-Version: channel.k8s.io" -H "User-Agent: kubectl/v1.19.0 (linux/amd64) kubernetes/e199641" 'https://myloadbalancer.local:6443/api/v1/namespaces/default/pods/busybox/exec?command=nslookup&command=kubernetes&container=busybox&stdin=true&stdout=true&tty=true'
I1018 02:19:40.776134 129813 round_trippers.go:443] POST https://myloadbalancer.local:6443/api/v1/namespaces/default/pods/busybox/exec?command=nslookup&command=kubernetes&container=busybox&stdin=true&stdout=true&tty=true 500 Internal Server Error in 43 milliseconds
I1018 02:19:40.776189 129813 round_trippers.go:449] Response Headers:
I1018 02:19:40.776206 129813 round_trippers.go:452] Content-Type: application/json
I1018 02:19:40.776234 129813 round_trippers.go:452] Date: Sun, 18 Oct 2020 02:19:40 GMT
I1018 02:19:40.776264 129813 round_trippers.go:452] Content-Length: 161
I1018 02:19:40.776277 129813 round_trippers.go:452] Cache-Control: no-cache, private
I1018 02:19:40.777904 129813 helpers.go:216] server response object: [{
"metadata": {},
"status": "Failure",
"message": "error dialing backend: x509: certificate signed by unknown authority",
"code": 500
}]
F1018 02:19:40.778081 129813 helpers.go:115] Error from server: error dialing backend: x509: certificate signed by unknown authority
goroutine 1 [running]:
Adding more information:
This is on UBUNTU 20.04. I went through step by step creating my cluster manually as a beginner I need that experience instead of spinning up with tools like kubeadm or minikube
xxxx#master01:~$ kubectl exec -ti busybox -- nslookup kubernetes
Error from server: error dialing backend: x509: certificate signed by unknown authority
xxxx#master01:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default busybox 1/1 Running 52 2d5h
kube-system coredns-78cb77577b-lbp87 1/1 Running 0 2d5h
kube-system coredns-78cb77577b-n7rvg 1/1 Running 0 2d5h
kube-system weave-net-d9jb6 2/2 Running 7 2d5h
kube-system weave-net-nsqss 2/2 Running 0 2d14h
kube-system weave-net-wnbq7 2/2 Running 7 2d5h
kube-system weave-net-zfsmn 2/2 Running 0 2d14h
kubernetes-dashboard dashboard-metrics-scraper-7b59f7d4df-dhcpn 1/1 Running 0 2d3h
kubernetes-dashboard kubernetes-dashboard-665f4c5ff-6qnzp 1/1 Running 7 2d3h
tinashe#master01:~$ kubectl logs busybox
Error from server: Get "https://worker01:10250/containerLogs/default/busybox/busybox": x509: certificate signed by unknown authority
xxxx#master01:~$
xxxx#master01:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:41:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
**Edited for simplicity:
my cluster operator kube-apiserver was degraded, causing my certificate failures. Resolving that degradation was necessary to resolve the overarching problem, resulting in x509 errors. Validate that all masters are in READY, pods in your apiserver projects are also scheduled and ready. See below KCS for more information:
https://access.redhat.com/solutions/4849711
**removed below outdated/incorrect information about local cert pull/export.

kubeadm init stuck at health check when deploying HA kubernetes master with haproxy

I am deploying HA kubernetes master(stacked etcd) with kubeadm ,I followed
the instructions on official website :
https://kubernetes.io/docs/setup/independent/high-availability/
four nodes are planned in my cluster for now:
One HAProxy server node used for master loadbalance.
three etcd stacked master nodes.
I deployed haproxy with following configuration:
global
daemon
maxconn 256
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend haproxy_kube
bind *:6443
mode tcp
option tcplog
timeout client 10800s
default_backend masters
backend masters
mode tcp
option tcplog
balance leastconn
timeout server 10800s
server master01 <master01-ip>:6443 check
my kubeadm-config.yaml is like this:
apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
nodeRegistration:
name: "master01"
---
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
apiServer:
certSANs:
- "<haproxyserver-dns>"
controlPlaneEndpoint: "<haproxyserver-dns>:6443"
networking:
serviceSubnet: "172.24.0.0/16"
podSubnet: "172.16.0.0/16"
my initial command is:
kubeadm init --config=kubeadm-config.yaml -v 11
but after I running the command above on the master01, it kept logging the following information:
I0122 11:43:44.039849 17489 manifests.go:113] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0122 11:43:44.041038 17489 local.go:57] [etcd] wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
I0122 11:43:44.041068 17489 waitcontrolplane.go:89] [wait-control-plane] Waiting for the API server to be healthy
I0122 11:43:44.042665 17489 loader.go:359] Config loaded from file /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0122 11:43:44.044971 17489 round_trippers.go:419] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.13.2 (linux/amd64) kubernetes/cff46ab" 'https://<haproxyserver-dns>:6443/healthz?timeout=32s'
I0122 11:43:44.120973 17489 round_trippers.go:438] GET https://<haproxyserver-dns>:6443/healthz?timeout=32s in 75 milliseconds
I0122 11:43:44.120988 17489 round_trippers.go:444] Response Headers:
I0122 11:43:44.621201 17489 round_trippers.go:419] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.13.2 (linux/amd64) kubernetes/cff46ab" 'https://<haproxyserver-dns>:6443/healthz?timeout=32s'
I0122 11:43:44.703556 17489 round_trippers.go:438] GET https://<haproxyserver-dns>:6443/healthz?timeout=32s in 82 milliseconds
I0122 11:43:44.703577 17489 round_trippers.go:444] Response Headers:
I0122 11:43:45.121311 17489 round_trippers.go:419] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.13.2 (linux/amd64) kubernetes/cff46ab" 'https://<haproxyserver-dns>:6443/healthz?timeout=32s'
I0122 11:43:45.200493 17489 round_trippers.go:438] GET https://<haproxyserver-dns>:6443/healthz?timeout=32s in 79 milliseconds
I0122 11:43:45.200514 17489 round_trippers.go:444] Response Headers:
I0122 11:43:45.621338 17489 round_trippers.go:419] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.13.2 (linux/amd64) kubernetes/cff46ab" 'https://<haproxyserver-dns>:6443/healthz?timeout=32s'
I0122 11:43:45.698633 17489 round_trippers.go:438] GET https://<haproxyserver-dns>:6443/healthz?timeout=32s in 77 milliseconds
I0122 11:43:45.698652 17489 round_trippers.go:444] Response Headers:
I0122 11:43:46.121323 17489 round_trippers.go:419] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.13.2 (linux/amd64) kubernetes/cff46ab" 'https://<haproxyserver-dns>:6443/healthz?timeout=32s'
I0122 11:43:46.199641 17489 round_trippers.go:438] GET https://<haproxyserver-dns>:6443/healthz?timeout=32s in 78 milliseconds
I0122 11:43:46.199660 17489 round_trippers.go:444] Response Headers:
after quitting the loop with Ctrl-C, I run the curl command mannually, but every thing seems ok:
curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.13.2 (linux/amd64) kubernetes/cff46ab" 'https://<haproxyserver-dns>:6443/healthz?timeout=32s'
* About to connect() to <haproxyserver-dns> port 6443 (#0)
* Trying <haproxyserver-ip>...
* Connected to <haproxyserver-dns> (10.135.64.223) port 6443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
* subject: CN=kube-apiserver
* start date: Jan 22 03:43:38 2019 GMT
* expire date: Jan 22 03:43:38 2020 GMT
* common name: kube-apiserver
* issuer: CN=kubernetes
> GET /healthz?timeout=32s HTTP/1.1
> Host: <haproxyserver-dns>:6443
> Accept: application/json, */*
> User-Agent: kubeadm/v1.13.2 (linux/amd64) kubernetes/cff46ab
>
< HTTP/1.1 200 OK
< Date: Tue, 22 Jan 2019 04:09:03 GMT
< Content-Length: 2
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host <haproxyserver-dns> left intact
ok
I don't know how to find out the essential cause of this issue, hoping someone who know about this can give me some suggestion. Thanks!
After several days of finding and trying, again, I can solve this problem by myself. In fact, the problem perhaps came with a very rare situation:
I set proxy on master node in both /etc/profile and docker.service.d, which made the request to haproxy don't work well.
I don't know which setting cause this problem. But after adding a no proxy rule, the problem solved and kubeadm successfully initialized a master after the haproxy load balancer. Here is my proxy settings :
/etc/profile:
...
export http_proxy=http://<my-proxy-server-dns:port>/
export no_proxy=<my-k8s-master-loadbalance-server-dns>,<my-proxy-server-dns>,localhost
/etc/systemd/system/docker.service.d/http-proxy.conf:
[Service]
Environment="HTTP_PROXY=http://<my-proxy-server-dns:port>/" "NO_PROXY<my-k8s-master-loadbalance-server-dns>,<my-proxy-server-dns>,localhost, 127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16"

Kubernetes v1.12 Problems with kubectl exec

I’ve been learning about Kubernetes using Kelsey Hightower’s excellent kubernetes-the-hard-way-guide.
Using this guide I’ve installed v1.12 on GCE. Everything works perfectly apart from kubectl exec:
$ kubectl exec -it shell-demo – /bin/bash --kubeconfig=/root/certsconfigs/admin.kubeconfig
error: unable to upgrade connection: Forbidden (user=kubernetes, verb=create, resource=nodes, subresource=proxy)
Note that I have set KUBECONFIG=/root/certsconfigs/admin.kubeconfig.
Apart from exec all other kubectl functions work as expected with this admin.kubeconfig file, so from that I deduce it valid for use with my cluster.
I’m pretty sure I have made a beginners mistake somewhere, but if somebody could advise where I have gone away, I should be most grateful.
TIA
Shaun
I have double checked that no .kube/config file exists anywhere on my master controller:
root#controller-1:/root/deployment/kubernetes# kubectl get pods
NAME READY STATUS
shell-demo 1/1 Running 0 23m
Here is the output with -v8:
root#controller-1:/root/deployment/kubernetes# kubectl -v8 exec -it shell-demo – /bin/bash
I1118 15:18:16.898428 11117 loader.go:359] Config loaded from file /root/certsconfigs/admin.kubeconfig
I1118 15:18:16.899531 11117 loader.go:359] Config loaded from file /root/certsconfigs/admin.kubeconfig
I1118 15:18:16.900611 11117 loader.go:359] Config loaded from file /root/certsconfigs/admin.kubeconfig
I1118 15:18:16.902851 11117 round_trippers.go:383] GET ://127.0.0.1:6443/api/v1/namespaces/default/pods/shell-demo
I1118 15:18:16.902946 11117 round_trippers.go:390] Request Headers:
I1118 15:18:16.903016 11117 round_trippers.go:393] Accept: application/json, /
I1118 15:18:16.903091 11117 round_trippers.go:393] User-Agent: kubectl/v1.12.0 (linux/amd64) kubernetes/0ed3388
I1118 15:18:16.918699 11117 round_trippers.go:408] Response Status: 200 OK in 15 milliseconds
I1118 15:18:16.918833 11117 round_trippers.go:411] Response Headers:
I1118 15:18:16.918905 11117 round_trippers.go:414] Content-Type: application/json
I1118 15:18:16.918974 11117 round_trippers.go:414] Content-Length: 2176
I1118 15:18:16.919053 11117 round_trippers.go:414] Date: Sun, 18 Nov 2018 15:18:16 GMT
I1118 15:18:16.919218 11117 request.go:942] Response Body: {“kind”:“Pod”,“apiVersion”:“v1”,“metadata”:{“name”:“shell-demo”,“namespace”:“default”,“selfLink”:"/api/v1/namespaces/default/pods/shell-demo",“uid”:“99f320f8-eb42-11e8-a053-42010af0000b”,“resourceVersion”:“13213”,“creationTimestamp”:“2018-11-18T14:59:51Z”},“spec”:{“volumes”:[{“name”:“shared-data”,“emptyDir”:{}},{“name”:“default-token-djprb”,“secret”:{“secretName”:“default-token-djprb”,“defaultMode”:420}}],“containers”:[{“name”:“nginx”,“image”:“nginx”,“resources”:{},“volumeMounts”:[{“name”:“shared-data”,“mountPath”:"/usr/share/nginx/html"},{“name”:“default-token-djprb”,“readOnly”:true,“mountPath”:"/var/run/secrets/kubernetes.io/serviceaccount"}],“terminationMessagePath”:"/dev/termination-log",“terminationMessagePolicy”:“File”,“imagePullPolicy”:“Always”}],“restartPolicy”:“Always”,“terminationGracePeriodSeconds”:30,“dnsPolicy”:“ClusterFirst”,“serviceAccountName”:“default”,“serviceAccount”:“default”,“nodeName”:“worker-1”,“securityContext”:{},“schedulerName”:“default-scheduler”,“tolerations”:[{“key”:"node.kubernet [truncated 1152 chars]
I1118 15:18:16.925240 11117 round_trippers.go:383] POST …
error: unable to upgrade connection: Forbidden (user=kubernetes, verb=create, resource=nodes, subresource=proxy)
According to your logs,the connection between kubectl and the apiserver is fine, and is being authenticated correctly.
To satisfy an exec request, the apiserver contacts the kubelet running the pod, and that connection is what is being forbidden.
Your kubelet is configured to authenticate/authorize requests, and the apiserver credential is not authorized to make the exec request against the kubelet's API.
Based on the forbidden message, your apiserver is authenticating as the "kubernetes" user to the kubelet.
You can grant that user full permissions to the kubelet API with the following command:
kubectl create clusterrolebinding apiserver-kubelet-admin --user=kubernetes --clusterrole=system:kubelet-api-admin
See the following docs for more information
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-authentication-authorization/#overview
https://kubernetes.io/docs/reference/access-authn-authz/rbac/#other-component-roles

Kubectl delete -f deployments/ --grace-period=0 --force does not work

What happened:
Force terminate does not work:
[root#master0 manifests]# kubectl delete -f prometheus/deployment.yaml --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
deployment.extensions "prometheus-core" force deleted
^C <---- Manual Quit due to hanging. Waited over 5 minutes with no change.
[root#master0 manifests]# kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-668794449d-6dppl 0/1 Terminating 0 22h
grafana-core-576c68c58d-7nvbt 0/1 Terminating 0 22h
kube-state-metrics-69b9d65dd5-rl8td 0/1 Terminating 0 3h
node-directory-size-metrics-6hcfc 2/2 Running 0 3h
node-directory-size-metrics-w7zxh 2/2 Running 0 3h
node-directory-size-metrics-z2m5j 2/2 Running 0 3h
prometheus-core-59778c7987-vh89h 0/1 Terminating 0 3h
prometheus-node-exporter-27fjg 1/1 Running 0 3h
prometheus-node-exporter-2t5v6 1/1 Running 0 3h
prometheus-node-exporter-hhxmv 1/1 Running 0 3h
Then
What you expected to happen:
Pod to be deleted
How to reproduce it (as minimally and precisely as possible):
We feel that the there might have been an IO error with the storage on the pods. Kubernetes has its own dedicated direct storage. All hosted on AWS. Use of t3.xl
Anything else we need to know?:
It seems to happen randomly but happens often enough as we have to reboot the entire cluster. Do stuck in termination can be ok to deal with, having no logs or no control to really force remove them and start again is frustrating.
Environment:
- Kubernetes version (use kubectl version):
kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:08:34Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration:
AWS
- OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Kernel (e.g. uname -a):
Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Tue Jun 26 16:32:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Kubernetes was deployed with Kuberpray with GlusterFS as a container volume and Weave as its networking.
Others:
2 master 1 node setup. We have redeployed the entire setup and still get hit by the same issue.
I have posted this question on their issues page:
https://github.com/kubernetes/kubernetes/issues/68829
But no reply.
Logs from API:
[root#master0 manifests]# kubectl -n monitoring delete pod prometheus-core-59778c7987-bl2h4 --force --grace-period=0 -v9
I0919 13:53:08.770798 19973 loader.go:359] Config loaded from file /root/.kube/config
I0919 13:53:08.771440 19973 loader.go:359] Config loaded from file /root/.kube/config
I0919 13:53:08.772681 19973 loader.go:359] Config loaded from file /root/.kube/config
I0919 13:53:08.780266 19973 loader.go:359] Config loaded from file /root/.kube/config
I0919 13:53:08.780943 19973 loader.go:359] Config loaded from file /root/.kube/config
I0919 13:53:08.781609 19973 loader.go:359] Config loaded from file /root/.kube/config
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
I0919 13:53:08.781876 19973 request.go:897] Request Body: {"gracePeriodSeconds":0,"propagationPolicy":"Foreground"}
I0919 13:53:08.781938 19973 round_trippers.go:386] curl -k -v -XDELETE -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: kubectl/v1.11.0 (linux/amd64) kubernetes/91e7b4f" 'https://10.1.1.28:6443/api/v1/namespaces/monitoring/pods/prometheus-core-59778c7987-bl2h4'
I0919 13:53:08.798682 19973 round_trippers.go:405] DELETE https://10.1.1.28:6443/api/v1/namespaces/monitoring/pods/prometheus-core-59778c7987-bl2h4 200 OK in 16 milliseconds
I0919 13:53:08.798702 19973 round_trippers.go:411] Response Headers:
I0919 13:53:08.798709 19973 round_trippers.go:414] Content-Type: application/json
I0919 13:53:08.798714 19973 round_trippers.go:414] Content-Length: 3199
I0919 13:53:08.798719 19973 round_trippers.go:414] Date: Wed, 19 Sep 2018 13:53:08 GMT
I0919 13:53:08.798758 19973 request.go:897] Response Body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"prometheus-core-59778c7987-bl2h4","generateName":"prometheus-core-59778c7987-","namespace":"monitoring","selfLink":"/api/v1/namespaces/monitoring/pods/prometheus-core-59778c7987-bl2h4","uid":"7647d17a-bc11-11e8-bd71-06b8eceafd88","resourceVersion":"676465","creationTimestamp":"2018-09-19T13:39:41Z","deletionTimestamp":"2018-09-19T13:40:18Z","deletionGracePeriodSeconds":0,"labels":{"app":"prometheus","component":"core","pod-template-hash":"1533473543"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"prometheus-core-59778c7987","uid":"75aba047-bc11-11e8-bd71-06b8eceafd88","controller":true,"blockOwnerDeletion":true}],"finalizers":["foregroundDeletion"]},"spec":{"volumes":[{"name":"config-volume","configMap":{"name":"prometheus-core","defaultMode":420}},{"name":"rules-volume","configMap":{"name":"prometheus-rules","defaultMode":420}},{"name":"api-token","secret":{"secretName":"api-token","defaultMode":420}},{"name":"ca-crt","secret":{"secretName":"ca-crt","defaultMode":420}},{"name":"prometheus-k8s-token-trclf","secret":{"secretName":"prometheus-k8s-token-trclf","defaultMode":420}}],"containers":[{"name":"prometheus","image":"prom/prometheus:v1.7.0","args":["-storage.local.retention=12h","-storage.local.memory-chunks=500000","-config.file=/etc/prometheus/prometheus.yaml","-alertmanager.url=http://alertmanager:9093/"],"ports":[{"name":"webui","containerPort":9090,"protocol":"TCP"}],"resources":{"limits":{"cpu":"500m","memory":"500M"},"requests":{"cpu":"500m","memory":"500M"}},"volumeMounts":[{"name":"config-volume","mountPath":"/etc/prometheus"},{"name":"rules-volume","mountPath":"/etc/prometheus-rules"},{"name":"api-token","mountPath":"/etc/prometheus-token"},{"name":"ca-crt","mountPath":"/etc/prometheus-ca"},{"name":"prometheus-k8s-token-trclf","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"prometheus-k8s","serviceAccount":"prometheus-k8s","nodeName":"master1.infra.cde","securityContext":{},"schedulerName":"default-scheduler"},"status":{"phase":"Pending","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-09-19T13:39:41Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2018-09-19T13:39:41Z","reason":"ContainersNotReady","message":"containers with unready status: [prometheus]"},{"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":null,"reason":"ContainersNotReady","message":"containers with unready status: [prometheus]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-09-19T13:39:41Z"}],"hostIP":"10.1.1.187","startTime":"2018-09-19T13:39:41Z","containerStatuses":[{"name":"prometheus","state":{"terminated":{"exitCode":0,"startedAt":null,"finishedAt":null}},"lastState":{},"ready":false,"restartCount":0,"image":"prom/prometheus:v1.7.0","imageID":""}],"qosClass":"Guaranteed"}}
pod "prometheus-core-59778c7987-bl2h4" force deleted
I0919 13:53:08.798864 19973 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json" -H "User-Agent: kubectl/v1.11.0 (linux/amd64) kubernetes/91e7b4f" 'https://10.1.1.28:6443/api/v1/namespaces/monitoring/pods/prometheus-core-59778c7987-bl2h4'
I0919 13:53:08.801386 19973 round_trippers.go:405] GET https://10.1.1.28:6443/api/v1/namespaces/monitoring/pods/prometheus-core-59778c7987-bl2h4 200 OK in 2 milliseconds
I0919 13:53:08.801403 19973 round_trippers.go:411] Response Headers:
I0919 13:53:08.801409 19973 round_trippers.go:414] Content-Type: application/json
I0919 13:53:08.801415 19973 round_trippers.go:414] Content-Length: 3199
I0919 13:53:08.801420 19973 round_trippers.go:414] Date: Wed, 19 Sep 2018 13:53:08 GMT
I0919 13:53:08.801465 19973 request.go:897] Response Body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"prometheus-core-59778c7987-bl2h4","generateName":"prometheus-core-59778c7987-","namespace":"monitoring","selfLink":"/api/v1/namespaces/monitoring/pods/prometheus-core-59778c7987-bl2h4","uid":"7647d17a-bc11-11e8-bd71-06b8eceafd88","resourceVersion":"676465","creationTimestamp":"2018-09-19T13:39:41Z","deletionTimestamp":"2018-09-19T13:40:18Z","deletionGracePeriodSeconds":0,"labels":{"app":"prometheus","component":"core","pod-template-hash":"1533473543"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"prometheus-core-59778c7987","uid":"75aba047-bc11-11e8-bd71-06b8eceafd88","controller":true,"blockOwnerDeletion":true}],"finalizers":["foregroundDeletion"]},"spec":{"volumes":[{"name":"config-volume","configMap":{"name":"prometheus-core","defaultMode":420}},{"name":"rules-volume","configMap":{"name":"prometheus-rules","defaultMode":420}},{"name":"api-token","secret":{"secretName":"api-token","defaultMode":420}},{"name":"ca-crt","secret":{"secretName":"ca-crt","defaultMode":420}},{"name":"prometheus-k8s-token-trclf","secret":{"secretName":"prometheus-k8s-token-trclf","defaultMode":420}}],"containers":[{"name":"prometheus","image":"prom/prometheus:v1.7.0","args":["-storage.local.retention=12h","-storage.local.memory-chunks=500000","-config.file=/etc/prometheus/prometheus.yaml","-alertmanager.url=http://alertmanager:9093/"],"ports":[{"name":"webui","containerPort":9090,"protocol":"TCP"}],"resources":{"limits":{"cpu":"500m","memory":"500M"},"requests":{"cpu":"500m","memory":"500M"}},"volumeMounts":[{"name":"config-volume","mountPath":"/etc/prometheus"},{"name":"rules-volume","mountPath":"/etc/prometheus-rules"},{"name":"api-token","mountPath":"/etc/prometheus-token"},{"name":"ca-crt","mountPath":"/etc/prometheus-ca"},{"name":"prometheus-k8s-token-trclf","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"prometheus-k8s","serviceAccount":"prometheus-k8s","nodeName":"master1.infra.cde","securityContext":{},"schedulerName":"default-scheduler"},"status":{"phase":"Pending","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-09-19T13:39:41Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2018-09-19T13:39:41Z","reason":"ContainersNotReady","message":"containers with unready status: [prometheus]"},{"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":null,"reason":"ContainersNotReady","message":"containers with unready status: [prometheus]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-09-19T13:39:41Z"}],"hostIP":"10.1.1.187","startTime":"2018-09-19T13:39:41Z","containerStatuses":[{"name":"prometheus","state":{"terminated":{"exitCode":0,"startedAt":null,"finishedAt":null}},"lastState":{},"ready":false,"restartCount":0,"image":"prom/prometheus:v1.7.0","imageID":""}],"qosClass":"Guaranteed"}}
I0919 13:53:08.801758 19973 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json" -H "User-Agent: kubectl/v1.11.0 (linux/amd64) kubernetes/91e7b4f" 'https://10.1.1.28:6443/api/v1/namespaces/monitoring/pods?fieldSelector=metadata.name%3Dprometheus-core-59778c7987-bl2h4&resourceVersion=676465&watch=true'
I0919 13:53:08.803409 19973 round_trippers.go:405] GET https://10.1.1.28:6443/api/v1/namespaces/monitoring/pods?fieldSelector=metadata.name%3Dprometheus-core-59778c7987-bl2h4&resourceVersion=676465&watch=true 200 OK in 1 milliseconds
I0919 13:53:08.803424 19973 round_trippers.go:411] Response Headers:
I0919 13:53:08.803430 19973 round_trippers.go:414] Date: Wed, 19 Sep 2018 13:53:08 GMT
I0919 13:53:08.803436 19973 round_trippers.go:414] Content-Type: application/json
After some investigation and help from the Kubernetes community over on github. We found the solution. The answer is, in 1.11.0 there is a known bug in relation to this issue. after upgrading to 1.12.0 the issue was resolved. The issue is noted to be resolved in 1.11.1
Thanks to cduchesne https://github.com/kubernetes/kubernetes/issues/68829#issuecomment-422878108
Some times a Kubernetes workers have problems like zombie process or kernel panics or IO waiting. But when you want to delete a pod that use Storage and have many IO/PS like Prometheus DB , your worker can not kill that pods.
I had same situation like you but on Container Linux without any cloud platform like AWS and Gcloud. i just rebooted my broke worker and after that delete them normally without --grace-period=0. --grace-period=0 is a very bad command when your nodes and pods are running without any problems.
workers can reboot when you use an K8S. This is a good fetcher of K8S.
For run Prometheus you should make some Prometheus with different config or use federation for scale Prometheus if you want to have a Monitoring System without IO problem .
After you issue the kubectl delete I would log into the nodes where the pods are running and debug with docker commands. (assuming your runtime is Docker)
docker logs <container-with-issue>
docker exec -it <container-with-with-issue> bash # maybe the application is hanging.
Are you mounting any volumes for Prometheus? It could be that it's trying to release an EBS volume and the AWS API is unresponsive.
Hope it helps!