Kubernetes fix gke-metrics-agent stuck in terminating state on GKE - kubernetes

GKE had an outage about 2 days ago in their London datacentre (https://status.cloud.google.com/incident/compute/20013), since which time one of my nodes has been acting up. I've had to manually terminate a number of pods running on it and I'm having issues with a couple of sites, I assume due to their liveness checks failing temporarily which might have something to do with the below error in gke-metrics-agent?
Looking at the system pods I can see one instance of gke-metrics-agent is stuck in a terminating state and has been since last night:
kubectl get pods -n kube-system
reports:
...
gke-metrics-agent-k47g8 0/1 Terminating 0 32d
gke-metrics-agent-knr9h 1/1 Running 0 31h
gke-metrics-agent-vqkpw 1/1 Running 0 32d
...
I've looked at the describe output for the pod but can't see anything that helps me understand what it needs done:
kubectl describe pod gke-metrics-agent-k47g8 -n kube-system
Name: gke-metrics-agent-k47g8
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: <node-name>/<IP>
Start Time: Mon, 09 Nov 2020 03:41:14 +0000
Labels: component=gke-metrics-agent
controller-revision-hash=f8c5b8bfb
k8s-app=gke-metrics-agent
pod-template-generation=4
Annotations: components.gke.io/component-name: gke-metrics-agent
components.gke.io/component-version: 0.27.1
configHash: <config-hash>
Status: Terminating (lasts 15h)
Termination Grace Period: 30s
IP: <IP>
IPs:
IP: <IP>
Controlled By: DaemonSet/gke-metrics-agent
Containers:
gke-metrics-agent:
Container ID: docker://<id>
Image: gcr.io/gke-release/gke-metrics-agent:0.1.3-gke.0
Image ID: docker-pullable://gcr.io/gke-release/gke-metrics-agent#sha256:<hash>
Port: <none>
Host Port: <none>
Command:
/otelsvc
--config=/conf/gke-metrics-agent-config.yaml
--metrics-level=NONE
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Nov 2020 03:41:17 +0000
Finished: Thu, 10 Dec 2020 21:16:50 +0000
Ready: False
Restart Count: 0
Limits:
memory: 50Mi
Requests:
cpu: 3m
memory: 50Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
POD_NAME: gke-metrics-agent-k47g8 (v1:metadata.name)
POD_NAMESPACE: kube-system (v1:metadata.namespace)
KUBELET_HOST: 127.0.0.1
ARG1: ${1}
ARG2: ${2}
Mounts:
/conf from gke-metrics-agent-config-vol (rw)
/etc/ssl/certs from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from gke-metrics-agent-token-cn6ss (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
gke-metrics-agent-config-vol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gke-metrics-agent-conf
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType:
gke-metrics-agent-token-cn6ss:
Type: Secret (a volume populated by a Secret)
SecretName: gke-metrics-agent-token-cn6ss
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoExecute
:NoSchedule
components.gke.io/gke-managed-components
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events: <none>
I'm not used to having to work on the system pods, in the past my experience troubleshooting issues often falls back on force deleting them when all else fails:
kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force
My concern is I don't fully understand what this might do for a system pod and was hoping someone with expertise could advise on a sensible way forward?
I'm also looking at draining this node so Kubernetes can rebuild a new one. Would this potentially be the easiest way to go?

Following up on this I found the pod that was experiencing the issues with gke-metrics-agent became even less stable as the day went on.
I, therefore, had to drain it. The resources it was running are now on new nodes which are working as expected and all system pods are running as expected (including gke-metrics-agent).
Prior to draining this node I ensured, Pod Disruption Budgets were in place as a number of services run on 1 or 2 instances:
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
This meant I could run:
kubectl drain <node-name>
The deployments then ensured they had enough live pods prior to the bad node being taken offline and seems to have avoided any downtime.

Related

Getting CrashBackloopError when deploying a pod

I am new to kubernetes and am trying to deploy a pod with private registry. Whenever I deploy this yaml it goes crash loop. Added sleep with a large value thinking that might cause this, still haven't worked.
apiVersion: v1
kind: Pod
metadata:
name: privetae-image-testing
spec:
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
command: ['echo','success','sleep 1000000']
Here are the logs:
Name: privetae-image-testing
Namespace: default
Priority: 0
Node: docker-desktop/192.168.65.4
Start Time: Sun, 24 Oct 2021 15:52:25 +0530
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.1.1.49
IPs:
IP: 10.1.1.49
Containers:
private-image-test:
Container ID: docker://46520936762f17b70d1ec92a121269e90aef2549390a14184e6c838e1e6bafec
Image: buildforjenkin.azurecr.io/nginx:latest
Image ID: docker-pullable://buildforjenkin.azurecr.io/nginx#sha256:7250923ba3543110040462388756ef099331822c6172a050b12c7a38361ea46f
Port: <none>
Host Port: <none>
Command:
echo
success
sleep 1000000
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
Ready: False
Restart Count: 2
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ld6zz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-ld6zz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned default/privetae-image-testing to docker-desktop
Normal Pulled 17s (x3 over 33s) kubelet Container image "buildforjenkin.azurecr.io/nginx:latest" already present on machine
Normal Created 17s (x3 over 33s) kubelet Created container private-image-test
Normal Started 17s (x3 over 33s) kubelet Started container private-image-test
Warning BackOff 2s (x5 over 31s) kubelet Back-off restarting failed container
I am running the cluster on docker-desktop on windows. TIA
Notice you are using standard nginx image? Try delete your pod and re-apply with:
apiVersion: v1
kind: Pod
metadata:
name: private-image-testing
labels:
run: my-nginx
spec:
restartPolicy: Always
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
name: http
If your pod runs, you should be able to remote into with kubectl exec -it private-image-testing -- sh, follow by wget -O- localhost should print you a welcome message. If it still fail, paste the output of kubectl logs -f -l run=my-nginx to your question.
Check my previous answer to understand step-by step whats going on after you launch the container.
You are launching some nginx:latest container with the process inside that runs forever as it should be to avoid main process be exited. Then you add overlay that (I will quote David: print the words success and sleep 1000000, and having printed those words, then exit).
Instead of making your container running all the time to serve, you explicitly shooting into your leg by finishing the process using sleep 1000000.
And sure, your command will be executed and container will exit. Check that. It was exited correctly with status 0 and did that 2 times. And will more in the future.
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
You need to think well if you really need command: ['echo','success','sleep 1000000']

kubernetes: when pod in CrashLoopBackOff status, related events won't update?

I'm testing kubernetes behavior when pod getting error.
I now have a pod in CrashLoopBackOff status caused by liveness probe failed, from what I can see in kubernetes events, pod turns into CrashLoopBackOff after 3 times try and begin to back off restarting, but the related Liveness probe failed events won't update?
➜ ~ kubectl describe pods/my-nginx-liveness-err-59fb55cf4d-c6p8l
Name: my-nginx-liveness-err-59fb55cf4d-c6p8l
Namespace: default
Priority: 0
Node: minikube/192.168.99.100
Start Time: Thu, 15 Jul 2021 12:29:16 +0800
Labels: pod-template-hash=59fb55cf4d
run=my-nginx-liveness-err
Annotations: <none>
Status: Running
IP: 172.17.0.3
IPs:
IP: 172.17.0.3
Controlled By: ReplicaSet/my-nginx-liveness-err-59fb55cf4d
Containers:
my-nginx-liveness-err:
Container ID: docker://edc363b76811fdb1ccacdc553d8de77e9d7455bb0d0fb3cff43eafcd12ee8a92
Image: nginx
Image ID: docker-pullable://nginx#sha256:353c20f74d9b6aee359f30e8e4f69c3d7eaea2f610681c4a95849a2fd7c497f9
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 15 Jul 2021 13:01:36 +0800
Finished: Thu, 15 Jul 2021 13:02:06 +0800
Ready: False
Restart Count: 15
Liveness: http-get http://:8080/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r7mh4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-r7mh4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37m default-scheduler Successfully assigned default/my-nginx-liveness-err-59fb55cf4d-c6p8l to minikube
Normal Created 35m (x4 over 37m) kubelet Created container my-nginx-liveness-err
Normal Started 35m (x4 over 37m) kubelet Started container my-nginx-liveness-err
Normal Killing 35m (x3 over 36m) kubelet Container my-nginx-liveness-err failed liveness probe, will be restarted
Normal Pulled 31m (x7 over 37m) kubelet Container image "nginx" already present on machine
Warning Unhealthy 16m (x32 over 36m) kubelet Liveness probe failed: Get "http://172.17.0.3:8080/": dial tcp 172.17.0.3:8080: connect: connection refused
Warning BackOff 118s (x134 over 34m) kubelet Back-off restarting failed container
BackOff event updated 118s ago, but Unhealthy event updated 16m ago?
and why I'm getting only 15 times Restart Count while BackOff events with 134 times?
I'm using minikube and my deployment is like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx-liveness-err
spec:
selector:
matchLabels:
run: my-nginx-liveness-err
replicas: 1
template:
metadata:
labels:
run: my-nginx-liveness-err
spec:
containers:
- name: my-nginx-liveness-err
image: nginx
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /
port: 8080
I think you might be confusing Status Conditions and Events. Events don't "update", they just exist. It's a stream of event data from the controllers for debugging or alerting on. The Age column is the relative timestamp to the most recent instance of that event type and you can see if does some basic de-duplication. Events also age out after a few hours to keep the database from exploding.
So your issue has nothing to do with the liveness probe, your container is crashing on startup.

Dashboard not running

I have setup kubenertes on ubuntu server using this link.
Then I installed kubernetes dashboard using:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc6/aio/deploy/recommended.yaml
Then I changed the ClusterIP to NodePort 32323, the service to NodePort.
But container is not running.
uday#dockermaster:~$ kubectl -n kubernetes-dashboard get all
NAME READY STATUS RESTARTS AGE
pod/dashboard-metrics-scraper-779f5454cb-pqfrj 1/1 Running 0 50m
pod/kubernetes-dashboard-64686c4bf9-5jkwq 0/1 CrashLoopBackOff 14 50m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/dashboard-metrics-scraper ClusterIP 10.103.22.252 <none> 8000/TCP 50m
service/kubernetes-dashboard NodePort 10.102.48.80 <none> 443:32323/TCP 50m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/dashboard-metrics-scraper 1/1 1 1 50m
deployment.apps/kubernetes-dashboard 0/1 1 0 50m
NAME DESIRED CURRENT READY AGE
replicaset.apps/dashboard-metrics-scraper-779f5454cb 1 1 1 50m
replicaset.apps/kubernetes-dashboard-64686c4bf9 1 1 0 50m
uday#dockermaster:~$ kubectl -n kubernetes-dashboard describe svc kubernetes-dashboard
Name: kubernetes-dashboard
Namespace: kubernetes-dashboard
Labels: k8s-app=kubernetes-dashboard
Annotations: Selector: k8s-app=kubernetes-dashboard
Type: NodePort
IP: 10.102.48.80
Port: <unset> 443/TCP
TargetPort: 8443/TCP
NodePort: <unset> 32323/TCP
Endpoints:
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
Other apps are working fine with NodePort, whether Tomcat/nginx/databases.
But here, it is failing with container creation.
C:\Users\uday\Desktop>kubectl.exe get pods -n kubernetes-dashboard
NAME READY STATUS RESTARTS AGE
dashboard-metrics-scraper-779f5454cb-pqfrj 1/1 Running 1 20h
kubernetes-dashboard-64686c4bf9-g9z2k 0/1 CrashLoopBackOff 84 18h
C:\Users\uday\Desktop>kubectl.exe describe pod kubernetes-dashboard-64686c4bf9-g9z2k -n kubernetes-dashboard
Name: kubernetes-dashboard-64686c4bf9-g9z2k
Namespace: kubernetes-dashboard
Priority: 0
Node: slave-node/10.0.0.6
Start Time: Sat, 28 Mar 2020 14:16:54 +0000
Labels: k8s-app=kubernetes-dashboard
pod-template-hash=64686c4bf9
Annotations: <none>
Status: Running
IP: 182.244.1.12
IPs:
IP: 182.244.1.12
Controlled By: ReplicaSet/kubernetes-dashboard-64686c4bf9
Containers:
kubernetes-dashboard:
Container ID: docker://470ee8c61998c3c3dda86c58ad17817468f55aa73cd4feecf3b018977ce13ca3
Image: kubernetesui/dashboard:v2.0.0-rc6
Image ID: docker-pullable://kubernetesui/dashboard#sha256:61f9c378c427a3f8a9643f83baa9f96db1ae1357c67a93b533ae7b36d71c69dc
Port: 8443/TCP
Host Port: 0/TCP
Args:
--auto-generate-certificates
--namespace=kubernetes-dashboard
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Sun, 29 Mar 2020 09:01:31 +0000
Finished: Sun, 29 Mar 2020 09:02:01 +0000
Ready: False
Restart Count: 84
Liveness: http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kubernetes-dashboard-token-pzfbl (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kubernetes-dashboard-token-pzfbl:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-token-pzfbl
Optional: false
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 4m49s (x501 over 123m) kubelet, slave-node Back-off restarting failed container
kubectl.exe logs kubernetes-dashboard-64686c4bf9-g9z2k -n kubernetes-dashboard
2020/03/29 09:01:31 Starting overwatch
2020/03/29 09:01:31 Using namespace: kubernetes-dashboard
2020/03/29 09:01:31 Using in-cluster config to connect to apiserver
2020/03/29 09:01:31 Using secret token for csrf signing
2020/03/29 09:01:31 Initializing csrf token from kubernetes-dashboard-csrf secret
panic: Get https://10.96.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf: dial tcp 10.96.0.1:443: i/o timeout
goroutine 1 [running]:
github.com/kubernetes/dashboard/src/app/backend/client/csrf.(*csrfTokenManager).init(0xc0004e2dc0)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:40 +0x3b0
github.com/kubernetes/dashboard/src/app/backend/client/csrf.NewCsrfTokenManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:65
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).initCSRFKey(0xc00043ae80)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:499 +0xc6
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).init(0xc00043ae80)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:467 +0x47
github.com/kubernetes/dashboard/src/app/backend/client.NewClientManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:548
main.main()
/home/travis/build/kubernetes/dashboard/src/app/backend/dashboard.go:105 +0x20d
The Problem
The reason the application is not coming up is because the Dashboard container itself is not running. If you look at the output you provided you can see this:
pod/kubernetes-dashboard-64686c4bf9-5jkwq 0/1 CrashLoopBackOff 14
So how do we troubleshoot this? Well, there are three principle ways. One of which you'll probably be using more than the other two.
Describe
Describe is a is a command that allows you to fetch details about a resource in kubernetes. This could be metadata information, the amount of replicas you assigned, or even some events depicting why a resource is failing to start. For example, a referenced Container Image in your Pod manifest can not be found in the usable container registries. The syntax for using Describe is like so:
kubectl describe pod -n kubernetes-dashboard kubernetes-dashboard-64686c4bf9-5jkwq
Here are some great docs on the tool as well.
Logs
The next troubleshooting step you'll likely take advantage of in Kubernetes is using the logging architecture. As you're probably aware, when a Docker container is spawned it is common practice to have the logs that are produced by the application be redirected to STDOUT or STDERR for the process. Kubernetes them captures this log data for you and provides an API abstraction layer with which you can interact with it. Sometimes your Describe events won't have any indication of why a process isn't running. However, you can then proceed with grabbing logs from the process to determine what is going wrong. An example syntax might look like:
kubectl logs -f -n kubernetes-dashboard kubernetes-dashboard-64686c4bf9-5jkwq
Exec
The last common troubleshooting technique is Exec. Exec effectively allows you to attach to the a shell in a running container so that you can interact with the live environment to troubleshoot the application. This allows you to do things like see if configuration files were properly staged on the container's filesystem, determine if environment variables were properly expanded and set, etc. An example syntax for Exec might look like:
kubectl exec -it -n kubernetes-dashboard kubernetes-dashboard-64686c4bf9-5jkwq sh
In this case, however, your pod is in a CrashLoopBackoff state. This means that you will not be able to exec into it due to the fact that the container is not running. The Kubernetes API Server recognized a pattern of failures and automatically reduced its attempts at scheduling the workload accordingly.
Here is a good thread on how to troubleshoot pods that enter this state.
Summary
So, now that I've said all of this. How do we answer your question? Well, we can't answer it directly. But I sort of did with my summary above. Because the real answer you're looking for is how to properly troubleshoot linux containers running in Kubernetes. These issues will be a reoccurring theme in your experience with Kubernetes so it's essential to develop debugging skills in the ecosystem as soon as possible.
If the Describe, Logs, and Exec command are unable to help you find out why the Dashboard pod is failing to come up, feel free to add a comment on this answer requesting additional support and I'll be happy to help where I can!

Kubernetes master doesn't attach FlexVolume

I'm trying to attach the dummy-attachable FlexVolume sample for Kubernetes which seems to initialize normally according to my logs on both the nodes and master:
Loaded volume plugin "flexvolume-k8s/dummy-attachable
But when I try to attach the volume to a pod, the attach method never gets called from the master. The logs from the node read:
flexVolume driver k8s/dummy-attachable: using default GetVolumeName for volume dummy-attachable
operationExecutor.VerifyControllerAttachedVolume started for volume "dummy-attachable"
Operation for "\"flexvolume-k8s/dummy-attachable/dummy-attachable\"" failed. No retries permitted until 2019-04-22 13:42:51.21390334 +0000 UTC m=+4814.674525788 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"dummy-attachable\" (UniqueName: \"flexvolume-k8s/dummy-attachable/dummy-attachable\") pod \"nginx-dummy-attachable\"
Here's how I'm attempting to mount the volume:
apiVersion: v1
kind: Pod
metadata:
name: nginx-dummy-attachable
namespace: default
spec:
containers:
- name: nginx-dummy-attachable
image: nginx
volumeMounts:
- name: dummy-attachable
mountPath: /data
ports:
- containerPort: 80
volumes:
- name: dummy-attachable
flexVolume:
driver: "k8s/dummy-attachable"
Here is the ouput of kubectl describe pod nginx-dummy-attachable:
Name: nginx-dummy-attachable
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: [node id]
Start Time: Wed, 24 Apr 2019 08:03:21 -0400
Labels: <none>
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container nginx-dummy-attachable
Status: Pending
IP:
Containers:
nginx-dummy-attachable:
Container ID:
Image: nginx
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/data from dummy-attachable (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hcnhj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
dummy-attachable:
Type: FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
Driver: k8s/dummy-attachable
FSType:
SecretRef: nil
ReadOnly: false
Options: map[]
default-token-hcnhj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hcnhj
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 41s (x6 over 11m) kubelet, [node id] Unable to mount volumes for pod "nginx-dummy-attachable_default([id])": timeout expired waiting for volumes to attach or mount for pod "default"/"nginx-dummy-attachable". list of unmounted volumes=[dummy-attachable]. list of unattached volumes=[dummy-attachable default-token-hcnhj]
I added debug logging to the FlexVolume, so I was able to verify that the attach method was never called on the master node. I'm not sure what I'm missing here.
I don't know if this matters, but the cluster is being launched with KOPS. I've tried with both k8s 1.11 and 1.14 with no success.
So this is a fun one.
Even though kubelet initializes the FlexVolume plugin on master, kube-controller-manager, which is containerized in KOPs, is the application that's actually responsible for attaching the volume to the pod. KOPs doesn't mount the default plugin directory /usr/libexec/kubernetes/kubelet-plugins/volume/exec into the kube-controller-manager pod, so it doesn't know anything about your FlexVolume plugins on master.
There doesn't appear to be a non-hacky way to do this other than to use a different Kubernetes deployment tool until KOPs addresses this problem.

MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32

This is 2nd question following 1st question at
PersistentVolumeClaim is not bound: "nfs-pv-provisioning-demo"
I am setting up a kubernetes lab using one node only and learning to setup kubernetes nfs. I am following kubernetes nfs example step by step from the following link: https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs
Based on feedback provided by 'helmbert', I modified the content of
https://github.com/kubernetes/examples/blob/master/staging/volumes/nfs/provisioner/nfs-server-gce-pv.yaml
It works and I don't see the event "PersistentVolumeClaim is not bound: “nfs-pv-provisioning-demo”" anymore.
$ cat nfs-server-local-pv01.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv01
labels:
type: local
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/tmp/data01"
$ cat nfs-server-local-pvc01.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pv-provisioning-demo
labels:
demo: nfs-pv-provisioning
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 5Gi
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv01 10Gi RWO Retain Available 4s
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs-pv-provisioning-demo Bound pv01 10Gi RWO 2m
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nfs-server-nlzlv 1/1 Running 0 1h
$ kubectl describe pods nfs-server-nlzlv
Name: nfs-server-nlzlv
Namespace: default
Node: lab-kube-06/10.0.0.6
Start Time: Tue, 21 Nov 2017 19:32:21 +0000
Labels: role=nfs-server
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"default","name":"nfs-server","uid":"b1b00292-cef2-11e7-8ed3-000d3a04eb...
Status: Running
IP: 10.32.0.3
Created By: ReplicationController/nfs-server
Controlled By: ReplicationController/nfs-server
Containers:
nfs-server:
Container ID: docker://1ea76052920d4560557cfb5e5bfc9f8efc3af5f13c086530bd4e0aded201955a
Image: gcr.io/google_containers/volume-nfs:0.8
Image ID: docker-pullable://gcr.io/google_containers/volume-nfs#sha256:83ba87be13a6f74361601c8614527e186ca67f49091e2d0d4ae8a8da67c403ee
Ports: 2049/TCP, 20048/TCP, 111/TCP
State: Running
Started: Tue, 21 Nov 2017 19:32:43 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/exports from mypvc (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-grgdz (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
mypvc:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nfs-pv-provisioning-demo
ReadOnly: false
default-token-grgdz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-grgdz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
I continued the rest of steps and reached the "Setup the fake backend" section and ran the following command:
$ kubectl create -f examples/volumes/nfs/nfs-busybox-rc.yaml
I see status 'ContainerCreating' and never change to 'Running' for both nfs-busybox pods. Is this because the container image is for Google Cloud as shown in the yaml?
https://github.com/kubernetes/examples/blob/master/staging/volumes/nfs/nfs-server-rc.yaml
containers:
- name: nfs-server
image: gcr.io/google_containers/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
Do I have to replace that 'image' line to something else because I don't use Google Cloud for this lab? I only have a single node in my lab. Do I have to rewrite the definition of 'containers' above? What should I replace the 'image' line with? Do I need to download dockerized 'nfs image' from somewhere?
$ kubectl describe pvc nfs
Name: nfs
Namespace: default
StorageClass:
Status: Bound
Volume: nfs
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed=yes
pv.kubernetes.io/bound-by-controller=yes
Capacity: 1Mi
Access Modes: RWX
Events: <none>
$ kubectl describe pv nfs
Name: nfs
Labels: <none>
Annotations: pv.kubernetes.io/bound-by-controller=yes
StorageClass:
Status: Bound
Claim: default/nfs
Reclaim Policy: Retain
Access Modes: RWX
Capacity: 1Mi
Message:
Source:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 10.111.29.157
Path: /
ReadOnly: false
Events: <none>
$ kubectl get rc
NAME DESIRED CURRENT READY AGE
nfs-busybox 2 2 0 25s
nfs-server 1 1 1 1h
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nfs-busybox-lmgtx 0/1 ContainerCreating 0 3m
nfs-busybox-xn9vz 0/1 ContainerCreating 0 3m
nfs-server-nlzlv 1/1 Running 0 1h
$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 20m
nfs-server ClusterIP 10.111.29.157 <none> 2049/TCP,20048/TCP,111/TCP 9s
$ kubectl describe services nfs-server
Name: nfs-server
Namespace: default
Labels: <none>
Annotations: <none>
Selector: role=nfs-server
Type: ClusterIP
IP: 10.111.29.157
Port: nfs 2049/TCP
TargetPort: 2049/TCP
Endpoints: 10.32.0.3:2049
Port: mountd 20048/TCP
TargetPort: 20048/TCP
Endpoints: 10.32.0.3:20048
Port: rpcbind 111/TCP
TargetPort: 111/TCP
Endpoints: 10.32.0.3:111
Session Affinity: None
Events: <none>
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs 1Mi RWX Retain Bound default/nfs 38m
pv01 10Gi RWO Retain Bound default/nfs-pv-provisioning-demo 1h
I see repeating events - MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
$ kubectl describe pod nfs-busybox-lmgtx
Name: nfs-busybox-lmgtx
Namespace: default
Node: lab-kube-06/10.0.0.6
Start Time: Tue, 21 Nov 2017 20:39:35 +0000
Labels: name=nfs-busybox
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"default","name":"nfs-busybox","uid":"15d683c2-cefc-11e7-8ed3-000d3a04e...
Status: Pending
IP:
Created By: ReplicationController/nfs-busybox
Controlled By: ReplicationController/nfs-busybox
Containers:
busybox:
Container ID:
Image: busybox
Image ID:
Port: <none>
Command:
sh
-c
while true; do date > /mnt/index.html; hostname >> /mnt/index.html; sleep $(($RANDOM % 5 + 5)); done
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/mnt from nfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-grgdz (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
nfs:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nfs
ReadOnly: false
default-token-grgdz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-grgdz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m default-scheduler Successfully assigned nfs-busybox-lmgtx to lab-kube-06
Normal SuccessfulMountVolume 17m kubelet, lab-kube-06 MountVolume.SetUp succeeded for volume "default-token-grgdz"
Warning FailedMount 17m kubelet, lab-kube-06 MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs --scope -- mount -t nfs 10.111.29.157:/ /var/lib/kubelet/pods/15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit run-43641.scope.
mount: wrong fs type, bad option, bad superblock on 10.111.29.157:/,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)
In some cases useful info is found in syslog - try
dmesg | tail or so.
Warning FailedMount 9m (x4 over 15m) kubelet, lab-kube-06 Unable to mount volumes for pod "nfs-busybox-lmgtx_default(15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-lmgtx". list of unattached/unmounted volumes=[nfs]
Warning FailedMount 4m (x8 over 15m) kubelet, lab-kube-06 (combined from similar events): Unable to mount volumes for pod "nfs-busybox-lmgtx_default(15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-lmgtx". list of unattached/unmounted volumes=[nfs]
Warning FailedSync 2m (x7 over 15m) kubelet, lab-kube-06 Error syncing pod
$ kubectl describe pod nfs-busybox-xn9vz
Name: nfs-busybox-xn9vz
Namespace: default
Node: lab-kube-06/10.0.0.6
Start Time: Tue, 21 Nov 2017 20:39:35 +0000
Labels: name=nfs-busybox
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"default","name":"nfs-busybox","uid":"15d683c2-cefc-11e7-8ed3-000d3a04e...
Status: Pending
IP:
Created By: ReplicationController/nfs-busybox
Controlled By: ReplicationController/nfs-busybox
Containers:
busybox:
Container ID:
Image: busybox
Image ID:
Port: <none>
Command:
sh
-c
while true; do date > /mnt/index.html; hostname >> /mnt/index.html; sleep $(($RANDOM % 5 + 5)); done
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/mnt from nfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-grgdz (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
nfs:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nfs
ReadOnly: false
default-token-grgdz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-grgdz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 59m (x6 over 1h) kubelet, lab-kube-06 Unable to mount volumes for pod "nfs-busybox-xn9vz_default(15d7fb5e-cefc-11e7-8ed3-000d3a04ebcd)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-xn9vz". list of unattached/unmounted volumes=[nfs]
Warning FailedMount 7m (x32 over 1h) kubelet, lab-kube-06 (combined from similar events): MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/15d7fb5e-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs --scope -- mount -t nfs 10.111.29.157:/ /var/lib/kubelet/pods/15d7fb5e-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit run-59365.scope.
mount: wrong fs type, bad option, bad superblock on 10.111.29.157:/,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)
In some cases useful info is found in syslog - try
dmesg | tail or so.
Warning FailedSync 2m (x31 over 1h) kubelet, lab-kube-06 Error syncing pod
Had the same problem,
sudo apt install nfs-kernel-server
directly on the nodes fixed it for ubuntu 18.04 server.
NFS server running on AWS EC2.
My pod was stuck in ContainerCreating state
I was facing this issue because of the Kubernetes cluster node CIDR range was not present in the inbound rule of Security Group of my AWS EC2 instance(where my NFS server was running )
Solution:
Added my Kubernetes cluser Node CIDR range to inbound rule of Security Group.
Installed the following nfs libraries on node machine of CentOS worked for me.
yum install -y nfs-utils nfs-utils-lib
Installing the nfs-common library in ubuntu worked for me.