Kubernetes uses Docker, and kubelet dictates the compatible Docker versions for any given cluster.
My question is, given a Kubernetes cluster that is already configured and running, how would I find out what version of Docker is running in the cluster if I don't have direct access to the nodes?

You can find container runtime and its version using the following:
kubectl get node <node> -o jsonpath="{.status.nodeInfo.containerRuntimeVersion}"

The following kubectl command will show detail information of the nodes in the cluster:
kubectl describe nodes
One instance of a node is shown below:
Name: node3-virtualbox
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
Annotations: node.alpha.kubernetes.io/ttl=0
Taints: <none>
CreationTimestamp: Tue, 05 Dec 2017 07:01:42 +0100
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Tue, 05 Dec 2017 22:52:05 +0100 Tue, 05 Dec 2017 17:08:13 +0100 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 05 Dec 2017 22:52:05 +0100 Tue, 05 Dec 2017 21:08:21 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 05 Dec 2017 22:52:05 +0100 Tue, 05 Dec 2017 21:08:21 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready False Tue, 05 Dec 2017 22:52:05 +0100 Tue, 05 Dec 2017 21:08:21 +0100 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Hostname: node3-virtualbox
cpu: 1
memory: 2048268Ki
pods: 110
cpu: 1
memory: 1945868Ki
pods: 110
System Info:
Machine ID: 9654f9402bfc4042b82b454e323cf46c
System UUID: 6EBA3E13-624C-4C82-A8EA-24FF86FA6E66
Boot ID: c7217654-8514-482c-9899-f04a3d3ce6d8
Kernel Version: 4.4.0-101-generic
OS Image: Ubuntu 16.04.1 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.8.4
Kube-Proxy Version: v1.8.4
ExternalID: node3-virtualbox
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system kube-proxy-sxp5s 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system weave-net-6jf98 20m (2%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
20m (2%) 0 (0%) 0 (0%) 0 (0%)
Events: <none>
The docker version running on the nodes can be found in system info:, for example in above case:
Container Runtime Version: docker://1.13.1

try kubectl get nodes -o wide and check CONTAINER-RUNTIME column, here my AKS example:
t#Azure:/$ kubectl get nodes -o wide
aks-agentpool-311111-0 Ready agent 4d1h v1.12.8 <none> Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4
aks-agentpool-311111-1 Ready agent 4d v1.12.8 <none> Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4
aks-agentpool-311111-3 Ready agent 4d1h v1.12.8 <none> Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4


Kubernetes pod stuck pending, but lacks events that tell me why

I have a simple alpine:node kubernetes pod attempting to start from a deployment on a cluster with a large surplus of resources on every node. It's failing to move out of the pending status. When I run kubectl describe, I get no events that explain why this is happening. What are the next steps for debugging a problem like this?
Here are some commands:
kubectl get events
60m Normal SuccessfulCreate replicaset/frontend-r0ktmgn9-dcc95dfd8 Created pod: frontend-r0ktmgn9-dcc95dfd8-8wn9j
36m Normal ScalingReplicaSet deployment/frontend-r0ktmgn9 Scaled down replica set frontend-r0ktmgn9-6d57cb8698 to 0
36m Normal SuccessfulDelete replicaset/frontend-r0ktmgn9-6d57cb8698 Deleted pod: frontend-r0ktmgn9-6d57cb8698-q52h8
36m Normal ScalingReplicaSet deployment/frontend-r0ktmgn9 Scaled up replica set frontend-r0ktmgn9-58cd8f4c79 to 1
36m Normal SuccessfulCreate replicaset/frontend-r0ktmgn9-58cd8f4c79 Created pod: frontend-r0ktmgn9-58cd8f4c79-fn5q4
kubectl describe po/frontend-r0ktmgn9-58cd8f4c79-fn5q4 (some parts redacted)
Name: frontend-r0ktmgn9-58cd8f4c79-fn5q4
Namespace: default
Priority: 0
Node: <none>
Labels: app=frontend
Annotations: kubectl.kubernetes.io/restartedAt: 2021-05-14T20:02:11-05:00
Status: Pending
IPs: <none>
Controlled By: ReplicaSet/frontend-r0ktmgn9-58cd8f4c79
Image: [Redacted]
Port: 3000/TCP
Host Port: 0/TCP
Environment: [Redacted]
Mounts: <none>
Volumes: <none>
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
I use loft virtual clusters, so the above commands were run in a virtual cluster context, where this pod's deployment is the only resource. When run from the main cluster itself:
kubectl describe nodes
Name: autoscale-pool-01-8bwo1
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
Annotations: alpha.kubernetes.io/provided-node-ip:
csi.volume.kubernetes.io/nodeid: {"dobs.csi.digitalocean.com":"246129007"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 14 May 2021 19:56:44 -0500
Taints: <none>
Unschedulable: false
HolderIdentity: autoscale-pool-01-8bwo1
AcquireTime: <unset>
RenewTime: Fri, 14 May 2021 21:33:44 -0500
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 14 May 2021 19:57:01 -0500 Fri, 14 May 2021 19:57:01 -0500 CiliumIsUp Cilium is running on this node
MemoryPressure False Fri, 14 May 2021 21:30:33 -0500 Fri, 14 May 2021 19:56:44 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 14 May 2021 21:30:33 -0500 Fri, 14 May 2021 19:56:44 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 14 May 2021 21:30:33 -0500 Fri, 14 May 2021 19:56:44 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 14 May 2021 21:30:33 -0500 Fri, 14 May 2021 19:57:04 -0500 KubeletReady kubelet is posting ready status. AppArmor enabled
Hostname: autoscale-pool-01-8bwo1
cpu: 8
ephemeral-storage: 103176100Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32941864Ki
pods: 110
cpu: 8
ephemeral-storage: 95087093603
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 29222Mi
pods: 110
System Info:
Machine ID: a98e294e721847469503cd531b9bc88e
System UUID: a98e294e-7218-4746-9503-cd531b9bc88e
Boot ID: a16de75d-7532-441d-885a-de90fb2cb286
Kernel Version: 4.19.0-11-amd64
OS Image: Debian GNU/Linux 10 (buster)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.4.3
Kubelet Version: v1.20.2
Kube-Proxy Version: v1.20.2
ProviderID: digitalocean://246129007
Non-terminated Pods: (28 in total) [Redacted]
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2727m (34%) 3202m (40%)
memory 9288341376 (30%) 3680Mi (12%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
Name: autoscale-pool-02-8mly8
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
Annotations: alpha.kubernetes.io/provided-node-ip:
csi.volume.kubernetes.io/nodeid: {"dobs.csi.digitalocean.com":"237830322"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sat, 20 Mar 2021 18:14:37 -0500
Taints: <none>
Unschedulable: false
HolderIdentity: autoscale-pool-02-8mly8
AcquireTime: <unset>
RenewTime: Fri, 14 May 2021 21:33:44 -0500
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 06 Apr 2021 16:24:45 -0500 Tue, 06 Apr 2021 16:24:45 -0500 CiliumIsUp Cilium is running on this node
MemoryPressure False Fri, 14 May 2021 21:33:35 -0500 Tue, 13 Apr 2021 18:40:21 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 14 May 2021 21:33:35 -0500 Wed, 05 May 2021 15:16:08 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 14 May 2021 21:33:35 -0500 Tue, 06 Apr 2021 16:24:40 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 14 May 2021 21:33:35 -0500 Tue, 06 Apr 2021 16:24:49 -0500 KubeletReady kubelet is posting ready status. AppArmor enabled
Hostname: autoscale-pool-02-8mly8
cpu: 2
ephemeral-storage: 51570124Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16427892Ki
pods: 110
cpu: 2
ephemeral-storage: 47527026200
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 13862Mi
pods: 110
System Info:
Machine ID: 7c8d577266284fa09f84afe03296abe8
System UUID: cf5f4cc0-17a8-4fae-b1ab-e0488675ae06
Boot ID: 6698c614-76a0-484c-bb23-11d540e0e6f3
Kernel Version: 4.19.0-16-amd64
OS Image: Debian GNU/Linux 10 (buster)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.4.4
Kubelet Version: v1.20.5
Kube-Proxy Version: v1.20.5
ProviderID: digitalocean://237830322
Non-terminated Pods: (73 in total) [Redacted]
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1202m (60%) 202m (10%)
memory 2135Mi (15%) 5170Mi (37%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>

Using GPU with Kubernetes GKE and Node auto-provisioning

I try to do something fairly simple: To run a GPU machine in a k8s cluster using auto-provisioning. When deploying the Pod with a limits: nvidia.com/gpu specification the auto-provisioning is correctly creating a node-pool and scaling up an appropriate node. However, the Pod stays at Pending with the following message:
Warning FailedScheduling 59s (x5 over 2m46s) default-scheduler 0/10 nodes are available: 10 Insufficient nvidia.com/gpu.
It seems like taints and tolerations are added correctly by gke. It just doesnt scale up.
Ive followed the instructions here:
To reproduce:
Create a new cluster in a zone with auto-provisioning that includes gpu (I have replaced my own project name with MYPROJECT). This command is what comes out of the console when these changes are done:
gcloud beta container --project "MYPROJECT" clusters create "cluster-2" --zone "europe-west4-a" --no-enable-basic-auth --cluster-version "1.18.12-gke.1210" --release-channel "regular" --machine-type "e2-medium" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/MYPROJECT/global/networks/default" --subnetwork "projects/MYPROJECT/regions/europe-west4/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-autoprovisioning --min-cpu 1 --max-cpu 20 --min-memory 1 --max-memory 50 --max-accelerator type="nvidia-tesla-p100",count=1 --enable-autoprovisioning-autorepair --enable-autoprovisioning-autoupgrade --autoprovisioning-max-surge-upgrade 1 --autoprovisioning-max-unavailable-upgrade 0 --enable-vertical-pod-autoscaling --enable-shielded-nodes --node-locations "europe-west4-a"
Install NVIDIA drivers by installing DaemonSet:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Deploy pod that requests GPU:
apiVersion: v1
kind: Pod
name: my-gpu-pod
- name: my-gpu-container
image: nvidia/cuda:11.0-runtime-ubuntu18.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
nvidia.com/gpu: 1
kubectl apply -f my-gpu-pod.yaml
Help would be really appreciated as Ive spent quite some time on this now :)
Edit: Here is the running Pod and Node specifications (the node that was auto-scaled):
Name: my-gpu-pod
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IPs: <none>
Image: nvidia/cuda:11.0-runtime-ubuntu18.04
Port: <none>
Host Port: <none>
while true; do sleep 600; done;
nvidia.com/gpu: 1
nvidia.com/gpu: 1
Environment: <none>
/var/run/secrets/kubernetes.io/serviceaccount from default-token-9rvjz (ro)
Type Status
PodScheduled False
Type: Secret (a volume populated by a Secret)
SecretName: default-token-9rvjz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 11m cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added):
Warning FailedScheduling 5m54s (x6 over 11m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
Warning FailedScheduling 54s (x7 over 5m37s) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Name: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
Annotations: container.googleapis.com/instance_id: 7877226485154959129
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 22 Mar 2021 11:32:17 +0100
Taints: nvidia.com/gpu=present:NoSchedule
Unschedulable: false
HolderIdentity: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
AcquireTime: <unset>
RenewTime: Mon, 22 Mar 2021 11:38:58 +0100
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
KernelDeadlock False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 FilesystemIsNotReadOnly Filesystem is not read-only
CorruptDockerOverlay2 False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
FrequentUnregisterNetDevice False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentUnregisterNetDevice node is functioning properly
FrequentKubeletRestart False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentContainerdRestart containerd is functioning properly
NetworkUnavailable False Mon, 22 Mar 2021 11:32:18 +0100 Mon, 22 Mar 2021 11:32:18 +0100 RouteCreated NodeController create implicit route
MemoryPressure False Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:17 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:17 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:17 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:19 +0100 KubeletReady kubelet is posting ready status. AppArmor enabled
InternalDNS: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
Hostname: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
attachable-volumes-gce-pd: 127
cpu: 1
ephemeral-storage: 98868448Ki
hugepages-2Mi: 0
memory: 3776196Ki
pods: 110
attachable-volumes-gce-pd: 127
cpu: 940m
ephemeral-storage: 47093746742
hugepages-2Mi: 0
memory: 2690756Ki
pods: 110
System Info:
Machine ID: 307671eefc01914a7bfacf17a48e087e
System UUID: 307671ee-fc01-914a-7bfa-cf17a48e087e
Boot ID: acd58f3b-1659-494c-b83d-427f834d23a6
Kernel Version: 5.4.49+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.9
Kubelet Version: v1.18.12-gke.1210
Kube-Proxy Version: v1.18.12-gke.1210
ProviderID: gce://exor-arctic/europe-west4-a/gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system fluentbit-gke-k22gv 100m (10%) 0 (0%) 200Mi (7%) 500Mi (19%) 6m46s
kube-system gke-metrics-agent-5fblx 3m (0%) 0 (0%) 50Mi (1%) 50Mi (1%) 6m47s
kube-system kube-proxy-gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 100m (10%) 0 (0%) 0 (0%) 0 (0%) 6m44s
kube-system nvidia-driver-installer-vmw8r 150m (15%) 0 (0%) 0 (0%) 0 (0%) 6m45s
kube-system nvidia-gpu-device-plugin-8vqsl 50m (5%) 50m (5%) 10Mi (0%) 10Mi (0%) 6m45s
kube-system pdcsi-node-k9brg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 6m47s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 403m (42%) 50m (5%)
memory 260Mi (9%) 560Mi (21%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-gce-pd 0 0
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 6m47s kubelet Starting kubelet.
Normal NodeAllocatableEnforced 6m47s kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 6m46s (x4 over 6m47s) kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 6m46s (x4 over 6m47s) kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 6m46s (x4 over 6m47s) kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientPID
Normal NodeReady 6m45s kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeReady
Normal Starting 6m44s kube-proxy Starting kube-proxy.
Warning NodeSysctlChange 6m41s sysctl-monitor
Warning ContainerdStart 6m41s systemd-monitor Starting containerd container runtime...
Warning DockerStart 6m41s (x2 over 6m41s) systemd-monitor Starting Docker Application Container Engine...
Warning KubeletStart 6m41s systemd-monitor Started Kubernetes kubelet.
As per the Kubernetes Documentation https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#nvidia-gpu-device-plugin-used-by-gce, we are supposed to use https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml.
So can you run
kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
A common error related with GKE is about project Quotas limiting resources, this could lead to the nodes not auto-provisioning or scaling up due to not being able to assign the resources.
Maybe your project Quotas for GPU (or specifically for nvidia-tesla-p100) are set to 0 or to a number way below to the requested one.
In this link there's more information about how to check it and how to request more resources for your quota.
Also, I see that you're making use of shared-core E2 instances, which are not compatible with accelerators. It shouldn't be an issue as GKE should automatically change the machine type to N1 if it detects the workload contains a GPU, as seen in this link, but still maybe attempt to run the cluster with other machine types such as N1.
You might be having a problem with the scopes.
When using node auto-provisioning with GPUs, the auto-provisioned node pools by default do not have sufficient scopes to run the installation DaemonSet. You need to manually change the default autoprovisioning scopes to enable that.
In this case the documented scopes that are required at the time of writting are:
[ "https://www.googleapis.com/auth/logging.write",
this article mentions this very issue: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#using_node_auto-provisioning_with_gpus
you might just have to expand them and retry. Manually it works because you have the necessary scopes.

Newly provisioned kubernetes nodes are inaccessible by kubectl

I am using Kubespray with Kubernetes 1.9
What I'm seeing is the following when I try to interact with pods on my new nodes in anyway through kubectl. Important to note that the nodes are considered to be healthy and are having pods scheduled on them appropriately. The pods are totally functional.
➜ Scripts k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on no such host
I am able to ping to the kubeworker nodes both locally where I am running kubectl and from all masters by both IP and DNS.
➜ Scripts ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 ( 56 data bytes
64 bytes from icmp_seq=0 ttl=63 time=88.972 ms
pubuntu#kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 ( 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 ( icmp_seq=1 ttl=64 time=0.259 ms
64 bytes from kubeworker-rwva1-prod-14 ( icmp_seq=2 ttl=64 time=0.213 ms
➜ Scripts k get nodes
kubemaster-rwva1-prod-1 Ready master 174d v1.9.2+coreos.0
kubemaster-rwva1-prod-2 Ready master 174d v1.9.2+coreos.0
kubemaster-rwva1-prod-3 Ready master 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-1 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-10 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-11 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-12 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-14 Ready node 16d v1.9.2+coreos.0
kubeworker-rwva1-prod-15 Ready node 14d v1.9.2+coreos.0
kubeworker-rwva1-prod-16 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-17 Ready node 4d v1.9.2+coreos.0
kubeworker-rwva1-prod-18 Ready node 4d v1.9.2+coreos.0
kubeworker-rwva1-prod-19 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-2 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-20 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-21 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-3 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-4 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-5 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-6 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-7 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-8 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-9 Ready node 174d v1.9.2+coreos.0
When I describe a broken node, it looks identical to one of my functioning ones.
➜ Scripts k describe node kubeworker-rwva1-prod-14
Name: kubeworker-rwva1-prod-14
Roles: node
Labels: beta.kubernetes.io/arch=amd64
Annotations: node.alpha.kubernetes.io/ttl=0
Taints: <none>
CreationTimestamp: Tue, 17 Jul 2018 19:35:08 -0700
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:08 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:08 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:08 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:18 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Hostname: kubeworker-rwva1-prod-14
cpu: 32
memory: 147701524Ki
pods: 110
cpu: 31900m
memory: 147349124Ki
pods: 110
System Info:
Machine ID: da30025a3f8fd6c3f4de778c5b4cf558
System UUID: 5ACCBB64-2533-E611-97F0-0894EF1D343B
Boot ID: 6b42ba3e-36c4-4520-97e6-e7c6fed195e2
Kernel Version: 4.4.0-130-generic
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.1
Kubelet Version: v1.9.2+coreos.0
Kube-Proxy Version: v1.9.2+coreos.0
ExternalID: kubeworker-rwva1-prod-14
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system calico-node-cd7qg 150m (0%) 300m (0%) 64M (0%) 500M (0%)
kube-system kube-proxy-kubeworker-rwva1-prod-14 150m (0%) 500m (1%) 64M (0%) 2G (1%)
kube-system nginx-proxy-kubeworker-rwva1-prod-14 25m (0%) 300m (0%) 32M (0%) 512M (0%)
prometheus prometheus-prometheus-node-exporter-gckzj 0 (0%) 0 (0%) 0 (0%) 0 (0%)
rabbit-relay rabbit-relay-844d6865c7-q6fr2 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
325m (1%) 1100m (3%) 160M (0%) 3012M (1%)
Events: <none>
➜ Scripts k describe node kubeworker-rwva1-prod-11
Name: kubeworker-rwva1-prod-11
Roles: node
Labels: beta.kubernetes.io/arch=amd64
Annotations: node.alpha.kubernetes.io/ttl=0
Taints: <none>
CreationTimestamp: Fri, 09 Feb 2018 21:03:46 -0800
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 03 Aug 2018 18:46:31 -0700 Fri, 09 Feb 2018 21:03:38 -0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 03 Aug 2018 18:46:31 -0700 Mon, 16 Jul 2018 13:24:58 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 03 Aug 2018 18:46:31 -0700 Mon, 16 Jul 2018 13:24:58 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Fri, 03 Aug 2018 18:46:31 -0700 Mon, 16 Jul 2018 13:24:58 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Hostname: kubeworker-rwva1-prod-11
cpu: 32
memory: 131985484Ki
pods: 110
cpu: 31900m
memory: 131633084Ki
pods: 110
System Info:
Machine ID: 0ff6f3b9214b38ad07c063d45a6a5175
System UUID: 4C4C4544-0044-5710-8037-B3C04F525631
Boot ID: 4d7ec0fc-428f-4b4c-aaae-8e70f374fbb1
Kernel Version: 4.4.0-87-generic
OS Image: Ubuntu 16.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.1
Kubelet Version: v1.9.2+coreos.0
Kube-Proxy Version: v1.9.2+coreos.0
ExternalID: kubeworker-rwva1-prod-11
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
ingress-nginx-internal default-http-backend-internal-7c8ff87c86-955np 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%)
kube-system calico-node-8fzk6 150m (0%) 300m (0%) 64M (0%) 500M (0%)
kube-system kube-proxy-kubeworker-rwva1-prod-11 150m (0%) 500m (1%) 64M (0%) 2G (1%)
kube-system nginx-proxy-kubeworker-rwva1-prod-11 25m (0%) 300m (0%) 32M (0%) 512M (0%)
prometheus prometheus-prometheus-kube-state-metrics-7c5cbb6f55-jq97n 0 (0%) 0 (0%) 0 (0%) 0 (0%)
prometheus prometheus-prometheus-node-exporter-7gn2x 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
335m (1%) 1110m (3%) 176730Ki (0%) 3032971520 (2%)
Events: <none>
What's going on?
➜ k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on no such host
➜ cat /etc/hosts | head -n1 kubeworker-rwva1-prod-14
ubuntu#kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 ( 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 ( icmp_seq=1 ttl=64 time=0.275 ms
--- kubeworker-rwva1-prod-14 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.275/0.275/0.275/0.000 ms
ubuntu#kubemaster-rwva1-prod-1:~$ kubectl logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on no such host
What's going on?
That name needs to be resolvable from your workstation, because for kubectl logs and kubectl exec, the API sends down the URL for the client to interact directly with the kubelet on the target Node (to ensure that all traffic in the world doesn't travel through the API server).
Thankfully, kubespray has a knob through which you can tell kubernetes to prefer the Node's ExternalIP (or, of course, InternalIP if you prefer): https://github.com/kubernetes-incubator/kubespray/blob/v2.5.0/roles/kubernetes/master/defaults/main.yml#L82
Insane problem. I don't know exactly how I fixed this. But I somehow put it back together by deleting one of my non functional nodes and re-registering it with the full FQDN. This somehow fixed everything. I was then able to delete the FQDN registered node and recreate it the short name.
After a lot of TCPdumping the best explanation I can come up with is the error message was accurate but in a really stupid and confusing way.
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on no such host","code":500}
The internal DNS of the cluster was not able to properly read the API to generate the necessary records. Without a name that the DNS was authoritative for, the cluster was using my upstream DNS records to attempt to recursively resolve the name. The upstream DNS server didn't know what to do with the short form name without a tld suffix.

Why the actual CPU utilization percentage exceeds Pod CPU limit in Kubernetes

I am running a few kubernetes pods in my a cluster (10 node). Each pod contains only one container which hosts one working process. I have specified the CPU "limits" and "requests" for the container . The following is a description of one pod that is running on a node (crypt12).
Name: alexnet-worker-6-9954df99c-p7tx5
Namespace: default
Node: crypt12/
Start Time: Sun, 15 Jul 2018 22:26:57 -0400
Labels: job=worker
Annotations: <none>
Status: Running
Controlled By: ReplicaSet/alexnet-worker-6-9954df99c
Container ID: docker://214e30e87ed4a7240e13e764200a260a883ea4550a1b5d09d29ed827e7b57074
Image: alexnet-tf150-py3:v1
Image ID: docker://sha256:4f18b4c45a07d639643d7aa61b06bfee1235637a50df30661466688ab2fd4e6d
Port: 5000/TCP
Host Port: 0/TCP
State: Running
Started: Sun, 15 Jul 2018 22:26:59 -0400
Ready: True
Restart Count: 0
cpu: 800m
memory: 6G
cpu: 800m
memory: 6G
Environment: <none>
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hfnlp (ro)
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hfnlp
Optional: false
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/hostname=crypt12
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
The following is the output when I run "kubectl describle node crypt12"
Name: crypt12
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
Annotations: kubeadm.alpha.kubernetes.io/cri-socket=/var/run/dockershim.sock
CreationTimestamp: Wed, 11 Jul 2018 23:07:41 -0400
Taints: <none>
Unschedulable: false
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:22 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 16 Jul 2018 16:25:43 -0400 Wed, 11 Jul 2018 22:57:42 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Hostname: crypt12
cpu: 8
ephemeral-storage: 144937600Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8161308Ki
pods: 110
cpu: 8
ephemeral-storage: 133574491939
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8058908Ki
pods: 110
System Info:
Machine ID: f0444e00ba2ed20e5314e6bc5b0f0f60
System UUID: 37353035-3836-5355-4530-32394E44414D
Boot ID: cf2a9daf-c959-4c7e-be61-5e44a44670c4
Kernel Version: 4.4.0-87-generic
OS Image: Ubuntu 16.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.11.0
Kube-Proxy Version: v1.11.0
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default alexnet-worker-6-9954df99c-p7tx5 800m (10%) 800m (10%) 6G (72%) 6G (72%)
kube-system kube-proxy-7kdkd 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system weave-net-dpclj 20m (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 820m (10%) 800m (10%)
memory 6G (72%) 6G (72%)
Events: <none>
As showed, in the node description ("Non-terminated Pods" section), the CPU limits is 10%. However, when I run "ps" or "top" command on the node(crypt12), the CPU utilization of the working process exceeds 10% (about 20%). Why this happened? Could anyone shed light on this?
I found a github issue discussion where I found the answer of my question: the cpu percentage from "kubectl describe node" is "CPU-limits/# of Cores". Since I set CPU-limit to 0.8, 10% is the result of 0.8/8.
I found a github issue discussion where I found the answer of my question: the cpu percentage from "kubectl describe node" is "CPU-limits/# of Cores". Since I set CPU-limit to 0.8, 10% is the result of 0.8/8.
Here is link: https://github.com/kubernetes/kubernetes/issues/24925
Firstly, by default, Top shows percentage utilisation per core. so with 8 cores you can have 800% ultilisation.
If you're reading the top statistics right then it might have something to do with fact that your node is running more processes than just your pod. Think kube-proxy, kubelet and any other controllers. GKE also runs a dashboard and calls the api for statistics.
Also note that resources are calculated per 100ms. A container can spike above the 10 percent utilisation, but on average never use more than allowed within this duration.
In the official documentation it reads:
The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

kubernetes node not responding after restart

I have a kubernetes cluster with one master and four nodes. kube-proxy was working fine on all four nodes, and I could access services on any of the nodes irrespective of where it was running; ie. http://node1:30000 through http://node4:30000 was giving the same response.
After restarting node4 by running shutdown -r now, it came back up, but I noticed that the node was no longer responding to requests. I am running the following command:
curl http://node4:30000
If I run it from my PC, or from any other node in the cluster -- node1 through node3, or master -- I get:
curl: (7) Failed to connect to node4 port 30000: Connection timed out
However, if I run it from node4, it responds successfully. This leads me to believe that kube-proxy is running fine, but something is preventing external connections.
When I run kubectl describe node node4, my output looks normal:
Name: node4
Labels: beta.kubernetes.io/arch=amd64
Taints: <none>
CreationTimestamp: Tue, 21 Feb 2017 15:21:17 -0400
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:18 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:18 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:18 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:28 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 2
memory: 4028748Ki
pods: 110
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 2
memory: 4028748Ki
pods: 110
System Info:
Machine ID: dbc0bb6ba10acae66b1061f958220ade
System UUID: 4229186F-AA5C-59CE-E5A2-258C1BBE9D2C
Boot ID: a3968e6c-eba3-498c-957f-f29283af1cff
Kernel Version: 4.4.0-63-generic
OS Image: Ubuntu 16.04.1 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.0
Kubelet Version: v1.5.2
Kube-Proxy Version: v1.5.2
ExternalID: node4
Non-terminated Pods: (27 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
<< application pods listed here >>
kube-system kube-proxy-0p3lj 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system weave-net-uqmr1 20m (1%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
20m (1%) 0 (0%) 0 (0%) 0 (0%)
Is there anything specific I need to do to bring a node back online after a system restart?
My team was able to solve this one by downgrading docker to 1.12. It appears that the problem is related to this issue:
After downgrading docker to 1.12, everything is working now.