Jupyterhub helm install timed out waiting for the condition - kubernetes

Thanks everyone as always for the help, I come once again with my melted brain in my hat in my hand asking for some assistance.
Environment setup:
3 Ubuntu VMs running Kubernetes Kubeadm install
1 control node
2 worker nodes
MetalLB deployment as load balancer for K8s
No out of the box storage classes or volumes.
I have been trying to simply install Jupyterhub using helm and am constantly running into various issues but the most recent one is a real head scratcher when I try to install it.
At this point I really need some help on a plain "local-storage" install, then use NFS for the user data in the future. The Zero to Jupyterhub docs have been helpful but, not for this issue sorry to say.
dev#control1:~/jupyterhub$ helm upgrade --install jhub jupyterhub-1.2.0.tgz --values config.yml
Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition
dev#control1:~/jupyterhub$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
jhub default 2 2022-01-10 18:16:58.570337598 +0000 UTC failed jupyterhub-1.2.0 1.5.0
Here is my config.yml for the values:
proxy:
secretToken: "xxxxxx"
singleuser:
storage:
type: dynamic
dynamic:
storageClass: local-storage
debug:
enabled: true
here is my storage class for the local storage:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
I have tried various persistent volume files and none work, so I'll list them here anyway:
hub persistent volume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: hub-db-pv
spec:
capacity:
storage: 1Gi
# volumeMode field requires BlockVolume Alpha feature gate to be enabled.
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /tmp/jhub
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker1
Upon starting the install process, these are the only pods that are spun up:
NAME READY STATUS RESTARTS AGE
pod/hook-image-awaiter--1-brl2q 1/1 Running 0 47s
pod/hook-image-puller-9fghk 1/1 Running 0 47s
pod/hook-image-puller-scp88 1/1 Running 0 47s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 46d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/hook-image-puller 2 2 2 2 2 <none> 48s
NAME COMPLETIONS DURATION AGE
job.batch/hook-image-awaiter 0/1 48s 48s
The helm debug logs show:
Error: INSTALLATION FAILED: failed pre-install: timed out waiting for the condition
helm.go:88: [debug] failed pre-install: timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra#v1.2.1/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra#v1.2.1/command.go:974
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra#v1.2.1/command.go:902
main.main
helm.sh/helm/v3/cmd/helm/helm.go:87
runtime.main
runtime/proc.go:225
runtime.goexit
runtime/asm_amd64.s:1371
kubectl events show nothing out of the ordinary:
44m Normal SuccessfulCreate daemonset/hook-image-puller Created pod: hook-image-puller-scp88
44m Normal SuccessfulCreate daemonset/hook-image-puller Created pod: hook-image-puller-9fghk
35m Normal SuccessfulCreate daemonset/hook-image-puller Created pod: hook-image-puller-cx7sb
35m Normal SuccessfulCreate daemonset/hook-image-puller Created pod: hook-image-puller-mw6cd
9m10s Normal SuccessfulCreate daemonset/hook-image-puller Created pod: hook-image-puller-pgt4n
9m10s Normal SuccessfulCreate daemonset/hook-image-puller Created pod: hook-image-puller-p6h49
Thanks again for any help!

Related

Default Grafana K8s app PV issue: FailedBinding persistentvolume-controller no persistent volumes available for this claim and no storage class is set

I am simply trying to deploy this Grafana app as-is, no changes to the YAML have been made: https://grafana.com/docs/grafana/latest/setup-grafana/installation/kubernetes/
VMs are Ubuntu 20.04 LTS. The Kubernetes cluster is made up of the Control-Plane/Mstr & 3x Worker nodes:
root#k8s-master:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 35d v1.24.2
k8s-worker1 Ready worker 4h24m v1.24.2
k8s-worker2 Ready worker 4h24m v1.24.2
k8s-worker3 Ready worker 4h24m v1.24.2v
Other K8s Pods such as NGINX run without issue.
However, the Grafana pod cannot start and is stuck in a Pending state:
root#k8s-master:~# kubectl create -f grafana.yaml
persistentvolumeclaim/grafana-pvc created
deployment.apps/grafana created
service/grafana created
# time passed here...
root#k8s-master:~# kubectl get pods
NAME READY STATUS RESTARTS AGE
grafana-9bd5bbd6b-k7ljz 0/1 Pending 0 3h39m
Troubleshooting this, I found there is an issue with the storage PersistentVolumeClaim (the pvc):
root#k8s-master:~# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
grafana-pvc Pending 2m22s
root#k8s-master:~#
root#k8s-master:~# kubectl describe pvc grafana-pvc
Name: grafana-pvc
Namespace: default
StorageClass:
Status: Pending
Volume:
Labels: <none>
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: grafana-9bd5bbd6b-k7ljz
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 6s (x11 over 2m30s) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
UPDATE:
I created a StorageClass and set it as default:
root#k8s-master:~# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
generic (default) no-provisioner Delete Immediate false 19m
I also created a PersistentVolume:
root#k8s-master:~# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
task-pv-volume 10Gi RWO Retain Released default/task-pv-claim manual 12m
However, now when I try to deploy the Grafana PVC it is still stuck - why?
root#k8s-master:~# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
grafana-pvc Pending generic 4m16s
root#k8s-master:~# kubectl describe pvc grafana-pvc
Name: grafana-pvc
Namespace: default
StorageClass: generic
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: no-provisioner
volume.kubernetes.io/storage-provisioner: no-provisioner
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: grafana-9bd5bbd6b-mmqs6
grafana-9bd5bbd6b-pvhtm
grafana-9bd5bbd6b-rtwgj
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 12s (x19 over 4m27s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "no-provisioner" or manually created by system administrator
I have tried creating a Grafana configuration file from the documentation, and was able to create successfully. The pod has a Running state, also the PVC(PersistentVolumeClaim) shows the Storage class as standard.
The below is the output of PVC:
$ kubectl describe pvc grafana-pvc
Name: grafana-pvc
Namespace: default
StorageClass: standard
Status: Bound
Volume: pvc-ee20cc5d-6ca5-4075-b5f3-d1a6323a5241
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: grafana-75789d79d4-wbgtv
Events: <none>
But in your use case the StorageClass field is showing as empty. So, try deleting the existing one and recreate the Grafana configuration file. If you were not able to create and are still facing the same error message which is “no persistent volumes available for this claim and no storage class is set” then you will have to create PV(PersistentVolume).
Because, your error says, "Your PVC hasn't found a matching PV and you also haven't mentioned any storageClass name". After you create the PersistentVolumeClaim, the Kubernetes control plane looks for a PersistentVolume that satisfies the claim's requirements. If the control plane finds a suitable PersistentVolume with the same StorageClass, it binds the claim to the volume.
In order to resolve your issue you will need to create a StorageClass with no-provisioner and then create a PV(PersistentVolume) by defining this storageClassName. Then you have to create PVC and Pod/Deployment .
Refer to stackpost1 and stackpost2 for more information.

Metrics server is currently unable to handle the request

I am new to kubernetes and was trying to apply horizontal pod autoscaling to my existing application. and after following other stackoverflow details - got to know that I need to install metric-server - and I was able to - but some how it's not working and unable to handle request.
Further I followed few more things but unable to resolve the issue - I will really appreciate any help here.
Please let me know for any further details you need for helping me :) Thanks in advance.
Steps followed:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
kubectl get deploy,svc -n kube-system | egrep metrics-server
deployment.apps/metrics-server 1/1 1 1 2m6s
service/metrics-server ClusterIP 10.32.0.32 <none> 443/TCP 2m6s
kubectl get pods -n kube-system | grep metrics-server
metrics-server-64cf6869bd-6gx88 1/1 Running 0 2m39s
vi ana_hpa.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: ana-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: common-services-auth
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 160
k apply -f ana_hpa.yaml
horizontalpodautoscaler.autoscaling/ana-hpa created
k get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
ana-hpa StatefulSet/common-services-auth <unknown>/160%, <unknown>/80% 1 10 0 4s
k describe hpa ana-hpa
Name: ana-hpa
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Tue, 12 Apr 2022 17:01:25 +0530
Reference: StatefulSet/common-services-auth
Metrics: ( current / target )
resource memory on pods (as a percentage of request): <unknown> / 160%
resource cpu on pods (as a percentage of request): <unknown> / 80%
Min replicas: 1
Max replicas: 10
StatefulSet pods: 3 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 38s (x8 over 2m23s) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedComputeMetricsReplicas 38s (x8 over 2m23s) horizontal-pod-autoscaler invalid metrics (2 invalid out of 2), first error is: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedGetResourceMetric 23s (x9 over 2m23s) horizontal-pod-autoscaler failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl edit deployments.apps -n kube-system metrics-server
Add hostNetwork: true
deployment.apps/metrics-server edited
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
k describe pod metrics-server-5dc6dbdb8-42hw9 -n kube-system
Name: metrics-server-5dc6dbdb8-42hw9
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: pusntyn196.apac.avaya.com/10.133.85.196
Start Time: Tue, 12 Apr 2022 17:08:25 +0530
Labels: k8s-app=metrics-server
pod-template-hash=5dc6dbdb8
Annotations: <none>
Status: Running
IP: 10.133.85.196
IPs:
IP: 10.133.85.196
Controlled By: ReplicaSet/metrics-server-5dc6dbdb8
Containers:
metrics-server:
Container ID: containerd://024afb1998dce4c0bd5f4e58f996068ea37982bd501b54fda2ef8d5c1098b4f4
Image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
Image ID: k8s.gcr.io/metrics-server/metrics-server#sha256:5ddc6458eb95f5c70bd13fdab90cbd7d6ad1066e5b528ad1dcb28b76c5fb2f00
Port: 4443/TCP
Host Port: 4443/TCP
Args:
--cert-dir=/tmp
--secure-port=4443
--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
--kubelet-use-node-status-port
--metric-resolution=15s
State: Running
Started: Tue, 12 Apr 2022 17:08:26 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/tmp from tmp-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g6p4g (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmp-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-g6p4g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 2s
node.kubernetes.io/unreachable:NoExecute op=Exists for 2s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m31s default-scheduler Successfully assigned kube-system/metrics-server-5dc6dbdb8-42hw9 to pusntyn196.apac.avaya.com
Normal Pulled 2m32s kubelet Container image "k8s.gcr.io/metrics-server/metrics-server:v0.6.1" already present on machine
Normal Created 2m31s kubelet Created container metrics-server
Normal Started 2m31s kubelet Started container metrics-server
kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
kubectl logs -f metrics-server-5dc6dbdb8-42hw9 -n kube-system
E0412 11:43:54.684784 1 configmap_cafile_content.go:242] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:44:27.001010 1 configmap_cafile_content.go:242] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
k logs -f metrics-server-5dc6dbdb8-42hw9 -n kube-system
I0412 11:38:26.447305 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0412 11:38:26.899459 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0412 11:38:26.899477 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0412 11:38:26.899518 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0412 11:38:26.899545 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0412 11:38:26.899546 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0412 11:38:26.899567 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0412 11:38:26.900480 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0412 11:38:26.900811 1 secure_serving.go:266] Serving securely on [::]:4443
I0412 11:38:26.900854 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
W0412 11:38:26.900965 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I0412 11:38:26.999960 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0412 11:38:26.999989 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0412 11:38:26.999970 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
E0412 11:38:27.000087 1 configmap_cafile_content.go:242] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:38:27.000118 1 configmap_cafile_content.go:242] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
kubectl top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Edit metrics server deployment yaml
Add - --kubelet-insecure-tls
k apply -f metric-server-deployment.yaml
serviceaccount/metrics-server unchanged
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader unchanged
clusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged
service/metrics-server unchanged
deployment.apps/metrics-server configured
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Also tried by adding below to metrics server deployment
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
This can easily be resolved by editing the deployment yaml files and adding the hostNetwork: true after the dnsPolicy: ClusterFirst
kubectl edit deployments.apps -n kube-system metrics-server
insert:
hostNetwork: true
I hope this help somebody for bare metal cluster:
$ helm --repo https://kubernetes-sigs.github.io/metrics-server/ --kubeconfig=$HOME/.kube/loc-cluster.config -n kube-system --set args='{--kubelet-insecure-tls}' upgrade --install metrics-server metrics-server
$ helm --kubeconfig=$HOME/.kube/loc-cluster.config -n kube-system uninstall metrics-server
Update: I deployed the metrics-server using the same command. Perhaps you can start fresh by removing existing resources and running:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
=======================================================================
It appears the --kubelet-insecure-tls flag was not configured correctly for the pod template in the deployment. The following should fix this:
Edit the existing deployment in the cluster with kubectl edit deployment/metrics-server -nkube-system.
Add the flag to the spec.containers[].args list, so that the deployment looks like this:
...
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls <=======ADD IT HERE.
image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
...
Simply save your changes and let the deployment rollout the updated pods. You can use watch -n1 kubectl get deployment/kube-metrics -nkube-system and wait for UP-TO-DATE column to show 1.
Like this:
NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 1/1 1 1 16m
Verify with kubectl top nodes. It will show something like
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
docker-desktop 222m 5% 1600Mi 41%
I've just verified this to work on a local setup. Let me know if this helps :)
Please configuration aggregation layer correctly and carefully, you can use this link for help : https://kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/.
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: <name of the registration object>
spec:
group: <API group name this extension apiserver hosts>
version: <API version this extension apiserver hosts>
groupPriorityMinimum: <priority this APIService for this group, see API documentation>
versionPriority: <prioritizes ordering of this version within a group, see API documentation>
service:
namespace: <namespace of the extension apiserver service>
name: <name of the extension apiserver service>
caBundle: <pem encoded ca cert that signs the server cert used by the webhook>
It would be helpful to provide kubectl version return value.
For me on EKS with helmfile I had to write in the values.yaml using the metrics-server chart :
containerPort: 10250
The value was enforced by default to 4443 for an unknown reason when I first deployed the chart.
See doc:
https://github.com/kubernetes-sigs/metrics-server/blob/master/charts/metrics-server/values.yaml#L62
https://aws.amazon.com/premiumsupport/knowledge-center/eks-metrics-server/#:~:text=confirm%20that%20your%20security%20groups
Then kubectl top nodes and kubectl describe apiservice v1beta1.metrics.k8s.io were working.
First of all, execute the following command:
kubectl get apiservices
And checkout the availablity (status) of kube-system/metrics-server service.
In case the availability is True:
Add hostNetwork: true to the spec of your metrics-server deployment by executing the following command:
kubectl edit deployment -n kube-system metrics-server
It should look like the following:
...
spec:
hostNetwork: true
...
Setting hostNetwork to true means that Pod will have access to
the host where it's running.
In case the availability is False (MissingEndpoints):
Download metrics-server:
wget https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.5.0/components.yaml
Remove (legacy) metrics server:
kubectl delete -f components.yaml
Edit downloaded file and add - --kubelet-insecure-tls to args list:
...
labels:
k8s-app: metrics-server
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls # add this line
...
Create service once again:
kubectl apply -f components.yaml

Failed to provision volume with StorageClass "google-storage"

I'm testing K8s with my custom cluster on GCP. I'm trying to create a StorageClass with this yaml:
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: google-storage
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
replication-type: none
But I'm having this error when I tried to create the resource:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 87s (x7 over 7m47s) persistentvolume-controller Failed to provision volume with StorageClass "google-storage": failed to get GCE GCECloudProvider with error <nil>
I'm not using GKE, but my own install and and installed everything by myself with kubeadm with the version 1.20 of kubernetes
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane,master 56m v1.20.1
worker-1 Ready <none> 55m v1.20.1
worker-2 Ready <none> 55m v1.20.1
Everything is working properly when I create pods, but I'm having an issue with StorageClass. Did I miss something during the creation?
I have found this same issue on stackoverflow, but the answer does not seem to applied to me, as I'm running a more recent version of the K8s Cluster without this problem describe :
Container-VM Image with GPD Volumes fails with "Failed to get GCE Cloud Provider. plugin.host.GetCloudProvider returned <nil> instead"

Cluster-autoscaler not triggering scale-up on Daemonset deployment

I deployed the Datadog agent using the Datadog Helm chart which deploys a Daemonset in Kubernetes. However when checking the state of the Daemonset I saw it was not creating all pods:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
datadog-agent-datadog 5 2 2 2 2 <none> 1h
When describing the Daemonset to figure out what was going wrong I saw it did not have enough resources:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x5 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 42s (x7 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Normal SuccessfulCreate 42s daemonset-controller Created pod: datadog-agent-7b2kp
However, I have the Cluster-autoscaler installed in the cluster and configured properly (It does trigger on regular Pod deployments that do not have enough resources to schedule), but it does not seem to trigger on the Daemonset:
I0424 14:14:48.545689 1 static_autoscaler.go:273] No schedulable pods
I0424 14:14:48.545700 1 static_autoscaler.go:280] No unschedulable pods
The AutoScalingGroup has enough nodes left:
Did I miss something in the configuration of the Cluster-autoscaler? What can I do to make sure it triggers on Daemonset resources as well?
Edit:
Describe of the Daemonset
Name: datadog-agent
Selector: app=datadog-agent
Node-Selector: <none>
Labels: app=datadog-agent
chart=datadog-1.27.2
heritage=Tiller
release=datadog-agent
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=datadog-agent
Annotations: checksum/autoconf-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/checksd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/confd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
Service Account: datadog-agent
Containers:
datadog:
Image: datadog/agent:6.10.1
Port: 8125/UDP
Host Port: 0/UDP
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Liveness: http-get http://:5555/health delay=15s timeout=5s period=15s #success=1 #failure=6
Environment:
DD_API_KEY: <set to the key 'api-key' in secret 'datadog-secret'> Optional: false
DD_LOG_LEVEL: INFO
KUBERNETES: yes
DD_KUBERNETES_KUBELET_HOST: (v1:status.hostIP)
DD_HEALTH_PORT: 5555
Mounts:
/host/proc from procdir (ro)
/host/sys/fs/cgroup from cgroups (ro)
/var/run/docker.sock from runtimesocket (ro)
/var/run/s6 from s6-run (rw)
Volumes:
runtimesocket:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
procdir:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
cgroups:
Type: HostPath (bare host directory volume)
Path: /sys/fs/cgroup
HostPathType:
s6-run:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 33m (x6 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-144.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Normal SuccessfulCreate 33m daemonset-controller Created pod: datadog-agent-7b2kp
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-174.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-3-250.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
You can add priorityClassName to point to a high priority PriorityClass to your DaemonSet. Kubernetes will then remove other pods in order to run the DaemonSet's pods. If that results in unschedulable pods, cluster-autoscaler should add a node to schedule them on.
See the docs (Most examples based on that) (For some pre-1.14 versions, the apiVersion is likely a beta (1.11-1.13) or alpha version (1.8 - 1.10) instead)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority class for essential pods"
Apply it to your workload
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: datadog-agent
spec:
template:
metadata:
labels:
app: datadog-agent
name: datadog-agent
spec:
priorityClassName: high-priority
serviceAccountName: datadog-agent
containers:
- image: datadog/agent:latest
############ Rest of template goes here
You should understand how cluster autoscaler works. It is responsible only for adding or removing nodes. It is not responsible for creating or destroying pods. So in your case cluster autoscaler is not doing anything because it's useless. Even if you add one more node - there will be still a requirement to run DaemonSet pods on nodes where is not enough CPU. That's why it is not adding nodes.
What you should do is to manually remove some pods from occupied nodes. Then it will be able to schedule DaemonSet pods.
Alternatively you can reduce CPU requests of Datadog to, for example, 100m or 50m. This should be enough to start those pods.

MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32

This is 2nd question following 1st question at
PersistentVolumeClaim is not bound: "nfs-pv-provisioning-demo"
I am setting up a kubernetes lab using one node only and learning to setup kubernetes nfs. I am following kubernetes nfs example step by step from the following link: https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs
Based on feedback provided by 'helmbert', I modified the content of
https://github.com/kubernetes/examples/blob/master/staging/volumes/nfs/provisioner/nfs-server-gce-pv.yaml
It works and I don't see the event "PersistentVolumeClaim is not bound: “nfs-pv-provisioning-demo”" anymore.
$ cat nfs-server-local-pv01.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv01
labels:
type: local
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/tmp/data01"
$ cat nfs-server-local-pvc01.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pv-provisioning-demo
labels:
demo: nfs-pv-provisioning
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 5Gi
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv01 10Gi RWO Retain Available 4s
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs-pv-provisioning-demo Bound pv01 10Gi RWO 2m
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nfs-server-nlzlv 1/1 Running 0 1h
$ kubectl describe pods nfs-server-nlzlv
Name: nfs-server-nlzlv
Namespace: default
Node: lab-kube-06/10.0.0.6
Start Time: Tue, 21 Nov 2017 19:32:21 +0000
Labels: role=nfs-server
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"default","name":"nfs-server","uid":"b1b00292-cef2-11e7-8ed3-000d3a04eb...
Status: Running
IP: 10.32.0.3
Created By: ReplicationController/nfs-server
Controlled By: ReplicationController/nfs-server
Containers:
nfs-server:
Container ID: docker://1ea76052920d4560557cfb5e5bfc9f8efc3af5f13c086530bd4e0aded201955a
Image: gcr.io/google_containers/volume-nfs:0.8
Image ID: docker-pullable://gcr.io/google_containers/volume-nfs#sha256:83ba87be13a6f74361601c8614527e186ca67f49091e2d0d4ae8a8da67c403ee
Ports: 2049/TCP, 20048/TCP, 111/TCP
State: Running
Started: Tue, 21 Nov 2017 19:32:43 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/exports from mypvc (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-grgdz (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
mypvc:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nfs-pv-provisioning-demo
ReadOnly: false
default-token-grgdz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-grgdz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
I continued the rest of steps and reached the "Setup the fake backend" section and ran the following command:
$ kubectl create -f examples/volumes/nfs/nfs-busybox-rc.yaml
I see status 'ContainerCreating' and never change to 'Running' for both nfs-busybox pods. Is this because the container image is for Google Cloud as shown in the yaml?
https://github.com/kubernetes/examples/blob/master/staging/volumes/nfs/nfs-server-rc.yaml
containers:
- name: nfs-server
image: gcr.io/google_containers/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
Do I have to replace that 'image' line to something else because I don't use Google Cloud for this lab? I only have a single node in my lab. Do I have to rewrite the definition of 'containers' above? What should I replace the 'image' line with? Do I need to download dockerized 'nfs image' from somewhere?
$ kubectl describe pvc nfs
Name: nfs
Namespace: default
StorageClass:
Status: Bound
Volume: nfs
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed=yes
pv.kubernetes.io/bound-by-controller=yes
Capacity: 1Mi
Access Modes: RWX
Events: <none>
$ kubectl describe pv nfs
Name: nfs
Labels: <none>
Annotations: pv.kubernetes.io/bound-by-controller=yes
StorageClass:
Status: Bound
Claim: default/nfs
Reclaim Policy: Retain
Access Modes: RWX
Capacity: 1Mi
Message:
Source:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 10.111.29.157
Path: /
ReadOnly: false
Events: <none>
$ kubectl get rc
NAME DESIRED CURRENT READY AGE
nfs-busybox 2 2 0 25s
nfs-server 1 1 1 1h
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nfs-busybox-lmgtx 0/1 ContainerCreating 0 3m
nfs-busybox-xn9vz 0/1 ContainerCreating 0 3m
nfs-server-nlzlv 1/1 Running 0 1h
$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 20m
nfs-server ClusterIP 10.111.29.157 <none> 2049/TCP,20048/TCP,111/TCP 9s
$ kubectl describe services nfs-server
Name: nfs-server
Namespace: default
Labels: <none>
Annotations: <none>
Selector: role=nfs-server
Type: ClusterIP
IP: 10.111.29.157
Port: nfs 2049/TCP
TargetPort: 2049/TCP
Endpoints: 10.32.0.3:2049
Port: mountd 20048/TCP
TargetPort: 20048/TCP
Endpoints: 10.32.0.3:20048
Port: rpcbind 111/TCP
TargetPort: 111/TCP
Endpoints: 10.32.0.3:111
Session Affinity: None
Events: <none>
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs 1Mi RWX Retain Bound default/nfs 38m
pv01 10Gi RWO Retain Bound default/nfs-pv-provisioning-demo 1h
I see repeating events - MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
$ kubectl describe pod nfs-busybox-lmgtx
Name: nfs-busybox-lmgtx
Namespace: default
Node: lab-kube-06/10.0.0.6
Start Time: Tue, 21 Nov 2017 20:39:35 +0000
Labels: name=nfs-busybox
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"default","name":"nfs-busybox","uid":"15d683c2-cefc-11e7-8ed3-000d3a04e...
Status: Pending
IP:
Created By: ReplicationController/nfs-busybox
Controlled By: ReplicationController/nfs-busybox
Containers:
busybox:
Container ID:
Image: busybox
Image ID:
Port: <none>
Command:
sh
-c
while true; do date > /mnt/index.html; hostname >> /mnt/index.html; sleep $(($RANDOM % 5 + 5)); done
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/mnt from nfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-grgdz (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
nfs:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nfs
ReadOnly: false
default-token-grgdz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-grgdz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m default-scheduler Successfully assigned nfs-busybox-lmgtx to lab-kube-06
Normal SuccessfulMountVolume 17m kubelet, lab-kube-06 MountVolume.SetUp succeeded for volume "default-token-grgdz"
Warning FailedMount 17m kubelet, lab-kube-06 MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs --scope -- mount -t nfs 10.111.29.157:/ /var/lib/kubelet/pods/15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit run-43641.scope.
mount: wrong fs type, bad option, bad superblock on 10.111.29.157:/,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)
In some cases useful info is found in syslog - try
dmesg | tail or so.
Warning FailedMount 9m (x4 over 15m) kubelet, lab-kube-06 Unable to mount volumes for pod "nfs-busybox-lmgtx_default(15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-lmgtx". list of unattached/unmounted volumes=[nfs]
Warning FailedMount 4m (x8 over 15m) kubelet, lab-kube-06 (combined from similar events): Unable to mount volumes for pod "nfs-busybox-lmgtx_default(15d8d6d6-cefc-11e7-8ed3-000d3a04ebcd)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-lmgtx". list of unattached/unmounted volumes=[nfs]
Warning FailedSync 2m (x7 over 15m) kubelet, lab-kube-06 Error syncing pod
$ kubectl describe pod nfs-busybox-xn9vz
Name: nfs-busybox-xn9vz
Namespace: default
Node: lab-kube-06/10.0.0.6
Start Time: Tue, 21 Nov 2017 20:39:35 +0000
Labels: name=nfs-busybox
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"default","name":"nfs-busybox","uid":"15d683c2-cefc-11e7-8ed3-000d3a04e...
Status: Pending
IP:
Created By: ReplicationController/nfs-busybox
Controlled By: ReplicationController/nfs-busybox
Containers:
busybox:
Container ID:
Image: busybox
Image ID:
Port: <none>
Command:
sh
-c
while true; do date > /mnt/index.html; hostname >> /mnt/index.html; sleep $(($RANDOM % 5 + 5)); done
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/mnt from nfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-grgdz (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
nfs:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nfs
ReadOnly: false
default-token-grgdz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-grgdz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 59m (x6 over 1h) kubelet, lab-kube-06 Unable to mount volumes for pod "nfs-busybox-xn9vz_default(15d7fb5e-cefc-11e7-8ed3-000d3a04ebcd)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-xn9vz". list of unattached/unmounted volumes=[nfs]
Warning FailedMount 7m (x32 over 1h) kubelet, lab-kube-06 (combined from similar events): MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/15d7fb5e-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs --scope -- mount -t nfs 10.111.29.157:/ /var/lib/kubelet/pods/15d7fb5e-cefc-11e7-8ed3-000d3a04ebcd/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit run-59365.scope.
mount: wrong fs type, bad option, bad superblock on 10.111.29.157:/,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)
In some cases useful info is found in syslog - try
dmesg | tail or so.
Warning FailedSync 2m (x31 over 1h) kubelet, lab-kube-06 Error syncing pod
Had the same problem,
sudo apt install nfs-kernel-server
directly on the nodes fixed it for ubuntu 18.04 server.
NFS server running on AWS EC2.
My pod was stuck in ContainerCreating state
I was facing this issue because of the Kubernetes cluster node CIDR range was not present in the inbound rule of Security Group of my AWS EC2 instance(where my NFS server was running )
Solution:
Added my Kubernetes cluser Node CIDR range to inbound rule of Security Group.
Installed the following nfs libraries on node machine of CentOS worked for me.
yum install -y nfs-utils nfs-utils-lib
Installing the nfs-common library in ubuntu worked for me.