I am getting a 0/1 ready state when i run kubectl get pod -n mongodb
To install binami/mongo I ran:
helm repo add bitnami https://charts.bitnami.com/bitnami
kubectl create namespace mongodb
helm install mongodb -n mongodb bitnami/mongodb --values ./factory/yaml/mongodb/values.yaml
when I run kubectl get pod -n mongodb
NAME READY STATUS RESTARTS AGE
mongodb-79bf77f485-8bdm6 0/1 CrashLoopBackOff 69 (55s ago) 4h17m
Here I want the ready state to be 1/1 and status running
thenI ran kubectl describe pod -n mongodb to view the log, and I got
Name: mongodb-79bf77f485-8bdm6
Namespace: mongodb
Priority: 0
Node: ip-192-168-58-58.ec2.internal/192.168.58.58
Start Time: Tue, 19 Jul 2022 07:31:32 +0000
Labels: app.kubernetes.io/component=mongodb
app.kubernetes.io/instance=mongodb
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=mongodb
helm.sh/chart=mongodb-12.1.26
pod-template-hash=79bf77f485
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.47.246
IPs:
IP: 192.168.47.246
Controlled By: ReplicaSet/mongodb-79bf77f485
Containers:
mongodb:
Container ID: docker://11999c2e13e382ceb0a8ba2ea8255ed3d4dc07ca18659ee5a1fe1a8d071b10c0
Image: docker.io/bitnami/mongodb:4.4.2-debian-10-r0
Image ID: docker-pullable://bitnami/mongodb#sha256:add0ef947bc26d25b12ee1b01a914081e08b5e9242d2f9e34e2881b5583ce102
Port: 27017/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 19 Jul 2022 11:46:12 +0000
Finished: Tue, 19 Jul 2022 11:47:32 +0000
Ready: False
Restart Count: 69
Liveness: exec [/bitnami/scripts/ping-mongodb.sh] delay=30s timeout=5s period=10s #success=1 #failure=6
Readiness: exec [/bitnami/scripts/readiness-probe.sh] delay=5s timeout=5s period=10s #success=1 #failure=6
Environment:
BITNAMI_DEBUG: false
MONGODB_ROOT_USER: root
MONGODB_ROOT_PASSWORD: <set to the key 'mongodb-root-password' in secret 'mongodb'> Optional: false
ALLOW_EMPTY_PASSWORD: no
MONGODB_SYSTEM_LOG_VERBOSITY: 0
MONGODB_DISABLE_SYSTEM_LOG: no
MONGODB_DISABLE_JAVASCRIPT: no
MONGODB_ENABLE_JOURNAL: yes
MONGODB_PORT_NUMBER: 27017
MONGODB_ENABLE_IPV6: no
MONGODB_ENABLE_DIRECTORY_PER_DB: no
Mounts:
/bitnami/mongodb from datadir (rw)
/bitnami/scripts from common-scripts (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nsr69 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
common-scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: mongodb-common-scripts
Optional: false
datadir:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: mongodb
ReadOnly: false
kube-api-access-nsr69:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 30m (x530 over 4h19m) kubelet Readiness probe failed: /bitnami/scripts/readiness-probe.sh: line 9: mongosh: command not found
Warning BackOff 9m59s (x760 over 4h10m) kubelet Back-off restarting failed container
Warning Unhealthy 5m6s (x415 over 4h19m) kubelet Liveness probe failed: /bitnami/scripts/ping-mongodb.sh: line 2: mongosh: command not found.
Don't understand the log or where the problem is coming from
How can I make the ready state 1/1 and a running status
Related
I am trying to deploy AIrflow on EKS Fargate using Helm. I have the EKS cluster, SC, PV, and PVC, along with namespace and fargate-profile(dev) all set up.
My problem comes when I do helm install:
helm upgrade --install airflow apache-airflow/airflow -n dev --values values.yaml --set volumePermissions.enbled=true --debug
[![list of pods][1]][1]
Above is the list of pods. The last 3 keep going into Crashloopbackoff.
Here is the describe of webserver pod:
C:\Users\tanma>kubectl describe pods -n dev airflow-webserver-775d548b98-wd5x8
Name: airflow-webserver-775d548b98-wd5x8
Namespace: dev
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: airflow-webserver
Node: fargate-ip-192-168-161-147.us-west-2.compute.internal/192.168.161.147
Start Time: Thu, 13 Oct 2022 17:12:54 -0400
Labels: component=webserver
eks.amazonaws.com/fargate-profile=dev
pod-template-hash=775d548b98
release=airflow
tier=airflow
Annotations: CapacityProvisioned: 0.25vCPU 0.5GB
Logging: LoggingDisabled: LOGGING_CONFIGMAP_NOT_FOUND
checksum/airflow-config: 978d20ff42d3de620bee24f2e35b1769f20ebd948890bf474bd940624e39f150
checksum/extra-configmaps: 2e44e493035e2f6a255d08f8104087ff10d30aef6f63176f1b18f75f73295598
checksum/extra-secrets: bb91ef06ddc31c0c5a29973832163d8b0b597812a793ef911d33b622bc9d1655
checksum/metadata-secret: d9bd679df96f2631a8559d02cc528fd78c3d73c06289be9816d83fb332e05b5e
checksum/pgbouncer-config-secret: da52bd1edfe820f0ddfacdebb20a4cc6407d296ee45bcb500a6407e2261a5ba2
checksum/webserver-config: 4a2281a4e3ed0cc5e89f07aba3c1bb314ea51c17cb5d2b41e9b045054a6b5c72
checksum/webserver-secret-key: a1e18ebcc73a51b6bafe52d95eee84dcdf132559cac0248fff6e58e409b4505e
kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.161.147
IPs:
IP: 192.168.161.147
Controlled By: ReplicaSet/airflow-webserver-775d548b98
Init Containers:
wait-for-airflow-migrations:
Container ID: containerd://bf4919f7a268bbeaf1a8f8779e4da1551d76f622d9ce970f18a3f2a1f14c24d7
Image: apache/airflow:2.4.1
Image ID: docker.io/apache/airflow#sha256:e077b68d81d56d773bddbcdc8941b7a2c16a2087a641005dfc5f1b8dcadec90a
Port: <none>
Host Port: <none>
Args:
airflow
db
check-migrations
--migration-wait-timeout=60
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Oct 2022 17:14:40 -0400
Finished: Thu, 13 Oct 2022 17:15:12 -0400
Ready: True
Restart Count: 0
Environment:
AIRFLOW__CORE__FERNET_KEY: <set to the key 'fernet-key' in secret 'airflow-fernet-key'> Optional: false
AIRFLOW__CORE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW_CONN_AIRFLOW_DB: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__WEBSERVER__SECRET_KEY: <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'> Optional: false
Mounts:
/opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pntv6 (ro)
Containers:
webserver:
Container ID: containerd://e479b50af8eefc8c99971cc9cc9b6345f826c09d5f770276b33518340298359d
Image: apache/airflow:2.4.1
Image ID: docker.io/apache/airflow#sha256:e077b68d81d56d773bddbcdc8941b7a2c16a2087a641005dfc5f1b8dcadec90a
Port: 8080/TCP
Host Port: 0/TCP
Args:
bash
-c
exec airflow webserver
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Thu, 13 Oct 2022 17:40:25 -0400
Finished: Thu, 13 Oct 2022 17:42:19 -0400
Ready: False
Restart Count: 9
Liveness: http-get http://:8080/health delay=15s timeout=30s period=5s #success=1 #failure=20
Readiness: http-get http://:8080/health delay=15s timeout=30s period=5s #success=1 #failure=20
Environment:
AIRFLOW__CORE__FERNET_KEY: <set to the key 'fernet-key' in secret 'airflow-fernet-key'> Optional: false
AIRFLOW__CORE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW_CONN_AIRFLOW_DB: <set to the key 'connection' in secret 'airflow-airflow-metadata'> Optional: false
AIRFLOW__WEBSERVER__SECRET_KEY: <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'> Optional: false
Mounts:
/opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
/opt/airflow/config/airflow_local_settings.py from config (ro,path="airflow_local_settings.py")
/opt/airflow/logs from logs (rw)
/opt/airflow/pod_templates/pod_template_file.yaml from config (ro,path="pod_template_file.yaml")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pntv6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: airflow-airflow-config
Optional: false
logs:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: af-efs-fargate-1
ReadOnly: false
kube-api-access-pntv6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning LoggingDisabled 31m fargate-scheduler Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
Normal Scheduled 30m fargate-scheduler Successfully assigned dev/airflow-webserver-775d548b98-wd5x8 to fargate-ip-192-168-161-147.us-west-2.compute.internal
Normal Pulling 30m kubelet Pulling image "apache/airflow:2.4.1"
Normal Pulled 28m kubelet Successfully pulled image "apache/airflow:2.4.1" in 1m43.155801441s
Normal Created 28m kubelet Created container wait-for-airflow-migrations
Normal Started 28m kubelet Started container wait-for-airflow-migrations
Normal Pulled 28m kubelet Container image "apache/airflow:2.4.1" already present on machine
Normal Created 28m kubelet Created container webserver
Normal Started 28m kubelet Started container webserver
Warning Unhealthy 27m (x9 over 27m) kubelet Readiness probe failed: Get "http://192.168.161.147:8080/health": dial tcp 192.168.161.147:8080: connect: connection refused
Warning Unhealthy 10m (x156 over 27m) kubelet Liveness probe failed: Get "http://192.168.161.147:8080/health": dial tcp 192.168.161.147:8080: connect: connection refused
Warning BackOff 10s (x44 over 14m) kubelet Back-off restarting failed container
Any thoughts on why the pods keep restarting?
Appreciate your help here.
Thanks
[1]: https://i.stack.imgur.com/IPocP.png
Your host port is 0. I guess that could cause the webserver not to be able to expose its port.
However, you'd have to check the logs of the webserver pod itself to make sure this is the problem.
You need to make sure that this endpoint is available (which is not currently); http://192.168.161.147:8080/health
Ended up increasing the resources for webserver and this solved the problem.
THanks
I'm testing kubernetes behavior when pod getting error.
I now have a pod in CrashLoopBackOff status caused by liveness probe failed, from what I can see in kubernetes events, pod turns into CrashLoopBackOff after 3 times try and begin to back off restarting, but the related Liveness probe failed events won't update?
➜ ~ kubectl describe pods/my-nginx-liveness-err-59fb55cf4d-c6p8l
Name: my-nginx-liveness-err-59fb55cf4d-c6p8l
Namespace: default
Priority: 0
Node: minikube/192.168.99.100
Start Time: Thu, 15 Jul 2021 12:29:16 +0800
Labels: pod-template-hash=59fb55cf4d
run=my-nginx-liveness-err
Annotations: <none>
Status: Running
IP: 172.17.0.3
IPs:
IP: 172.17.0.3
Controlled By: ReplicaSet/my-nginx-liveness-err-59fb55cf4d
Containers:
my-nginx-liveness-err:
Container ID: docker://edc363b76811fdb1ccacdc553d8de77e9d7455bb0d0fb3cff43eafcd12ee8a92
Image: nginx
Image ID: docker-pullable://nginx#sha256:353c20f74d9b6aee359f30e8e4f69c3d7eaea2f610681c4a95849a2fd7c497f9
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 15 Jul 2021 13:01:36 +0800
Finished: Thu, 15 Jul 2021 13:02:06 +0800
Ready: False
Restart Count: 15
Liveness: http-get http://:8080/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r7mh4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-r7mh4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37m default-scheduler Successfully assigned default/my-nginx-liveness-err-59fb55cf4d-c6p8l to minikube
Normal Created 35m (x4 over 37m) kubelet Created container my-nginx-liveness-err
Normal Started 35m (x4 over 37m) kubelet Started container my-nginx-liveness-err
Normal Killing 35m (x3 over 36m) kubelet Container my-nginx-liveness-err failed liveness probe, will be restarted
Normal Pulled 31m (x7 over 37m) kubelet Container image "nginx" already present on machine
Warning Unhealthy 16m (x32 over 36m) kubelet Liveness probe failed: Get "http://172.17.0.3:8080/": dial tcp 172.17.0.3:8080: connect: connection refused
Warning BackOff 118s (x134 over 34m) kubelet Back-off restarting failed container
BackOff event updated 118s ago, but Unhealthy event updated 16m ago?
and why I'm getting only 15 times Restart Count while BackOff events with 134 times?
I'm using minikube and my deployment is like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx-liveness-err
spec:
selector:
matchLabels:
run: my-nginx-liveness-err
replicas: 1
template:
metadata:
labels:
run: my-nginx-liveness-err
spec:
containers:
- name: my-nginx-liveness-err
image: nginx
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /
port: 8080
I think you might be confusing Status Conditions and Events. Events don't "update", they just exist. It's a stream of event data from the controllers for debugging or alerting on. The Age column is the relative timestamp to the most recent instance of that event type and you can see if does some basic de-duplication. Events also age out after a few hours to keep the database from exploding.
So your issue has nothing to do with the liveness probe, your container is crashing on startup.
I deployed a kubernete cluster with two EC2 instances.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-21-12 Ready control-plane,master 36h v1.20.2
ip-172-31-21-62 Ready <none> 12h v1.20.2
There is an error pod in the cluster:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
es-cluster-0 0/1 Error 141 12h
I see this error in the log of this pod:
$ kubectl logs pods/es-cluster-0
Error from server: Get "https://172.31.21.62:10250/containerLogs/default/es-cluster-0/elasticsearch": dial tcp 172.31.21.62:10250: i/o timeout
It is talking about having error with the node 172.31.21.62:10250. how can I fix the issue? I am not sure that this error mean?
Below command is from master node:
kubectl describe pods/es-cluster-0 command output:
$ kubectl describe pods/es-cluster-0
Name: es-cluster-0
Namespace: default
Priority: 0
Node: ip-172-31-21-62/172.31.21.62
Start Time: Wed, 17 Feb 2021 10:08:04 +0000
Labels: controller-revision-hash=es-cluster-55b6944c56
name=elasticsearch
statefulset.kubernetes.io/pod-name=es-cluster-0
Annotations: <none>
Status: Running
IP: 10.32.0.2
IPs:
IP: 10.32.0.2
Controlled By: StatefulSet/es-cluster
Containers:
elasticsearch:
Container ID: docker://838e02ff6fba31234656e68b804f49e86ec7fea0053e5d1062abdd9d24b728b9
Image: elasticsearch:7.10.1
Image ID: docker-pullable://elasticsearch#sha256:7cd88158f6ac75d43b447fdd98c4eb69483fa7bf1be5616a85fe556262dc864a
Ports: 9200/TCP, 9300/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 78
Started: Thu, 18 Feb 2021 01:40:23 +0000
Finished: Thu, 18 Feb 2021 01:40:39 +0000
Ready: False
Restart Count: 178
Environment: <none>
Mounts:
/usr/share/elasticsearch/config/elasticsearch.yml from elasticsearch-config (rw,path="elasticsearch.yml")
/var/run/secrets/kubernetes.io/serviceaccount from default-token-ntgwr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
elasticsearch-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: es-config
Optional: false
default-token-ntgwr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-ntgwr
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 42s (x4082 over 15h) kubelet Back-off restarting failed container
Below command is from master node:
$ netstat -tunlp| grep 10250
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp6 0 0 :::10250 :::* LISTEN -
Below command is from worker node:
$ kubectl describe pods/es-cluster-0
Name: es-cluster-0
Namespace: default
Priority: 0
Node: ip-172-31-21-62/172.31.21.62
Start Time: Wed, 17 Feb 2021 10:08:04 +0000
Labels: controller-revision-hash=es-cluster-55b6944c56
name=elasticsearch
statefulset.kubernetes.io/pod-name=es-cluster-0
Annotations: <none>
Status: Running
IP: 10.32.0.2
IPs:
IP: 10.32.0.2
Controlled By: StatefulSet/es-cluster
Containers:
elasticsearch:
Container ID: docker://0efa71da7b70725b248cfe8dfae5560c2fc95aaf40e3a3a899970e8579e7ac27
Image: elasticsearch:7.10.1
Image ID: docker-pullable://elasticsearch#sha256:7cd88158f6ac75d43b447fdd98c4eb69483fa7bf1be5616a85fe556262dc864a
Ports: 9200/TCP, 9300/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 78
Started: Thu, 18 Feb 2021 05:31:55 +0000
Finished: Thu, 18 Feb 2021 05:32:12 +0000
Ready: False
Restart Count: 221
Environment: <none>
Mounts:
/usr/share/elasticsearch/config/elasticsearch.yml from elasticsearch-config (rw,path="elasticsearch.yml")
/var/run/secrets/kubernetes.io/serviceaccount from default-token-ntgwr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
elasticsearch-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: es-config
Optional: false
default-token-ntgwr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-ntgwr
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 55m (x211 over 19h) kubelet Container image "elasticsearch:7.10.1" already present on machine
Warning BackOff 54s (x5087 over 19h) kubelet Back-off restarting failed container
Below command is from worker node:
$ netstat -tunlp| grep 10250
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp6 0 0 :::10250 :::* LISTEN -
My pod get red status and show this error:
Usage of EmptyDir volume "agent" exceeds the limit "100Mi".
after I fix this problem, the error did not disappear.
how to make the error tips disappear? This is the pod info:
dolphin#dolphins-MacBook-Pro ~ % kubectl describe pods soa-task-745d48d955-bd4j8
Name: soa-task-745d48d955-bd4j8
Namespace: dabai-fat
Priority: 0
Node: azshara-k8s03/172.19.150.82
Start Time: Tue, 17 Aug 2021 15:23:18 +0800
Labels: k8s-app=soa-task
pod-template-hash=745d48d955
Annotations: kubectl.kubernetes.io/restartedAt: 2021-04-20T07:42:03Z
Status: Running
IP: 172.30.184.3
IPs: <none>
Controlled By: ReplicaSet/soa-task-745d48d955
Init Containers:
init-agent:
Container ID: docker://25d947147300edba8bc1861d40cea314047674b74f82d7de9013eead41f1f20f
Image: registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai_fat/skywalking-agent:6.5.0
Image ID: docker-pullable://registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai_fat/skywalking-agent#sha256:eda5426bc7bc06fc184e740f4783f263f151ae25e55aae37eec8b67e5dbb2fb0
Port: <none>
Host Port: <none>
Command:
sh
-c
set -ex;mkdir -p /skywalking/agent;cp -r /opt/skywalking/agent/* /skywalking/agent;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 17 Aug 2021 15:23:19 +0800
Finished: Tue, 17 Aug 2021 15:23:19 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/skywalking/agent from agent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xnrwt (ro)
Containers:
soa-task:
Container ID: docker://5406a90606e0a3905fa8fa4827e19db0d8d58609c06c2fd4e756b718df5db3b9
Image: registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task:v1.0.0
Image ID: docker-pullable://registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task#sha256:aece36589aae4fdedcfb82d7e64e451e32ebb1169dccd2485f8fe4bd451944a8
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 17 Aug 2021 15:23:34 +0800
Ready: True
Restart Count: 0
Liveness: http-get http://:11028/actuator/liveness delay=120s timeout=30s period=10s #success=1 #failure=3
Readiness: http-get http://:11028/actuator/health delay=90s timeout=30s period=10s #success=1 #failure=3
Environment:
SKYWALKING_ADDR: dabai-skywalking-skywalking-oap.apm.svc.cluster.local:11800
APOLLO_META: <set to the key 'apollo.meta' of config map 'fat-config'> Optional: false
ENV: <set to the key 'env' of config map 'fat-config'> Optional: false
Mounts:
/opt/skywalking/agent from agent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xnrwt (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
agent:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 100Mi
default-token-xnrwt:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xnrwt
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 360s
node.kubernetes.io/unreachable:NoExecute op=Exists for 360s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned dabai-fat/soa-task-745d48d955-bd4j8 to azshara-k8s03
Normal Pulled 13m kubelet Container image "registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai_fat/skywalking-agent:6.5.0" already present on machine
Normal Created 13m kubelet Created container init-agent
Normal Started 13m kubelet Started container init-agent
Normal Pulling 13m kubelet Pulling image "registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task:v1.0.0"
Normal Pulled 13m kubelet Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task:v1.0.0"
Normal Created 13m kubelet Created container soa-task
Normal Started 13m kubelet Started container soa-task
Just installed a single master cluster using kubeadm v1.15.0. However, coredns seems stuck in pending mode:
coredns-5c98db65d4-4pm65 0/1 Pending 0 2m17s <none> <none> <none> <none>
coredns-5c98db65d4-55hcc 0/1 Pending 0 2m2s <none> <none> <none> <none>
the following is what shows up for the pod:
kubectl describe pods coredns-5c98db65d4-4pm65 --namespace=kube-system
Name: coredns-5c98db65d4-4pm65
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: <none>
Labels: k8s-app=kube-dns
pod-template-hash=5c98db65d4
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/coredns-5c98db65d4
Containers:
coredns:
Image: k8s.gcr.io/coredns:1.3.1
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-5t2wn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-5t2wn:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-5t2wn
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 61s (x4 over 5m21s) default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
I removed the taint on the master node, to no avail. Shouldn't I be able to created a single node master without any problems like this. I know scheduling pods on the master is not possible without removing the taint, but this is odd.
I tried adding the latest calico cni, to no avail, too.
I get the following running journalctl (systemctl shows no errors):
sudo journalctl -xn --unit kubelet.service
[sudo] password for gms:
-- Logs begin at Fri 2019-07-12 04:31:34 CDT, end at Tue 2019-07-16 16:58:17 CDT. --
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:54.122355 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:54.400606 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:59.124863 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:59.400924 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:04.127120 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:04.401266 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:09.129287 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:09.401520 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:14.133059 11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:14.402008 11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Indeed, when I look in /etc/cni/net.d there is nothing there -> yes, I ran kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml... this is the output when I apply this:
configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
I ran the following on the pod for calico-node, which is stuck in the following state:
calico-node-tcfhw 0/1 Init:0/3 0 11m 10.32.3.158
describe pods calico-node-tcfhw --namespace=kube-system
Name: calico-node-tcfhw
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: thalia0.ahc.umn.edu/10.32.3.158
Start Time: Tue, 16 Jul 2019 18:08:25 -0500
Labels: controller-revision-hash=844ddd97c6
k8s-app=calico-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP: 10.32.3.158
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://1e1bf9e65cb182656f6f06a1bb8291237562f0f5a375e557a454942e81d32063
Image: calico/cni:v3.8.0
Image ID: docker-pullable://docker.io/calico/cni#sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Running
Started: Tue, 16 Jul 2019 18:08:26 -0500
Ready: False
Restart Count: 0
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
install-cni:
Container ID:
Image: calico/cni:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
flexvol-driver:
Container ID:
Image: calico/pod2daemon-flexvol:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Containers:
calico-node:
Container ID:
Image: calico/node:v3.8.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 250m
Liveness: http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: k8s,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Always
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-b9c6p:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-b9c6p
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: :NoSchedule
:NoExecute
CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m15s default-scheduler Successfully assigned kube-system/calico-node-tcfhw to thalia0.ahc.umn.edu
Normal Pulled 9m14s kubelet, thalia0.ahc.umn.edu Container image "calico/cni:v3.8.0" already present on machine
Normal Created 9m14s kubelet, thalia0.ahc.umn.edu Created container upgrade-ipam
Normal Started 9m14s kubelet, thalia0.ahc.umn.edu Started container upgrade-ipam
I tried Flannel as a cni, but that was even worse. The kube-proxy wouldn't even start due to a taint!
EDIT ADDENDUM
Should the kube-controller-manager and kube-scheduler not have defined endpoints?
[gms#thalia0 ~]$ kubectl get ep --namespace=kube-system -o wide
NAME ENDPOINTS AGE
kube-controller-manager <none> 19h
kube-dns <none> 19h
kube-scheduler <none> 19h
[gms#thalia0 ~]$ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
coredns-5c98db65d4-nmn4g 0/1 Pending 0 19h
coredns-5c98db65d4-qv8fm 0/1 Pending 0 19h
etcd-thalia0.x.x.edu. 1/1 Running 0 19h
kube-apiserver-thalia0.x.x.edu 1/1 Running 0 19h
kube-controller-manager-thalia0.x.x.edu 1/1 Running 0 19h
kube-proxy-4hrdc 1/1 Running 0 19h
kube-proxy-vb594 1/1 Running 0 19h
kube-proxy-zwrst 1/1 Running 0 19h
kube-scheduler-thalia0.x.x.edu 1/1 Running 0 19h
Lastly, for sanity's sake, I tried v1.13.1, and voila! Success:
NAME READY STATUS RESTARTS AGE
calico-node-pbrps 2/2 Running 0 15s
coredns-86c58d9df4-g5944 1/1 Running 0 2m40s
coredns-86c58d9df4-zntjl 1/1 Running 0 2m40s
etcd-thalia0.ahc.umn.edu 1/1 Running 0 110s
kube-apiserver-thalia0.ahc.umn.edu 1/1 Running 0 105s
kube-controller-manager-thalia0.ahc.umn.edu 1/1 Running 0 103s
kube-proxy-qxh2h 1/1 Running 0 2m39s
kube-scheduler-thalia0.ahc.umn.edu 1/1 Running 0 117s
EDIT 2
Tried sudo kubeadm upgrade plan and got an error on api-server's health and bad certs.
Ran this on the api-server:
kubectl logs kube-apiserver-thalia0.x.x.edu --namespace=kube-system1
and got a ton of errors of the sort TLS handshake error from 10.x.x.157:52384: remote error: tls: bad certificate, which were from nodes that have long been deleted from the cluster and, long after several kubeadm resets on the master, along with uninstall/reinstall of kubelet, kubeadm, etc.
Why are these old nodes showing up? Don't the certs get recreated on a kubeadm init?
This issue https://github.com/projectcalico/calico/issues/2699 had similar symptoms and indicates that deleting /var/lib/cni/ fixed the issue. You could see if it exists and delete it if so.
Coreos-dns doesn't start until Calico is started, check if your worker nodes are ready with this command
kubectl get nodes -owide
kubectl describe node <your-node>
or
kubectl get node <your-node> -oyaml
Other thing to check is the following message in the log :
"Unable to update cni config: No networks found in /etc/cni/net.d"
what you have in that directory?
Maybe cni isn't configured properly.
That directory /etc/cni/net.d should contain 2 files :
10-calico.conflist calico-kubeconfig
Below is the content of this two files, check if you have files like this in your directory
[root#master net.d]# cat 10-calico.conflist
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "master",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
[root#master net.d]# cat calico-kubeconfig
# Kubeconfig file for Calico CNI plugin.
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
server: https://[10.20.0.1]:443
certificate-authority-data: LSRt.... tLQJ=
users:
- name: calico
user:
token: "eUJh .... ZBoIA"
contexts:
- name: calico-context
context:
cluster: local
user: calico
current-context: calico-context