When a pod gets stuck in a Waiting state, what can I do to find out why it's Waiting?
For instance, I have a deployment to AKS which uses ACI.
When I deploy the yaml file, a number of the pods will be stuck in a Waiting state. Running kubectl describe pod selenium121157nodechrome-7bf598579f-kqfqs returns;
State: Waiting
Reason: Waiting
Ready: False
Restart Count: 0
kubectl logs selenium121157nodechrome-7bf598579f-kqfqs returns nothing.
How can I find out what is the pod Waiting for?
Here's the yaml deployment;
apiVersion: apps/v1
kind: Deployment
metadata:
name: aci-helloworld2
spec:
replicas: 20
selector:
matchLabels:
app: aci-helloworld2
template:
metadata:
labels:
app: aci-helloworld2
spec:
containers:
- name: aci-helloworld
image: microsoft/aci-helloworld
ports:
- containerPort: 80
nodeSelector:
kubernetes.io/role: agent
beta.kubernetes.io/os: linux
type: virtual-kubelet
tolerations:
- key: virtual-kubelet.io/provider
operator: Exists
- key: azure.com/aci
effect: NoSchedule
Here's the output from a describe pod that's been Waiting for 5 minutes;
matt#Azure:~/2020$ kubectl describe pod aci-helloworld2-86b8d7866d-b9hgc
Name: aci-helloworld2-86b8d7866d-b9hgc
Namespace: default
Priority: 0
Node: virtual-node-aci-linux/
Labels: app=aci-helloworld2
pod-template-hash=86b8d7866d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/aci-helloworld2-86b8d7866d
Containers:
aci-helloworld:
Container ID: aci://95919def19c28c2a51a806928030d84df4bc6b60656d026d19d0fd5e26e3cd86
Image: microsoft/aci-helloworld
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: Waiting
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hqrj8 (ro)
Volumes:
default-token-hqrj8:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hqrj8
Optional: false
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/os=linux
kubernetes.io/role=agent
type=virtual-kubelet
Tolerations: azure.com/aci:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
virtual-kubelet.io/provider
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/aci-helloworld2-86b8d7866d-b9hgc to virtual-node-aci-linux
Based on the official documentation if your pod is in waiting state it means that it was scheduled on the node but it can't run on that machine with the image pointed out as the most common issue. You can try to run your image manually with docker pull and docker run and rule out the issues with image.
The information from kubectl describe <pod-name> should give you some information, especially the events section down to the bottom. Here`s an example how they can look like:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/testpod to cafe
Normal BackOff 50s (x6 over 2m16s) kubelet, cafe Back-off pulling image "busybox"
Normal Pulling 37s (x4 over 2m17s) kubelet, cafe Pulling image "busybox"
It could be also issue with your NodeSelector and Tolerations but again that would be shown in your events once you describe your pod.
Let me know if it helps and what are your outputs from describe pod.
Related
I've created a replicaset on Kubernetes using a yaml file, while the replicaset is created - the pods are not starting .. giving CrashLoopBackOff error.
Please see the yaml file & the pod status below:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: new-replica-set
labels:
app: new-replica-set
type: front-end
spec:
template:
metadata:
name: myimage
labels:
app: myimg-app
type: front-end
spec:
containers:
- name: my-busybody
image: busybox
replicas: 4
selector:
matchLabels:
type: front-end
Here is output, when list the pods:
new-replica-set-8v4l2 0/1 CrashLoopBackOff 10 (38s ago) 27m
new-replica-set-kd6nq 0/1 CrashLoopBackOff 10 (44s ago) 27m
new-replica-set-nkkns 0/1 CrashLoopBackOff 10 (21s ago) 27m
new-replica-set-psjcc 0/1 CrashLoopBackOff 10 (40s ago) 27m
output of describe command
$ kubectl describe pods new-replica-set-8v4l2
Name: new-replica-set-8v4l2
Namespace: default
Priority: 0
Node: minikube/192.168.49.2
Start Time: Wed, 03 Nov 2021 19:57:54 -0700
Labels: app=myimg-app
type=front-end
Annotations: <none>
Status: Running
IP: 172.17.0.14
IPs:
IP: 172.17.0.14
Controlled By: ReplicaSet/new-replica-set
Containers:
my-busybody:
Container ID: docker://67dec2d3a1e6d73fa4e67222e5d57fd980a1e6bf6593fbf3f275474e36956077
Image: busybox
Image ID: docker-pullable://busybox#sha256:15e927f78df2cc772b70713543d6b651e3cd8370abf86b2ea4644a9fba21107f
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 03 Nov 2021 22:12:32 -0700
Finished: Wed, 03 Nov 2021 22:12:32 -0700
Ready: False
Restart Count: 16
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lvnh6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-lvnh6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 138m default-scheduler Successfully assigned default/new-replica-set-8v4l2 to minikube
Normal Pulled 138m kubelet Successfully pulled image "busybox" in 4.009613585s
Normal Pulled 138m kubelet Successfully pulled image "busybox" in 4.339635544s
Normal Pulled 138m kubelet Successfully pulled image "busybox" in 2.293243043s
Normal Created 137m (x4 over 138m) kubelet Created container my-busybody
Normal Started 137m (x4 over 138m) kubelet Started container my-busybody
Normal Pulled 137m kubelet Successfully pulled image "busybox" in 2.344639501s
Normal Pulling 136m (x5 over 138m) kubelet Pulling image "busybox"
Normal Pulled 136m kubelet Successfully pulled image "busybox" in 1.114394958s
Warning BackOff 61s (x231 over 138m) kubelet Back-off restarting failed container
How do I fix this?
Also, what is the best way to debug these error?
busybox default to the docker command sh which opens a shell and because the container is neither not started with a terminal attached the sh process exits immediatly after container startup leading to the CrashLoopBackOff Status of your pods.
Try switching to an image that is intended to have a long running/always running process, e.g. nginx or define a command ( = docker entrypoint equivalent) and an argument ( = docker CMD equivalent), e.g.
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: new-replica-set
labels:
app: new-replica-set
type: front-end
spec:
template:
metadata:
name: myimage
labels:
app: myimg-app
type: front-end
spec:
containers:
- name: my-busybody
image: busybox
command: ["sh"]
args: ["-c", "while true; do echo Hello from busybox; sleep 100;done"]
replicas: 4
selector:
matchLabels:
type: front-end
I am practicing making PV and PVC with Minikube. But I encountered an error that my InfluxDB deployment couldn't find influxdb-pvc and I can't solve it.
I check the message at the top of the event, I can see that my PVC cannot be found. Therefore, I checked the status of PersistentVolumeClaim.
As far as I know, if the STATUS of influxdb-pv and influxdb-pvc is Bound, it is normally created and Deployment should be able to find influxdb-pvc. I don't know what's going on... Please help me 😢
The following is a description of Pod:
> kubectl describe pod influxdb-5b769454b8-pksss
Name: influxdb-5b769454b8-pksss
Namespace: ft-services
Priority: 0
Node: minikube/192.168.49.2
Start Time: Thu, 25 Feb 2021 01:14:25 +0900
Labels: app=influxdb
pod-template-hash=5b769454b8
Annotations: <none>
Status: Running
IP: 172.17.0.5
IPs:
IP: 172.17.0.5
Controlled By: ReplicaSet/influxdb-5b769454b8
Containers:
influxdb:
Container ID: docker://be2eec32cca22ea84f4a0034f42668c971fefe62e361f2a4d1a74d92bfbf4d78
Image: service_influxdb
Image ID: docker://sha256:50693dcc4dda172f82c0dcd5ff1db01d6d90268ad2b0bd424e616cb84da64c6b
Port: 8086/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 25 Feb 2021 01:30:40 +0900
Finished: Thu, 25 Feb 2021 01:30:40 +0900
Ready: False
Restart Count: 8
Environment Variables from:
influxdb-secret Secret Optional: false
Environment: <none>
Mounts:
/var/lib/influxdb from var-lib-influxdb (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-lfzz9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
var-lib-influxdb:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: influxdb-pvc
ReadOnly: false
default-token-lfzz9:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-lfzz9
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20m (x2 over 20m) default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "influxdb-pvc" not found.
Normal Scheduled 20m default-scheduler Successfully assigned ft-services/influxdb-5b769454b8-pksss to minikube
Normal Pulled 19m (x5 over 20m) kubelet Container image "service_influxdb" already present on machine
Normal Created 19m (x5 over 20m) kubelet Created container influxdb
Normal Started 19m (x5 over 20m) kubelet Started container influxdb
Warning BackOff 43s (x93 over 20m) kubelet Back-off restarting failed container
The following is status information for PV and PVC:
> kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/influxdb-pv 10Gi RWO Recycle Bound ft-services/influxdb-pvc influxdb 104m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/influxdb-pvc Bound influxdb-pv 10Gi RWO influxdb 13m
I proceeded with the setting in the following order.
Create a namespace.
kubectl create namespace ft-services
kubectl config set-context --current --namespace=ft-services
Apply my config files: influxdb-deployment.yaml, influxdb-secret.yaml, influxdb-service.yaml, influxdb-volume.yaml
influxdb-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: influxdb
labels:
app: influxdb
spec:
replicas: 1
selector:
matchLabels:
app: influxdb
template:
metadata:
labels:
app: influxdb
spec:
containers:
- name: influxdb
image: service_influxdb
imagePullPolicy: Never
ports:
- containerPort: 8086
envFrom:
- secretRef:
name: influxdb-secret
volumeMounts:
- mountPath: /var/lib/influxdb
name: var-lib-influxdb
volumes:
- name: var-lib-influxdb
persistentVolumeClaim:
claimName: influxdb-pvc
influxdb-volume.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: influxdb-pv
labels:
app: influxdb
spec:
storageClassName: influxdb
claimRef:
namespace: ft-services
name: influxdb-pvc
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
hostPath:
path: "/mnt/influxdb"
type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: influxdb-pvc
labels:
app: influxdb
spec:
storageClassName: influxdb
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
Build my docker image: service_influxdb
Dockerfile:
FROM alpine:3.13.1
RUN apk update && apk upgrade --ignore busybox && \
apk add \
influxdb && \
sed -i "247s/ #/ /" /etc/influxdb.conf && \
sed -i "256s/ #/ /" /etc/influxdb.conf
EXPOSE 8086
ENTRYPOINT influxd & /bin/sh
Check my minikube with dashboard
> minikube dashboard
0/1 nodes are available: 1 persistentvolumeclaim "influxdb-pvc" not found.
Back-off restarting failed container
I've tested your YAMLs on my Minikube cluster.
Your configuration is correct, however you missed one small detail. Container based on alpine needs to "do something" inside, otherwise container exits when its main process exits. Once container did all what was expected/configured, pod will be in Completed status.
Your pod is crashing because it starts up then immediately exits, thus Kubernetes restarts and the cycle continues. For more details please check Pod Lifecycle Documentation.
Examples
Alpine example:
$ kubectl get po alipne-test -w
NAME READY STATUS RESTARTS AGE
alipne-test 0/1 Completed 2 36s
alipne-test 0/1 CrashLoopBackOff 2 36s
alipne-test 0/1 Completed 3 54s
alipne-test 0/1 CrashLoopBackOff 3 55s
alipne-test 0/1 Completed 4 101s
alipne-test 0/1 CrashLoopBackOff 4 113s
Nginx example:
$ kubectl get po nginx
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 5m23s
Nginx is a webserver based container so it does not need additional sleep command.
Your Current Configuration
Your pod with influx is created, has nothing to do and exits.
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
influxdb-96bfd697d-wbkt7 0/1 CrashLoopBackOff 4 2m28s
influxdb-96bfd697d-wbkt7 0/1 Completed 5 3m8s
influxdb-96bfd697d-wbkt7 0/1 CrashLoopBackOff 5 3m19s
Solution
You just need add for example sleep command to keep container alive. For test I've used sleep 60 to keep container alive for 60 seconds using below configuration:
spec:
containers:
- name: influxdb
image: service_influxdb
imagePullPolicy: Never
ports:
- containerPort: 8086
envFrom:
- secretRef:
name: influxdb-secret
volumeMounts:
- mountPath: /var/lib/influxdb
name: var-lib-influxdb
command: ["/bin/sh"] # additional command
args: ["-c", "sleep 60"] # args to use sleep 60 command
And output below:
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
influxdb-65dc56f8df-9v76p 1/1 Running 0 7s
influxdb-65dc56f8df-9v76p 0/1 Completed 0 62s
influxdb-65dc56f8df-9v76p 1/1 Running 1 63s
It was running for 60 seconds, as sleep command was set to 60. As container fulfill all configured commands inside, it exit and status changed to Completed. If you will use commands to keep this container alive, you don't need to use sleep.
PV issues
As last part you mention about issue in Minikube Dashboard. I was not able to replicate it, but it might be some leftovers from your previous test.
Please let me know if you still have issue.
I have just moved my first cluster from minikube up to AWS EKS. All went pretty smoothly so far, except I'm running into some DNS issues I think, but only on one of the cluster nodes.
I have two nodes in the cluster running v1.14, and 4 pods of one type, and 4 of another, 3 of each work, but 1 of each - both on the same node - start then error (CrashLoopBackOff) with the script inside the container erroring because it can't resolve the hostname for the database. Deleting the errored pod, or even all pods, results in one pod on the same node failing every time.
The database is in its own pod and has a service assigned, none of the other pods of the same type have problems resolving the name or connecting. The database pod is on the same node as the pods that can't resolve the hostname. I'm not sure how to migrate the pod to a different node, but that might be worth trying to see if the problem follows.
No errors in the coredns pods. I'm not sure where to start looking to discover the issue from here, and any help or suggestions would be appreciated.
Providing the configs below. As mentioned, they all work on Minikube, and also they work on one node.
kubectl get pods - note age, all pod1's were deleted at the same time and they recreated themselves, 3 worked fine, 4th does not.
NAME READY STATUS RESTARTS AGE
pod1-85f7968f7-2cjwt 1/1 Running 0 34h
pod1-85f7968f7-cbqn6 1/1 Running 0 34h
pod1-85f7968f7-k9xv2 0/1 CrashLoopBackOff 399 34h
pod1-85f7968f7-qwcrz 1/1 Running 0 34h
postgresql-865db94687-cpptb 1/1 Running 0 3d14h
rabbitmq-667cfc4cc-t92pl 1/1 Running 0 34h
pod2-94b9bc6b6-6bzf7 1/1 Running 0 34h
pod2-94b9bc6b6-6nvkr 1/1 Running 0 34h
pod2-94b9bc6b6-jcjtb 0/1 CrashLoopBackOff 140 11h
pod2-94b9bc6b6-t4gfq 1/1 Running 0 34h
postgresql service
apiVersion: v1
kind: Service
metadata:
name: postgresql
spec:
ports:
- port: 5432
selector:
app: postgresql
pod1 deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod1
spec:
replicas: 4
selector:
matchLabels:
app: pod1
template:
metadata:
labels:
app: pod1
spec:
containers:
- name: pod1
image: us.gcr.io/gcp-project-8888888/pod1:latest
env:
- name: rabbitmquser
valueFrom:
secretKeyRef:
name: rabbitmq-secrets
key: rmquser
volumeMounts:
- mountPath: /data/files
name: datafiles
volumes:
- name: datafiles
persistentVolumeClaim:
claimName: datafiles-pv-claim
imagePullSecrets:
- name: container-readonly
pod2 depoloyment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod2
spec:
replicas: 4
selector:
matchLabels:
app: pod2
template:
metadata:
labels:
app: pod2
spec:
containers:
- name: pod2
image: us.gcr.io/gcp-project-8888888/pod2:latest
env:
- name: rabbitmquser
valueFrom:
secretKeyRef:
name: rabbitmq-secrets
key: rmquser
volumeMounts:
- mountPath: /data/files
name: datafiles
volumes:
- name: datafiles
persistentVolumeClaim:
claimName: datafiles-pv-claim
imagePullSecrets:
- name: container-readonly
CoreDNS config map to forward DNS to external service if it doesn't resolve internally. This is the only place I can think that would be causing the issue - but as said it works for one node.
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
proxy . 8.8.8.8
cache 30
loop
reload
loadbalance
}
Errored Pod output. Same for both pods, as it occurs in library code common to both. As mentioned, this does not occur for all pods so the issue likely doesn't lie with the code.
Error connecting to database (psycopg2.OperationalError) could not translate host name "postgresql" to address: Try again
Errored Pod1 description:
Name: xyz-94b9bc6b6-jcjtb
Namespace: default
Priority: 0
Node: ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time: Tue, 15 Oct 2019 19:43:11 +1030
Labels: app=pod1
pod-template-hash=94b9bc6b6
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.70.63
Controlled By: ReplicaSet/xyz-94b9bc6b6
Containers:
pod1:
Container ID: docker://f7dc735111bd94b7c7b698e69ad302ca19ece6c72b654057627626620b67d6de
Image: us.gcr.io/xyz/xyz:latest
Image ID: docker-pullable://us.gcr.io/xyz/xyz#sha256:20110cf126b35773ef3a8656512c023b1e8fe5c81dd88f19a64c5bfbde89f07e
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 16 Oct 2019 07:21:40 +1030
Finished: Wed, 16 Oct 2019 07:21:46 +1030
Ready: False
Restart Count: 139
Environment:
xyz: <set to the key 'xyz' in secret 'xyz-secrets'> Optional: false
Mounts:
/data/xyz from xyz (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
xyz:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: xyz-pv-claim
ReadOnly: false
default-token-m72kz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m72kz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2m22s (x3143 over 11h) kubelet, ip-192-168-87-230.us-east-2.compute.internal Back-off restarting failed container
Errored Pod 2 description:
Name: xyz-85f7968f7-k9xv2
Namespace: default
Priority: 0
Node: ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time: Mon, 14 Oct 2019 21:19:42 +1030
Labels: app=pod2
pod-template-hash=85f7968f7
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.84.69
Controlled By: ReplicaSet/pod2-85f7968f7
Containers:
pod2:
Container ID: docker://f7c7379f92f57ea7d381ae189b964527e02218dc64337177d6d7cd6b70990143
Image: us.gcr.io/xyz-217300/xyz:latest
Image ID: docker-pullable://us.gcr.io/xyz-217300/xyz#sha256:b9cecdbc90c5c5f7ff6170ee1eccac83163ac670d9df5febd573c2d84a4d628d
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 16 Oct 2019 07:23:35 +1030
Finished: Wed, 16 Oct 2019 07:23:41 +1030
Ready: False
Restart Count: 398
Environment:
xyz: <set to the key 'xyz' in secret 'xyz-secrets'> Optional: false
Mounts:
/data/xyz from xyz (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
xyz:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: xyz-pv-claim
ReadOnly: false
default-token-m72kz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m72kz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m28s (x9208 over 34h) kubelet, ip-192-168-87-230.us-east-2.compute.internal Back-off restarting failed container
At the suggestion of a k8s community member, I applied the following change to my coredns configuration to be more in line with the best practice:
Line: proxy . 8.8.8.8 changed to forward . /etc/resolv.conf 8.8.8.8
I then deleted the pods, and after they were recreated by k8s, the issue did not appear again.
EDIT:
Turns out, that was not the issue at all as shortly afterwards the issue re-occurred and persisted. In the end, it was this: https://github.com/aws/amazon-vpc-cni-k8s/issues/641
Rolled back to 1.5.3 as recommended by Amazon, restarted the cluster, and the issue was resolved.
i am trying to deploy the back-end component of my application for testing REST API's. i have dockerized the components and created an image in minikube.i have created a yaml file for deploying and creating services. Now when i try to deploy it through sudo kubectl create -f frontend-deployment.yaml, it deploys without any error but when i check the status of deployments this is what is shown :
NAME READY UP-TO-DATE AVAILABLE AGE
back 0/3 3 0 2m57s
Interestingly the service corresponding to this deployment is available.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
back ClusterIP 10.98.73.249 <none> 8080/TCP 3m9s
i also tried to create deployment by running deplyment statemnts individually like sudo kubectl run back --image=back --port=8080 --image-pull-policy Never but the result was same.
Here is how my `deployment.yaml file looks like :
kind: Service
apiVersion: v1
metadata:
name: back
spec:
selector:
app: back
ports:
- protocol: TCP
port: 8080
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: back
spec:
selector:
matchLabels:
app: back
replicas: 3
template:
metadata:
labels:
app: back
spec:
containers:
- name: back
image: back
imagePullPolicy: Never
ports:
- containerPort: 8080
How can i make this deployment up and running as this causes internal server error on my front end side of application?
Description of pod back
Name: back-7fd9995747-nlqhq
Namespace: default
Priority: 0
Node: minikube/10.0.2.15
Start Time: Mon, 15 Jul 2019 12:49:52 +0200
Labels: pod-template-hash=7fd9995747
run=back
Annotations: <none>
Status: Running
IP: 172.17.0.7
Controlled By: ReplicaSet/back-7fd9995747
Containers:
back:
Container ID: docker://8a46e16c52be24b12831bb38d2088b8059947d099299d15755d77094b9cb5a8b
Image: back:latest
Image ID: docker://sha256:69218763696932578e199b9ab5fc2c3e9087f9482ac7e767db2f5939be98a534
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 15 Jul 2019 12:49:54 +0200
Finished: Mon, 15 Jul 2019 12:49:54 +0200
Ready: False
Restart Count: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-c247f (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-c247f:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-c247f
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6s default-scheduler Successfully assigned default/back-7fd9995747-nlqhq to minikube
Normal Pulled 4s (x2 over 5s) kubelet, minikube Container image "back:latest" already present on machine
Normal Created 4s (x2 over 5s) kubelet, minikube Created container back
Normal Started 4s (x2 over 5s) kubelet, minikube Started container back
Warning BackOff 2s (x2 over 3s) kubelet, minikube Back-off restarting failed container
As you can see zero of three Pods have Ready status:
NAME READY AVAILABLE
back 0/3 0
To find out what is going on you should check the underlying Pods:
$ kubectl get pods -l app=back
and then look at the Events in their description:
$ kubectl describe pod back-...
The last(3rd) container is continuously being delete and recreated by kubernetes. It goes from Running to Terminating state. The Kubernetes UI shows status as : 'Terminated: ExitCode:${state.terminated.exitCode}'
My deployment YAML:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: openapi
spec:
scaleTargetRef:
kind: Deployment
name: openapi
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 75
---
kind: Service
apiVersion: v1
metadata:
name: openapi
spec:
selector:
app: openapi
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080
- name: https
protocol: TCP
port: 443
targetPort: 8443
type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: openapi
spec:
template:
metadata:
labels:
app: openapi
spec:
containers:
- name: openapi
image: us.gcr.io/PROJECT_ID/openapi:latest
imagePullPolicy: Always
ports:
- containerPort: 8080
Portion of Output of kubectl get events -n namespace:
Pod Normal Created kubelet Created container
Pod Normal Started kubelet Started container
Pod Normal Killing kubelet Killing container with id docker://openapi:Need to kill Pod
ReplicaSet Normal SuccessfulCreate replicaset-controller (combined from similar events): Created pod: openapi-7db5f8d479-p7mcl
ReplicaSet Normal SuccessfulDelete replicaset-controller (combined from similar events): Deleted pod: openapi-7db5f8d479-pgmxf
HorizontalPodAutoscaler Normal SuccessfulRescale horizontal-pod-autoscaler New size: 2; reason: Current number of replicas above Spec.MaxReplicas
HorizontalPodAutoscaler Normal SuccessfulRescale horizontal-pod-autoscaler New size: 3; reason: Current number of replicas below Spec.MinReplicas
Deployment Normal ScalingReplicaSet deployment-controller Scaled up replica set openapi-7db5f8d479 to 3
Deployment Normal ScalingReplicaSet deployment-controller Scaled down replica set openapi-7db5f8d479 to 2
kubectl describe pod -n default openapi-7db5f8d479-2d2nm for a pod that spawned and was killed:
A different pod with a different unique id spawns each time after a pod gets killed by Kubernetes.
Name: openapi-7db5f8d479-2d2nm
Namespace: default
Node: gke-testproject-default-pool-28ce3836-t4hp/10.150.0.2
Start Time: Thu, 23 Nov 2017 11:50:17 +0000
Labels: app=openapi
pod-template-hash=3861948035
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"openapi-7db5f8d479","uid":"b7b3e48f-ceb2-11e7-afe7-42010a960003"...
kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container openapi
Status: Terminating (expires Thu, 23 Nov 2017 11:51:04 +0000)
Termination Grace Period: 30s
IP:
Created By: ReplicaSet/openapi-7db5f8d479
Controlled By: ReplicaSet/openapi-7db5f8d479
Containers:
openapi:
Container ID: docker://93d2f1372a7ad004aaeb34b0bc9ee375b6ed48609f505b52495067dd0dcbb233
Image: us.gcr.io/testproject-175705/openapi:latest
Image ID: docker-pullable://us.gcr.io/testproject-175705/openapi#sha256:54b833548cbed32db36ba4808b33c87c15c4ecde673839c3922577f30b
Port: 8080/TCP
State: Terminated
Reason: Error
Exit Code: 143
Started: Thu, 23 Nov 2017 11:50:18 +0000
Finished: Thu, 23 Nov 2017 11:50:35 +0000
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-61k6c (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-61k6c:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-61k6c
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 21s default-scheduler Successfully assigned openapi-7db5f8d479-2d2nm to gke-testproject-default-pool-28ce3836-t4hp
Normal SuccessfulMountVolume 21s kubelet, gke-testproject-default-pool-28ce3836-t4hp MountVolume.SetUp succeeded for volume "default-token-61k6c"
Normal Pulling 21s kubelet, gke-testproject-default-pool-28ce3836-t4hp pulling image "us.gcr.io/testproject-175705/openapi:latest"
Normal Pulled 20s kubelet, gke-testproject-default-pool-28ce3836-t4hp Successfully pulled image "us.gcr.io/testproject-175705/openapi:latest"
Normal Created 20s kubelet, gke-testproject-default-pool-28ce3836-t4hp Created container
Normal Started 20s kubelet, gke-testproject-default-pool-28ce3836-t4hp Started container
Normal Killing 3s kubelet, gke-testproject-default-pool-28ce3836-t4hp Killing container with id docker://openapi:Need to kill Pod
Check the pod logs using the commands below:
kubectl get events -w -n namespace
and
kubectl describe pod -n namespace pod_name