Pods stuck in PodInitializing state indefinitely - kubernetes

I've got a k8s cronjob that consists of an init container and a one pod container. If the init container fails, the Pod in the main container never gets started, and stays in "PodInitializing" indefinitely.
My intent is for the job to fail if the init container fails.
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: job-name
namespace: default
labels:
run: job-name
spec:
schedule: "15 23 * * *"
startingDeadlineSeconds: 60
concurrencyPolicy: "Forbid"
successfulJobsHistoryLimit: 30
failedJobsHistoryLimit: 10
jobTemplate:
spec:
# only try twice
backoffLimit: 2
activeDeadlineSeconds: 60
template:
spec:
initContainers:
- name: init-name
image: init-image:1.0
restartPolicy: Never
containers:
- name: some-name
image: someimage:1.0
restartPolicy: Never
a kubectl on the pod that's stuck results in:
Name: job-name-1542237120-rgvzl
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: my-node-98afffbf-0psc/10.0.0.0
Start Time: Wed, 14 Nov 2018 23:12:16 +0000
Labels: controller-uid=ID
job-name=job-name-1542237120
Annotations: kubernetes.io/limit-ranger:
LimitRanger plugin set: cpu request for container elasticsearch-metrics; cpu request for init container elasticsearch-repo-setup; cpu requ...
Status: Failed
IP: 10.0.0.0
Controlled By: Job/job-1542237120
Init Containers:
init-container-name:
Container ID: docker://ID
Image: init-image:1.0
Image ID: init-imageID
Port: <none>
Host Port: <none>
State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 14 Nov 2018 23:12:21 +0000
Finished: Wed, 14 Nov 2018 23:12:32 +0000
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wwl5n (ro)
Containers:
some-name:
Container ID:
Image: someimage:1.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wwl5n (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True

To try and figure this out I would run the command:
kubectl get pods - Add the namespace param if required.
Then copy the pod name and run:
kubectl describe pod {POD_NAME}
That should give you some information as to why it's stuck in the initializing state.

A Pod can be stuck in Init status due to many reasons.
PodInitializing or Init Status means that the Pod contains an Init container that hasn't finalized (Init containers: specialized containers that run before app containers in a Pod, init containers can contain utilities or setup scripts). If the pods status is ´Init:0/1´ means that one init container is not finalized; init:N/M means the Pod has M Init Containers, and N have completed so far.
Gathering information
For those scenario the best would be to gather information, as the root cause can be different in every PodInitializing issue.
kubectl describe pods pod-XXX with this command you can get many info of the pod, you can check if there's any meaningful event as well. Save the init container name
kubectl logs pod-XXX this command prints the logs for a container in a pod or specified resource.
kubectl logs pod-XXX -c init-container-xxx This is the most accurate as could print the logs of the init container. You can get the init container name describing the pod in order to replace "init-container-XXX" as for example to "copy-default-config" as below:
The output of kubectl logs pod-XXX -c init-container-xxx can thrown meaningful info of the issue, reference:
In the image above we can see that the root cause is that the init container can't download the plugins from jenkins (timeout), here now we can check connection config, proxy, dns; or just modify the yaml to deploy the container without the plugins.
Additional:
kubectl describe node node-XXX describing the pod will give you the name of its node, which you can also inspect with this command.
kubectl get events to list the cluster events.
journalctl -xeu kubelet | tail -n 10 kubelet logs on systemd (journalctl -xeu docker | tail -n 1 for docker).
Solutions
The solutions depends on the information gathered, once the root cause is found.
When you find a log with an insight of the root cause, you can investigate that specific root cause.
Some examples:
1 > In there this happened when init container was deleted, can be fixed deleting the pod so it would be recreated, or redeploy it. Same scenario in 1.1.
2 > If you found "bad address 'kube-dns.kube-system'" the PVC may not be recycled correctly, solution provided in 2 is running /opt/kubernetes/bin/kube-restart.sh.
3 > There, a sh file was not found, the solution would be to modify the yaml file or remove the container if unnecessary.
4 > A FailedSync was found, and it was solved restarting docker on the node.
In general you can modify the yaml, for example to avoid using an outdated URL, try to recreate the affected resource, or just remove the init container that causes the issue from your deployment. However the specific solution will depend on the specific root cause.

I think that you could miss that it is the expected behavior of the init containers.
The rule is that in case of initContainers failure a Pod will not restart if restartPolicy is set to Never otherwise the Kubernetes will keep restarting it until it succeeds.
Also:
If the init container fails, the Pod in the main container never gets
started, and stays in "PodInitializing" indefinitely.
According to documentation:
A Pod cannot be Ready until all Init Containers have succeeded. The
ports on an Init Container are not aggregated under a service. A Pod
that is initializing is in the Pending state but should have a
condition Initializing set to true.
*I can see that you tried to change this behavior, but I am not sure if you can do that with CronJob, I saw examples with Jobs. But I am just theorizing, and if this post did not help you solve your issue I can try to recreate it in lab environment.

Since you have already figured out that initcontainers are meant to run to completion, successfully. If you can't get rid of init containers, what i would do in this case is to make sure that the init container ends successfully all the time. The result of the init container can be written in an emptydir volume, something like a status file, shared by both your init container and your work container.
I would delegate to the work container the responsibility of deciding what to do in case the init container ends unsuccessfully.

Related

Init container not restarting on pod restart

I have an init container that do some stuff that needs for the main container to run correctly, like creating some directories and a liveness probe that may fail if one of these directories were deleted. When the pod is restarted due to fail of liveness probe I expect that init container is also being restarted, but it won't.
This is what kubernetes documentation says about this:
If the Pod restarts, or is restarted, all init containers must execute again.
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
Easiest way to prove this behavior was to use the example of the pod from k8s documentation, add a liveness probe that always fails and expect that init container to be restarted, but again, it is not behaving as expected.
This is the example I'm working with:
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
restartPolicy: Always
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo "App started at $(date)" && tail -f /dev/null']
livenessProbe:
exec:
command:
- sh
- -c
- exit 1
initialDelaySeconds: 1
periodSeconds: 1
initContainers:
- name: myapp-init
image: busybox:1.28
command: ['/bin/sh', '-c', 'sleep 5 && echo "Init container started at $(date)"']
Sleep and date command are there to confirm that init container was restarted.
The pod is being restarted:
NAME READY STATUS RESTARTS AGE
pod/myapp-pod 1/1 Running 4 2m57s
But from logs it's clear that init container don't:
$ k logs pod/myapp-pod myapp-init
Init container started at Thu Jun 16 12:12:03 UTC 2022
$ k logs pod/myapp-pod myapp-container
App started at Thu Jun 16 12:14:20 UTC 2022
I checked it on both v1.19.5 and v1.24.0 kubernetes servers.
The question is how to force the init container to restart on pod restart.
The restart number refers to container restarts, not pod restarts.
init container need to run only once in a pos lifetime, and you need to design your containers like that, you can read this PR, and especially this comment

How to resolve this issue ErrImageNeverPull pod creation status kubernetes

I am creating a pod from an image which resides on the master node. When I create a pod on the master node to be scheduled on the worker node, I get the status of pod ErrImageNeverPull
kind: Pod
metadata:
name: cloud-pipe
labels:
app: cloud-pipe
spec:
containers:
- name: cloud-pipe
image: cloud-pipeline:latest
command: ["sleep"]
args: ["infinity"]
Kubectl describe pod details:
Type Reason Age From Message
- --- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned
default/cloud-pipe to knode
Warning ErrImageNeverPull 5m54s (x49 over 16m) kubelet Container image "cloud-
pipeline:latest" is not present with pull policy of Never
Warning Failed 51s (x72 over 16m) kubelet Error: ErrImageNeverPull
How to resolve this issue. Also, my question is does Kubernetes by default looks on the worker node for the image to exist?. Thanks
When kubernetes creates containers, it first looks to local images, and then will try registry(docker registry by default)
You are getting this error because:
your image cant be found localy on your node.
you specified imagePullPolicy: Never, so you will never try to download image from registry
You have few ways of resolving this, but all of them generally instruct you to get image locally and tag it properly.
To get image on your node you can:
copy images from one node to another
build image from existing Dockerfile
Once you get image, tag it and specify in the deployment
docker tag cloud-pipeline:latest mytest:mytest
kind: Pod
metadata:
name: cloud-pipe
labels:
app: cloud-pipe
spec:
containers:
- name: cloud-pipe
image: mytest:mytest
imagePullPolicy: Never
command: ["sleep"]
args: ["infinity"]
Or you can configure own local registry, push tagged image into it, and use imagePullPolicy: IfNotPresent. More information in #dryairship answer
Also please be sure using eval $(minikube docker-env) for imagePullPolicy: Never images, in case you are using minikube (you havent specified any tag, but it can be helpful). More information in Getting “ErrImageNeverPull” in pods question

Kubernetes job vs pod in containerCreating state

I am trying to figure out if there is a way to force a pod that is stuck on containerCreating state (for valid reasons like can't mount an inaccessible NFS, etc.) to move to a failed state after a specific amount of time.
I have Kubernetes jobs that I'm running through a Jenkins pipeline. I'm using the job state (type: completed|failed) to determine the outcome and then I gather the results of the jobs (kubectl get pods + kubectl logs). It works well as long as the pods go into a known failed state like ContainerCannotRun or Backofflimit and therefore the job state goes to failed.
Where the problem arises is when a pod goes into containerCreating state and stays that way. Then, the job state stays active and will never change. Is there a way, in the job manifest to put something to force a pod that's in containerCreating state to move to a failed state after a certain amount of time?
Example:
pod status
- image: myimage
imageID: ""
lastState: {}
name: primary
ready: false
restartCount: 0
state:
waiting:
reason: ContainerCreating
hostIP: x.y.z.y
phase: Pending
qosClass: BestEffort
startTime: "2020-05-06T17:09:58Z"
job status
active: 1
startTime: "2020-05-06T17:09:58Z"
Thanks for any input.
As documented here use activeDeadlineSeconds or backoffLimit
The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.
Once backoffLimit has been reached the Job will be marked as failed and any running Pods will be terminated.
Note that a Job’s activeDeadlineSeconds takes precedence over its backoffLimit. Therefore, a Job that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds, even if the backoffLimit is not yet reached.
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never

Why the pod status is coming as crashloopbackoff in my case?

I created a pod which just calculates the pi and exits and it should run again. On monitoring I observed the status was running, then turned completed and finally crashloopbackoff .
I tried different images, but issue is the same.
apiVersion: v1
kind: Pod
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
Running and Completed status in a series. But i see CrashLoopBackOff.
qvamjak#qvamjak:~/Jobs$ kubectl get pods
NAME READY STATUS RESTARTS AGE
pi 0/1 CrashLoopBackOff 16 63m
This is due to the restartPolicy of your pod, by default it is Always, which means the pod expects all its containers to be long-running (e.g., HTTP server), if any container in the pod exits (even successfully with code 0), it will be restarted. In this case, you want to set restartPolicy as OnFailure. See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy.
In addition to that, Job instead of Pod is the resource type you want for run-to-completion applications, because Job manages pods for you and offers more guarantees than unmanaged Pod.

My kubernetes pods keep crashing with "CrashLoopBackOff" but I can't find any log

This is what I keep getting:
[root#centos-master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nfs-server-h6nw8 1/1 Running 0 1h
nfs-web-07rxz 0/1 CrashLoopBackOff 8 16m
nfs-web-fdr9h 0/1 CrashLoopBackOff 8 16m
Below is output from describe pods
kubectl describe pods
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
16m 16m 1 {default-scheduler } Normal Scheduled Successfully assigned nfs-web-fdr9h to centos-minion-2
16m 16m 1 {kubelet centos-minion-2} spec.containers{web} Normal Created Created container with docker id 495fcbb06836
16m 16m 1 {kubelet centos-minion-2} spec.containers{web} Normal Started Started container with docker id 495fcbb06836
16m 16m 1 {kubelet centos-minion-2} spec.containers{web} Normal Started Started container with docker id d56f34ae4e8f
16m 16m 1 {kubelet centos-minion-2} spec.containers{web} Normal Created Created container with docker id d56f34ae4e8f
16m 16m 2 {kubelet centos-minion-2} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "web" with CrashLoopBackOff: "Back-off 10s restarting failed container=web pod=nfs-web-fdr9h_default(461c937d-d870-11e6-98de-005056040cc2)"
I have two pods: nfs-web-07rxz, nfs-web-fdr9h, but if I do kubectl logs nfs-web-07rxz or with -p option I don't see any log in both pods.
[root#centos-master ~]# kubectl logs nfs-web-07rxz -p
[root#centos-master ~]# kubectl logs nfs-web-07rxz
This is my replicationController yaml file:
replicationController yaml file
apiVersion: v1 kind: ReplicationController metadata: name: nfs-web spec: replicas: 2 selector:
role: web-frontend template:
metadata:
labels:
role: web-frontend
spec:
containers:
- name: web
image: eso-cmbu-docker.artifactory.eng.vmware.com/demo-container:demo-version3.0
ports:
- name: web
containerPort: 80
securityContext:
privileged: true
My Docker image was made from this simple docker file:
FROM ubuntu
RUN apt-get update
RUN apt-get install -y nginx
RUN apt-get install -y nfs-common
I am running my kubernetes cluster on CentOs-1611, kube version:
[root#centos-master ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.0", GitCommit:"86dc49aa137175378ac7fba7751c3d3e7f18e5fc", GitTreeState:"clean", BuildDate:"2016-12-15T16:57:18Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.0", GitCommit:"86dc49aa137175378ac7fba7751c3d3e7f18e5fc", GitTreeState:"clean", BuildDate:"2016-12-15T16:57:18Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
If I run the docker image by docker run I was able to run the image without any issue, only through kubernetes I got the crash.
Can someone help me out, how can I debug without seeing any log?
As #Sukumar commented, you need to have your Dockerfile have a Command to run or have your ReplicationController specify a command.
The pod is crashing because it starts up then immediately exits, thus Kubernetes restarts and the cycle continues.
#Show details of specific pod
kubectl describe pod <pod name> -n <namespace-name>
# View logs for specific pod
kubectl logs <pod name> -n <namespace-name>
If you have an application that takes slower to bootstrap, it could be related to the initial values of the readiness/liveness probes. I solved my problem by increasing the value of initialDelaySeconds to 120s as my SpringBoot application deals with a lot of initialization. The documentation does not mention the default 0 (https://kubernetes.io/docs/api-reference/v1.9/#probe-v1-core)
service:
livenessProbe:
httpGet:
path: /health/local
scheme: HTTP
port: 8888
initialDelaySeconds: 120
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 10
readinessProbe:
httpGet:
path: /admin/health
scheme: HTTP
port: 8642
initialDelaySeconds: 150
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 10
A very good explanation about those values is given by What is the default value of initialDelaySeconds.
The health or readiness check algorithm works like:
wait for initialDelaySeconds
perform check and wait timeoutSeconds for a timeout
if the number of continued successes is greater than successThreshold return success
if the number of continued failures is greater than failureThreshold return failure otherwise wait periodSeconds and start a new check
In my case, my application can now bootstrap in a very clear way, so that I know I will not get periodic crashloopbackoff because sometimes it would be on the limit of those rates.
I had the need to keep a pod running for subsequent kubectl exec calls and as the comments above pointed out my pod was getting killed by my k8s cluster because it had completed running all its tasks. I managed to keep my pod running by simply kicking the pod with a command that would not stop automatically as in:
kubectl run YOUR_POD_NAME -n YOUR_NAMESPACE --image SOME_PUBLIC_IMAGE:latest --command tailf /dev/null
My pod kept crashing and I was unable to find the cause. Luckily there is a space where kubernetes saves all the events that occurred before my pod crashed.
(#List Events sorted by timestamp)
To see these events run the command:
kubectl get events --sort-by=.metadata.creationTimestamp
make sure to add a --namespace mynamespace argument to the command if needed
The events shown in the output of the command showed my why my pod kept crashing.
From This page, the container dies after running everything correctly but crashes because all the commands ended. Either you make your services run on the foreground, or you create a keep alive script. By doing so, Kubernetes will show that your application is running. We have to note that in the Docker environment, this problem is not encountered. It is only Kubernetes that wants a running app.
Update (an example):
Here's how to avoid CrashLoopBackOff, when launching a Netshoot container:
kubectl run netshoot --image nicolaka/netshoot -- sleep infinity
In your yaml file, add command and args lines:
...
containers:
- name: api
image: localhost:5000/image-name
command: [ "sleep" ]
args: [ "infinity" ]
...
Works for me.
I observed the same issue, and added the command and args block in yaml file. I am copying sample of my yaml file for reference
apiVersion: v1
kind: Pod
metadata:
labels:
run: ubuntu
name: ubuntu
namespace: default
spec:
containers:
- image: gcr.io/ow/hellokubernetes/ubuntu
imagePullPolicy: Never
name: ubuntu
resources:
requests:
cpu: 100m
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
dnsPolicy: ClusterFirst
enableServiceLinks: true
As mentioned in above posts, the container exits upon creation.
If you want to test this without using a yaml file, you can pass the sleep command to the kubectl create deployment statement. The double hyphen -- indicates a command, which is equivalent of command: in a Pod or Deployment yaml file.
The below command creates a deployment for debian with sleep 1234, so it doesn't exit immediately.
kubectl create deployment deb --image=debian:buster-slim -- "sh" "-c" "while true; do sleep 1234; done"
You then can create a service etc, or, to test the container, you can kubectl exec -it <pod-name> -- sh (or -- bash) into the container you just created to test it.
I solved this problem I increased memory resource
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 100m
memory: 250Mi
In my case the problem was what Steve S. mentioned:
The pod is crashing because it starts up then immediately exits, thus Kubernetes restarts and the cycle continues.
Namely I had a Java application whose main threw an exception (and something overrode the default uncaught exception handler so that nothing was logged). The solution was to put the body of main into try { ... } catch and print out the exception. Thus I could find out what was wrong and fix it.
(Another cause could be something in the app calling System.exit; you could use a custom SecurityManager with an overridden checkExit to prevent (or log the caller of) exit; see https://stackoverflow.com/a/5401319/204205.)
Whilst troubleshooting the same issue I found no logs when using kubeclt logs <pod_id>.
Therefore I ssh:ed in to the node instance to try to run the container using plain docker. To my surprise this failed also.
When entering the container with:
docker exec -it faulty:latest /bin/sh
and poking around I found that it wasn't the latest version.
A faulty version of the docker image was already available on the instance.
When I removed the faulty:latest instance with:
docker rmi faulty:latest
everything started to work.
I had same issue and now I finally resolved it. I am not using docker-compose file.
I just added this line in my Docker file and it worked.
ENV CI=true
Reference:
https://github.com/GoogleContainerTools/skaffold/issues/3882
Try rerunning the pod and running
kubectl get pods --watch
to watch the status of the pod as it progresses.
In my case, I would only see the end result, 'CrashLoopBackOff,' but the docker container ran fine locally. So I watched the pods using the above command, and I saw the container briefly progress into an OOMKilled state, which meant to me that it required more memory.
In my case this error was specific to the hello-world docker image. I used the nginx image instead of the hello-world image and the error was resolved.
i solved this problem by removing space between quotes and command value inside of array ,this is happened because container exited after started and no executable command present which to be run inside of container.
['sh', '-c', 'echo Hello Kubernetes! && sleep 3600']
I had similar issue but got solved when I corrected my zookeeper.yaml file which had the service name mismatch with file deployment's container names. It got resolved by making them same.
apiVersion: v1
kind: Service
metadata:
name: zk1
namespace: nbd-mlbpoc-lab
labels:
app: zk-1
spec:
ports:
- name: client
port: 2181
protocol: TCP
- name: follower
port: 2888
protocol: TCP
- name: leader
port: 3888
protocol: TCP
selector:
app: zk-1
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: zk-deployment
namespace: nbd-mlbpoc-lab
spec:
template:
metadata:
labels:
app: zk-1
spec:
containers:
- name: zk1
image: digitalwonderland/zookeeper
ports:
- containerPort: 2181
env:
- name: ZOOKEEPER_ID
value: "1"
- name: ZOOKEEPER_SERVER_1
value: zk1
In my case, the issue was a misconstrued list of command-line arguments. I was doing this in my deployment file:
...
args:
- "--foo 10"
- "--bar 100"
Instead of the correct approach:
...
args:
- "--foo"
- "10"
- "--bar"
- "100"
I finally found the solution when I execute 'docker run xxx ' command ,and I got the error then.It is caused by incomplete-platform .
It seems there could be a lot of reasons why a Pod should be in crashloopbackoff state.
In my case, one of the container was terminating continuously due to the missing Environment value.
So, the best way to debug is to -
1. check Pod description output i.e. kubectl describe pod abcxxx
2. check the events generated related to the Pod i.e. kubectl get events| grep abcxxx
3. Check if End-points have been created for the Pod i.e. kubectl get ep
4. Check if dependent resources have been in-place e.g. CRDs or configmaps or any other resource that may be required.
kubectl logs -f POD, will only produce logs from a running container. Suffix --previous to the command to get logs from a previous container. Used maily for debugging. Hope this helps.