Kubernetes Pod Stuck in Pending Without Indicating Any Reason

Kubernetes Pod Stuck in Pending Without Indicating Any Reason - kubernetes

We are using client-go to create kubernetes jobs and deployments. Today in one of our cluster (kubernetes v1.18.19), I encounter below weird problem.
Pods of kubernetes Job are always stuck in Pending status, without any reasons. kubectl describe pod shows there are no events. Creating Jobs from host (via kubectl) are normal and pods became running eventually.
What surprises me is Creating Deployments is ok, pods get running eventually!! It won't work only for Kubernetes Jobs. Why? How to fix that?? What I can do?? I have taken hours here but got no progress.
kubeconfig by client-go:
Mount from host machine, path: /root/.kube/config
kubectl describe job shows:
Name: unittest
Namespace: default
Selector: controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
Labels: job-id=unittest
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Sat, 19 Jun 2021 00:20:12 +0800
Pods Statuses: 1 Running / 0 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
job-name=unittest
Containers:
unittest:
Image: ubuntu:18.04
Port: <none>
Host Port: <none>
Command:
echo hello
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 21m job-controller Created pod: unittest-tt5b2
Kubectl describe on target pod shows:
Name: unittest-tt5b2
Namespace: default
Priority: 0
Node: <none>
Labels: controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
job-name=unittest
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: Job/unittest
Containers:
unittest:
Image: ubuntu:18.04
Port: <none>
Host Port: <none>
Command:
echo hello
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-72g27 (ro)
Volumes:
default-token-72g27:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-72g27
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
kubectl get events shows:
55m Normal ScalingReplicaSet deployment/job-scheduler Scaled up replica set job-scheduler-76b7465d74 to 1
19m Normal ScalingReplicaSet deployment/job-scheduler Scaled up replica set job-scheduler-74f8896f48 to 1
58m Normal SuccessfulCreate job/unittest Created pod: unittest-pp665
49m Normal SuccessfulCreate job/unittest Created pod: unittest-xm6ck
17m Normal SuccessfulCreate job/unittest Created pod: unittest-tt5b2

I fixed the issue.
We use a custom scheduler for NPU devices and default scheduler for GPU devices. For GPU devices, the scheduler name is "default-scheduler" other than "default". I passed "default" for those kube Jobs, this causes the pods to stuck in pending.

Related

CrashLoopBackOff : Back-off restarting failed container for flask application

I am a beginner in kubernetes and was trying to deploy my flask application following this guide: https://medium.com/analytics-vidhya/build-a-python-flask-app-and-deploy-with-kubernetes-ccc99bbec5dc
I have successfully built a docker image and pushed it to dockerhub https://hub.docker.com/repository/docker/beatrix1997/kubernetes_flask_app
but am having trouble debugging a pod.
This is my yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: kubernetesflaskapp-deploy
labels:
app: kubernetesflaskapp
spec:
replicas: 1
selector:
matchLabels:
app: kubernetesflaskapp
template:
metadata:
labels:
app: kubernetesflaskapp
spec:
containers:
- name: kubernetesflaskapp
image: beatrix1997/kubernetes_flask_app
ports:
- containerPort: 5000
And this is the description of the pod:
Name: kubernetesflaskapp-deploy-5764bbbd44-8696k
Namespace: default
Priority: 0
Node: minikube/192.168.49.2
Start Time: Fri, 20 May 2022 11:26:33 +0100
Labels: app=kubernetesflaskapp
pod-template-hash=5764bbbd44
Annotations: <none>
Status: Running
IP: 172.17.0.12
IPs:
IP: 172.17.0.12
Controlled By: ReplicaSet/kubernetesflaskapp-deploy-5764bbbd44
Containers:
kubernetesflaskapp:
Container ID: docker://d500dc15e389190670a9273fea1d70e6bd6ab2e7053bd2480d114ad6150830f1
Image: beatrix1997/kubernetes_flask_app
Image ID: docker-pullable://beatrix1997/kubernetes_flask_app#sha256:1bfa98229f55b04f32a6b85d72860886abcc0f17295b14e173151a8e4b0f0334
Port: 5000/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 20 May 2022 11:58:38 +0100
Finished: Fri, 20 May 2022 11:58:38 +0100
Ready: False
Restart Count: 11
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zq8n7 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-zq8n7:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33m default-scheduler Successfully assigned default/kubernetesflaskapp-deploy-5764bbbd44-8696k to minikube
Normal Pulled 33m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 14.783413947s
Normal Pulled 33m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 1.243534487s
Normal Pulled 32m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 1.373217701s
Normal Pulling 32m (x4 over 33m) kubelet Pulling image "beatrix1997/kubernetes_flask_app"
Normal Created 32m (x4 over 33m) kubelet Created container kubernetesflaskapp
Normal Pulled 32m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 1.239794774s
Normal Started 32m (x4 over 33m) kubelet Started container kubernetesflaskapp
Warning BackOff 3m16s (x138 over 33m) kubelet Back-off restarting failed container
I am using ubuntu as my OS if it matters at all.
Any help would be appreciated!
Many thanks!

I would check the following:
Check if your Docker image works in Docker, you can run it with the run command, find the official doc here
If it doesn't work, then you can check what is wrong in your app first.
If it does, try checking the readiness and liveness probe, here the official documentation
You can find more hints about failing pods here

The error can be due to the issue in the application as the reported reason is "Back-off restarting failed container". Please paste the following logs in the question for further clarification
kubectl logs -n <NS> pods <pod-name>

A question about pod running on the kubernetes(k8s) platform:The pods are running but the containers are not-ready

I build a k8s cluster on my virtual Machines(CentOS/7) with Virtual Box:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master Ready control-plane,master 8d v1.21.2 192.168.0.186 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
k8s-worker01 Ready <none> 8d v1.21.2 192.168.0.187 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
k8s-worker02 Ready <none> 8d v1.21.2 192.168.0.188 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
And i run some pods on the default namespace with a ReplicaSet several days before.
They were all worked fine at first, and then I shut down the VM.
Today, after I restarted the VMs, I found that they are not working properly anymore:
kubectl get all
NAME READY STATUS RESTARTS AGE
pod/dnsutils 1/1 Running 3 5d13h
pod/kubapp-6qbfz 0/1 Running 0 5d13h
pod/kubapp-d887h 0/1 Running 0 5d13h
pod/kubapp-z6nw7 0/1 Running 0 5d13h
NAME DESIRED CURRENT READY AGE
replicaset.apps/kubapp 3 3 0 5d13h
Then I delete the ReplicaSet and re-create it to create the pods.
And i run the command to get more infomations:
[root#k8s-master ch04]# kubectl describe po kubapp-z887v
Name: kubapp-d887h
Namespace: default
Priority: 0
Node: k8s-worker02/192.168.0.188
Start Time: Fri, 23 Jul 2021 15:55:16 +0000
Labels: app=kubapp
Annotations: cni.projectcalico.org/podIP: 10.244.69.244/32
cni.projectcalico.org/podIPs: 10.244.69.244/32
Status: Running
IP: 10.244.69.244
IPs:
IP: 10.244.69.244
Controlled By: ReplicaSet/kubapp
Containers:
kubapp:
Container ID: docker://fc352ce4c6a826f2cf108f9bb9a335e3572509fd5ae2002c116e2b080df5ee10
Image: evalle/kubapp
Image ID: docker-pullable://evalle/kubapp#sha256:560c9c50b1d894cf79ac472a9925dc795b116b9481ec40d142b928a0e3995f4c
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 23 Jul 2021 15:55:21 +0000
Ready: False
Restart Count: 0
Readiness: exec [ls /var/ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9rwr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-m9rwr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned default/kubapp-d887h to k8s-worker02
Normal Pulling 30m kubelet Pulling image "evalle/kubapp"
Normal Pulled 30m kubelet Successfully pulled image "evalle/kubapp" in 4.049160061s
Normal Created 30m kubelet Created container kubapp
Normal Started 30m kubelet Started container kubapp
Warning Unhealthy 11s (x182 over 30m) kubelet Readiness probe failed: ls: cannot access /var/ready: No such file or directory
I don`t know what it happens and how i should do for fix it.
SO here i am and ask to you guys for help.
I am a k8s newbie,just give a hand please.
Thanks for paul-becotte`s help and recommendation.I think i should to post the definition of the pod:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
# here is the name of the replication controller (RC)
name: kubapp
spec:
replicas: 3
# what pods the RC is operating on
selector:
matchLabels:
app: kubapp
# the pod template for creating new pods
template:
metadata:
labels:
app: kubapp
spec:
containers:
- name: kubapp
image: evalle/kubapp
readinessProbe:
exec:
command:
- ls
- /var/ready
There is a example definition of yaml from https://github.com/Evalle/k8s-in-action/blob/master/Chapter_4/kubapp-rs.yaml.
I don`t know where to find the dockerfile of the image evalle/kubapp.
And I don't know if it has the /var/ready directory.

Look at your event
Warning Unhealthy 11s (x182 over 30m) kubelet Readiness probe failed: ls: cannot access /var/ready: No such file or directory
Your readiness probe is failing- looks like it is checking for the existence of a file at /var/ready.
Your next step is "does that make sense? Is my container going to actually write a file at /var/ready when its ready?" If so, you'll want to look at the logs from your pod and figure out why its not writing the file. If its NOT the correct check, look at the yaml you used to create your pod/deployment/replicaset whatever and replace that check with something that does make sense.

Kubernetes CronJob Not exited

I am running a cronjob in kubernetes. Cronjob started and but not exited. Status of pod is always in RUNNING.
Below is logs
kubectl get pods
cronjob-1623253800-xnwwx 1/1 Running 0 13h
When i describe the JOB below are noticed
kubectl describe job cronjob-1623300120
Name: cronjob-1623300120
Namespace: cronjob
Selector: xxxxx
Labels: xxxxx
Annotations: <none>
Controlled By: CronJob/cronjob
Parallelism: 1
Completions: 1
Start Time: Thu, 9 Jun 2021 10:12:03 +0530
Pods Statuses: 1 Running / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=cronjob
controller-xxxx
job-name=cronjob-1623300120
Containers:
plannercronjob:
Image: xxxxxxxxxxxxx
Port: <none>
Host Port: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 13h job-controller Created pod: cronjob-1623300120
I Noticed that Pods Statuses: 1 Running / 0 Succeeded / 0 Failed. This means that the when code return zero , then job Succeeded/Failed. Is that correct ?.
When i enter into the pod using execute command
kubectl exec --stdin --tty cronjob-1623253800-xnwwx -n cronjob -- /bin/bash
root#cronjob-1623253800-xnwwx:/# ps ax| grep python
1 ? Ssl 0:01 python -m sfit.src.app
18 pts/0 S+ 0:00 grep python
I found that python process is still running. Is this a code issue deadlock or something else.
pod describe
Name: cronjob-1623302220-xnwwx
Namespace: default
Priority: 0
Node: aks-agentpool-xxxxvmss000000/10.240.0.4
Start Time: Thu, 9 Jun 2021 10:47:02 +0530
Labels: app=cronjob
controller-uid=xxxxxx
job-name=cronjob-1623302220
Annotations: <none>
Status: Running
IP: 10.244.1.30
IPs:
IP: 10.244.1.30
Controlled By: Job/cronjob-1623302220
Containers:
plannercronjob:
Container ID: docker://xxxxxxxxxxxxxxxx
Image: xxxxxxxxxxx
Image ID: docker-xxxx
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 9 Jun 2021 10:47:06 +0530
Ready: True
Restart Count: 0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-97xzv (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-97xzv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-97xzv
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13h default-scheduler Successfully assigned cronjob/cronjob-1623302220-xnwwx to aks-agentpool-xxx-vmss000000
Normal Pulling 13h kubelet, aks-agentpool-xxx-vmss000000 Pulling image "xxxx.azurecr.io/xxx:1.1.1"
Normal Pulled 13h kubelet, aks-agentpool-xxx-vmss000000 Successfully pulled image "xxx.azurecr.io/xx:1.1.1"
Normal Created 13h kubelet, aks-agentpool-xxx-vmss000000 Created container cronjob
Normal Started 13h kubelet, aks-agentpool-xxx-vmss000000 Started container cronjob
#KrishnaChaurasia . I run the docker image in my system. There is some error in my python code. But it is exit with error. But in the kubernetes it is not exited and not stop
docker run xxxxx/cronjob:1
File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 261, in send
raise error
azure.core.exceptions.ServiceRequestError: <urllib3.connection.HTTPSConnection object at 0x7f113f6480a0>: Failed to establish a new connection: [Errno -2] Name or service not known
echo $?
1

If you are seeing your pod is always running and never completed, try to add staratingDeadlineSeconds.
https://medium.com/#hengfeng/what-does-kubernetes-cronjobs-startingdeadlineseconds-exactly-mean-cc2117f9795f

K8s tutorial fails on my local installation with i/o timeout

I'm working on a local kubernetes installation with three nodes. They are installed via geerlingguy/kubernetes Ansible role (with default settings). I've recreated the whole VMs multiple times. I try to follow the Kubernetes tutorials on https://kubernetes.io/docs/tutorials/kubernetes-basics/explore/explore-interactive/ to get services up and running inside the cluster and try to reach them now.
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
enceladus Ready <none> 162m v1.17.9
mimas Ready <none> 162m v1.17.9
titan Ready master 162m v1.17.9
I tried it with the 1.17.9 or 1.18.6, I tried it with https://github.com/geerlingguy/ansible-role-kubernetes and https://github.com/kubernetes-sigs/kubespray on fresh Debian-Buster VMs. I tried it with Flannel and Calico network plugin. There is no a firewall configured.
I can deploy the kubernetes-bootcamp and exec into it, but when I try to reach the pod via kubectl proxy and curl I'm getting an error.
# kubectl create deployment kubernetes-bootcamp --image=gcr.io/google-samples/kubernetes-bootcamp:v1
# kubectl describe pods
Name: kubernetes-bootcamp-69fbc6f4cf-nq4tj
Namespace: default
Priority: 0
Node: enceladus/192.168.10.12
Start Time: Thu, 06 Aug 2020 10:53:34 +0200
Labels: app=kubernetes-bootcamp
pod-template-hash=69fbc6f4cf
Annotations: <none>
Status: Running
IP: 10.244.1.4
IPs:
IP: 10.244.1.4
Controlled By: ReplicaSet/kubernetes-bootcamp-69fbc6f4cf
Containers:
kubernetes-bootcamp:
Container ID: docker://77eae93ca1e6b574ef7b0623844374a5b2f3054075025492b708b23fc3474a45
Image: gcr.io/google-samples/kubernetes-bootcamp:v1
Image ID: docker-pullable://gcr.io/google-samples/kubernetes-bootcamp#sha256:0d6b8ee63bb57c5f5b6156f446b3bc3b3c143d233037f3a2f00e279c8fcc64af
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 06 Aug 2020 10:53:35 +0200
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kkcvk (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-kkcvk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-kkcvk
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10s default-scheduler Successfully assigned default/kubernetes-bootcamp-69fbc6f4cf-nq4tj to enceladus
Normal Pulled 9s kubelet, enceladus Container image "gcr.io/google-samples/kubernetes-bootcamp:v1" already present on machine
Normal Created 9s kubelet, enceladus Created container kubernetes-bootcamp
Normal Started 9s kubelet, enceladus Started container kubernetes-bootcamp
Update service list
# kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4d20h
I can exec curl inside the deployment. It is running.
# kubectl exec -ti kubernetes-bootcamp-69fbc6f4cf-nq4tj curl http://localhost:8080/
Hello Kubernetes bootcamp! | Running on: kubernetes-bootcamp-69fbc6f4cf-nq4tj | v=1
But, when I try to curl from master node the response is not good:
curl http://localhost:8001/api/v1/namespaces/default/pods/kubernetes-bootcamp-69fbc6f4cf-nq4tj/proxy/
Error trying to reach service: 'dial tcp 10.244.1.4:80: i/o timeout'
The curl itself needs ca. 30sec to return. The version etc. is available. The proxy is running fine.
# curl http://localhost:8001/version
{
"major": "1",
"minor": "17",
"gitVersion": "v1.17.9",
"gitCommit": "4fb7ed12476d57b8437ada90b4f93b17ffaeed99",
"gitTreeState": "clean",
"buildDate": "2020-07-15T16:10:45Z",
"goVersion": "go1.13.9",
"compiler": "gc",
"platform": "linux/amd64"
}
The tutorial shows on kubectl describe pods that the container has open ports (in my case it's <none>):
Port: 8080/TCP
Host Port: 0/TCP
Ok, I than created an apply-file bootcamp.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kubernetes-bootcamp
spec:
replicas: 1
selector:
matchLabels:
app: kubernetes-bootcamp
template:
metadata:
labels:
app: kubernetes-bootcamp
spec:
containers:
- name: kubernetes-bootcamp
image: gcr.io/google-samples/kubernetes-bootcamp:v1
ports:
- containerPort: 8080
protocol: TCP
I removed the previous deployment
# kubectl delete deployments.apps kubernetes-bootcamp --force
# kubectl apply -f bootcamp.yaml
But after that I'm getting still the same i/o timeout on the new deployment.
So, what is my problem?

Pods are in Pending state

My pods are staying in Pending state, as all the answers mentioned I tried to get describe output but no idea about why it is staying in Pending state:
k8s#k8s-master:~/deployment$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 12d v1.12.2
k8s-node-1 Ready <none> 12d v1.12.2
k8s-node-2 Ready <none> 12d v1.12.2
k8s#k8s-master:~/deployment$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 62m
webserver 0/1 Pending 0 13m
k8s#k8s-master:~/deployment$ kubectl describe pod webserver
Name: webserver
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: creator=rithin
Annotations: <none>
Status: Pending
IP:
Containers:
apache:
Image: httpd
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-vdpls (ro)
Volumes:
default-token-vdpls:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-vdpls
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
Already tried describing the pods, but no info

Your pods require manual scheduling.
In your yaml file for the pods add
nodeName: k8s-master
at the same level of containers under spec.
Your pods would be scheduled at the k8s-master node. If you want to schedule it in any other node, replace "k8s-master" with the appripriate node name.

One possiblity is that worker node is not reachable from master node as there is no node assigned to the pod.

Well I couldn't find any logs related to failure. So recreated the cluster and now it is working. I assume it was a problem with flannel.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kubernetes Pod Stuck in Pending Without Indicating Any Reason - kubernetes

I fixed the issue. We use a custom scheduler for NPU devices and default scheduler for GPU devices. For GPU devices, the scheduler name is "default-scheduler" other than "default". I passed "default" for those kube Jobs, this causes the pods to stuck in pending.

Related

CrashLoopBackOff : Back-off restarting failed container for flask application

A question about pod running on the kubernetes(k8s) platform:The pods are running but the containers are not-ready

Kubernetes CronJob Not exited

K8s tutorial fails on my local installation with i/o timeout

Pods are in Pending state

Categories

Resources