Iguazio job is stuck on 'Pending' status - mlops

I have a job I am running in Iguazio. It starts and then the status is "Pending" and the icon is blue. It stays like this indefinitely and there is nothing in the logs that describes what is going on. How do I fix this?

A job stuck in this status is usually a Kubernetes issue. The reason there is no logs in the Iguazio dashboard for the job is because the pod never started, which is where the logs come from. You can navigate to the web shell / Jupyter service in Iguazio and use kubectl commands to find out what is going on in Kubernetes. Usually, I see this when there is an issue with the docker image for the pod, it either can’t be found or has bugs.
In a terminal: doing kubectl get pods and find your pod. It usually has ImagePullBackOff, or CrashLoopBackOff or some similar error. Check the docker image which is usually the culprit. You can kill the pod in Kubernetes, which in turn will error the job out. You can also “abort” the job from the menu in the dashboard under that specific job.

Related

Kubernetes: view logs of crashed Airflow worker pod

Pods on our k8s cluster are scheduled with Airflow's KubernetesExecutor, which runs all Tasks in a new pod.
I have a such a Task for which the pod instantly (after 1 or 2 seconds) crashes, and for which of course I want to see the logs.
This seems hard. As soon the pod crashes, it gets deleted, along with the ability to retrieve crash logs. I already tried all of:
kubectl logs -f <pod> -p: cannot be used since these pods are named uniquely
(courtesy of KubernetesExecutor).
kubectl logs -l label_name=label_value: I
struggle to apply the labels to the pod (if this is a known/used way of working, I'm happy to try further)
An shared nfs is mounted on all pods on a fixed log directory. The failing pod however, does not log to this folder.
When I am really quick I run kubectl logs -f -l dag_id=sample_dag --all-containers (dag_idlabel is added byAirflow)
between running and crashing and see Error from server (BadRequest): container "base" in pod "my_pod" is waiting to start: ContainerCreating. This might give me some clue but:
these are only but the last log lines
this is really backwards
I'm basically looking for the canonical way of retrieving logs from transient pods
You need to enable remote logging. Code sample below is for using S3. In airflow.cfg set the following:
remote_logging = True
remote_log_conn_id = my_s3_conn
remote_base_log_folder = s3://airflow/logs
The my_s3_conn can be set in airflow>Admin>Connections. In the Conn Type dropdown, select S3.

Kubernetes Deployment/Pod/Container statuses

I am currently working on a monitoring service that will monitor Kubernetes' deployments and their pods. I want to notify users when a deployment is not running the expected amount of replicas and also when pods' containers restart unexpectedly. This may not be the right things to monitor and I would greatly appreciate some feedback on what I should be monitoring.
Anyways, the main question is the differences between all of the Statuses of pods. And when I say Statuses I mean the Status column when running kubectl get pods. The statuses in question are:
- ContainerCreating
- ImagePullBackOff
- Pending
- CrashLoopBackOff
- Error
- Running
What causes pod/containers to go into these states?
For the first four Statuses, are these states recoverable without user interaction?
What is the threshold for a CrashLoopBackOff?
Is Running the only status that has a Ready Condition of True?
Any feedback would be greatly appreciated!
Also, would it be bad practice to use kubectl in an automated script for monitoring purposes? For example, every minute log the results of kubectl get pods to Elasticsearch?
You can see the pod lifecycle details in k8s documentation.
The recommended way of monitoring kubernetes cluster and applications are with prometheus
I will try to tell what I see hidden behind these terms
ContainerCreating
Showing when we wait to image be downloaded and the
container will be created by a docker or another system.
ImagePullBackOff
Showing when we have problem to download the image from a registry. Wrong credentials to log in to the docker hub for example.
Pending
The container starts (if start take time) or started but redinessProbe failed.
CrashLoopBackOff
This status showing when container restarts occur too much often. For example, we have process that tries to read not exists file and crash. Then the container will be recreated by Kube and repeat.
Error
This is pretty clear. We have some errors to run the container.
Running
All is good container running and livenessProbe is OK.

Heapster status stuck in Container Creating or Pending status

I am new to Kubernetes and started working with it from past one month.
When creating the setup of cluster, sometimes I see that Heapster will be stuck in Container Creating or Pending status. After this happens the only way have found here is to re-install everything from the scratch which has solved our problem. Later if I run the Heapster it would run without any problem. But I think this is not the optimal solution every time. So please help out in solving the same issue when it occurs again.
Heapster image is pulled from the github for our use. Right now the cluster is running fine, So could not send the screenshot of the heapster failing with it's status by staying in Container creating or Pending status.
Suggest any alternative for the problem to be solved if it occurs again.
Thanks in advance for your time.
A pod stuck in pending state can mean more than one thing. Next time it happens you should do 'kubectl get pods' and then 'kubectl describe pod '. However, since it works sometimes the most likely cause is that the cluster doesn't have enough resources on any of its nodes to schedule the pod. If the cluster is low on remaining resources you should get an indication of this by 'kubectl top nodes' and by 'kubectl describe nodes'. (Or with gke, if you are on google cloud, you often get a low resource warning in the web UI console.)
(Or if in Azure then be wary of https://github.com/Azure/ACS/issues/29 )

On what basis restart count in kubernetes increase

I have a kubernetes cluster running fine. It has 4 workers and 1 master with the dashboard to view the status. After running it for sometime, I looked at the Restart count of a node and it was 8. I immediately ran the describe command to get any events but there was no events for that pod. However when I checked the logs of the containers, I found out that the node itself was powered down and up 4 times but dont know why it didnt had any events.
In another node, while looking at the restart count, I got event as Sandbox changed which means probably the node was powered down for sometime and thus the master lost connection to it and so incremented the restart count by 2.
I wanted to know how can we get the logs/debug related to this restart count to know why it was restarted.
Whenever a pod is recreated, does it takes up a new name.? If so, how can we get the events of the previous pod.
Does sandbox changed event actually means that master actually lost connection.?
Step by step:
I'd check the kubelet and docker daemon logs, these restarts should appear somewhere in the logs and hopefully more info about what causes them.
Yes, the pod's name is unique thus it change everytime a pod is destroyed and recreated. You can try to find the pod with kubectl get po -a. Other solution is to get all events with kubectl get events and then filter to find your pod's events.
I've seen this error before and in my case it meant problem with the docker daemon networking. But I searched a bit in google and I saw many other reasons. Again, try to analyse the docker daemon and kubelet logs, and also dmesg. If you have doubts please add a link to the logs in your question and I'll try to help.

Pod Containers Keeps on Restarting

Container getting killed at node after pod creation
Issue was raised at github and asked me to move to SO
https://github.com/kubernetes/kubernetes/issues/24241
However i am briefing my issue here.After creating pod it doesnt run since i have to mention the container name in the kubelet args under --pod-infra-container-image as mentioned below.
I have solved the issue of Pods Status Container Creating by adding the container name in "--pod-infra-container-image= then pod creation was successful.
However I want to resolve this issue some other way instead of adding containers name in kubelet args. Kindly let me know how do I get this issue fixed.
Also after the pod creation is done. The containers keep on restarting. However if I check the logs via kubectl logs output shows the container expected output.
But the container restarts often. For restarting of pod what i did i have made the restartPolicy: never in spec file of pod and then it didnt restarted however container doesnt run. Kindly help me.
Your description is very confusing, can you please reply to this answer with:
1. What error you get when you do docker pull gcr.io/google_containers/pause:2.0
2. What cmd you're running in your container
For 2 you need a long running command, like while true; do echo SUCCESS; done, otherwise it'll just exit and get restarted by the kubelet with a RestartPolicy of Always.