How to find the reason of a pod crashing? - kubernetes

Is there a way to see why a kubernetes pod is failing with the status "craskLoopBackOff" under a heavy load?
I have a HorizontalPodAutoscaler which never kicks in. In its status it always shows low (Under 50%) cpu and memory usage.
Tailing the application logs within the pods doesnt give any insights either.

Try looking at the Kubernetes events kubectl get events --sort-by='.lastTimestamp'
If you don't get anything meaningful out of events go to the specific node and see the kubelet logs journalctl -u kubelet

To get logs from a pod you should use:
kubectl logs [podname] -p
You can also do kubelet logs but that's mostly for Cluster logs.
If there is no logs that means your application did not produces any logs before the crash. You would need to rewrite the app and for example add a memory dump on crush.
You mentioned that the pod is dying under heavy load but stats shows only 50% utilization. You should login to the pod and check yourself the load, maybe check how many files are being open because maybe you are hitting the limit.
You can read the Kubernetes docs about Application Introspection and Debugging and go over Debugging CrashLoopBackoffs with Init-Containers.
You can also try running your image in Docker and checking logs there. There is a nice documentation about Logs and troubleshooting available.
If you provide more details we might be more helpful.

Below are some obvious reasons for crashloopbackoff, which I have observed:
waiting for some condition to be full-filled e.g. some secrets,
failing healthcheck etc
pod is running with burstable or besteffort
QoS and is getting killed due to non-availability of resources on
node
You can run this script to find the possible issues for pods in a namespace: https://github.com/dguyhasnoname/k8s-day2-ops/blob/master/namespace_scripts/debug_app_namespace.sh

Related

How to monitor pod preemption event

I have a bunch of Rancher clusters I take care of and on some of them developers use PriorityClasses to ensure that some of the more important workloads get scheduled. The 3 PriorityClasses are in 3 digits range so they will not interfere with the default ones. However, at present none of the PriorityClasses is set as default and neither is the preemptionPolicy set so it defaults to PreemptLowerPriority.
None of the rancher, longhorn, prometheus, grafana, etc., workloads have priorityClassName set.
Long story short, I believe this causes havoc on the cluster when resources are in short supply.
Before I take my opinion to the developers I would like to collect some data to back up my story.
The question: How do I detect if the pod was Terminated due to Preemption?
I tried to google the subject but couldn't find anything. I was hoping kube state metrics would have something but I didn't find anything.
Any help would be greatly appreciated.
You can try to look for convincing data like the pod termination reason with help of kubectl.
You can see the last restart logs of a container using the following command:
kubectl logs podname -c containername --previous
You can also use the following command to check the lifecycle events sent by the kubelet to the apiserver about the pod.
kubectl describe pod podname
Finally, You can also write a final message to /dev/termination-log, and this will show up as described in the docs.
To use kubectl commands with rancher kindly refer to this documentation page.

Kubernetes - keeping the execution logs of a pod

I'm trying to keep the execution logs of containers in Kubernetes.
I added in my cronjob yaml the successfulJobsHistoryLimit: 5 failedJobsHistoryLimit: 5 in order to see the execution history, but when I try to view the logs of the pods I get this error
I assume it is because the pods have been deleted because when I go to a running pod I can see the logs.
So is there a way of keeping the logs in this part of Kubernetes or is there something that I have to setup in order to have this functionality?
Sorry if the question have been asked but I didn't really find something and I'm new to Kubernetes.
Thanks for the replies.
Looking at this problem in a bigger picture it's generally a good idea to have your logs stored via logging agents or directly pushed into an external service as per the official documentation.
Taking advantage of Kubernetes logging architecture explained here you can also try to fetch the logs directly from the log-rotate files in the node hosting the pods. Please note that this option might depend on the specific Kubernetes implementation as log files might be deleted when the pod eviction is triggered.

Kubernetes Deployment/Pod/Container statuses

I am currently working on a monitoring service that will monitor Kubernetes' deployments and their pods. I want to notify users when a deployment is not running the expected amount of replicas and also when pods' containers restart unexpectedly. This may not be the right things to monitor and I would greatly appreciate some feedback on what I should be monitoring.
Anyways, the main question is the differences between all of the Statuses of pods. And when I say Statuses I mean the Status column when running kubectl get pods. The statuses in question are:
- ContainerCreating
- ImagePullBackOff
- Pending
- CrashLoopBackOff
- Error
- Running
What causes pod/containers to go into these states?
For the first four Statuses, are these states recoverable without user interaction?
What is the threshold for a CrashLoopBackOff?
Is Running the only status that has a Ready Condition of True?
Any feedback would be greatly appreciated!
Also, would it be bad practice to use kubectl in an automated script for monitoring purposes? For example, every minute log the results of kubectl get pods to Elasticsearch?
You can see the pod lifecycle details in k8s documentation.
The recommended way of monitoring kubernetes cluster and applications are with prometheus
I will try to tell what I see hidden behind these terms
ContainerCreating
Showing when we wait to image be downloaded and the
container will be created by a docker or another system.
ImagePullBackOff
Showing when we have problem to download the image from a registry. Wrong credentials to log in to the docker hub for example.
Pending
The container starts (if start take time) or started but redinessProbe failed.
CrashLoopBackOff
This status showing when container restarts occur too much often. For example, we have process that tries to read not exists file and crash. Then the container will be recreated by Kube and repeat.
Error
This is pretty clear. We have some errors to run the container.
Running
All is good container running and livenessProbe is OK.

On what basis restart count in kubernetes increase

I have a kubernetes cluster running fine. It has 4 workers and 1 master with the dashboard to view the status. After running it for sometime, I looked at the Restart count of a node and it was 8. I immediately ran the describe command to get any events but there was no events for that pod. However when I checked the logs of the containers, I found out that the node itself was powered down and up 4 times but dont know why it didnt had any events.
In another node, while looking at the restart count, I got event as Sandbox changed which means probably the node was powered down for sometime and thus the master lost connection to it and so incremented the restart count by 2.
I wanted to know how can we get the logs/debug related to this restart count to know why it was restarted.
Whenever a pod is recreated, does it takes up a new name.? If so, how can we get the events of the previous pod.
Does sandbox changed event actually means that master actually lost connection.?
Step by step:
I'd check the kubelet and docker daemon logs, these restarts should appear somewhere in the logs and hopefully more info about what causes them.
Yes, the pod's name is unique thus it change everytime a pod is destroyed and recreated. You can try to find the pod with kubectl get po -a. Other solution is to get all events with kubectl get events and then filter to find your pod's events.
I've seen this error before and in my case it meant problem with the docker daemon networking. But I searched a bit in google and I saw many other reasons. Again, try to analyse the docker daemon and kubelet logs, and also dmesg. If you have doubts please add a link to the logs in your question and I'll try to help.

How can I access the pod when it become CrashLoopBackOff?

Right now, I deployed some pods on my kubernetes cluster. But sometime, my image may has some bugs which make the pod cannot start correctly.
For example:
nats-1 0/1 CrashLoopBackOff 121 10h
I also cannot see any error in the kubectl log.
So is there any way to access this pod? Or is there any tools or tech can allow to to enter the container?
Thanks a lot all! :)
You can kubectl describe to get the events, it sometimes might show some errors there. Otherwise you can probably also make the deployment/pod run a command like sleep 3600 to keep it open for you to exec into it to investigate further.
Edited after clarification:
You could go into the worker (kubectl get pod <pod-name> -o wide to get which one) and access the node syslogs or pods' logs. That should show you a more detailed information of what happened.
But #ho-man approach is very valid and less cumbersome.