How do I check if a Kubernetes pod was killed for OOM or DEADLINE EXCEEDED? - kubernetes

I have some previously run pods that I think were killed by Kubernetes for OOM or DEADLINE EXCEEDED, what's the most reliable way to confirm that? Especially if the pods weren't recent.

If the pods are still showing up when you type kubectl get pods -a then you get type the following kubectl describe pod PODNAME and look at the reason for termination. The output will look similar to the following (I have extracted parts of the output that are relevant to this discussion):
Containers:
somename:
Container ID: docker://5f0d9e4c8e0510189f5f209cb09de27b7b114032cc94db0130a9edca59560c11
Image: ubuntu:latest
...
State: Terminated
Reason: Completed
Exit Code: 0
In the sample output you will, my pod's terminated reason is Completed but you will see other reasons such as OOMKilled and others over there.

If the pod has already been deleted, you can also check kubernetes events and see what's going on:
$ kubectl get events
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
59m 59m 1 my-pod-7477dc76c5-p49k4 Pod spec.containers{my-service} Normal Killing kubelet Killing container with id docker://my-service:Need to kill Pod

Related

Is it possible to get the details of the node where the pod ran before restart?

I'm running a kubernetes cluster of 20+ nodes. And one pod in a namespace got restarted. The pod got killed due to OOM with exit code 137 and restarted again as expected. But would like to know the node in which the pod was running earlier. Any place we could check the logs for the info? Like tiller, kubelet, kubeproxy etc...
But would like to know the node in which the pod was running earlier.
If a pod is killed with ExitCode: 137, e.g. when it used more memory than its limit, it will be restarted on the same node - not re-scheduled. For this, check your metrics or container logs.
But Pods can also be killed due to over-committing a node, see e.g. How to troubleshoot Kubernetes OOM and CPU Throttle.

How do I know why my SonarQube helm chart is getting auto-killed by Kubernetes

This question is about logging/monitoring.
I'm running a 3 node cluster on AKS, with 3 orgs, Dev, Test and Prod. The chart worked fine in Dev, but the same chart keeps getting killed by Kubernetes in Test, and it keeps getting recreated, and re-killed. Is there a way to extract details on why this is happening? All I see when I describe the pod is Reason: Killed
Please tell me more details on this or can give some suggestions. Thanks!
List Events sorted by timestamp
kubectl get events --sort-by=.metadata.creationTimestamp
There might be various reasons for it to be killed, e.g. not sufficient resources or failed liveness probe.
For SonarQube there is a liveness and readiness probe configured so it might fail. Also as described in helm's chart values:
If an ingress path other than the root (/) is defined, it should be reflected here
A trailing "/" must be included
You can also check if there are sufficient resources on node:
check what node are pods running on: kubectl get pods -test and
then run kubectl describe node <node-name> to check if there is no
disk/ memory pressure.
You can also run kubectl logs <pod-name> and kubectl describe pod <pod-name> that might give you some insight of kill reason.

Reasons of Pod Status Failed

If Pod's status is Failed, Kubernetes will try to create new Pods until it reaches terminated-pod-gc-threshold in kube-controller-manager. This will leave many Failed Pods in a cluster and need to be cleaned up.
Are there other reasons except Evicted that will cause Pod Failed?
There can be many causes for the POD status to be FAILED. You just need to check for problems(if there exists any) by running the command
kubectl -n <namespace> describe pod <pod-name>
Carefully check the EVENTS section where all the events those occurred during POD creation are listed. Hopefully you can pinpoint the cause of failure from there.
However there are several reasons for POD failure, some of them are the following:
Wrong image used for POD.
Wrong command/arguments are passed to the POD.
Kubelet failed to check POD liveliness(i.e., liveliness probe failed).
POD failed health check.
Problem in network CNI plugin (misconfiguration of CNI plugin used for networking).
For example:
In the above example, the image "not-so-busybox" couldn't be pulled as it doesn't exist, so the pod FAILED to run. The pod status and events clearly describe the problem.
Simply do this:
kubectl get pods <pod_name> -o yaml
And in the output, towards the end, you can see something like this:
This will give you a good idea of where exactly did the pod fail and what happened.
PODs will not survive scheduling failures, node failures, or other evictions, such as lack of resources, or in the case of node maintenance.
Pods should not be created manually but almost always via controllers like Deployments (self-healing, replication etc).
Reason why pod failed or was terminated can be obtain by
kubectl describe pod <pod_name>
Others situation I have encountered when pod Failed:
Issues with image (not existing anymore)
When pod is attempting to access i.e ConfigMap or Secrets but it is not found in namespace.
Liveness Probe Failure
Persistent Volume fails to mount
Validation Error
In addition, eviction is based on resources - EvictionPolicy
It can be also caused by DRAINing the Node/Pod. You can read about DRAIN here.

Automatic restart of a Kubernetes pod

I have a Kubernetes cluster on Google Cloud Platform. The Kubernetes cluster contains a deployment which has one pod. The pod has two containers. I have observed that the pod has been replaced by a new pod and the entire data is wiped out. I am not able to identify the reason behind it.
I have tried the below two commands:
kubectl logs [podname] -c [containername] --previous
**Result: ** previous terminated container [containername] in pod [podname] not found
kubectl get pods
Result: I see that the number of restarts for my pod equals 0.
Is there anything I could do to get the logs from my old pod?
Try below command to see the pod info
kubectl describe po
Not many chances you will retrieve this information, but try next:
1) If you know your failed container id - try to find old logs here
/var/lib/docker/containers/<container id>/<container id>-json.log
2) look at kubelet's logs:
journalctl -u kubelet

"Ghost" kubernetes pod stuck in terminating

The situation
I have a kubernetes pod stuck in "Terminating" state that resists pod deletions
NAME READY STATUS RESTARTS AGE
...
funny-turtle-myservice-xxx-yyy 1/1 Terminating 1 11d
...
Where funny-turtle is the name of the helm release that have since been deleted.
What I have tried
try to delete the pod.
Output: pod "funny-turtle-myservice-xxx-yyy" deleted
Outcome: it still shows up in the same state.
- also tried with --force --grace-period=0, same outcome with extra warning
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
try to read the logs (kubectl logs ...).
Outcome: Error from server (NotFound): nodes "ip-xxx.yyy.compute.internal" not found
try to delete the kubernetes deployment.
but it does not exist.
So I assume this pod somehow got "disconnected" from the aws API, reasoning from the error message that kubectl logs printed.
I'll take any suggestions or guidance to explain what happened here and how I can get rid of it.
EDIT 1
Tried to see if the "ghost" node was still there (kubectl delete node ip-xxx.yyy.compute.internal) but it does not exist.
Try removing the finalizers from the pod:
kubectl patch pod funny-turtle-myservice-xxx-yyy -p '{"metadata":{"finalizers":null}}'
In my case, the solution proposed by the accepted answer did not work, it kept stuck in "Terminating" status. What did the trick for me was:
kubectl delete pods <pod> --grace-period=0 --force
The above solutions did not work in my case, except I didn't try restarting all the nodes.
The error state for my pod was as follows (extra lines omitted):
$ kubectl -n myns describe pod/mypod
Status: Terminating (lasts 41h)
Containers:
runner:
State: Waiting
Reason: ContainerCreating
Last State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was deleted.
The container used to be Running
Exit Code: 137
$ kubectl -n myns get pod/mypod -o json
"metadata": {
"deletionGracePeriodSeconds": 0,
"deletionTimestamp": "2022-06-07T22:17:20Z",
"finalizers": [
"actions.summerwind.dev/runner-pod"
],
I removed the entry under finalizers (leaving finalizers as empty array) and then the pod was finally gone.
$ kubectl -n myns edit pod/mypod
pod/mypod edited
In my case nothing worked, no logs, no delete, absolutely nothing. I had to restart all the nodes, then the situation cleared up, no more Terminating pods.