How to monitor pod preemption event - kubernetes

I have a bunch of Rancher clusters I take care of and on some of them developers use PriorityClasses to ensure that some of the more important workloads get scheduled. The 3 PriorityClasses are in 3 digits range so they will not interfere with the default ones. However, at present none of the PriorityClasses is set as default and neither is the preemptionPolicy set so it defaults to PreemptLowerPriority.
None of the rancher, longhorn, prometheus, grafana, etc., workloads have priorityClassName set.
Long story short, I believe this causes havoc on the cluster when resources are in short supply.
Before I take my opinion to the developers I would like to collect some data to back up my story.
The question: How do I detect if the pod was Terminated due to Preemption?
I tried to google the subject but couldn't find anything. I was hoping kube state metrics would have something but I didn't find anything.
Any help would be greatly appreciated.

You can try to look for convincing data like the pod termination reason with help of kubectl.
You can see the last restart logs of a container using the following command:
kubectl logs podname -c containername --previous
You can also use the following command to check the lifecycle events sent by the kubelet to the apiserver about the pod.
kubectl describe pod podname
Finally, You can also write a final message to /dev/termination-log, and this will show up as described in the docs.
To use kubectl commands with rancher kindly refer to this documentation page.

Related

deployed a service on k8s but not showing any pods weven when it failed

I have deployed a k8s service, however its not showing any pods. This is what I see
kubectl get deployments
It should create on the default namespace
kubectl get nodes (this shows me nothing)
How do I troubleshoot a failed deployment. The test-control-plane is the one deployed by kind this is the k8s one I'm using.
kubectl get nodes
If above command is not showing anything which mean there is no Nodes in your cluster so where your workload will run ?
You need to have at least one worker node in K8s cluster so deployment can schedule the POD on it and run the application.
You can check worker node using same command
kubectl get nodes
You can debug more and check the reason of issue further using
kubectl describe deployment <name of your deployment>
To find out what really went wrong, first follow the steps described in Harsh Manvar in his answer. Perhaps obtaining that information can help you find the problem. If not, check the logs of your deployment. Try to list your pods and see which ones did not boot properly, then check their logs.
You can also use the kubectl describe on pods to see in more detail what went wrong. Since you are using kind, I include a list of known errors for you.
You can also see this visual guide on troubleshooting Kubernetes deployments and 5 Tips for Troubleshooting Kubernetes Deployments.

Getting Unkown target for HPA

Am actually new to kubernetes, but as now am good with the terms such as deployment, pods etc.
Well i was trying an example of HPA (Horizontal pod autoscaler), and as prerequisite metrics-servers is already integrated, but after all those things am not able to see HPA working as expected
enter image description here
When execute below cmd-
Kubectl get hpa
Unknown in the target, although i have tried all my luck referring online forum but didn't got any break through
Any help would be really appreciated
Thank you
I was getting same issue, fixed after adding cpu requests in my pod defination. Below points can be the reason in most of the cases
Metric server is not installed in your kubernates cluster, you can check with command
(kubectl get deploy,svc -n kube-system | egrep metrics-server)
Check if you have provided resources for deployment/sts/pod definations

How to find the reason of a pod crashing?

Is there a way to see why a kubernetes pod is failing with the status "craskLoopBackOff" under a heavy load?
I have a HorizontalPodAutoscaler which never kicks in. In its status it always shows low (Under 50%) cpu and memory usage.
Tailing the application logs within the pods doesnt give any insights either.
Try looking at the Kubernetes events kubectl get events --sort-by='.lastTimestamp'
If you don't get anything meaningful out of events go to the specific node and see the kubelet logs journalctl -u kubelet
To get logs from a pod you should use:
kubectl logs [podname] -p
You can also do kubelet logs but that's mostly for Cluster logs.
If there is no logs that means your application did not produces any logs before the crash. You would need to rewrite the app and for example add a memory dump on crush.
You mentioned that the pod is dying under heavy load but stats shows only 50% utilization. You should login to the pod and check yourself the load, maybe check how many files are being open because maybe you are hitting the limit.
You can read the Kubernetes docs about Application Introspection and Debugging and go over Debugging CrashLoopBackoffs with Init-Containers.
You can also try running your image in Docker and checking logs there. There is a nice documentation about Logs and troubleshooting available.
If you provide more details we might be more helpful.
Below are some obvious reasons for crashloopbackoff, which I have observed:
waiting for some condition to be full-filled e.g. some secrets,
failing healthcheck etc
pod is running with burstable or besteffort
QoS and is getting killed due to non-availability of resources on
node
You can run this script to find the possible issues for pods in a namespace: https://github.com/dguyhasnoname/k8s-day2-ops/blob/master/namespace_scripts/debug_app_namespace.sh

how to take a problematic pod offline to troubleshoot

HI I know there's a way i can pull out a problematic node out of loadbalancer to troubleshoot. But how can i pull a pod out of service to troubleshoot. What tools or command can do it ?
Change its labels so they no longer matches the selector: in the Service; we used to do that all the time. You can even put it back into rotation if you want to test a hypothesis. I don't recall exactly how quickly it takes effect, but I would guess "real quick" is a good approximation. :-)
## for example:
$ kubectl label pod $the_pod -app.kubernetes.io/name
## or, change it to non-matching
$ kubectl label pod $the_pod app.kubernetes.io/name=i-am-debugging-this-pod
As mentioned in Oreilly's "Kubernetes recipes: Maintenance and troubleshooting" page here
Removing a Pod from a Service
Problem
You have a well-defined service (see not available) backed by several
pods. But one of the pods is misbehaving, and you would like to take
it out of the list of endpoints to examine it at a later time.
Solution
Relabel the pod using the --overwrite option—this will allow you to
change the value of the run label on the pod. By overwriting this
label, you can ensure that it will not be selected by the service
selector (not available) and will be removed from the list of
endpoints. At the same time, the replica set watching over your pods
will see that a pod has disappeared and will start a new replica.
To see this in action, start with a straightforward deployment
generated with kubectl run (see not available):
For commands, check the recipes page mentioned above. There is also a section talking about "Debugging Pods" which will be helpful

How can I access the pod when it become CrashLoopBackOff?

Right now, I deployed some pods on my kubernetes cluster. But sometime, my image may has some bugs which make the pod cannot start correctly.
For example:
nats-1 0/1 CrashLoopBackOff 121 10h
I also cannot see any error in the kubectl log.
So is there any way to access this pod? Or is there any tools or tech can allow to to enter the container?
Thanks a lot all! :)
You can kubectl describe to get the events, it sometimes might show some errors there. Otherwise you can probably also make the deployment/pod run a command like sleep 3600 to keep it open for you to exec into it to investigate further.
Edited after clarification:
You could go into the worker (kubectl get pod <pod-name> -o wide to get which one) and access the node syslogs or pods' logs. That should show you a more detailed information of what happened.
But #ho-man approach is very valid and less cumbersome.