More detailed monitoring of Pod states - kubernetes

Our Pods usually spend at least a minute and up to several minutes in the Pending state, the events via kubectl describe pod x yield:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned testing/runner-2zyekyp-project-47-concurrent-0tqwl4 to host
Normal Pulled 55s kubelet, host Container image "registry.com/image:c1d98da0c17f9b1d4ca81713c138ee2e" already present on machine
Normal Created 55s kubelet, host Created container build
Normal Started 54s kubelet, host Started container build
Normal Pulled 54s kubelet, host Container image "gitlab/gitlab-runner-helper:x86_64-6214287e" already present on machine
Normal Created 54s kubelet, host Created container helper
Normal Started 54s kubelet, host Started container helper
The provided information is not exactly detailed as to figure out exactly what is happening.
Question:
How can we gather more detailed metrics of what exactly and when exactly something happens in regards to get a Pod running in order to troubleshoot which step exactly needs how much time?
Special interest would be the metric of how long it takes to mount a volume.

Check kubelet and kube scheduler logs because kube scheduler schedules the pod to a node and kubelet starts the pod on that node and reports the status as ready.
journalctl -u kubelet # after logging into the kubernetes node
kubectl logs kube-scheduler -n kube-system
Describe the pod, deployment, replicaset to get more details
kubectl describe pod podnanme -n namespacename
kubectl describe deploy deploymentnanme -n namespacename
kubectl describe rs replicasetnanme -n namespacename
Check events
kubectl get events -n namespacename
Describe the nodes and check available resources and status which should be ready.
kubectl describe node nodename

Related

Can I get events from other resources in addition to the pod in Kubernetes?

When running this command for resources ( deployment, ReplicaSet ...) other than Pod
$ kubectl describe deployment xxx-deployment
---- ------ ------
Events: <none>
I have deployed several resources, but I haven't seen the event yet except for Pod.
What type of event will occur if events occur in other resources?
Could you recommend any materials to refer to?
Good explanation what is event in Kubernetes you can find in Types of Kubernetes Events article. Author also mentioned about types of events.
Kubernetes events are a resource type in Kubernetes that are automatically created when other resources have state changes, errors, or other messages that should be broadcast to the system. While there is not a lot of documentation available for events, they are an invaluable resource when debugging issues in your Kubernetes cluster.
You can describe not only pod, deployment or replicaset but almost all resources in kubernetes.
Examples:
kubectl describe job pi -n test
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12s job-controller Created pod: pi-5rgbz
kubectl describe node ubuntu
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning MissingClusterDNS 22h (x98 over 23h) kubelet, ubuntu-18 kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
Normal Starting 22h kubelet, ubuntu-18 Starting kubelet.
Warning InvalidDiskCapacity 22h kubelet, ubuntu-18 invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 22h kubelet, ubuntu-18 Node ubuntu-18 status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 22h
To list all resources events you can use
$ kubectl get events --all-namespaces
$ kubectl get events --all-namespaces
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
default 50m Normal Starting node/gke-cluster-1-default-pool-XXXXXXXXXXXXX Starting kubelet.
default 50m Normal NodeHasSufficientMemory node/gke-cluster-1-default-pool-XXXXXXXXXXXXX Node gke-cluster-1-default-pool-XXXXXXXXXXXXX status is now: NodeHasSufficientMemory
default 2m47s Normal SuccessfulCreate job/pi Created pod: pi-5rgbz
kube-system 50m Normal ScalingReplicaSet deployment/fluentd-gcp-scaler Scaled up replica set fluentd-gcp-scaler-6855f55bcc to 1
In Object column you resource type.
If you would like more detailed information you can use -o wide flag - $ kubectl get events --all-namespaces -o wide
$ kubectl get events -o wide
LAST SEEN TYPE REASON OBJECT SUBOBJECT SOURCE MESSAGE
FIRST SEEN COUNT NAME
20m Normal Scheduled pod/hello-world-86d6c6f84d-8qz9d default-scheduler Successfully assigned default/hello-world-86d
6c6f84d-8qz9d to ubuntu-18
Possibly root cause.
I wasn't able to create deployment without any event at the beginning I would guess that you have set --event-ttl which is described in Kube-apiserver docs.
--event-ttl duration Default: 1h0m0s
Amount of time to retain events.
It was also mentioned in Github thread.
In short, all events will disappear after 1 hour if you have this flag set.
To check if you have this flag set in kube-apiserver you can check this StackOverflow thread.
If this didn't help you please edit your question with informations like your configuration YAMLs, what version of K8s are you using, steps to reproduce etc.
Well yes deployment do have events. But keep that in mind events only available for around 1 hr.
you can also filter by labels with --labelsfor describe all resources

Coredns in Crashloopbackoff state with calico network

I have a ubuntu 16.04 running in virtual box. I installed Kubernetes on it as a single node using kubeadm.
But coredns pods are in Crashloopbackoff state.
All other pods are running.
Single interface(enp0s3) - Bridge Network
Applied calico using
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
output on kubectl describe pod:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 41m default-scheduler Successfully assigned kube-system/coredns-66bff467f8-dxzq7 to kube
Normal Pulled 39m (x5 over 41m) kubelet, kube Container image "k8s.gcr.io/coredns:1.6.7" already present on machine
Normal Created 39m (x5 over 41m) kubelet, kube Created container coredns
Normal Started 39m (x5 over 41m) kubelet, kube Started container coredns
Warning BackOff 87s (x194 over 41m) kubelet, kube Back-off restarting failed container
I did a kubectl logs <coredns-pod> and found error logs below and looked in the mentioned link
As per suggestion, added resolv.conf = /etc/resolv.conf at the end of /etc/kubernetes/kubelet/conf.yaml and recreated the pod.
kubectl logs coredns-66bff467f8-dxzq7 -n kube-system
.:53 [INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7 CoreDNS-1.6.7 linux/amd64, go1.13.6, da7f65b [FATAL] plugin/loop: Loop (127.0.0.1:34536 -> :53) detected for zone ".", see coredns.io/plugins/loop#troubleshooting. Query: "HINFO 8322382447049308542.5528484581440387393."
root#kube:/home/kube#
Commented below line in /etc/resolv.conf (Host machine) and delete the coredns pods in kube-system namespace.
New pods came in running state :)
#nameserver 127.0.1.1

One pod in Kubernetes cluster crashes but other doesn't

Strangely, one pod in kubernetes cluster crashes but other doesn't!
codingjediweb-6d77f46b56-5mffg 0/1 CrashLoopBackOff 3 81s
codingjediweb-6d77f46b56-vcr8q 1/1 Running 0 81s
They should both have same image and both should work. What could be reason?
I suspect that the crashing pod has old image but I don't know why. Its because I fixed an issue and expected the code to work (which is on one of the pods).
Is it possible that different pods have different images? Is there a way to check which pod is running which image? Is there a way to "flush" an old image or force K8S to download even if it has a cache?
UPDATE
After Famen's suggestion, I looked at the image. I can see that for the crashing container seem to be using an existing image (which might be old). How can I make K8S always pull an image?
manuchadha25#cloudshell:~ (copper-frame-262317)$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 1 2d1h
codingjediweb-6d77f46b56-5mffg 0/1 CrashLoopBackOff 10 29m
codingjediweb-6d77f46b56-vcr8q 1/1 Running 0 29m
manuchadha25#cloudshell:~ (copper-frame-262317)$ kubectl describe pod codingjediweb-6d77f46b56-vcr8q | grep image
Normal Pulling 29m kubelet, gke-codingjediweb-cluste-default-pool-69be8339-wtjt Pulling image "docker.io/manuchadha25/codingjediweb:08072020v3"
Normal Pulled 29m kubelet, gke-codingjediweb-cluste-default-pool-69be8339-wtjt Successfully pulled image "docker.io/manuchadha25/codingjediweb:08072020v3"
manuchadha25#cloudshell:~ (copper-frame-262317)$ kubectl describe pod codingjediweb-6d77f46b56-5mffg | grep image
Normal Pulled 28m (x5 over 30m) kubelet, gke-codingjediweb-cluste-default-pool-69be8339-p5hx Container image "docker.io/manuchadha25/codingjediweb:08072020v3" already present on machine
manuchadha25#cloudshell:~ (copper-frame-262317)$
Also, the working pod has two entries for the image (pulling and pulled). Where are there two?
When you create a deployment, a replicaSet is created in the background. Each pod of that replicaSet has same properties(i.e. images, memory).
When you apply any changes by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment controller manages the moving of the Pods from the old ReplicaSet to the new one at a controlled rate. At this point you may find different pods from different replicaSet with differnt properties.
To check the image:
# get pod's yaml
$ kubectl get pods -n <namespace> <pod-name> -o yaml
# get deployment's yaml
$ kubectl get deployments -n <namespace> <deployment-name> -o yaml
Set imagePullPolicy to Always in your deployment yaml, to use the updated image by forcing a pull.

My old windows pods are dead and don't respond to http requests / exec fails

I have an AKS cluster with a mix of Windows and Linux nodes and an nginx-ingress.
This all worked great, but a few days ago all my windows pods have become unresponsive.
Everything is still green on the K8s dashboard, but they don't respond to HTTP requests and kubectl exec fails.
All the linux pods still work.
I created a new deployment with the exact same image and other properties, and this new pod works, responds to HTTP and kubectl exec works.
Q: How can I find out why my old pods died? How can I prevent this from occuring again in the future?
Note that this is a test cluster, so I have the luxury of being able to investigate, if this was prod I would have burned and recreated the cluster already.
Details:
https://aks-test.progress-cloud.com/eboswebApi/ is one of the old pods, https://aks-test.progress-cloud.com/eboswebApi2/ is the new pod.
When I look at the nginx log, I see a lot of connect() failed (111: Connection refused) while connecting to upstream.
When I try kubectl exec -it <podname> --namespace <namespace> -- cmd I get one of two behaviors:
Either the command immediatly returns without printing anything, or I get an error:
container 1dfffa08d834953c29acb8839ea2d4c6b78b7a530371d98c16b15132d49f5c52 encountered an error during CreateProcess: failure in a Windows system call: The remote procedure call failed and did not execute. (0x6bf) extra info: {"CommandLine":"cmd","WorkingDirectory":"C:\\inetpub\\wwwroot","Environment":{...},"EmulateConsole":true,"CreateStdInPipe":true,"CreateStdOutPipe":true,"ConsoleSize":[0,0]}
command terminated with exit code 126
kubectl describe pod works on both.
The only difference I could find was that on the old pod, I don't get any events:
Events: <none>
whereas on the new pod I get a bunch of them for pulling the image etc:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 39m default-scheduler Successfully assigned ingress-basic/ebos-webapi-test-2-78786968f4-xmvfw to aksnpwin000000
Warning Failed 38m kubelet, aksnpwin000000 Error: failed to start container "ebos-webapi-test-2": Error response from daemon: hcsshim::CreateComputeSystem ebos-webapi-test-2: The binding handle is invalid.
(extra info: {"SystemType":"Container","Name":"ebos-webapi-test-2","Owner":"docker","VolumePath":"\\\\?\\Volume{dac026db-26ab-11ea-bb33-e3730ff9432d}","IgnoreFlushesDuringBoot":true,"LayerFolderPath":"C:\\ProgramData\\docker\\windowsfilter\\ebos-webapi-test-2","Layers":[{"ID":"8c160b6e-685a-58fc-8c4b-beb407ad09b4","Path":"C:\\ProgramData\\docker\\windowsfilter\\12061f29088664dc41c0836c911ed7ced1f6d7ed38b1c932c25cd8ca85a3a88e"},{"ID":"6a230a46-a97c-5e30-ac4a-636e62cd9253","Path":"C:\\ProgramData\\docker\\windowsfilter\\8c0ce5a9990bc433c4d937aa148a4251ef55c1aa7caccf1b2025fd64b4feee97"},{"ID":"240d5705-d8fe-555b-a966-1fc304552b64","Path":"C:\\ProgramData\\docker\\windowsfilter\\2b334b769fe19d0edbe1ad8d1ae464c8d0103a7225b0c9e30fdad52e4b454b35"},{"ID":"5f5d8837-5f62-5a76-a706-9afb789e45e4","Path":"C:\\ProgramData\\docker\\windowsfilter\\3d1767755b0897aaae21e3fb7b71e2d880de22473f0071b0dca6301bb6110077"},{"ID":"978503cb-b816-5f66-ba41-ed154db333d5","Path":"C:\\ProgramData\\docker\\windowsfilter\\53d2e85a90d2b8743b0502013355df5c5e75448858f0c1f5b435281750653520"},{"ID":"d7d0d14e-b097-5104-a492-da3f9396bb06","Path":"C:\\ProgramData\\docker\\windowsfilter\\38830351b46e7a0598daf62d914eb2bf01e6eefde7ac560e8213f118d2bd648c"},{"ID":"90b1c608-be4c-55a1-a787-db3a97670149","Path":"C:\\ProgramData\\docker\\windowsfilter\\84b71fda82ea0eacae7b9382eae2a26f3c71bf118f5c80e7556496f21e754126"},{"ID":"700711b2-d578-5d7c-a17f-14165a5b3507","Path":"C:\\ProgramData\\docker\\windowsfilter\\08dd6f93c96c1ac6acd3d2e8b60697340c90efe651f805809dbe87b6bd26a853"},{"ID":"270de12a-461c-5b0c-8976-a48ae0de2063","Path":"C:\\ProgramData\\docker\\windowsfilter\\115de87074fadbc3c44fc33813257c566753843f8f4dd7656faa111620f71f11"},{"ID":"521250bb-4f30-5ac4-8fcd-b4cf45866627","Path":"C:\\ProgramData\\docker\\windowsfilter\\291e51f5f030d2a895740fae3f61e1333b7fae50a060788040c8d926d46dbe1c"},{"ID":"6dded7bf-8c1e-53bb-920e-631e78728316","Path":"C:\\ProgramData\\docker\\windowsfilter\\938e721c29d2f2d23a00bf83e5bc60d92f9534da409d0417f479bd5f06faa080"},{"ID":"90dec4e9-89fe-56ce-a3c2-2770e6ec362c","Path":"C:\\ProgramData\\docker\\windowsfilter\\d723ebeafd1791f80949f62cfc91a532cc5ed40acfec8e0f236afdbcd00bbff2"},{"ID":"94ac6066-b6f3-5038-9e1b-d5982fcefa00","Path":"C:\\ProgramData\\docker\\windowsfilter\\00d1bb6fc8abb630f921d3651b1222352510d5821779d8a53d994173a4ba1126"},{"ID":"037c6d16-5785-5bea-bab4-bc3f69362e0c","Path":"C:\\ProgramData\\docker\\windowsfilter\\c107cf79e8805e9ce6d81ec2a798bf4f1e3b9c60836a40025272374f719f2270"}],"ProcessorWeight":5000,"HostName":"ebos-webapi-test-2-78786968f4-xmvfw","MappedDirectories":[{"HostPath":"c:\\var\\lib\\kubelet\\pods\\c44f445c-272b-11ea-b9bc-ae0ece5532e1\\volumes\\kubernetes.io~secret\\default-token-n5tnc","ContainerPath":"c:\\var\\run\\secrets\\kubernetes.io\\serviceaccount","ReadOnly":true,"BandwidthMaximum":0,"IOPSMaximum":0,"CreateInUtilityVM":false}],"HvPartition":false,"NetworkSharedContainerName":"4c9bede623553673fde0da6e8dc92f9a55de1ff823a168a35623ad8128f83ecb"})
Normal Pulling 38m (x2 over 38m) kubelet, aksnpwin000000 Pulling image "progress.azurecr.io/eboswebapi:release-2019-11-11_16-41"
Normal Pulled 38m (x2 over 38m) kubelet, aksnpwin000000 Successfully pulled image "progress.azurecr.io/eboswebapi:release-2019-11-11_16-41"
Normal Created 38m (x2 over 38m) kubelet, aksnpwin000000 Created container ebos-webapi-test-2
Normal Started 38m kubelet, aksnpwin000000 Started container ebos-webapi-test-2

GCP GKE: View logs of terminated jobs/pods

I have a few cron jobs on GKE.
One of the pods did terminate and now I am trying to access the logs.
➣ $ kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
23m Normal SuccessfulCreate Job Created pod: virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
22m Normal SuccessfulDelete Job Deleted pod: virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
22m Warning DeadlineExceeded Job Job was active longer than specified deadline
23m Normal Scheduled Pod Successfully assigned default/virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42 to staging-cluster-default-pool-4b4827bf-rpnl
23m Normal Pulling Pod pulling image "gcr.io/my-repo/myimage:v8"
23m Normal Pulled Pod Successfully pulled image "gcr.io/my-repo/my-image:v8"
23m Normal Created Pod Created container
23m Normal Started Pod Started container
22m Normal Killing Pod Killing container with id docker://virulent-angelfish-cronjob:Need to kill Pod
23m Normal SuccessfulCreate CronJob Created job virulent-angelfish-cronjob-netsuite-proservices-1562220000
22m Normal SawCompletedJob CronJob Saw completed job: virulent-angelfish-cronjob-netsuite-proservices-1562220000
So at least one CJ run.
I would like to see the pod's logs, but there is nothing there
➣ $ kubectl get pods
No resources found.
Given that in my cj definition, I have:
failedJobsHistoryLimit: 1
successfulJobsHistoryLimit: 3
shouldn't at least one pod be there for me to do forensics?
Your pod is crashing or otherwise unhealthy
First, take a look at the logs of the current container:
kubectl logs ${POD_NAME} ${CONTAINER_NAME}
If your container has previously crashed, you can access the previous container’s crash log with:
kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
Alternately, you can run commands inside that container with exec:
kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
Note: -c ${CONTAINER_NAME} is optional. You can omit it for pods that only contain a single container.
As an example, to look at the logs from a running Cassandra pod, you might run:
kubectl exec cassandra -- cat /var/log/cassandra/system.log
If none of these approaches work, you can find the host machine that the pod is running on and SSH into that host.
Finaly, check Logging on Google StackDriver.
Debugging Pods
The first step in debugging a pod is taking a look at it. Check the current state of the pod and recent events with the following command:
kubectl describe pods ${POD_NAME}
Look at the state of the containers in the pod. Are they all Running? Have there been recent restarts?
Continue debugging depending on the state of the pods.
Debugging ReplicationControllers
ReplicationControllers are fairly straightforward. They can either create pods or they can’t. If they can’t create pods, then please refer to the instructions above to debug your pods.
You can also use kubectl describe rc ${CONTROLLER_NAME} to inspect events related to the replication controller.
Hope it helps you to find exactly problem.
You can use the --previous flag to get the logs for the previous pod.
So, you can use:
kubectl logs --previous virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
to get the logs for the pod that was there before this one.