One pod in Kubernetes cluster crashes but other doesn't - kubernetes

Strangely, one pod in kubernetes cluster crashes but other doesn't!
codingjediweb-6d77f46b56-5mffg 0/1 CrashLoopBackOff 3 81s
codingjediweb-6d77f46b56-vcr8q 1/1 Running 0 81s
They should both have same image and both should work. What could be reason?
I suspect that the crashing pod has old image but I don't know why. Its because I fixed an issue and expected the code to work (which is on one of the pods).
Is it possible that different pods have different images? Is there a way to check which pod is running which image? Is there a way to "flush" an old image or force K8S to download even if it has a cache?
UPDATE
After Famen's suggestion, I looked at the image. I can see that for the crashing container seem to be using an existing image (which might be old). How can I make K8S always pull an image?
manuchadha25#cloudshell:~ (copper-frame-262317)$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 1 2d1h
codingjediweb-6d77f46b56-5mffg 0/1 CrashLoopBackOff 10 29m
codingjediweb-6d77f46b56-vcr8q 1/1 Running 0 29m
manuchadha25#cloudshell:~ (copper-frame-262317)$ kubectl describe pod codingjediweb-6d77f46b56-vcr8q | grep image
Normal Pulling 29m kubelet, gke-codingjediweb-cluste-default-pool-69be8339-wtjt Pulling image "docker.io/manuchadha25/codingjediweb:08072020v3"
Normal Pulled 29m kubelet, gke-codingjediweb-cluste-default-pool-69be8339-wtjt Successfully pulled image "docker.io/manuchadha25/codingjediweb:08072020v3"
manuchadha25#cloudshell:~ (copper-frame-262317)$ kubectl describe pod codingjediweb-6d77f46b56-5mffg | grep image
Normal Pulled 28m (x5 over 30m) kubelet, gke-codingjediweb-cluste-default-pool-69be8339-p5hx Container image "docker.io/manuchadha25/codingjediweb:08072020v3" already present on machine
manuchadha25#cloudshell:~ (copper-frame-262317)$
Also, the working pod has two entries for the image (pulling and pulled). Where are there two?

When you create a deployment, a replicaSet is created in the background. Each pod of that replicaSet has same properties(i.e. images, memory).
When you apply any changes by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment controller manages the moving of the Pods from the old ReplicaSet to the new one at a controlled rate. At this point you may find different pods from different replicaSet with differnt properties.
To check the image:
# get pod's yaml
$ kubectl get pods -n <namespace> <pod-name> -o yaml
# get deployment's yaml
$ kubectl get deployments -n <namespace> <deployment-name> -o yaml
Set imagePullPolicy to Always in your deployment yaml, to use the updated image by forcing a pull.

Related

More detailed monitoring of Pod states

Our Pods usually spend at least a minute and up to several minutes in the Pending state, the events via kubectl describe pod x yield:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned testing/runner-2zyekyp-project-47-concurrent-0tqwl4 to host
Normal Pulled 55s kubelet, host Container image "registry.com/image:c1d98da0c17f9b1d4ca81713c138ee2e" already present on machine
Normal Created 55s kubelet, host Created container build
Normal Started 54s kubelet, host Started container build
Normal Pulled 54s kubelet, host Container image "gitlab/gitlab-runner-helper:x86_64-6214287e" already present on machine
Normal Created 54s kubelet, host Created container helper
Normal Started 54s kubelet, host Started container helper
The provided information is not exactly detailed as to figure out exactly what is happening.
Question:
How can we gather more detailed metrics of what exactly and when exactly something happens in regards to get a Pod running in order to troubleshoot which step exactly needs how much time?
Special interest would be the metric of how long it takes to mount a volume.
Check kubelet and kube scheduler logs because kube scheduler schedules the pod to a node and kubelet starts the pod on that node and reports the status as ready.
journalctl -u kubelet # after logging into the kubernetes node
kubectl logs kube-scheduler -n kube-system
Describe the pod, deployment, replicaset to get more details
kubectl describe pod podnanme -n namespacename
kubectl describe deploy deploymentnanme -n namespacename
kubectl describe rs replicasetnanme -n namespacename
Check events
kubectl get events -n namespacename
Describe the nodes and check available resources and status which should be ready.
kubectl describe node nodename

Enabling NodeLocalDNS fails

We have 2 clusters on GKE: dev and production. I tried to run this command on dev cluster:
gcloud beta container clusters update "dev" --update-addons=NodeLocalDNS=ENABLED
And everything went great, node-local-dns pods are running and all works, next morning I decided to run same command on production cluster and node-local-dns fails to run, and I noticed that both PILLAR__LOCAL__DNS and PILLAR__DNS__SERVER in yaml aren't changed to proper IPs, I tried to change those variables in config yaml, but GKE keeps overwriting them back to yaml with PILLAR__DNS__SERVER variables...
The only difference between clusters is that dev runs on 1.15.9-gke.24 and production 1.15.11-gke.1.
Apparently 1.15.11-gke.1 version has a bug.
I recreated it first on 1.15.11-gke.1 and can confirm that node-local-dns Pods fall into CrashLoopBackOff state:
node-local-dns-28xxt 0/1 CrashLoopBackOff 5 5m9s
node-local-dns-msn9s 0/1 CrashLoopBackOff 6 8m17s
node-local-dns-z2jlz 0/1 CrashLoopBackOff 6 10m
When I checked the logs:
$ kubectl logs -n kube-system node-local-dns-msn9s
2020/04/07 21:01:52 [FATAL] Error parsing flags - Invalid localip specified - "__PILLAR__LOCAL__DNS__", Exiting
Solution:
Upgrade to 1.15.11-gke.3 helped. First you need to upgrade your master-node and then your node pool. It looks like on this version everything runs nice and smoothly:
$ kubectl get daemonsets -n kube-system node-local-dns
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-local-dns 3 3 3 3 3 addon.gke.io/node-local-dns-ds-ready=true 44m
$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME READY STATUS RESTARTS AGE
node-local-dns-8pjr5 1/1 Running 0 11m
node-local-dns-tmx75 1/1 Running 0 19m
node-local-dns-zcjzt 1/1 Running 0 19m
As it comes to manually fixing this particular daemonset yaml file, I wouldn't recommend it as you can be sure that GKE's auto-repair and auto-upgrade features will overwrite it sooner or later anyway.
I hope it was helpful.

GCP GKE: View logs of terminated jobs/pods

I have a few cron jobs on GKE.
One of the pods did terminate and now I am trying to access the logs.
➣ $ kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
23m Normal SuccessfulCreate Job Created pod: virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
22m Normal SuccessfulDelete Job Deleted pod: virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
22m Warning DeadlineExceeded Job Job was active longer than specified deadline
23m Normal Scheduled Pod Successfully assigned default/virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42 to staging-cluster-default-pool-4b4827bf-rpnl
23m Normal Pulling Pod pulling image "gcr.io/my-repo/myimage:v8"
23m Normal Pulled Pod Successfully pulled image "gcr.io/my-repo/my-image:v8"
23m Normal Created Pod Created container
23m Normal Started Pod Started container
22m Normal Killing Pod Killing container with id docker://virulent-angelfish-cronjob:Need to kill Pod
23m Normal SuccessfulCreate CronJob Created job virulent-angelfish-cronjob-netsuite-proservices-1562220000
22m Normal SawCompletedJob CronJob Saw completed job: virulent-angelfish-cronjob-netsuite-proservices-1562220000
So at least one CJ run.
I would like to see the pod's logs, but there is nothing there
➣ $ kubectl get pods
No resources found.
Given that in my cj definition, I have:
failedJobsHistoryLimit: 1
successfulJobsHistoryLimit: 3
shouldn't at least one pod be there for me to do forensics?
Your pod is crashing or otherwise unhealthy
First, take a look at the logs of the current container:
kubectl logs ${POD_NAME} ${CONTAINER_NAME}
If your container has previously crashed, you can access the previous container’s crash log with:
kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
Alternately, you can run commands inside that container with exec:
kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
Note: -c ${CONTAINER_NAME} is optional. You can omit it for pods that only contain a single container.
As an example, to look at the logs from a running Cassandra pod, you might run:
kubectl exec cassandra -- cat /var/log/cassandra/system.log
If none of these approaches work, you can find the host machine that the pod is running on and SSH into that host.
Finaly, check Logging on Google StackDriver.
Debugging Pods
The first step in debugging a pod is taking a look at it. Check the current state of the pod and recent events with the following command:
kubectl describe pods ${POD_NAME}
Look at the state of the containers in the pod. Are they all Running? Have there been recent restarts?
Continue debugging depending on the state of the pods.
Debugging ReplicationControllers
ReplicationControllers are fairly straightforward. They can either create pods or they can’t. If they can’t create pods, then please refer to the instructions above to debug your pods.
You can also use kubectl describe rc ${CONTROLLER_NAME} to inspect events related to the replication controller.
Hope it helps you to find exactly problem.
You can use the --previous flag to get the logs for the previous pod.
So, you can use:
kubectl logs --previous virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
to get the logs for the pod that was there before this one.

Dont delete pods in rolling back a deployment

I would like to perform rolling back a deployment in my environment.
Command:
kubectl rollout undo deployment/foo
Steps which are perform:
create pods with old configurations
delete old pods
Is there a way to not perform last step - for example - developer would like to check why init command fail and debug.
I didn't find information about that in documentation.
Yes it is possible, before doing rollout, first you need to remove labels (corresponding to replica-set controlling that pod) from unhealthy pod. This way pod won't belong anymore to the deployment and even if you do rollout, it will still be there. Example:
$kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
sleeper 1/1 1 1 47h
$kubectl get pod --show-labels
NAME READY STATUS RESTARTS AGE LABELS
sleeper-d75b55fc9-87k5k 1/1 Running 0 5m46s pod-template-hash=d75b55fc9,run=sleeper
$kubectl label pod sleeper-d75b55fc9-87k5k pod-template-hash- run-
pod/sleeper-d75b55fc9-87k5k labeled
$kubectl get pod --show-labels
NAME READY STATUS RESTARTS AGE LABELS
sleeper-d75b55fc9-87k5k 1/1 Running 0 6m34s <none>
sleeper-d75b55fc9-swkj9 1/1 Running 0 3s pod-template-hash=d75b55fc9,run=sleeper
So what happens here, we have a pod sleeper-d75b55fc9-87k5k which belongs to sleeper deployment, we remove all labels from it, deployment detects that pod "has gone" so it creates a new one sleeper-d75b55fc9-swkj9, but the old one is still there and ready for debugging. Only pod sleeper-d75b55fc9-swkj9 will be affected by rollout.

Kubernetes pods are pending not active

If I run this:
kubectl get pods -n kube-system
I get this output:
NAME READY STATUS RESTARTS AGE
coredns-6fdd4f6856-6bl64 0/1 Pending 0 1h
coredns-6fdd4f6856-xgrbm 0/1 Pending 0 1h
kubernetes-dashboard-65c76f6c97-c69jg 0/1 Pending 0 13m
supposedly I need a kubernetes scheduler in order to actually launch containers? Does anyone know how to initiate a kube-scheduler?
More than a Kubernetes scheduler issue, it looks like it's more about not having enough resources on your nodes (or no nodes at all) in your cluster to schedule any workloads. You can check your nodes with:
$ kubectl get nodes
Also, you are not likely able to see any control plane resource on the kube-system namespace because you may be using managed services like EKS or GKE.