Intermittent `CreateContainerError` on startup for cron jobs due to container runtime - kubernetes

I’m getting CreateContainerErrors for cron jobs and I want to better understand why this would happen in the container runtime.
I know this is an issue in the container runtime (docker-engine v20.10.6) because it is the only possible common cause (see the common causes here). Specifically, subsequent containers are able to startup without issue.
The errors say something like:
Error from server (BadRequest): container "my-container" in pod "my-cron-27911235-69t4m"
is waiting to start: CreateContainerError
kubectl describe doesn't provide much more insight
...
State: Waiting
Reason: CreateContainerError
Ready: False
Restart Count: 0
...
I'm running v1.22.15-gke.2500 on GCP.
Any help would be appreciated. Thanks!

Related

Kubelet + prometheus: how to query if a pod is crashing?

I want to set up alerts when any pod in my Kubernetes cluster is in a CrashloopBackOff state. I'm running Kubelet on Azure Kubernetes Services and have set up a Prometheus Operator which exposes metrics/cadvisor.
Other similar questions on this topic, such as this and this are not relevant to Kubelet setups. The recommended kube_pod_container_status_waiting_reason{}/kube_pod_status_phase{phase="Pending|Unknown|Failed"} and similar queries are not available to me with Kubelet on AKS.
Kubelet has somewhat limited metrics, here is what I have tried:
Container state:
container_tasks_state{container='my_container', kubernetes_azure_com_cluster='my_cluster'}
This seems like it should be the right solution, but the state is always 0, whether Running or in CrashloopBackOff. This seems to be a known bug.
Time from start:
time() - container_start_time_seconds{kubernetes_azure_com_cluster='my_cluster', container='my_container'}
We can here notify when the time the container is live is low. Any pod with a repeat alert is crashing. Inelegant as healthy containers will also notify until they've lived long enough, also my alert channel becomes very noisy.
Detect exited containers:
kubelet_running_containers{kubernetes_azure_com_cluster='my_cluster', container_state='exited'}
Can detect a crashing container, but containers may also exit gracefully, so a notification on container exits is not very useful. We essentially get a 'container exited' alert and then need to manually check whether it was a crash or graceful exit.
Number of running pods:
kubelet_running_pods{kubernetes_azure_com_cluster='my_cluster'}
Does not change on a container crash.
Scrape error:
container_scrape_error{kubernetes_azure_com_cluster='my_cluster'}
Again, does not change on a container crash.
Which query will allow me to discover if a pod has entered the CrashloopBackOff state?

My Pods getting SIGTERM and exited gracefully as part of signalhandler but unable to find root cause of why SIGTERM sent from kubelet to my pods?

My Pods getting SIGTERM automatically for unknown reason. Unable to find root cause of why SIGTERM sent from kubelet to my pods is the concern to fix issue.
When I ran kubectl describe podname -n namespace, under events section Only killing event is present. I didn't see any unhealthy status before kill event.
Is there any way to debug further with events of pods or any specific log files where we can find trace of reason for sending SIGTERM?
I tried to do kubectl describe on events(killing)but it seems no such command to drill down events further.
Any other approach to debug this issue is appreciated.Thanks in advance!
kubectl desribe pods snippet
Please can you share the yaml of your deployment so we can try to replicate your problem.
Based on your attached screenshot, it looks like your readiness probe failed to complete repeatedly (it didn't run and fail, it failed to complete entirely), and therefore the cluster killed it.
Without knowing what your docker image is doing makes it hard to debug from here.
As a first point of debugging, you can try doing kubectl logs -f -n {namespace} {pod-name} to see what the pod is doing and seeing if it's erroring there.
The error Client.Timeout exceeded while waiting for headers implies your container is proxying something? So perhaps what you're trying to proxy upstream isn't responding.

Kubernetes Deployment/Pod/Container statuses

I am currently working on a monitoring service that will monitor Kubernetes' deployments and their pods. I want to notify users when a deployment is not running the expected amount of replicas and also when pods' containers restart unexpectedly. This may not be the right things to monitor and I would greatly appreciate some feedback on what I should be monitoring.
Anyways, the main question is the differences between all of the Statuses of pods. And when I say Statuses I mean the Status column when running kubectl get pods. The statuses in question are:
- ContainerCreating
- ImagePullBackOff
- Pending
- CrashLoopBackOff
- Error
- Running
What causes pod/containers to go into these states?
For the first four Statuses, are these states recoverable without user interaction?
What is the threshold for a CrashLoopBackOff?
Is Running the only status that has a Ready Condition of True?
Any feedback would be greatly appreciated!
Also, would it be bad practice to use kubectl in an automated script for monitoring purposes? For example, every minute log the results of kubectl get pods to Elasticsearch?
You can see the pod lifecycle details in k8s documentation.
The recommended way of monitoring kubernetes cluster and applications are with prometheus
I will try to tell what I see hidden behind these terms
ContainerCreating
Showing when we wait to image be downloaded and the
container will be created by a docker or another system.
ImagePullBackOff
Showing when we have problem to download the image from a registry. Wrong credentials to log in to the docker hub for example.
Pending
The container starts (if start take time) or started but redinessProbe failed.
CrashLoopBackOff
This status showing when container restarts occur too much often. For example, we have process that tries to read not exists file and crash. Then the container will be recreated by Kube and repeat.
Error
This is pretty clear. We have some errors to run the container.
Running
All is good container running and livenessProbe is OK.

Kubernetes Node NotReady: ContainerGCFailed / ImageGCFailed context deadline exceeded

Worker node is getting into "NotReady" state with an error in the output of kubectl describe node:
ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded
Environment:
Ubuntu, 16.04 LTS
Kubernetes version: v1.13.3
Docker version: 18.06.1-ce
There is a closed issue on that on Kubernetes GitHub k8 git, which is closed on the merit of being related to Docker issue.
Steps done to troubleshoot the issue:
kubectl describe node - error in question was found(root cause isn't clear).
journalctl -u kubelet - shows this related message:
skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]
it is related to this open k8 issue Ready/NotReady with PLEG issues
Check node health on AWS with cloudwatch - everything seems to be fine.
journalctl -fu docker.service : check docker for errors/issues -
the output doesn't show any erros related to that.
systemctl restart docker - after restarting docker, the node gets into "Ready" state but in 3-5 minutes becomes "NotReady" again.
It all seems to start when I deployed more pods to the node( close to its resource capacity but don't think that it is direct dependency) or was stopping/starting instances( after restart it is ok, but after some time node is NotReady).
Questions:
What is the root cause of the error?
How to monitor that kind of issue and make sure it doesn't happen?
Are there any workarounds to this problem?
What is the root cause of the error?
From what I was able to find it seems like the error happens when there is an issue contacting Docker, either because it is overloaded or because it is unresponsive. This is based on my experience and what has been mentioned in the GitHub issue you provided.
How to monitor that kind of issue and make sure it doesn't happen?
There seem to be no clarified mitigation or monitoring to this. But it seems like the best way would be to make sure your node will not be overloaded with pods. I have seen that it is not always shown on disk or memory pressure of the Node - but this is probably a problem of not enough resources allocated to Docker and it fails to respond in time. Proposed solution is to set limits for your pods to prevent overloading the Node.
In case of managed Kubernetes in GKE (not sure but other vendors probably have similar feature) there is a feature called node auto-repair. Which will not prevent node pressure or Docker related issue but when it detects an unhealthy node it can drain and redeploy the node/s.
If you already have resources and limits it seems like the best way to make sure this does not happen is to increase memory resource requests for pods. This will mean fewer pods per node and the actual used memory on each node should be lower.
Another way of monitoring/recognizing this could be done by SSH into the node check the memory, the processes with PS, monitoring the syslog and command $docker stats --all
I have got the same issue. I have cordoned and evicted the pods.
Rebooted the server. automatically node came into ready state.

Regarding Scheduling of K8S

Regarding the kubelet status and the policy of kube-scheduler.
The status of kubelet running on my eight workers is ready at the time when i did spawning to create container in RC.
The given message by scheduler is that the RC was well scheduled on across all eight worker nodes. but, the pod status is pending.
i waited as much enough for downloading image but the state didn't changed to running. so, i restarted kubelet service on a worker where having the pending pod. then all pod's pending state had changed running state.
Scheduled well(pod) -> pending(pod) -> restart kubelet -> running(pod)
why it was resolved after restart kubelet?
The log(kubelet) looks like below.
factory.go:71] Error trying to work out if we can handle /docker-daemon/docker: error inspecting container: No such container: docker
factory.go:71] Error trying to work out if we can handle /docker: error inspecting container: No such container: docker
factory.go:71] Error trying to work out if we can handle /: error inspecting container: unexpected end of JSON input
factory.go:71] Error trying to work out if we can handle /docker-daemon: error inspecting container: No such container: docker-daemon
factory.go:71] Error trying to work out if we can handle /kubelet: error inspecting container: No such container: kubelet
factory.go:71] Error trying to work out if we can handle /kube-proxy: error inspecting container: No such container: kube-proxy
Another symptom is with below picture.
The scheduled pod works well. but the condition is false middle in picture
(got from "kubectl describe ~~")
working well but false...what the false means?
Thanks