EKS node moves to NodeNotReady state when running a batch jobs - kubernetes

I am running a batch job in my EKS cluster that trains a ML model and the training goes on for 8-10hours. However, it seems like the node on which the job runs moves is killed and the job is restarted on a new node. I am monitoring the Node in Prometheus and seems like there was no CPU or OOM issue.
My next bet was to look into the EKS cloudtrail logs and right when the node is removed I see below events -
kube-controller-manager log
controller_utils.go:179] Recording status change NodeNotReady event message for node XXX
controller_utils.go:121] Update ready status of pods on node [XXX]
event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"XXX", UID:"1bf33ec8-41cd-434a-8579-3ba4b8cdd5f1", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node XXX status is now: NodeNotReady
node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
I0609 01:00:48.962465 1 node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
node_lifecycle_controller.go:180] deleting node since it is no longer present in cloud provider: XXX
kube-scheduler log
node_tree.go:113] Removed node "XXX" in group "us-east-2:\x00:us-east-2b" from NodeTree
I checked the kubelet logs but it does not have any message moving the node to NotReady status. I was expecting to atleast see this message in the kubelet log - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682
Which makes me wonder if the kubelet dies or the node is not reachable or any connection lost from kube-api-server to kubelet on that node.
I have been working on this for days to debug this issue but with no success.
Note: The batch job running in Kubernetes do run successfully eventually on restart. Also this issue is sporadic i.e sometime the restart happens and sometimes it does not and finishes in the first run.

Are you using spot instance nodes? That might be one of the reason where the node gets terminated based on the spot / bid price changes. Try dedicated instance.

Related

Kubelet + prometheus: how to query if a pod is crashing?

I want to set up alerts when any pod in my Kubernetes cluster is in a CrashloopBackOff state. I'm running Kubelet on Azure Kubernetes Services and have set up a Prometheus Operator which exposes metrics/cadvisor.
Other similar questions on this topic, such as this and this are not relevant to Kubelet setups. The recommended kube_pod_container_status_waiting_reason{}/kube_pod_status_phase{phase="Pending|Unknown|Failed"} and similar queries are not available to me with Kubelet on AKS.
Kubelet has somewhat limited metrics, here is what I have tried:
Container state:
container_tasks_state{container='my_container', kubernetes_azure_com_cluster='my_cluster'}
This seems like it should be the right solution, but the state is always 0, whether Running or in CrashloopBackOff. This seems to be a known bug.
Time from start:
time() - container_start_time_seconds{kubernetes_azure_com_cluster='my_cluster', container='my_container'}
We can here notify when the time the container is live is low. Any pod with a repeat alert is crashing. Inelegant as healthy containers will also notify until they've lived long enough, also my alert channel becomes very noisy.
Detect exited containers:
kubelet_running_containers{kubernetes_azure_com_cluster='my_cluster', container_state='exited'}
Can detect a crashing container, but containers may also exit gracefully, so a notification on container exits is not very useful. We essentially get a 'container exited' alert and then need to manually check whether it was a crash or graceful exit.
Number of running pods:
kubelet_running_pods{kubernetes_azure_com_cluster='my_cluster'}
Does not change on a container crash.
Scrape error:
container_scrape_error{kubernetes_azure_com_cluster='my_cluster'}
Again, does not change on a container crash.
Which query will allow me to discover if a pod has entered the CrashloopBackOff state?

Amazon EKS: pod scheduled on removed node

On our EKS cluster running K8s 1.20, I have an incident where the Scheduler attempted to schedule a pod on a node that had just been removed by the Control Plane.
Events log:
Node ip-10-65-94-144.eu-west-1.compute.internal event: Removing Node ip-10-65-94-144.eu-west-1.compute.internal from Controller
AttachVolume.Attach failed for volume "pvc-d80d24be-653f-45f0-90ca-321bda1ef0ab" : error finding instance ip-AA-BB-CC-DD.eu-west-1.compute.internal: "instance not found"
Control Plane:
controller_utils.go:185] Recording status change NodeNotReady event message for node ip-AA-BB-CC-DD.eu-west-1.compute.internal
This indicates a miscommunication between the Scheduler and the Control Plane.
Has anyone seen this before?
Any suggestions on how to prevent it, other than upgrading to a more recent K8s version?

Job still running even when deleting nodes

I created a two nodes clusters and I created a new job using the busybox image that sleeps for 300 secs. I checked on which node this job is running using
kubectl get pods -o wide
I deleted the node but surprisingly the job was still finishing to run on the same node. Any idea if this is a normal behavior? If not how can I fix it?
Jobs aren't scheduled or running on nodes. The role of a job is just to define a policy by making sure that a pod with certain specifications exists and ensure that it runs till the completion of the task whether it completed successfully or not.
When you create a job, you are declaring a policy that the built-in job-controller will see and will create a pod for. Then the built-in kube-scheduler will see this pod without a node and patch the pod to it with a node's identity. The kubelet will see a pod with a node matching it's own identity and hence a container will be started. As the container will be still running, the control-plane will know that the node and the pod still exist.
There are two ways of breaking a node, one with a drain and the second without a drain. The process of breaking a node without draining is identical to a network cut or a server crash. The api-server will keep the node resource for a while, but it 'll cease being Ready. The pods will be then terminated slowly. However, when you drain a node, it looks as if you are preventing new pods from scheduling on to the node and deleting the pods using kubectl delete pod.
In both ways, the pods will be deleted and you will be having a job that hasn't run to completion and doesn't have a pod, therefore job-controller will make a new pod for the job and the job's failed-attempts will be increased by 1, and the loop will start over again.

How can a pod have status ready and terminating?

Curiously, I saw that a pod I had had both ready 1/1 status and status terminating when I ran kubectl get pods. Are these states not mutually exclusive? Why or why not?
For context, this was noticed immediately after I had killed skaffold so these pods were in the middle of shutting down.
When pods are in terminating state, they could still be functioning. The pod could be delayed in termination due to many reasons (eg. could be that you have a PVC attached, other pods are being terminated at the same time, etc). You could test this by running the following on a pod with a PVC attached or another reason to be terminated with a delay:
$ kubectl delete pod mypod-xxxxx-xxxxxx
pod mypod-xxxxx-xxxxxx deleted
$ kubectl delete pod mypod-xxxxx-xxxxxx
pod mypod-xxxxx-xxxxxx deleted
$ kubectl apply mypod.yaml
pod mypod-xxxxx-xxxxxx configured
Sometimes this happens because the pod is still in the terminating period and is functioning normally, so it will be treated as an existing pod that gets configured (neglecting the fact that you usually can't configure pods like this, but you get the point).
The ready column says how many containers are up.
The status terminating means no more traffic is being sent to that pod by the controllers. From kubernetes' docs:
When a user requests deletion of a pod, the system records the
intended grace period before the pod is allowed to be forcefully
killed, and a TERM signal is sent to the main process in each
container. Once the grace period has expired, the KILL signal is sent
to those processes, and the pod is then deleted from the API server.
That's the state it is. The containers are up, finishing processing whatever work it had already and a TERM signal was sent.
I want to update #nrxr answer:
The status terminating means no more traffic is being sent to that pod by the controllers.
That is what we want, but in reality, it not always be like that. The pod may terminate completely and the traffic still forward to it.
For detail please read this blog: https://learnk8s.io/graceful-shutdown.

Kubernetes node NotReady perhaps due to invalid memory address or nil pointer dereference in reconciler.go

My experimental, two-node Kubernetes 1.13.2 unfortunately entered a mode where the second node is NotReady. I've tried systemctl restart kubelet on both nodes, but this has not helped so far.
journalctl -u kubelet.service on the master node ends with this line:
reconciler.go:154] Reconciler: start to sync state
journalctl -u kubectl.service on the second node contains these lines:
server.go:999] Started kubelet
kubelet.go:1308] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
kubelet.go:1412] No api server defined - no node status update will be sent.
fs_resource_analyzer.go:66] Starting FS ResourceAnalyzer
status_manager.go:148] Kubernetes client is nil, not starting status manager.
kubelet.go:1829] Starting kubelet main sync loop.
kubelet.go:1846] skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]
server.go:137] Starting to listen on 0.0.0.0:10250
server.go:333] Adding debug handlers to kubelet server.
volume_manager.go:248] Starting Kubelet Volume Manager
desired_state_of_world_populator.go:130] Desired state populator starts to run
container.go:409] Failed to create summary reader for "/system.slice/atop.service": none of the resources are being tracked.
kubelet.go:1846] skipping pod synchronization - [container runtime status check may not have completed yet]
kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach
cpu_manager.go:155] [cpumanager] starting with none policy
cpu_manager.go:156] [cpumanager] reconciling every 10s
policy_none.go:42] [cpumanager] none policy: Start
kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach
reconciler.go:154] Reconciler: start to sync state
runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:562
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:599
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:419
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:330
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:155
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/workspace/anago-v1.13.2-beta.0.75+cff46ab41ff0bb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:143
/usr/local/go/src/runtime/asm_amd64.s:1333
kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach
kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach
...
What's going on here and how can I remedy this situation?
UPDATE I've now stopped kubelet.service on both nodes and then removed all docker containers (docker rm) and images (docker rmi) also on both nodes. I've then restarted kubelet.service on the master only. Kubernetes apparently pulls in all of its own images and starts them again, but now see multiple connection refused errors in journalctl -u kubelet.service, such as:
kubelet_node_status.go:94] Unable to register node "my_node" with API server: Post https://my_master:6443/api/v1/nodes: dial tcp my_master:6443: connect: connection refused
kubectl get nodes can still access the API server normally. So perhaps this closer to the root cause of what I am facing. Is it correct to assume that at this point the kubelet is trying to connect to the API server? How can I check whether the relevant credentials are still intact?
It is kinda hard to troubleshoot without seeing the whole configuration.
however try to follow this check list.
make sure the kubeconfig is well set up.
do you have enabled all services in master and worker? node systemctl enable {component}
check you have the right public IP in your kubelet.service pointing to your api-server (this seems to be the error regarding to the last comment) or check it in your, unless you have in the same machine (localhost)
check these links that have been useful for me at the time to boopstrap a k8s cluster. https://medium.com/containerum/4-ways-to-bootstrap-a-kubernetes-cluster-de0d5150a1e4 and https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/docs/09-bootstrapping-kubernetes-workers.md
The issue remained unexplained and I had to recreate the cluster with kubeadm reset, etc.