How to debug an ECS Fargate service that occasionally restarts task due to unhealthy elastic load balancer health checks

How to debug an ECS Fargate service that occasionally restarts task due to unhealthy elastic load balancer health checks - amazon-ecs

I'm hosting a shiny app on ECS Fargate. It works fairly well but then occasionally when using the app it crashes. I traced it to the following in the events tab:
service YYYY has started 1 tasks: task XXX
service YYYY has stopped 1 running tasks: task XXX
service YYYY deregistered 1 targets in target-group (Name of Elastic Load Balancer)
service YYYY (port 3838) is unhealthy in target-group (Name of Elastic Load Balancer) due to (reason Request timed out).
Does anyone know what might be causing this?
Or alternatively how can I investigate this further?
Could this be linked to spikes in CPU utilization within the application?
I've seen that at certain times the CPU utilization is spiked to 100%.
So If the user uses the application in a way that causes this high utilization, could this cause the container to be deemed unhealthy?
Also, auto-scaling is enabled for the application for when the CPU > 50% - however this is not being activated in the moments when the CPU utilization spikes to 100%. Any ideas?

You can get details about stopped tasks on the ECS Console
Cluster -> Tasks -> Stopped and then enter in the specific task
ECS Console
Additionally in that tab you can get the logs of the container if you have configured the appropiate log driver in the task definition

Does the application write any logs? Make sure those logs are getting sent to the container's console so they show up in CloudWatch logs for ECS.
Add the following to your Dockerfile to get logs to output to the console:
RUN ln -sf /proc/self/fd/1 /var/log/mylocation/mylogfil.log && \
ln -sf /proc/self/fd/1 /var/log/mylocation/myerrorfile.log

Related

Periodic problem with kubernetes: An error: "The connection to the server x.x.x.:6443 was refused - did you specify the right host or port?"

Good afternoon!
Setting up my first k8s cluster :)
I set up a virtual machine on vmware, set up a control plane, connected a worker node, set up kubectl on the master node and on a laptop (vmware is installed on it). I observe the following problem: periodically, every 2 - 5 minutes, the api-server stops responding, when you run any kubectl command (for example, "kubectl get nodes"), an error appears: "The connection to the server 192.168.131.133:6443 was refused - did you specify the right host or port?" A few minutes pass - everything is restored, in response to the "kubectl get nodes", the system shows the nodes. A few more minutes - again the same error. The error synchronously appears both on the master node and on the laptop.
This is what it looks like (for about 10 minutes):
At the same time, if you execute commands on the master node
$ sudo systemctl stop kubelet
$ sudo systemctl start kubelet
everything is immediately restored. And after a few minutes again the same error.
I would be grateful if you could help interpret these logs and tell me how to fix this problem?
kubectl logs at the time of the error (20:42:55):
log

Could imagine that the process on 192.168.131.133 is restarting which is leading to a connection refused when it is not listening any more on the API port.
You should start investigating if you can see any hw issues.
Either CPU is increasing leading to a restart. Or memory leak.
You can check the running processes with.
ps -ef
Use
top
Command to see CPU consumption.
There should be some logs and events in k8s available as well.
It seems no connectivity issue as you are receiving a clear failure back.

How to find the reason of a pod crashing?

Is there a way to see why a kubernetes pod is failing with the status "craskLoopBackOff" under a heavy load?
I have a HorizontalPodAutoscaler which never kicks in. In its status it always shows low (Under 50%) cpu and memory usage.
Tailing the application logs within the pods doesnt give any insights either.

Try looking at the Kubernetes events kubectl get events --sort-by='.lastTimestamp'
If you don't get anything meaningful out of events go to the specific node and see the kubelet logs journalctl -u kubelet

To get logs from a pod you should use:
kubectl logs [podname] -p
You can also do kubelet logs but that's mostly for Cluster logs.
If there is no logs that means your application did not produces any logs before the crash. You would need to rewrite the app and for example add a memory dump on crush.
You mentioned that the pod is dying under heavy load but stats shows only 50% utilization. You should login to the pod and check yourself the load, maybe check how many files are being open because maybe you are hitting the limit.
You can read the Kubernetes docs about Application Introspection and Debugging and go over Debugging CrashLoopBackoffs with Init-Containers.
You can also try running your image in Docker and checking logs there. There is a nice documentation about Logs and troubleshooting available.
If you provide more details we might be more helpful.

Below are some obvious reasons for crashloopbackoff, which I have observed:
waiting for some condition to be full-filled e.g. some secrets,
failing healthcheck etc
pod is running with burstable or besteffort
QoS and is getting killed due to non-availability of resources on
node
You can run this script to find the possible issues for pods in a namespace: https://github.com/dguyhasnoname/k8s-day2-ops/blob/master/namespace_scripts/debug_app_namespace.sh

Kubernetes Deployment/Pod/Container statuses

I am currently working on a monitoring service that will monitor Kubernetes' deployments and their pods. I want to notify users when a deployment is not running the expected amount of replicas and also when pods' containers restart unexpectedly. This may not be the right things to monitor and I would greatly appreciate some feedback on what I should be monitoring.
Anyways, the main question is the differences between all of the Statuses of pods. And when I say Statuses I mean the Status column when running kubectl get pods. The statuses in question are:
- ContainerCreating
- ImagePullBackOff
- Pending
- CrashLoopBackOff
- Error
- Running
What causes pod/containers to go into these states?
For the first four Statuses, are these states recoverable without user interaction?
What is the threshold for a CrashLoopBackOff?
Is Running the only status that has a Ready Condition of True?
Any feedback would be greatly appreciated!
Also, would it be bad practice to use kubectl in an automated script for monitoring purposes? For example, every minute log the results of kubectl get pods to Elasticsearch?

You can see the pod lifecycle details in k8s documentation.
The recommended way of monitoring kubernetes cluster and applications are with prometheus

I will try to tell what I see hidden behind these terms
ContainerCreating
Showing when we wait to image be downloaded and the
container will be created by a docker or another system.
ImagePullBackOff
Showing when we have problem to download the image from a registry. Wrong credentials to log in to the docker hub for example.
Pending
The container starts (if start take time) or started but redinessProbe failed.
CrashLoopBackOff
This status showing when container restarts occur too much often. For example, we have process that tries to read not exists file and crash. Then the container will be recreated by Kube and repeat.
Error
This is pretty clear. We have some errors to run the container.
Running
All is good container running and livenessProbe is OK.

Kubernetes Node NotReady: ContainerGCFailed / ImageGCFailed context deadline exceeded

Worker node is getting into "NotReady" state with an error in the output of kubectl describe node:
ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded
Environment:
Ubuntu, 16.04 LTS
Kubernetes version: v1.13.3
Docker version: 18.06.1-ce
There is a closed issue on that on Kubernetes GitHub k8 git, which is closed on the merit of being related to Docker issue.
Steps done to troubleshoot the issue:
kubectl describe node - error in question was found(root cause isn't clear).
journalctl -u kubelet - shows this related message:
skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]
it is related to this open k8 issue Ready/NotReady with PLEG issues
Check node health on AWS with cloudwatch - everything seems to be fine.
journalctl -fu docker.service : check docker for errors/issues -
the output doesn't show any erros related to that.
systemctl restart docker - after restarting docker, the node gets into "Ready" state but in 3-5 minutes becomes "NotReady" again.
It all seems to start when I deployed more pods to the node( close to its resource capacity but don't think that it is direct dependency) or was stopping/starting instances( after restart it is ok, but after some time node is NotReady).
Questions:
What is the root cause of the error?
How to monitor that kind of issue and make sure it doesn't happen?
Are there any workarounds to this problem?

What is the root cause of the error?
From what I was able to find it seems like the error happens when there is an issue contacting Docker, either because it is overloaded or because it is unresponsive. This is based on my experience and what has been mentioned in the GitHub issue you provided.
How to monitor that kind of issue and make sure it doesn't happen?
There seem to be no clarified mitigation or monitoring to this. But it seems like the best way would be to make sure your node will not be overloaded with pods. I have seen that it is not always shown on disk or memory pressure of the Node - but this is probably a problem of not enough resources allocated to Docker and it fails to respond in time. Proposed solution is to set limits for your pods to prevent overloading the Node.
In case of managed Kubernetes in GKE (not sure but other vendors probably have similar feature) there is a feature called node auto-repair. Which will not prevent node pressure or Docker related issue but when it detects an unhealthy node it can drain and redeploy the node/s.
If you already have resources and limits it seems like the best way to make sure this does not happen is to increase memory resource requests for pods. This will mean fewer pods per node and the actual used memory on each node should be lower.
Another way of monitoring/recognizing this could be done by SSH into the node check the memory, the processes with PS, monitoring the syslog and command $docker stats --all

I have got the same issue. I have cordoned and evicted the pods.
Rebooted the server. automatically node came into ready state.

Kubernetes event logs

As a part of debug i need to track down events like pod creation and removal. in my kubernetes set up I am using logging level 5.
Kube api server, scheduler, controller, etcd are running on master node and the minion nodes is running kubelet and docker.
I am using journalctl to get K8s logs on master node as well on worker node. On worker node i can see logs from Docker and Kubelet. These logs contain events as i would expect as i create and destroy pods.
However on Master node i dont see any relevant logs which may indicate a pod creation or removal request handling.
what other logs or method i can use to get such logs from Kubernetes master components (API server, controller, scheduler, etcd)?
i have checked the logs from API server, controller, scheduler, etcd pods; they dont seem to have such information.
thanks

System component logs:
There are two types of system components:
those that run in a container
and those that do not run in a container.
For example:
The Kubernetes scheduler and kube-proxy run in a container
The kubelet and container runtime, for example Docker, do not run in containers.
On machines with systemd, the kubelet and container runtime write to journald. If systemd is not present, they write to .log files in the /var/log directory. System components inside containers always write to the /var/log directory, bypassing the default logging mechanism. They use the klog logging library.
Master components logs:
Get them from those containers running on master nodes.
$
$ docker ps | grep apiserver
d6af65a248f1 af20925d51a3 "kube-apiserver --ad…" 2 weeks ago Up 2 weeks k8s_kube-apiserver_kube-apiserver-minikube_kube-system_177a3eb80503eddadcdf8ec0423d04b9_0
5f0e6b33a29f k8s.gcr.io/pause-amd64:3.1 "/pause" 2 weeks ago Up 2 weeks k8s_POD_kube-apiserver-minikube_kube-system_177a3eb80503eddadcdf8ec0423d04b9_0
$
$
$ docker logs -f d6a
But all of this approach to logging is just for testing , you should stream all the logs , ( app logs , container logs , cluster level logs , everything) to a centeral logging system such as ELK or EFK.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to debug an ECS Fargate service that occasionally restarts task due to unhealthy elastic load balancer health checks - amazon-ecs

You can get details about stopped tasks on the ECS Console Cluster -> Tasks -> Stopped and then enter in the specific task ECS Console Additionally in that tab you can get the logs of the container if you have configured the appropiate log driver in the task definition

Related

Periodic problem with kubernetes: An error: "The connection to the server x.x.x.:6443 was refused - did you specify the right host or port?"

How to find the reason of a pod crashing?

Kubernetes Deployment/Pod/Container statuses

Kubernetes Node NotReady: ContainerGCFailed / ImageGCFailed context deadline exceeded

Kubernetes event logs

Categories

Resources