filebeat pod restarting multiple times and not getting logs in kibana - elastic-stack

We are using ELK for logging and monitoring of my AKS Cluster. but sometimes filebeat pod is restarting and unable to pick the log into elastic search.
[![enter image description here][1]][1]
here the pod log as well.
2021-08-09T12:10:04.191Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":5128640,"time":{"ms":722}},"total":{"ticks":10563900,"time":{"ms":1266},"value":10563900},"user":{"ticks":5435260,"time":{"ms":544}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":12},"info":{"ephemeral_id":"12737862-0ffc-4805-8e49-d06e61ae95ad","uptime":{"ms":228300048}},"memstats":{"gc_next":75621616,"memory_alloc":43143552,"memory_total":11734470608},"runtime":{"goroutines":30}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":1}}},"registrar":{"states":{"current":15061}},"system":{"load":{"1":2.41,"15":2.22,"5":2.36,"norm":{"1":0.6025,"15":0.555,"5":0.59}}}}}}
So, Could anybody suggest that what might be the reasons for restarting the pod multiple times and what are the ways to resolve this?
[1]: https://i.stack.imgur.com/h5ABD.png

Sometimes this would happen due to CPU and Memory limits that we set. Describe your pod to know the reason for restarts. if the reason is OOMKilled then check the below command to verify the current utilizations of the filebeat pod.
kubectl top pods -n namespace
According to the output of the top command, change(increase or describe) the CPU and Memory limits of your pod manifest.

Related

Kubernetes on GCP - Getting message "Does not have minimum availability"

I deployed the images of microservices currency-conversion and currency-exchange on Google cloud but in the Kubernetes Engine, I see that the pods/replica sets are not available.
When I check under Workload tab, I see that the service shows a message "Does not have minimum availability"
I added additional availability zone to increase the resources but that did not help.
How do I fix this ?
Many reasons could be there behind failure:
Low Resources so POD are not starting or pending
Liveness or Readiness failing for PODs
Configmap or secret which POD require to start is not available
You can describe the POD or check the logs of POD to debug more issue
kubectl describe pod <POD name> -n <Namespace name>
The pod is crashing hence why you're getting "Does not mean minimum availability"
You should look at the logs of the container first and see why its crashing
kubectl logs -n default {name of pod}

how to see the kubernetes container servcie log with restart pod

Now my kubernetes (v1.15.x) deployment keeps restarting all the time. From the log ouput with kubernetes dashboard I could not see anything useful. Now I want to log into the pod and check the log from log dir of my service. But the pod keeps restarting all the time and I have no chance to log into the pod.
Is there any way to login restart pod or dump some file or see the file in the pod? I want to find why the pod restart all the time.
if you are running the GKE and logging is enabled you can get all container log by default into the dashboard of stack driver logging.
As of now you can run the kubectl describe pod <pod name> to check the status code of the container which got exited. Status code might be helpful to understand the reason for restart, is it due to Error or OOM killed.
you can also use the flag --previous and get logs of restarted POD
Example :
kubectl logs <POD name> --previous
in the above case of --previous your pod needs but still exist inside the cluster.
#HarshManvar is right but I would like to provide you with some more options:
Debugging with an ephemeral debug container: Ephemeral containers are useful for interactive troubleshooting when kubectl exec is insufficient because a container has crashed or a container image doesn't include debugging utilities, such as with distroless images.
Debugging via a shell on the node: If none of these approaches work, you can find the host machine that the pod is running on and SSH into that host.
These two methods above can be found useful when checking logs or execing into the container would not be efficient.

Is it possible to get the details of the node where the pod ran before restart?

I'm running a kubernetes cluster of 20+ nodes. And one pod in a namespace got restarted. The pod got killed due to OOM with exit code 137 and restarted again as expected. But would like to know the node in which the pod was running earlier. Any place we could check the logs for the info? Like tiller, kubelet, kubeproxy etc...
But would like to know the node in which the pod was running earlier.
If a pod is killed with ExitCode: 137, e.g. when it used more memory than its limit, it will be restarted on the same node - not re-scheduled. For this, check your metrics or container logs.
But Pods can also be killed due to over-committing a node, see e.g. How to troubleshoot Kubernetes OOM and CPU Throttle.

How do I know why my SonarQube helm chart is getting auto-killed by Kubernetes

This question is about logging/monitoring.
I'm running a 3 node cluster on AKS, with 3 orgs, Dev, Test and Prod. The chart worked fine in Dev, but the same chart keeps getting killed by Kubernetes in Test, and it keeps getting recreated, and re-killed. Is there a way to extract details on why this is happening? All I see when I describe the pod is Reason: Killed
Please tell me more details on this or can give some suggestions. Thanks!
List Events sorted by timestamp
kubectl get events --sort-by=.metadata.creationTimestamp
There might be various reasons for it to be killed, e.g. not sufficient resources or failed liveness probe.
For SonarQube there is a liveness and readiness probe configured so it might fail. Also as described in helm's chart values:
If an ingress path other than the root (/) is defined, it should be reflected here
A trailing "/" must be included
You can also check if there are sufficient resources on node:
check what node are pods running on: kubectl get pods -test and
then run kubectl describe node <node-name> to check if there is no
disk/ memory pressure.
You can also run kubectl logs <pod-name> and kubectl describe pod <pod-name> that might give you some insight of kill reason.

HPA could not get CPU metric during GKE node auto-scaling

Cluster information:
Kubernetes version: 1.12.8-gke.10
Cloud being used: GKE
Installation method: gcloud
Host OS: (machine type) n1-standard-1
CNI and version: default
CRI and version: default
During node scaling, HPA couldn't get CPU metric.
At the same time, kubectl top pod and kubectl top node output is:
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
For more details, I'll show you the flow of my problem occurs:
Suddenly many requests arrive at the GKE server. (Using testing tool)
HPA detects current CPU usage above target CPU usage(50%), thus try pod scale up
incrementally.
Insufficient CPU warning occurs when creating pods, thus GKE try node scalie up
incrementally.
Soon the HPA fails to get the metric, and kubectl top node or kubectl top pod
doesn’t get a response.
- At this time one or more OutOfcpu pods are found, and several pods are in
ContainerCreating (from Pending state).
After node scale-up is complete and some time has elapsed (about a few minutes),
HPA starts to fetch the CPU metric successfully and try to scale up/down based on
metric.
Same situation happens when node scale down.
This causes pod scaling to stop and raises some failures on responding to client’s requests. Is this normal?
I think HPA should get CPU metric(or other metrics) on running pods even during node scaling, to keep track of the optimal pod size at the moment. So when node scaling done, HPA create the necessary pods at once (rather than incrementally).
Can I make my cluster work like this?
Maybe your node runs out of one resource either memory or cpu, there are config maps that describe how addons are scaled depending on the cluster size. You need to edit metrics-server-config config map in kube-system namespace:
kubectl edit cm/metrics-server-config -n kube-system
you should add
baseCPU
cpuPerNode
baseMemory
memoryPerNode
to NannyConfiguration, here you can find extensive manual:
Also heapster suffers from the same OOM issue: too many pods to handle all metrics within assigned resources please modify heapster's config map in accordingly:
kubectl edit cm/heapster-config -n kube-system