I had a few pods restarting in my EKS cluster. I could see that they were SIGKILL'ed by K8s. Now I would like to know the reason but I can't because the Kubernetes events TTL is only one hour.
I am checking the control plane logs for the EKS cluster in CloudWatch now but don't know which of them contains these messages as well.
Which of the logs does contain these events form K8s?
Yes you are right, the default value of --event-ttl is 60m00s, and unfortunately, there is currently no any native option to change that value in EKS. The github issue is still opened without any promising timeframes.
As per guide you sent and as per Streaming EKS Metrics and Logs to CloudWatch, if you configured everything correctly, you can find logs under “Container Insights” from the drop-down menu.
Logs you might want to check are
Control plane logs consist of scheduler logs, API server logs, and
audit logs.
Data plane logs consist of kubelet and container runtime
engine logs.
Can you please specify what exact logs you have in your cloudwatch control plane logs and what you already checked? Maybe that will help
Related
When I try to retrieve logs from my pods, I note that K8s does not print all the logs, and I know that because I observe that logs about microservice initialization are not present in the head of logs.
Considering that my pods print a lot of logs in a long observation period, does someone know if K8s has a limit in showing all logs?
I also tried to set --since parameter in the kubectl logs command to get all logs in a specific time range, but it seems to have no effect.
Thanks.
The container runtime engine typically manages container (pod) logs. Do check the settings on the runtime engine in use.
There seems to be an issue with the logging earlier. Attaching the link for the same. https://github.com/kubernetes/kubernetes/pull/78071
There are some answers, I'll add more details and sources.
The answer is quite short. There is no limit but free space. By default kubernetes is not responsible for log rotation:
An important consideration in node-level logging is implementing log
rotation, so that logs don't consume all available storage on the
node. Kubernetes is not responsible for rotating logs, but rather a
deployment tool should set up a solution to address that. For example,
in Kubernetes clusters, deployed by the kube-up.sh script, there is a
logrotate tool configured to run each hour. You can also set up a
container runtime to rotate an application's logs automatically.
As it was stated by William, Kubernetes itself doesn’t provide log aggregation of its own and it relies on container runtime by default.
When a container running on Kubernetes writes its logs to stdout or
stderr streams, they are picked up by the kubelet service running on
that node, and are delegated to the container engine for handling
based on the logging driver configured in Kubernetes.
In most cases, Docker container logs will end up in the
/var/log/containers directory on your host. Docker supports multiple
logging drivers but, unfortunately, Kubernetes API does not support
driver configuration.
Once a container terminates or restarts, kubelet keeps its logs on the
node. To prevent these files from consuming all of the host’s storage,
a log rotation mechanism should be set on the node.
Kubernetes doesn’t provide built-in log rotation, but this
functionality is available in many tools, such as Docker’s log-opt, or
standard file shippers or even a simple custom cron job. When a
container is evicted from the node, so are its corresponding log files
That means you can try to find full logs in /var/log/containers and var/log/pods. This part is from official documentation and more precise:
By default, if a container restarts, the kubelet keeps one terminated
container with its logs. If a pod is evicted from the node, all
corresponding containers are also evicted, along with their logs.
To have a good visibility and accessibility of logs you may consider having a dedicated solution for logs storing. E.g. node logging agent or streaming to a sidecar
Please find articles and official kubernetes documentation with concepts and examples:
Kubernetes logging architecture
Practical guide to kubernetes
I have a Kubernetes cluster and I am figuring out in what numbers have pods got scaled up using the Kubectl command.
what is the possible way to get the details of all the scaled-up and scaled-down pods within a month?
That is not information Kubernetes records. The Events system keeps some debugging messages that include stuff about pod startup and sometimes shutdown, but that's only kept for a few hours. For long term metrics look at something like Prometheus + kube-state-metrics.
If you have k8s audit Policy in place, you can find the Events that are passed through the k8s API by filtering them in Cloud Audit Logs or elasticsearch! it depends on you current setup!
So there's this page about auditing-logs and I'm very confused about:
The k8s.io service is used for Kubernetes audit logs. These logs are generated by the Kubernetes API Server component and they contain information about actions performed using the Kubernetes API. For example, any changes you make on a Kubernetes resource by using the kubectl command are recorded by the k8s.io service. For more information, see Auditing in the Kubernetes documentation.
The container.googleapis.com service is used for GKE control plane audit logs. These logs are generated by the GKE internal components and they contain information about actions performed using the GKE API. For example, any changes you perform on a GKE cluster configuration using a gcloud command are recorded by the container.googleapis.com service.
which one shall I pick to get:
/var/log/kube-apiserver.log - API Server, responsible for serving the API
/var/log/kube-controller-manager.log - Controller that manages replication controllers
or these are all similar to EKS where audit logs means a separate thing?
Audit (audit) – Kubernetes audit logs provide a record of the individual users, administrators, or system components that have affected your cluster. For more information, see Auditing in the Kubernetes documentation.
If the cluster still exists, you should be able to do the following on GKE
kubectl proxy
curl http://localhost:8001/logs/kube-apiserver.log
AFAIK, there's no way to get server logs for clusters that have been deleted.
Logs for GKE control-plane components are available since November 29, 2022 for clusters with versions 1.22.0 and later.
You simply need to activate it on the clusters. Either
via CLI:
gcloud container clusters update [CLUSTER_NAME] \
--region=[REGION] \
--monitoring=SYSTEM,WORKLOAD,API_SERVER,SCHEDULER,CONTROLLER_MANAGER
via GCP web-console: Open the cluster-details, in the section "Features" edit the entry "Cloud Logging" and add the "Control Plane" components.
See documentation for details.
Note the notes in the solutions documentation, especially about reaching the logging.googleapis.com/write_requests quota (quick link).
You cannot. GKE does not make them available. Audit logs are different, those are a record of API actions.
I have a scenario where I need to push application logs running on EKS Cluster to separate cloudwatch log streams. I have followed the below link, which pushes all logs to cloudwatch using fluentd. But the issue is, it pushes logs to a single log stream only.
https://github.com/aws-samples/aws-workshop-for-kubernetes
It also pushes all the logs under /var/lib/docker/container/*.log. How Can I filter this to can only application specific logs?
Collectord now supports AWS CloudWatch Logs (and S3/Athena/Glue). It gives you flexibility to choose to what LogGroup and LogStream you want to forward the data (if the default does work for you).
Installation instructions for CloudWatch
How you can specify LogGroup and LogStream with annotations
Highly recommend to read Setting up comprehensive centralized logging with AWS Services for Kubernetes
Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?
While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.
I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.
Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.
So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?
You can achieve this manually with the following:
In Logs Viewer, creating the following filter:
resource.labels.project_id="<PROJECT_ID>"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.namespace_name="<NAMESPACE, or default>"
jsonPayload.message:"failed liveness probe"
Create a metric by clicking on the Create Metric button above the filter input and filling up the details.
You may now track this metric in Stackdriver.
Would be happy to be informed of a built-in metric instead of this.
There is a built in metric now, so it's easy to dashboard and/or alert on it without setting up custom metrics
Metric: kubernetes.io/container/restart_count
Resource type: k8s_container
In my cluster (a bare-metal k8s cluster),I use kube-state-metrics https://github.com/kubernetes/kube-state-metrics to do what you want. This project belongs to kubernetes repo and it is quite easy to use. Once deployed u can use kube_pod_container_status_restarts this metrics to know if a container restarts
Others have commented on how to do this with metrics, which is the right solution if you have a very large number of crashing pods.
An alernative approach is to treat crashing pods as discrete events or even log-lines. You can do this with Robusta (disclaimer, I wrote this) with YAML like this:
triggers:
- on_pod_update: {}
actions:
- restart_loop_reporter:
restart_reason: CrashLoopBackOff
- image_pull_backoff_reporter:
rate_limit: 3600
sinks:
- slack
Here we're triggering an action named restart_loop_reporter whenever a pod updates. The data stream comes from the APIServer.
The restart_loop_reporter is an action which filters out non-crashing pods. Above it's configured to report only on CrashLoopBackOffs but you could remove that to report all crashes.
A benefit of doing it this way is that you can gather extra data about the crash automatically. For example, the above will fetch the pod's logs and forward them along with the crash report.
I'm sending the result here to Slack, but you could just as well send it to a structured output like Kafka (already builtin) or Stackdriver (not yet supported, but I can fix that if you like).
Remember that, you can always raise feature request if the options available are not enough.