Realtime monitoring of CPU usage/CPU limit in k8s container - kubernetes

I am trying to measure CPU usage of a container in Kubernetes, represented as a ratio between actual usage and usage limit in a short time window. This should be ideally close to real-time (up to 5s delay).
I have full control of the container code and I can also extend the pod with a sidecar container to do reporting for me.
I have looked at Prometheus deployed using Prometheus operator, but I am seeing the data landing with large delays or even not showing up at all for some pods.
I was hoping somebody could shed some light on how to implement any of those:
sidecar container that can query cpu usage/cpu limit and send the data to another service (I am worried that this is impossible, because containers run in isolated file systems).
another process within main container, that can do the reporting. Maybe dividing $(cat /sys/fs/cgroup/cpu/cpuacct.usage) / $(/sys/fs/cgroup/cpu/cpu.cfs_quota_us) would do the trick?
use some existing software tool/service to achieve this. Any recommendation would be appreciated.
Thank you very much!

Deploy a sidecar container along with the container that you want to monitor. The sidecar container should monitor the cpu of the main container and push the status to prometheus or some other monitoring service. With alerting you can set thresholds and if the cpu is over the threshold then prometheus would trigger an alert action through alert manager service

Related

Kubernetes - How to calculate resources we need for each container?

How to figure out how much min and max resources to allocate for each application deployment? I'm setting up a cluster and I haven't setup any resources and letting it run freely.
I guess I could use top command to figure out the load during the peak time and work on that but still top says like 6% or 10% but then I'm not sure how to calculate them to produce something like 0.5 cpu or 100 MB. Is there a method/formula to determine max and min based on top command usage?
I'm running two t3.medium nodes and I have the following pods httpd and tomcat in namespace1, mysql in namepsace2, jenkins and gitlab in namespace3. Is there any guide to minimum resources it needs? or Do I have to figure it based on top or some other method?
There are few things to discuss here:
Unix top and kubectl top are different:
Unix top uses proc virtual filesystem and reads /proc/meminfo file to get an actual information about the current memory usage.
kubectl top shows metrics information based on reports from cAdvisor, which collects the resource usage. For example: kubectl top pod POD_NAME --containers: Show metrics for a given pod and its containers or kubectl top node NODE_NAME: Show metrics for a given node.
You can use the metrics-server to get the CPU and memory usage of the pods. With it you will be able to Assign CPU Resources to Containers and Pods.
Optimally, your pods should be using exactly the amount of resources you requested but that's almost impossible to achieve. If the usage is lower than your request, you are wasting resources. If it's higher, you are risking performance issues. Consider a 25% margin up and down the request value as a good starting point. Regarding limits, achieving a good setting would depend on trying and adjusting. There is no optimal value that would fit everyone as it
depends on many factors related to the application itself, the
demand model, the tolerance to errors etc.
As a supplement I recommend going through the Managing Resources for Containers docs.

Kubernetes - Monitoring pod IO

I would like to monitor the IO which my pod is doing. Using commands like 'kubectl top pods/nodes', i can monitor CPU & Memory. But I am not sure how to monitor IO which my pod is doing, especially disk IO.
Any suggestions ?
Since you already used kubectl top command I assume you have metrics server. In order to have more advanced monitoring solution I would suggest to use cAdvisor, Prometheus or Elasticsearch.
For getting started with Prometheus you can check this article.
Elastic search has System diskio and Docker diskio metrics set. You can easily deploy it using helm chart.
Part 3 of the series about kubernetes monitoring is especially focused on monitoring container metrics using cAdvisor. Allthough it is worth checking whole series.
Let me know if this helps.

Monitoring Rancher containers by hosts through Prometheus cAdvisor NodeExporter

I have a setup where I manage to monitor every container of my Rancher 1.6 environnement with a stack Prometheus(2.4.3)/Grafana (with cAdvisor v0.27.4 and NodeExporter v0.16.0).
Here is my issue. I manage to monitor every container consumption but I can't relate the consumption of a container based on the host.
For example, if I want to show information about CPU usage I use container_cpu_user_seconds_total from cAdvisor which provides cpu usage of the container in percentage related to its host but I can't find which host is concerned (I Have 4 hosts on this environnement) so the cpu consumption cumulative tends to go over 100%.
I would like to either show charts by host (I saw I could create dynamic charts in Grafana but it seems a bit hard so manually creating them doesn't bother me).
Should I try to create my own metrics in prom-conf file ? Seems a bit overkill for such stuff
I find this very strange that this information only interests me. That's why I ask it here.
I'm new to all of these tools.
Thank you in advance.

Kubernetes - NodeUnderMemoryPressure Issue

I'm very new to Kubernetes. We are using Kubernetes cluster on Google Cloud Platform.
I have created Cluster, Services, Pod, Replica controllers.
I have created Horizontal Pod Autoscaler and it is based on CPU Params.
Cluster details
Default running node count is set to 3
3GB allocatable memory per node
Default running node count is 3 in the cluster.
After running for 1 hour Service and Nodes showing NodeUnderMemoryPressure Issues.
How to resolve this ??
If you any more details, please ask
Thanks
I don't know how much traffic is hitting your cluster, but I would highly recommend running Prometheus in your cluster.
Prometheus is an open-source monitoring and alerting tool, and integrates very well with Kubernetes.
This tool should give you a much better view of memory consumption, CPU usage, amongst many other monitoring capabilities, that will allow you to effectively troubleshoot these types of issues.
There are several ways to address this issue that depends on the type of your workloads.
The easiest is simply scale your nodes, but it can be useless if there is a memory leakage. Even if now you are not affected by it you should always consider the possibility of a memory leakage happening, therefore the best practise is to introduce always memory limits for PODs and Namespaces.
Scale the cluster
if you have many pods running and there are not some of them way bigger that the others it would be useful to scale horizontally your cluster, in this way the number of running pods per nodes will reduce and the NodeUnderMemoryPressure warning should disappear.
if you are running few PODs or some of them are capable to make the cluster suffering alone, then the only option is to scale the nodes vertically adding a new node pool with Compute Engine instances having more memory and possibly delete the old one.
if your workload is correct and you memory suffer because in certain moment of the day you receive 100 times more the usual traffic and you create more pods to support this traffic, you should consider to make use of the Autoscaler.
Check Memory leakages
On the other hand if it is not an "healthy" situation and you have pods consuming way more RAM than expected then you should follow the advice of grizzthedj and understand why your PODs are consuming so much and maybe verify if some of your container is affected by memory leakage and in this case scale the amount of RAM is useless since at some point you will run out of it anyway.
Therefore start to understand which are the PODs consuming too much and then troubleshoot why they have this behaviour, if you do not want to make use of Prometeus simply SSH into the container and check with the classical Linux commands.
Limit the RAM consumed by PODs
To prevent this to happen in the future I advise you when writing YAML file to always limit the amount of RAM they can make use of, in this way you will control them and you will be sure that there is not the risk that they cause the Kubernetes "node agent" to fail because out of memory.
Consider also to limit the CPU and introduce minimum requirements of both RAM and CPU for PODs to help the scheduler to properly schedule the PODs to avoid to hit NodeUnderMemoryPressure under high workload.

Monitoring and alerting on pod status or restart with Google Container Engine (GKE) and Stackdriver

Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?
While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.
I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.
Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.
So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?
You can achieve this manually with the following:
In Logs Viewer, creating the following filter:
resource.labels.project_id="<PROJECT_ID>"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.namespace_name="<NAMESPACE, or default>"
jsonPayload.message:"failed liveness probe"
Create a metric by clicking on the Create Metric button above the filter input and filling up the details.
You may now track this metric in Stackdriver.
Would be happy to be informed of a built-in metric instead of this.
There is a built in metric now, so it's easy to dashboard and/or alert on it without setting up custom metrics
Metric: kubernetes.io/container/restart_count
Resource type: k8s_container
In my cluster (a bare-metal k8s cluster),I use kube-state-metrics https://github.com/kubernetes/kube-state-metrics to do what you want. This project belongs to kubernetes repo and it is quite easy to use. Once deployed u can use kube_pod_container_status_restarts this metrics to know if a container restarts
Others have commented on how to do this with metrics, which is the right solution if you have a very large number of crashing pods.
An alernative approach is to treat crashing pods as discrete events or even log-lines. You can do this with Robusta (disclaimer, I wrote this) with YAML like this:
triggers:
- on_pod_update: {}
actions:
- restart_loop_reporter:
restart_reason: CrashLoopBackOff
- image_pull_backoff_reporter:
rate_limit: 3600
sinks:
- slack
Here we're triggering an action named restart_loop_reporter whenever a pod updates. The data stream comes from the APIServer.
The restart_loop_reporter is an action which filters out non-crashing pods. Above it's configured to report only on CrashLoopBackOffs but you could remove that to report all crashes.
A benefit of doing it this way is that you can gather extra data about the crash automatically. For example, the above will fetch the pod's logs and forward them along with the crash report.
I'm sending the result here to Slack, but you could just as well send it to a structured output like Kafka (already builtin) or Stackdriver (not yet supported, but I can fix that if you like).
Remember that, you can always raise feature request if the options available are not enough.