I am trying to aggregate the pod duration of the status Running over another custom business-logic tag. Then I can calculate how much it cost me to run this service.
I have tried to use docker.uptime but it has not been fruitful as I imagine multiple containers can be run in parallel per node. I saw that Datadog KSM provides pods.age and pods.uptime metrics but they do not appear in Datadog metric explorer.
I do not want to use Prometheus/Grafana to do this because I think this should be possible with Datadog. Prometheus Solution
avg:container.uptime{kube_namespace:<your namespace>} by {pod_name}
Related
I run Kube under EKS on AWS. I wish to alert when enough pod under from a deployement are in a failed phase. I do have prometheus, but I need the alerte to be in CloudWatch, therefore I am exporting the metric to CloudWatch thanks to the CW Agent.
I was planing into using the metric kube_pod_status_phase, and group it by Phase and a another label identifying my deployement.
But I realize that the kube_pod_status_phase metric only has Namespace/Pod/Phase and couple of other useless label in my case, so not enough for me to achieve what I need.
I see that with Prometheus PromQL I could make Join Query, which seem to solve my issue. But since I am using CloudWatch Metric, I can not use PromQL like this (or at least I do not know how).
Does anyone has a suggestion on how to solve this issue?
How, with AWS CloudWatch, can I list for one specific deployement the list of pod in a failed state?
I would like to monitor the IO which my pod is doing. Using commands like 'kubectl top pods/nodes', i can monitor CPU & Memory. But I am not sure how to monitor IO which my pod is doing, especially disk IO.
Any suggestions ?
Since you already used kubectl top command I assume you have metrics server. In order to have more advanced monitoring solution I would suggest to use cAdvisor, Prometheus or Elasticsearch.
For getting started with Prometheus you can check this article.
Elastic search has System diskio and Docker diskio metrics set. You can easily deploy it using helm chart.
Part 3 of the series about kubernetes monitoring is especially focused on monitoring container metrics using cAdvisor. Allthough it is worth checking whole series.
Let me know if this helps.
For example, kubelet(cAdvisor) container_cpu_usage_seconds_total has value with some parameter (e.g. pod, namespace).
I wonder how to summarize this kind of values into Service(for example, CPU usage per service)? I understand that Service is a set of pods so that just aggregating these values per pod to service, but I do not know how?
Is there any aggregation method to Service? Or, process_cpu_seconds_total is a kind of aggregated value per service of 'container_cpu_usage_seconds_total'?
Thank you for your help!
What about
sum(rate(container_cpu_usage_seconds_total{job="kubelet", cluster="", namespace="default", pod_name=~"your_pod_name.*"}[3m]))
Taken from kubernetes-mixin
In general, cAdvisor collects metrics about containers and doesn't know anything about Services. If you want to aggregate by Service, you need to manually select the metrics of the Pod that belong to this Service.
For example, if your cAdvisor metrics are in Prometheus, you can use this PromQL query:
sum(rate(container_cpu_usage_seconds_total{pod_name=~"myprefix-*"}[2m]))
This adds up the CPU usages of all containers of all Pods that have a name starting with myprefix-.
Or if you have the Resource Metrics API enabled (i.e. the Metrics Server installed), you can query the CPU usage of a specific Pod (in fractions of a CPU core) with:
kubectl get --raw="/apis/metrics.k8s.io/v1beta1/namespaces/{namespace}/pods/{pod}"
To get the total usage of a Service, you would need to iterate through all the Pods of the Service, extract the values, and add them together.
In general, Service is a Kubernetes concept and does not exist in cAdvisor, which is an independent project and just happens to be used in Kubernetes.
Is it possible to start or stop Kubernets PODS based on some events like a Kafka event?
For e.g., if there is an event that some work is complete and based on that I want to bring down a POD or bring a POD up. In my case, minimum replicas of the PODs keep running even though they are not required to be running for the most part of the day.
Pods with Horizontal Pod Autoscaling based on custom metrics is the option you are looking for.
Probably you instrument your code with custom Prometheus metrics. In your case it is publishing a metric in Prometheus that says the number of messages available for processing at a point in time. Then use that custom Prometheus metric to scale pods on that basis.
Inside a namespace, I have created a pod with its specs consisting of memory limit and memory requests parameters. Once up a and running, I would like to know how can I get the memory utilization of the pod in order to figure out if the memory utilization is within the specified limit or not. "kubectl top" command returns back with a services related error.
kubectl top pod <pod-name> -n <fed-name> --containers
FYI, this is on v1.16.2
You need to install metrics server to get the metrics. Follow the below thread
Error from server (NotFound): podmetrics.metrics.k8s.io "mem-example/memory-demo" not found
kubectl top pod POD_NAME --containers
shows metrics for a given pod and its containers.
If you want to see graphs of memory and cpu utilization then you can see them through the kubernetes dashboard.
A better solution would be to install a metrics server alongwith prometheus and grafana in your cluster. Prometheus will scrape the metrics which can be used by grafana for displaying as graphs.
This might be useful.
Instead of building ad-hoc metric snapshots, a much better way is to install and work with 3rd party data collector programs which if managed well gives you a great solution for monitoring systems and a neat Grafana UI (or likewise) you can play with. One of them is Prometheus and which comes highly recommended.
using such PnP systems, you can not only create a robust monitoring pipeline but also the consumption and hence the reaction to the problem is well managed and executed compared to only relying on TOP