How to get number of pods running in prometheus

How to get number of pods running in prometheus - kubernetes

I am scraping the kubernetes metrics from prometheus and would need to extract the number of running pods.
I can see container_last_seen metrics but how should i get no of pods running. Can someone help on this?

If you need to get number of running pods, you can use a metric from the list of pods metrics https://github.com/kubernetes/kube-state-metrics/blob/master/docs/pod-metrics.md for that (To get the info purely on pods, it'd make sens to use pod-specific metrics).
For example if you need to get the number of pods per namespace, it'll be:
count(kube_pod_info{namespace="$namespace_name"}) by (namespace)
To get the number of all pods running on the cluster, then just do:
count(kube_pod_info)

Assuming you want to display that in Grafana according to your question tags, from this Kubernetes App Metrics dashboard for example:
count(count(container_memory_usage_bytes{container_name="$container", namespace="$namespace"}) by (pod_name))
You can just import the dashboard and play with the queries.
Depending on your configuration/deployment, you can adjust the variables container_name and namespace, grouping by (pod_name) and count'ing it does the trick. Some other label than pod_name can be used as long as it's shared between the pods you want to count.

If you want to see only the number of "deployed" pods in some namespace, you can use the solutions in previous answers.
My use case was to see the current running pods in some namespace and below is my solution:
'min_over_time(sum(group(kube_pod_container_status_ready{namespace="BC_NAME"}) by (pod,uid)) [5m:1m]) OR on() vector(0)'
Please replace BC_NAME with your namespace name.
The timespan provides you fine the data.
If no data found - no pod currently running it returns '0'

Related

Grafana consolidate pod metrics

I have a Kubernetes Pod which serves metrics for prometheus.
Once in a while I update the release and thus the pod gets restarted.
Prometheus safes the metrics but labels it according to the new pod name:
this is by prometheus' design, so its ok.
but if I display this data with grafana, Im getting this (the pods ahve been redeployed twice):
So for example the metric "Registered Users" now has 3 different colors because the source from it comes from 3 diffferent pods
I have some options. Maybe disregard the pod name in prometheus, but I consider that bad practise because I dont want to lose data.
So I think I have to consolidate this in grafana. But how I can I tell Grafana that I want to merge all data with container-name api-gateway-narkuma and disregard the label pods?

You can do something like
max(users) without (instance, pod)

Argo Workflow metrics

I am trying to understand the metrics emited by argo workflow but their explination isn't helping enough:
For example
argo_workflows_pods_count
It is possible for a workflow to start, but no pods be running (e.g.
cluster is too busy to run them). This metric sheds light on actual
work being done.
Does it mean the count of all the running pods for all the workflows (if this is the case, then, at least for me, doesn't seem correct) from all the namespaces?
There is a difference between this metric and kubernetes_state.pod.* metrics (which would give me the pods with different states, eg: running)?

Enabling and scraping the endpoint shows the following data exposed:
# HELP argo_workflows_pods_count Number of Pods from Workflows currently accessible by the controller by status (refreshed every 15s)
# TYPE argo_workflows_pods_count gauge
argo_workflows_pods_count{status="Pending"} 0
argo_workflows_pods_count{status="Running"} 0
As we are querying the workflow controller here and there are no additional labels attached to the metric, we can assume that this is indeed the total number of pods created by Argo. However, this is not necessarily the same as kubernetes_state.pod.* as this will also include pods created by other processes.

How to monitor pod preemption event

I have a bunch of Rancher clusters I take care of and on some of them developers use PriorityClasses to ensure that some of the more important workloads get scheduled. The 3 PriorityClasses are in 3 digits range so they will not interfere with the default ones. However, at present none of the PriorityClasses is set as default and neither is the preemptionPolicy set so it defaults to PreemptLowerPriority.
None of the rancher, longhorn, prometheus, grafana, etc., workloads have priorityClassName set.
Long story short, I believe this causes havoc on the cluster when resources are in short supply.
Before I take my opinion to the developers I would like to collect some data to back up my story.
The question: How do I detect if the pod was Terminated due to Preemption?
I tried to google the subject but couldn't find anything. I was hoping kube state metrics would have something but I didn't find anything.
Any help would be greatly appreciated.

You can try to look for convincing data like the pod termination reason with help of kubectl.
You can see the last restart logs of a container using the following command:
kubectl logs podname -c containername --previous
You can also use the following command to check the lifecycle events sent by the kubelet to the apiserver about the pod.
kubectl describe pod podname
Finally, You can also write a final message to /dev/termination-log, and this will show up as described in the docs.
To use kubectl commands with rancher kindly refer to this documentation page.

How to find metrics about CPU/MEM for the pod running on a Kubernetes cluster on Prometheus

I have Prometheus setup via Helm from Terraform and it's is configured to connect to my Kubernetes cluster. I open my Prometheus but I am not sure which metric to choose from the list to be able to view the CPU/MEM of running pods/jobs.
Here are all the pods running with the command (test1 is the kube namespace):
kubectl -n test1 get pods
podsrunning
When, I am on Prometheus, I see many metrics related to CPU, but not sure which one to choose:
prom1
I tried to choose one, but the namespace = prometheus and it uses prometheus-node-exporter and I don't see my cluster or my namespace test1 anywhere here.
prom2
Could you please help me? Thank you very much in advance.
UPDATE SCREENSHOT
UPDATE SCREENSHOT
I need to concentrate on this specific namespace, normally with the command:
kubectl get pods --all-namespaces | grep hermatwin
I see the first line with namespace = jobs I think this is namespace.
No result when set calendar to last Friday:
UPDATE SCREENSHOT April 20
I tried to select 2 days with starting date on last Saturday 17 April but I don't see any result:
ANd, if I remove (namespace="jobs") condition, I don't see any result either:
I tried to rerun the job (simulation jobs) again just now and tried to execute the prometheus query while the job was still running mode but I don't get any result :-( Here you can see my jobs where running.
I don't get any result:
When using simple filter, just container_cpu_usage_seconds_total, I can see the namespace="jobs"

node_cpu_seconds_total is a metric from node-exporter, the exporter that brings machine statistics and its metrics are prefixed with node_. You need metrics from cAdvisor, this one produces metrics related to containers and they are prefixed with container_:
container_cpu_usage_seconds_total
container_cpu_load_average_10s
container_memory_usage_bytes
container_memory_rss
Here are some basic queries for you to get started. Be ready that they may require tweaking (you may have different label names):
CPU Utilisation Per Pod
sum(irate(container_cpu_usage_seconds_total{container!="POD", container=~".+"}[2m])) by (pod)
RAM Usage Per Pod
sum(container_memory_usage_bytes{container!="POD", container=~".+"}) by (pod)
In/Out Traffic Rate Per Pod
Beware that pods with host network mode (not isolated) show traffic rate for the whole node. * 8 is to convert bytes to bits for convenience (MBit/s, GBit/s, etc).
# incoming
sum(irate(container_network_receive_bytes_total[2m])) by (pod) * 8
# outgoing
sum(irate(container_network_transmit_bytes_total[2m])) by (pod) * 8

Kubernetes / Prometheus Metrics Mismatch

I have an application running in Kubernetes (Azure AKS) in which each pod contains two containers. I also have Grafana set up to display various metrics some of which are coming from Prometheus. I'm trying to troubleshoot a separate issue and in doing so I've noticed that some metrics don't appear to match up between data sources.
For example, kube_deployment_status_replicas_available returns the value 30 whereas kubectl -n XXXXXXXX get pod lists 100 all of which are Running, and kube_deployment_status_replicas_unavailable returns a value of 0. Also, if I get the deployment in question using kubectl I'm seeing the expected value.
$ kubectl get deployment XXXXXXXX
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
XXXXXXXX 100 100 100 100 49d
There are other applications (namespaces) in the same cluster where all the values correlate correctly so I'm not sure where the fault may be or if there's any way to know for sure which value is the correct one. Any guidance would be appreciated. Thanks

Based on having the kube_deployment_status_replicas_available metric I assume that you have Prometheus scraping your metrics from kube-state-metrics. It sounds like there's something quirky about its deployment. It could be:
Cached metric data
And/or simply it can't pull current metrics from the kube-apiserver
I would:
Check the version that you are running for kube-state-metrics and see if it's compatible with your K8s version.
Restart the kube-state-metrics pod.
Check the logs kubectl logskube-state-metrics`
Check the Prometheus logs
If you don't see anything try starting Prometheus with the --log.level=debug flag.
Hope it helps.