Argo Workflow metrics - argo-workflows

I am trying to understand the metrics emited by argo workflow but their explination isn't helping enough:
For example
It is possible for a workflow to start, but no pods be running (e.g.
cluster is too busy to run them). This metric sheds light on actual
work being done.
Does it mean the count of all the running pods for all the workflows (if this is the case, then, at least for me, doesn't seem correct) from all the namespaces?
There is a difference between this metric and kubernetes_state.pod.* metrics (which would give me the pods with different states, eg: running)?

Enabling and scraping the endpoint shows the following data exposed:
# HELP argo_workflows_pods_count Number of Pods from Workflows currently accessible by the controller by status (refreshed every 15s)
# TYPE argo_workflows_pods_count gauge
argo_workflows_pods_count{status="Pending"} 0
argo_workflows_pods_count{status="Running"} 0
As we are querying the workflow controller here and there are no additional labels attached to the metric, we can assume that this is indeed the total number of pods created by Argo. However, this is not necessarily the same as kubernetes_state.pod.* as this will also include pods created by other processes.


check if grafana agent operator is up and running

I have a grafana agent operator and I was trying to create some metrics to monitor if it's up.
If I had a simple grafana agent process I would just use something along the lines of absent(up{instance=""} == 1 but with the Grafana Agent operator the components are dynamic.
I don't see issues with monitoring the metrics part. For example, if the grafana-agent-0 stateful set for metrics goes down and a new pod is built the name would be the same.
But for logs, the Grafana Agent operator runs a pod (daemon set) for every node with a different name each time.
In the log case if a pod grafana-agent-log-vsq5r goes down or a new node is added to the cluster I would have a new pod to monitor with a different name which would create some problems in being able to monitor the changes in the cluster. Anyone that already had this issue or that knows some good way of tackling the issue?
I would like to suggest using Labels in Grafana Alerting

Is Interrupt functionality is availabe in Kubernetes minikube autoscaling?

I have used custom metrics API and have successfully auto scaled the service based on some metrics.But here is the point the auto scaling works in way that the minikube's HPA(horizontal pod autoscaler) will check the particular URL/API and try to find out the metric value repetitively with some polling period.For example HPA will check for the value for every 15 seconds.So this is continuous polling from HPA to the URL/API to fetch the value of that metric.After that it will simply compare the value with the target reference value given and try to scale.
What I want is, the API/URL should trigger the minikube HPA whenever needed, it's like HPA should work as a interrupt here in simple words.
call for autoscale should be from Service/API to HPA not from HPA to Service !
is this feature available in Kubernetes ? or do you have any comments in this scenario ? please share your view on this I am currently in last stage and this question is stopping my progress!

How to get status history/lineage for Kubernetes pods

I was wondering if there is a kubectl command to quickly get the history of all STATUS for a given pod?
for example: Lets say a pod - my-test-pod went from ContainerCreating to Running to OomKill to Terminating:
I was wondering if there is a command that experts use to get this lineage. Appreciate a nudge..
Using kubectl get events you can only see events of last 1 hour. If you want to persist events for a longer duration you can sue eventrouter.The event router serves as an active watcher of event resource in the kubernetes system, which takes those events and pushes them to a user specified sink. This is useful for a number of different purposes, but most notably long term behavioral analysis of your workloads running on your kubernetes cluster.
kubectl get events or kubectl describe pod which shows the events for the pod at the bottom. However events are only kept for a little while, so it's not a permanent history. For that you would need some webhooks or a tool like Prometheus.

How to get number of pods running in prometheus

I am scraping the kubernetes metrics from prometheus and would need to extract the number of running pods.
I can see container_last_seen metrics but how should i get no of pods running. Can someone help on this?
If you need to get number of running pods, you can use a metric from the list of pods metrics for that (To get the info purely on pods, it'd make sens to use pod-specific metrics).
For example if you need to get the number of pods per namespace, it'll be:
count(kube_pod_info{namespace="$namespace_name"}) by (namespace)
To get the number of all pods running on the cluster, then just do:
Assuming you want to display that in Grafana according to your question tags, from this Kubernetes App Metrics dashboard for example:
count(count(container_memory_usage_bytes{container_name="$container", namespace="$namespace"}) by (pod_name))
You can just import the dashboard and play with the queries.
Depending on your configuration/deployment, you can adjust the variables container_name and namespace, grouping by (pod_name) and count'ing it does the trick. Some other label than pod_name can be used as long as it's shared between the pods you want to count.
If you want to see only the number of "deployed" pods in some namespace, you can use the solutions in previous answers.
My use case was to see the current running pods in some namespace and below is my solution:
'min_over_time(sum(group(kube_pod_container_status_ready{namespace="BC_NAME"}) by (pod,uid)) [5m:1m]) OR on() vector(0)'
Please replace BC_NAME with your namespace name.
The timespan provides you fine the data.
If no data found - no pod currently running it returns '0'

What is stored in a kubernetes job and how do I check resource use of old job(s)?

This morning I learned about the (unfortunate) default in kubernetes of all previously run cronjobs' jobs instances being retained in the cluster. Mea culpa for not reading that detail in the documentation. I also notice that deleting jobs (kubectl delete job [<foo> or --all]) takes quite a long time. Further, I noticed that even a reasonably provisioned kubernetes cluster with three large nodes appears to fail (get timeouts of all sorts when trying to use kubectl) when there are just ~750 such old jobs in the system (plus some other active containers that otherwise had not entailed heavy load) [Correction: there were also ~7k pods associated with those old jobs that were also retained :-o]. (I did learn about the configuration settings to limit/avoid storing old jobs from cronjobs, so this won't be a problem [for me] in the future.)
So, since I couldn't find documentation for kubernetes about this, my (related) questions are:
what exactly is stored when kubernetes retains old jobs? (Presumably it's the associated pod's logs and some metadata, but this doesn't explain why they seemed to place such a load on the cluster.)
is there a way to see the resources (disk only, I assume, but maybe
there is some other resource) that individual or collective old jobs
are using?
why does deleting a kubernetes job take on the order of a minute?
I don't know if k8s provides that kinda details of what job is consuming how much disk space but here is something you can try.
Try to find the pods associated with the job:
kubectl get pods --selector=job-name=<job name> --output=jsonpath={}
Once you know the pod then find the docker container associated with it:
kubectl describe pod <pod name>
In the above output look for Node & Container ID. Now go on that node and in that node goto path /var/lib/docker/containers/<container id found above> here you can do some investigation to find out what is wrong.