check if grafana agent operator is up and running - grafana

I have a grafana agent operator and I was trying to create some metrics to monitor if it's up.
If I had a simple grafana agent process I would just use something along the lines of absent(up{instance="1.2.3.4:8000"} == 1 but with the Grafana Agent operator the components are dynamic.
I don't see issues with monitoring the metrics part. For example, if the grafana-agent-0 stateful set for metrics goes down and a new pod is built the name would be the same.
But for logs, the Grafana Agent operator runs a pod (daemon set) for every node with a different name each time.
In the log case if a pod grafana-agent-log-vsq5r goes down or a new node is added to the cluster I would have a new pod to monitor with a different name which would create some problems in being able to monitor the changes in the cluster. Anyone that already had this issue or that knows some good way of tackling the issue?

I would like to suggest using Labels in Grafana Alerting

Related

How can I get correct metric for pod status?

I'm trying to get the pod status in Grafana through Prometheus in a GKE cluster.
kube-state-metrics has been installed together with Prometheus by using the prometheus-community/prometheus and grafana Helm charts.
I tried to know the pod status through kube_pod_status_phase{exported_namespace=~".+-my-namespace", pod=~"my-server-.+"}, but I get only "Running" as a result.
In other words, in the obtained graph I can see only a straight line at the value 1 for the running server. I can't get when the given pod was pending or in another state different from Running.
I am interested in the starting phase, after the pod is created, but before it is running.
Am I using the query correctly? Is there another query or it could be due to something in the installation?
If you mean the Pending status for the Pod, I think you should use instead kube_pod_status_phase{exported_namespace=~".+-my-namespace", pod=~"my-server-.+", phase="Pending"} . Not sure what it does when you don't put the phase in your request but I suspect it just renders the number of Pods whatever the state is. In your case is always 1.

What is the difference between the core os projects kube-prometheus and prometheus operator?

The github repo of Prometheus Operator https://github.com/coreos/prometheus-operator/ project says that
The Prometheus Operator makes the Prometheus configuration Kubernetes native and manages and operates Prometheus and Alertmanager clusters. It is a piece of the puzzle regarding full end-to-end monitoring.
kube-prometheus combines the Prometheus Operator with a collection of manifests to help getting started with monitoring Kubernetes itself and applications running on top of it.
Can someone elaborate this?
I've always had this exact same question/repeatedly bumped into both, but tbh reading the above answer didn't clarify it for me/I needed a short explanation. I found this github issue that just made it crystal clear to me.
https://github.com/coreos/prometheus-operator/issues/2619
Quoting nicgirault of GitHub:
At last I realized that prometheus-operator chart was packaging
kube-prometheus stack but it took me around 10 hours playing around to
realize this.
**Here's my summarized explanation:
"kube-prometheus" and "Prometheus Operator Helm Chart" both do the same thing:
Basically the Ingress/Ingress Controller Concept, applied to Metrics/Prometheus Operator.
Both are a means of easily configuring, installing, and managing a huge distributed application (Kubernetes Prometheus Stack) on Kubernetes:**
What is the Entire Kube Prometheus Stack you ask? Prometheus, Grafana, AlertManager, CRDs (Custom Resource Definitions), Prometheus Operator(software bot app), IaC Alert Rules, IaC Grafana Dashboards, IaC ServiceMonitor CRDs (which auto-generate Prometheus Metric Collection Configuration and auto hot import it into Prometheus Server)
(Also when I say easily configuring I mean 1,000-10,000++ lines of easy for humans to understand config that generates and auto manage 10,000-100,000 lines of machine config + stuff with sensible defaults + monitoring configuration self-service, distributed configuration sharding with an operator/controller to combine config + generate verbose boilerplate machine-readable config from nice human-readable config.
If they achieve the same end goal, you might ask what's the difference between them?
https://github.com/coreos/kube-prometheus
https://github.com/helm/charts/tree/master/stable/prometheus-operator
Basically, CoreOS's kube-prometheus deploys the Prometheus Stack using Ksonnet.
Prometheus Operator Helm Chart wraps kube-prometheus / achieves the same end result but with Helm.
So which one to use?
Doesn't matter + they achieve the same end result + shouldn't be crazy difficult to start with 1 and switch to the other.
Helm tends to be faster to learn/develop basic mastery of.
Ksonnet is harder to learn/develop basic mastery of, but:
it's more idempotent (better for CICD automation) (but it's only a difference of 99% idempotent vs 99.99% idempotent.)
has built-in templating which means that if you have multiple clusters you need to manage / that you want to always keep consistent with each other. Then you can leverage ksonnet's templating to manage multiple instances of the Kube Prometheus Stack (for multiple envs) using a DRY code base with lots of code reuse. (If you only have a few envs and Prometheus doesn't need to change often it's not completely unreasonable to keep 4 helm values files in sync by hand. I've also seen Jinja2 templating used to template out helm values files, but if you're going to bother with that you may as well just consider ksonnet.)
Kubernetes operator are kubernetes specific application(pods) that configure, manage and optimize other Kubernetes deployments automatically. They are implemented as a custom controller.
According to official coreOS website:
Operators were introduced by CoreOS as a class of software that operates other software, putting operational knowledge collected by humans into software.
The prometheus operator provides the easy way to deploy configure and monitor your prometheus instances on kubernetes cluster. To do so, prometheus operator introduces three types of custom resource definition(CRD) in kubernetes.
Prometheus
Alertmanager
ServiceMonitor
Now, with the help of above CRD's, you can directly create a prometheus instance by providing kind: Prometheus and the prometheus instance is ready to serve, likewise you can do for AlertManager. Without this you would have to setup the deployment for prometheus with its image, configuration and many more things.
The Prometheus Operator serves to make running Prometheus on top of Kubernetes as easy as possible, while preserving Kubernetes-native configuration options.
Now, kube-prometheus implemented the prometheus operator and provides you minimum yaml files to create your basic setup of prometheus, alertmanager and grafana by running a single command.
git clone https://github.com/coreos/prometheus-operator.git
kubectl apply -f prometheus-operator/contrib/kube-prometheus/manifests/
By running above command in kube-prometheus directory, you will get a monitoring namespace which will have an instance of alertmanager, prometheus and grafana for UI. This is enough setup for most of the basic implementation and if you need any more specifics according to your application, you can add more yamls of exporter you need.
Kube-prometheus is more of a contribution to prometheus-operator project, which implements the prometheus operator functionality very well and provide you a complete monitoring setup for your kubernetes cluster. You can start with kube-prometheus and extend the functionality of your monitoring setup according to your application from there.
You can learn more about prometheus-operator here
As of today, 28-09-2020, this is the way to install Prometheus in a Kubernetes cluster
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#kube-prometheus-stack
According to official documentation, kube-prometheus-stack is a rename of prometheus-operator.
As I understood, kube-prometheus-stack also has preinstalled grafana dashboards and prometheus rules.
Note: This chart was formerly named prometheus-operator chart, now
renamed to more clearly reflect that it installs the kube-prometheus
project stack, within which Prometheus Operator is only one component.
Taken from https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
Architecturally the container runs docker
The default container logs are managed by Docker, and the default log driver uses JSON-file
log-driver": "json-file
https://docs.docker.com/config/containers/logging/configure/
If the default jSON-file is used to manage container logs, log rotation is not performed by default. Therefore, the default JSON-file log driver the log files stored by the log driver can result in a large amount of disk space for containers that generate a large amount of output, which can cause disk space to run out.
In this case, save the log to ES, store it separately, and periodically delete the index using curator kubernetes
And run a scheduled task in K8S to delete the index periodically
Another solution for disk space is to periodically delete old logs from jSON-files
Typically we set the size and number of logs
This will set up a maximum of 10 log files, each with a maximum size of 20 Mb. Therefore, the container has a maximum of 200 Mb of logs
"log-driver": "json-file", "log-opts": { "max-size": "20m", "max-file": "10" },
Note: In general, the default Docker log is placed
/var/lib/docker/containers/
But in the same case kubernetes also saves logs and creates a directory structure to help you find pods-based logs, so you can find container logs for each Pod running on a node
/var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/
When removing pod, / var/lib/container under the docker/containers/log and k8s created under/var/log/pods/pod log will be deleted
For example, if the POD is restarted during production, the pod log will be deleted whether it is on the original node or jumped to another node
Therefore, this log needs to be saved in ES for centralized management. Many R&D projects will check the log for troubleshooting in most cases

How to get number of pods running in prometheus

I am scraping the kubernetes metrics from prometheus and would need to extract the number of running pods.
I can see container_last_seen metrics but how should i get no of pods running. Can someone help on this?
If you need to get number of running pods, you can use a metric from the list of pods metrics https://github.com/kubernetes/kube-state-metrics/blob/master/docs/pod-metrics.md for that (To get the info purely on pods, it'd make sens to use pod-specific metrics).
For example if you need to get the number of pods per namespace, it'll be:
count(kube_pod_info{namespace="$namespace_name"}) by (namespace)
To get the number of all pods running on the cluster, then just do:
count(kube_pod_info)
Assuming you want to display that in Grafana according to your question tags, from this Kubernetes App Metrics dashboard for example:
count(count(container_memory_usage_bytes{container_name="$container", namespace="$namespace"}) by (pod_name))
You can just import the dashboard and play with the queries.
Depending on your configuration/deployment, you can adjust the variables container_name and namespace, grouping by (pod_name) and count'ing it does the trick. Some other label than pod_name can be used as long as it's shared between the pods you want to count.
If you want to see only the number of "deployed" pods in some namespace, you can use the solutions in previous answers.
My use case was to see the current running pods in some namespace and below is my solution:
'min_over_time(sum(group(kube_pod_container_status_ready{namespace="BC_NAME"}) by (pod,uid)) [5m:1m]) OR on() vector(0)'
Please replace BC_NAME with your namespace name.
The timespan provides you fine the data.
If no data found - no pod currently running it returns '0'

grafana variable still catch old metrics info

i use grafana+prometheus monitor k8s pod,when my pod is removed ,i clean all metrics belongs to the removed pod,but still can see in grafana
variable
for example ,i defined a variable named node ,the query express is " {instance=~"(.+)", job="node status"} ",it can catch all metrics ,and i use regex expression '/instance="([^"]+):9100"/' to match the ip of each monitor target ,when i click node label on dashboard,it display all target ip , and when one of these targets is removed ,i use http api provide by prometheus to clean all metrics belongs to this target,but when i click node label ,it still display the removed target ip ,why? and how i can delete this ip?
It seems that Prometheus targets are not updated, even so that some of the pods are evicted. You can check it in Prometheus http://yourprometheus/targets page.
Does Prometheus run inside of the K8s cluster?

Having issue while creating custom dashboard in Grafana( data-source is Prometheus)

I have setup Prometheus and Grafana for monitoring my kubernetes cluster and everything works fine. Then I have created custom dashboard in Grafana for my application.The metrics available in Prometheus is as follows and i have added the same in grafana as metrics:
sum(irate(container_cpu_usage_seconds_total{namespace="test", pod_name="my-app-65c7d6576b-5pgjq", container_name!="POD"}[1m])) by (container_name)
The issue is, my application is running as pod in kubernetes,so when the pod is deleted or recreated, then the name of the pod will change and it will be different than the pod name specified in the above metrics "my-app-65c7d6576b-5pgjq". So the data for the above metrics will not work anymore. and I have to add new metrics again in Grafana. Please let me know How can I overcome this situation.
Answer was provided by manu thankachan:
I have done it. Made some change in the query as follow:
sum(irate(container_cpu_usage_seconds_total{namespace="test",
container_name="my-app", container_name!="POD"}[1m])) by
(container_name)
If pod is created directly(not a part of deployment) then only pod name is same as we mentioned.
If pod is part of Deployment the it will have unique string from replicaset and also ends with random 5 characters to maintain unique name.
So always try to use container_name label or if your Kubernetes version is > v1.16.0 then use container label