I'm working on a production K8s cluster with an HTTP-based application and I'd like to setup monitoring and alerting for HTTP errors. It's clear how to check the uptime of the service (using monitoring e.g. stackdriver), but absolutely not regarding HTTP failure rate.
I've got an nginx-ingress-controller as an end-point (with external load balancer).
How to collect and view the metrics such as latency, HTTP failures, etc. from this load balancer?
In particular I need to now, when HTTP response failure rate exceeds some percentage.
If you are looking at monitoring HTTP 4XX and 5XX errors for example I believe the best way is to aggregate the load balancer and the nginx ingress controller logs in some logging tool. If you are looking at open source solutions you could use something like Elasticsearch with Kibana to visualize the errors over a time frame. To send the logs you can use a forwarder like fluent-bit or Fluentd.
If you have a budget for a paid tool you can use a commercially available solution like:
Loggly
Datadog logging
Papertrail
etc.
Then you can set up alerts with any of these tools. For Elasticsearch you can use something like elastalert
If you are using GCP you can also use their Logging tool, create a custom metric, and alert on that metric.
Another alternative, but may not have the metrics that you are looking for is to use Prometheus with an Nginx ingress Prometheus exporter to monitor nginx metrics (it depends on what metrics you'd like to monitor)
Related
I want to know if it's possible to get metrics for the services inside the pods using Prometheus.
I don't mean monitoring the pods but the processes inside those pods. For example, containers which have apache or nginx running inside them along other main services, so I can retrieve metrics for the web server and the other main service (for example a wordpress image which aso comes with an apache configured).
The cluster already has running kube-state-metrics, node-exporter and blackbox exporter.
Is it possible? If so, how can I manage to do it?
Thanks in advance
Prometheus works by scraping an HTTP endpoint that provides the actual metrics. That's where you get the term "exporter". So if you want to get metrics from the processes running inside of pods you have three primary steps:
You must modify those processes to export the metrics you care about. This is inherently something that must be custom for each kind of application. The good news is that there are lots of pre-built ones including things like nginx and apache that you mention . Most application frameworks also have capability to export prometheus metrics. ex: Microprofile, Quarkus, and many more.
You must then modify your pod definition to expose the HTTP endpoint that those processes are now providing. Very straightfoward, but will depend on the configuration you specify for your exporters.
You must then modify your Prometheus to scrape those targets. This will depend on your monitoring stack. For Openshift you will find the docs here for enabling user workload monitoring, and here for providing exporter details.
I would like to monitor all ELK service running in our kubernetes clusters to be sure, that is still running properly.
I am able to monitor Kibana portal via URL. ElasticSearch via Prometheus and his metrics (ES have some interested metrics to be sure, that ES is working well).
But exist something similar for Filebeat, Logstash, ... ? Have these daemons some exposed metrics for Prometheus, which is possible to watching and analizing it states?
Thank you very much for all hints.
There is an exporter for ElasticSearch found here: https://github.com/prometheus-community/elasticsearch_exporter and an exporter for Kibana found here: https://github.com/pjhampton/kibana-prometheus-exporter These will enable your Prometheus to scrape the endpoints and collect metrics.
We are also working on a new profiler inside of OpenSearch which will provide much more detailed metrics and fix a lot of bugs. That will also natively provide an exporter for Prometheus to scrape : https://github.com/opensearch-project/OpenSearch/issues/539 you can follow along here, this is in active development if you are looking for an open-source alternative to ElasticSearch and Kibana.
Yes, both the beats and logstash have metrics endpoint for monitoring.
These monitoring endpoints are built to be consumed using metricbeat, but since they return a json you can use other tools to monitor it.
For logstash the metrics endpoint is enabled by default, listening on localhost at port 9600, and from the documentation you have these two endpoints:
node
node_stats
For the beats family you need to enable it as if you would consume the metrics using metricbeat, this documentation explains how to do that.
Then you will have two endpoints:
stats
state
So you would just need to use those endpoints to collect the metrics.
I was looking into Kubernetes Heapster and Metrics-server for getting metrics from the running pods. But the issue is, I need some custom metrics which might vary from pod to pod, and apparently Heapster only provides cpu and memory related metrics. Is there any tool already out there, which would provide me the functionality I want, or do I need to build one from scratch?
What you're looking for is application & infrastructure specific metrics. For this, the TICK stack could be helpful! Specifically Telegraf can be set up to gather detailed infrastructure metrics like Memory- and CPU pressure or even the resources used by individual docker containers, network and IO metrics etc... But it can also scrape Prometheus metrics from pods. These metrics are then shipped to influxdb and visualized using either chronograph or grafana.
Not sure if this is still open.
I would classify metrics into 3 types.
Events or Logs - System and Applications events which are sent to logs. These are non-deterministic.
Metrics - CPU and Memory utilization on the node the app is hosted. This is deterministic and are collected periodically.
APM - Applicaton Performance Monitoring metrics - these are application level metrics like requests received vs failed vs responded etc.
Not all the platforms do everything. ELK for instance does both the Metrics and Log Monitoring and does not do APM. Some of these tools have plugins into collect daemons which collect perfmon metrics of the node.
APM is a completely different area as it requires developer tool to provider metrics as Springboot does Actuator, Nodejs does AppMetrics etc. This carries the request level data. Statsd is an open source library which application can consume to provide APM metrics too Statsd agents installed in the node.
AWS offers CloudWatch agents for log shipping and sink and Xray for distributed tracing which can be used for APM.
would like to see k8 Service level metrics in Grafana from underlying prometheus server.
For instance:
1) If i have 3 application pods exposed through a service i would like to see service level metrics for CPU,memory & network I/O pressure ,Total # of requests,# of requests failed
2)Also if i have group of pods(replicas) related to an application which doesn"t have Service on top of them would like to see the aggregated metrics of the pods related to that application in a single view on grafana
What would be the prometheus queries to achieve the same
Service level metrics for CPU, memory & network I/O pressure
If you have Prometheus installed on your Kubernetes cluster, all those statistics are being already collected by Prometheus. There are many good articles about how to install and how to use Kubernetes+Prometheus, try to check that one, as an example.
Here is an example of a request to fetch container memory usage:
container_memory_usage_bytes{image="CONTAINER:VERSION"}
Total # of requests,# of requests failed
Those are service-level metrics, and for collecting them, you need to use Prometheus Exporter created especially for your service. Check the list with exporters, find one which you need for your service and follow its instruction.
If you cannot find an Exporter for your application, you can write it yourself, here is an official documentation about it.
application which doesn"t have Service on top of them would like to see the aggregated metrics of the pods related to that application in a single view on grafana
It is possible to combine any graphics in a single view in Grafana using Dashboards and Panels. Check an official documentation, all that topics pretty detailed and easy to understand.
Aggregation can be done by Prometheus itself by aggregation operations.
All metrics from Kubernetes has labels, so you can group by them:
sum(http_requests_total) by (application, group), where application and group is labels.
Also, here is an official Prometheus instruction about how to add Prometheus to Grafana as a Datasourse.
I am trying to figure out how to best collect metrics from a set of spring boot based services running within a Kubernetes cluster. Looking at the various docs, it seems that the choice for internal monitoring is between Actuator or Spectator with metrics being pushed to an external collection store such as Redis or StatsD or pulled, in the case of Prometheus.
Since the number of instances of a given service is going to vary, I dont see how Prometheus can be configured to poll those running services since it will lack knowledge of them. I am also building around a Eureka service registry so not sure if that is polled first in this configuration.
Any real world insight into this kind of approach would be welcome.
You should use the Prometheus java client (https://www.robustperception.io/instrumenting-java-with-prometheus/) for instrumenting. Approaches like redis and statsd are to be avoided, as they mean hitting the network on every single event - greatly limiting what you can monitor.
Use file_sd service discovery in Prometheus to provide it with a list of targets from Eureka (https://www.robustperception.io/using-json-file-service-discovery-with-prometheus/), though if you're using Kubernetes like your tag hints Prometheus has a direct integration there.