Dynamically update prometheus scrape config based on pod labels - kubernetes

I'm trying to enhance my monitoring and want to expand the amount of metrics pulled into Prometheus from our Kube estate. We already have a stand alone Prom implementation which has a hard coded config file monitoring some bare metal servers, and hooks into cadvisor for generic Pod metrics.
What i would like to do is configure Kube to monitor the apache_exporter metrics from a webserver deployed in the cluster, but also dynamically add a 2nd, 3rd etc webserver as the instances are scaled up.
I've looked at the kube-prometheus project, but this seems to be more geared to instances where there is no established Prometheus deployed. Is there a simple way to get prometheus to scrape the Kube API or etcd to pull in the current list of pods which match a certain criteria (ie, a tag like deploymentType=webserver) and scrape the apache_exporter metrics for these pods, and scrape the mysqld_exporter metrics where deploymentType=mysql

There's a project called kube-prometheus-stack (formerly prometheus-operator): https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
It has concepts called ServiceMonitor and PodMonitor:
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#servicemonitor
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#podmonitor
Basically, this is a selector that points your Prometheus instance to scrape targets. In the case of service selector, it discovers all the pods behind the service. In the case of a pod selector, it discovers pods directly. Prometheus scrape config is updated and reloaded automatically in both cases.
Example PodMonitor:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: example
namespace: monitoring
spec:
podMetricsEndpoints:
- interval: 30s
path: /metrics
port: http
namespaceSelector:
matchNames:
- app
selector:
matchLabels:
app.kubernetes.io/name: my-app
Note that this PodMonitor object itself must be discovered by the controller. To achieve this you write a PodMonitorSelector(link). This additional explicit linkage is done intentionally - in this way, if you have 2 Prometheus instances on your cluster (say Infra and Product) you can separate which Prometheus will get which Pods to its scraping config.
The same applies to a ServiceMonitor.

Related

Horizontal pod Autoscaling without custom metrics

We want to scale our pods horizontally based on the amount of messages in our Kafka Topic. The standard solution is to publish the metrics to the custom metrics API of Kubernetes. However, due to company guidelines we are not allowed to use the custom metrics API of Kubernetes. We are only allowed to use non-admin functionality. Is there a solution for this with kubernetes-nativ features or do we need to implement a customized solution?
I'm not exactly sure if this would fit your needs but you could use Autoscaling on metrics not related to Kubernetes objects.
Applications running on Kubernetes may need to autoscale based on metrics that don’t have an obvious relationship to any object in the Kubernetes cluster, such as metrics describing a hosted service with no direct correlation to Kubernetes namespaces. In Kubernetes 1.10 and later, you can address this use case with external metrics.
Using external metrics requires knowledge of your monitoring system; the setup is similar to that required when using custom metrics. External metrics allow you to autoscale your cluster based on any metric available in your monitoring system. Just provide a metric block with a name and selector, as above, and use the External metric type instead of Object. If multiple time series are matched by the metricSelector, the sum of their values is used by the HorizontalPodAutoscaler. External metrics support both the Value and AverageValue target types, which function exactly the same as when you use the Object type.
For example if your application processes tasks from a hosted queue service, you could add the following section to your HorizontalPodAutoscaler manifest to specify that you need one worker per 30 outstanding tasks.
- type: External
external:
metric:
name: queue_messages_ready
selector: "queue=worker_tasks"
target:
type: AverageValue
averageValue: 30
When possible, it’s preferable to use the custom metric target types instead of external metrics, since it’s easier for cluster administrators to secure the custom metrics API. The external metrics API potentially allows access to any metric, so cluster administrators should take care when exposing it.
You may also have a look at zalando-incubator/kube-metrics-adapter and use Prometheus collector external metrics.
This is an example of an HPA configured to get metrics based on a Prometheus query. The query is defined in the annotation metric-config.external.prometheus-query.prometheus/processed-events-per-second where processed-events-per-second is the query name which will be associated with the result of the query. A matching query-name label must be defined in the matchLabels of the metric definition. This allows having multiple prometheus queries associated with a single HPA.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
annotations:
# This annotation is optional.
# If specified, then this prometheus server is used,
# instead of the prometheus server specified as the CLI argument `--prometheus-server`.
metric-config.external.prometheus-query.prometheus/prometheus-server: http://prometheus.my->namespace.svc
# metric-config.<metricType>.<metricName>.<collectorName>/<configKey>
# <configKey> == query-name
metric-config.external.prometheus-query.prometheus/processed-events-per-second: |
scalar(sum(rate(event-service_events_count{application="event-service",processed="true"}[1m])))
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: custom-metrics-consumer
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: prometheus-query
selector:
matchLabels:
query-name: processed-events-per-second
target:
type: AverageValue
averageValue: "10"

Configuring prometheus to access another cluster that applications deployed

I am newbie for using monitoring tools such as Prometheus in k8s..We have two separate cluster one for applications we deployed one for we only would like deploy monitoring,logging tools.
But I have some confusion how to handle this?
1.How cluster that serves prometheus can connect to application cluster and able to pull metrics?
2.How should I specify the namespace if I would like to set a network policy?
3.What should I do in application side for helm chart except exporting metrics?
# Allow traffic from pods with label app=prometheus in namespace with label name=monitoring
# to any pod in <YOUR_APPLICATION_NAMESPACE>
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: monitoring.prometheus.all
namespace: <YOUR_APPLICATION_NAMESPACE>
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchLabels:
app: prometheus
podSelector: {}
policyTypes:
- Ingress
Isn't that what you want?
1) Prometheus federation
Prometheus federation is a Prometheus server that can scrape data
from other Prometheus servers. It supports hierarchical federation,
which in our case resembles a tree.
A default version of the Prometheus server is installed in each one of our clusters and a Prometheus federation server is deployed together with Grafana in a central monitoring cluster. Prometheus federation scrapes data from all the other Prometheus servers that run in our clusters. For future expansion, a central Prometheus federation can be used to scrape data from multiple Prometheus federation servers that scrape data from groups of tens of clusters.
More info here: https://developers.mattermost.com/blog/cloud-monitoring/
2) Prometheus configuration to scrape Kubernetes outside the cluster yaml example
3) Linkedin Monitoring Kubernetes with Prometheus - outside the cluster! article and Reddit Monitoring K8s by Prometheus Outside Cluster related discussion

prometheus operator - enable monitoring for everything in all namespaces

I want to monitor a couple applications running on a Kubernetes cluster in namespaces named development and production through prometheus-operator.
Installation command used (as per Github) is:
helm install prometheus-operator stable/prometheus-operator -n production --set prometheusOperator.enabled=true,prometheus.service.type=NodePort,prometheusOperator.service.type=NodePort,alertmanager.service.type=NodePort,grafana.service.type=NodePort,grafana.service.nodePort=30906
What parameters do I need to add to above command to have prometheus-operator discover and monitor all apps/services/pods running in all namespaces?
With this, Service Discovery only shows some prometheus-operator related services, but not the app that I am running within 'production' namespace even though prometheus-operator is installed in the same namespace.
Anything I am missing?
Note - Am running performing all actions using the same user (which uses the $HOME/.kube/config file), so I assume permissions are not an issue.
kubectl version - v1.17.3
helm version - 3.1.2
P.S. There are numerous articles on this on different forums, but am still not finding simple and direct answers for this.
I had the same problem. After some investigation answering with more details.
I've installed Prometheus stack via Helm charts which include Prometheus operator chart directly as a sub-project. Prometheus operator monitors namespaces specified by the following helm values:
prometheusOperator:
namespaces: ''
denyNamespaces: ''
prometheusInstanceNamespaces: ''
alertmanagerInstanceNamespaces: ''
thanosRulerInstanceNamespaces: ''
The namespaces value specifies monitored namespaces for ServiceMonitor and PodMonitor CRDs. Other CRDs have their own settings, which if not set, default to namespaces. Helm values are passed as command-line arguments to the operator. See here and here.
Prometheus CRDs are picked up by the operator from the mentioned namespaces, by default - everywhere. However, as the operator is designed with multiple simultaneous Prometheus releases in mind, what to pick up by a particular Prometheus app instance is controlled by the corresponding Prometheus CRD. CRDs selectors and corresponding namespaces selectors are controlled via the following Helm values:
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: true
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
Similar values are present for other CRDs: alertmanagerConfigXXX, ruleNamespaceXXX, podMonitorXXX, probeXXX. XXXSelectorNilUsesHelmValues set to true, means to look for CRD with particular release label, e.g. release=myrelease. See here.
Empty selector (for a namespace, CRD, or any other object) means no filtering. So for Prometheus object to pick up a ServiceMonitor from the other namespaces there are few options:
Set serviceMonitorSelectorNilUsesHelmValues: false. This leaves serviceMonitorSelector empty.
Apply the release label, e.g. release=myrelease, to your ServiceMonitor CRD.
Set a non-empty serviceMonitorSelector that matches your ServiceMonitor.
For the curious ones here are links to the operator sources:
Enqueue of Prometheus CRD processing
Processing of Prometheus CRD
I used values.yaml from https://github.com/helm/charts/blob/master/stable/prometheus-operator/values.yaml, modified parameters *NilUsesHelmValues to False and it seems to work fine with that.
helm install prometheus-operator stable/prometheus-operator -n monitoring -f values.yaml
Also, like https://stackoverflow.com/users/7889479/anish-kumar-mourya stated, the services do show in Grafana dashboard even though they dont appear in Prometheus UI under Service Discovery or Targets.
Hope this helps other newbies like me.
no its fine but you can create new namespace for monitoring and install prometheus over there would be good to manage things related to monitoring.
helm install prometheus-operator stable/prometheus-operator -n monitoring
You need to create a service for the pod and a serviceMonitor custom resource to configure which services in which namespace need to be discovered by prometheus.
kube-state-metrics Service example
apiVersion: v1
kind: Service
metadata:
labels:
app: kube-state-metrics
k8s-app: kube-state-metrics
annotations:
alpha.monitoring.coreos.com/non-namespaced: "true"
name: kube-state-metrics
spec:
ports:
- name: http-metrics
port: 8080
targetPort: metrics
protocol: TCP
selector:
app: kube-state-metrics
This Service targets all Pods with the label k8s-app: kube-state-metrics.
Generic ServiceMonitor example
This ServiceMonitor targets all Services with the label k8s-app (spec.selector) any value, in the namespaces kube-system and monitoring (spec.namespaceSelector).
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: k8s-apps-http
labels:
k8s-apps: http
spec:
jobLabel: k8s-app
selector:
matchExpressions:
- {key: k8s-app, operator: Exists}
namespaceSelector:
matchNames:
- kube-system
- monitoring
endpoints:
- port: http-metrics
interval: 15s
https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md

Avoiding Prometheus call all instances of k8s service (only one, app-wide metrics collection)

I need to expose application-wide metrics for Prometheus collection from a Kubernetes application that is deployed with multiple instances, e.g. scaled by Horizontal Pod Autoscaler.
The scrape point is exposed by every instance of the pod for fail-over purposes, however I do not want Prometheus to actually call the scrape endpoint on every pod's instance, only one instance at a time and failover to another instance only if necessary.
The statistics is application-wide, not per-pod instance, all instance endpoints report the same data, and calling them in parallel would serve no useful purpose and only increase a workload on the backend system that has to be queried for statistics. I do not want 30 calls to the backend (assuming the app is scaled up to 30 pods) where just one call would suffice.
I hoped that exposing the scrape endpoint as a k8s service (and annotating the service for scraping) should do the trick. However instead of going through the service proxy and let it route the request to one of the pods, Prometheus seems to be going directly to the instances behind the service, and to all of them, rather than only one at a time.
Is there a way to avoid Prometheus calling all the instances, and have it call only one?
The service is defined as:
apiVersion: v1
kind: Service
metadata:
name: k8worker-msvc
labels:
app: k8worker-msvc
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: '/metrics'
prometheus.io/port: '3110'
spec:
selector:
app: k8worker
type: LoadBalancer
ports:
- protocol: TCP
port: 3110
targetPort: 3110
In case this is not possible, what are my options other than running leader election inside the app and reporting empty metrics data from non-leader instances?
Thanks for advice.
This implies the metrics are coming from some kind of backend database rather than a usual in-process exporter. Move the metrics endpoint to a new service connected to the same DB and only run one copy of it.

How to configure a Kubernetes Multi-Pod Deployment

I would like to deploy an application cluster by managing my deployment via k8s Deployment object. The documentation has me extremely confused. My basic layout has the following components that scale independently:
API server
UI server
Redis cache
Timer/Scheduled task server
Technically, all 4 above belong in separate pods that are scaled independently.
My questions are:
Do I need to create pod.yml files and then somehow reference them in deployment.yml file or can a deployment file also embed pod definitions?
K8s documentation seems to imply that the spec portion of Deployment is equivalent to defining one pod. Is that correct? What if I want to declaratively describe multi-pod deployments? Do I do need multiple deployment.yml files?
Pagids answer has most of the basics. You should create 4 Deployments for your scenario. Each deployment will create a ReplicaSet that schedules and supervises the collection of PODs for the Deployment.
Each Deployment will most likely also require a Service in front of it for access. I usually create a single yaml file that has a Deployment and the corresponding Service in it. Here is an example for an nginx.yaml that I use:
apiVersion: v1
kind: Service
metadata:
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
name: nginx
labels:
app: nginx
spec:
type: NodePort
ports:
- port: 80
name: nginx
targetPort: 80
nodePort: 32756
selector:
app: nginx
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginxdeployment
spec:
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginxcontainer
image: nginx:latest
imagePullPolicy: Always
ports:
- containerPort: 80
Here some additional information for clarification:
A POD is not a scalable unit. A Deployment that schedules PODs is.
A Deployment is meant to represent a single group of PODs fulfilling a single purpose together.
You can have many Deployments work together in the virtual network of the cluster.
For accessing a Deployment that may consist of many PODs running on different nodes you have to create a Service.
Deployments are meant to contain stateless services. If you need to store a state you need to create StatefulSet instead (e.g. for a database service).
You can use the Kubernetes API reference for the Deployment and you'll find that the spec->template field is of type PodTemplateSpec along with the related comment (Template describes the pods that will be created.) it answers you questions. A longer description can of course be found in the Deployment user guide.
To answer your questions...
1) The Pods are managed by the Deployment and defining them separately doesn't make sense as they are created on demand by the Deployment. Keep in mind that there might be more replicas of the same pod type.
2) For each of the applications in your list, you'd have to define one Deployment - which also makes sense when it comes to difference replica counts and application rollouts.
3) you haven't asked that but it's related - along with separate Deployments each of your applications will also need a dedicated Service so the others can access it.
additional information:
API server use deployment
UI server use deployment
Redis cache use statefulset
Timer/Scheduled task server maybe use a statefulset (If your service has some state in)