Hey I'm currently trying to determine uptime of a pod with kube state metrics, specifically when a pod has started or stopped. I am using a Prometheus Deployment with Kube State metrics in order to determine when a pod has been started and stopped.
Specifically I want to get the following metrics:
kube_pod_completion_time
kube_pod_created
As a test I've configured Prometheus to gather metrics with the following config.yml file:
global:
scrape_interval: 10m
scrape_timeout: 10s
evaluation_interval: 10m
scrape_configs:
- job_name: kubernetes-nodes-cadvisor
honor_timestamps: true
scrape_interval: 10m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
enable_http2: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
action: replace
metric_relabel_configs:
- source_labels: [__name__]
regex: '(container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_writes_bytes_total|container_memory_max_usage_bytes|container_network_receive_bytes_total|container_network_transmit_bytes_total)'
action: keep
kubernetes_sd_configs:
- role: node
kubeconfig_file: ''
follow_redirects: true
enable_http2: true
- job_name: 'kube-state-metrics'
scrape_interval: 10m
static_configs:
- targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
metric_relabel_configs:
- source_labels: [__name__]
regex: '(kube_pod_labels|kube_pod_created|kube_pod_completion_time|kube_pod_container_resource_limits)'
action: keep
remote_write:
- url: http://example.com
remote_timeout: 30s
follow_redirects: true
enable_http2: true
oauth2:
token_url: https://example.com
client_id: myCoolID
client_secret: myCoolPassword
queue_config:
capacity: 2500
max_shards: 200
min_shards: 1
max_samples_per_send: 10
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
metadata_config:
send: false
Additionally I also have the following test pod deployment running:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: busy-box-test
spec:
replicas: 1
selector:
matchLabels:
app: busy-box-test
template:
metadata:
labels:
app: busy-box-test
spec:
containers:
- command:
- sleep
- '300'
image: busybox
name: test-box
However when I go to search for metrics regarding kube_pod_completion_time I cannot find any in my remote write source, while I do have all the other metrics specified in the regex. (kube_pod_labels|kube_pod_created ... kube_pod_container_resource_limits)
Additionally I've tried the following command to see if they are present in the cluster:
kubectl get --raw '/metrics' | grep kube_ and kubectl get --raw 'kube-state-metrics.kube-system.svc.cluster.local:8080' but I don't find anything definitive. I suspect the command is looking in the wrong location
So beyond if I am missing something obvious I missed I have the following open questions:
Is there an endpoint I should hit inside the cluster which should return the completion time? Is there an issue with the polling interval being once every 10 minutes for a pod that comes up and down every 5? (If anyone knows how long a terminated history will stick around in kube state metrics that would be great to know as well)
I've included the configuration for kube state metrics here: https://gist.github.com/twosdai/12607c8459bdb73fc98edbbcb17b5eb5 in order to keep the post a bit more concise. The cluster is running in AWS EKS Version: 1.22
Related
i am trying to get logs from a single namespace through promtail and scrape_configs, but i am not getting results. I am installing in k8s with
helm install loki grafana/loki-stack -n loki-test -f
~/loki-stack-values.yml
and the contents of my values file are:
loki:
enabled: true
promtail:
enabled: true
pipelineStages:
- cri: {}
- json:
expressions:
is_even: is_even
level: level
version: version
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
action: keep
regex: mongodb-test
# [...]
- job_name: kubernetes-pods-app
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
action: keep
regex: mongodb-test
grafana:
enabled: true
sidecar:
datasources:
enabled: true
image:
tag: 8.3.5
My expectation was that i will only get logs from the mongodb-test namespace, but i can view from any namespace present.
Also tried with drop, but it did not do anything.
What should i do here?
Thank you so much
Using match statement to drop other namespaces under pipeline stages worked for me. In your case,
config:
snippets:
pipelineStages:
- cri: {}
- match:
selector: '{namespace!~"mongodb-test"}'
action: drop
common:
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
I installed Prometheus on my Kubernetes cluster with Helm, using the community chart kube-prometheus-stack - and I get some beautiful dashboards in the bundled Grafana instance. I now wanted the recommender from the Vertical Pod Autoscaler to use Prometheus as a data source for historic metrics, as described here. Meaning, I had to make a change to the Prometheus scraper settings for cAdvisor, and this answer pointed me in the right direction, as after making that change I can now see the correct job tag on metrics from cAdvisor.
Unfortunately, now some of the charts in the Grafana dashboards are broken. It looks like it no longer picks up the CPU metrics - and instead just displays "No data" for the CPU-related charts.
So, I assume I have to tweak the charts to be able to pick up the metrics correctly again, but I don't see any obvious places to do this in Grafana?
Not sure if it is relevant for the question, but I am running my Kubernetes cluster on Azure Kubernetes Service (AKS).
This is the full values.yaml I supply to the Helm chart when installing Prometheus:
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeEtcd:
enabled: false
kubeProxy:
enabled: false
kubelet:
serviceMonitor:
# Diables the normal cAdvisor scraping, as we add it with the job name "kubernetes-cadvisor" under additionalScrapeConfigs
# The reason for doing this is to enable the VPA to use the metrics for the recommender
# https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/FAQ.md#how-can-i-use-prometheus-as-a-history-provider-for-the-vpa-recommender
cAdvisor: false
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
# the azurefile storage class is created automatically on AKS
storageClassName: azurefile
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
- job_name: 'kubernetes-cadvisor'
scheme: https
metrics_path: /metrics/cadvisor
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
Kubernetes version: 1.21.2
kube-prometheus-stack version: 18.1.1
helm version: version.BuildInfo{Version:"v3.6.3", GitCommit:"d506314abfb5d21419df8c7e7e68012379db2354", GitTreeState:"dirty", GoVersion:"go1.16.5"}
Unfortunately, I don't have access to Azure AKS, so I've reproduced this issue on my GKE cluster. Below I'll provide some explanations that may help to resolve your problem.
First you can try to execute this node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate rule to see if it returns any result:
If it doesn't return any records, please read the following paragraphs.
Creating a scrape configuration for cAdvisor
Rather than creating a completely new scrape configuration for cadvisor, I would suggest using one that is generated by default when kubelet.serviceMonitor.cAdvisor: true, but with a few modifications such as changing the label to job=kubernetes-cadvisor.
In my example, the 'kubernetes-cadvisor' scrape configuration looks like this:
NOTE: I added this config under the additionalScrapeConfigs in the values.yaml file (the rest of the values.yaml file may be like yours).
- job_name: 'kubernetes-cadvisor'
honor_labels: true
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics/cadvisor
scheme: https
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
regex: (.*)
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
separator: ;
regex: kubelet
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kubelet
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
regex: (.*)
target_label: container
replacement: $1
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
regex: (.*)
target_label: metrics_path
replacement: $1
action: replace
- source_labels: [__address__]
separator: ;
regex: (.*)
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
kubernetes_sd_configs:
- role: endpoints
kubeconfig_file: ""
follow_redirects: true
namespaces:
names:
- kube-system
Modifying Prometheus Rules
By default, Prometheus rules fetching data from cAdvisor use job="kubelet" in their PromQL expressions:
After changing job=kubelet to job=kubernetes-cadvisor, we also need to modify this label in the Prometheus rules:
NOTE: We just need to modify the rules that have metrics_path="/metrics/cadvisor (these are rules that retrieve data from cAdvisor).
$ kubectl get prometheusrules prom-1-kube-prometheus-sta-k8s.rules -o yaml
...
- name: k8s.rules
rules:
- expr: |-
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubernetes-cadvisor", metrics_path="/metrics/cadvisor", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
...
here we have a few more rules to modify...
After modifying Prometheus rules and waiting some time, we can see if it works as expected. We can try to execute node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate as in the beginning.
Additionally, let's check out our Grafana to make sure it has started displaying our dashboards correctly:
I am currently running metrics server inside my pod. The data is being sent inside pod at localhost:9090. I am able to get the data inside pod via curl. The deployment.yaml is annotated to scrape the data, but I don't see any new metrics in pod. What am I doing wrong?
metrics I see inside pod :
cpu_usage{process="COMMAND", pid="PID"} %CPU
cpu_usage{process="/bin/sh", pid="1"} 0.0
cpu_usage{process="sh", pid="8"} 0.0
cpu_usage{process="/usr/share/filebeat/bin/filebeat-god", pid="49"} 0.0
cpu_usage{process="/usr/share/filebeat/bin/filebeat", pid="52"} 0.0
cpu_usage{process="php-fpm:", pid="66"} 0.0
cpu_usage{process="php-fpm:", pid="67"} 0.0
cpu_usage{process="php-fpm:", pid="68"} 0.0
cpu_usage{process="nginx:", pid="69"} 0.0
cpu_usage{process="nginx:", pid="70"} 0.0
cpu_usage{process="nginx:", pid="71"} 0.0
cpu_usage{process="/bin/sh", pid="541"} 0.0
cpu_usage{process="bash", pid="556"} 0.0
cpu_usage{process="/bin/sh", pid="1992"} 0.0
cpu_usage{process="ps", pid="1993"} 0.0
cpu_usage{process="/bin/sh", pid="1994"} 0.0
deployment.yaml
template:
metadata:
labels:
app: supplier-service
annotations:
prometheus.io/path: /
prometheus.io/scrape: 'true'
prometheus.io/port: '9090'
ports:
- containerPort: 80
- containerPort: 443
- containerPort: 9090
prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15seconds. # Attach these labels to any time series or alerts when #communicating with external systems (federation, re$
external_labels:
monitor: 'codelab-monitor'
# Scraping Prometheus itself
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-service-endpoints'
scrape_interval: 5s
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__address__]
action: replace
regex: ([^:]+)(?::\d+)?
replacement: $1:9090
target_label: __address__
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
Port numbers are correct. What am I doing wrong?
Your kubernetes_sd_configs is configured to look for endpoints, which are created by services. Do you have an endpoint created for your service? You could check with kubectl get endpoints in your namespace. If you don't want to create a service, I suppose you could configure Prometheus to scrape pod targets too, check the docs for more info.
Also the documentation for metrics and labels say the metric name must match the regular expression [a-zA-Z_:][a-zA-Z0-9_:]*, so the dash (-) in your metric name might be an issue too.
I'm using Prometheus to scrape metrics from my pods. The application I'm interested in is replicated a couple of times with one service providing access. Prometheus uses this service to scrape the metrics. In my app the metrics are setup as follows:
import * as Prometheus from 'prom-client';
const httpRequestDurationMicroseconds = new Prometheus.Histogram({
name: 'transaction_amounts',
help: 'Amount',
labelNames: ['amount'],
buckets: [0, 5, 15, 50, 100, 200, 300, 400, 500, 10000],
});
const totalPayments = new Prometheus.Counter('transaction_totals', 'Total payments');
I'm using helm to install Prometheus and the scrape config looks like this:
prometheus.yml:
rule_files:
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: transactions
scrape_interval: 1s
static_configs:
- targets:
- transaction-metrics-service:3001
I can see the metrics inside prometheus, but it seems to be from just one pod. For example, in Prometheus, when I query for transaction_totals it gives:
I don't think that the instance label can uniquely identify my pods. What should I do to be able to query all pods?
Instead of using a static_config that scrapes just one host, try using kubernetes_sd_configs Kubernetes Service Discovery as provided by Prometheus.
Your config file would look something like this:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# only scrape when annotation prometheus.io/scrape: 'true' is set
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (.+):(?:\d+);(\d+)
replacement: ${1}:${2}
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
and then add the annotation to your Kubernetes Deployment yaml config like this:
kind: Deployment
...
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "<< PORT OF YOUR CONTAINER >>"
You can see a full working example here.
add prometheus annotations to your service, since prom will only scrape a service that:
Exposes the exporter port
Has a prometheus.io/scrape: "true" annotation
Has a prometheus.io/port: "<exporter_port_here>" annotation
here is an official example
the scraped pod is probably prometheus itself
I am running prometheus on a kubernetes cluster and trying to scrape pods, nodes, services. I am getting the following error when i reload the config by sending POST request-
failed to reload config: couldn't load configuration (-config.file=/etc/prometheus/conf/prometheus.yml): unknown fields in kubernetes_sd_config: api_server
While trying to follow official docs for writing config file, I am not able to understand the relabel_configs, source_labels, target_labels, action, keep, regex part. Can somebody explain these parts and also the use of labels in prometheus. Thanks in advance.
Following is the prometheus.yml file-
scrape_configs:
- job_name: 'kubernetes-nodes'
# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https
# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
# insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- api_server: "https://kubernetes.default.svc"
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
# Example scrape config for probing services via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- api_server: "https://kubernetes.default.svc"
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
# pod's declared ports (default is a port-free target if none are declared).
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- api_server: "https://kubernetes.default.svc"
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
Your yaml file is off, try this:
- job_name: 'kubernetes-services'
...
kubernetes_sd_configs:
- api_server: "https://kubernetes.default.svc"
role: service
...
This is the working Prometheus Configmap example file, fwiw.
https://github.com/kayrus/prometheus-kubernetes/blob/master/prometheus-configmap.yaml#L214-L241
I found that to reduce the noise of what kubectl thinks it is doing to use yamllint. If you get the config map with options yaml; when reading that file back in the kubectl command puts all the sections that are meant for data inside the data: section and it should know to ignore the other 3 sections (apiVersion, kind, and metadata)
So make sure to have only the data: section when/if you load it as a new config map.
apiVersion: v1
data:
kind: ConfigMap
metadata:
Command to get the config map
kubectl get configmap prometheus-config --namespace prometheus -o yaml > prometheus.yml
Take out all the excess comments and extra blank lines in both files (yours and the example) to save it as prometheus[#].yml then get yamllint and run it on the file(s)
yamllint -d relaxed prometheus[#].yml
Most of the time yamllint will complain lines are > 80 characters long.
If it is a JSON syntax issue then it will show up quickly.
HTH