I'm trying to get deployment frequency in kubernetes. Counting replicasets doesn't work for us since we want to see the frequency of 'code deploys'. For this we can look at how many different images were deployed in a period of time. I'm trying to get this data from prometheus which has kubernetes metrics, we also have kube-state-metrics deployed.
I have tried things like -
count(
group by (image) (
(label_replace(kube_pod_owner{owner_kind="ReplicaSet", owner_name=~"app-.*"}, "replicaset", "$1", "owner_name", "(.*)")
* on (replicaset) group_left kube_replicaset_spec_replicas{}
> 0)
* on (pod) group_right kube_pod_container_info{container="app"}
)
)
count(
group by (image) (
(count_over_time(kube_pod_container_info{container="app", pod=~"app.*"}[1d])
* on(pod) group_left kube_pod_container_status_ready{}
) > 1
)
)
Which aren't giving me quite what I want. I get a spike in the graph for the period where two images exist simultaneously, but that's what I'm looking for.
Related
I'm creating a custom k8s grafana dashboard with datasource as InfluxDB (v1.8.6).I have gone through the influxdb documentation and recognised that the analogical construct for prometheus rate() in influx is non_negative_derivative(mean(value), interval). But on trying to convert the prometheus query to InfluxQL, the resultant query values vary when executed against same time intervals. Im basically trying to compute the k8s cluster network i/o pressure.
PromQL :
sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~"^$Node$", job="kubernetes-nodes-cadvisor"}[1m]))
Output is in bytes : 7321180
InfluxQL :
SELECT SUM(bytes_used) FROM (SELECT non_negative_derivative(mean(value), 1s) AS bytes_used FROM container_network_receive_bytes_total WHERE ("job" = 'kubernetes-nodes-cadvisor' AND "kubernetes_io_hostname" =~ /^$Node$/) AND $timeFilter GROUP BY time(1m)) group by time($__interval)
Output is in Mb/s : 36.7 MB/s
Could someone help identify the issue and correct me ?
prometheus:v2.15.2
kubernetes:v1.14.9
I have a query where it shows exactly the maximum over time during the set period.
But I would like to join with the metric already set in the kube_pod_container resource.
I would like to know if what is set is close to the percentage set or not, displaying the percentage.
I have other examples working with this same structure of metric
jvm_memory_bytes_used{instance="url.instance.com.br"} / jvm_memory_bytes_max{area="heap"} * 100 > 80
but this one is not working.
max_over_time(sum(rate(container_cpu_usage_seconds_total{pod="pod-name-here",container_name!="POD", container_name!=""}[1m])) [1h:1s]) / kube_pod_container_resource_requests_cpu_cores * 100 < 70
Well the first idea was to create a query to collect the maximum historical cpu usage of a container in a pod in a brief period:
max_over_time(sum(rate(container_cpu_usage_seconds_total{pod="xpto-92838241",container_name!="POD", container_name!=""}[1m])) [1h:1s])
Element: {} Value: 0.25781324101515
If we execute it this way:
container_cpu_usage_seconds_total{pod="xpto-92838241",container_name!="POD", container_name!=""}
Element: container_cpu_usage_seconds_total{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_instance_type="t3.small",beta_kubernetes_io_os="linux",cluster="teste.k8s.xpto",container="xpto",container_name="xpto",cpu="total",failure_domain_beta_kubernetes_io_region="sa-east-1",failure_domain_beta_kubernetes_io_zone="sa-east-1c",generic="true",id="/kubepods/burstable/poda9999e9999e999e9-/99999e9999999e9",image="nginx",instance="kubestate-dev.internal.xpto",job="kubernetes-cadvisor",kops_k8s_io_instancegroup="nodes",kubernetes_io_arch="amd64",kubernetes_io_hostname="ip-99-999-9-99.sa-east-1.compute.internal",kubernetes_io_os="linux",kubernetes_io_role="node",name="k8s_nginx_nginx-99999e9999999e9",namespace="nmpc",pod="pod-92838241",pod_name="pod-92838241",spot="false"} Value: 22533.2
Now we have what is configured:
kube_pod_container_resource_requests_cpu_cores{pod="xpto-92838241"}
Element: kube_pod_container_resource_requests_cpu_cores{container="xpto",instance="kubestate-dev.internal.xpto",job="k8s-http",namespace="nmpc",node="ip-99-999-999-99.sa-east-1.compute.internal",pod="pod-92838241"} Value: 1
Well, in my perception it would be to use these two metrics and get it close to the percentage like this:
max_over_time(sum(rate(container_cpu_usage_seconds_total{pod="xpto-dev-92838241",container_name!="POD", container_name!=""}[1m])) [1h:1s]) / kube_pod_container_resource_requests_cpu_cores * 100 < 70
Element: no data Value:
But these two metrics do not interact, I can not understand why and do not find in the documentation.
Regards
As you can see here, only in Kubernetes 1.16 cadvisor metric labels pod_name and container_name were removed and substituted by pod and container respectively.
As you are using Kubernetes 1.14, you should still use pod_name and container_name.
Let me know if it helps.
Here's Prometheus Operators,
with the documentation and this blog about CPU aggregation walkthrough.
I got the solution of my problem with vector matching.
max_over_time(sum(rate(container_cpu_usage_seconds_total{pod="pod-name-here",container_name!="POD", container_name!=""}[1m])) [1h:1s]) / on(pod_name) group_left(container_name) kube_pod_container_resource_requests_cpu_cores{pod="pod-name-here"}
thank you all
The following PromQL query returns per-pod CPU usage in percentage of its configured limits:
100 * sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)
The following query returns the maximum per-pod CPU usage over the last hour:
max_over_time((
100 * sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by
(pod)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)
)[1h:5m])
Note that the first query is basically wrapped into max_over_time((...)[1h:5m]). Such construction is called subquery. It may work slower than the original query.
I can't seem to figure out the Prometheus query to calculate the single value of, say, average CPU usage per instance over a time period and create the Grafana table out of it:
Period: last 3h
Instance A: CPU usage A
Instance B: CPU usage B
Simply put, I want to:
select a time period in Grafana
have Prometheus average the values per instance within that period to a single value
use that data to populate a Grafana table
Any hints?
Thanks!
To answer myself:
avg_over_time(instance:cpu_usage:irate[$__range])
So for example if I would like to get the CPU utilisation , does it mean that this promql will work well?
(((count(count(node_cpu_seconds_total{job="vi-prod-node-exporter-ec2-vsat2-spen"}) by (cpu))) - avg(sum by (mode)(rate(node_cpu_seconds_total{mode='idle',job="vi-prod-node-exporter-ec2-vsat2-spen"}[$__rate_interval])))) * 100) / count(count(node_cpu_seconds_total{job="vi-prod-node-exporter-ec2-vsat2-spen"}) by (cpu))
I want to count number of unique label values. Kind of like
select count (distinct a) from hello_info
For example if my metric 'hello_info' has labels a and b. I want to count number of unique a's. Here the count would be 3 for a = "1", "2", "3".
hello_info(a="1", b="ddd")
hello_info(a="2", b="eee")
hello_info(a="1", b="fff")
hello_info(a="3", b="ggg")
count(count by (a) (hello_info))
First you want an aggregator with a result per value of a, and then you can count them.
Other example:
If you want to count the number of apps deployed in a kubernetes cluster based on different values of a label( ex:app):
count(count(kube_pod_labels{app=~".*"}) by (app))
The count(count(hello_info) by (a)) is equivalent to the following SQL:
SELECT
time_bucket('5 minutes', timestamp) AS t,
COUNT(DISTINCT a)
FROM hello_info
GROUP BY t
See time_bucket() function description.
E.g. it returns the number of distinct values for a label per each 5-minute interval by default - see staleness docs for details about 5-minute interval.
If you need to calculate the number of unique values for a label over custom interval (for example, over the last day), then the following PromQL query must be used instead:
count(count(last_over_time(hello_info[1d])) by (a))
The custom interval - 1d in the case above - can be changed to an arbitrary value - see these docs for possible values, which can be used there.
This query uses last_over_time() function for selecting all the time series, which were active during the last day. Time series can stop receiving new samples and become inactive at any time. Such time series aren't captured with simple count(...) by (a) after 5 minutes of inactivity. New deployments in Kubernetes and horizontal pod autoscaling are the most frequent source of big number of inactive time series (aka high churn rate).
I want to create a Grafana 'singlestat' Panel that shows the Uptime or SLA 'percentage', based on the presence or absence of test failure metrics.
I already have the appropriate metric, e2e_tests_failure_count, for different test frameworks.
This means that the following query returns the sum of observed test failures:
sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-2|test-framework-3",kubernetes_namespace="platform-edge"})
I already managed to create a graph that is "1" if everything is ok and "0" if there are any test failures:
1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-1|test-framework-1",kubernetes_namespace="platform-edge"}), 1)
I now want to have a single percentage value that shows the "uptime" (= amount of time the environment was 'helathy') over a period of time, e.g. the last 5 days. Something like "99.5%" or, more appropriate for the screenshot, "65%".
I tried something like this:
(1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"service-cvi-e2e-tests|service-svhb-e2e-tests|service-svh-roundtrip-e2e-tests",kubernetes_namespace="platform-edge"}), 1))[5d]
but this only results in parser errors. Googling didn't really get me any further, so I'm hoping I can find help here :)
Just figured this out and I believe it is producing correct results. You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query, as you have already discovered (you get a parse error). So we record the function result (which will be an instance vector) as a new time series and use that as the metric name in a different query, where you can then add the [5d] to select a range.
We run our tests multiple times per minute against all our services, and each service ("service" is a label where each service's name is the label value) has a different number of tests associated with it, but if any of the tests for a given service fails, we consider that a "down moment". (The number of test failures for a given service is captured in the metrics with the status="failure" label value.) We clamp the number of failures to 1 so we only have zeroes and ones for our values and can therefore convert a "failure values time series" into a "success values time series" instead, using an inequality operator and the bool modifier. (See this post for a discussion about the use of bool.) So the result of the first recorded metric is 1 for every service where all its tests succeeded during that scrape interval, and 0 where there was at least one test failure for that service.
If the number of failures for a service is > 0 for all the values returned for any given minute, we consider that service to be "down" for that minute. (So if we have both a failure and a success in a given minute, that does not count as downtime.) That is why we have the second recorded metric to produce the actual "up for this minute" boolean values. The second recorded metric builds on the first, which is OK since the Prometheus documentation says the recorded metrics are run in series within each group.
So "Uptime" for any given duration is the sum of "up for this minute" values (i.e. 1 for each minute up) divided by the total number of minutes in the duration, whatever that duration happens to be.
Since we have defined a recorded metric named "minute_up_bool", we can then create an uptime graph over whatever range we want. (BTW, recorded metrics are only generated for times after you first define them, so you won't get yesterday's time series data included in a recorded metric you define today.) Here's a query you can put in Grafana to show uptime % over a moving window of the last 5 days:
sum_over_time(minute_up_bool[5d]) * 100 / (5 * 24 * 60)
So this is our recording rule configuration:
groups:
- name: uptime
interval: 1m
# Each rule here builds on the previous one.
rules:
# Get test results as pass/fail => 1/0
# (label_replace() removes confusing status="failure" label value)
- record: test_success_bool
expr: label_replace(clamp_max(test_statuses_total{status="failure"}, 1), "status", "", "", "") != bool 1
# Get the uptime as 1 minute range where the sum of successes is not zero
- record: minute_up_bool
expr: clamp_max(sum_over_time(test_success_bool[1m]), 1)
You have to use recording rules because you cannot create a range
vector from the instance vector result of a function in a single
query
Actually you can, by using a subquery:
(...some complicated instant subexpression...)[5d:1m]
This gives the same results as if you'd used a recording rule with a 1 minute evaluation interval. The recording rule is still beneficial though, as it avoids recomputing the subexpression every time.