Prometheus query to count unique label values - grafana

I want to count number of unique label values. Kind of like
select count (distinct a) from hello_info
For example if my metric 'hello_info' has labels a and b. I want to count number of unique a's. Here the count would be 3 for a = "1", "2", "3".
hello_info(a="1", b="ddd")
hello_info(a="2", b="eee")
hello_info(a="1", b="fff")
hello_info(a="3", b="ggg")

count(count by (a) (hello_info))
First you want an aggregator with a result per value of a, and then you can count them.

Other example:
If you want to count the number of apps deployed in a kubernetes cluster based on different values of a label( ex:app):
count(count(kube_pod_labels{app=~".*"}) by (app))

The count(count(hello_info) by (a)) is equivalent to the following SQL:
SELECT
time_bucket('5 minutes', timestamp) AS t,
COUNT(DISTINCT a)
FROM hello_info
GROUP BY t
See time_bucket() function description.
E.g. it returns the number of distinct values for a label per each 5-minute interval by default - see staleness docs for details about 5-minute interval.
If you need to calculate the number of unique values for a label over custom interval (for example, over the last day), then the following PromQL query must be used instead:
count(count(last_over_time(hello_info[1d])) by (a))
The custom interval - 1d in the case above - can be changed to an arbitrary value - see these docs for possible values, which can be used there.
This query uses last_over_time() function for selecting all the time series, which were active during the last day. Time series can stop receiving new samples and become inactive at any time. Such time series aren't captured with simple count(...) by (a) after 5 minutes of inactivity. New deployments in Kubernetes and horizontal pod autoscaling are the most frequent source of big number of inactive time series (aka high churn rate).

Related

sum by groups in KDB

I have a PnL table with 3 columns. date, region, product.
I'm trying to group all PnL rows by region and product. One way that i've tried is to sum by region and product as following
select PnL : sum(PnL) by region, product from table where date within (d1;d2)
The issue I have is unexpected results. For a given date range (d1;d2) I'm getting the results I'm expecting. However for date range (d1;d2+1) I'm getting 0 everywhere.
I checked the data availability on the d2+1 and data is already available on that day.
Please note that the server is stateless and it is not possible to use intermediate results in variables.
What is the best way to achieve a grouping sum in KDB?

Tableau Count Distinct when graphed shows chronological last date, when deduplicated, not first

I'm doing a break fix on a Tableau report visualization that shows the outcomes of clients by client id for a given year by showing a running sum of distinct count of client id or RUNNING_SUM(COUNTD([ID])). The X axis of the visualization is the initial date of contact with the client. Occasionally, due to errors in the data or weird behavior, there are clients that have two initial dates, listed as two separate data rows where the column Initial Date will have different values but they will share an ID.
Currently, the visualization shows such people with their chronological last Initial Date and I need it to dedup such that the visualization shows them as starting from the chronological first Initial Date.
I could create a calculated field for if there are two IDs with multiple non identical Initial Dates then use the first, but I'm not sure how to create a calculated field that can groupby or otherwise check multiple dates per ID.
In Python/psuedo code, it would be something like
For ID in IDS:
if len(groupby.IDS.ID)>1:
then Initial_Date = min(InitialDate)
But I have to do the transformation in Tableau
Keep everything the same, but create a calculated field named "Initial Contact Date" with the calculation:
{FIXED [ID]: MIN(InitialDate)}
Then replace the date field on the X axis (Columns) with this date field instead.
That LOD Expression loops through all rows given the ID, and returns only the min one.

Prometheus query equivalent to SQL DISTINCT

I have multiple Prometheus instances providing the same metric, such as:
my_metric{app="foo", state="active", instance="server-1"} 20
my_metric{app="foo", state="inactive", instance="server-1"} 30
my_metric{app="foo", state="active", instance="server-2"} 20
my_metric{app="foo", state="inactive", instance="server-2"} 30
Now I want to display this metric in a Grafana singlestat widget. When I use the following query...
sum(my_metric{app="foo", state="active"})
...it, of course, sums up all values and returns 40. So I tell Prometheus to sum it by instance...
sum(my_metric{app="foo", state="active"}) by (instance)
...which results in a "Multiple Series Error" in Grafana. Is there a way to tell Prometheus/Grafana to only use the first of the results?
I don't know of a distinct, but I think this would work too:
topk(1, my_metric{app="foo", state="active"} by (instance))
Check out the second to last example in here:
https://prometheus.io/docs/prometheus/latest/querying/examples/
One way I just found is to additionally do an average over all values:
avg(sum(my_metric{app="foo", state="active"}) by(instance))
If you need to return an arbitrary time series out of multiple matching time series, then this can be done with topk() or bottomk() functions. For example, the following query returns a single time series with the maximum value out of multiple time series which match my_metric{app="foo", state="active"}:
topk(1, my_metric{app="foo", state="active"})
You need to set instant query option in Grafana when using topk(). Otherwise topk(1, ...) may return multiple time series when it is used for building a graph with range query. This is because topk(1, ...) selects a single time series with the max value individually per each point on the graph. Different points on the graph may have different time series with the max value. There is a workaround, which allows returning a single series out of many series on a graph in alternative Prometheus-like systems such as VictoriaMetrics. It provides topk_* and bottomk_* functions for this purpose. See, for example, topk_last or topk_avg.
Note that topk() has no common grounds with DISTINCT from SQL. If you need to select distinct label values with PromQL, then you need to use count(...) by (label). It will return unique label values for the given label alongside the number of unique time series per each label value. For example, count(my_metric) by (app) will return unique app label names for time series with my_metric name. This is roughly equivalent to the following SQL with DISTINCT clause:
SELECT DISTINCT app FROM my_metric
See count() docs for details.

Grafana: combining two queries from two prometheus exporters

I have two exporters for feeding data into prometheus - the node exporter and the elasticsearch exporter. I'm trying to combine sources from both exporters into one query, but unfortunately get "No data points" in the graph.
Each of the series successfully shows data:
elasticsearch_jvm_memory_max_bytes{cluster="$cluster", name=~"$node"}
node_memory_MemTotal{name=~"$node"}
This is the result when I try to subtract the two series from one another:
node_memory_MemTotal{name=~"$node"} - elasticsearch_jvm_memory_max_bytes{cluster="$cluster", name=~"$node"}
What am I missing here?
Thanks.
The subtraction you are trying here is more complex than it reads in the beginning. On both sides of the - operator are queries that can result in one or more time series. So the operation requested works as follows: Execute the query on the left hand side and get a result of one or more time series. A time series means a unique combination of a metric and all its labels and their values. Then a second query for your right hand side is executed which also results in one or more time series. Now to calculate the results, only those combinations with matching label combinations are used.
For your example this means that the metrics from node_exporter and from the elasticsearch_exporter have different label names (or even only different values for the labels). When there are no combinations that exist on both sides, you will see the empty result. For details on how operators are applied, please see the prometheus docs.
To solve your problem, you could do the following:
Check the metrics of both left and right side on their own
Evaluate if there are additional labels that could be ignored
See if there is a good label to match on (e.g. instance / node / hostname)
Use the ignoring(a,b,c) on the required side(s) to drop superfluous dimensions, e.g. the job
Try the following query:
node_memory_MemTotal{name=~"$node"}
- on(name)
sum(elasticsearch_jvm_memory_max_bytes{cluster="$cluster", name=~"$node"}) by (name)
It works in the following way:
It selects all the time series matching the node_memory_MemTotal{name=~"$node"} time series selector.
It selects all the time series matching the elasticsearch_jvm_memory_max_bytes{cluster="$cluster", name=~"$node"} selector.
It groups time series found at step 2 by name label value and sums time series in each group with sum() aggregate function. The end result of the sum(...) by (name) is per-name sums.
It finds pairs of time series with identical name label value from the step 1 and step 3 and calculates the difference between the first and the second time series in each pair. The on(name) modifier is used for limiting the set of labels, which are used for finding time series pairs with matching labels. See more details about this process here.

Prometheus Uptime or SLA percentage over sliding window in Grafana

I want to create a Grafana 'singlestat' Panel that shows the Uptime or SLA 'percentage', based on the presence or absence of test failure metrics.
I already have the appropriate metric, e2e_tests_failure_count, for different test frameworks.
This means that the following query returns the sum of observed test failures:
sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-2|test-framework-3",kubernetes_namespace="platform-edge"})
I already managed to create a graph that is "1" if everything is ok and "0" if there are any test failures:
1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-1|test-framework-1",kubernetes_namespace="platform-edge"}), 1)
I now want to have a single percentage value that shows the "uptime" (= amount of time the environment was 'helathy') over a period of time, e.g. the last 5 days. Something like "99.5%" or, more appropriate for the screenshot, "65%".
I tried something like this:
(1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"service-cvi-e2e-tests|service-svhb-e2e-tests|service-svh-roundtrip-e2e-tests",kubernetes_namespace="platform-edge"}), 1))[5d]
but this only results in parser errors. Googling didn't really get me any further, so I'm hoping I can find help here :)
Just figured this out and I believe it is producing correct results. You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query, as you have already discovered (you get a parse error). So we record the function result (which will be an instance vector) as a new time series and use that as the metric name in a different query, where you can then add the [5d] to select a range.
We run our tests multiple times per minute against all our services, and each service ("service" is a label where each service's name is the label value) has a different number of tests associated with it, but if any of the tests for a given service fails, we consider that a "down moment". (The number of test failures for a given service is captured in the metrics with the status="failure" label value.) We clamp the number of failures to 1 so we only have zeroes and ones for our values and can therefore convert a "failure values time series" into a "success values time series" instead, using an inequality operator and the bool modifier. (See this post for a discussion about the use of bool.) So the result of the first recorded metric is 1 for every service where all its tests succeeded during that scrape interval, and 0 where there was at least one test failure for that service.
If the number of failures for a service is > 0 for all the values returned for any given minute, we consider that service to be "down" for that minute. (So if we have both a failure and a success in a given minute, that does not count as downtime.) That is why we have the second recorded metric to produce the actual "up for this minute" boolean values. The second recorded metric builds on the first, which is OK since the Prometheus documentation says the recorded metrics are run in series within each group.
So "Uptime" for any given duration is the sum of "up for this minute" values (i.e. 1 for each minute up) divided by the total number of minutes in the duration, whatever that duration happens to be.
Since we have defined a recorded metric named "minute_up_bool", we can then create an uptime graph over whatever range we want. (BTW, recorded metrics are only generated for times after you first define them, so you won't get yesterday's time series data included in a recorded metric you define today.) Here's a query you can put in Grafana to show uptime % over a moving window of the last 5 days:
sum_over_time(minute_up_bool[5d]) * 100 / (5 * 24 * 60)
So this is our recording rule configuration:
groups:
- name: uptime
interval: 1m
# Each rule here builds on the previous one.
rules:
# Get test results as pass/fail => 1/0
# (label_replace() removes confusing status="failure" label value)
- record: test_success_bool
expr: label_replace(clamp_max(test_statuses_total{status="failure"}, 1), "status", "", "", "") != bool 1
# Get the uptime as 1 minute range where the sum of successes is not zero
- record: minute_up_bool
expr: clamp_max(sum_over_time(test_success_bool[1m]), 1)
You have to use recording rules because you cannot create a range
vector from the instance vector result of a function in a single
query
Actually you can, by using a subquery:
(...some complicated instant subexpression...)[5d:1m]
This gives the same results as if you'd used a recording rule with a 1 minute evaluation interval. The recording rule is still beneficial though, as it avoids recomputing the subexpression every time.