a cloudwatch alarm for api gateway endpoint. based on documentation, i have created template below and way i read it, is if there is 5xx error, which will be greater than '0' threshold, this should trigger. i'm not sure about the "TreatMissingData" attribute below, what data does this refer too? also not sure about the "EvaluationPeriods" attribute as well? can someone explain on it?
loudAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
Namespace: AWS/ApiGateay
MetricName: 5XXError
Period: '60'
EvaluationPeriods: '1'
Threshold: 0
Statistic: Sum
ComparisonOperator: GreaterThanThreshold
TreatMissingData: ?????????
AlarmActions:
...
TreatMissingData
is a parameter how to do when the data is missing.
Values are:
breaching: missing data is treated as breaching threshold
notBreaching: treated as within threshold
ignore: missing data is ignored
missing: ignored, if all of data missing in the period range, alarm INSUFFICIENT_DATA
if not specified, missing is used.
EvaluationPeriods is a parameter how many times the threshold has to be breached for alarm.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
Related
I am using Grafana v8.3.4 with influxDB & I want to pass dynamic value to my notification for the query/alert condition. I couldn’t find any documentation for this. Can anyone suggest?
This is what I have used:
It alerts if the last() value is greater than threshold value. I want to dynamically pass the last() & Threshold value to the notification whose format is:
Please suggest how it can be achieved.
I'm not sure if I got your question but if you want to access the value in notifications/annotations you can use {{ $values.B0.Value }} where B is your query and 0 is the condition number. (In this case, that is 0 because you only have one condition.)
Annotations and labels for alerting rules
See example:
Akshay's example:
Alarm definition in grafana:
I'm using prometheus with grafana. I have a usecase where I have to take variables dynamically and need to perform divide operation which to be performed for each variable which is coming dynamically so can plot graph at each variable level.
eg. first metrics is -
rate(container_cpu_usage_seconds_total{id="/",instance=~'${INSTANCE:pipe}'}[5m])
where ${INSTANCE:pipe} getting dynamically
which needs to be divided by -
machine_cpu_cores{kubernetes_io_hostname=~'${INSTANCE:pipe}'}
and i want result in format -
1 entry per variable
eg.
vars result
var1 - 102
var2 - 23
var3 - 453
note (var1,var2,var3 are nothing but dynamically passed variables and result is nothing value return by divide operation)
Thanks in advance
After trying some queries found the solution -
My use-case has 2 metrics as below -
container_cpu_usage_seconds_total
machine_cpu_cores
In both metrics I found common label as kubernetes_io_hostname
I grouped both the metrics with the above label with the following queries -
(sort_desc ( max (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_role="node"}[5m])) BY (kubernetes_io_hostname)
sort_desc(max (machine_cpu_cores{kubernetes_io_role="node"}) BY (kubernetes_io_hostname ))
So my data has only 1 label named kubernetes_io_hostname
Then I did the division of the above 2 metrics and then got the result for the kubernetes_io_hostname label
If you need more info on this let me know in the comment section.
I can't seem to figure out the Prometheus query to calculate the single value of, say, average CPU usage per instance over a time period and create the Grafana table out of it:
Period: last 3h
Instance A: CPU usage A
Instance B: CPU usage B
Simply put, I want to:
select a time period in Grafana
have Prometheus average the values per instance within that period to a single value
use that data to populate a Grafana table
Any hints?
Thanks!
To answer myself:
avg_over_time(instance:cpu_usage:irate[$__range])
So for example if I would like to get the CPU utilisation , does it mean that this promql will work well?
(((count(count(node_cpu_seconds_total{job="vi-prod-node-exporter-ec2-vsat2-spen"}) by (cpu))) - avg(sum by (mode)(rate(node_cpu_seconds_total{mode='idle',job="vi-prod-node-exporter-ec2-vsat2-spen"}[$__rate_interval])))) * 100) / count(count(node_cpu_seconds_total{job="vi-prod-node-exporter-ec2-vsat2-spen"}) by (cpu))
I want to create a Grafana 'singlestat' Panel that shows the Uptime or SLA 'percentage', based on the presence or absence of test failure metrics.
I already have the appropriate metric, e2e_tests_failure_count, for different test frameworks.
This means that the following query returns the sum of observed test failures:
sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-2|test-framework-3",kubernetes_namespace="platform-edge"})
I already managed to create a graph that is "1" if everything is ok and "0" if there are any test failures:
1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"test-framework-1|test-framework-1|test-framework-1",kubernetes_namespace="platform-edge"}), 1)
I now want to have a single percentage value that shows the "uptime" (= amount of time the environment was 'helathy') over a period of time, e.g. the last 5 days. Something like "99.5%" or, more appropriate for the screenshot, "65%".
I tried something like this:
(1 - clamp_max(sum(e2e_tests_failure_count{kubernetes_name=~"service-cvi-e2e-tests|service-svhb-e2e-tests|service-svh-roundtrip-e2e-tests",kubernetes_namespace="platform-edge"}), 1))[5d]
but this only results in parser errors. Googling didn't really get me any further, so I'm hoping I can find help here :)
Just figured this out and I believe it is producing correct results. You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query, as you have already discovered (you get a parse error). So we record the function result (which will be an instance vector) as a new time series and use that as the metric name in a different query, where you can then add the [5d] to select a range.
We run our tests multiple times per minute against all our services, and each service ("service" is a label where each service's name is the label value) has a different number of tests associated with it, but if any of the tests for a given service fails, we consider that a "down moment". (The number of test failures for a given service is captured in the metrics with the status="failure" label value.) We clamp the number of failures to 1 so we only have zeroes and ones for our values and can therefore convert a "failure values time series" into a "success values time series" instead, using an inequality operator and the bool modifier. (See this post for a discussion about the use of bool.) So the result of the first recorded metric is 1 for every service where all its tests succeeded during that scrape interval, and 0 where there was at least one test failure for that service.
If the number of failures for a service is > 0 for all the values returned for any given minute, we consider that service to be "down" for that minute. (So if we have both a failure and a success in a given minute, that does not count as downtime.) That is why we have the second recorded metric to produce the actual "up for this minute" boolean values. The second recorded metric builds on the first, which is OK since the Prometheus documentation says the recorded metrics are run in series within each group.
So "Uptime" for any given duration is the sum of "up for this minute" values (i.e. 1 for each minute up) divided by the total number of minutes in the duration, whatever that duration happens to be.
Since we have defined a recorded metric named "minute_up_bool", we can then create an uptime graph over whatever range we want. (BTW, recorded metrics are only generated for times after you first define them, so you won't get yesterday's time series data included in a recorded metric you define today.) Here's a query you can put in Grafana to show uptime % over a moving window of the last 5 days:
sum_over_time(minute_up_bool[5d]) * 100 / (5 * 24 * 60)
So this is our recording rule configuration:
groups:
- name: uptime
interval: 1m
# Each rule here builds on the previous one.
rules:
# Get test results as pass/fail => 1/0
# (label_replace() removes confusing status="failure" label value)
- record: test_success_bool
expr: label_replace(clamp_max(test_statuses_total{status="failure"}, 1), "status", "", "", "") != bool 1
# Get the uptime as 1 minute range where the sum of successes is not zero
- record: minute_up_bool
expr: clamp_max(sum_over_time(test_success_bool[1m]), 1)
You have to use recording rules because you cannot create a range
vector from the instance vector result of a function in a single
query
Actually you can, by using a subquery:
(...some complicated instant subexpression...)[5d:1m]
This gives the same results as if you'd used a recording rule with a 1 minute evaluation interval. The recording rule is still beneficial though, as it avoids recomputing the subexpression every time.
After installed Heapster in my kubernetes cluster, I can access Grafana but the graph are empty.
I can build a new graph with special value, e.g. "cpu/limits"; but if the pre-defined graph used $interval, the graph can not display; for example,
SELECT mean(value) FROM "cpu/limit_gauge" WHERE "container_name" = 'machine' AND $timeFilter GROUP BY time($interval), "hostname"
https://grafana.com/docs/grafana/latest/variables/variable-types/add-interval-variable/
$interval is a built in automatic variable in grafana , and is automatically set based on time range
the graph are empty? maybe your query is wrong
For future readers -
I want to add that sometimes the "group-by" "fill" option, which is NULL by default, can cause no values to be displayed depending on the resolution.
If you have a query you think should work, but still see no data, try changing the fill value to another setting that you think works (ie: none) and see if data shows up.
Grafana tries to use that configuration to fill in for missing data, and occasionally, depending on your interval and data collection rate, you might end up with oddness.