Handling alerts triggered by boolean condition; keep alertmanager from auto-resolving until manually cleared - kubernetes

given the following promql with alertmanager integration:
what's a good way to alter this expression so that when a pod restart is indicated by this promql and an alert triggered (expected) that an alert isn't automatically resolved on the next check since on the next check it's likely a restart will no longer register (unless a pod is flapping or something). We want to be alerted so we can check it out so we want to manually resolve the alert:
sum(rate(kube_pod_container_status_restarts_total[2m])*100) by (namespace,container) > 0
A potential alertmanager rule would be the following, but we don't yet have an actual rule.
- alert: Unexpected POD restart(s) occurring
expr: sum(rate(kube_pod_container_status_restarts_total[2m])*100) by (namespace,container) > 0
for: 10s
annotations:
summary: POD restarts occurring

An alert is raised when your query produces at least one metric and the alert is resolved when it no longer produces a metric.
sum by (...) (rate(kube_pod_container_status_restarts_total[2m])) > 0
means that it calculates the rate over the last two minutes. So if a restart occurs, the this expression will result in 0 after 2 minutes and your alert is resolved.
What you could do is simply increase the time interval:
sum by (...) (rate(kube_pod_container_status_restarts_total[1h])) > 0
so the alert stays on for an hour.
May be you dont want to check for every single restart, but if there are pods that restart within a specific time at least twice, so that might be the better expression:
sum by (...) (increase(kube_pod_container_status_restarts_total[24h])) > 2
meaning if a container restert at least two times within a day that might be a situation you want to have a look at.

Related

Prometheus Alerting expression for metrics that increases once a day

I have a custom counter metric that should increase one time a day at a specific time period. and I have to create an alerting rule that fires if this metric doesn't increase.
Infrastructure: Kubernetes
So for my case, I used the next alerting expression:
increase(my_custom_metric[1d]) == 0
But it seems to be some issues with using it. Suppose the pod restarted and the counter reset, after a couple of hours the counter increased by one, but the value of the expression above still remains 0. What expression should be used to account for such situations?

why wont alert drop to 0 for gcp alert even though the time threshold is 1 minute?

So I have created an alert from a log based metric for the error count of a kubernetes container. The alert count value never drops below 1 even with the timeframe of 1 minute and I think it should drop to 0 as the alert isn't that frequent. I have created other alerts that do drop to 0 so I am not sure if it is the frequency of the alert or I have misconfigured something in the alerts. So is there something wrong with my log based metrics here. Or is it with my alert?

Is there a way to query Prometheus to count failed jobs in time range?

There are several metrics collected for cron jobs, unfortunately I‘m not sure how to use them properly.
I wanted to use the kube_job_status_failed == 1 metrics. I can use a regex for job=~“+.myjobname.+“ to aggregate all failed attempts for a cron job.
This is where i got stuck. Is there a way to count the amount of distinct labels(=number of failed attempts) in a given time period?
Or can I use the metrics the other way around meaning checking whether there was kube_job_status_succeeded{job=~“+.myjobname+.“}==1 in a given time period?
I feel like I’m so close to solving this but I just can’t wrap my head around it.
EDIT: Added PictureThis shows that there clearly are several succeded jobs over time, I just have no clue on how to count them
This should give you the number of failed jobs matching the job name in 1h period:
count_over_time(kube_job_status_failed{job=~“+.myjobname+.“}==1 [1h])
I searched for this answer myself and found offset working for my purpose.
kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} - kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} offset 6h > 2
I needed 6h, not 1h and the amount of failed jobs to be larger than 2 in this timerange.

What does Kubernetes cronjobs `startingDeadlineSeconds` exactly mean?

In Kubernetes cronjobs, It is stated in the limitations section that
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency.
What I understand from this is that, If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
Also, If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
After investigating the code base of the Kubernetes repo, so this is how the CronJob controller works:
The CronJob controller will check the every 10 seconds the list of cronjobs in the given Kubernetes Client.
For every CronJob, it checks how many schedules it missed in the duration from the lastScheduleTime till now. If there are more than 100 missed schedules, then it doesn't start the job and records the event:
"FailedNeedsStart", "Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew."
It is important to note, that if the field startingDeadlineSeconds is set (not nil), it will count how many missed jobs occurred from the value of startingDeadlineSeconds till now. For example, if startingDeadlineSeconds = 200, It will count how many missed jobs occurred in the last 200 seconds. The exact implementation of counting how many missed schedules can be found here.
In case there are not more than a 100 missed schedules from the previous step, the CronJob controller will check if the time now is not after the time of its scheduledTime + startingDeadlineSeconds , i.e. that it's not too late to start the job (passed the deadline). If it wasn't too late, the job will continue to be attempted to be started by the CronJob Controller. However, If it is already too late, then it doesn't start the job and records the event:
"Missed starting window for {cronjob name}. Missed scheduled time to start a job {scheduledTime}"
It is also important to note, that if the field startingDeadlineSeconds is not set, then it means there is no deadline at all. This means the job will be attempted to start by the CronJob controller without checking if it's later or not.
Therefore to answer the questions above:
1. If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
The CronJob controller will attempt to start the job and it will be successfully scheduled if the 10 seconds after it's schedule time haven't passed yet. However, if the deadline has passed, it won't be started this run, and it will be counted as a missed schedule in later executions.
2. If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
Yes, it will be counted as a missed schedule. Since missed schedules are calculated as I stated above in point 2.

Dropwizard metrics how to count for an action without carrying forward the counter value

I have a very simple case where I want to see how many time a user click on the ButtonA in my app. I'm using DropWizard metrics counter to archive this and the coursera reporter to report them to DataDog every 1 minutes.
registry.counter("buttonA").inc();
But what is happening is that this counter doesn't behave like I thought it would. so for example if the buttonA has been clicked 4 times, the counter will keep the value 4 until the app restart which is not very useful.
Is there an other metrics I'm not aware about that would keep a count and at each reports reset to 0 ? So that on Datadog dashboard I can easily sum all the count and manage to get the exact numbers even if the app is restarted it will not affect the metrics.
I don't think there is something that does this for you automatically. You have to reset the counter yourself at each reporting interval. Something like this should work:
long count = registry.counter("buttonA").getCount();
dataDogReporter.report("buttonA", count);
registry.counter("buttonA").dec(count);