Is there a way to query Prometheus to count failed jobs in time range? - kubernetes

There are several metrics collected for cron jobs, unfortunately I‘m not sure how to use them properly.
I wanted to use the kube_job_status_failed == 1 metrics. I can use a regex for job=~“+.myjobname.+“ to aggregate all failed attempts for a cron job.
This is where i got stuck. Is there a way to count the amount of distinct labels(=number of failed attempts) in a given time period?
Or can I use the metrics the other way around meaning checking whether there was kube_job_status_succeeded{job=~“+.myjobname+.“}==1 in a given time period?
I feel like I’m so close to solving this but I just can’t wrap my head around it.
EDIT: Added PictureThis shows that there clearly are several succeded jobs over time, I just have no clue on how to count them

This should give you the number of failed jobs matching the job name in 1h period:
count_over_time(kube_job_status_failed{job=~“+.myjobname+.“}==1 [1h])

I searched for this answer myself and found offset working for my purpose.
kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} - kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} offset 6h > 2
I needed 6h, not 1h and the amount of failed jobs to be larger than 2 in this timerange.

Related

Error in dataflow plugins.adfprod.AutoResolveIntegrationRuntime.45

I am getting below error while running my dataflow. This dataflow was running fine till yesterday. From today onwards I am getting error like this
Operation on target LoadAccount failed:
[plugins.adfprod.AutoResolveIntegrationRuntime.45 WorkspaceType: CCID:<1a11d7e0-b019-4845-ab29-641100c79f04>] The job has surpassed the max number of seconds it can be in ResourceAcquisition state [1000s], so ending the job.
Error Message - The job has surpassed the max number of seconds it can
be in ResourceAcquisition state [1000s], so ending the job.
In a lot of cases of Data Factory the MAX limitations are only soft restrictions that can easily be lifted via a support ticket.
There is no such thing as a limitless cloud platform.
Refer this article by MRPAULANDREW

How can I get down time of a specific deployment in kubernetes?

I have an use case where I need to collect the downtime of each deployment (if all the replicas(pods) are down at the same point of time).
My goal is to maintain the total down time for each deployment since it was created.
I tried getting it from deployment status, but the problem is that I need to make frequent calls to get the deployment and check for any down time.
Also the deployment status stores only the latest change. So, I will end up missing out the changes that occurred in between each call if there is more than one change(i.e., down time). Also I will end up making multiple calls for multiple deployments frequently which will consume more compute resource.
Is there any reliable method to collect the down time data of an deployment?
Thanks in advance.
A monitoring tool like prometheus would be a better solution to handle this.
As an example, below is a graph from one of our deployments for last 2 days
If you look at the blue line for unavailable replicas, we had one replica unavailable from about 17:00 to 10:30 (ideally unavailable count should be zero)
This seems pretty close to what you are looking for.

How to measure containers execution time inside a kubernetes POD?

I am running a jobs with a kubernetes POD and I need to measure the execution time for each job .
I want to get it through some api.
Does anyone know how can I get it ?
A job has a property denominated status of type JobStatus.
The properties which you are looking for in the JobStatus type is the startTime and the completionTime, which as the name suggest are responsible for indicating the moment where the job started/completed. The difference between these values is going to lead you to the duration of the execution of the job.

What does Kubernetes cronjobs `startingDeadlineSeconds` exactly mean?

In Kubernetes cronjobs, It is stated in the limitations section that
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency.
What I understand from this is that, If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
Also, If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
After investigating the code base of the Kubernetes repo, so this is how the CronJob controller works:
The CronJob controller will check the every 10 seconds the list of cronjobs in the given Kubernetes Client.
For every CronJob, it checks how many schedules it missed in the duration from the lastScheduleTime till now. If there are more than 100 missed schedules, then it doesn't start the job and records the event:
"FailedNeedsStart", "Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew."
It is important to note, that if the field startingDeadlineSeconds is set (not nil), it will count how many missed jobs occurred from the value of startingDeadlineSeconds till now. For example, if startingDeadlineSeconds = 200, It will count how many missed jobs occurred in the last 200 seconds. The exact implementation of counting how many missed schedules can be found here.
In case there are not more than a 100 missed schedules from the previous step, the CronJob controller will check if the time now is not after the time of its scheduledTime + startingDeadlineSeconds , i.e. that it's not too late to start the job (passed the deadline). If it wasn't too late, the job will continue to be attempted to be started by the CronJob Controller. However, If it is already too late, then it doesn't start the job and records the event:
"Missed starting window for {cronjob name}. Missed scheduled time to start a job {scheduledTime}"
It is also important to note, that if the field startingDeadlineSeconds is not set, then it means there is no deadline at all. This means the job will be attempted to start by the CronJob controller without checking if it's later or not.
Therefore to answer the questions above:
1. If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
The CronJob controller will attempt to start the job and it will be successfully scheduled if the 10 seconds after it's schedule time haven't passed yet. However, if the deadline has passed, it won't be started this run, and it will be counted as a missed schedule in later executions.
2. If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
Yes, it will be counted as a missed schedule. Since missed schedules are calculated as I stated above in point 2.

hiveQL counter limit exceeded error

I am running a create table query in Hiveql and obtain the following error when it is run:
Status: Failed
Counters limit exceeded: Too many counters: 2001 max=2000
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Counters limit exceeded: Too many counters: 2001 max=2000
I have attempted to set the counters to to a greater number, i.e.
set tez.counters.max=16000;
However, it still falls over with the same error.
My query incorporates 13 left joins but the data sets are relatively small (1,000's rows). The query did work when there were roughly 10 joins but since I've added additional joins it has started to fail.
Any suggestions on how I can configure this to work would be greatly appreciated!
You need to find real initial error log from failed container. The error you have shown here is not initial error. 2001 containers (including their restart attempts) have failed because of some other error (which you really need to fix), then all job was terminated, all other containers were killed because of Failed Counters limit. Go to Job tracker and find some failed (not killed) container and read it's log. The real problem is not in limit and changing the Failed Counters limit will not help.
Divide your query into multiple step and then run it.
As you said your query works with 10 joins,So first create the table which has data with first 10 joins and then with the new table,create other table which has data from first table and three other tables.
I faced the same issue as I was applying union all statement on 100 tables.But when I started to run only 10 tables at a time it works.
Hope This Helps!!!!