How to measure containers execution time inside a kubernetes POD? - kubernetes

I am running a jobs with a kubernetes POD and I need to measure the execution time for each job .
I want to get it through some api.
Does anyone know how can I get it ?

A job has a property denominated status of type JobStatus.
The properties which you are looking for in the JobStatus type is the startTime and the completionTime, which as the name suggest are responsible for indicating the moment where the job started/completed. The difference between these values is going to lead you to the duration of the execution of the job.

Related

Is there a way to query Prometheus to count failed jobs in time range?

There are several metrics collected for cron jobs, unfortunately I‘m not sure how to use them properly.
I wanted to use the kube_job_status_failed == 1 metrics. I can use a regex for job=~“+.myjobname.+“ to aggregate all failed attempts for a cron job.
This is where i got stuck. Is there a way to count the amount of distinct labels(=number of failed attempts) in a given time period?
Or can I use the metrics the other way around meaning checking whether there was kube_job_status_succeeded{job=~“+.myjobname+.“}==1 in a given time period?
I feel like I’m so close to solving this but I just can’t wrap my head around it.
EDIT: Added PictureThis shows that there clearly are several succeded jobs over time, I just have no clue on how to count them
This should give you the number of failed jobs matching the job name in 1h period:
count_over_time(kube_job_status_failed{job=~“+.myjobname+.“}==1 [1h])
I searched for this answer myself and found offset working for my purpose.
kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} - kube_job_failed{job_name=~"^your_job_name.*", namespace="your_teamspace",} offset 6h > 2
I needed 6h, not 1h and the amount of failed jobs to be larger than 2 in this timerange.

k8s job with an unlimited/unknown number of work-items (completions)

I have a queue which filled (consistently) with work-item by one of my k8s pod.
I want to use k8s job to process each work item as I want each job to be handle by new pod as I want to use multiple container as suggest here, but jobs don't seem to support an infinite number of completions.
When I use spec.completions: null I got BackoffLimitExceeded.
Any Idea how to implement job without the need to specify the number of work item?
Is there alternative to job for implementing background worker in k8s ?
Thank you
My suggestion is to use Kubernetes resources in the way they have been designed: Job resources are one-off tasks that can ben triggered many times but they differ from the background jobs you're eager to implement.
If your application is popping jobs from a queue/backend, it's better to put in a Deployment with a for loop (YMMV according to programming language) and, eventually, scale it to down with another component if you don't want to allocate unused resources.
Another solution could be the specialization of each Job, using UUID as the name of the jobs and labelling them in order to group them: anyway, first suggestion of using Kubernetes in the Kubernetes way is strongly recommended.

How can I get down time of a specific deployment in kubernetes?

I have an use case where I need to collect the downtime of each deployment (if all the replicas(pods) are down at the same point of time).
My goal is to maintain the total down time for each deployment since it was created.
I tried getting it from deployment status, but the problem is that I need to make frequent calls to get the deployment and check for any down time.
Also the deployment status stores only the latest change. So, I will end up missing out the changes that occurred in between each call if there is more than one change(i.e., down time). Also I will end up making multiple calls for multiple deployments frequently which will consume more compute resource.
Is there any reliable method to collect the down time data of an deployment?
Thanks in advance.
A monitoring tool like prometheus would be a better solution to handle this.
As an example, below is a graph from one of our deployments for last 2 days
If you look at the blue line for unavailable replicas, we had one replica unavailable from about 17:00 to 10:30 (ideally unavailable count should be zero)
This seems pretty close to what you are looking for.

What does Kubernetes cronjobs `startingDeadlineSeconds` exactly mean?

In Kubernetes cronjobs, It is stated in the limitations section that
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency.
What I understand from this is that, If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
Also, If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
After investigating the code base of the Kubernetes repo, so this is how the CronJob controller works:
The CronJob controller will check the every 10 seconds the list of cronjobs in the given Kubernetes Client.
For every CronJob, it checks how many schedules it missed in the duration from the lastScheduleTime till now. If there are more than 100 missed schedules, then it doesn't start the job and records the event:
"FailedNeedsStart", "Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew."
It is important to note, that if the field startingDeadlineSeconds is set (not nil), it will count how many missed jobs occurred from the value of startingDeadlineSeconds till now. For example, if startingDeadlineSeconds = 200, It will count how many missed jobs occurred in the last 200 seconds. The exact implementation of counting how many missed schedules can be found here.
In case there are not more than a 100 missed schedules from the previous step, the CronJob controller will check if the time now is not after the time of its scheduledTime + startingDeadlineSeconds , i.e. that it's not too late to start the job (passed the deadline). If it wasn't too late, the job will continue to be attempted to be started by the CronJob Controller. However, If it is already too late, then it doesn't start the job and records the event:
"Missed starting window for {cronjob name}. Missed scheduled time to start a job {scheduledTime}"
It is also important to note, that if the field startingDeadlineSeconds is not set, then it means there is no deadline at all. This means the job will be attempted to start by the CronJob controller without checking if it's later or not.
Therefore to answer the questions above:
1. If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
The CronJob controller will attempt to start the job and it will be successfully scheduled if the 10 seconds after it's schedule time haven't passed yet. However, if the deadline has passed, it won't be started this run, and it will be counted as a missed schedule in later executions.
2. If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
Yes, it will be counted as a missed schedule. Since missed schedules are calculated as I stated above in point 2.

Checking later the state of nodes during a past job execution

Is there a way to check in Azure Batch whether a node went to the unusable state while running a specific job on a specific pool? The context is that, when running a job and checking the pool on which it was running at that time, there were some nodes that went to the unusable state during the job execution, but we wouldn't have any indication that this happened if we weren't checking the heatmap of the pool during the job execution. Thus, how can I check if nodes went to the unusable state during some job run?
Also, I see that there are metrics collected about the state of nodes in Azure portal, but I am not sure why these metrics are always zero for me even though I am running jobs and tasks that fail?
I had a quick look for you: (I hope this helps :))
For the nodes state monitoring you can do something mentioned here:
https://learn.microsoft.com/en-us/azure/batch/batch-efficient-list-queries
PoolOperations: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.pooloperations?view=azurebatch-7.0.1
ListComputeNodes : Enumerates the ComputeNode of the specified pool.
I think at the detail level if you filter with correct clasue you will get ComputeNode information, then you can loop through the information and check the state.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.common.computenodestate?view=azurebatch-7.0.1
Possible sample implementation: (Please note this specific code is probably for the pool health) https://github.com/Azure/azure-batch-samples/blob/master/CSharp/Common/GettingStartedCommon.cs#L31
With regards to the metrics, how are you getting the metrics back. I am sure I will get corrected if I said anything doubtful or incorrect. Thanks!