How to determine if a job is failed - kubernetes

How can I programatically determine if a job has failed for good and will not retry any more? I've seen the following on failed jobs:
status:
conditions:
- lastProbeTime: 2018-04-25T22:38:34Z
lastTransitionTime: 2018-04-25T22:38:34Z
message: Job has reach the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: Failed
However, the documentation doesn't explain why conditions is a list. Can there be multiple conditions? If so, which one do I rely on? Is it a guarantee that there will only be one with status: "True"?

JobConditions is similar as PodConditions. You may read about PodConditions in official docs.
Anyway, To determine a successful pod, I follow another way. Let's look at it.
There are two fields in Job Spec.
One is spec.completion (default value 1), which says,
Specifies the desired number of successfully finished pods the
job should be run with.
Another is spec.backoffLimit (default value 6), which says,
Specifies the number of retries before marking this job failed.
Now In JobStatus
There are two fields in JobStatus too. Succeeded and Failed. Succeeded means how many times the Pod completed successfully and Failed denotes, The number of pods which reached phase Failed.
Once the Success is equal or bigger than the spec.completion, the job will become completed.
Once the Failed is equal or bigger than the spec.backOffLimit, the job will become failed.
So, the logic will be here,
if job.Status.Succeeded >= *job.Spec.Completion {
return "completed"
} else if job.Status.Failed >= *job.Spec.BackoffLimit {
return "failed"
}

If so, which one do I rely on?
You might not have to choose, considering commit dd84bba64
When a job is complete, the controller will indefinitely update its conditions
with a Complete condition.
This change makes the controller exit the
reconcilation as soon as the job is already found to be marked as complete.

As https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#jobstatus-v1-batch says:
The latest available observations of an object's current state. When a
Job fails, one of the conditions will have type "Failed" and status
true. When a Job is suspended, one of the conditions will have type
"Suspended" and status true; when the Job is resumed, the status of
this condition will become false. When a Job is completed, one of the
conditions will have type "Complete" and status true. More info:
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

Related

How to wait for tekton pipelinRun conditions

I have the following code within a gitlab pipeline which results in some kind of race condition:
kubectl apply -f pipelineRun.yaml
tkn pipelinerun logs -f pipeline-run
The tkn command immediately exits, since the pipelineRun object is not yet created. There is one very nice solution for this problem:
kubectl apply -f pipelineRun.yaml
kubectl wait --for=condition=Running --timeout=60s pipelinerun/pipeline-run
tkn pipelinerun logs -f pipeline-run
Unfortunately this is not working as expected, since Running seems to be no valid condition for a pipelineRun object. So my question is: what are the valid conditions of a pipelineRun object?
I didn't search too far and wide, but it looks like they only have two condition types imported from the knative.dev project?
https://github.com/tektoncd/pipeline/blob/main/vendor/knative.dev/pkg/apis/condition_types.go#L32
The link above is for the imported condition types from the pipeline source code of which it looks like Tekton only uses "Ready" and "Succeeded".
const (
// ConditionReady specifies that the resource is ready.
// For long-running resources.
ConditionReady ConditionType = "Ready"
// ConditionSucceeded specifies that the resource has finished.
// For resource which run to completion.
ConditionSucceeded ConditionType = "Succeeded"
)
But there may be other imports of this nature elsewhere in the project.
Tekton TaskRuns and PipelineRun only use a condition of type Succeeded.
Example:
conditions:
- lastTransitionTime: "2020-05-04T02:19:14Z"
message: "Tasks Completed: 4, Skipped: 0"
reason: Succeeded
status: "True"
type: Succeeded
The different status and messages available for the Succeeded condition are available in the documentation:
TaskRun: https://tekton.dev/docs/pipelines/taskruns/#monitoring-execution-status
PipelineRun: https://tekton.dev/docs/pipelines/pipelineruns/#monitoring-execution-status
As a side note, there is an activity timeout available in the API. That timeout is not surfaced to the CLI options though. You could create a tkn feature request for that.

How can I find the jobId that caused node failure in slurm?

I know that sacctmgr command can list the event history of nodes with the reason.
sacctmgr show event Start=09/01-00:00 format=nodename,timestart,timeend,state,reason,user
This command gives the following output
gnodeXX 2020-09-04T20:21:34 2020-09-05T01:21:38 DRAIN Kill task failed root(ZZ)
gnodeXX 2020-09-09T16:44:55 2020-09-09T17:50:21 DOWN* Not responding slurm(DDDD)
Is there any way I can get the jobId or username that caused the Node Failure or any info on the task for which kill failed? The user column gives either of two outputs for all results root(ZZ) and slurm(DDDD) and I am not sure what these imply.

Kubernetes jobs and back-off limit values: is the value a number of retries or minutes?

I was reading the Kubernetes documentation about jobs and retries. I found this:
There are situations where you want to fail a Job after some amount of
retries due to a logical error in configuration etc. To do so, set
.spec.backoffLimit to specify the number of retries before considering
a Job as failed. The back-off limit is set by default to 6. Failed
Pods associated with the Job are recreated by the Job controller with
an exponential back-off delay (10s, 20s, 40s …) capped at six minutes.
The back-off count is reset if no new failed Pods appear before the
Job’s next status check.
I had two questions about the above quote:
The back-off limit value is on minutes or number of retries? The documentation example using the value 6 (six) is confuse, because he initially affirms that the value is the number of retries but after that said "capped at six minutes".
There is a way to define the back-off delay time? As I understand, this behavior (10s, 20s, 40s …) is default and can't be changed.
No confusion about the .spec.backoffLimit is is the number of retries.
The Job controller recreates the failed Pods (associated with the Job) in an exponential delay (10s, 20s, 40s, ... , 360s). And of course, this delay time is set by the Job controller.
If the Pod fails, after 10s new Pod will be created
If fails again, after 20s new one will be created
If fails again, after 40s new one comes
If fails again, next one comes after 80s (1m 20s)
If fails again, next one comes after 160s (2m 40s)
If fails again, after 320s (5m 20s), new Pod comes
If fails again, after 360s (not 640s, cause it is greater than 360s or 6m) you will see the next one
By looking at the source code, it seems like the backoffLimit attribute specifies the failure count rather than failure time.
Excerpt of the code mentioned above:
func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rErr error) {
// ...
succeeded, failed := getStatus(&job, pods, uncounted, expectedRmFinalizers)
// ...
jobHasNewFailure := failed > job.Status.Failed
exceedsBackoffLimit := jobHasNewFailure && (active != *job.Spec.Parallelism) &&
(failed > *job.Spec.BackoffLimit)
// ...
}

What is the difference between a resourceVersion and a generation?

In Kubernetes object metadata, there are the concepts of resourceVersion and generation. I understand the notion of resourceVersion: it is an optimistic concurrency control mechanism—it will change with every update. What, then, is generation for?
resourceVersion changes on every write, and is used for optimistic concurrency control
in some objects, generation is incremented by the server as part of persisting writes affecting the spec of an object.
some objects' status fields have an observedGeneration subfield for controllers to persist the generation that was last acted on.
In a Deployment context:
In Short
resourceVersion is the version of a k8s resource, while generation is the version of the deployment which you can use to undo, pause and so on using kubectl cli.
Source code for kubectl rollout: https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/rollout/rollout.go#L50
The Long Version
resourceVersion
K8s server saves all the modifications to any k8s resource. Each modification has a version which is called resourceVersion.
k8s languages libraries provide a way to receive in real-time events of ADD, DELETE, MODIFY events of any resource. You also have BOOKMARK event, but let's leave that for a moment aside.
On any modification operation, you receive the new k8s resource with updated resourceVersion. You can use this resourceVersion and start a watch starting from this resourceVersion, so you won't miss any events between the time the k8s server sent you back the first response, until the watch has started.
K8s doesn't preserve the history for every resource for ever. I think that it will save for 5m, but I'm not sure exactly.
resourceVersion will change after any modification of the object.
The reason of it's existence is to avoid concurrency problems where multiple clients try to modify the same k8s resource. This pattern is pretty common also in databases and you can find more info about it:
Optimistic concurrency control (https://en.wikipedia.org/wiki/Optimistic_concurrency_control)
https://www.programmersought.com/article/1104647506/
observedGeneration
You didn't talk about it in your question but thats important piece of information we need to clarify before moving on to generation.
It is the version of the replicaSet which this deployment is currently tracking on.
When the deployment is still creating for the first time, this value won't exist (good discussion on this can be found here: https://github.com/kubernetes/kubernetes/issues/47871).
This value can be found under status:
....
apiVersion: apps/v1
kind: Deployment
.....
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2021-02-07T19:04:17Z"
lastUpdateTime: "2021-02-07T19:04:17Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2021-02-07T19:04:15Z"
lastUpdateTime: "2021-02-07T19:17:09Z"
message: ReplicaSet "deployment-bcb437a4-59bb9f6f69" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 3. <<<--------------------
readyReplicas: 1
replicas: 1
updatedReplicas: 1
p.s, from this article: https://thenewstack.io/kubernetes-deployments-work/
observedGeneration is equal to the deployment.kubernetes.io/revision annotation. It is the observedGeneration.
It looks correct because deployment.kubernetes.io/revision does not exist when the deployment is first created and not yet ready, and also it has the same value as observedGeneration when the deployment is updated.
generation
It represents the version of the "new" replicaSet which this deployment should track on.
When a deployment is created for the first time, the value of this will be equal to 1. When the observedGeneration will be set to 1, it means that replicate set is ready (This question is not about how to know if a deployment was successful (or not) so I'm not getting into what is "ready", which is some terminology I created for this answer - but be sure that there are additional conditions to check if a deployment was successful or not).
Same goes for any change in the deployment k8s resource which will trigger re-deployment. the generation value will be incremented by 1, and then it will take some time until observedGeneration will be equal to generation value.
More info on observedGeneration and generation in the context of kuebctl rollout status (to check if a deployment "finished") from kubectl source code:
https://github.com/kubernetes/kubectl/blob/a2d36ec6d62f756e72fb3a5f49ed0f720ad0fe83/pkg/polymorphichelpers/rollout_status.go#L75
if deployment.Generation <= deployment.Status.ObservedGeneration {
cond := deploymentutil.GetDeploymentCondition(deployment.Status, appsv1.DeploymentProgressing)
if cond != nil && cond.Reason == deploymentutil.TimedOutReason {
return "", false, fmt.Errorf("deployment %q exceeded its progress deadline", deployment.Name)
}
if deployment.Spec.Replicas != nil && deployment.Status.UpdatedReplicas < *deployment.Spec.Replicas {
return fmt.Sprintf("Waiting for deployment %q rollout to finish: %d out of %d new replicas have been updated...\n", deployment.Name, deployment.Status.UpdatedReplicas, *deployment.Spec.Replicas), false, nil
}
if deployment.Status.Replicas > deployment.Status.UpdatedReplicas {
return fmt.Sprintf("Waiting for deployment %q rollout to finish: %d old replicas are pending termination...\n", deployment.Name, deployment.Status.Replicas-deployment.Status.UpdatedReplicas), false, nil
}
if deployment.Status.AvailableReplicas < deployment.Status.UpdatedReplicas {
return fmt.Sprintf("Waiting for deployment %q rollout to finish: %d of %d updated replicas are available...\n", deployment.Name, deployment.Status.AvailableReplicas, deployment.Status.UpdatedReplicas), false, nil
}
return fmt.Sprintf("deployment %q successfully rolled out\n", deployment.Name), true, nil
}
return fmt.Sprintf("Waiting for deployment spec update to be observed...\n"), false, nil
I must say that I'm not sure when a observedGeneration can be higher than generation. Maybe folks can help me out in the comments.
To sum it all up: Illustration from this great article: https://thenewstack.io/kubernetes-deployments-work/
More Info:
https://kubernetes.slack.com/archives/C2GL57FJ4/p1612651711106700?thread_ts=1612650049.105300&cid=C2GL57FJ4
Some more information about Bookmarks Events (Which are related to resourceVersion: What k8s bookmark solves?
How do you rollback deployments in Kubernete: https://learnk8s.io/kubernetes-rollbacks

Inspect and retry resque jobs via redis-cli

I am unable to run the resque-web on my server due to some issues I still have to work on but I still have to check and retry failed jobs in my resque queues.
Has anyone any experience on how to peek the failed jobs queue to see what the error was and then how to retry it using the redis-cli command line?
thanks,
Found a solution on the following link:
http://ariejan.net/2010/08/23/resque-how-to-requeue-failed-jobs
In the rails console we can use these commands to check and retry failed jobs:
1 - Get the number of failed jobs:
Resque::Failure.count
2 - Check the errors exception class and backtrace
Resque::Failure.all(0,20).each { |job|
puts "#{job["exception"]} #{job["backtrace"]}"
}
The job object is a hash with information about the failed job. You may inspect it to check more information. Also note that this only lists the first 20 failed jobs. Not sure how to list them all so you will have to vary the values (0, 20) to get the whole list.
3 - Retry all failed jobs:
(Resque::Failure.count-1).downto(0).each { |i| Resque::Failure.requeue(i) }
4 - Reset the failed jobs count:
Resque::Failure.clear
retrying all the jobs do not reset the counter. We must clear it so it goes to zero.