I have a cluster running some jobs, there's a job that executes a pod. that pod is completed while in process, like 1+3=5 now it should display 5 but it stops in 1+3 and its status is to complete. I don't know what can cause a pod to complete without executing the whole code. any help or thoughts on it would help a lot.
Detail:
I have a case now,
console.log("Opening in ECS ");<<--in one case pod successfully terminates here -->>
try {
await funcy1();<<-- an async function -->>
console.log("opening in ECS end");<--in second case pod successfully terminates here-->>
} catch (error) {
throw error;
}
now the pod is completed at stated line, if there's an error it should be thrown (which is logged). But I cannot see any log. It's just pod is completed on specified line which shouldn't be the case.
some erros from pods descriptions are:
State: Terminated
Reason: Error
Exit Code: 255
and
State: Terminated
Reason: Error
Exit Code: 137
and
State: Terminated
Reason: Completed
Exit Code: 0
So the issue was not specifying resources of the pod, just thought it can help someone.
Related
I have a Python script that is being executed via Rundeck. I already have implemented handlers for signal.SIGINT and signal.SIGTERM but when the script is terminated via Rundeck KILL JOB BUTTON it is not catching the signal.
Someone know what KILL BUTTON in Rundeck use under the woods to kills the process?
Example of how I'm catching signals, it works in a standard command line execution:
def sigint_handler(signum, frame):
proc = psutil.Process(os.getpid())
children_procs = proc.children(recursive=True)
children_procs.reverse()
for child_proc in children_procs:
try:
if child_proc.is_running():
msg = f'removing: {child_proc.pid},
{child_proc.name}'
logging.debug(msg)
os.kill(child_proc.pid, SIGINT)
except OSError as exc:
raise Error('Error removing processes', detail=str(exc))
sys.exit(SIGINT_EXIT)
Adding debug logging level in Rundeck get this:
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Interrupted: Engine interrupted, stopping engine...
Disconnecting from 9.11.56.44 port 22
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] WillShutdown: Workflow engine shutting down (interrupted? true)
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] OperationFailed: operation failed: java.util.concurrent.CancellationException: Task was cancelled.
SSH command execution error: Interrupted: Connection was interrupted
Caught an exception, leaving main loop due to Socket closed
Failed: Interrupted: Connection was interrupted
[workflow] finishExecuteNodeStep(mario): NodeDispatch: Interrupted: Connection was interrupted
1: Workflow step finished, result: Dispatch failed on 1 nodes: [mario: Interrupted: Connection was interrupted + {dataContext=MultiDataContextImpl(map={ContextView(step:1, node:mario)=BaseDataContext{{exec={exitCode=-1}}}, ContextView(node:mario)=BaseDataContext{{exec={exitCode=-1}}}}, base=null)} ]
[workflow] Finish step: 1,NodeDispatch
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Complete: Workflow complete: [Step{stepNum=1, label='null'}: CancellationException]
[wf:7bb0cd58-7dc6-4a55-bb0f-62399533396c] Cancellation while running step [1]
[workflow] Finish execution: node-first: [Workflow result: , Node failures: {mario=[]}, status: failed]
[Workflow result: , Node failures: {mario=[]}, status: failed]
Execution failed: 57 in project iLAB: [Workflow result: , Node failures: {mario=[]}, status: failed]
It is just closing the connection?
Rundeck can't manage internal threads in that way (directly), with the kill button you can kill only the Rundeck job, the only way to manage that is by applying all the logic in your script (detect the thread, and depending on some option/behavior kill the thread). That was requested here and here.
Situation:
I've got a CronJob that often fails (this is expected at the moment). Due to the fact that the container performing the job, has a side-car, the dependencies are between the containers are expressed through bash scripts and common mounts of emptyDir in /etc/liveness folder:
spec:
containers:
- args:
- -c
- set -x;
...
./process; # execute the main process
rc=$?;
rm /etc/liveness; # clean-up
exit $rc;
command:
- /bin/bash
Problem:
In the scenarios, where the job fails, I see the following in the logs:
+ rc=255
+ rm /etc/liveness
+ exit 255
With retryPolicy set to never, the failed pod enters the Completed status, which is misleading:
scheduler-1594015200-wl9xc 0/2 Completed 0 24m
According to official doc,
A Job creates one or more Pods and ensures that a specified number of
them successfully terminate.
And containers enter terminated state when
it has successfully completed execution or when it has failed for some
reason.
So if you set retryPolicy to never, this is what will happen.
A Pod's status field is a PodStatus object, which has a phase field.
Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
Status and Phase is not the same. So I learned, that what happens above is that my pods end up in status Completed and phase Failed.
I use celery 4.4.0 version in my project(Ubuntu 18.04.2 LTS). When i raise Exception('too few functions in features to classify') , celery project lost worker and i get such logs:
[2020-02-11 15:42:07,364] [ERROR] [Main ] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.')
Traceback (most recent call last):
File "/var/lib/virtualenvs/simus_classifier_new/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost human_status(exitcode)), billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
[2020-02-11 15:42:07,474] [DEBUG] [ForkPoolWorker-61] Closed channel #1
Do you have any idea how to solve this problem?
WorkerLostError are almost like OutOfMemory errors - they can't be solved. They will continue to happen from time to time. What you should do is to make your task(s) idempotent and let Celery retry tasks that failed due to worker crash.
It sounds trivial, but in many cases it is not. Not all tasks can be idempotent for an example. Celery still has bugs in the way it handles WorkerLostError. Therefore you need to monitor your Celery cluster closely and react to these events, and try to minimize them. In other words, find why the worker crashed - Was it killed by the system because it was consuming all the memory? Was it killed simply because it was running on an AWS spot instance, and it got terminated? Was it killed by someone executing kill -9 <worker pid>? All these circumstances could be handled this way or another...
How can I programatically determine if a job has failed for good and will not retry any more? I've seen the following on failed jobs:
status:
conditions:
- lastProbeTime: 2018-04-25T22:38:34Z
lastTransitionTime: 2018-04-25T22:38:34Z
message: Job has reach the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: Failed
However, the documentation doesn't explain why conditions is a list. Can there be multiple conditions? If so, which one do I rely on? Is it a guarantee that there will only be one with status: "True"?
JobConditions is similar as PodConditions. You may read about PodConditions in official docs.
Anyway, To determine a successful pod, I follow another way. Let's look at it.
There are two fields in Job Spec.
One is spec.completion (default value 1), which says,
Specifies the desired number of successfully finished pods the
job should be run with.
Another is spec.backoffLimit (default value 6), which says,
Specifies the number of retries before marking this job failed.
Now In JobStatus
There are two fields in JobStatus too. Succeeded and Failed. Succeeded means how many times the Pod completed successfully and Failed denotes, The number of pods which reached phase Failed.
Once the Success is equal or bigger than the spec.completion, the job will become completed.
Once the Failed is equal or bigger than the spec.backOffLimit, the job will become failed.
So, the logic will be here,
if job.Status.Succeeded >= *job.Spec.Completion {
return "completed"
} else if job.Status.Failed >= *job.Spec.BackoffLimit {
return "failed"
}
If so, which one do I rely on?
You might not have to choose, considering commit dd84bba64
When a job is complete, the controller will indefinitely update its conditions
with a Complete condition.
This change makes the controller exit the
reconcilation as soon as the job is already found to be marked as complete.
As https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#jobstatus-v1-batch says:
The latest available observations of an object's current state. When a
Job fails, one of the conditions will have type "Failed" and status
true. When a Job is suspended, one of the conditions will have type
"Suspended" and status true; when the Job is resumed, the status of
this condition will become false. When a Job is completed, one of the
conditions will have type "Complete" and status true. More info:
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
What does this error mean please?
Stack named 'awseb-eea9ufee4ak-stack' aborted operation. Current state: 'CREATE_FAILED' Reason: The following resource(s) failed to create: [AWSEBInstanceLaunchWaitCondition]. (Service: AmazonCloudFormation; Status Code: 400; Error Code: OperationError; Request ID: null)
This error means that launching your environment timed out while waiting to hear back the EC2 instance. The instance did not report whether it successfully launched the environment or not. I would recommend taking snapshot logs to see detailed error messages from the instance.