Is there a better way to wait for Kubernetes job completion? - kubernetes

I currently use the following script to wait for the job completion
ACTIVE=$(kubectl get jobs my-job -o jsonpath='{.status.active}')
until [ -z $ACTIVE ]; do ACTIVE=$(kubectl get jobs my-job -o jsonpath='{.status.active}') ; sleep 30 ; done
The problem is the job can either fail or be successful as it is a test job.
Is there a better way to achieve the same?

Yes. As I pointed out in kubectl tip of the day: wait like a boss, you can use the kubectl wait command.

Related

Kuebctl wait for multiple jobs completion (fail or success)

I want to wait for multiple jobs which can fail or succeed. I wrote a simple script based on an answer from Sebastian N. It's purpose is to wait for either success or fail of a job. The script works fine for one job (it can only fail or success obviously).
Now for the problem... I need to wait for multiple jobs identified by the same label. The script works fine when all jobs fail or all jobs succeed. But when some job fails, some succeeds the kubectl wait will time out.
For what I intend to do next it's not necessary to know which jobs failed or succeeded I just want to know when they end. Here is the "wait part" of the script I wrote (LABEL is the label by which the jobs I want to wait for are identified):
kubectl wait --for=condition=complete -l LABEL --timeout 14400s && exit 0 &
completion_pid=$!
kubectl wait --for=condition=failed -l LABEL --timeout 14400s && exit 1 &
failure_pid=$!
wait -n $completion_pid $failure_pid
exit_code=$?
if (( exit_code == 0 )); then
echo "Job succeeded"
pkill -P $failure_pid
else
echo "Job failed"
pkill -P $completion_pid
fi
If someone is curious why I kill the other kubectl wait command it's because of the timeout I set. When the job succeeds the process ends but the other one waits until the time out is reached. To stop running it on the background I simply kill it.
I found a workaround that fits my purpose. I found out that kubectl logs with the --follow or -f flag pointed to /dev/null actually "waits" until all jobs are done.
Further explanation:
The --follow flag means that the logs are printed continuously - not looking at the finishing state of the job. In addition, pointing the logs to /dev/null doesn't leave any unwanted text. I needed to print the output of logs via Python so I added another kubectl logs at the end (which I think is not ideal but it serves the purpose). I use sleep because I assume there is some procedure after all jobs are completed - without it the logs are not printed. Finally I use --tail=-1 flag because my logs are expected to have large output.
Here is my updated script (this part replaces everything from the script specified in question):
#wait until all jobs are completed, doesn't matter if failed or succeeded
kubectl logs -f -l LABEL > /dev/null
#sleep for some time && print final logs
sleep 2 && kubectl logs -l LABEL --tail=-1

How to make a workflow run for an infinitely long duration when running it using command line?

I am running a Cadence workflow using command line. I don't want my workflow to timeout (ie, I want it to run it for an infinitely long duration). How can I do so?
You can specify the startToCloseTimeout to a very large number, e.g. 100 years can represent infinite duration for you.
Also there are two ways to start workflows using Command line -- start or run.
./cadence --address <> --domain <> workflow run --tl helloWorldGroup --wt <WorkflowTypeName> --et 31536000 -i '<inputInJson>'
or
./cadence --address <> --domain <> workflow start --tl helloWorldGroup --wt <WorkflowTypeName> --et 31536000 -i '<inputInJson>'
Note that et is short for execution_timeout which is the startToCloseTimeout in seconds.
So Start would return once server accepts the start request. Run will wait for workflow to complete, and return the result at the end. In your case, you need to use Start because you don't know when the workflow will complete. But if you also want to get the workflow result after it's started, you can use observe command.
./cadence --address <> --domain <> workflow observe --workflow_id <>

When does status.lastScheduleTime change?

I'm working on a Kubernetes CronJob that relies on knowing when the last successfull run happened in order to correctly notify users about new events since said run. That said, I need to know if I can rely on status.lastScheduleTime to be my "last successful run" timestamp.
When does it change? Does it depend on exit codes? Is it even a good idea to use it for that purpose?
No, you can't rely on that as an indicator of a successful run. That value changes whenever your CronJob runs. It doesn't mean that it's your last successful run and it doesn't change depending on exit codes.
A CronJob essentially runs a Job with a name that is <cronjob name>-<unix epoch>. The epoch is in Unix/Linux what you would get from the date +%s command, for example, also that epoch is a timestamp that is slightly later than the timestamp of the lastScheduleTime (It's when the job resource gets created)
To find out if your last cron job ran successfully you can do something like the following.
You can get the last Job run/started name including its epoch with something like this:
$ kubectl get jobs | tail -1 | awk '{print $1}'
Then after that, you could check whether that job is successful with something like:
$ kubectl get job <job-name> -o=jsonpath='{.status.succeeded}'
Should return a 1.

How to store PBSPRO jobs in an array and check for job completion?

I'm trying to generate a system that allows me to check if multiple jobs have completed running on a cluster.
This bash code should work to wait until all PBS jobs have completed:
#create the array
ALLMYJOBS=()
# loop through scripts, submit them and store the job IDs in the array
for i in 1 2 3 4 5
do
ALLMYJOBS[${i}]=$(qsub script${i}.bash)
done
JOBSR=true
# check if all jobs have completed:
while [ ${JOBSR} ];do
JOBSR=false
for m in "${ALLMYJOBS[#]}"
do
if qstat ${m} &> /dev/null; then
JOBSR=true
fi
done
done
am I missing something obvious?
Actually the problem was with the check itself:
while [ "${JOBSR}" = "true" ];do
This works apparently.
The way your implemented will keep on polling the scheduler, which is undesirable for large cluster.
If I am to implement, I will use the job dependency, define another job depending on all your jobs, either check the output of the last job or use email notification.

Catching the error status while running scripts in parallel on Jenkins

I'm running two perl scripts in parallel on Jenkins and one more script which should get executed if the first two succeed. If I get an error in script1, script 2 still runs and hence the exit status becomes successful.
I want to run it in such a way that if any one of the parallel script fails, the job should stop with a failure status.
Currently my setup looks like
perl_script_1 &
perl_script_2 &
wait
perl_script_3
If script 1 or 2 fails in the middle, the job should be terminated with a Failure status without executing job 3.
Note: I'm using tcsh shell in Jenkins.
I have a similar setup where I run several java processes (tests) in parallel and wait for them to finish. If any fail, I fail the rest of my script.
Each test process writes its result to a file to be tested once done.
Note - the code examples below are written in bash, but it should be similar in tcsh.
To do this, I get the process id for every execution:
test1 &
test1_pid=$!
# test1 will write pass or fail to file test1_result
test2 &
test2_pid=$!
...
Now, I wait for the processes to finish by using the kill -0 PID command
For example test1:
# Check test1
kill -0 $test1_pid
# Check if process is done or not
if [ $? -ne 0 ]
then
echo process test1 finished
# check results
grep fail test1_result
if [ $? -eq 0 ]
then
echo test1 failed
mark_whole_build_failed
fi
fi
Same for other tests (you can do a loop to test all running processes periodically).
Later condition the rest of the execution based on mark_whole_build_failed.
I hope this helps.