Kuebctl wait for multiple jobs completion (fail or success) - kubernetes

I want to wait for multiple jobs which can fail or succeed. I wrote a simple script based on an answer from Sebastian N. It's purpose is to wait for either success or fail of a job. The script works fine for one job (it can only fail or success obviously).
Now for the problem... I need to wait for multiple jobs identified by the same label. The script works fine when all jobs fail or all jobs succeed. But when some job fails, some succeeds the kubectl wait will time out.
For what I intend to do next it's not necessary to know which jobs failed or succeeded I just want to know when they end. Here is the "wait part" of the script I wrote (LABEL is the label by which the jobs I want to wait for are identified):
kubectl wait --for=condition=complete -l LABEL --timeout 14400s && exit 0 &
completion_pid=$!
kubectl wait --for=condition=failed -l LABEL --timeout 14400s && exit 1 &
failure_pid=$!
wait -n $completion_pid $failure_pid
exit_code=$?
if (( exit_code == 0 )); then
echo "Job succeeded"
pkill -P $failure_pid
else
echo "Job failed"
pkill -P $completion_pid
fi
If someone is curious why I kill the other kubectl wait command it's because of the timeout I set. When the job succeeds the process ends but the other one waits until the time out is reached. To stop running it on the background I simply kill it.

I found a workaround that fits my purpose. I found out that kubectl logs with the --follow or -f flag pointed to /dev/null actually "waits" until all jobs are done.
Further explanation:
The --follow flag means that the logs are printed continuously - not looking at the finishing state of the job. In addition, pointing the logs to /dev/null doesn't leave any unwanted text. I needed to print the output of logs via Python so I added another kubectl logs at the end (which I think is not ideal but it serves the purpose). I use sleep because I assume there is some procedure after all jobs are completed - without it the logs are not printed. Finally I use --tail=-1 flag because my logs are expected to have large output.
Here is my updated script (this part replaces everything from the script specified in question):
#wait until all jobs are completed, doesn't matter if failed or succeeded
kubectl logs -f -l LABEL > /dev/null
#sleep for some time && print final logs
sleep 2 && kubectl logs -l LABEL --tail=-1

Related

How to make a workflow run for an infinitely long duration when running it using command line?

I am running a Cadence workflow using command line. I don't want my workflow to timeout (ie, I want it to run it for an infinitely long duration). How can I do so?
You can specify the startToCloseTimeout to a very large number, e.g. 100 years can represent infinite duration for you.
Also there are two ways to start workflows using Command line -- start or run.
./cadence --address <> --domain <> workflow run --tl helloWorldGroup --wt <WorkflowTypeName> --et 31536000 -i '<inputInJson>'
or
./cadence --address <> --domain <> workflow start --tl helloWorldGroup --wt <WorkflowTypeName> --et 31536000 -i '<inputInJson>'
Note that et is short for execution_timeout which is the startToCloseTimeout in seconds.
So Start would return once server accepts the start request. Run will wait for workflow to complete, and return the result at the end. In your case, you need to use Start because you don't know when the workflow will complete. But if you also want to get the workflow result after it's started, you can use observe command.
./cadence --address <> --domain <> workflow observe --workflow_id <>

qsub -t job "has died"

I am trying to submit an array job to our cluster using qsub. The script is like:
#!/bin/bash
#PBS -l nodes=1:ppn=1 # Number of nodes and processor
#..... (Other options)
#PBS -t 0-50 # List job
cd $PBS_O_WORKDIR
./programname << EOF
some parameters
EOF
This script runs without a problem when removing -t option. But every time I added -t, I got following output:
---------------------------------------------
Check nodes and clean them of stray processes
---------------------------------------------
Checking node XXXXXXXXXX
-> User XXXX running job XXXXX.XXX:state=X:ncpus=X
-> Job XXX.XXX has died
Done clearing all the allocated nodes
------------------------------------------------------
Concluding PBS prologue script - XX-XX-XXXX XX:XX:XX
------------------------------------------------------
-------------- Job will be requeued --------------
Where it died and started requeue. No error message. I did not find any similar issue online. Has anyone experienced this before? Thank you!
(I wrote another "manual" array qsub script which works. But I do wish to get the work, as it is in the command option and much cleaner.)

Is there a better way to wait for Kubernetes job completion?

I currently use the following script to wait for the job completion
ACTIVE=$(kubectl get jobs my-job -o jsonpath='{.status.active}')
until [ -z $ACTIVE ]; do ACTIVE=$(kubectl get jobs my-job -o jsonpath='{.status.active}') ; sleep 30 ; done
The problem is the job can either fail or be successful as it is a test job.
Is there a better way to achieve the same?
Yes. As I pointed out in kubectl tip of the day: wait like a boss, you can use the kubectl wait command.

Catching the error status while running scripts in parallel on Jenkins

I'm running two perl scripts in parallel on Jenkins and one more script which should get executed if the first two succeed. If I get an error in script1, script 2 still runs and hence the exit status becomes successful.
I want to run it in such a way that if any one of the parallel script fails, the job should stop with a failure status.
Currently my setup looks like
perl_script_1 &
perl_script_2 &
wait
perl_script_3
If script 1 or 2 fails in the middle, the job should be terminated with a Failure status without executing job 3.
Note: I'm using tcsh shell in Jenkins.
I have a similar setup where I run several java processes (tests) in parallel and wait for them to finish. If any fail, I fail the rest of my script.
Each test process writes its result to a file to be tested once done.
Note - the code examples below are written in bash, but it should be similar in tcsh.
To do this, I get the process id for every execution:
test1 &
test1_pid=$!
# test1 will write pass or fail to file test1_result
test2 &
test2_pid=$!
...
Now, I wait for the processes to finish by using the kill -0 PID command
For example test1:
# Check test1
kill -0 $test1_pid
# Check if process is done or not
if [ $? -ne 0 ]
then
echo process test1 finished
# check results
grep fail test1_result
if [ $? -eq 0 ]
then
echo test1 failed
mark_whole_build_failed
fi
fi
Same for other tests (you can do a loop to test all running processes periodically).
Later condition the rest of the execution based on mark_whole_build_failed.
I hope this helps.

How to detect if a process is running on any of the nodes in a multiple node Grid in UNIX

My server is using a GRID. we have 3 nodes [any one of them could execute my script when i kick off the autosys job ]
Now my problem is if am trying to stop a job from running if it is already running. My code works when i see the scripts is executing on the same node [i mean the first instance and the second instance ]
ps -ead -o %U%p%a| egrep '(ksh|perl)' | grep -v egrep| grep \"perl .*myprocess.pl\"
is there a way, PS could list all instances of the processes from all nodes in the GRID.
please help!!
You can create a start.flag file in a common location. Keep the below conditions:
if the flag exists, then the flag will be removed and the
script will be executed. After completion of execution, the script
will touch that flag again.
if the flag does not exist, the script will just exit saying that its
running.
Best luck :)