I'm working on a Kubernetes CronJob that relies on knowing when the last successfull run happened in order to correctly notify users about new events since said run. That said, I need to know if I can rely on status.lastScheduleTime to be my "last successful run" timestamp.
When does it change? Does it depend on exit codes? Is it even a good idea to use it for that purpose?
No, you can't rely on that as an indicator of a successful run. That value changes whenever your CronJob runs. It doesn't mean that it's your last successful run and it doesn't change depending on exit codes.
A CronJob essentially runs a Job with a name that is <cronjob name>-<unix epoch>. The epoch is in Unix/Linux what you would get from the date +%s command, for example, also that epoch is a timestamp that is slightly later than the timestamp of the lastScheduleTime (It's when the job resource gets created)
To find out if your last cron job ran successfully you can do something like the following.
You can get the last Job run/started name including its epoch with something like this:
$ kubectl get jobs | tail -1 | awk '{print $1}'
Then after that, you could check whether that job is successful with something like:
$ kubectl get job <job-name> -o=jsonpath='{.status.succeeded}'
Should return a 1.
Related
I want to wait for multiple jobs which can fail or succeed. I wrote a simple script based on an answer from Sebastian N. It's purpose is to wait for either success or fail of a job. The script works fine for one job (it can only fail or success obviously).
Now for the problem... I need to wait for multiple jobs identified by the same label. The script works fine when all jobs fail or all jobs succeed. But when some job fails, some succeeds the kubectl wait will time out.
For what I intend to do next it's not necessary to know which jobs failed or succeeded I just want to know when they end. Here is the "wait part" of the script I wrote (LABEL is the label by which the jobs I want to wait for are identified):
kubectl wait --for=condition=complete -l LABEL --timeout 14400s && exit 0 &
completion_pid=$!
kubectl wait --for=condition=failed -l LABEL --timeout 14400s && exit 1 &
failure_pid=$!
wait -n $completion_pid $failure_pid
exit_code=$?
if (( exit_code == 0 )); then
echo "Job succeeded"
pkill -P $failure_pid
else
echo "Job failed"
pkill -P $completion_pid
fi
If someone is curious why I kill the other kubectl wait command it's because of the timeout I set. When the job succeeds the process ends but the other one waits until the time out is reached. To stop running it on the background I simply kill it.
I found a workaround that fits my purpose. I found out that kubectl logs with the --follow or -f flag pointed to /dev/null actually "waits" until all jobs are done.
Further explanation:
The --follow flag means that the logs are printed continuously - not looking at the finishing state of the job. In addition, pointing the logs to /dev/null doesn't leave any unwanted text. I needed to print the output of logs via Python so I added another kubectl logs at the end (which I think is not ideal but it serves the purpose). I use sleep because I assume there is some procedure after all jobs are completed - without it the logs are not printed. Finally I use --tail=-1 flag because my logs are expected to have large output.
Here is my updated script (this part replaces everything from the script specified in question):
#wait until all jobs are completed, doesn't matter if failed or succeeded
kubectl logs -f -l LABEL > /dev/null
#sleep for some time && print final logs
sleep 2 && kubectl logs -l LABEL --tail=-1
I usually got the job id with:
MY_CONDOR_JOB_ID
but I don't see it set if it's an interactive job. Is there a way to set it? When I am given the resources I see that there is a job id for my job. Is there a way to get it?
Here is what it should be
Submitting job(s).
1 job(s) submitted to cluster 4869.
Waiting for job to start...
HTCondor proper doesn't set MY_CONDOR_JOB_ID, so either your submit file or your administrator has set this up.
If your submit file contains
environment = CONDOR_JOB_ID=$(Cluster)
Then HTCondor will insert the job cluster id into the environment variable CONDOR_JOB_ID. To get this into a condor_submit -i, you'll need to pass the name of this submit file to condor_submit. So, try putting that into a submit file, maybe named env.sub, and run
condor_submit -i env.sub
Or, if you already have a submit file which sets this, pass the name of that submit file to condor_submit -i
when I doing a calculation halfway, I just found the runtime limit 50:00 may not be sufficient. So I use $bstop 1234 to stop the job 1234 and try to modify the old runtime -W 50:00 to -W 100:00
Can you suggest a command to do so?
I tried
$ bmod -W 100:00 1234
Please request for a minimum of 32 cores!
For more information, please contact XXX#XXX.
Request aborted by esub. Job not modified.
$ bmod [-W 100:00| -Wn ] 1234
-bash: -Wn]: command not found
100:00[8217]: Illegal job ID.
. Job not modified.
according to
[-W [hour:]minute[/host_name | /host_model] | -Wn]
from http://www.cisl.ucar.edu/docs/LSF/7.0.3/command_reference/bmod.cmdref.html
I don't quite understand the syntax, -Wn does it mean Wall time new
Many thanks for your help!
The first command fails because LSF calls a the mandatory esub defined by your administrator to do some preprocessing on the command line, and this is returning an error. Here's the relevant quote from the page you linked:
Like bsub, bmod calls the master esub (mesub), which invokes any
mandatory esub executables configured by an LSF administrator, and any
executable named esub (without .application) if it exists in
LSF_SERVERDIR.
You're going to have to come up with a bmod command line that passes the esub checks, but that might cause other problems because some parameters (like -n I believe) can't be changed at runtime by default so bmod will reject the request if you specify it.
The -Wn option is used to remove the run limit from the job entirely rather than change it to a different value.
I'm not sure how to title that more succinctly and still have it be meaningful.
(Note that this works fine when run mid-day, via cron or manually, so I "know" the script itself is sound.)
I have a cron job (ubuntu 13.04.)
It runs as my user (not root.)
The job itself runs at 6:00 in the morning. It's the first 'business level' job that runs all day.
1 6 * * 1-5 /home/me/bin/run_perl_job
run_perl_job is just:
#!/bin/bash
cd /home/me/bin
./script.pl
The script copies a file to "/mnt/shared_drive/outputfile.xls"
The mount point is defined in fstab as:
//fileserver/share /mnt/shared_drive cifs user=domain/me%password,iocharset=utf8,gid=1000,uid=1000,sec=ntlm,file_mode=0777,dir_mode=0777 0 0
Now. Given that:
When I run the script in a normal shell, it works fine.
When I look at the mount point first thing in the morning (via a normal terminal) it shows up (and is writeable) without event.
When I copy the crontab line and set it to run in a couple minutes, to see the symptom, it works fine (creates the file quite happily.)
The ONLY time this fails is if it's running in its normal time slot (6:01). The rest of the script functions ( the file itself has to be pulled down via sftp, etc.) So I know it's not dying.
It's driving me batty because the test cycle is 24 hours.
I just added the following couple lines to the beginning of the 'run_perl_job' script, hoping it exposes something tomorrow:
cd /mnt/shared_drive
ls -lrt >>home/me/bin/process.log
But I'm stumped. "It's almost as though" the mount point had gotten stale overnight and is waiting for some kind of access attempt before remounting. I'd run "mount -a" at the top of the 'run_perl_job' script if I could reasonably do it. But given that it's got to be sudo'ed, that doesn't seem reasonable to me.
Thoughts? I'm running out of ideas and this test cycle is awful.
how about putting a
umount -f -v /mnt/shared_drive
mount -v -a
into a root cron job just before your script runs. That way you don't need to sudo in your script and have the password in plain sight. -v might give you a hint on what is happening to make it stale
When starting an instance on Amazon EC2, how would I detect a failure, for instance, if there's no machine available to fulfill my request? I'm using one of the less-common machine types and am concerned it won't start up, but am having trouble finding out what message to look for to detect this.
I'm using the EC2 commandline tools to do this. I know I can look for 'running' when I do ec2-describe-instance to see if the machine is up, but don't know what to look for to see if the startup failed.
Thanks!
The output from ec2-start-instances only returns you stopped pending, and as you say you need to use ec2-describe-instances to retrieve the state.
For that, you have a couple of choices; you can either use a loop to check for instance-state-name, looking for a result of running or stopped; alternatively you could look at either the reason or state-reason-code fields; unfortunately you'll need to trigger the failure you're worried about, to obtain the values that indicate failure.
The batch file I use to wait for a successful startup (fill in the underscores):
#echo off
set EC2_HOME=C:\tools\ec2-api-tools
set EC2_PRIVATE_KEY=C:\_\pk-_.pem
set EC2_CERT=C:\_\cert-_.pem
set JAVA_HOME=C:\Program Files (x86)\Java\jre6
%EC2_HOME%\bin\ec2-start-instances i-_
:docheck
%EC2_HOME%\bin\ec2-describe-instances | C:\tools\gnuwin32\bin\grep.exe -c stopped > %EC2_HOME%\temp.txt
findstr /m "1" %EC2_HOME%\temp.txt > nul
if %errorlevel%==0 (c:\tools\gnuwin32\bin\echo -n "."
goto docheck)
del temp.txt
ec2-start-instances will return you the previous state (after last command to instance) and the current state (after your command). ec2-stop instances does the same thing. THE PROBLEM IS, if you are scripting and you use -start- on a 'stopping' instance -OR- you use a -stop- on a 'pending' instance. These will cause exceptions in the command line tool and NASTILY exit your scripts all the way to the original console (VERY BAD BEHVIOR, AMAZON). So you have to go all the way through parsing the ec2-describe-instances [instance-id] result. HOWVER, that still leaves you vulnerable to that tiny little bit of time between when you GET the status from your instance and you APPLY A COMMAND. If someone else, or Amazon, puts you into pending or stopping, and you then do 'stop' or 'start respectively, your script will break. I really don't know how to catch such an exception with script. Bad Amazon AWS, BAD DOG!