Extract details for past jobs in SLURM - hpc

In PBS, one can query a specific job with qstat -f and obtain (all?) info and details to reproduce the job:
# qstat -f 1234
Job Id: 1234.login
Job_Name = job_name_here
Job_Owner = user#pbsmaster
...
Resource_List.select = 1:ncpus=24:mpiprocs=24
Resource_List.walltime = 23:59:59
...
Variable_List = PBS_O_HOME=/home/user,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=user,...
etime = Mon Apr 20 16:38:27 2020
Submit_arguments = run_script_here --with-these flags
How may I extract the same information from SLURM?
scontrol show job %j only works for currently running jobs or those terminated up to 5 minutes ago.
Edit: I'm currently using the following to obtain some information, but it's not as complete as a qstat -f:
sacct -u $USER \
-S 2020-05-13 \
-E 2020-05-15 \
--format "Account,JobID%15,JobName%20,State,ExitCode,Submit,CPUTime,MaxRSS,ReqMem,MaxVMSize,AllocCPUs,ReqTres%25"
.. usually piped into |(head -n 2; grep -v COMPLETED) |sort -k12 to inspect only failed runs.

You can get a list of all jobs that started before a certain date like so:
sacct --starttime 2020-01-01
Then pick the job you are interested (e.g. job 1234) and print details with sacct:
sacct -j 1234 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
See here under --helpformat for a complete list of available fields.

Related

Is there a way to get information about kubernetes cronjob timetable?

I'd like to get information about k8s cronjob time.
There are so many jobs in my k8s program.
So It's hard to count what time there are focused on.
I want to distribute my jobs evenly.
Is there are way to count cronjob time or sort by time?
I have tried to find a suitable tool that can help with your case.
Unfortunately, I did not find anything suitable and easy to use at the same time.
It is possible to use Prometheus + Grafana to monitor CronJobs e.g using this Kubernetes Cron and Batch Job monitoring dashboard.
However, I don't think you will find any useful information in this way, just a dashboard that displays the number of CronJobs in the cluster.
For this reason, I decided to write a Bash script that is able to display the last few CronJobs run in a readable manner.
As described in the Kubernetes CronJob documentation:
A CronJob creates Jobs on a repeating schedule.
To find out how long a specific Job was running, we can check its startTime and completionTime e.g. using the commands below:
# kubectl get job <JOB_NAME> --template '{{.status.startTime}}' # "startTime"
# kubectl get job <JOB_NAME> --template '{{.status.completionTime}}' # "completionTime"
To get the duration of Jobs in seconds, we can convert startTime and completionTime dates to epoch:
# date -d "<SOME_DATE> +%s
And this is the entire Bash script:
NOTE: We need to pass the namespace name as an argument.
#!/bin/bash
# script name: cronjobs_timetable.sh <NAMESPACE>
namespace=$1
for cronjob_name in $(kubectl get cronjobs -n $namespace --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}'); do
echo "===== CRONJOB_NAME: ${cronjob_name} ==========="
printf "%-15s %-15s %-15s %-15s\n" "START_TIME" "COMPLETION_TIME" "DURATION" "JOB_NAME"
for job_name in $(kubectl get jobs -n $namespace --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep -w "${cronjob_name}-[0-9]*$"); do
startTime="$(kubectl get job ${job_name} -n $namespace --template '{{.status.startTime}}')"
completionTime="$(kubectl get job ${job_name} -n $namespace --template '{{.status.completionTime}}')"
if [[ "$completionTime" == "<no value>" ]]; then
continue
fi
duration=$[ $(date -d "$completionTime" +%s) - $(date -d "$startTime" +%s) ]
printf "%-15s %-15s %-15s %-15s\n" "$(date -d ${startTime} +%X)" "$(date -d ${completionTime} +%X)" "${duration} s" "$job_name"
done
done
By default, this script only displays the last three Jobs, but it may by modified in the Job configuration using the .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit fields (for more information see Kubernetes Jobs History Limits)
We can check how it works:
$ ./cronjobs_timetable.sh default
===== CRONJOB_NAME: hello ===========
START_TIME COMPLETION_TIME DURATION JOB_NAME
02:23:00 PM 02:23:12 PM 12 s hello-1616077380
02:24:02 PM 02:24:13 PM 11 s hello-1616077440
02:25:03 PM 02:25:15 PM 12 s hello-1616077500
===== CRONJOB_NAME: hello-2 ===========
START_TIME COMPLETION_TIME DURATION JOB_NAME
02:23:01 PM 02:23:23 PM 22 s hello-2-1616077380
02:24:02 PM 02:24:24 PM 22 s hello-2-1616077440
02:25:03 PM 02:25:25 PM 22 s hello-2-1616077500
===== CRONJOB_NAME: hello-3 ===========
START_TIME COMPLETION_TIME DURATION JOB_NAME
02:23:01 PM 02:23:32 PM 31 s hello-3-1616077380
02:24:02 PM 02:24:34 PM 32 s hello-3-1616077440
02:25:03 PM 02:25:35 PM 32 s hello-3-1616077500
===== CRONJOB_NAME: hello-4 ===========
START_TIME COMPLETION_TIME DURATION JOB_NAME
02:23:01 PM 02:23:44 PM 43 s hello-4-1616077380
02:24:02 PM 02:24:44 PM 42 s hello-4-1616077440
02:25:03 PM 02:25:45 PM 42 s hello-4-1616077500
Additionally, you'll likely want to create exceptions and error handling to make this script work as expected in all cases.

How can I login to any pod within any namespace in kubernetes and run any command? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
My requirement is to login into pod without selecting the namespace and the pod name, then providing a command to be run inside it. There was no thread or question that answered the query. Below is a solution that I came up with. If there can be improvements in the same, please provide those.
What I was looking for to build automation or a krew plugin for logging into any pod in any namespace and run any command inside it. It is purely automation based, there are plenty of plugins that do the same work like k9s and other krew plugins but my requirement was more of a generic based and lightweight which could be implemented in a Pipeline without any third-party tools.
Create a Shell Script
login.sh
#$1 is namespace $2 is the row number where the pod is located $3 is the command to run inside the pod
#$3 is command can be bash,sh or some command like echo "test"
#Example ./login.sh test-ns 1 bash or ./login.sh test-ns 1 'echo "test"'
pod=$(kubectl get pods -n $1 | grep -v NAME | awk -v i=1 -v j=$2 'FNR == j {print $i}')
kubectl exec -it $pod -n $1 -- $3
How to use it?
login.sh $namespace $pod_number $command
1. Create the above script, make it executable and copy it in /usr/local/bin directory or in the $PATH variable
chmod +x login.sh
#Check if /usr/local/bin exists in PATH variable, if not then run the export command
echo $PATH
export PATH=$PATH:/usr/local/bin
cp -p login.sh /usr/local/bin/login.sh
2. Run kubectl get pods to get the pod number in which you want to the command to run
helloworld-v1-5dfcf5d5cd-v7xtw 1/1 Running 0 8d
httpbin-56db79f4f5-j2cxz 1/1 Running 1 8d
test-cc5b6bfd-2hhmr 1/1 Running 0 39h
And I want to exec into the second pod with the shell being bash
3. Run the command to use the script
login.sh default 2 bash
Demo Output
The output would look something like:
shubham.yadav#my-MAC:~/k8s $ ./login.sh default 2 bash
root#httpbin-56db79f4f5-j2cxz:/#
Another example just to make implementation more clear
shubham.yadav#my-MAC:~/k8s $ login.sh qa-test 2 'echo "From the pod Login Succeeded"'
"From the pod Login Succeeded"
Note: In case you have a large number of pods in the namespace you can use NR utility of awk to get the column numbers of the pods.
kp is an alias for kubectl get pods
kp | grep -v NAME | awk '{print NR, $1}'
1 details-v1-78d78fbddf-zksf7
2 helloworld-v1-5dfcf5d5cd-wl2nx
3 httpbin-56db79f4f5-wj4ql
4 load-generator-5cdbd66865-fxpbm
5 productpage-v1-85b9bf9cd7-qp79l
6 ratings-v1-6c9dbf6b45-xm6ww
7 reviews-v1-564b97f875-cx86c
8 reviews-v2-568c7c9d8f-xxc86
9 reviews-v3-67b4988599-p98ft
10 test-cc5b6bfd-68tqg
11 test-cc5b6bfd-8q694
12 unset-deployment-7896c75bf6-5w27b
13 web-v1-fc4d58bdc-pcv9p
14 web-v2-7bf5dd654d-684t9
15 web-v3-7567d5d6b9-sqrpg
Creating a Binary for Shell Script
In case if you are interested in creating the binary for the above script, you can follow the following link which lets you create a binary for your above shell script.
You can also change the binary name to a more friendly name from login.sh to just execute
shubham.yadav#my-MAC:/usr/local/bin $ mv login.sh execute
shubham.yadav#my-MAC:/usr/local/bin $ execute qa-test 2 'echo "Binary Name Changed"'
"Binary Name Changed"

Kubernetes - delete all jobs in bulk

I can delete all jobs inside a custer running
kubectl delete jobs --all
However, jobs are deleted one after another which is pretty slow (for ~200 jobs I had the time to write this question and it was not even done).
Is there a faster approach ?
It's a little easier to setup an alias for this bash command:
kubectl delete jobs `kubectl get jobs -o custom-columns=:.metadata.name`
I have a script for deleting which was quite faster in deleting:
$ cat deljobs.sh
set -x
for j in $(kubectl get jobs -o custom-columns=:.metadata.name)
do
kubectl delete jobs $j &
done
And for creating 200 jobs used following script with the command for i in {1..200}; do ./jobs.sh; done
$ cat jobs.sh
kubectl run memhog-$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 8 | head -n 1) --restart=OnFailure --record --image=derekwaynecarr/memhog --command -- memhog -r100 20m
If you are using CronJob and those are piling up quickly, you can let kubernetes delete them automatically by configuring job history limit described in documentation. That is valid starting from version 1.6.
...
spec:
...
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
This works really well for me:
kubectl delete jobs $(kubectl get jobs -o custom-columns=:.metadata.name)
There is an easier way to do it:
To delete successful jobs:
kubectl delete jobs --field-selector status.successful=1
To delete failed or long-running jobs:
kubectl delete jobs --field-selector status.successful=0
I use this script, it's fast but it can trash CPU (a process per job), you can always adjust the sleep parameter:
#!/usr/bin/env bash
echo "Deleting all jobs (in parallel - it can trash CPU)"
kubectl get jobs --all-namespaces | sed '1d' | awk '{ print $2, "--namespace", $1 }' | while read line; do
echo "Running with: ${line}"
kubectl delete jobs ${line} &
sleep 0.05
done
The best way for me is (for completed jobs older than a day):
kubectl get jobs | grep 1/1 | gawk 'match($0, / ([0-9]*)h/, ary) { if(ary[1]>24) print $1}' | parallel -r --bar -P 32 kubectl delete jobs
grep 1/1 for completed jobs
gawk 'match($0, / ([0-9]*)h/, ary) { if(ary[1]>24) print $1}' for jobs older than a day
-P number of parallel processes
It is faster than kubectl delete jobs --all, has a progress bar and you can use it when some jobs are still running.
kubectl delete jobs --all --cascade=false is fast, but won't delete associated resources, such as Pods
https://github.com/kubernetes/kubernetes/issues/8598
Parallelize using GNU parallel
parallel --jobs=5 "echo {}; kubectl delete jobs {} -n core-services;" ::: $(kubectl get job -o=jsonpath='{.items[?(#.status.succeeded==1)].metadata.name}' -n core-services)
kubectl get jobs -o custom-columns=:.metadata.name | grep specific* | xargs kubectl delete jobs
kubectl get jobs -o custom-columns=:.metadata.name gives you list of jobs name | then you can grep specific that you need with regexp | then xargs use output to delete one by one from the list.
Probably, there's no other way to delete all job at once,because even kubectl delete jobs also queries one job at a time, what Norbert van Nobelen suggesting might get faster result, but it will make much difference.
Kubectl bulk (bulk-action on krew) plugin may be useful for you, it gives you bulk operations on selected resources.
This is the command for deleting jobs
' kubectl bulk jobs delete '
You could check details in
https://github.com/emreodabas/kubectl-plugins/blob/master/README.md#kubectl-bulk-aka-bulk-action

Rundeck REST endpoint for fetching project details like jobs and node

I am using rundeck 1.6 version and just want to check is there REST endpoint in any 1.6 or any latest version which solves my below requirement.
If i pass a project name it gives me all the jobs created under that project with the node names on which they are being configured to run.
Thanks
-Sam
There is, the following code uses Tokens authentication. You can also use Password Authentication
#!/bin/bash
RUNDECK_URL='localhost:4440' #<-----change it to your rundeck url
API_TOKEN='OyFXX1q4UzhTUe7deOUIPJKkrUnEwZlo' #<-----change it to your api koken
PROJECTS=`curl -H "Accept: application/json" http://$RUNDECK_URL/api/1/projects?authtoken=$API_TOKEN |tr "}" "\n"|tr "," "\n"|grep name|cut -d":" -f2 |tr -d "\""`
for proj in $PROJECTS; do
#get all Jobs in all projects:
echo "Project: $proj"
PROJECT_OUTPUT=`curl -sS "http://$RUNDECK_URL/api/1/jobs?authtoken=$API_TOKEN&project=${proj}"`
# get job definition and parse
JOB_IDS=`echo $PROJECT_OUTPUT | grep -oP "(?<=<job id=')[^']+"`
for id in $JOB_IDS; do
echo $id #job id
JOB_OUTPUT=`curl -sS "http://$RUNDECK_URL/api/1/job/$id?authtoken=$API_TOKEN"`
echo $JOB_OUTPUT | grep -oP "(?<=<name>)[^<]+" #job name
echo $JOB_OUTPUT | grep -oP "(?<=<filter>)[^<]+" #job node filter
done
done
Output:
$ sh rundeck_test.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 98 0 98 0 0 6292 0 --:--:-- --:--:-- --:--:-- 6533
Project: TestProject
02a41aaa-eb50-4831-8762-80b798468cbe <-------- job id
TestJob <-------- job name, the job doesn't have node filter - running on rundeck
9b2ac9e9-0350-4494-a463-b43ba1e458ab
TestJob2
node1.exmple.com <-------- node filter value

upstart script. shell arithmetic in script stanza producing incorrect values. equivalent /bin/sh script works

I have an upstart init script, but my dev/testing/production have different numbers of cpus/cores. I'd like to compute the number of worker processes to be 4 * number of cores within the init script
The upstart docs say that the script stanzas use /bin/sh syntax.
I created /bin/sh script to see what was going on. I'm getting drastically different results than my upstart script.
script stanza from my upstart script:
script
# get the number of cores
CORES=`lscpu | grep -v '#' | wc -l`
# set the number of worker processes to 4 * num cores
WORKERS=$(($CORES * 4))
echo exec gunicorn -b localhost:8000 --workers $WORKERS tutalk_site.wsgi > tmp/gunicorn.txt
end script
which outputs:
exec gunicorn -b localhost:8000 --workers 76 tutalk_site.wsgi
my equivalent /bin/sh script
#!/bin/sh
CORES=`lscpu -p | grep -v '#' | wc -l`
WORKERS=$(($CORES * 4))
echo exec gunicorn -b localhost:8000 --workers $WORKERS tutalk_site.wsgi
which outputs:
exec gunicorn -b localhost:8000 --workers 8 tutalk_site.wsgi
I'm hoping this is a rather simple problem and a few other pairs of eyes will locate the issue.
Any help would be appreciated.
I suppose I should have answered this several days ago. I first attempted using environment variables instead but didn't have any luck.
I solved the issue by replacing the computation with a python one-liner
WORKERS=$(python -c "import os; print os.sysconf('SC_NPROCESSORS_ONLN') * 2")
and that worked out just fine.
still curious why my bourne-shell script came up with the correct value while the upstart script, whose docs say use bourne-shell syntax didn't