TORQUE jobs hang on Ubuntu 16.04 - ubuntu-16.04

I have TORQUE installed on Ubuntu 16.04, and I am having trouble because my jobs hang. I have a test script test.pbs:
#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00
cd $PBS_O_WORKDIR
touch done.txt
echo "done"
And I run it with
qsub test.pbs
The job writes done.txt and echoes "done" just fine, but the job hangs in the C state.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost test wlandau 00:00:00 C batch
Edit: some diagnostic info on another job from qstat -f 55
qstat -f 55
Job Id: 55.localhost
Job_Name = test
Job_Owner = wlandau#localhost
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = haggunenon
Checkpoint = u
ctime = Mon Oct 30 07:35:00 2017
Error_Path = localhost:/home/wlandau/Desktop/test.e55
exec_host = localhost/2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 30 07:35:00 2017
Output_Path = localhost:/home/wlandau/Desktop/test.o55
Priority = 0
qtime = Mon Oct 30 07:35:00 2017
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 00:01:00
session_id = 5115
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
PBS_O_WORKDIR=/home/wlandau/Desktop
comment = Job started on Mon Oct 30 at 07:35
etime = Mon Oct 30 07:35:00 2017
exit_status = 0
submit_args = test.pbs
start_time = Mon Oct 30 07:35:00 2017
Walltime.Remaining = 60
start_count = 1
fault_tolerant = False
comp_time = Mon Oct 30 07:35:00 2017
And a similar tracejob -n2 62:
/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located
Job: 62.localhost
10/30/2017 17:20:25 S enqueuing into batch, state 1 hop 1
10/30/2017 17:20:25 S Job Queued at request of wlandau#localhost, owner =
wlandau#localhost, job name = jobe945093c2e029c5de5619d6bf7922071,
queue = batch
10/30/2017 17:20:25 S Job Modified at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:20:25 L Job Run
10/30/2017 17:20:25 S Job Run at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 M job was terminated
10/30/2017 17:20:25 M obit sent to server
10/30/2017 17:20:25 A queue=batch
10/30/2017 17:20:25 M scan_for_terminated: job 62.localhost task 1 terminated, sid=17917
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=17917
end=1509398425 Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
EDIT: jobs now hanging in E
After some tinkering, I am now using these settings. I have moved on to this tiny pipeline workflow, where some TORQUE jobs wait for other TORQUE jobs to finish. Unfortunately, all the jobs hang in the E state, and any number of jobs more than 4 will just stay queued. To keep things from hanging indefinitely, I have to sudo qdel -p each one, which I think is causing legitimate problems with the project's filesystem as well as an inconvenience.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
113.localhost ...b73ec2cda6dca wlandau 00:00:00 E batch
114.localhost ...b6c8e6da05983 wlandau 00:00:00 E batch
115.localhost ...9123b8e20850b wlandau 00:00:00 E batch
116.localhost ...e6d49a3d7d822 wlandau 00:00:00 E batch
117.localhost ...8c3f6cb68927b wlandau 0 Q batch
118.localhost ...40b1d0cab6400 wlandau 0 Q batch
qmgr -c "list server" shows
Server haggunenon
server_state = Active
scheduling = True
max_running = 300
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:3
acl_hosts = localhost
managers = root#localhost
operators = root#localhost
default_queue = batch
log_events = 511
mail_from = adm
query_other_jobs = True
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.4.16
keep_completed = 0
submit_hosts = SERVER
allow_node_submit = True
next_job_number = 119
net_counter = 118 94 93
And qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:4
max_running = 300
resources_max.ncpus = 4
resources_max.nodes = 2
resources_min.ncpus = 1
resources_default.ncpus = 1
resources_default.nodect = 1
resources_default.nodes = 1
resources_default.walltime = 01:00:00
mtime = Wed Nov 1 07:40:45 2017
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
keep_completed = 0
enabled = True
started = True

C state means the job has completed and its status is kept in the system. Usually the status is kept after job completion for a period of time specified by the keep_completed parameter. However certain types of failure may result in the job being kept in this state to provide the information necessary to examine the cause of failure.
Check the output of qstat -f 46 to see if there is anything indicating an error.
To tune the keep_completed parameter you can execute the following command to check the value of this parameter on your system.
qmgr -c "print queue batch keep_completed"
If you have administrative privileges on the Torque server you could also change this value with
qmgr -c "set queue batch keep_completed=120"
To keep jobs in completed state for 2 minutes after completion.
In general having keep_completed set is a useful feature. Advanced schedulers use the information on completed jobs to schedule around failures.

Related

Celery: How to schedule worker processes/children restart

I try to figure out how to setup my "celery workers" to restart after a living day.
Indeed, I configured my worker children/processes to restart after executing a task.
But in some cases, there are no tasks to execute in 3-4 days. So I need to restart the long-living children.
Do you how to do this ?
This is my actual celery app setup:
app = Celery(
"celery",
broker=f"amqp://bla#blabla/blablabla",
backend="rpc://",
)
app.conf.task_serializer = "pickle"
app.conf.result_serializer = "pickle"
app.conf.accept_content = ["pickle", "application/json"]
app.conf.broker_connection_max_retries = 5
app.conf.broker_pool_limit = 1
app.conf.worker_max_tasks_per_child = 1 # Ensure 1 task is executed before restarting child
app.conf.worker_max_living_time_before_restart = 60 * 60 * 24 # The conf I want
Thank you :)

PBS jobs queue then immediately exit sometimes

I've been using a PBS managed computing cluster for a few years now at my school. A few months ago I ran into this problem and they were never able to figure it out. When I submit jobs, they queue and then some run immediately. I believe that the jobs that should stay queued because of lack of resources die pretty much immediately. This happens intermittently depending on how many nodes I can use at one time. Sometimes I submit say 10 jobs, the first two will run, the next three will fail, then the next five will run.
I do not get either stdout or stderr files created for these failed jobs. The ones that do run do create these files. I get an email when these jobs die which I've attached here with some identifying information removed. Exit status -9 means "Could not create/open stdout stderr files" but I don't know how to fix that problem since it's so intermittent.
PBS Job Id: 11335.pearl.hpcc.XXX.edu
Job Name: mc1055
Exec host: m09/5
Aborted by PBS Server
Job cannot be executed
See Administrator for help
Exit_status=-9
resources_used.cput=00:00:00
resources_used.vmem=0kb
resources_used.walltime=00:00:02
resources_used.mem=0kb
resources_used.energy_used=0
req_information.task_count.0=1
req_information.lprocs.0=1
req_information.thread_usage_policy.0=allowthreads
req_information.hostlist.0=m09:ppn=1
req_information.task_usage.0.task.0={"task":{"cpu_list":"9","mem_list":"0","cores":0,"threads":1,"host":"m09"}}
Error_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.e11335
Output_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.o11335
I also looked at qstat -f right when a job failed and it's below. If I don't catch it right away it disappears from qstat.
Job Id: 11339.pearl.hpcc.XXX.edu
Job_Name = mc1059
Job_Owner = USERNAME#CLUSTERNAME.hpcc.XXX.edu
resources_used.cput = 00:00:00
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
resources_used.mem = 0kb
resources_used.energy_used = 0
job_state = C
queue = default
server = CLUSTERNAME.hpcc.XXX.edu
Account_Name = ADVISOR
Checkpoint = u
ctime = Mon Jan 4 20:02:25 2021
Error_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.e11339
exec_host = m09/9
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Jan 4 20:03:14 2021
Output_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.o11339
Priority = 0
qtime = Mon Jan 4 20:02:25 2021
Rerunable = True
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 50:00:00
Resource_List.var = mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2
Resource_List.nodect = 1
session_id = 0
Variable_List = PBS_O_QUEUE=largeq,
PBS_O_HOME=/PATH,PBS_O_LOGNAME=USERNAME,
PBS_O_PATH=lots of things
PBS_O_MAIL=/var/spool/mail/USERNAME,PBS_O_SHELL=/bin/bash,
PBS_O_LANG=en_US,KRB5CCNAME=FILE:/tmp/krb5cc_404112_hd4Yty,
PBS_O_WORKDIR=/PATH/TOSCRIPT/run,
PBS_O_HOST=CLUSTERNAME.hpcc.XXX.edu,
PBS_O_SERVER=CLUSTERNAME.hpcc.XXX.edu
euser = USERNAME
egroup = physics
queue_type = E
etime = Mon Jan 4 20:02:25 2021
exit_status = -9
submit_args = -l var=mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2 -v KRB5CCNAME
/PATH/TOSCRIPT/run/tmp/montec_1059
start_time = Mon Jan 4 20:03:14 2021
start_count = 1
fault_tolerant = False
comp_time = Mon Jan 4 20:03:14 2021
job_radix = 0
total_runtime = 7.218811
submit_host = CLUSTERNAME.hpcc.XXX.edu
init_work_dir = /PATH/TOSCRIPT/run
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 1
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = m09:ppn=1
req_information.task_usage.0.task.0.cpu_list = 5
req_information.task_usage.0.task.0.mem_list = 1
req_information.task_usage.0.task.0.cores = 0
req_information.task_usage.0.task.0.threads = 1
req_information.task_usage.0.task.0.host = m09

Audit daemon does not take rules from audit.rules

I am unable to add rules to audit daemon using /etc/audit/audit.rules
Every time i add the rules using auditctl it gets removed on reboot or audit daemon restart I have attached the /etc/audit/audit.rules and /etc/audit/auditd.conf
cat /etc/audit/auditd.conf
$ cat /etc/audit/auditd.conf
#
# This file controls the configuration of the audit daemon
#
local_events = yes
write_logs = yes
log_file = /NU_Application/audit.log
log_group = root
log_format = RAW
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
disp_qos = lossy
dispatcher = /sbin/audispd
name_format = NONE
##name = mydomain
max_log_file_action = ROTATE
space_left = 75
space_left_action = SYSLOG
verify_email = yes
action_mail_acct = root
admin_space_left = 50
admin_space_left_action = SUSPEND
disk_full_action = SUSPEND
disk_error_action = SUSPEND
use_libwrap = yes
##tcp_listen_port = 22
tcp_listen_queue = 5
tcp_max_per_addr = 1
##tcp_client_ports = 1024-65535
tcp_client_max_idle = 0
enable_krb5 = no
krb5_principal = auditd
##krb5_key_file = /etc/audit/audit.key
distribute_network = no
cat /etc/audit/audit.rules
$ cat /etc/audit/audit.rules
## First rule - delete all
## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192
## This determine how long to wait in burst of events
--backlog_wait_time 0
## Set failure mode to syslog
-f 1
-w /var/log/lastlog -p wa
root#iWave-G22M:~# auditctl
When i restart the audit daemon ( i.e /etc/init.d/auditd restart ) and try to list the rules i get the message No rules
$ /etc/init.d/auditd restart
Restarting audit daemon auditd
type=1305 audit(1558188111.980:3): audit_pid=0 old=1148 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.010:4): audit_enabled=1 old=1 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.020:5): audit_pid=30342 old=0 auid=4294967295 ses=4294967295
res=1
1
$ auditctl -l
No rules
OS INFO
$ uname -a
Linux iWave-G22M 3.10.31-ltsi-svn743 #5 SMP PREEMPT Mon May 27 18:28:01 IST 2019 armv7l GNU/Linux
audit_2.8.4.bb file was used to install auditd daemon via yocto
path of audit_2.8.4.bb -- http://git.yoctoproject.org/cgit/cgit.cgi/meta-selinux/tree/recipes-security/audit/audit_2.8.4.bb?h=master
audit rules add via /etc/audit/audit.rules and auditctl command are not permanent. to make them permanent across reboot you have to add them /etc/audit/rules.d/audit.rules file.
after adding the rule, restart auditd service and run command auditctl -l, it will list all the rules and also reflect in /etc/audit/audit.rules file.

Mojolicious: long subroutine have no effect on my variable

I have some trouble with very long subroutine with hypnotoad.
I need to run 1 minute subroutine (hardware connected requirements).
Firsly, i discover this behavior with:
my $var = 0;
Mojo::IOLoop->recurring(60 => sub {
$log->debug("starting loop: var: $var");
if ($var == 0) {
#...
#make some long (30 to 50 seconds) communication with external hardware
sleep 50; #fake reproduction of this long communication
$var = 1;
}
$log->debug("ending loop: var: $var");
})
Log:
14:13:45 2018 [debug] starting loop: var: 0
14:14:26 2018 [debug] ending loop: var: 1 #duration: 41 seconds
14:14:26 2018 [debug] starting loop: var: 0
14:15:08 2018 [debug] ending loop: var: 1 #duration: 42 seconds
14:15:08 2018 [debug] starting loop: var: 0
14:15:49 2018 [debug] ending loop: var: 1 #duration: 42 seconds
14:15:50 2018 [debug] starting loop: var: 0
14:16:31 2018 [debug] ending loop: var: 1 #duration: 41 seconds
...
3 problems:
1) Where do these 42 seconds come from? (yes, i know, 42 seconds is the answer of the universe...)
2) Why the IOLoop recuring loses his pace?
3) Why my variable is setted to 1, and just one second after, the if get a variable equal to 0?
When looped job needs 20 seconds or 25 seconds, no problem.
When looped job needs 60 secondes and used with morbo, no problem.
When looped job needs more than 40 seconds and used with hypnotoad (1 worker), this is the behavior explained here.
If i increase the "not needed" time (e.g. 120 seconds IOLoop for 60 seconds jobs, the behaviour is always the same.
It's not a problem about IOLoop, i can reproduce the same behavior outside loop.
I suspect a problem with worker killed and heart-beat, but i have no log about that.

Getting description of a PBS job queue

Is there any command that would allow me to query the description of a running/ queued PBS job for its attributes such as RAM, number of processors, GPUs etc?
Use qstat command:
qstat -f job_id
Expanding on the answer posted by dimm.
If a job is registered in a queue, you can query it's attributes with qstat command. If the job has already finished, you can only grep relevant information from the log files. There is a handy tracejob command to do the grepping for you.
In PBS Pro and Torque each job registered with a queue has two sets of attributes:
Resource_List has resources requested for a running or queued job
resources_used holds actual resource usage for a running job.
For example in PBS Pro you could get the following attributes for Resource_List
Resource_List.mem = 2000mb
Resource_List.mpiprocs = 8
Resource_List.ncpus = 8
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.qlist = queue1
Resource_List.select = 1:ncpus=8:mpiprocs=8
Resource_List.walltime = 02:00:00
 
And the following values for resources_used
resources_used.cpupercent = 800
resources_used.cput = 00:03:31
resources_used.mem = 529992kb
resources_used.ncpus = 8
resources_used.vmem = 3075580kb
resources_used.walltime = 00:00:28
For finished jobs tracejob could fetch you only some of the requested resources:
ncpus=8:mem=2048000kb
and the final values for resources_used
resources_used.cpupercent=799
resources_used.cput=00:54:29
resources_used.mem=725520kb
resources_used.ncpus=8
resources_used.vmem=3211660kb
resources_used.walltime=00:06:53