PBS jobs queue then immediately exit sometimes - hpc

I've been using a PBS managed computing cluster for a few years now at my school. A few months ago I ran into this problem and they were never able to figure it out. When I submit jobs, they queue and then some run immediately. I believe that the jobs that should stay queued because of lack of resources die pretty much immediately. This happens intermittently depending on how many nodes I can use at one time. Sometimes I submit say 10 jobs, the first two will run, the next three will fail, then the next five will run.
I do not get either stdout or stderr files created for these failed jobs. The ones that do run do create these files. I get an email when these jobs die which I've attached here with some identifying information removed. Exit status -9 means "Could not create/open stdout stderr files" but I don't know how to fix that problem since it's so intermittent.
PBS Job Id: 11335.pearl.hpcc.XXX.edu
Job Name: mc1055
Exec host: m09/5
Aborted by PBS Server
Job cannot be executed
See Administrator for help
Exit_status=-9
resources_used.cput=00:00:00
resources_used.vmem=0kb
resources_used.walltime=00:00:02
resources_used.mem=0kb
resources_used.energy_used=0
req_information.task_count.0=1
req_information.lprocs.0=1
req_information.thread_usage_policy.0=allowthreads
req_information.hostlist.0=m09:ppn=1
req_information.task_usage.0.task.0={"task":{"cpu_list":"9","mem_list":"0","cores":0,"threads":1,"host":"m09"}}
Error_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.e11335
Output_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.o11335
I also looked at qstat -f right when a job failed and it's below. If I don't catch it right away it disappears from qstat.
Job Id: 11339.pearl.hpcc.XXX.edu
Job_Name = mc1059
Job_Owner = USERNAME#CLUSTERNAME.hpcc.XXX.edu
resources_used.cput = 00:00:00
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
resources_used.mem = 0kb
resources_used.energy_used = 0
job_state = C
queue = default
server = CLUSTERNAME.hpcc.XXX.edu
Account_Name = ADVISOR
Checkpoint = u
ctime = Mon Jan 4 20:02:25 2021
Error_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.e11339
exec_host = m09/9
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Jan 4 20:03:14 2021
Output_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.o11339
Priority = 0
qtime = Mon Jan 4 20:02:25 2021
Rerunable = True
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 50:00:00
Resource_List.var = mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2
Resource_List.nodect = 1
session_id = 0
Variable_List = PBS_O_QUEUE=largeq,
PBS_O_HOME=/PATH,PBS_O_LOGNAME=USERNAME,
PBS_O_PATH=lots of things
PBS_O_MAIL=/var/spool/mail/USERNAME,PBS_O_SHELL=/bin/bash,
PBS_O_LANG=en_US,KRB5CCNAME=FILE:/tmp/krb5cc_404112_hd4Yty,
PBS_O_WORKDIR=/PATH/TOSCRIPT/run,
PBS_O_HOST=CLUSTERNAME.hpcc.XXX.edu,
PBS_O_SERVER=CLUSTERNAME.hpcc.XXX.edu
euser = USERNAME
egroup = physics
queue_type = E
etime = Mon Jan 4 20:02:25 2021
exit_status = -9
submit_args = -l var=mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2 -v KRB5CCNAME
/PATH/TOSCRIPT/run/tmp/montec_1059
start_time = Mon Jan 4 20:03:14 2021
start_count = 1
fault_tolerant = False
comp_time = Mon Jan 4 20:03:14 2021
job_radix = 0
total_runtime = 7.218811
submit_host = CLUSTERNAME.hpcc.XXX.edu
init_work_dir = /PATH/TOSCRIPT/run
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 1
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = m09:ppn=1
req_information.task_usage.0.task.0.cpu_list = 5
req_information.task_usage.0.task.0.mem_list = 1
req_information.task_usage.0.task.0.cores = 0
req_information.task_usage.0.task.0.threads = 1
req_information.task_usage.0.task.0.host = m09

Related

Celery: How to schedule worker processes/children restart

I try to figure out how to setup my "celery workers" to restart after a living day.
Indeed, I configured my worker children/processes to restart after executing a task.
But in some cases, there are no tasks to execute in 3-4 days. So I need to restart the long-living children.
Do you how to do this ?
This is my actual celery app setup:
app = Celery(
"celery",
broker=f"amqp://bla#blabla/blablabla",
backend="rpc://",
)
app.conf.task_serializer = "pickle"
app.conf.result_serializer = "pickle"
app.conf.accept_content = ["pickle", "application/json"]
app.conf.broker_connection_max_retries = 5
app.conf.broker_pool_limit = 1
app.conf.worker_max_tasks_per_child = 1 # Ensure 1 task is executed before restarting child
app.conf.worker_max_living_time_before_restart = 60 * 60 * 24 # The conf I want
Thank you :)

Add a constraint of renewable ressource

I am new to these optimisation problems, I just found the or-tools library and saw that cp_model can fix problems that are close to mine.
I have printers and some tasks, that I want to schedule in order to finish the production the earliest. The tasks uses time on machine and raw material, that I must refill at the end of coil. For the moment, I don't consider changing a plastic coil before using all the material.
Here is some information about my situation:
1- The printers are all same, they can do every task, with the same efficiency.
2- A printer can only print one task at a time.
3- A printer cannot start without human around, so tasks can start only at certain hours (in the code below, from 0AM to 10 AM).
4- A task can finish at any time.
5- If a printer has no more material, it needs to be change, this can happen only on opening hours.
6- If a printer has no more material, the task is paused until new material is put.
7- I consider having unlimited quantity of material coil.
Thanks to examples and some search in the documentation, I have been able to fix all the issues that are not related to material. I have been able to set a maximum quantity per machine, but it is not my issue.
I don't understand how I can pause/resume my intervals (for the moment I set the duration to a fixed one).
from ortools.sat.python import cp_model
from ortools.util.python import sorted_interval_list
import random
Domain = sorted_interval_list.Domain
def main():
random.seed(0)
nb_jobs = 10
nb_machine = 2
horizon = 30000000
job_list = [] #Job : (id,time,quantity)
listOfEnds = []
quantityPerMachine = []
maxQuantity = 5
#create the jobs
for i in range(nb_jobs):
time = random.randrange(1,24)
quantity = random.randrange(1,4)
job_list.append([i,time,quantity])
print([job[1] for job in job_list])
print("total time to print = ",horizon)
print("quantity")
print([job[2] for job in job_list])
print("total quantity = ",sum([job[2] for job in job_list]))
model = cp_model.CpModel()
makespan = model.NewIntVar(0, horizon, 'makespan')
machineForJob = {}
#boolean variable for each combination of machine and job. if True then machine works on this job
for machine in range(nb_machine):
for job in job_list:
j = job[0]
machineForJob[(machine,j)]=(model.NewBoolVar(f'x[{machine},{j}]'))
#for each job, apply sum()==1 to ensure one machine works on each job only
for j in range(nb_jobs):
model.Add(sum([machineForJob[(machine,j)] for machine in range(nb_machine)])==1)
#set the affectation of the jobs
time_intervals={}
starts = {}
ends = {}
#Time domain represents working hours when someone can start the taks
timeDomain = []#[[0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[24],[25],[26]]
for i in range(20):
for j in range(10):
t= [j +i*24]
timeDomain.append(t)
for machine in range(nb_machine):
time_intervals[machine] = []
for job in job_list:
j = job[0]
duration = job[1]
starts[(machine,j)] = model.NewIntVarFromDomain(Domain.FromIntervals(timeDomain),f'start {machine},{j}')
ends[(machine,j)] = (model.NewIntVar(0, horizon, f'end {machine},{j}'))
time_intervals[machine].append(model.NewOptionalIntervalVar(starts[(machine,j)], duration, ends[(machine,j)],
machineForJob[(machine,j)],f'interval {machine},{j} '))
#time should not overlap, quantity of raw material is limited,
for machine in range(nb_machine):
#model.Add(quantityPerMachine[machine] <= maxQuantity) Not working as expected as the raw material cannot be refilled
model.AddNoOverlap(time_intervals[machine])
#calculate time per machine
time_per_machine = []
for machine in range(nb_machine):
q = 0
s = 0
for job in job_list:
s+= job[1]*machineForJob[(machine,job[0])]
listOfEnds.append(ends[(machine,job[0])])
q+= job[2]*machineForJob[(machine,job[0])]
time_per_machine.append(s)
quantityPerMachine.append(q)
#Goal is to finish all taks earliest
model.AddMaxEquality(makespan,listOfEnds)
model.Minimize(makespan)
solver = cp_model.CpSolver()
# Sets a time limit of 10 seconds.
solver.parameters.max_time_in_seconds = 600.0
#Solve and prints
status = solver.Solve(model)
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
print("optimal =",status == cp_model.OPTIMAL )
print(f'Total cost = {solver.ObjectiveValue()}')
for i in range(nb_machine):
for j in range(nb_jobs):
if solver.BooleanValue(machineForJob[(i,j)]):
print(
f'Machine {i} assigned to Job {j} Time = {job_list[j][1]},Quantity = {job_list[j][2]}')
print(f"[{solver.Value(starts[(i,j)])} ,{solver.Value(ends[(i,j)])}]")
else:
print('No solution found.')
# Statistics.
print('\nStatistics')
print(' - conflicts: %i' % solver.NumConflicts())
print(' - branches : %i' % solver.NumBranches())
print(' - wall time: %f s' % solver.WallTime())

Odoo 12 : Not enought limit time to finish the backup?

I use the auto_backup to backup production database everyday.
It was working well until now.
Now, the backup can't finish until the end, I mean, I get the half size of the .zip file and it is impossible to restore it.
Normaly, the backup takes about 15mn.
I think that it's related to the Odoo configuration.
Here it is :
workers = 3
longpolling_port = 8072
limit_memory_soft = 2013265920
limit_memory_hard = 2415919104
limit_request = 8192
limit_time_cpu = 600
limit_time_real = 3600
limit_time_real_cron = 3600
proxy_mode = True
Can you help me?
I have another question, What does mean limit_time_real_cron = -1 if the limit_time_real_cron = 0 is unlimited?
Try to increase limit_time_cpu.

Audit daemon does not take rules from audit.rules

I am unable to add rules to audit daemon using /etc/audit/audit.rules
Every time i add the rules using auditctl it gets removed on reboot or audit daemon restart I have attached the /etc/audit/audit.rules and /etc/audit/auditd.conf
cat /etc/audit/auditd.conf
$ cat /etc/audit/auditd.conf
#
# This file controls the configuration of the audit daemon
#
local_events = yes
write_logs = yes
log_file = /NU_Application/audit.log
log_group = root
log_format = RAW
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
disp_qos = lossy
dispatcher = /sbin/audispd
name_format = NONE
##name = mydomain
max_log_file_action = ROTATE
space_left = 75
space_left_action = SYSLOG
verify_email = yes
action_mail_acct = root
admin_space_left = 50
admin_space_left_action = SUSPEND
disk_full_action = SUSPEND
disk_error_action = SUSPEND
use_libwrap = yes
##tcp_listen_port = 22
tcp_listen_queue = 5
tcp_max_per_addr = 1
##tcp_client_ports = 1024-65535
tcp_client_max_idle = 0
enable_krb5 = no
krb5_principal = auditd
##krb5_key_file = /etc/audit/audit.key
distribute_network = no
cat /etc/audit/audit.rules
$ cat /etc/audit/audit.rules
## First rule - delete all
## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192
## This determine how long to wait in burst of events
--backlog_wait_time 0
## Set failure mode to syslog
-f 1
-w /var/log/lastlog -p wa
root#iWave-G22M:~# auditctl
When i restart the audit daemon ( i.e /etc/init.d/auditd restart ) and try to list the rules i get the message No rules
$ /etc/init.d/auditd restart
Restarting audit daemon auditd
type=1305 audit(1558188111.980:3): audit_pid=0 old=1148 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.010:4): audit_enabled=1 old=1 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.020:5): audit_pid=30342 old=0 auid=4294967295 ses=4294967295
res=1
1
$ auditctl -l
No rules
OS INFO
$ uname -a
Linux iWave-G22M 3.10.31-ltsi-svn743 #5 SMP PREEMPT Mon May 27 18:28:01 IST 2019 armv7l GNU/Linux
audit_2.8.4.bb file was used to install auditd daemon via yocto
path of audit_2.8.4.bb -- http://git.yoctoproject.org/cgit/cgit.cgi/meta-selinux/tree/recipes-security/audit/audit_2.8.4.bb?h=master
audit rules add via /etc/audit/audit.rules and auditctl command are not permanent. to make them permanent across reboot you have to add them /etc/audit/rules.d/audit.rules file.
after adding the rule, restart auditd service and run command auditctl -l, it will list all the rules and also reflect in /etc/audit/audit.rules file.

TORQUE jobs hang on Ubuntu 16.04

I have TORQUE installed on Ubuntu 16.04, and I am having trouble because my jobs hang. I have a test script test.pbs:
#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00
cd $PBS_O_WORKDIR
touch done.txt
echo "done"
And I run it with
qsub test.pbs
The job writes done.txt and echoes "done" just fine, but the job hangs in the C state.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost test wlandau 00:00:00 C batch
Edit: some diagnostic info on another job from qstat -f 55
qstat -f 55
Job Id: 55.localhost
Job_Name = test
Job_Owner = wlandau#localhost
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = haggunenon
Checkpoint = u
ctime = Mon Oct 30 07:35:00 2017
Error_Path = localhost:/home/wlandau/Desktop/test.e55
exec_host = localhost/2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 30 07:35:00 2017
Output_Path = localhost:/home/wlandau/Desktop/test.o55
Priority = 0
qtime = Mon Oct 30 07:35:00 2017
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 00:01:00
session_id = 5115
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
PBS_O_WORKDIR=/home/wlandau/Desktop
comment = Job started on Mon Oct 30 at 07:35
etime = Mon Oct 30 07:35:00 2017
exit_status = 0
submit_args = test.pbs
start_time = Mon Oct 30 07:35:00 2017
Walltime.Remaining = 60
start_count = 1
fault_tolerant = False
comp_time = Mon Oct 30 07:35:00 2017
And a similar tracejob -n2 62:
/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located
Job: 62.localhost
10/30/2017 17:20:25 S enqueuing into batch, state 1 hop 1
10/30/2017 17:20:25 S Job Queued at request of wlandau#localhost, owner =
wlandau#localhost, job name = jobe945093c2e029c5de5619d6bf7922071,
queue = batch
10/30/2017 17:20:25 S Job Modified at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:20:25 L Job Run
10/30/2017 17:20:25 S Job Run at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 M job was terminated
10/30/2017 17:20:25 M obit sent to server
10/30/2017 17:20:25 A queue=batch
10/30/2017 17:20:25 M scan_for_terminated: job 62.localhost task 1 terminated, sid=17917
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=17917
end=1509398425 Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
EDIT: jobs now hanging in E
After some tinkering, I am now using these settings. I have moved on to this tiny pipeline workflow, where some TORQUE jobs wait for other TORQUE jobs to finish. Unfortunately, all the jobs hang in the E state, and any number of jobs more than 4 will just stay queued. To keep things from hanging indefinitely, I have to sudo qdel -p each one, which I think is causing legitimate problems with the project's filesystem as well as an inconvenience.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
113.localhost ...b73ec2cda6dca wlandau 00:00:00 E batch
114.localhost ...b6c8e6da05983 wlandau 00:00:00 E batch
115.localhost ...9123b8e20850b wlandau 00:00:00 E batch
116.localhost ...e6d49a3d7d822 wlandau 00:00:00 E batch
117.localhost ...8c3f6cb68927b wlandau 0 Q batch
118.localhost ...40b1d0cab6400 wlandau 0 Q batch
qmgr -c "list server" shows
Server haggunenon
server_state = Active
scheduling = True
max_running = 300
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:3
acl_hosts = localhost
managers = root#localhost
operators = root#localhost
default_queue = batch
log_events = 511
mail_from = adm
query_other_jobs = True
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.4.16
keep_completed = 0
submit_hosts = SERVER
allow_node_submit = True
next_job_number = 119
net_counter = 118 94 93
And qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:4
max_running = 300
resources_max.ncpus = 4
resources_max.nodes = 2
resources_min.ncpus = 1
resources_default.ncpus = 1
resources_default.nodect = 1
resources_default.nodes = 1
resources_default.walltime = 01:00:00
mtime = Wed Nov 1 07:40:45 2017
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
keep_completed = 0
enabled = True
started = True
C state means the job has completed and its status is kept in the system. Usually the status is kept after job completion for a period of time specified by the keep_completed parameter. However certain types of failure may result in the job being kept in this state to provide the information necessary to examine the cause of failure.
Check the output of qstat -f 46 to see if there is anything indicating an error.
To tune the keep_completed parameter you can execute the following command to check the value of this parameter on your system.
qmgr -c "print queue batch keep_completed"
If you have administrative privileges on the Torque server you could also change this value with
qmgr -c "set queue batch keep_completed=120"
To keep jobs in completed state for 2 minutes after completion.
In general having keep_completed set is a useful feature. Advanced schedulers use the information on completed jobs to schedule around failures.