Getting description of a PBS job queue - hpc

Is there any command that would allow me to query the description of a running/ queued PBS job for its attributes such as RAM, number of processors, GPUs etc?

Use qstat command:
qstat -f job_id

Expanding on the answer posted by dimm.
If a job is registered in a queue, you can query it's attributes with qstat command. If the job has already finished, you can only grep relevant information from the log files. There is a handy tracejob command to do the grepping for you.
In PBS Pro and Torque each job registered with a queue has two sets of attributes:
Resource_List has resources requested for a running or queued job
resources_used holds actual resource usage for a running job.
For example in PBS Pro you could get the following attributes for Resource_List
Resource_List.mem = 2000mb
Resource_List.mpiprocs = 8
Resource_List.ncpus = 8
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.qlist = queue1
Resource_List.select = 1:ncpus=8:mpiprocs=8
Resource_List.walltime = 02:00:00
 
And the following values for resources_used
resources_used.cpupercent = 800
resources_used.cput = 00:03:31
resources_used.mem = 529992kb
resources_used.ncpus = 8
resources_used.vmem = 3075580kb
resources_used.walltime = 00:00:28
For finished jobs tracejob could fetch you only some of the requested resources:
ncpus=8:mem=2048000kb
and the final values for resources_used
resources_used.cpupercent=799
resources_used.cput=00:54:29
resources_used.mem=725520kb
resources_used.ncpus=8
resources_used.vmem=3211660kb
resources_used.walltime=00:06:53

Related

Celery: How to schedule worker processes/children restart

I try to figure out how to setup my "celery workers" to restart after a living day.
Indeed, I configured my worker children/processes to restart after executing a task.
But in some cases, there are no tasks to execute in 3-4 days. So I need to restart the long-living children.
Do you how to do this ?
This is my actual celery app setup:
app = Celery(
"celery",
broker=f"amqp://bla#blabla/blablabla",
backend="rpc://",
)
app.conf.task_serializer = "pickle"
app.conf.result_serializer = "pickle"
app.conf.accept_content = ["pickle", "application/json"]
app.conf.broker_connection_max_retries = 5
app.conf.broker_pool_limit = 1
app.conf.worker_max_tasks_per_child = 1 # Ensure 1 task is executed before restarting child
app.conf.worker_max_living_time_before_restart = 60 * 60 * 24 # The conf I want
Thank you :)

Add a constraint of renewable ressource

I am new to these optimisation problems, I just found the or-tools library and saw that cp_model can fix problems that are close to mine.
I have printers and some tasks, that I want to schedule in order to finish the production the earliest. The tasks uses time on machine and raw material, that I must refill at the end of coil. For the moment, I don't consider changing a plastic coil before using all the material.
Here is some information about my situation:
1- The printers are all same, they can do every task, with the same efficiency.
2- A printer can only print one task at a time.
3- A printer cannot start without human around, so tasks can start only at certain hours (in the code below, from 0AM to 10 AM).
4- A task can finish at any time.
5- If a printer has no more material, it needs to be change, this can happen only on opening hours.
6- If a printer has no more material, the task is paused until new material is put.
7- I consider having unlimited quantity of material coil.
Thanks to examples and some search in the documentation, I have been able to fix all the issues that are not related to material. I have been able to set a maximum quantity per machine, but it is not my issue.
I don't understand how I can pause/resume my intervals (for the moment I set the duration to a fixed one).
from ortools.sat.python import cp_model
from ortools.util.python import sorted_interval_list
import random
Domain = sorted_interval_list.Domain
def main():
random.seed(0)
nb_jobs = 10
nb_machine = 2
horizon = 30000000
job_list = [] #Job : (id,time,quantity)
listOfEnds = []
quantityPerMachine = []
maxQuantity = 5
#create the jobs
for i in range(nb_jobs):
time = random.randrange(1,24)
quantity = random.randrange(1,4)
job_list.append([i,time,quantity])
print([job[1] for job in job_list])
print("total time to print = ",horizon)
print("quantity")
print([job[2] for job in job_list])
print("total quantity = ",sum([job[2] for job in job_list]))
model = cp_model.CpModel()
makespan = model.NewIntVar(0, horizon, 'makespan')
machineForJob = {}
#boolean variable for each combination of machine and job. if True then machine works on this job
for machine in range(nb_machine):
for job in job_list:
j = job[0]
machineForJob[(machine,j)]=(model.NewBoolVar(f'x[{machine},{j}]'))
#for each job, apply sum()==1 to ensure one machine works on each job only
for j in range(nb_jobs):
model.Add(sum([machineForJob[(machine,j)] for machine in range(nb_machine)])==1)
#set the affectation of the jobs
time_intervals={}
starts = {}
ends = {}
#Time domain represents working hours when someone can start the taks
timeDomain = []#[[0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[24],[25],[26]]
for i in range(20):
for j in range(10):
t= [j +i*24]
timeDomain.append(t)
for machine in range(nb_machine):
time_intervals[machine] = []
for job in job_list:
j = job[0]
duration = job[1]
starts[(machine,j)] = model.NewIntVarFromDomain(Domain.FromIntervals(timeDomain),f'start {machine},{j}')
ends[(machine,j)] = (model.NewIntVar(0, horizon, f'end {machine},{j}'))
time_intervals[machine].append(model.NewOptionalIntervalVar(starts[(machine,j)], duration, ends[(machine,j)],
machineForJob[(machine,j)],f'interval {machine},{j} '))
#time should not overlap, quantity of raw material is limited,
for machine in range(nb_machine):
#model.Add(quantityPerMachine[machine] <= maxQuantity) Not working as expected as the raw material cannot be refilled
model.AddNoOverlap(time_intervals[machine])
#calculate time per machine
time_per_machine = []
for machine in range(nb_machine):
q = 0
s = 0
for job in job_list:
s+= job[1]*machineForJob[(machine,job[0])]
listOfEnds.append(ends[(machine,job[0])])
q+= job[2]*machineForJob[(machine,job[0])]
time_per_machine.append(s)
quantityPerMachine.append(q)
#Goal is to finish all taks earliest
model.AddMaxEquality(makespan,listOfEnds)
model.Minimize(makespan)
solver = cp_model.CpSolver()
# Sets a time limit of 10 seconds.
solver.parameters.max_time_in_seconds = 600.0
#Solve and prints
status = solver.Solve(model)
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
print("optimal =",status == cp_model.OPTIMAL )
print(f'Total cost = {solver.ObjectiveValue()}')
for i in range(nb_machine):
for j in range(nb_jobs):
if solver.BooleanValue(machineForJob[(i,j)]):
print(
f'Machine {i} assigned to Job {j} Time = {job_list[j][1]},Quantity = {job_list[j][2]}')
print(f"[{solver.Value(starts[(i,j)])} ,{solver.Value(ends[(i,j)])}]")
else:
print('No solution found.')
# Statistics.
print('\nStatistics')
print(' - conflicts: %i' % solver.NumConflicts())
print(' - branches : %i' % solver.NumBranches())
print(' - wall time: %f s' % solver.WallTime())

Limit the number of single processes in Nextflow workflows

I have the following simple workflow:
workflow {
Channel.fromPath(params.file_list)
.splitText(){it.trim()}
.set { file_list }
data = GetFromHPSS(file_list)
data_pairs = CoupleDETXToFile(data, file(params.detx_path))
SingleDUTimeResFit(data_pairs)
}
In which file_list is a list of paths on a tape-drive system. The GetFromHPSS is the process which retrieves files from the tape system and I need to limit the parallel processes to a fairly low number.
Currently, I am using
executor {
queueSize = 100
}
in the configuration file but there are two problems:
it limits the overall maximum number of parallel jobs, while I could run thousands of SingleDUTimeResFit processes in parallel
it always first waits until it processed everything from GetFromHPSS instead of continuing with the subsequent processes
Here is an example:
N E X T F L O W ~ version 21.04.3
Launching `workflows/singledu_timeresfit.nf` [wise_galileo] - revision: 8084ac1482
executor > sge (502)
[13/ca3e8a] process > GetFromHPSS (426) [ 18%] 402 of 22840
[- ] process > CoupleDETXToFile [ 0%] 0 of 402
[- ] process > SingleDUTimeResFit -
Is there a way to limit GetFromHPSS to a specific number of parallel executions and let the remaining processes run with another queue-limit set?
EDIT: This is one of my best tries I guess, but it does not accept the configuration:
process {
executor {
queueSize = 100
submitRateLimit = "10sec"
}
withName: GetFromHPSS {
executor.queueSize = 10
}
}
With this process top-level configuration, I get:
N E X T F L O W ~ version 21.04.3
Launching `workflows/singledu_timeresfit.nf` [confident_pasteur] - revision: 8084ac1482
Unknown config attribute `process.withName:GetFromHPSS` -- check config file: /sps/km3net/users/tgal/dev/PhD/workflows/nextflow.config
I think what you're looking for here is the maxForks directive, which can be applied to just the 'GetFromHPSS' process without the need to change the executor's queueSize:
process 'GetFromHPSS' {
maxForks 1
"""
<your script here>
"""
}
You could even parameterize it, if you think it makes sense:
params.hpss_forks = 5
process 'GetFromHPSS' {
maxForks params.hpss_forks
"""
<your script here>
"""
}

Airflow tasks timeout in one hour even the setting is larger than 1 hour

Currently I'm using airflow with celery-executor+redis to run dags, and I have set execution_timeout to be 12 hours in a S3 key sensor, but it will fail in one hour in each retry
I have tried to update visibility_timeout = 64800 in airflow.cfg but the issue still exist
file_sensor = CorrectedS3KeySensor(
task_id = 'listen_for_file_drop', dag = dag,
aws_conn_id = 'aws_default',
poke_interval = 15,
timeout = 64800, # 18 hours
bucket_name = EnvironmentConfigs.S3_SFTP_BUCKET_NAME,
bucket_key = dag_config[ConfigurationConstants.FILE_S3_PATTERN],
wildcard_match = True,
execution_timeout = timedelta(hours=12)
)
For my understanding, execution_timeout should work that it will last for 12 hours after total four times run (retry = 3). But the issue is for each retry, it will fail in an hour and it only last total 4 hours+
[2019-08-06 13:00:08,597] {{base_task_runner.py:101}} INFO - Job 9: Subtask listen_for_file_drop [2019-08-06 13:00:08,595] {{timeout.py:41}} ERROR - Process timed out
[2019-08-06 13:00:08,612] {{models.py:1788}} ERROR - Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1652, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/sensors/base_sensor_operator.py", line 97, in execute
while not self.poke(context):
File "/usr/local/airflow/dags/ProcessingStage/sensors/sensors.py", line 91, in poke
time.sleep(30)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 42, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout
I figure it out a few days before.
Since I'm using AWS to deploy airflow with celery executor, there's a few improper cloudwatch alarm will keep scale up and down the workers and webserver/scheuler :(
After those alarms updated, it works well now!!

TORQUE jobs hang on Ubuntu 16.04

I have TORQUE installed on Ubuntu 16.04, and I am having trouble because my jobs hang. I have a test script test.pbs:
#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00
cd $PBS_O_WORKDIR
touch done.txt
echo "done"
And I run it with
qsub test.pbs
The job writes done.txt and echoes "done" just fine, but the job hangs in the C state.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost test wlandau 00:00:00 C batch
Edit: some diagnostic info on another job from qstat -f 55
qstat -f 55
Job Id: 55.localhost
Job_Name = test
Job_Owner = wlandau#localhost
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = haggunenon
Checkpoint = u
ctime = Mon Oct 30 07:35:00 2017
Error_Path = localhost:/home/wlandau/Desktop/test.e55
exec_host = localhost/2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 30 07:35:00 2017
Output_Path = localhost:/home/wlandau/Desktop/test.o55
Priority = 0
qtime = Mon Oct 30 07:35:00 2017
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 00:01:00
session_id = 5115
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
PBS_O_WORKDIR=/home/wlandau/Desktop
comment = Job started on Mon Oct 30 at 07:35
etime = Mon Oct 30 07:35:00 2017
exit_status = 0
submit_args = test.pbs
start_time = Mon Oct 30 07:35:00 2017
Walltime.Remaining = 60
start_count = 1
fault_tolerant = False
comp_time = Mon Oct 30 07:35:00 2017
And a similar tracejob -n2 62:
/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located
Job: 62.localhost
10/30/2017 17:20:25 S enqueuing into batch, state 1 hop 1
10/30/2017 17:20:25 S Job Queued at request of wlandau#localhost, owner =
wlandau#localhost, job name = jobe945093c2e029c5de5619d6bf7922071,
queue = batch
10/30/2017 17:20:25 S Job Modified at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:20:25 L Job Run
10/30/2017 17:20:25 S Job Run at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 M job was terminated
10/30/2017 17:20:25 M obit sent to server
10/30/2017 17:20:25 A queue=batch
10/30/2017 17:20:25 M scan_for_terminated: job 62.localhost task 1 terminated, sid=17917
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=17917
end=1509398425 Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
EDIT: jobs now hanging in E
After some tinkering, I am now using these settings. I have moved on to this tiny pipeline workflow, where some TORQUE jobs wait for other TORQUE jobs to finish. Unfortunately, all the jobs hang in the E state, and any number of jobs more than 4 will just stay queued. To keep things from hanging indefinitely, I have to sudo qdel -p each one, which I think is causing legitimate problems with the project's filesystem as well as an inconvenience.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
113.localhost ...b73ec2cda6dca wlandau 00:00:00 E batch
114.localhost ...b6c8e6da05983 wlandau 00:00:00 E batch
115.localhost ...9123b8e20850b wlandau 00:00:00 E batch
116.localhost ...e6d49a3d7d822 wlandau 00:00:00 E batch
117.localhost ...8c3f6cb68927b wlandau 0 Q batch
118.localhost ...40b1d0cab6400 wlandau 0 Q batch
qmgr -c "list server" shows
Server haggunenon
server_state = Active
scheduling = True
max_running = 300
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:3
acl_hosts = localhost
managers = root#localhost
operators = root#localhost
default_queue = batch
log_events = 511
mail_from = adm
query_other_jobs = True
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.4.16
keep_completed = 0
submit_hosts = SERVER
allow_node_submit = True
next_job_number = 119
net_counter = 118 94 93
And qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:4
max_running = 300
resources_max.ncpus = 4
resources_max.nodes = 2
resources_min.ncpus = 1
resources_default.ncpus = 1
resources_default.nodect = 1
resources_default.nodes = 1
resources_default.walltime = 01:00:00
mtime = Wed Nov 1 07:40:45 2017
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
keep_completed = 0
enabled = True
started = True
C state means the job has completed and its status is kept in the system. Usually the status is kept after job completion for a period of time specified by the keep_completed parameter. However certain types of failure may result in the job being kept in this state to provide the information necessary to examine the cause of failure.
Check the output of qstat -f 46 to see if there is anything indicating an error.
To tune the keep_completed parameter you can execute the following command to check the value of this parameter on your system.
qmgr -c "print queue batch keep_completed"
If you have administrative privileges on the Torque server you could also change this value with
qmgr -c "set queue batch keep_completed=120"
To keep jobs in completed state for 2 minutes after completion.
In general having keep_completed set is a useful feature. Advanced schedulers use the information on completed jobs to schedule around failures.