Torque PBS jobs going to the debug queue - hpc

On my new job, I administer a cluster that uses torque as a resource manager and maui as the scheduler.
Currently, I am facing this repeated problem where a specific users jobs are always sent to the debug queue. Here is a list of the active queues on the system:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
debug -- -- 00:20:00 -- 0 0 12 E R
intel -- -- -- -- 0 0 -- E R
medium -- -- 72:00:00 -- 0 0 12 E R
bighuge -- -- -- -- 0 0 -- E R
long -- -- -- -- 0 0 12 E R
----- -----
0 0
The Wall-time for jobs submitted by the user is in hours, so I am puzzled why its being sent to the debug queue.
Also, here is a output of the tracejob:
04/08/2016 15:46:48 S enqueuing into intel, state 1 hop 1
04/08/2016 15:46:48 S dequeuing from intel, state QUEUED
04/08/2016 15:46:48 S enqueuing into debug, state 1 hop 1
04/08/2016 15:46:48 S Job Queued at request of dawn#cm01, owner = dawn#cm01, job name = run01_submit.script, queue =
debug
04/08/2016 15:46:49 S Job Run at request of root#cm01
04/08/2016 15:46:49 S child reported success for job after 0 seconds (dest=n20), rc=0
04/08/2016 15:46:49 S preparing to send 'b' mail for job 15631.cm01 to dawn#cm01 (---)
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S obit received - updating final job usage info
04/08/2016 15:46:49 S job exit status 1 handled
04/08/2016 15:46:49 S preparing to send 'e' mail for job 15631.cm01 to dawn#cm01 (Exit_status=1
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S Exit_status=1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
04/08/2016 15:46:49 S on_job_exit task assigned to job
04/08/2016 15:46:49 S req_jobobit completed
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITING
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S about to copy stdout/stderr/stageout files
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEDEL
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITED
04/08/2016 15:46:49 S JOB_SUBSTATE_COMPLETE
04/08/2016 15:50:54 S Request invalid for state of job COMPLETE
04/08/2016 15:51:00 S Request invalid for state of job COMPLETE
04/08/2016 15:51:49 S dequeuing from debug, state COMPLETE
A workaround now is to manually change the assigned queue for the jobs using the qalter command.
Any ideas?

Because the job immediately jumps from the intel queue to debug, I'd suspect you have automatic routing configured, either in qmgr or in Maui. If the intel queue is configured as a routing queue, that would explain it.
Run qmgr -c "print queue intel" to check that.
If it's not a routing queue, you can probably increase loglevel to better see what's going on in the pbs_server logs.
When I create a routing queue that way, I get the same type of tracejob output when submitting a job:
05/20/2016 20:04:05.439 S enqueuing into route, state 1 hop 1
05/20/2016 20:04:05.440 S dequeuing from route, state QUEUED
05/20/2016 20:04:05.440 S enqueuing into test, state 1 hop 1
05/20/2016 20:04:05.737 S Job Run at request of root#testserver
Otherwise, examine the Maui config and logs for clues.

Related

Airflow : Job <job-id> was killed before it finished (likely due to running out of memory)

I have a linear DAG with two tasks - first task's truthy/falsy value decide if second task would be executed. I am using ShortCircuitOperator for first task so that if needed second task can be bypassed. Following is my DAG code :-
DAG_VERSION = "1.0.0"
with DAG(
"sample_dag",
catchup=False,
tags=[DAG_VERSION],
max_active_runs=1,
schedule_interval=None,
default_args=DEFAULT_ARGS,
) as dag:
dag.doc_md = "Sample DAG"
TASK_1 = ShortCircuitOperator(
task_id="task_1",
python_callable=test_script_1,
executor_config=EXECUTOR_CONFIG,
)
TASK_2 = PythonOperator(
task_id="task_2",
python_callable=test_script_2,
executor_config=EXECUTOR_CONFIG,
)
TASK_1 >> TASK_2
However, when I try to run the DAG, then I get following in log for first task when it returns truthy value :-
task_1 logs
Marking task as SUCCESS. dag_id=sample_dag, task_id=task_1, execution_date=20220606T060000, start_date=20220606T070012, end_date=20220606T070014
[2022-06-06, 07:00:17 UTC] State of this instance has been externally set to success. Terminating instance.
[2022-06-06, 07:00:17 UTC] Sending Signals.SIGTERM to GPID 18
[2022-06-06, 07:01:17 UTC] process psutil.Process(pid=18, name='airflow task runner: sample_dag task_1 scheduled__2022-06-06T06:00:00+00:00 7037', status='sleeping', started='07:00:12') did not respond to SIGTERM. Trying SIGKILL
[2022-06-06, 07:01:17 UTC] Process psutil.Process(pid=18, name='airflow task runner: sample_dag task_1 scheduled__2022-06-06T06:00:00+00:00 7037', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='07:00:12') (18) terminated with exit code Negsignal.SIGKILL
[2022-06-06, 07:01:17 UTC] Job 7037 was killed before it finished (likely due to running out of memory)
I am using return value of first task in second task. When I try to log xcom value of first task inside second task then I get None, which cause second task to fail. This is my code for access xcom value for first task inside second task :-
def test_script_2(**context: models.xcom) -> List[str]:
task_instance = context["task_instance"]
return_value = task_instance.xcom_pull(task_ids="task_1")
print("logging return value of first task ", return_value)
I am running Airflow 2.2.2 with kubernetes executor.
Is None xcom value due to out of memory issue in first task ? I tried by adding fixed value in first task, but again None was returned in second task with following log :-
task_2 logs
Marking task as SUCCESS. dag_id=sample_dag, task_id=task_2, execution_date=20220606T095611, start_date=20220606T095637, end_date=20220606T095638
[2022-06-06, 09:56:42 UTC] State of this instance has been externally set to success. Terminating instance.
[2022-06-06, 09:56:42 UTC] Sending Signals.SIGTERM to GPID 18
[2022-06-06, 09:57:42 UTC] process psutil.Process(pid=18, name='airflow task runner: sample_dag task_2 manual__2022-06-06T09:56:11.804005+00:00 7051', status='sleeping', started='09:56:37') did not respond to SIGTERM. Trying SIGKILL
[2022-06-06, 09:57:42 UTC] Process psutil.Process(pid=18, name='airflow task runner: sample_dag task_2 manual__2022-06-06T09:56:11.804005+00:00 7051', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='09:56:37') (18) terminated with exit code Negsignal.SIGKILL
[2022-06-06, 09:57:42 UTC] Job 7051 was killed before it finished (likely due to running out of memory)
I am unable to find issue in the code. Would appreciate any hint on where I am going wrong and how to get xcom value of first task
Thanks

Camunda Cockpit and Rest API down but application up/JobExecutor config

We are facing a major incident in our Camunda Orchestrator. When we hit 100 running process instances, Camunda Cockpit takes an eternity and never responds.
We have the same issue when calling /app/engine/.
Few messages are being consumed from RabbitMQ, and then everything stops.
The application however is not down.
I suspect a process engine configuration issue, because of being unable to get the job executor log.
When I set JobExecutorActivate to false, all things go right for Cockpit and queue consumption, but processes stop at the end of the first subprocess.
We have this log loop non stop:
2018/11/17 14:47:33.258 DEBUG ENGINE-14012 Job acquisition thread woke up
2018/11/17 14:47:33.258 DEBUG ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8338]
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8217]
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8256]
2018/11/17 14:47:33.258 DEBUG ENGINE-14011 Job acquisition thread sleeping for 100 millis
2018/11/17 14:47:33.359 DEBUG ENGINE-14012 Job acquisition thread woke up
And this log too (for queue consumption):
2018/11/17 15:04:19.582 DEBUG Waiting for message from consumer. {"null":null}
2018/11/17 15:04:19.582 DEBUG Retrieving delivery for Consumer#5d05f453: tags=[{amq.ctag-0ivcbc2QL7g-Duyu2Rcbow=queue_response}], channel=Cached Rabbit Channel: AMQChannel(amqp://guest#127.0.0.1:5672/,4), conn: Proxy#77a5983d Shared Rabbit Connection: SimpleConnection#17a1dd78 [delegate=amqp://guest#127.0.0.1:5672/, localPort= 49812], acknowledgeMode=AUTO local queue size=0 {"null":null}
Environment :
Spring Boot 2.0.3.RELEASE, Camunda v7.9.0 with PostgreSQL, RabbitMQ
Camunda BPM listen and push to 165 RabbitMQ queue.
Configuration :
# Data source (PostgreSql)
com.campDo.fr.camunda.datasource.url=jdbc:postgresql://localhost:5432/campDo
com.campDo.fr.camunda.datasource.username=campDo
com.campDo.fr.camunda.datasource.password=password
com.campDo.fr.camunda.datasource.driver-class-name=org.postgresql.Driver
com.campDo.fr.camunda.bpm.database.jdbc-batch-processing=false
oms.camunda.retry.timer=1
oms.camunda.retry.nb-max=2
SpringProcessEngineConfiguration :
#Bean
public SpringProcessEngineConfiguration processEngineConfiguration() throws IOException {
final SpringProcessEngineConfiguration config = new SpringProcessEngineConfiguration();
config.setDataSource(camundaDataSource);
config.setDatabaseSchemaUpdate("true");
config.setTransactionManager(transactionManager());
config.setHistory("audit");
config.setJobExecutorActivate(true);
config.setMetricsEnabled(false);
final Resource[] resources = resourceLoader.getResources(CLASSPATH_ALL_URL_PREFIX + "/processes/*.bpmn");
config.setDeploymentResources(resources);
return config;
}
Pom dependencies :
<dependency>
<groupId>org.camunda.bpm.springboot</groupId>
<artifactId>camunda-bpm-spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.camunda.bpm.springboot</groupId>
<artifactId>camunda-bpm-spring-boot-starter-webapp</artifactId>
</dependency>
<dependency>
<groupId>org.camunda.bpm.springboot</groupId>
<artifactId>camunda-bpm-spring-boot-starter-rest</artifactId>
</dependency>
I am quite sure that my job executor config is wrong.
Update :
I can start cockpit and make Camunda consume messages by setting JobExecutorActivate to false, but processes are still stopping at the first job executor required step:
config.setJobExecutorActivate(false);
Thanks for your help.
First: if your process contains async steps (Jobs) then it will pause. Activating the jobExecutor will just say that camunda should manage how these jobs are worked on. If you disable the executor, your processes will still stop and since no-one will execute them, they remain stopped.
Disabling job-execution is only sensible during testing or when you have multiple nodes and only some of them should do processing.
To your main issue: the job executor works with a threadPool. From what you describe, it is very likely, that all threads in the pool block forever, so they never finish and never return, meaning your system is stuck.
This happened to us a while ago when working with a smtp server, there was an infinite timeout on the connection so the threads kept waiting although the machine was not available.
Since job execution in camunda is highly reliable and well tested per se, I yywould suggest that you double check everything you do in your delegates, if you are lucky (and I am right) you will find the spot where you just wait forever ...

How to see where in the queue I am for a job on a cluster?

I am using a cluster computer to compute a "lmem" job. I submitted my job a day ago and while normally the job immediately begins running and I can monitor how long it has been running with qstat, this job remains in the queue.
I used qstat -q to see that
`Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
lmem -- -- -- -- 36 235 -- E R
batch -- -- -- -- 8 0 -- E R
express 4gb -- 06:00:00 -- 17 0 -- E R
test -- -- -- -- 0 0 -- D S
production 16gb -- -- -- 66 157 -- E R
route -- -- -- -- 0 0 -- E R`
Someone must have put in A LOT of lmem jobs. I was wondering if there was a way to see where on that list of 235 my job is in line?
When using a scheduler like Moab or Maui, you can run a commands such as checkjob -v and mdiag -p (or have an admin do it for you) to see if the job has an advance/priority/job reservation, and how many jobs it has in front of it. The default ordering of showq will place jobs with a job reservation at or near the top of the Eligible/Idle list. If you're only using pbs_sched, then the order shown in a plain qstat is the order in which it will run, although that may not make it clear how soon it can run.

How to stop and resume a spring batch job

Goal : I am using spring batch for data processing and I want to have an option to stop/resume (where it left off).
Issue: I am able to send a stop signal to a running job and it gets stopped successfully. But when I try to send start signal to same job its creating a new instance of the job and starts as a fresh job.
My question is how can we achieve a resume functionality for a stopped job in spring batch.
You just have to run it with the same parameters. Just make sure you haven't marked the job as non-restarted and that you're not using RunIdIncrementer or similar to automatically generate unique job parameters.
See for instance, this example. After the first run, we have:
INFO: Job: [SimpleJob: [name=myJob]] completed with the following parameters: [{}] and the following status: [STOPPED]
Status is: STOPPED, job execution id 0
#1 step1 COMPLETED
#2 step2 STOPPED
And after the second:
INFO: Job: [SimpleJob: [name=myJob]] completed with the following parameters: [{}] and the following status: [COMPLETED]
Status is: COMPLETED, job execution id 1
#3 step2 COMPLETED
#4 step3 COMPLETED
Note that stopped steps will be re-executed. If you're using chunk-oriented steps, make sure that at least the ItemReader implements ItemStream (and does it with the correct semantics).
Steps marked with allowRestartWithComplete will always be re-run.

Inspect and retry resque jobs via redis-cli

I am unable to run the resque-web on my server due to some issues I still have to work on but I still have to check and retry failed jobs in my resque queues.
Has anyone any experience on how to peek the failed jobs queue to see what the error was and then how to retry it using the redis-cli command line?
thanks,
Found a solution on the following link:
http://ariejan.net/2010/08/23/resque-how-to-requeue-failed-jobs
In the rails console we can use these commands to check and retry failed jobs:
1 - Get the number of failed jobs:
Resque::Failure.count
2 - Check the errors exception class and backtrace
Resque::Failure.all(0,20).each { |job|
puts "#{job["exception"]} #{job["backtrace"]}"
}
The job object is a hash with information about the failed job. You may inspect it to check more information. Also note that this only lists the first 20 failed jobs. Not sure how to list them all so you will have to vary the values (0, 20) to get the whole list.
3 - Retry all failed jobs:
(Resque::Failure.count-1).downto(0).each { |i| Resque::Failure.requeue(i) }
4 - Reset the failed jobs count:
Resque::Failure.clear
retrying all the jobs do not reset the counter. We must clear it so it goes to zero.