How to see where in the queue I am for a job on a cluster? - queue

I am using a cluster computer to compute a "lmem" job. I submitted my job a day ago and while normally the job immediately begins running and I can monitor how long it has been running with qstat, this job remains in the queue.
I used qstat -q to see that
`Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
lmem -- -- -- -- 36 235 -- E R
batch -- -- -- -- 8 0 -- E R
express 4gb -- 06:00:00 -- 17 0 -- E R
test -- -- -- -- 0 0 -- D S
production 16gb -- -- -- 66 157 -- E R
route -- -- -- -- 0 0 -- E R`
Someone must have put in A LOT of lmem jobs. I was wondering if there was a way to see where on that list of 235 my job is in line?

When using a scheduler like Moab or Maui, you can run a commands such as checkjob -v and mdiag -p (or have an admin do it for you) to see if the job has an advance/priority/job reservation, and how many jobs it has in front of it. The default ordering of showq will place jobs with a job reservation at or near the top of the Eligible/Idle list. If you're only using pbs_sched, then the order shown in a plain qstat is the order in which it will run, although that may not make it clear how soon it can run.

Related

Raspberry Pi 4 I2C bus was working, but fails after reboot

I have an SSD1306 OLED display and RPi4. I used the tutorial from https://maker.pro/raspberry-pi/projects/raspberry-pi-monitoring-system-via-oled-display-module which (after a small mod) was working up until I updated /etc/rc.local and did a reboot.
Since then, I get nothing.
sudo i2cdetect -y 1 now gives me:
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --
I had created an image of the SD Card before setting up the OLED, and restoring the image and trying again I still can't detect the OLED.
I'd appreciate any tips or thoughts on resolving this.
Solution is unknown. I left the Pi run for a while, then tried again and it just worked.

Coral Dev Board Mini i2C connect to Adafruit Motor Shield V2

Anyone have any success connecting the Coral Dev Board Mini to Adafruit's Motor Shield V2 (AMS) with I2C?
My first solution was to use the only 3 PWM on the Dev Mini to control the speed and direction of a motor, but i need two motors.
So decided to go the I2C route coupled with Adafruit's Motor Shield V2.
I've connected the following pins between the board and shield
Mini 5V -> AMS Vin
Mini GND -> AMS GND
Mini SDA (pin 3) -> AMS SDA pin
Mini SCL (pin 5) -> AMS SCL pin
On the mini the SDA/SCL pins 3 & 5 are associated with device path /dev/i2c-3
sudo i2cdetect -y 3
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: -- -- -- -- -- -- -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --
and I should be seeing 0x60.
The default Mini pin states are
HIGH (4.7K pull-up)
Any ideas?
Or Any thoughts on another solution to control the speed and direction of two motors?
I don't know the Adadfruit motor driver, but if you connect +5V to it's Vin it might well pull up the I2C lines to this voltage which might not be compatible with the 3.3V the mini dev uses. Have you checked this?
I have tried something similar, but with the SparkFun Auto pHAT (which does 3.3V <-> 5V conversion on the I2C bus). I could not get I2C working between the Auto pHat and the dev board mini. Maybe something in the I2C implementation of the mini is not working out of the box. I could control the Auto pHat with a (3.3V) ESP32 without problems.
I have also developed my own motor driver which is controlled by I2C and the prototype works well with the ESP32, I have not yet tried it with the the dev board mini. I'll give it a try and let you know.

How to register Celery task to specific worker?

I am developing web application in Python/Django, and I have several tasks which are running in celery.
I have to run task A one at a time so I have created worker with --concurrency=1 and routed task A to that worker using following command.
celery -A proj worker -Q A -c 1 -l INFO
Everything is working fine as this worker handle task A and other tasks are routed to default queue.
But, above worker return all task when I use inspect command to get registered task for worker. That is absolutely true because when I start worker, it displays all tasks of projects as registered task but handle only task A.
Following is the output of worker when I start it.
$ celery -A proj worker -Q A -c 1 -l INFO
-------------- celery#pet_sms v4.0.2 (latentcall)
---- **** -----
--- * *** * -- Linux-4.8.10-040810-generic-x86_64-with-Ubuntu-16.04-xenial 2018-04-26 14:11:49
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: proj:0x7f298a10d208
- ** ---------- .> transport: redis://localhost:6379/0
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> A exchange=A(direct) key=A
[tasks]
. task_one
. task_two
. task_three
. A
. task_four
. task_five
Is there any way to register specific task to the worker in celery?
Notice following two parts in your worker log.
[queues]
.> A exchange=A(direct) key=A
[tasks]
. task_one
. task_two
. task_three
. A
. task_four
. task_five
The first part [queues] shows the queues your worker consumes.
And it shows A, exchange=A(direct), key=A, indicating this worker only consumes tasks that are from the queue A. Which is exactly what you want. And you were achieving this effect because you specified -Q A when you started the worker by the command $ celery -A proj worker -Q A -c 1 -l INFO.
The second part [tasks] shows all the registered tasks of this app.
Though other tasks such as task_one task_five are all registered, since these tasks do not go into queue A, therefore this worker does not consume the tasks task_one task_five.

Torque PBS jobs going to the debug queue

On my new job, I administer a cluster that uses torque as a resource manager and maui as the scheduler.
Currently, I am facing this repeated problem where a specific users jobs are always sent to the debug queue. Here is a list of the active queues on the system:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
debug -- -- 00:20:00 -- 0 0 12 E R
intel -- -- -- -- 0 0 -- E R
medium -- -- 72:00:00 -- 0 0 12 E R
bighuge -- -- -- -- 0 0 -- E R
long -- -- -- -- 0 0 12 E R
----- -----
0 0
The Wall-time for jobs submitted by the user is in hours, so I am puzzled why its being sent to the debug queue.
Also, here is a output of the tracejob:
04/08/2016 15:46:48 S enqueuing into intel, state 1 hop 1
04/08/2016 15:46:48 S dequeuing from intel, state QUEUED
04/08/2016 15:46:48 S enqueuing into debug, state 1 hop 1
04/08/2016 15:46:48 S Job Queued at request of dawn#cm01, owner = dawn#cm01, job name = run01_submit.script, queue =
debug
04/08/2016 15:46:49 S Job Run at request of root#cm01
04/08/2016 15:46:49 S child reported success for job after 0 seconds (dest=n20), rc=0
04/08/2016 15:46:49 S preparing to send 'b' mail for job 15631.cm01 to dawn#cm01 (---)
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S obit received - updating final job usage info
04/08/2016 15:46:49 S job exit status 1 handled
04/08/2016 15:46:49 S preparing to send 'e' mail for job 15631.cm01 to dawn#cm01 (Exit_status=1
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S Exit_status=1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
04/08/2016 15:46:49 S on_job_exit task assigned to job
04/08/2016 15:46:49 S req_jobobit completed
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITING
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S about to copy stdout/stderr/stageout files
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEDEL
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITED
04/08/2016 15:46:49 S JOB_SUBSTATE_COMPLETE
04/08/2016 15:50:54 S Request invalid for state of job COMPLETE
04/08/2016 15:51:00 S Request invalid for state of job COMPLETE
04/08/2016 15:51:49 S dequeuing from debug, state COMPLETE
A workaround now is to manually change the assigned queue for the jobs using the qalter command.
Any ideas?
Because the job immediately jumps from the intel queue to debug, I'd suspect you have automatic routing configured, either in qmgr or in Maui. If the intel queue is configured as a routing queue, that would explain it.
Run qmgr -c "print queue intel" to check that.
If it's not a routing queue, you can probably increase loglevel to better see what's going on in the pbs_server logs.
When I create a routing queue that way, I get the same type of tracejob output when submitting a job:
05/20/2016 20:04:05.439 S enqueuing into route, state 1 hop 1
05/20/2016 20:04:05.440 S dequeuing from route, state QUEUED
05/20/2016 20:04:05.440 S enqueuing into test, state 1 hop 1
05/20/2016 20:04:05.737 S Job Run at request of root#testserver
Otherwise, examine the Maui config and logs for clues.

Solaris svcs command shows wrong status

I have freshly installed an application on solaris 5.10 . When checked through ps -ef | grep hyperic | grep agent, process are up and running . When checked the status through svcs hyperic-agent command, the output shows that the agent is in maintenance mode . Application is working fine and I dont have any issues with the application . Please help
There are several reasons that lead to that behavior:
Starter (start/exec property of service) returned status that is different from SMF_EXIT_OK (zero). Than you may check logs:
# svcs -x ssh
...
See: /var/svc/log/network-ssh:default.log
If you check logs, you may see following messages that means, starter script failed or incorrectly written:
[ Aug 11 18:40:30 Method "start" exited with status 96 ]
Another reason for such behavior is that service faults during while its working (i.e. one of processes coredumps or receives kill signal or all processes exits) as described here: https://blogs.oracle.com/lianep/entry/smf_5_fault_retry_models
The actual system that provides SMF facilities for monitoring that is System Contracts. You may determine contract ID of online service with svcs -v (field CTID):
# svcs -vp svc:/network/smtp:sendmail
STATE NSTATE STIME CTID FMRI
online - Apr_14 68 svc:/network/smtp:sendmail
Apr_14 1679 sendmail
Apr_14 1681 sendmail
Than watch events with ctwatch:
# ctwatch 68
CTID EVID CRIT ACK CTTYPE SUMMARY
68 28 crit no process contract empty
Than there are two options to handle that:
There is a real problem with service so it eventually faults. Than debug the application.
It is normal behavior of service, so you should edit and re-import your service manifest, to make SMF less paranoid. I.e. configure ignore_error and duration properties.