Celery : CELERYD_CONCURRENCY and number of workers - celery

From the other stackoverflow answer, I've tried to limit celery's number of workers
After I terminated all the celery worker, I restarted celery with new configuration.
CELERYD_CONCURRENCY = 1 (in Django's settings.py)
Then I typed following command to check how many celery workers are working.
ps auxww | grep 'celery worker' | grep -v grep | awk '{print $2}'
It returns two PIDs like 24803, 24817.
Then I changed configuration to CELERYD_CONCURRENCY = 2 and restarted celery.
Same command returns three PIDs like 24944, 24958, 24959. (As you can see, last two PIDs are sequential)
It implies that number of workers is increased as I expected.
However, I don't know why it returns two PIDs even though there is only one celery worker is working?
Is there a something subsidiary process to help worker?

I believe one process always acts like the controller that listens for tasks and then distributes them to it's child processes to actually perform the work. Therefore, you will always have 1 more process than the configuration setting.

Related

qsub -t job "has died"

I am trying to submit an array job to our cluster using qsub. The script is like:
#!/bin/bash
#PBS -l nodes=1:ppn=1 # Number of nodes and processor
#..... (Other options)
#PBS -t 0-50 # List job
cd $PBS_O_WORKDIR
./programname << EOF
some parameters
EOF
This script runs without a problem when removing -t option. But every time I added -t, I got following output:
---------------------------------------------
Check nodes and clean them of stray processes
---------------------------------------------
Checking node XXXXXXXXXX
-> User XXXX running job XXXXX.XXX:state=X:ncpus=X
-> Job XXX.XXX has died
Done clearing all the allocated nodes
------------------------------------------------------
Concluding PBS prologue script - XX-XX-XXXX XX:XX:XX
------------------------------------------------------
-------------- Job will be requeued --------------
Where it died and started requeue. No error message. I did not find any similar issue online. Has anyone experienced this before? Thank you!
(I wrote another "manual" array qsub script which works. But I do wish to get the work, as it is in the command option and much cleaner.)

Celery worker --exclude-queues option is not affected

I use celery 4.0.2.
I want my celery execute only specific queue. (or exclude specific queue)
So I executed celery as below.
celery -A mycelery worker -Q queue1,queue2 -E --logfile=./celery.log --pidfile=./celery.pid &
but when I run the code like this, 'testqueue' is consumed very well..!
mycelery.control.add_consumer('testqueue', reply=False)
myfunc.apply_async(queue='testqueue')
So I change the option as below and execute source code.
celery -A mycelery worker -X testqueue -E --logfile=./celery.log --pidfile=./celery.pid &
myfunc still run well..
'-Q' option means 'consume only the name of queue',
And '-X'option means 'never consume the name of queue'... isn't it?
What's wrong?

How to iterate over a sequence in systemd?

We're migrating from ubuntu 14 to ubuntu 16.
I have the following upstart task:
description "Start multiple Resque workers."
start on runlevel [2345]
task
env NUM_WORKERS=4
script
for i in `seq 1 $NUM_WORKERS`
do
start resque-work-app ID=$i
done
end script
As you can see, I have 4 workers that I'm starting. There is an upstart script then that starts each one of these workers:
description "Resque work app"
respawn
respawn limit 5 30
instance $ID
pre-start script
test -e /home/opera/bounties || { stop; exit 0; }
end script
exec sudo -u opera sh -c "<DO THE WORK>"
How do I do something similar in systemd? I'm particularly interested in how to iterate over a sequence of 4, and start a worker for each - this way, I'd have a cluster of 4 workers.
systemd doesn't have an iteration syntax, but it still has features to help solve this problem. The related concepts that systemd provides are:
Target Units, which allow you to treat a related group of services as a single service.
Template Units, which allow you to easily launch new copies of an app based on a variable like an ID.
With systemd, you could run a one-time bash loop as part of setting up the service that would enable the desired number of workers:
for i in `seq 1 4`; { systemctl enable resque-work-app#1; }
That presumes you have a resque-work-app#.service file that includes something like:
[Install]
WantedBy=resque-work-app.target
And that you have have a resque-work-app.target that contains something like:
[Unit]
Description=Resque Work App
[Install]
WantedBy=multi-user.target
See Also
How to create a virtual systemd service to stop/start several instances together?
man systemd.target
man systemd.unit
About Instances and Template Units

GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.
When I run a PBS script calling GNU parallel without the --jobs option, like this:
#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
it looks like it only uses one CPU per core, and also provides the following error stream:
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.
This looks like one error for each node. I don't understand the first part (bash: parallel: command not found), but the second part tells me it's using one node.
When I add the option -j2 to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:
Am I using the --jobs option correctly? Is it correct to specify -j2 because I have 2 CPUs per node? Or should I be using -jN where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)?
It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
Is there any meaning to the bash: parallel: command not found message?
Yes: -j is the number of jobs per node.
Yes: Install 'parallel' in your $PATH on the remote hosts.
Yes: It is a consequence from parallel missing from the $PATH.
GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores) which fails and then defaults to 1 CPU core per host. By giving -j2 GNU Parallel will not try to determine the number of cores.
Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.
This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.
The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:
parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodisplay -r "primes1({})" :::: 10 20 30 40
Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.

How to detect if a process is running on any of the nodes in a multiple node Grid in UNIX

My server is using a GRID. we have 3 nodes [any one of them could execute my script when i kick off the autosys job ]
Now my problem is if am trying to stop a job from running if it is already running. My code works when i see the scripts is executing on the same node [i mean the first instance and the second instance ]
ps -ead -o %U%p%a| egrep '(ksh|perl)' | grep -v egrep| grep \"perl .*myprocess.pl\"
is there a way, PS could list all instances of the processes from all nodes in the GRID.
please help!!
You can create a start.flag file in a common location. Keep the below conditions:
if the flag exists, then the flag will be removed and the
script will be executed. After completion of execution, the script
will touch that flag again.
if the flag does not exist, the script will just exit saying that its
running.
Best luck :)