MPI+OpenMP job submission script on LSF - sockets

I am very new to LSF. I have 4 nodes with with 2 sockets per node. Each node is having 8 cores. I have developed hybrid MPI+OpenMP code. I am submitting the job like the following which asks each core to perform one MPI task. So I loose the power of OpenMP.
##BSUB -n 64
I wish to submit the job so that each socket runs one MPI task rather than each core so that the cores inside the socket can be used for OpenMP. How can I build up job submit scripts to optimize the power of the Hybridization in my code.

First of all, the BSUB sentinels have to be preceded by a single # sign, otherwise they are skipped over as a regular comments.
The correct way to start a hybrid job with older LSF versions is to pass the span resource request and request nodes exclusively. To start a job with 8 MPI processes and 8 OpenMP threads each, you should use the following:
#BSUB -n 8
#BSUB -x
#BSUB -R "span[ptile=2]"
The parameters are as following:
-n 8 - requests 8 slots for MPI processes
-x - requests nodes exclusively
-R "span[ptile=2]" - instructs LSF to span the job over two slots per node
You should request nodes exclusively, otherwise LSF will schedule other jobs to the same nodes since only two slots per node will be used.
Then you have to set the OMP_NUM_THREADS environment variable to 4 (the number of cores per socket), tell the MPI library to pass the variable to the MPI processes, and make the library limit each MPI process to its own CPU socket. This is unfortunately very implementation-specific, e.g.:
Open MPI 1.6.x or older:
export OMP_NUM_THREADS=4
mpiexec -x OMP_NUM_THREADS --bind-to-socket --bysocket ./program.exe
Open MPI 1.7.x or newer:
export OMP_NUM_THREADS=4
mpiexec -x OMP_NUM_THREADS --bind-to socket --map-by socket ./program.exe
Intel MPI (not sure about this one as I don't use IMPI very often):
mpiexec -genv OMP_NUM_THREADS 4 -genv I_MPI_PIN 1 \
-genv I_MPI_PIN_DOMAIN socket -genv I_MPI_PIN_ORDER scatter \
./program.exe

Related

bind9 (named) does not start in multi-threaded mode

From the bind9 man page, I understand that the named process starts one worker thread per CPU if it was able to determine the number of CPUs. If its unable to determine, a single worker thread is started.
My question is how does it calculate the number of CPUs? I presume by CPU, it means cores. The Linux machine I work is customized and has kernel 2.6.34 and does not support lscpu or nproc utilities. named is starting a single thread even if i give -n 4 option. Is there any other way to force named to start multiple threads?
Thanks in advance.

In celery, what would be the purpose of having multiple workers process the same queue?

In the documentation for celeryd-multi, we find this example:
# Advanced example starting 10 workers in the background:
# * Three of the workers processes the images and video queue
# * Two of the workers processes the data queue with loglevel DEBUG
# * the rest processes the default' queue.
$ celeryd-multi start 10 -l INFO -Q:1-3 images,video -Q:4,5 data
-Q default -L:4,5 DEBUG
( From here: http://docs.celeryproject.org/en/latest/reference/celery.bin.celeryd_multi.html#examples )
What would be a practical example of why it would be good to have more than one worker on a single host process the same queue, as in the above example? Isn't that what setting the concurrency is for?
More specifically, would there be any practical difference between the following two lines (A and B)?:
A:
$ celeryd-multi start 10 -c 2 -Q data
B:
$ celeryd-multi start 1 -c 20 -Q data
I am concerned that I am missing some valuable bit of knowledge about task queues by my not understanding this practical difference, and I would greatly appreciate if somebody could enlighten me.
Thanks!
What would be a practical example of why it would be good to have more
than one worker on a single host process the same queue, as in the
above example?
Answer:
So, you may want to run multiple worker instances on the same machine
node if:
You're using the multiprocessing pool and want to consume messages in parallel. Some report better performance using multiple worker
instances instead of running a single instance with many pool
workers.
You're using the eventlet/gevent (and due to the infamous GIL, also the 'threads') pool), and you want to execute tasks on multiple CPU
cores.
Reference: http://www.quora.com/Celery-distributed-task-queue/What-is-the-difference-between-workers-and-processes

Using celeryd as a daemon with multiple django apps?

I'm just starting using django-celery and I'd like to set celeryd running as a daemon. The instructions, however, appear to suggest that it can be configured for only one site/project at a time. Can the celeryd handle more than one project, or can it handle only one? And, if this is the case, is there a clean way to set up celeryd to be automatically started for each configuration, which requiring me to create a separate init script for each one?
Like all interesting questions, the answer is it depends. :)
It is definitely possible to come up with a scenario in which celeryd can be used by two independent sites. If multiple sites are submitting tasks to the same exchange, and the tasks do not require access to any specific database -- say, they operate on email addresses, or credit card numbers, or something other than a database record -- then one celeryd may be sufficient. Just make sure that the task code is in a shared module that is loaded by all sites and the celery server.
Usually, though, you'll find that celery needs access to the database -- either it loads objects based on the ID that was passed as a task parameter, or it has to write some changes to the database, or, most often, both. And multiple sites / projects usually don't share a database, even if they share the same apps, so you'll need to keep the task queues separate .
In that case, what will usually happen is that you set up a single message broker (RabbitMQ, for example) with multiple exchanges. Each exchange receives messages from a single site. Then you run one or more celeryd processes somewhere for each exchange (in the celery config settings, you have to specify the exchange. I don't believe celeryd can listen to multiple exchanges). Each celeryd server knows its exchange, the apps it should load, and the database that it should connect to.
To manage these, I would suggest looking into cyme -- It's by #asksol, and manages multiple celeryd instances, on multiple servers if necessary. I haven't tried, but it looks like it should handle different configurations for different instances.
Did not try but using Celery 3.1.x which does not need django-celery, according to the documentation you can instantiate a Celery app like this:
app1 = Celery('app1')
app1.config_from_object('django.conf:settings')
app1.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
#app.task(bind=True)
def debug_task(self):
print('Request: {0!r}'.format(self.request))
But you can use celery multi for launching several workers with single configuration each, you can see examples here. So you can launch several workers with different --app appX parameters so it will use different taks and settings:
# 3 workers: Two with 3 processes, and one with 10 processes.
$ celery multi start 3 -c 3 -c:1 10
celery worker -n celery1#myhost -c 10 --config celery1.py --app app1
celery worker -n celery2#myhost -c 3 --config celery2.py --app app2
celery worker -n celery3#myhost -c 3 --config celery3.py --app app3

mpirun on CPUs with specified IDs

Does anyone know how to execute mpirun on specified CPUs? Though "mpirun -np 4 " specifies the number of CPUs used, what I here want to do is to specify CPU IDs.
The OS is CentOS 5.6 and MVAPICH2 is used on a single node with 6x2 cores.
Thank you for your cooperation.
Yes; new versions of mvapich2 use the hwloc library to enable CPU affinity and binding.
From the User Guide:
For example, if you want to run 4 processes per node and utilize cores
0, 1, 4, 5 on each node, you can specify:
$ mpirun_rsh -np 64 -hostfile hosts MV2_CPU_MAPPING=0:1:4:5 ./a.out
or
$ mpiexec -n 64 -f hosts -env MV2_CPU_MAPPING 0:1:4:5 ./a.out
In this way, process 0 on each node will be mapped to core 0, process
1 will be mapped to core 1, process 2 will be mapped to core 4, and
process 3 will be mapped to core 5. For each process, the mapping is
separated by a single “:”.

MPI has only master nodes

I'm trying to use MPI with my 4 cores processor.
I have followed this tutorial : http://debianclusters.org/index.php/MPICH:_Starting_a_Global_MPD_Ring
But at the end, when I try the hello.out script, I get only server process (master nodes) :
mpiexec -np 4 ./hello.out
Hello MPI from the server process!
Hello MPI from the server process!
Hello MPI from the server process!
Hello MPI from the server process!
I have searched all around the web but couldn't find any clues for this problem.
Here is my mpdtrace result :
[nls#debian] ~ $ mpd --ncpus=4 --daemon
[nls#debian] ~ $ mpdtrace -l
debian_52063 (127.0.0.1)
Shouldn't I get one trace line per core ?
Thanks for your help,
Malchance
95% of the time, when you see this problem -- MPI tasks not getting the "right" rank ids, usually ending up all being rank zero -- it means there's a mismatch in MPI libraries. Either the mpiexec is doing the launching isn't the same as the mpicc (or whatever) used to compile the program, or the MPI libraries the child processes are picking up on launch (if linked dynamically) are different than those intended. So I'd start by double checking those things.