Accessing multiple nodes in a mpi cluster using ipython - ipython

This is a continuation of the thread ipython-with-mpi-clustering-using-machinefile. It is slightly more focused and hopefully clearer as to what the issue might be.
I have 3-nodes running as a cluster using mpich/mpi4py, a machinefile and all libraries in a virtualenv, all on an NFS share. My goal is to use ipython/ipyparallel to distribute jobs across multiple nodes, each running multiple ipython engines.
I am able to run ipcluster start --profile=mpi -n 4 on one node (in this case, worker2) and through another node (in this case worker1) run ipython --profile=mpi and list the engines running on the running using the following commands:
import ipyparallel as ipp
client = ipp.Client()
dview = client[:]
with dview.sync_imports():
import socket
#dview.remote(block=True)
def engine_hostname():
return socket.gethostname()
results = engine_hostname()
for r in results:
print r
As expected, I get 4 instances of the hostname of the host running the engines printed:
In [7]: for r in results:
print r
...:
worker2
worker2
worker2
worker2
However, if I start ipcluster on another node (in this case head), then those are the only engine/nodes to show up when I query them as outlined above, even though the first set of engines are still running on the other node:
In [7]: for r in results:
print r
...:
head
head
head
head
My question is, how can I get ipython to see all of the engines on all of the nodes that are running; iow, to actually distribute the load across the different nodes.
Running mpi on its own works fine (head, worker1 and worker2 are the respective node sin the cluster):
(venv)gms#head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
head[21506]: 0/10
worker1[7809]: 1/10
head[21507]: 3/10
worker2[8683]: 2/10
head[21509]: 9/10
worker2[8685]: 8/10
head[21508]: 6/10
worker1[7811]: 7/10
worker2[8684]: 5/10
worker1[7810]: 4/10
So, at least I know this is not the problem.

Resolved. I recreated my ipcluster_config.py file and added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas...

Related

Ceph Mgr not responding to certain commands

We have a ceph cluster built with rook(2 mgrs, 3 mons, 2 mds each cephfs, 24 osds, rook: 1.9.3, ceph: 16.2.7, kubelet: 1.24.1). Our operation requires constantly creating and deleting cephfilesystems. Overtime we experienced issues with rook-ceph-mgr. After the cluster was built, in a week or two, rook-ceph-mgr failed to respond to certain ceph commands, like ceph osd pool autoscale-status, ceph fs subvolumegroup ls, while other commands, like ceph -s, worked fine. We have to restart rook-ceph-mgr to get it going. Now we have around 30 cephfilesystems and the issue happens more frequently.
We tried disabling mgr modules dashboard, prometheus and iostat, set ceph progress off, increased mgr_stats_period & mon_mgr_digest_period. That didn't help much. The issue happened again after one or two creating & deleting cycles.

How can I fix ceph commands hanging after a reboot?

I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...

Load generated by distributed worker processes is not equivalent generated by single process

I start another test trying to figure how users are allocated to worker node.
Here is my locust file.
#task
def mytask(self):
self.client.get("/")
class QuickstartUser(HttpUser):
wait_time = between(1, 2)
tasks = [mytask]
It is nothing but access a Chinese search engine website that never failed to visit.
When I start 30 users running in single node, and RPS is 20.
locust -f try_baidu.py
and got running status and result as below.
I switch to distributed running mode using command in 3 terminal of my computer.
locust -f try_baidu.py --master #for master node
locust -f try_baidu.py --worker --master-host=127.0.0.1 #for worker node each
and I input same amount of users and hatch rate in locust UI as above, say 30 users and hatch rate 10.
I've got same RPS which is 20 or around, and each worker node runs 15 users.
This explains that number of user input in UI is total amount to simulated and dispersed around worker node. It is something like load balance to burden load generation.
But I don't know why same amount of users gives 2 different RPS when running in single node (Scenario 1) and distributed (Scenario 2). They shall be result into same or closed RPS as above test.
The only difference I can tell is above comparison is in same computer while Scenario 2 have worker nodes in 2 remote linux VMs. But is it real reason?
Question may be asked not very clearly and I add some testing result here trying to depict what I have when running distributed and in single node with specified users.
Scenario 1: Single Node
Scenario 2: Trying to simulate 3 worker process each of which running 30 users but get lower RPS even.
from console I can see that each worker process starts 30 users as expected but have no idea why RPS is only 1/3 or single node.
Scenario 3: Adding triple times users to 90 for each worker process and get almost same RPS as running in single node.
It seems Scenario 3 is what I expected for triple simulation amount. But why locust graphic panel gives each worker process is running 90 users?
Scenario 4: To make sure locust truly distribute users specified to each worker node, I put 30 users for single worker node and get the same RPS as single node (not distributed)
Do I have to adding up total users distributed to worker node and input this total amount?
The problem is solved to some extend by set master node to another system. I guess there should be something interferes master to collect requests sent from one of workers which results into half of expected RPS. This may suggest a possible solution to lower RPS than it should be.

MPI Processes Communication error

I've got a raspberry pi cluster that has three nodes. I've installed mpi on it and i tried to excute an example code named cpi. The thing is that I get this error:
The command executed on master node:
mpiexec -f machinefile -n 2 mpi-build/examples/cpi
The result:
Process 0 of 2 is on Pi01
Fatal error in PMPI_Reduce: A process has failed,
error stack:PMPI_Reduce(1259)...............:MPI_Reduce(sbuf=0xbebc6630,rbuf=0xbebc6638,count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1071)..........:
MPIR_Reduce_intra(877)..........:
MPIR_Reduce_binomial(184).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(630): Communication error with rank 1
Process 1 of 2 is on Pi02
I've used SSH Keygens between the master and each slave so, there is no need to use the password to login between each node. (But if a slave connects to another it must login to another slave using the password, this means that I didn't share the ssh keygens between the slaves, but only between the master and the slaves.)
Programs that print helloworld with the process rank and the PC that executed it work properly, but when a process needs to communicate with another, I get the error as stated above.
What should I do?

MPI has only master nodes

I'm trying to use MPI with my 4 cores processor.
I have followed this tutorial : http://debianclusters.org/index.php/MPICH:_Starting_a_Global_MPD_Ring
But at the end, when I try the hello.out script, I get only server process (master nodes) :
mpiexec -np 4 ./hello.out
Hello MPI from the server process!
Hello MPI from the server process!
Hello MPI from the server process!
Hello MPI from the server process!
I have searched all around the web but couldn't find any clues for this problem.
Here is my mpdtrace result :
[nls#debian] ~ $ mpd --ncpus=4 --daemon
[nls#debian] ~ $ mpdtrace -l
debian_52063 (127.0.0.1)
Shouldn't I get one trace line per core ?
Thanks for your help,
Malchance
95% of the time, when you see this problem -- MPI tasks not getting the "right" rank ids, usually ending up all being rank zero -- it means there's a mismatch in MPI libraries. Either the mpiexec is doing the launching isn't the same as the mpicc (or whatever) used to compile the program, or the MPI libraries the child processes are picking up on launch (if linked dynamically) are different than those intended. So I'd start by double checking those things.