I have successfully configured mpi with mpi4py support across three nodes, as per testing of the hellowworld.py script in the mpi4py demo directory:
gms#host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py
Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.
I am now trying to get this working in ipython and have added my machinefile to my $IPYTHON_DIR/profile_mpi/ipcluster_config.py file, as follows:
c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]
I then start iPython notebook on my head node using the command: ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &
and, voila, I can access it just fine from another device on my local network. However, when I run helloworld.py from iPython notebook, I only get a response from the head node: Hello, World! I am process 0 of 10 on host.
I started mpi from iPython with 10 engines, but...
I further configured these parameters, just in case
in $IPYTHON_DIR/profile_mpi/ipcluster_config.py
c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'
in $IPYTHON_DIR/profile_mpi/ipengine_config.py
c.MPI.use = 'mpi4py'
in $IPYTHON_DIR/profile_mpi/ipcontroller_config.py
c.HubFactory.ip = '*'
However, these did not help, either.
What am I missing to get this working correctly?
EDIT UPDATE 1
I now have NFS mounted directories on my worker nodes, and thus, am fulfilling the requirement "Currently ipcluster requires that the IPYTHONDIR/profile_/security directory live on a shared filesystem that is seen by both the controller and engines." to be able to use ipcluster to start my controller and engines, using the command ipcluster start --profile=mpi -n 6 &.
So, I issue this on my head node, and then get:
2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully
Then, proceed to issue the same command to start the engines on the other nodes, but I get:
2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
[1]+ Stopped ipcluster start --profile=mpi -n 6
with no confirmation that the Engines appear to have started successfully ...
Even more confusing, when I do a ps au on the worker nodes, I get:
gms 3862 0.1 2.5 38684 23740 pts/0 T 20:31 0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms 3874 0.1 1.7 21428 16772 pts/0 T 20:31 0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms 3875 0.0 0.2 4768 2288 pts/0 T 20:31 0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms 3876 0.0 0.4 5732 4132 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms 3877 0.0 0.1 4816 1204 pts/0 T 20:31 0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms 3878 0.0 0.4 5732 4028 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms 3879 0.0 0.6 8944 6008 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms 3880 0.0 0.6 8944 6108 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
Where the ip addresses in processes 3376 and 3378 are from the other hosts in the cluster. But...
When I run a similar test directly using ipython, all I get is a response from the localhost (even though, minus, ipython, this works directly with mpi and mpi4py as noted in my original post):
gms#head:~/development/mpi$ ipython test.py
head[3834]: 0/1
gms#head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10
I still seem to be missing something obvious, although I am convinced my configuration is now correct. One thing that pops out, is when I start ipcluster on my worker nodes, I get this: 2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
EDIT UPDATE 2
This is more to document what is happening and, hopefully, ultimately what gets this working:
I cleaned out my log files and reissued ipcluster start --profile=mpi -n 6 &
And now see 6-log files for my engines, and 1 for my controller:
drwxr-xr-x 2 gms gms 12288 Mar 6 03:28 .
drwxr-xr-x 7 gms gms 4096 Mar 6 03:31 ..
-rw-r--r-- 1 gms gms 1313 Mar 6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4629.log
Looking in the log for ipcontroller it looks like only one engine registered:
2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0
Shouldn't each of the 6 engines be registered?
2 of the engine's logs look like they registered fine:
2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1
while the other registered with id 0
But, the other 4 engines gave a time out error:
2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds
Hmmm... I think I may try a reinstall of ipython tomorrow.
EDIT UPDATE 3
Conflicting versions of ipython were installed (looks like through apt-get and pip). Uninstalling and reinstall using pip install ipython[all]...
EDIT UPDATE 4
I hope someone is finding this useful AND I hope someone can weigh in at some point to help clarify a few things.
Anywho, I installed a virtualenv to deal isolate my environment, and it looks like some degree of success, I think. I fired up 'ipcluster start -n 4 --profile=mpi' on each of my nodes, then ssh'ed back into my head node and ran a test script, which first calls ipcluster. The following output: So, it is doing some parallel computing.
However, when I run my test script that queries all the nodes, I just get the head node:
But, again, if I just run the straight up mpiexec command, everything is hunky dory.
To add to the confusion, if I look at the processes on the nodes, I see all sorts of behavior to indicate they are working together:
And nothing out of the ordinary in my logs. Why am I not getting nodes returned in my second test script (code included here:):
# test_mpi.py
import os
import socket
from mpi4py import MPI
MPI = MPI.COMM_WORLD
print("{host}[{pid}]: {rank}/{size}".format(
host=socket.gethostname(),
pid=os.getpid(),
rank=MPI.rank,
size=MPI.size,
))
Not sure why, but I recreated my ipcluster_config.py file and again added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas...
Related
There is a container in my Kubernetes cluster which I want to debug.
But there is nonetstat, no ip and no apk.
Is there a way to upgrade this image, so that the common tools are installed?
In this case it is the nginx container image in a K8s 1.23 cluster.
Alpine is a stripped-down version of the image to reduce the footprint. So the absence of those tools is expected. Although since Kubernetes 1.23, you can use the kubectl debug command to attach a debug pod to the subject pod.
Syntax:
kubectl debug -it <POD_TO_DEBUG> --image=ubuntu --target=<CONTAINER_TO_DEBUG> --share-processes
Example:
In the below example, the ubuntu container is attached to the Nginx-alpine pod, requiring debugging. Also, note that the ps -eaf output shows nginx process running and the cat /etc/os-release shows ubuntu running. The indicating process is shared/visible between the two containers.
ps#kube-master:~$ kubectl debug -it nginx --image=ubuntu --target=nginx --share-processes
Targeting container "nginx". If you don't see processes from this container, the container runtime doesn't support this feature.
Defaulting debug container name to debugger-2pgtt.
If you don't see a command prompt, try pressing enter.
root#nginx:/# ps -eaf
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 19:50 ? 00:00:00 nginx: master process nginx -g daemon off;
101 33 1 0 19:50 ? 00:00:00 nginx: worker process
101 34 1 0 19:50 ? 00:00:00 nginx: worker process
101 35 1 0 19:50 ? 00:00:00 nginx: worker process
101 36 1 0 19:50 ? 00:00:00 nginx: worker process
root 248 0 1 20:00 pts/0 00:00:00 bash
root 258 248 0 20:00 pts/0 00:00:00 ps -eaf
root#nginx:/#
Debugging as ubuntu as seen here, this arm us with all sort of tools:
root#nginx:/# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
root#nginx:/#
In case ephemeral containers need to be enabled in your cluster, then you can enable it via feature gates as described here.
The whole point of using containers is to optimize the resource utilization in your cluster. The images used should only include the packages that are needed to run your app.
The unwanted packages should be removed from your images (especially in prod) to reduce the compute utilization and to reduce the attack vector.
This appears to be a stripped down image that has only the libraries needed to run that application.
In order to debug, you will have to create a new container in the same pid and network namespace as the container you are trying to debug
Build container first
Dockerfile
FROM alpine
RUN apk update && apk add strace
CMD ["strace", "-p", "1"]
Build
$ docker build -t strace .
Run
docker run -t --pid=container:<targetContainer> \
--net=container:targetContainer \
--cap-add sys_admin \
--cap-add sys_ptrace \
strace
strace: Process 1 attached
futex(0xd72e90, FUTEX_WAIT, 0, NULL
https://rothgar.medium.com/how-to-debug-a-running-docker-container-from-a-separate-container-983f11740dc6
I'm having a problem with init.d script on my Raspberry PI 4 (4GB) with Raspbian 10.
I've followed the guide on the official docs and compiled RethinkDB without any problem.
Then I've configured as described in the Deployment docs.
Created conf file in /etc/rethinkdb/instances.d/<conf_name>.conf;
Copied init.d script sudo cp /home/pi/rethinkdb-2.4.1/packaging/assets/init/rethinkdb /etc/init.d/rethinkdb
Added Default Runlevel sudo update-rc.d rethinkdb defaults
I can start the server with command rethinkdb --config-file /etc/rethinkdb/instances.d/instance1.config and it gives me no problem
pi#homeserverpi:~ $ rethinkdb --config-file /etc/rethinkdb/instances.d/instance1.conf
WARNING: ignoring --server-name because this server already has a name.
Running rethinkdb 2.4.1 (CLANG 7.0.1 (tags/RELEASE_701/final))...
Running on Linux 5.4.72-v7l+ armv7l
Loading data from directory /home/pi/rethinkdb_data
Listening for intracluster connections on port 29015
Listening for client driver connections on port 28015
Listening for administrative HTTP connections on port 8182
Listening on cluster addresses: 127.0.0.1, 192.168.1.3, ::1, fe80::38b8:6928:e4fd:1a9c%3
Listening on driver addresses: 127.0.0.1, 192.168.1.3, ::1, fe80::38b8:6928:e4fd:1a9c%3
Listening on http addresses: 127.0.0.1, 192.168.1.3, ::1, fe80::38b8:6928:e4fd:1a9c%3
Server ready, "homeserverpi_9x0" 00eb027b-181c-4a15-a170-8ba8299f4f3f
But when I try to start the service it gives me this
sudo /etc/init.d/rethinkdb start rethinkdb: instance1: Starting instance. (logging to '/var/lib/rethinkdb/instance1/data/log_file')
/etc/init.d/rethinkdb: 224: /etc/init.d/rethinkdb: /usr/bin/rethinkdb: Permission denied
Permissions
pi#homeserverpi:~ $ ls -alh /etc/init.d/rethinkdb
-rwxr-xr-x 1 root root 7.5K Nov 30 00:20 /etc/init.d/rethinkdb
pi#homeserverpi:~ $ ls -alh /usr/bin/rethinkdb/
total 40K
drwxr-xr-x 2 root root 4.0K Nov 29 23:06 .
drwxr-xr-x 3 root root 36K Nov 29 23:06 ..
Can someone please help me on this?
Thank you
After some time the Postgres database stopped working of my live server. I'm working on this server from last 8 months. Now suddenly it's stopped working.
when I try to enter the command, psql produces an error
psql: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
enter image description here
I'm using odoo8.
First, you need to make sure the socket file is located in /var/run/postgresql/.s.PGSQL.5432. To check that
$ cat /var/run/postgresql/.s.PGSQL.5432
if result shows something, then the problem is anything else. But, if file is not there you need to check /tmp dir (specially for OSX Homebrew users)
$ cd /tmp
$ l
total 16
drwxrwxrwt 7 root wheel 224B Mar 11 08:03 .
drwxr-xr-x 6 root wheel 192B Jan 23 18:35 ..
-rw-r--r-- 1 root wheel 65B Nov 7 22:59 .BBE72B41371180178E084EEAF106AED4F350939DB95D3516864A1CC62E7AE82F
srwxrwxrwx 1 shiva wheel 0B Mar 11 08:03 .s.PGSQL.5432
-rw------- 1 shiva wheel 57B Mar 11 08:03 .s.PGSQL.5432.lock
drwx------ 3 shiva wheel 96B Mar 10 17:11 com.apple.launchd.C1tUB2MvF8
drwxr-xr-x 2 root wheel 64B Mar 10 17:10 powerlog
Now, there are two ways you can solve the error
Solution One
You can change the application configuration to see for sockets at /tmp/.s.PGSQL.5432
For Rails Users
# config/database.yml
default: &default
adapter: postgresql
pool: 5
# port:
timeout: 5000
encoding: utf8
# min_messages: warning
socket: /tmp/.s.PGSQL.5432
Solution Two
You can create symlinks to the expected location
$ sudo mkdir /var/pgsql_socket
$ sudo ln /tmp/.s.PGSQL.5432 /var/pgsql_socket/
Then the error should go.
Hope this helps.
Note: Your default socket directory may not be /tmp
Did you update/upgrade your database?
Did you start a docker container that interfered with any of your data-store/socket file locations?
This probably doesn't fit your situation exactly, but maybe it will provide some insight:
Sometimes when you try
sudo systemctl start postgresql.service
and the systemd status says it is started but you still get that error message when trying to connect, try this instead:
sudo pg_ctlcluster <version> <cluster> <action>
which in my case had been
sudo pg_ctlcluster 13 main start
I am trying to use the ipyparallel library to run an ipcontroller and ipengine on different machines.
My setup is as follows:
Remote machine:
Windows Server 2012 R2 x64, running an ipcontroller, listening on port 5900 and ip=0.0.0.0.
Local machine:
Windows 10 x64, running an ipengine, listening on the remote machine's ip and port 5900.
Controller start command:
ipcontroller --ip=0.0.0.0 --port=5900 --reuse --log-to-file=True
Engine start command:
ipengine --file=/c/Users/User/ipcontroller-engine.json --timeout=10 --log-to-file=True
I've changed the interface field in ipcontroller-engine.json from "tcp://127.0.0.1" to "tcp://" for ipengine.
On startup, here is a snapshot of the ipcontroller log:
2016-10-10 01:14:00.651 [IPControllerApp] Hub listening on tcp://0.0.0.0:5900 for registration.
2016-10-10 01:14:00.677 [IPControllerApp] Hub using DB backend: 'DictDB'
2016-10-10 01:14:00.956 [IPControllerApp] hub::created hub
2016-10-10 01:14:00.957 [IPControllerApp] task::using Python leastload Task scheduler
2016-10-10 01:14:00.959 [IPControllerApp] Heartmonitor started
2016-10-10 01:14:00.967 [IPControllerApp] Creating pid file: C:\Users\Administrator\.ipython\profile_default\pid\ipcontroller.pid
2016-10-10 01:14:02.102 [IPControllerApp] client::client b'\x00\x80\x00\x00)' requested 'connection_request'
2016-10-10 01:14:02.102 [IPControllerApp] client::client [b'\x00\x80\x00\x00)'] connected
2016-10-10 01:14:47.895 [IPControllerApp] client::client b'82f5efed-52eb-46f2-8c92-e713aee8a363' requested 'registration_request'
2016-10-10 01:15:05.437 [IPControllerApp] client::client b'efe6919d-98ac-4544-a6b8-9d748f28697d' requested 'registration_request'
2016-10-10 01:15:17.899 [IPControllerApp] registration::purging stalled registration: 1
And the ipengine log:
2016-10-10 13:44:21.037 [IPEngineApp] Registering with controller at tcp://172.17.3.14:5900
2016-10-10 13:44:21.508 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2016-10-10 13:44:21.522 [IPEngineApp] Completed registration with id 1
2016-10-10 13:44:27.529 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2016-10-10 13:44:30.539 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
...
2016-10-10 13:46:52.009 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (49 time(s) in a row).
2016-10-10 13:46:55.028 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (50 time(s) in a row).
2016-10-10 13:46:55.028 [IPEngineApp] CRITICAL | Maximum number of heartbeats misses reached (50 times 3010 ms), shutting down.
(There is a 12.5 hour time difference between the local machine and the remote VM)
Any idea why this may happen?
If you are using --reuse, make sure to remove the files if you change settings. It's possible that it doesn't behave well when --reuse is given and you change things like --ip, as the connection file may be overriding your command-line arguments.
When setting --ip=0.0.0.0, it may be useful to also set --location=a.b.c.d where a.b.c.d is an ip address of the controller that you know is accessible to the engines. Changing the
If registration works and subsequent connections don't, this may be due to a firewall only opening one port, e.g. 5900. The machine running the controller needs to have all the ports listed in the connection file open. You can specify these to be a port-range by manually entering port numbers in the connection files.
i have a setup where i am using 3 mesos masters and 3 mesos slasves. after making all the required configurations i can see 3 mesos masters are part of a cluster which is maintained by zookeepers.
now i have setup 3 mesos slaves and when i am starting mesos-slave service, i am expecting that mesos slaves will be available to the mesos masters web UI page. But i can not see any of them in the slaves tab.
selinux, firewall, iptalbes all are disabled. able to perform ssh between node.
[cloud-user#slave1 ~]$ sudo systemctl status mesos-slave -l
mesos-slave.service - Mesos Slave
Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service; enabled)
Active: active (running) since Sat 2016-01-16 16:11:55 UTC; 3s ago
Main PID: 2483 (mesos-slave)
CGroup: /system.slice/mesos-slave.service
├─2483 /usr/sbin/mesos-slave --master=zk://10.0.0.2:2181,10.0.0.6:2181,10.0.0.7:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins
├─2493 logger -p user.info -t mesos-slave[2483]
└─2494 logger -p user.err -t mesos-slave[2483]
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628670 2497 detector.cpp:482] A new leading master (UPID=master#127.0.0.1:5050) is detected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628732 2497 slave.cpp:729] New master detected at master#127.0.0.1:5050
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628825 2497 slave.cpp:754] No credentials provided. Attempting to register without authentication
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628844 2497 slave.cpp:765] Detecting new master
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628872 2497 status_update_manager.cpp:176] Pausing sending status updates
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.628922 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: W0116 16:11:55.629107 2502 slave.cpp:3218] Master disconnected! Waiting for a new master to be elected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.983531 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:57 slave1.novalocal mesos-slave[2494]: E0116 16:11:57.465049 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
So the problematic line is:
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).
This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname, --no-hostname_lookup, --ip, --ip_discovery_command, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.
The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.
If you can't get the hostnames setup properly, I would recommend setting --hostname and --ip manually which should cause mesos to announce exactly what you want.
[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478