ipyparallel displaying "registration: purging stalled registration" - ipython

I am trying to use the ipyparallel library to run an ipcontroller and ipengine on different machines.
My setup is as follows:
Remote machine:
Windows Server 2012 R2 x64, running an ipcontroller, listening on port 5900 and ip=0.0.0.0.
Local machine:
Windows 10 x64, running an ipengine, listening on the remote machine's ip and port 5900.
Controller start command:
ipcontroller --ip=0.0.0.0 --port=5900 --reuse --log-to-file=True
Engine start command:
ipengine --file=/c/Users/User/ipcontroller-engine.json --timeout=10 --log-to-file=True
I've changed the interface field in ipcontroller-engine.json from "tcp://127.0.0.1" to "tcp://" for ipengine.
On startup, here is a snapshot of the ipcontroller log:
2016-10-10 01:14:00.651 [IPControllerApp] Hub listening on tcp://0.0.0.0:5900 for registration.
2016-10-10 01:14:00.677 [IPControllerApp] Hub using DB backend: 'DictDB'
2016-10-10 01:14:00.956 [IPControllerApp] hub::created hub
2016-10-10 01:14:00.957 [IPControllerApp] task::using Python leastload Task scheduler
2016-10-10 01:14:00.959 [IPControllerApp] Heartmonitor started
2016-10-10 01:14:00.967 [IPControllerApp] Creating pid file: C:\Users\Administrator\.ipython\profile_default\pid\ipcontroller.pid
2016-10-10 01:14:02.102 [IPControllerApp] client::client b'\x00\x80\x00\x00)' requested 'connection_request'
2016-10-10 01:14:02.102 [IPControllerApp] client::client [b'\x00\x80\x00\x00)'] connected
2016-10-10 01:14:47.895 [IPControllerApp] client::client b'82f5efed-52eb-46f2-8c92-e713aee8a363' requested 'registration_request'
2016-10-10 01:15:05.437 [IPControllerApp] client::client b'efe6919d-98ac-4544-a6b8-9d748f28697d' requested 'registration_request'
2016-10-10 01:15:17.899 [IPControllerApp] registration::purging stalled registration: 1
And the ipengine log:
2016-10-10 13:44:21.037 [IPEngineApp] Registering with controller at tcp://172.17.3.14:5900
2016-10-10 13:44:21.508 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2016-10-10 13:44:21.522 [IPEngineApp] Completed registration with id 1
2016-10-10 13:44:27.529 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2016-10-10 13:44:30.539 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
...
2016-10-10 13:46:52.009 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (49 time(s) in a row).
2016-10-10 13:46:55.028 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (50 time(s) in a row).
2016-10-10 13:46:55.028 [IPEngineApp] CRITICAL | Maximum number of heartbeats misses reached (50 times 3010 ms), shutting down.
(There is a 12.5 hour time difference between the local machine and the remote VM)
Any idea why this may happen?

If you are using --reuse, make sure to remove the files if you change settings. It's possible that it doesn't behave well when --reuse is given and you change things like --ip, as the connection file may be overriding your command-line arguments.
When setting --ip=0.0.0.0, it may be useful to also set --location=a.b.c.d where a.b.c.d is an ip address of the controller that you know is accessible to the engines. Changing the
If registration works and subsequent connections don't, this may be due to a firewall only opening one port, e.g. 5900. The machine running the controller needs to have all the ports listed in the connection file open. You can specify these to be a port-range by manually entering port numbers in the connection files.

Related

Hyperledger fabric chaincode connection with peer getting dropped

I have a hyperledger fabric network version 2.4.4 running on Kubernetes. The peers and other components are running behind istio ingress. The chaincode is running on dind (docker-in-docker) container and connects to peer through its URL. The problem is the chaincode connection is being dropped after few minutes. Below are the logs:
2022-07-14T04:31:13.057Z info [c-api:lib/handler.js] [assetschannel-ddc183b4] Calling chaincode Invoke() succeeded. Sending COMPLETED message back to peer
2022-07-14T04:33:04.197Z error [c-api:lib/handler.js] Chat stream with peer - on error: %j "Error: 14 UNAVAILABLE: Connection dropped\n at Object.callErrorFromStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/call.js:31:26)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client.js:391:49)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:328:181)\n at /usr/local/src/node_modules/#grpc/grpc-js/build/src/call-stream.js:187:78\n at processTicksAndRejections (node:internal/process/task_queues:78:11)"
I did set the following environment variables in the peer pod to keep the connection alive:
CORE_CHAINCODE_KEEPALIVE: 60000
CORE_PEER_KEEPALIVE_CLIENT_INTERVAL: 600s
CORE_PEER_KEEPALIVE_CLIENT_TIMEOUT: 2s
CORE_PEER_KEEPALIVE_DELIVERYCLIENT_INTERVAL: 20s
CORE_PEER_KEEPALIVE_MININTERVAL: 15s
but this did not resolve the issue.
Any suggestions would be appreciated.
It appears to be an issue with aws elb. The idle timeout was set to 60s which was breaking the connection between chaincode and peer when there was no communication between them. Increasing this time fixed the issue.

Hyperledger Fabric in Kubernetes: Not able to instantiate chaincode

Hello Everyone i am working on setup of fabric default first-network in kubernetes. But when i am instantiate the chaincode it gives me error. Please check below are my peer logs.
2019-07-22 07:25:02.134 UTC [endorser] SimulateProposal -> ERRO 066 [mychannel][c4b4e2ae] failed to invoke chaincode name:"lscc" , error: container exited with 0
github.com/hyperledger/fabric/core/chaincode.(*RuntimeLauncher).Launch.func1
/opt/gopath/src/github.com/hyperledger/fabric/core/chaincode/runtime_launcher.go:63
runtime.goexit
/opt/go/src/runtime/asm_amd64.s:1333
chaincode registration failed
Getting this error on Cli :-
2019-07-22 07:24:58.263 UTC [chaincodeCmd] checkChaincodeCmdParams -> INFO 001 Using default escc
2019-07-22 07:24:58.264 UTC [chaincodeCmd] checkChaincodeCmdParams -> INFO 002 Using default vscc
Error: could not assemble transaction, err proposal response was not successful, error code 500, msg chaincode registration failed: container exited with 0
Once check if all your docker containers are up and running and if you are simply running the sample network without making any changes to the smart contract and the docker files then you can simply stop your network and freshly start the network(it worked in my case).
I have check with my configuration files it was due to wrong CORE_PEER_CHAINCODELISTENADDRESS env variable value for the peer.

Flink: HA mode killing leading jobmanager terminating standby jobmanagers

I am trying to get Flink to run in HA mode using Zookeeper, but when I try to test it by killing the leader JobManager all my standby jobmanagers get killed too.
So instead of a standby jobmanager taking over as the new Leader, they all get killed which isn't supposed to happen.
My setup:
4 servers, 3 of those servers have Zookeeper running, but only 1 server will host all the JobManagers.
ad011.local: Zookeeper + Jobmanagers
ad012.local: Zookeeper + Taskmanager
ad013.local: Zookeeper
ad014.local: nothing interesting
My masters file looks like this:
ad011.local:8081
ad011.local:8082
ad011.local:8083
My flink-conf.yaml:
jobmanager.rpc.address: ad011.local
blob.server.port: 6130,6131,6132
jobmanager.heap.mb: 512
taskmanager.heap.mb: 128
taskmanager.numberOfTaskSlots: 4
parallelism.default: 2
taskmanager.tmp.dirs: /var/flink/data
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 8789,8790,8791
high-availability: zookeeper
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /cluster-one
high-availability.storageDir: /var/flink/recovery
high-availability.jobmanager.port: 50000,50001,50002
When I run flink by using start-cluster.sh script I see my 3 JobManagers running, and going to the WebUI they all point to ad011.local:8081, which is the leader. Which is okay I guess?
I then try to test the failover by killing the leader using kill and then all my other standby JobManagers stop too.
This is what I see in my standby JobManager logs:
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#ad011.local:50002/user/jobmanager.
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService#72d546c8.
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#ad011.local:50002/user/jobmanager on port 8083
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,645 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#ad011.local:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a.
2017-09-29 08:08:41,651 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,722 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Received leader address but not running in leader ActorSystem. Cancelling registration.
2017-09-29 09:26:13,472 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#ad011.local:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-09-29 09:26:14,274 INFO org.apache.flink.runtime.jobmanager.JobManager - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2017-09-29 09:26:14,284 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6132
Any help would be appreciated.
Solved it by running my cluster using ./bin/start-cluster.sh instead of using service files (which calls the same script), the service file kills the other jobmanagers apparently.

ipython with MPI clustering using machinefile

I have successfully configured mpi with mpi4py support across three nodes, as per testing of the hellowworld.py script in the mpi4py demo directory:
gms#host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py
Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.
I am now trying to get this working in ipython and have added my machinefile to my $IPYTHON_DIR/profile_mpi/ipcluster_config.py file, as follows:
c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]
I then start iPython notebook on my head node using the command: ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &
and, voila, I can access it just fine from another device on my local network. However, when I run helloworld.py from iPython notebook, I only get a response from the head node: Hello, World! I am process 0 of 10 on host.
I started mpi from iPython with 10 engines, but...
I further configured these parameters, just in case
in $IPYTHON_DIR/profile_mpi/ipcluster_config.py
c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'
in $IPYTHON_DIR/profile_mpi/ipengine_config.py
c.MPI.use = 'mpi4py'
in $IPYTHON_DIR/profile_mpi/ipcontroller_config.py
c.HubFactory.ip = '*'
However, these did not help, either.
What am I missing to get this working correctly?
EDIT UPDATE 1
I now have NFS mounted directories on my worker nodes, and thus, am fulfilling the requirement "Currently ipcluster requires that the IPYTHONDIR/profile_/security directory live on a shared filesystem that is seen by both the controller and engines." to be able to use ipcluster to start my controller and engines, using the command ipcluster start --profile=mpi -n 6 &.
So, I issue this on my head node, and then get:
2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully
Then, proceed to issue the same command to start the engines on the other nodes, but I get:
2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
[1]+ Stopped ipcluster start --profile=mpi -n 6
with no confirmation that the Engines appear to have started successfully ...
Even more confusing, when I do a ps au on the worker nodes, I get:
gms 3862 0.1 2.5 38684 23740 pts/0 T 20:31 0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms 3874 0.1 1.7 21428 16772 pts/0 T 20:31 0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms 3875 0.0 0.2 4768 2288 pts/0 T 20:31 0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms 3876 0.0 0.4 5732 4132 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms 3877 0.0 0.1 4816 1204 pts/0 T 20:31 0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms 3878 0.0 0.4 5732 4028 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms 3879 0.0 0.6 8944 6008 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms 3880 0.0 0.6 8944 6108 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
Where the ip addresses in processes 3376 and 3378 are from the other hosts in the cluster. But...
When I run a similar test directly using ipython, all I get is a response from the localhost (even though, minus, ipython, this works directly with mpi and mpi4py as noted in my original post):
gms#head:~/development/mpi$ ipython test.py
head[3834]: 0/1
gms#head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10
I still seem to be missing something obvious, although I am convinced my configuration is now correct. One thing that pops out, is when I start ipcluster on my worker nodes, I get this: 2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
EDIT UPDATE 2
This is more to document what is happening and, hopefully, ultimately what gets this working:
I cleaned out my log files and reissued ipcluster start --profile=mpi -n 6 &
And now see 6-log files for my engines, and 1 for my controller:
drwxr-xr-x 2 gms gms 12288 Mar 6 03:28 .
drwxr-xr-x 7 gms gms 4096 Mar 6 03:31 ..
-rw-r--r-- 1 gms gms 1313 Mar 6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4629.log
Looking in the log for ipcontroller it looks like only one engine registered:
2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0
Shouldn't each of the 6 engines be registered?
2 of the engine's logs look like they registered fine:
2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1
while the other registered with id 0
But, the other 4 engines gave a time out error:
2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds
Hmmm... I think I may try a reinstall of ipython tomorrow.
EDIT UPDATE 3
Conflicting versions of ipython were installed (looks like through apt-get and pip). Uninstalling and reinstall using pip install ipython[all]...
EDIT UPDATE 4
I hope someone is finding this useful AND I hope someone can weigh in at some point to help clarify a few things.
Anywho, I installed a virtualenv to deal isolate my environment, and it looks like some degree of success, I think. I fired up 'ipcluster start -n 4 --profile=mpi' on each of my nodes, then ssh'ed back into my head node and ran a test script, which first calls ipcluster. The following output: So, it is doing some parallel computing.
However, when I run my test script that queries all the nodes, I just get the head node:
But, again, if I just run the straight up mpiexec command, everything is hunky dory.
To add to the confusion, if I look at the processes on the nodes, I see all sorts of behavior to indicate they are working together:
And nothing out of the ordinary in my logs. Why am I not getting nodes returned in my second test script (code included here:):
# test_mpi.py
import os
import socket
from mpi4py import MPI
MPI = MPI.COMM_WORLD
print("{host}[{pid}]: {rank}/{size}".format(
host=socket.gethostname(),
pid=os.getpid(),
rank=MPI.rank,
size=MPI.size,
))
Not sure why, but I recreated my ipcluster_config.py file and again added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas...

mesos slaves are not connecting with mesos masters cluster

i have a setup where i am using 3 mesos masters and 3 mesos slasves. after making all the required configurations i can see 3 mesos masters are part of a cluster which is maintained by zookeepers.
now i have setup 3 mesos slaves and when i am starting mesos-slave service, i am expecting that mesos slaves will be available to the mesos masters web UI page. But i can not see any of them in the slaves tab.
selinux, firewall, iptalbes all are disabled. able to perform ssh between node.
[cloud-user#slave1 ~]$ sudo systemctl status mesos-slave -l
mesos-slave.service - Mesos Slave
Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service; enabled)
Active: active (running) since Sat 2016-01-16 16:11:55 UTC; 3s ago
Main PID: 2483 (mesos-slave)
CGroup: /system.slice/mesos-slave.service
├─2483 /usr/sbin/mesos-slave --master=zk://10.0.0.2:2181,10.0.0.6:2181,10.0.0.7:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins
├─2493 logger -p user.info -t mesos-slave[2483]
└─2494 logger -p user.err -t mesos-slave[2483]
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628670 2497 detector.cpp:482] A new leading master (UPID=master#127.0.0.1:5050) is detected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628732 2497 slave.cpp:729] New master detected at master#127.0.0.1:5050
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628825 2497 slave.cpp:754] No credentials provided. Attempting to register without authentication
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628844 2497 slave.cpp:765] Detecting new master
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628872 2497 status_update_manager.cpp:176] Pausing sending status updates
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.628922 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: W0116 16:11:55.629107 2502 slave.cpp:3218] Master disconnected! Waiting for a new master to be elected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.983531 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:57 slave1.novalocal mesos-slave[2494]: E0116 16:11:57.465049 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
So the problematic line is:
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).
This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname, --no-hostname_lookup, --ip, --ip_discovery_command, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.
The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.
If you can't get the hostnames setup properly, I would recommend setting --hostname and --ip manually which should cause mesos to announce exactly what you want.
[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478