What am I missing getting mpirun to schedule across multiple nodes? - raspberry-pi

TL;DR; I'm having troubles getting MPI to schedule jobs across more than a single node. There seems to be a communication error between nodes at the MPI level that isn't a problem for TCP or at the Slurm level. Ultimately, I seem to be missing something about setting up MPI communication and I'm not sure what.
I'm beginning to teach myself some High Performance Computing using a cluster of Raspberry Pi's. I've followed the steps in https://glmdev.medium.com/building-a-raspberry-pi-cluster-f5f2446702e8 (All three blog posts of the series) to get a two-node cluster set up (with plans to add more later).
The trouble came when it came time to use Slurm to schedule a python MPI job (the pi calculation in the blog post). When scheduling the job as described in the blog post, I get the following message:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
After testing things out, I've confirmed that slurm works as I expected. For example:
pi#cluster-master:/clusterfs/calc-pi $ srun -n 6 hostname
cluster-master
cluster-master
cluster-01
cluster-01
cluster-01
cluster-01
Running mpirun works when targeting the master mpirun --host cluster-master -n 1 hostname, but targeting the worker mpirun --host cluster-01 -n 1 hostname will hang. Reducing the task count from the pi calculation example also works (and actually runs on cluster-01).
#!/bin/bash
#SBATCH --ntasks=4
cd $SLURM_SUBMIT_DIR
mpiexec -n 4 /clusterfs/usr/bin/python3 calculate.py
EDIT 1
pi#cluster-master:/clusterfs/calc-pi $ srun orted
[cluster-01:01960] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 147
[cluster-01:01960] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file util/session_dir.c at line 106
[cluster-01:01960] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file util/session_dir.c at line 345
[cluster-01:01960] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 270
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
srun: error: cluster-01: task 0: Exited with exit code 213
MPI was installed via the command srun --nodes=2 apt install openmpi-bin openmpi-common libopenmpi3 libopenmpi-dev -y. If I'm understanding this correctly, srun will run the apt command on two nodes (which is all I have).
Adding a full path to the submission script did not change the result.
#!/bin/bash
#SBATCH --ntasks=6
cd $SLURM_SUBMIT_DIR
/usr/bin/mpiexec -n 6 /clusterfs/usr/bin/python3 calculate.py
EDIT 2
Ifconfig output
pi#cluster-master:/clusterfs/calc-pi $ ifconfig -a
eth0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether dc:a6:32:f9:16:89 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 3546 bytes 371661 (362.9 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3546 bytes 371661 (362.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.15.30.222 netmask 255.255.255.0 broadcast 10.15.30.255
inet6 fe80::10f0:f020:1af7:b503 prefixlen 64 scopeid 0x20<link>
ether dc:a6:32:f9:16:8a txqueuelen 1000 (Ethernet)
RX packets 46530 bytes 10782530 (10.2 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 11017 bytes 2184717 (2.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
pi#cluster-01:~ $ ifconfig -a
eth0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether dc:a6:32:f9:15:6c txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 1103 bytes 77867 (76.0 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1103 bytes 77867 (76.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.15.30.235 netmask 255.255.255.0 broadcast 10.15.30.255
inet6 fe80::9489:150a:535d:4485 prefixlen 64 scopeid 0x20<link>
ether dc:a6:32:f9:15:6d txqueuelen 1000 (Ethernet)
RX packets 92797 bytes 27212632 (25.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 60789 bytes 48894420 (46.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
EDIT 3/4
Running with oob_base_verbose. I should note that 10.15.30.222 is cluster-master which is where I'm running these commands. Edit 4 replaced the bad output with a now working mpirun.
pi#cluster-master:/clusterfs/calc-pi $ mpirun --mca oob_base_verbose 10 --host cluster-01 -n 1 hostname
[cluster-master:08881] mca: base: components_register: registering framework oob components
[cluster-master:08881] mca: base: components_register: found loaded component ud
[cluster-master:08881] mca: base: components_register: component ud register function successful
[cluster-master:08881] mca: base: components_register: found loaded component tcp
[cluster-master:08881] mca: base: components_register: component tcp register function successful
[cluster-master:08881] mca: base: components_open: opening oob components
[cluster-master:08881] mca: base: components_open: found loaded component ud
[cluster-master:08881] mca: base: components_open: component ud open function successful
[cluster-master:08881] mca: base: components_open: found loaded component tcp
[cluster-master:08881] mca: base: components_open: component tcp open function successful
[cluster-master:08881] mca:oob:select: checking available component ud
[cluster-master:08881] mca:oob:select: Querying component [ud]
[cluster-master:08881] oob:ud: component_available called
[cluster-master:08881] mca:oob:select: Skipping component [ud] - failed to startup
[cluster-master:08881] mca:oob:select: checking available component tcp
[cluster-master:08881] mca:oob:select: Querying component [tcp]
[cluster-master:08881] oob:tcp: component_available called
[cluster-master:08881] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cluster-master:08881] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[cluster-master:08881] [[63716,0],0] oob:tcp:init adding 10.15.30.222 to our list of V4 connections
[cluster-master:08881] [[63716,0],0] TCP STARTUP
[cluster-master:08881] [[63716,0],0] attempting to bind to IPv4 port 0
[cluster-master:08881] [[63716,0],0] assigned IPv4 port 60821
[cluster-master:08881] mca:oob:select: Adding component to end
[cluster-master:08881] mca:oob:select: Found 1 active transports
[cluster-master:08881] [[63716,0],0]: get transports
[cluster-master:08881] [[63716,0],0]:get transports for component tcp
[cluster-01:06357] mca: base: components_register: registering framework oob components
[cluster-01:06357] mca: base: components_register: found loaded component ud
[cluster-01:06357] mca: base: components_register: component ud register function successful
[cluster-01:06357] mca: base: components_register: found loaded component tcp
[cluster-01:06357] mca: base: components_register: component tcp register function successful
[cluster-01:06357] mca: base: components_open: opening oob components
[cluster-01:06357] mca: base: components_open: found loaded component ud
[cluster-01:06357] mca: base: components_open: component ud open function successful
[cluster-01:06357] mca: base: components_open: found loaded component tcp
[cluster-01:06357] mca: base: components_open: component tcp open function successful
[cluster-01:06357] mca:oob:select: checking available component ud
[cluster-01:06357] mca:oob:select: Querying component [ud]
[cluster-01:06357] oob:ud: component_available called
[cluster-01:06357] mca:oob:select: Skipping component [ud] - failed to startup
[cluster-01:06357] mca:oob:select: checking available component tcp
[cluster-01:06357] mca:oob:select: Querying component [tcp]
[cluster-01:06357] oob:tcp: component_available called
[cluster-01:06357] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cluster-01:06357] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[cluster-01:06357] [[63716,0],1] oob:tcp:init adding 10.15.30.235 to our list of V4 connections
[cluster-01:06357] [[63716,0],1] TCP STARTUP
[cluster-01:06357] [[63716,0],1] attempting to bind to IPv4 port 0
[cluster-01:06357] [[63716,0],1] assigned IPv4 port 35529
[cluster-01:06357] mca:oob:select: Adding component to end
[cluster-01:06357] mca:oob:select: Found 1 active transports
[cluster-01:06357] [[63716,0],1]: get transports
[cluster-01:06357] [[63716,0],1]:get transports for component tcp
[cluster-01:06357] [[63716,0],1] OOB_SEND: rml_oob_send.c:265
[cluster-01:06357] [[63716,0],1] oob:base:send to target [[63716,0],0] - attempt 0
[cluster-01:06357] [[63716,0],1] oob:base:send unknown peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1]:set_addr checking if peer [[63716,0],0] is reachable via component tcp
[cluster-01:06357] [[63716,0],1] oob:tcp: working peer [[63716,0],0] address tcp://10.15.30.222:60821
[cluster-01:06357] [[63716,0],1]: peer [[63716,0],0] is reachable via component tcp
[cluster-01:06357] [[63716,0],1] oob:tcp:send_nb to peer [[63716,0],0]:10 seq = -1
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:204] processing send to peer [[63716,0],0]:10 seq_num = -1 via [[63716,0],0]
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:225] queue pending to [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_nb: initiating connection to [[63716,0],0]
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:239] connect to [[63716,0],0]
[cluster-01:06357] [[63716,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[63716,0],0]
[cluster-01:06357] [[63716,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[63716,0],0] on socket 19
[cluster-01:06357] [[63716,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[63716,0],0] on 10.15.30.222:60821 - 0 retries
[cluster-01:06357] [[63716,0],1] waiting for connect completion to [[63716,0],0] - activating send event
[cluster-master:08881] [[63716,0],0] mca_oob_tcp_listen_thread: incoming connection: (24, 0) 10.15.30.235:58444
[cluster-master:08881] [[63716,0],0] connection_handler: working connection (24, 0) 10.15.30.235:58444
[cluster-master:08881] [[63716,0],0] accept_connection: 10.15.30.235:58444
[cluster-01:06357] [[63716,0],1] tcp:send_handler called to send to peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_handler CONNECTING
[cluster-01:06357] [[63716,0],1]:tcp:complete_connect called for peer [[63716,0],0] on socket 19
[cluster-01:06357] [[63716,0],1] tcp_peer_complete_connect: sending ack to [[63716,0],0]
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler called
[cluster-master:08881] [[63716,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 24
[cluster-master:08881] [[63716,0],0] waiting for connect ack from UNKNOWN
[cluster-master:08881] [[63716,0],0] connect ack received from UNKNOWN
[cluster-master:08881] [[63716,0],0] connect-ack recvd from UNKNOWN
[cluster-master:08881] [[63716,0],0] mca_oob_tcp_recv_connect: connection from new peer
[cluster-master:08881] [[63716,0],0] connect-ack header from [[63716,0],1] is okay
[cluster-master:08881] [[63716,0],0] waiting for connect ack from [[63716,0],1]
[cluster-master:08881] [[63716,0],0] connect ack received from [[63716,0],1]
[cluster-master:08881] [[63716,0],0] connect-ack version from [[63716,0],1] matches ours
[cluster-master:08881] [[63716,0],0] tcp:peer_accept called for peer [[63716,0],1] in state UNKNOWN on socket 24
[cluster-master:08881] [[63716,0],0] SEND CONNECT ACK
[cluster-master:08881] [[63716,0],0] send blocking of 72 bytes to socket 24
[cluster-master:08881] [[63716,0],0] blocking send complete to socket 24
[cluster-master:08881] [[63716,0],0]-[[63716,0],1] tcp_peer_connected on socket 24
[cluster-master:08881] [[63716,0],0]-[[63716,0],1] accepted: 10.15.30.222 - 10.15.30.235 nodelay 1 sndbuf 44800 rcvbuf 131072 flags 00000802
[cluster-master:08881] [[63716,0],0] tcp:set_module called for peer [[63716,0],1]
[cluster-01:06357] [[63716,0],1] SEND CONNECT ACK
[cluster-01:06357] [[63716,0],1] send blocking of 72 bytes to socket 19
[cluster-01:06357] [[63716,0],1] blocking send complete to socket 19
[cluster-01:06357] [[63716,0],1] tcp_peer_complete_connect: setting read event on connection to [[63716,0],0]
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler called for peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] RECV CONNECT ACK FROM [[63716,0],0] ON SOCKET 19
[cluster-01:06357] [[63716,0],1] waiting for connect ack from [[63716,0],0]
[cluster-01:06357] [[63716,0],1] connect ack received from [[63716,0],0]
[cluster-01:06357] [[63716,0],1] connect-ack recvd from [[63716,0],0]
[cluster-01:06357] [[63716,0],1] connect-ack header from [[63716,0],0] is okay
[cluster-01:06357] [[63716,0],1] waiting for connect ack from [[63716,0],0]
[cluster-01:06357] [[63716,0],1] connect ack received from [[63716,0],0]
[cluster-01:06357] [[63716,0],1] connect-ack version from [[63716,0],0] matches ours
[cluster-01:06357] [[63716,0],1]-[[63716,0],0] tcp_peer_connected on socket 19
[cluster-01:06357] [[63716,0],1]-[[63716,0],0] connected: 10.15.30.235 - 10.15.30.222 nodelay 1 sndbuf 44800 rcvbuf 131072 flags 00000802
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler called for peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler CONNECTED
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate new recv msg
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler read hdr
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate data region of size 2743
[cluster-master:08881] [[63716,0],0] RECVD COMPLETE MESSAGE FROM [[63716,0],1] (ORIGIN [[63716,0],1]) OF 2743 BYTES FOR DEST [[63716,0],0] TAG 10
[cluster-master:08881] [[63716,0],0] DELIVERING TO RML tag = 10 seq_num = -1
[cluster-master:08881] [[63716,0],0] OOB_SEND: rml_oob_send.c:265
[cluster-master:08881] [[63716,0],0] oob:base:send to target [[63716,0],1] - attempt 0
[cluster-master:08881] [[63716,0],0] oob:base:send known transport for peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0] oob:tcp:send_nb to peer [[63716,0],1]:1 seq = -1
[cluster-master:08881] [[63716,0],0]:[oob_tcp.c:204] processing send to peer [[63716,0],1]:1 seq_num = -1 via [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_nb: already connected to [[63716,0],1] - queueing for send
[cluster-master:08881] [[63716,0],0]:[oob_tcp.c:218] queue send to [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_handler called to send to peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_handler SENDING TO [[63716,0],1]
[cluster-master:08881] oob:tcp:send_handler SENDING MSG
[cluster-master:08881] [[63716,0],0] MESSAGE SEND COMPLETE TO [[63716,0],1] OF 13 BYTES ON SOCKET 24
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler starting send/recv events
[cluster-01:06357] [[63716,0],1] tcp:set_module called for peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_handler called to send to peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_handler SENDING TO [[63716,0],0]
[cluster-01:06357] oob:tcp:send_handler SENDING MSG
[cluster-01:06357] [[63716,0],1] MESSAGE SEND COMPLETE TO [[63716,0],0] OF 2743 BYTES ON SOCKET 19
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler called for peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler CONNECTED
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler allocate new recv msg
[cluster-master:08881] [[63716,0],0] OOB_SEND: rml_oob_send.c:265
[cluster-master:08881] [[63716,0],0] oob:base:send to target [[63716,0],1] - attempt 0
[cluster-master:08881] [[63716,0],0] oob:base:send known transport for peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0] oob:tcp:send_nb to peer [[63716,0],1]:15 seq = -1
[cluster-master:08881] [[63716,0],0]:[oob_tcp.c:204] processing send to peer [[63716,0],1]:15 seq_num = -1 via [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_nb: already connected to [[63716,0],1] - queueing for send
[cluster-master:08881] [[63716,0],0]:[oob_tcp.c:218] queue send to [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_handler called to send to peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_handler SENDING TO [[63716,0],1]
[cluster-master:08881] oob:tcp:send_handler SENDING MSG
[cluster-master:08881] [[63716,0],0] MESSAGE SEND COMPLETE TO [[63716,0],1] OF 694 BYTES ON SOCKET 24
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler read hdr
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler allocate data region of size 13
[cluster-01:06357] [[63716,0],1] RECVD COMPLETE MESSAGE FROM [[63716,0],0] (ORIGIN [[63716,0],0]) OF 13 BYTES FOR DEST [[63716,0],1] TAG 1
[cluster-01:06357] [[63716,0],1] DELIVERING TO RML tag = 1 seq_num = -1
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler called for peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler CONNECTED
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler allocate new recv msg
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler read hdr
[cluster-01:06357] [[63716,0],1]:tcp:recv:handler allocate data region of size 694
[cluster-01:06357] [[63716,0],1] RECVD COMPLETE MESSAGE FROM [[63716,0],0] (ORIGIN [[63716,0],0]) OF 694 BYTES FOR DEST [[63716,0],1] TAG 15
[cluster-01:06357] [[63716,0],1] DELIVERING TO RML tag = 15 seq_num = -1
[cluster-01:06357] [[63716,0],1] OOB_SEND: rml_oob_send.c:265
[cluster-01:06357] [[63716,0],1] oob:base:send to target [[63716,0],0] - attempt 0
[cluster-01:06357] [[63716,0],1] oob:base:send known transport for peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] oob:tcp:send_nb to peer [[63716,0],0]:5 seq = -1
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:204] processing send to peer [[63716,0],0]:5 seq_num = -1 via [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_nb: already connected to [[63716,0],0] - queueing for send
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:218] queue send to [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_handler called to send to peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_handler SENDING TO [[63716,0],0]
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler called for peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler CONNECTED
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate new recv msg
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler read hdr
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate data region of size 54
[cluster-master:08881] [[63716,0],0] RECVD COMPLETE MESSAGE FROM [[63716,0],1] (ORIGIN [[63716,0],1]) OF 54 BYTES FOR DEST [[63716,0],0] TAG 5
[cluster-master:08881] [[63716,0],0] DELIVERING TO RML tag = 5 seq_num = -1
[cluster-01:06357] oob:tcp:send_handler SENDING MSG
[cluster-01:06357] [[63716,0],1] MESSAGE SEND COMPLETE TO [[63716,0],0] OF 54 BYTES ON SOCKET 19
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler called for peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler CONNECTED
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate new recv msg
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler read hdr
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate data region of size 33
[cluster-master:08881] [[63716,0],0] RECVD COMPLETE MESSAGE FROM [[63716,0],1] (ORIGIN [[63716,0],1]) OF 33 BYTES FOR DEST [[63716,0],0] TAG 2
[cluster-master:08881] [[63716,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler called for peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler CONNECTED
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate new recv msg
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler read hdr
[cluster-master:08881] [[63716,0],0]:tcp:recv:handler allocate data region of size 54
[cluster-master:08881] [[63716,0],0] RECVD COMPLETE MESSAGE FROM [[63716,0],1] (ORIGIN [[63716,0],1]) OF 54 BYTES FOR DEST [[63716,0],0] TAG 5
[cluster-master:08881] [[63716,0],0] DELIVERING TO RML tag = 5 seq_num = -1
[cluster-master:08881] [[63716,0],0] OOB_SEND: rml_oob_send.c:265
[cluster-master:08881] [[63716,0],0] oob:base:send to target [[63716,0],1] - attempt 0
[cluster-master:08881] [[63716,0],0] oob:base:send known transport for peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0] oob:tcp:send_nb to peer [[63716,0],1]:15 seq = -1
[cluster-master:08881] [[63716,0],0]:[oob_tcp.c:204] processing send to peer [[63716,0],1]:15 seq_num = -1 via [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_nb: already connected to [[63716,0],1] - queueing for send
[cluster-master:08881] [[63716,0],0]:[oob_tcp.c:218] queue send to [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_handler called to send to peer [[63716,0],1]
[cluster-master:08881] [[63716,0],0] tcp:send_handler SENDING TO [[63716,0],1]
[cluster-master:08881] oob:tcp:send_handler SENDING MSG
[cluster-master:08881] [[63716,0],0] MESSAGE SEND COMPLETE TO [[63716,0],1] OF 31 BYTES ON SOCKET 24
[cluster-01:06357] [[63716,0],1] OOB_SEND: rml_oob_send.c:265
[cluster-01:06357] [[63716,0],1] oob:base:send to target [[63716,0],0] - attempt 0
[cluster-01:06357] [[63716,0],1] oob:base:send known transport for peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] oob:tcp:send_nb to peer [[63716,0],0]:2 seq = -1
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:204] processing send to peer [[63716,0],0]:2 seq_num = -1 via [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_nb: already connected to [[63716,0],0] - queueing for send
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:218] queue send to [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_handler called to send to peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] tcp:send_handler SENDING TO [[63716,0],0]
[cluster-01:06357] oob:tcp:send_handler SENDING MSG
[cluster-01:06357] [[63716,0],1] MESSAGE SEND COMPLETE TO [[63716,0],0] OF 33 BYTES ON SOCKET 19
[cluster-01:06357] [[63716,0],1] OOB_SEND: rml_oob_send.c:265
[cluster-01:06357] [[63716,0],1] oob:base:send to target [[63716,0],0] - attempt 0
[cluster-01:06357] [[63716,0],1] oob:base:send known transport for peer [[63716,0],0]
[cluster-01:06357] [[63716,0],1] oob:tcp:send_nb to peer [[63716,0],0]:5 seq = -1
[cluster-01:06357] [[63716,0],1]:[oob_tcp.c:204] processing send to peer [[63716,0],0]:5 seq_num = -1 via [[63716,0],0]
[cluster-01:06357] [[63716,0]

Related

establishing IKE_SA failed, peer not responding - Strongswan with Centos 7 [Possibly duplicate]

I've been on this VPN tunnel for over a week now and keep getting Peer not responding when I bring up the tunnel.
I have Strongswan installed and I have also created 1 tunnel which is working fine and connection established, then I added a second one, but whenever I bring up the second tunnel, after 5 attempts, I get Peer not responding. The remote server cannot see any connection from me.
Here's my ipsec.conf file:
config setup
charondebug="all"
conn %default
ikelifetime=24h
keylife=20m
rekeymargin=3m
keyingtries=1
authby=secret
mobike=no
conn Foo-to-Bar
keyexchange=ikev1
left=196.xxx.xxx.xx #PUBLIC IP of my server
leftsubnet=196.xxx.xx.xx/32 #PUBLIC IP of my server
leftid=196.xxx.xx.xx #PUBLIC IP of my server
leftfirewall=yes
right=41.xxx.xx.xx #Remote Peer IP
rightsubnet=41.xxx.xx.xx/32 #Remote Host IP
rightid=41.xxx.xx.xx
auto=route
esp=3des-sha1
ike=3des-sha1-modp1024
type=tunnel
lifetime=24h
dpdaction=clear
ike_dhgroup=group2
conn Foo-to-Bar2
also=Foo-to-Bar
rightsubnet=xxx.xxx.xx.xx/32 #Another remote host
esp=3des-sha1
When I do strongswan up Foo-to-Bar here's what I get:
strongswan up Foo-to-Bar
initiating Main Mode IKE_SA Foo-to-Bar[2] to xxx.xxx.xx.xx
generating ID_PROT request 0 [ SA V V V V V ]
sending packet: from 196.xxx.xxx.xx[500] to 41.xxx.xx.xx[500] (248 bytes)
sending retransmit 1 of request message ID 0, seq 1
sending packet: from 196.xxx.xxx.xx[500] to 41.xxx.xx.xx[500] (248 bytes)
sending retransmit 2 of request message ID 0, seq 1
sending packet: from 196.xxx.xxx.xx[500] to 41.xxx.xx.xx[500] (248 bytes)
sending retransmit 3 of request message ID 0, seq 1
sending packet: from 196.xxx.xxx.xx[500] to 41.xxx.xx.xx[500] (248 bytes)
sending retransmit 4 of request message ID 0, seq 1
sending packet: from 196.xxx.xxx.xx[500] to 41.xxx.xx.xx[500] (248 bytes)
sending retransmit 5 of request message ID 0, seq 1
sending packet: from 196.xxx.xxx.xx[500] to 41.xxx.xx.xx[500] (248 bytes)
giving up after 5 retransmits
establishing IKE_SA failed, peer not responding
establishing connection 'Foo-to-Bar' failed
Also when I check my /var/log/messages I get
# localhost charon: 04[NET] sending packet: from 196.xxx.xxx.xx[500] to 41.xxx.xx.xx[500] (248 bytes)
# localhost charon: 03[NET] error writing to socket: Network is unreachable
What could be the cause?
I am a bit confuse that I have one config just above this on the same file which is able to establish the connection, while this one does not and the remote host cannot see my connection attempts on their log.
I'd appreciate any help.

How to fix pgAdmin4 connection refused error

I'm getting this error when attempting to setup a new server on pgAdmin4:
Unable to connect to server:
could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "192.168.210.146" and accepting
TCP/IP connections on port 5432?
I have postgres 12.7 running on CentOS 8 inside a virtual box 6.1 VM which is running on my Windows 10 21H1 laptop. I can connect to the OS using putty and the CentOS web client just fine.
Here is some network info via the CentOS web client terminal:
# nmap localhost
Starting Nmap 7.70 ( https://nmap.org ) at 2021-07-14 16:59 PDT
Nmap scan report for localhost (127.0.0.1)
Host is up (0.000014s latency).
Other addresses for localhost (not scanned): ::1
Not shown: 996 closed ports
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
5432/tcp open postgresql
9090/tcp open zeus-admin
Nmap done: 1 IP address (1 host up) scanned in 1.68 seconds
netstat -tlpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 954/sshd
tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN 972/postmaster
tcp 0 0 127.0.0.1:37753 0.0.0.0:* LISTEN 1620/cockpit-bridge
# firewall-cmd --list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: enp0s3
sources:
services: cockpit dhcpv6-client postgresql ssh
ports: 5432/tcp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
#
# ifconfig
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.210.146 netmask 255.255.254.0 broadcast 192.168.211.255
inet6 fe80::a00:27ff:fecb:8d2d prefixlen 64 scopeid 0x20<link>
ether 08:00:27:cb:8d:2d txqueuelen 1000 (Ethernet)
RX packets 4704 bytes 512333 (500.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3757 bytes 2510585 (2.3 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 7252 bytes 2161674 (2.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 7252 bytes 2161674 (2.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
In the PgAdmin setup server screen I'm using the ip address listed above of 192.168.210.146, user postgres and its password, port 5432 and the database is set to postgres.
I get this same error trying to establish odbc and jdbc connections from my laptop but I'm not sure what in the postgres environment needs to be fixed.I did add 1 entry the pg_hba.conf file as shown below, but that didn't help:
# IPv4 local connections:
host all all 127.0.0.1/32 ident
host all all 192.168.210.146/32 trust #added;not helping
Is there another file or setting that needs to be fixed?
Thanks.
The solution was to first un-comment the listen_address entry in postgresql.conf and then set it to the necessary ip number. Everything connects just fine now. Thanks

rsyslogd client not closing the TCP connection when server rsyslogd goes down

I have configured rsyslogd on a remote server to send the logs from a client machine with rsyslogd using TCP protocol. After configuring and restarting the rsyslogd daemon on both client and server I am able to send the packets to the server and all works fine. But later when I restart the rsyslogd on the server the client is still sending the packets to the old TCP connection. Hence the client retries for 16 times and fails to send the packet. After the retry for sending the next packet the client is creating a new connection and the communication works fine there onward.
When I restart the rsyslogd on server using the tcpdump i captured the packets and we can see that the server sends flag[F] to client and the client acknowledges it as well. But when we send the next packet it is not crating a new connection.
restart rsyslog on server:
Server side tcpdump:
*09:54:50.012933 IP x.x.x.101.514 > y.y.y.167.37141: Flags [F.], seq 1, ack 31, win 229, length 0
09:54:50.013050 IP y.y.y.167.37141 > x.x.x.101.514: Flags [.], ack 2,
win 115, length 0*
For the very next packet send from client the server sends flag[R] but client keeps retrying for 16 times:
tcpdump from server:
*03:55:11.811611 IP y.y.y.167.37141 > x.x.x.101.514: Flags [P.], seq 31:61, ack 2, win 115, length 30
03:55:11.811647 IP x.x.x.101.514 > y.y.y.167.37141: Flags [R], seq
1863584583, win 0, length 0
03:55:12.014158 IP y.y.y.167.37141 > x.x.x.101.514: Flags [P.], seq
31:61, ack 2, win 115, length 30
03:55:12.014189 IP x.x.x.101.514 > y.y.y.167.37141: Flags [R], seq
1863584583, win 0, length 0*
<this repeated 6 times on sever>
at same time on client we do not see the response reaching from server:
09:55:11.811077 IP y.y.y.167.37141 > x.x.x.101.514: Flags [P.], seq
31:61, ack 2, win 115, length 30
09:55:12.013639 IP y.y.y.167.37141 > x.x.x.101.514: Flags [P.], seq
31:61, ack 2, win 115, length 30
09:55:12.421627 IP y.y.y.167.37141 > x.x.x.101.514: Flags [P.], seq
31:61, ack 2, win 115, length 30
<this retied 16 times>
Now after the 16 retry (took ~13 min) if we send a new packet it is sending correctly.
Here we see a new session is getting created:
*10:16:43.873325 IP y.y.y.167.39859 > x.x.x.101.514: Flags [S], seq 1000783963, win 14600, options [mss 1460,nop,wscale 7], length 0
10:16:43.873658 IP x.x.x.101.514 > y.y.y.167.39859: Flags [S.],
seq 231452091, ack 1000783964, win 29200, options [mss 1460,nop,wscale
7], length 0
10:16:43.873740 IP y.y.y.167.39859 > x.x.x.101.514: Flags [.], ack 1,
win 115, length 0
10:16:43.873904 IP y.y.y.167.39859 > x.x.x.101.514: Flags [P.], seq
1:31, ack 1, win 115, length 30
10:16:43.874084 IP x.x.x.101.514 > y.y.y.167.39859: Flags [.], ack 31,
win 229, length 0*
Does any one faced such issue? Can any one tell why server is not closing the connection when client sends flag[F]. Do we have any configuration parameter in rsyslogd to create a new session when server sends flag[F]?
Why client is sending data after receiving FIN and ACKed it?
TCP connection termination is a 4 way handshake, which means once a client received FIN from server, it acknowledges it and sends all remaining data to server before sending another FIN to server and wait for it's ACK to complete the hand-shake and fully close the connection.
Logs you have provided shows that, the connection is half-open when the server restarted (which it should have not done, before connection is full-close). and that's why the client is sending remaining data before completing handshake.
What is the correct way of abrupt termination?
When an endpoint is about to abruptly terminate a connection, while already some data is in transfer, it should send RST packet instead of FIN.
Why the RST packet sent by server after restart is not received in client?
It must have been discarded as already the connection is half-open with FIN packet received earlier, or it must have been discarded by client firewall for potential TCP Reset attack

Output of ns3 script where evalvid server is sending packets to client over 802.11

evalvid server is sending packets to evalvid client over 802.11 link. I have two questions:
When evalvid server is sending packets, output is like this:
EvalvidServer: Send packet at 1.10406s id: 1 udp 1460 to 192.168.0.1
EvalvidServer: Send packet at 1.10406s id: 2 udp 1460 to 192.168.0.1
EvalvidServer: Send packet at 1.10406s id: 3 udp 1460 to 192.168.0.1
EvalvidServer: Send packet at 1.10406s id: 4 udp 1460 to 192.168.0.1
EvalvidServer: Send packet at 1.10406s id: 5 udp 1460 to 192.168.0.1
EvalvidServer: Send packet at 1.10406s id: 6 udp 1460 to 192.168.0.1
... to id: 64
EvalvidClient: Received packet at 1.11958s id: 1 udp 1460
EvalvidClient: Received packet at 1.13261s id: 2 udp 1460
EvalvidClient: Received packet at 1.14567s id: 3 udp 1460
EvalvidClient: Received packet at 1.15866s id: 65 udp 1460
What happened with packets id:4-64? Are they dropped? Where? In ascii script there was no information about sending this packets?
Why there is retransmission when using evalvid server or just UDP in 802.11:
EvalvidClient: Received packet at 1.32891s id: 91 udp 1460
drop packet because signal power too Small (3.15207e-14<2.51189e-13)
EvalvidClient: Received packet at 1.33504s id: 92 udp 1460
Packet is dropped, retransmission is done, so there was no loss. UDP flow is without retransmission.
Thank you

Error setting Coturn Server on Google Compute Cloud VM instance

I am trying to setup the coturn server on a google compute cloud VM instance running ubuntu 14.04. I think this is a firewall issue but I have opened the port tcp:3478 and udp:3478 in the firewall settings. No other process is running on the 3478 port. Anyone tried setting this on gcloud?
What am I doing wrong.
$ sudo turnserver -a -v -n -r north.gov -L 35.189.173.237
0: log file opened: /var/log/turn_2564_2017-07-06.log
0:
RFC 3489/5389/5766/5780/6062/6156 STUN/TURN Server
Version Coturn-4.5.0.6 'dan Eider'
0:
Max number of open files/sockets allowed for this process: 65536
0:
Due to the open files/sockets limitation,
max supported number of TURN Sessions possible is: 32500 (approximately)
0:
==== Show him the instruments, Practical Frost: ====
0: TLS supported
0: DTLS supported
0: DTLS 1.2 supported
0: TURN/STUN ALPN supported
0: Third-party authorization (oAuth) supported
0: GCM (AEAD) supported
0: OpenSSL compile-time version: OpenSSL 1.0.2g 1 Mar 2016 (0x1000207f)
0:
0: SQLite supported, default database location is /usr/local/var/db/turndb
0: Redis is not supported
0: PostgreSQL is not supported
0: MySQL is not supported
0: MongoDB is not supported
0:
0: Default Net Engine version: 3 (UDP thread per CPU core)
=====================================================
0: Listener address to use: 35.189.173.237
0: Domain name:
0: Default realm: north.gov
0: WARNING: cannot find certificate file: turn_server_cert.pem (1)
0: WARNING: cannot start TLS and DTLS listeners because certificate file is not set properly
0: WARNING: cannot find private key file: turn_server_pkey.pem (1)
0: WARNING: cannot start TLS and DTLS listeners because private key file is not set properly
0: Relay address to use: 35.189.173.237
0: pid file created: /var/run/turnserver.pid
0: IO method (main listener thread): epoll (with changelist)
0: WARNING: I cannot support STUN CHANGE_REQUEST functionality because only one IP address is provided
0: Wait for relay ports initialization...
0: relay 35.189.173.237 initialization...
0: relay 35.189.173.237 initialization done
0: Relay ports initialization done
0: IO method (general relay thread): epoll (with changelist)
0: turn server id=1 created
bind: Cannot assign requested address
0: IO method (general relay thread): epoll (with changelist)
0: turn server id=0 created
bind: Cannot assign requested address
bind: Cannot assign requested address
0: Trying to bind fd 16 to <35.189.173.237:3478>: errno=99
Cannot bind local socket to addr: Cannot assign requested address
0: Trying to bind fd 20 to <35.189.173.237:3478>: errno=99
Cannot bind local socket to addr: Cannot assign requested address
0: Cannot bind DTLS/UDP listener socket to addr 35.189.173.237:3478
0: Trying to bind DTLS/UDP listener socket to addr 35.189.173.237:3478, again...
0: Cannot bind TLS/TCP listener socket to addr 35.189.173.237:3478
0: Trying to bind TLS/TCP listener socket to addr 35.189.173.237:3478, again...
0: Trying to bind fd 18 to <35.189.173.237:3478>: errno=99
Cannot bind local socket to addr: Cannot assign requested address
0: Cannot bind TLS/TCP listener socket to addr 35.189.173.237:3478
0: Trying to bind TLS/TCP listener socket to addr 35.189.173.237:3478, again...
bind: Cannot assign requested address
0: Trying to bind fd 20 to <35.189.173.237:3478>: errno=99
Cannot bind local socket to addr: Cannot assign requested address
0: Cannot bind DTLS/UDP listener socket to addr 35.189.173.237:3478
0: Trying to bind DTLS/UDP listener socket to addr 35.189.173.237:3478, again...
bind: Cannot assign requested address
0: Trying to bind fd 16 to <35.189.173.237:3478>: errno=99
bind: Cannot assign requested address
0: Trying to bind fd 18 to <35.189.173.237:3478>: errno=99
Cannot bind local socket to addr: Cannot assign requested address
Cannot bind local socket to addr: Cannot assign requested address
0: Cannot bind TLS/TCP listener socket to addr 35.189.173.237:3478
0: Trying to bind TLS/TCP listener socket to addr 35.189.173.237:3478, again...
0: Cannot bind TLS/TCP listener socket to addr 35.189.173.237:3478
0: Trying to bind TLS/TCP listener socket to addr 35.189.173.237:3478, again...
bind: Cannot assign requested address
0: Trying to bind fd 20 to <35.189.173.237:3478>: errno=99
Cannot bind local socket to addr: Cannot assign requested address
0: Cannot bind DTLS/UDP listener socket to addr 35.189.173.237:3478
0: Trying to bind DTLS/UDP listener socket to addr 35.189.173.237:3478, again...
bind: Cannot assign requested address
bind: Cannot assign requested address
I figured out I was using a wrong command to start the turnserver. If anyone is facing problems setting up a a coturn server on AmazaonEC2/GCloud, this really helped me a lot: https://blog.knoldus.com/2013/10/24/configure-turn-server-for-webrtc-on-amazon-ec2/