What causes "Transport endpoint is not connected" in ZeroMQ? - sockets

I am working on a product which uses ZeroMQ (version 4.0.1).
The server and client communicate based on ZeroMQ ROUTER-socket.
To read socket events, server and client also create socket-monitor sockets (PAIR). There are three ports on which server binds and listens. Out of these three ports, one port is in a non-secured mode. Other two ports are using md5-authentication.
The issue I am facing is that, both the server and the client spontaneously receive socket disconnect for one of the secure port sockets (please see a log below). I have checked multiple times that server and client both have L3 reachability to each other.
What else I should check for?
What really triggers this error scenario?
zmq_print_callback:ZmQ: int zmq::stream_engine_t::read(void*, size_t):923
Stream engine recv():
TCP socket (187) to unknown:0 was disconnected
with error 107 [Transport endpoint is not connected]

Below sequence of events can trigger this error on server
Server receives ACCEPTED event for clientY and gets FD1.
Link-flap/network issue happens and clientY disconnects but server does not receive this disconnect.
Network recovers and clientY connects back to server.
Server receives ACCEPTED event for clientY and gets FD2. However, packets sent to this sockets does not go out of the server.
After 1 min or so, clientY receives "Transport endpoint is not connected error" for FD1.
Application can use this to treat as client disconnect.

Related

ZeroMQ broadcast to specific PULL client across firewall

I'm building a message broker which communicates with clients over ZeroMQ PUSH/PULL sockets and has the ability to exclude clients from messages they're not subscribed to from the server side (unlike ZeroMQ pub/sub which excludes messages on the client side).
Currently, I implement it in the following way:
Server: Binds ZeroMQ PULL socket on a fixed port
Client: Binds a ZeroMQ PULL socket on a random or fixed port
Client: Connects to the server's PULL socket and sends a handshake message containing the new client's address and port.
Server: Recieves handshake from client and connects a PUSH socket to the client's PULL server. Sends handshake response to the client's socket.
Client: Recieves handshake. Connected!
Now the client and server can communicate bidirectionally and the server can send messages to only a certain subset of clients. It works great!
However, this model doesn't work if the clients binding PULL sockets are unable to open a port in their firewall so the server can connect to them. How can I resolve this with minimal re-architecting (as the current model works very well when the firewall can be configured correctly)
I've considered the following:
Router/dealer pattern? I'm fairly ignorant on this and documentation I found was sparse.
Some sort of transport bridging? The linked example provides an example for PUB/SUB.
I was hoping to get some advice from someone who knows more about ZeroMQ than me.
tl;dr: I implemented a message broker that communicates with clients via bidirectional push/pull sockets. Each client binds a PULL socket and the server keeps a map of PUSH sockets so that it can address specific subscribers. How do I deal with a firewall blocking the client ports?
You can use the router/dealer to do this like you say. By default the ROUTER socket tracks every connection it has. The way it does this is by having the caller stick the connection identity information in front of each message it recieves. This makes things like pub/sub fairly trivial as all you need to do is handle a few messages server side that the DEALER socket sends it. In the past I have done something like
1.) Server side is a ROUTER socket. The ROUTER handles 2 messages from DEALER sockets SUB/UNSUB. This alongside the identity info sent as the first part of a frame allows the router to know the messages that a client is interested in.
2.) The server checks the mapping to see which clients should be sent a particular type of data using the map and then forwards the message to the correct client by appending the identity again to the start of the message.
This is nice in that it allows a single port to be exposed on the server. Client side we do not need to expose ports, simply just connect to the server ROUTER socket.
See https://zguide.zeromq.org/docs/chapter3/ for more info.

Why does the server application send RST after having gone through SYN->SYN,ACK->ACK?

I have a system with server/client applications. The client will send in socket connection request and the server will accept the socket connection when it's working correctly. However, in some situations (most likely due to ungraceful socket disconnection like system shutdown on client side or crash), the client will not be able to reconnect to the server application. The Wireshark capture shows the client will continue to try to connect; but after going through SYN->SYN,ACK->ACK, the server application will send RST. At this point, sometimes the netstat -an will show the connection is in CLOSE_WAIT state and other times would not show this connection. The capture shows 'Acknowledgment Number: Broken TCP. The ackowledge field is nonzero while the ACK flag is not set.
My questions is why the server application would send this RST?

Are application level Retransmission and Acknowledgement needed over TCP?

I have the following queries:
1) Does TCP guarantee delivery of packets and thus is thus application level re-transmission ever required if transport protocol used is TCP. Lets say I have established a TCP connection between a client and server, and server sends a message to the client. However the client goes offline and comes back only after say 10 hours, so will TCP stack handle re-transmission and delivering message to the client or will the application running on the server need to handle it?
2) Related to the above question, is application level ACK needed if transport protocol is TCP. One reason for application ACK would be that without it, the application would not know when the remote end received the message. Is there any reason other than that? Meaning is the delivery of the message itself guaranteed?
Does TCP guarantee delivery of packets and thus is thus application level re-transmission ever required if transport protocol used is TCP
TCP guarantees delivery of message stream bytes to the TCP layer on the other end of the TCP connection. So an application shouldn't have to bother with the nuances of retransmission. However, read the rest of my answer before taking that as an absolute.
However the client goes offline and comes back only after say 10 hours, so will TCP stack handle re-transmission and delivering message to the client or will the application running on the server need to handle it?
No, not really. Even though TCP has some degree of retry logic for individual TCP packets, it can not perform reconnections if the remote endpoint is disconnected. In other words, it will eventually "time out" waiting to get a TCP ACK from the remote side and do a few retries. But will eventually give up and notify the application through the socket interface that the remote endpoint connection is in a dead or closed state. Typical pattern is that when a client application detects that it lost the socket connection to the server, it either reports an error to the user interface of the application or retries the connection. Either way, it's application level decision on how to handle a failed TCP connection.
is application level ACK needed if transport protocol is TCP
Yes, absolutely. Most client-server protocols has some notion of a request/response pair of messages. A TCP socket can only indicate to the application if data "sent" by the application is successfully queued to the kernel's network stack. It provides no guarantees that the application on top of the socket on the remote end actually "got it" or "processed it". Your protocol on top of TCP should provide some sort of response indication when ever a message is processed. Use HTTP as a good example here. Imagine if an application would send an HTTP POST message to the server, but there was not acknowledgement (e.g. 200 OK) from the server. How would the client know the server processed it?
In a world of Network Address Translators (NATs) and proxy servers, TCP connections that are idle (no data between each other) can fail as the NAT or proxy closes the connection on behalf of the actual endpoint because it perceives a lack of data being sent. The solution is to have some sort of periodic "ping" and "pong" protocol by which the applications can keep the TCP connection alive in the absences of having no data to send.

tcp connection issue for unreachable server after connection

I am facing an issue with tcp connection..
I have a number of clients connected to the a remote server over tcp .
Now,If due to any issue i am not able to reach my server , after the successful establishment of the tcp connection , i do not receive any error on the client side .
On client end if i do netstat , it shows me that clients are connected the remote server , even though i am not able to ping the server.
So,now i am in the case where the server shows it is not connected to any client and on another end the client shows it is connected the server.
I have tested this for websocket also with node.js , but the same behavior persists over there also .
I have tried to google it around , but no luck .
Is there any standard solution for that ?
This is by design.
If two endpoints have a successful socket (TCP) connection between each other, but aren't sending any data, then the TCP state machines on both endpoints remains in the CONNECTED state.
Imagine if you had a shell connection open in a terminal window on your PC at work to a remote Unix machine across the Internet. You leave work that evening with the terminal window still logged in and at the shell prompt on the remote server.
Overnight, some router in between your PC and the remote computer goes out. Hours later, the router is fixed. You come into work the next day and start typing at the shell prompt. It's like the loss of connectivity never happened. How is this possible? Because neither socket on either endpoint had anything to send during the outage. Given that, there was no way that the TCP state machine was going to detect a connectivity failure - because no traffic was actually occurring. Now if you had tried to type something at the prompt during the outage, then the socket connection would eventually time out within a minute or two, and the terminal session would end.
One workaround is to to enable the SO_KEEPALIVE option on your socket. YMMV with this socket option - as this mode of TCP does not always send keep-alive messages at a rate in which you control.
A more common approach is to just have your socket send data periodically. Some protocols on top of TCP that I've worked with have their own notion of a "ping" message for this very purpose. That is, the client sends a "ping" message over the TCP socket every minute and the server responds back with "pong" or some equivalent. If neither side gets the expected ping/pong message within N minutes, then the connection, regardless of socket error state, is assumed to be dead. This approach of sending periodic messages also helps with NATs that tend to drop TCP connections for very quiet protocols when it doesn't observe traffic over a period of time.

Is TCP Reset (RST) two way?

I have a client-server (Java) application using persistent TCP connections, but sometimes the Server receives java.io.IOException: Connection reset by peer exception when trying to write on the socket, however I don't see any error in the Client log.
This RST is probably caused by an intermediate proxy/router, but if that's the case, should this be seen on the client as well?
If the RST is sent by the client, it can be seen on it using a packet sniffer such as wireshark. However, it won't show up in any user-level sockets since it's sent by the OS as a response to various erroneous inputs (such as connection attempts to a closed port).
If the RST is sent by the network, then it's pretending to be the client to sever the connection. It can do so in one direction, or in both of them. In that case, the client might not see anything, except for a RST sent by the actual server when the client continues to send data to a connection it perceives as open, while the server sees it as closed.
Try capturing the traffic on both the server and the client, see where the resets are coming from.