Using boost::asio::tcp, how to get notified when socket connection is broken - sockets

Scenario:
I am using boost::asio::tcp protocol between 2 peers connected over the network. My code runs on Linux, macOS and iOS.
I have a Watchdog ping-pong mechanism implemented on my both sides to check that the socket connection is okay between the peers. This is done by sending a dummy packet every 2 seconds which I think is a known approach.
Challenge:
But, is there are way I can avoid writing this watchdog myself?
Is there a way to enable asio/TCP stack itself to do this for me and trigger an event right when the socket connection is not okay? I have been trying to understand the kepp_alive functionality in TCP stack because it seems to have answer to my question?
But then again it looks like I can tweak the keep_alive parameters but boost::asio::tcp does not seem to give me an API to do that.
Question:
Will tweaking of the keep_alive parameters help me acheive my goal? My goal is to get a notification from the asio or TCP stack when the socket is not connected anymore due any wierd reasons on the peer side. Note that peer can go down in a weird way like kernel panic or something really crazy.
Just setting socket.set_option(boost::asio::socket_base::keep_alive(true)); does not seem to help. The default time running on Linux is too high. I do get a notification even after many minutes after event of peer going down.

Related

How to use TCP keep_alive property to get notified on the event of a unresponsive peer?

Scenario:
I have a client and server written using boost::asio 1.63. Generally the connection and communication part works well and great.
I have written a Watchdog on both sides which send dummy packets to peers in an interval of 2 seconds each. The objective of the watchdog is that the concerned peer reports a connection error if it does not receive the dummy packet it is expecting in the next 2 seconds. This is even more important for me because it might happen the 2 peers are not transacting packets for any user purpose but each of them is required to report a connection error if any of the peer goes down. The peer can go down even because of a kernel crash in which case it would not be possible for that peer to send a message. This is a classic problem of course which exists even beyond asio and TCP.
My Watchdog works perfectly well. No issues at all.
But, recently I read about the keep_alive feature in sockets. I tried out the following code and seems like I can a property called keep_alive on the TCP socket by getting the native handle to the socket from within my code using boost::asio.
boost::asio::io_service ioService;
boost::asio::ip::tcp::socket mySocket(ioService);
int on = 1;
int delay = 120;
setsockopt(mySocket.native_handle(), SOL_SOCKET, SO_KEEPALIVE, &on, sizeof(on));
setsockopt(mySocket.native_handle(), IPPROTO_TCP, TCP_KEEPALIVE, &delay, sizeof(delay));
Question:
Above code compiles well on macOS, Linux and iOS. That looks great. But, how do I benefit from this? Does this give me a callback or event when the peer goes down? Does this free me up from writting the Watchdog that I described above?
I have used boost::asio::async_connect to connect to the peer. Can I get a callback to my connectionHandler when the perr goes down after the defined timeout interval?
Having set the keep_alive options, how do I then get to know that my peer is not responding anymore?
If the disconnetion was detected when an async operation is pending, your socket's completion handler will be invoked with the appropriate error code.
The problem is that TCP keep_alive option doesn't not always detect disconnects.
In general, there is no reliable way to detect sudden disconnection, other than by implementing application-level ping/heartbeat.
You can also see this thread.

why many libraries does not detect dead TCP connections?

TCP has a keep-alive mechanism to detect dead connections, but it surprised me that this option is turned off by default and many libraries/tools do not utilize this feature.
If I am understanding correctly, a TCP connection blocked in a recv call won't be able to detect if a connection has been actually aborted by peer if all the FIN/RST packets from peer have been lost.
A timeout parameter on client side may alleviate the issue but many libraries does not have a option to set timeout either. One example is that the mysql-python connector does not have a recv timeout option. Another example is that a Nginx server talks to a gunicorn backend with proxy_pass, gunicorn workers may stop responding due to dead connections on it, but there is no way for gunicorn workers to detect it.
Could anyone can explain the reason or correct me if I am wrong?
The term "dead connection" is a bit ambiguous -- it could mean any of the following:
The peer program closed its socket (or the peer program exited or crashed, and the peer computer's OS closed the socket as part of its standard process-cleanup)
Connectivity to the peer computer has suddenly been lost (this could happen because the peer computer lost power, or somebody pulled out the Ethernet cord that was connecting the peer computer to the router, or the peer's ISP had a router failure, or your ISP had a router failure, or etc)
The peer program is still running but simply decided (for some reason, probably due to a bug) to stop calling recv() on his TCP socket anymore.
The packet-path between your program and the remote peer still exists, sort of, but something along that path is dropping so many packets that the effective transmission rate of the TCP connection has dropped to approximately zero.
So the first question to answer is, which of the above conditions will the TCP layer detect on its own?
Condition (1) is the easy case -- the peer's TCP stack will send you the FIN packets, and when your program's network stack receives them, it will know for sure that the TCP connection is closed and act accordingly, and therefore your recv() call will return 0 very quickly.
In condition (2), the answer is "sometimes" -- in particular, if your program has any TCP data in the socket's output buffer that it is trying to send to the peer, and it never gets any ACK packets back regarding that data, then after a certain number of timeouts (and subsequent packet-resend attempts), your computer's TCP stack will give up, declare the connection dead, and unilaterally close the TCP connection; at which point recv() will return 0. If there are no outgoing TCP data packets trying to be sent, on the other hand, then the local TCP stack won't be waiting for any ACKs to come back, and therefore it won't time out when it doesn't get them, and therefore it won't ever give up and close the TCP connection. In this scenario, your recv() call could well block indefinitely, because the TCP connection is idle and the TCP stack has no way of knowing that the peer is gone (as opposed to simply not sending any data right now). It is this scenario that the SO_KEEPALIVE option was meant to handle, but since the designers of the SO_KEEPALIVE option wanted to conserve bandwidth by default, and sending automatic keepalive packets uses up additional bandwidth, they decided to make the keepalive option disabled by default. Also, the default send-a-keepalive interval is often quite long by modern standards (e.g. hours) and on some OS's it is difficult to change except on a system-wide basis, which make SO_KEEPALIVE of limited usefulness for many applications.
For conditions (3) and (4), the TCP connection isn't really "dead", it's just that some device (either the peer program, or a piece of networking gear somewhere between your program and the peer) is being uncooperative. Since the TCP layer can't know what the applications that are using it are trying to achieve, it wisely doesn't try to second-guess them in this regard, and it leaves the TCP connection open unless you explicitly tell it to close() the connection.
So now that we've described the TCP layer's behavior, what about the applications and API's that use it? i.e. why don't they try to improve on the basic TCP-stack behavior by offering better detection? The answer is that some of them do; e.g. by periodically sending dummy "ping" messages across any socket that would otherwise be idle, simply to "stimulate" the TCP stack into detecting when no ACKs are coming back as described in the paragraph about condition (2), above. Some go even further and expect the remote peer to send a corresponding "pong" message to come back on the same socket within (so many) seconds, and if it doesn't, the program will unilaterally close the socket. This sort-of works, but it also makes assumptions about the performance of your network, and that can lead to false positives and therefore unwanted disconnections when the peer is connecting via a slow or unreliable network, which is why many applications/libraries don't implement this (or at least don't enable it by default).
It's not surprising to me that keep-alive is turned off by default.
Because it's always possible that the peer program can freeze due to a bug or error, etc. In this case recv also blocks forever even if the TCP connection is alive. So keep-alive may be not so useful after all (except to prevent router from dropping connection). Various reasons might cause your recv to block forever anyway.
Besides, a low-level underlying protocol for general purpose should probably be kept as simple as possible.
In addition, I'm not surprised by your examples about not being able to set timeout either. Look at the most popular software tools in this world. They are polished, evolved, optimized, and used for such a long time. Yet many of them still freeze, crash, or misbehave rather frequently. Writing correct code is meticulous work. Not to mention further requirements like security, cross-platform, backward compatibility. Programmer's life is not easy.

Select() is not coming out in client side

I have written one client socket program using linux sockets only. Here is the information giving picture what I am doing in my program
Creating the socket
Making connection with server socket
assigning that socket to read set and exception set for select.
using the select method giving the timeout value NULL in a separate thread
Server is running in one external device.
this program is working fine for reading and all.. Now I am facing problem when I unplug the power cable of that device.
I assumed that when we remove the power cable of the device all the sockets will abruptly closed and connected client sockets will get read event. when we try to read we receive number of bytes read as zero that means connection closed by server.
But in my program when I unplug the power cable of the device, Here in my client program select is not coming out means client socket is not getting any event. I don't understand why..
Any suggestion will be appreciated on how we can come to know that connection is closed by server or any information on whats the sockets behaviour when shutting down the power supply.
I need your help, its very critical.
thank you.
When a remote machine is suddenly cut off the network (network cable unplug or power loss), there is no way it can inform the other side of the connection about that. What is more the client side that performs only reads from a half-open socket (like in your case) won't be able to detect this either.
The only way to know about a connection loss is to to send a packet. Since all data being sent should be acknowledged by the other side, TCP on a client computer will keep retrying to send an unconfirmed portion of data till the number of attempts is exhausted. Then a ETIMEDOUT error should be returned (via a socket that is expecting read events). You can create one more socket for sending these messages periodically to detect a peer disappearance (heart beat connection) on the client side. But all this retries might still take some time.
Another option could be to use SO_KEEPALIVE socket option. After some time a connection has been idle, TCP starts sending probe messages to the server and can detect its disappearance. The default values for idle item are usually enormously huge, so they need to be modified. Some of other parameters that might be related (TCP_KEEPCNT, TCP_KEEPINTVL, TCP_KEEPIDLE). It appears, this option might be implemented differently on different systems or can be simply absent.
I've never personally tried to solve this problem so all this is just a bunch of thoughts that might give some ideas. Here is one more source of ideas.

How to detect when socket connection is lost?

I have a script (I don't have the code example here at the moment but I used IO::Async) which connects to socket on a remote server and listens. Client usually just listens for new data.
Problem is that the client is not able to detect if network problems occur and the socket connection is gone.
I used IO::Async and I also tried it with IO::Socket. Handle is always "connected" after the initial connection is established.
If the network connection is established again the socket connection is naturally still lost because the script has no idea that it should reconnect.
I was thinking to create some kind of "keepAlive" which "pings" (syswrite) the socket every X seconds (if nothing new came through socket) to check whether the connection is still there.
Is this the correct way to do it or is there maybe an another more creative or cleaner solution?
You can set the SO_KEEPALIVE socket option which, for TCP, sends periodic keepalive messages, and may help detect this condition. If this is detected, you will be delivered an EOF condition (most likely causing the containing IO::Async::Stream to fire on_read_eof).
For a better solution you might consider some sort of application-level keepalive message, such as IRC's PING command.
The short answer is there is no default way to automatically detect a dropped socket in perl.
Your approach of pinging would probably work pretty well; you could run a continuous thread in the background that sends ping requests and if it doesn't receive a response the main thread can be notified and a reconnect should be issued.
If you want to get messy you can work with select() to detect keep alive messages; however this may require some OS configuration depending upon your platform.
See this thread for more details: http://www.perlmonks.org/?node_id=566568

Is it possible to check if a TCP connection disconnected without writing to it?

I am wondering if it possible to determine if an accepted socket connection has been disconnected without trying to write to it.
IO::Select still indicates that the socket can be written to with can_write, even after the socket connection has been lost.
Is it possible to check if a TCP connection has been disconnected without writing to it (in the situation where there is an unplanned internet outage).
This is more a TCP than a Perl issue.
Events like a disconnected cable/internet connection do not lead to a TCP event. Thus you must write to a TCP connection to be sure that it is still connected. You might add a ping/echo message for the sole porpose to know that the connection is still available.
Generally, no. You'll usually only get a failure when you write: if you never write, it will just sit there. If you entirely lose network connectivity I've seen errors pop up (on Windows: haven't tried it on Linux) but you're typically required to try writing to it to verify that its alive.