Network packet loss causes client code to act strange - sockets

I am facing some issues which I need some help on coming with a best way to resolve this.
here is the problem -
I have server code running which has a socket that is listening to accept new incoming connections.
I then attempt to start a client, which also has a socket that is listening to accept new incoming connections.
The client code begins with accepting a new connection on the listening socket file descriptor and gets a new socket file descriptor for I/O.
The server does the same thing and gets a new socket file descriptor for I/O.
Note: The client is not completely up, yet. It needs to receive some bytes from the server and send some before it can start.
I then introduce some packet loss over the TCP/IP network connection. This causes the certain errors (example: the recv() system call in the client process sees no received bytes and then closes the socket connection on the client side and the associated new socket file descriptor is closed.) However, this leaves the client process hanging since there are other descriptors in the FD_SET but none of them are I/O ready. So pselect() keeps returning 0 file descriptors ready for I/O. The client needs to send and receive certain bytes over the connection before it can start up.
My question is more of what should I do here ?
I did research on the SO_KEEPALIVE option when I create the new socket connection during the accept() system call. But I do not think that would resolve my problem here especially if the network packet loss is ongoing.
Should I kill the client process here if I realize there are no file descriptors ready for I/O and never will be ? Is there a better way to approach this ?

If I'm reading the question correctly, the core of the question is: "what should your client program do when a TCP connection that is central to its functionality has been broken?"
The answer to that question is really a matter of preference -- what would you like your client program to do in that case? Or to put it another way, what behavior would your users find most useful?
In many of my own client programs, I have logic included such that if the TCP connection to the server is ever broken, the client will automatically try to create a new TCP connection to the server and thereby recover its connectivity and useful functionality as soon as possible.
The other obvious option would be to just have the client quit when the connection is broken; perhaps with some sort of error indication so that the user will know why the client went away. (perhaps an error dialog that asks if the user would like to try to reconnect?)
SO_KEEPALIVE is probably not going to help you much in this scenario, by the way -- despite its name, its purpose is to help a program discover in a more timely manner that TCP connectivity has been lost, not to try harder to keep a TCP connection from being lost. (And it doesn't even serve that purpose particularly well, since in many TCP stacks only one keepalive packet is sent per hour, or so, which means that even with SO_KEEPALIVE enabled it can be a very long time before your program starts receiving error messages reflecting the loss of network connectivity)

Related

how to find that the client is reading from tcp buffer in go

I'm starting to use golang for a quite amount of time for a project. In my project I have to implement a tcp server which responds to tcp clients. The server has to send a number of messages to a client.
The problem is that when a server writes a message to a client connection, it has to wait until the client has read that message from buffer and then send another message (the server has to wait until the client calls the reader.ReadString('\n') method).
In my server code I wrote:
for {
data := <-client.outgoing
client.writer.WriteString(data + "\n")
client.writer.Flush()
}
but the server sends all the messages to client without waiting for ReadString in client.
How to make server wait until the client read a message and then send the other message?
I think that either the assignment is ambiguous or you're misinterpreting it and solving the XY problem.
The short answer is that you can never know whether the client has read a message just by looking at the TCP conversation. You have to implement this "protocol" in your application.
Here are a few problems:
From your application you don't really have access to what TCP is doing. You get a stream on which you can perform I/O.
The fact that a write to your stream "succeeds" only means that TCP has agreed to try to transport your stuff and has an independent copy. It doesn't say anything about whether the data has been received and it doesn't even mean the data has been even sent
You may find certain mechanisms to peer into TCP's inner workings (such as ioctls, SIOCINQ, SIOCOUTQ or various setsockopts): these won't help
Even if you find out what your TCP is doing, this only tells you what the remote TCP is doing. So if you have full control over your TCP and even see the acknowledgments from the peer, you still don't know what the application is doing. It's very possible the application didn't read the data yet (it might not have requested the data, the TCP might be withholding it in a buffer for some weird reason, the scheduler might not have scheduled the remote process etc.)
Going back to your question, a way to really know whether the remote application has received your message is to have the remote application tell you. This means you have to restructure your protocol to:
Send stuff from the server
Wait for a message from the application telling you it received your stuff
Send more stuff (because you know from point 2 it's safe to do so)

why many libraries does not detect dead TCP connections?

TCP has a keep-alive mechanism to detect dead connections, but it surprised me that this option is turned off by default and many libraries/tools do not utilize this feature.
If I am understanding correctly, a TCP connection blocked in a recv call won't be able to detect if a connection has been actually aborted by peer if all the FIN/RST packets from peer have been lost.
A timeout parameter on client side may alleviate the issue but many libraries does not have a option to set timeout either. One example is that the mysql-python connector does not have a recv timeout option. Another example is that a Nginx server talks to a gunicorn backend with proxy_pass, gunicorn workers may stop responding due to dead connections on it, but there is no way for gunicorn workers to detect it.
Could anyone can explain the reason or correct me if I am wrong?
The term "dead connection" is a bit ambiguous -- it could mean any of the following:
The peer program closed its socket (or the peer program exited or crashed, and the peer computer's OS closed the socket as part of its standard process-cleanup)
Connectivity to the peer computer has suddenly been lost (this could happen because the peer computer lost power, or somebody pulled out the Ethernet cord that was connecting the peer computer to the router, or the peer's ISP had a router failure, or your ISP had a router failure, or etc)
The peer program is still running but simply decided (for some reason, probably due to a bug) to stop calling recv() on his TCP socket anymore.
The packet-path between your program and the remote peer still exists, sort of, but something along that path is dropping so many packets that the effective transmission rate of the TCP connection has dropped to approximately zero.
So the first question to answer is, which of the above conditions will the TCP layer detect on its own?
Condition (1) is the easy case -- the peer's TCP stack will send you the FIN packets, and when your program's network stack receives them, it will know for sure that the TCP connection is closed and act accordingly, and therefore your recv() call will return 0 very quickly.
In condition (2), the answer is "sometimes" -- in particular, if your program has any TCP data in the socket's output buffer that it is trying to send to the peer, and it never gets any ACK packets back regarding that data, then after a certain number of timeouts (and subsequent packet-resend attempts), your computer's TCP stack will give up, declare the connection dead, and unilaterally close the TCP connection; at which point recv() will return 0. If there are no outgoing TCP data packets trying to be sent, on the other hand, then the local TCP stack won't be waiting for any ACKs to come back, and therefore it won't time out when it doesn't get them, and therefore it won't ever give up and close the TCP connection. In this scenario, your recv() call could well block indefinitely, because the TCP connection is idle and the TCP stack has no way of knowing that the peer is gone (as opposed to simply not sending any data right now). It is this scenario that the SO_KEEPALIVE option was meant to handle, but since the designers of the SO_KEEPALIVE option wanted to conserve bandwidth by default, and sending automatic keepalive packets uses up additional bandwidth, they decided to make the keepalive option disabled by default. Also, the default send-a-keepalive interval is often quite long by modern standards (e.g. hours) and on some OS's it is difficult to change except on a system-wide basis, which make SO_KEEPALIVE of limited usefulness for many applications.
For conditions (3) and (4), the TCP connection isn't really "dead", it's just that some device (either the peer program, or a piece of networking gear somewhere between your program and the peer) is being uncooperative. Since the TCP layer can't know what the applications that are using it are trying to achieve, it wisely doesn't try to second-guess them in this regard, and it leaves the TCP connection open unless you explicitly tell it to close() the connection.
So now that we've described the TCP layer's behavior, what about the applications and API's that use it? i.e. why don't they try to improve on the basic TCP-stack behavior by offering better detection? The answer is that some of them do; e.g. by periodically sending dummy "ping" messages across any socket that would otherwise be idle, simply to "stimulate" the TCP stack into detecting when no ACKs are coming back as described in the paragraph about condition (2), above. Some go even further and expect the remote peer to send a corresponding "pong" message to come back on the same socket within (so many) seconds, and if it doesn't, the program will unilaterally close the socket. This sort-of works, but it also makes assumptions about the performance of your network, and that can lead to false positives and therefore unwanted disconnections when the peer is connecting via a slow or unreliable network, which is why many applications/libraries don't implement this (or at least don't enable it by default).
It's not surprising to me that keep-alive is turned off by default.
Because it's always possible that the peer program can freeze due to a bug or error, etc. In this case recv also blocks forever even if the TCP connection is alive. So keep-alive may be not so useful after all (except to prevent router from dropping connection). Various reasons might cause your recv to block forever anyway.
Besides, a low-level underlying protocol for general purpose should probably be kept as simple as possible.
In addition, I'm not surprised by your examples about not being able to set timeout either. Look at the most popular software tools in this world. They are polished, evolved, optimized, and used for such a long time. Yet many of them still freeze, crash, or misbehave rather frequently. Writing correct code is meticulous work. Not to mention further requirements like security, cross-platform, backward compatibility. Programmer's life is not easy.

How to Close TCP socket abruptly and reconnecting

We use Linux 2.6.33 on device. A PC application will be used which connects to the device over TCP socket. There will be a maximum of 1 MB of data transferred upon request from application.
Consider If the application is connected and working fine and in between the IP address of the device gets changed (which is possible with a command). Now the connection is broken abruptly at the application but on device it doesn't. But user expects to reconnect to the device with new IP address. In some cases the "send()' was halfway and may not have sent all data. Then this closing does not happen immediately. And hence user will not be able to reconnect until the socket is closed on device.
I use "shutdown(sock, RDWR), close(sock)" sequence.
The netstat output shows:
$ netstat -nt
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 29200 xx.xx.xx.xx:3000 xx.xx.xx.xx:50639 ESTABLISHED
As it shows the Send-Q is still not empty and so the connection is still in ESTABLISED state. This socket is closed after some time and I am not sure how much time it is. May be defined by socket implementation.
How can i close this socket immediately so that a new connection over same TCP port is possible from PC application?
It always take time to close a socket.But you can use setsockopt() to reuse the same port.Find more
detail on this link.
How can i close this socket immediately so that a new connection over
same TCP port is possible from PC application?
You can close it by calling close() of course, but what you really want to know is how to detect this condition so that you know when to call close(). The answer to that is: not easily -- as far as your Linux device is concerned, the client at (the_old_ip_address) has simply stopped responding. The Linux device has no way of knowing that this condition is due to the client's IP address having changed, rather than (say) a temporary network outage, so the Linux device will keep trying for a few minutes before it times out.
As far as resolving the problem you are having, probably the best way is to design your server so that it can accept and service multiple TCP connections at once. If you do that, then it won't really matter if the old TCP connection sticks around for a few minutes, because the user will still be able to reconnect immediately from his new IP address and resume his work without having to wait.
That said, the other approach would be to add timeouts to your server such that if no data is sent or received on the TCP socket for N seconds (for whatever value of N you think is appropriate), the server then explicitly calls close() on the socket and listens for a new incoming connection. This isn't a very good solution, though, since it makes an assumption about the performance of the TCP connection (i.e. that when in its "working" state its will always be sending or receiving data at least once every N seconds), and that assumption might not be true -- i.e. if your client and server don't want to send anything for a while, or if the network connectivity is bad enough that no data gets through for more than N seconds. The risk is that this will lead to a false-positive, causing your server to close a TCP connection that was still valid. Therefore I recommend the first approach rather than this one, if possible.

How to detect when socket connection is lost?

I have a script (I don't have the code example here at the moment but I used IO::Async) which connects to socket on a remote server and listens. Client usually just listens for new data.
Problem is that the client is not able to detect if network problems occur and the socket connection is gone.
I used IO::Async and I also tried it with IO::Socket. Handle is always "connected" after the initial connection is established.
If the network connection is established again the socket connection is naturally still lost because the script has no idea that it should reconnect.
I was thinking to create some kind of "keepAlive" which "pings" (syswrite) the socket every X seconds (if nothing new came through socket) to check whether the connection is still there.
Is this the correct way to do it or is there maybe an another more creative or cleaner solution?
You can set the SO_KEEPALIVE socket option which, for TCP, sends periodic keepalive messages, and may help detect this condition. If this is detected, you will be delivered an EOF condition (most likely causing the containing IO::Async::Stream to fire on_read_eof).
For a better solution you might consider some sort of application-level keepalive message, such as IRC's PING command.
The short answer is there is no default way to automatically detect a dropped socket in perl.
Your approach of pinging would probably work pretty well; you could run a continuous thread in the background that sends ping requests and if it doesn't receive a response the main thread can be notified and a reconnect should be issued.
If you want to get messy you can work with select() to detect keep alive messages; however this may require some OS configuration depending upon your platform.
See this thread for more details: http://www.perlmonks.org/?node_id=566568

Will a TCP RST cause a host to drop the receive buffer?

Upon receiving a TCP RST packet, will the host drop all the remaining data in the receive buffer that has already been ACKed by the remote host but not read by the application process using the socket?
I'm wondering if it's dangerous to close a socket as soon as I'm not interested in what the other host has to say anymore (e.g. to conserver resources); e.g. if that could cause the other party to lose any data I've already sent, but he has not yet read.
Should RSTs generally be avoided and indicate a complete, bidirectional failure of communication, or are they a relatively safe way to unidirectionally force a connection teardown as in the example above?
I've found some nice explanations of the topic, they indicate that data loss is quite possible in that case:
http://blog.olivierlanglois.net/index.php/2010/02/06/tcp_rst_flag_subtleties
http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable also gives some more information on the topic, and offers a solution that I've used in my code. So far, I've not seen any RSTs sent by my server application.
Application-level close(2) on a socket does not produce an RST but a FIN packet sent to the other side, which results in normal four-way connection tear-down. RSTs are generated by the network stack in response to packets targeting not-existing TCP connection.
On the other hand, if you close the socket but the other side still has some data to write, its next send(2) will result in EPIPE.
With all of the above in mind, you are much better off designing your own protocol on top of TCP that includes explicit "logout" or "disconnect" message.