How to know if server rebooted or crashed reliably on the client

How to know if server rebooted or crashed reliably on the client - sockets

I have a Server and Client which is using TCP for communication and file transfer. In the middle of the transfer of file from Server to Client, the server crashes or reboots. The client was doing a read of the connection FD while this was happening and it returned back 0 indicating EOF. This seems to be incorrect, as server might not have sent everything over. How do I check on the client, that the server did not do a graceful shutdown, but was some sort of crash?

Related

Why does the server application send RST after having gone through SYN->SYN,ACK->ACK?

I have a system with server/client applications. The client will send in socket connection request and the server will accept the socket connection when it's working correctly. However, in some situations (most likely due to ungraceful socket disconnection like system shutdown on client side or crash), the client will not be able to reconnect to the server application. The Wireshark capture shows the client will continue to try to connect; but after going through SYN->SYN,ACK->ACK, the server application will send RST. At this point, sometimes the netstat -an will show the connection is in CLOSE_WAIT state and other times would not show this connection. The capture shows 'Acknowledgment Number: Broken TCP. The ackowledge field is nonzero while the ACK flag is not set.
My questions is why the server application would send this RST?

epoll_wait missing EPOLLIN events on a TCP socket fd

On the server side: I am using epoll_wait to monitor the possible read IO on a TCP socket.
On the client side: I have a single threaded app to write to the socket that's connected to the server.
The problem is, sometimes epoll_wait doesn't recognize there is new IO to read even after a new message is sent from the client. (I confirmed the message is indeed received by the server using wireshark) So the client is hanging waiting on the response from server. BUT: if I kill the client connection, epoll_wait does get notified!
Originally I am using EPOLLET and thought it would be a problem. But this issue still exists after removing EPOLLET.
Is there any tool that I can use to debug this? (e.g, outside of server process, to confirm that there is IO on the server socket queue but epoll_wait doesn't process it?) Any thought or guidance on how to debug this would be appreciated.

tcp connection issue for unreachable server after connection

I am facing an issue with tcp connection..
I have a number of clients connected to the a remote server over tcp .
Now,If due to any issue i am not able to reach my server , after the successful establishment of the tcp connection , i do not receive any error on the client side .
On client end if i do netstat , it shows me that clients are connected the remote server , even though i am not able to ping the server.
So,now i am in the case where the server shows it is not connected to any client and on another end the client shows it is connected the server.
I have tested this for websocket also with node.js , but the same behavior persists over there also .
I have tried to google it around , but no luck .
Is there any standard solution for that ?

This is by design.
If two endpoints have a successful socket (TCP) connection between each other, but aren't sending any data, then the TCP state machines on both endpoints remains in the CONNECTED state.
Imagine if you had a shell connection open in a terminal window on your PC at work to a remote Unix machine across the Internet. You leave work that evening with the terminal window still logged in and at the shell prompt on the remote server.
Overnight, some router in between your PC and the remote computer goes out. Hours later, the router is fixed. You come into work the next day and start typing at the shell prompt. It's like the loss of connectivity never happened. How is this possible? Because neither socket on either endpoint had anything to send during the outage. Given that, there was no way that the TCP state machine was going to detect a connectivity failure - because no traffic was actually occurring. Now if you had tried to type something at the prompt during the outage, then the socket connection would eventually time out within a minute or two, and the terminal session would end.
One workaround is to to enable the SO_KEEPALIVE option on your socket. YMMV with this socket option - as this mode of TCP does not always send keep-alive messages at a rate in which you control.
A more common approach is to just have your socket send data periodically. Some protocols on top of TCP that I've worked with have their own notion of a "ping" message for this very purpose. That is, the client sends a "ping" message over the TCP socket every minute and the server responds back with "pong" or some equivalent. If neither side gets the expected ping/pong message within N minutes, then the connection, regardless of socket error state, is assumed to be dead. This approach of sending periodic messages also helps with NATs that tend to drop TCP connections for very quiet protocols when it doesn't observe traffic over a period of time.

Is TCP Reset (RST) two way?

I have a client-server (Java) application using persistent TCP connections, but sometimes the Server receives java.io.IOException: Connection reset by peer exception when trying to write on the socket, however I don't see any error in the Client log.
This RST is probably caused by an intermediate proxy/router, but if that's the case, should this be seen on the client as well?

If the RST is sent by the client, it can be seen on it using a packet sniffer such as wireshark. However, it won't show up in any user-level sockets since it's sent by the OS as a response to various erroneous inputs (such as connection attempts to a closed port).
If the RST is sent by the network, then it's pretending to be the client to sever the connection. It can do so in one direction, or in both of them. In that case, the client might not see anything, except for a RST sent by the actual server when the client continues to send data to a connection it perceives as open, while the server sees it as closed.
Try capturing the traffic on both the server and the client, see where the resets are coming from.

socket error 10054

I have a C/S program. Client use socket to send a file to server, after send approximate more than 700k data, client(on win7) will receive a socket 10054 error which means Connection reset by peer.
Server worked on CentOS 5.4, client is windows7 virtual machine run in virtual box. client and server communicate via a virtual network interface.
The command port(send log) is normal, but the data port(send file) have the problem.
If it was caused by wrong configuration of socket buffer size or something else?
If anyone can help me check the problem. Thanks.
Every time I call socket send a buffer equals 4096 byte
send(socket, buffer, 4096, 0 )
CentOS socket config.
#sysctl -a
...
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_mem = 196608 262144 393216
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_fack = 1
I'm not quite understand what the socket buffer configuration means, if this will cause the receive incomplete result problem?

It's almost definitely a bug in your code. Most likely, one side thinks the other side has timed out and so closes the connection abnormally. The most common way this happens it that you call a receive function to get data, but you actually already got that data and just didn't realize it. So you're waiting for data that you have already received and thus time out.
For example:
1) Client sends a message.
2) Client sends another message.
3) Server reads both messages but thinks it only got one, sends an acknowledge.
4) Client receives acknowledge, waits for second acknowledge which server will never send.
5) Server waits for second message which it actually already received.
Now the server is waiting for the client and the client is waiting for the server. The server was coded incorrectly and didn't realize that it actually got two messages in one go. TCP does not preserve message boundaries.
If you tell me more about your protocol, I can probably tell you in more detail what went wrong. What constitutes a message? Which side sends when? Are there any acknowledgements? And so on.
But the short version is that each side is probably waiting for the other.
Most likely, the connection reset by peer is a symptom. Your problem occurs, one side times out and aborts the connection. That causes the other side to get a connection reset because the other side aborted the connection.