What causes the ENOTCONN error? - sockets

I'm currently maintaining some web server software and I need to perform a lot of I/O operations. The read(), write(), close() and shutdown() calls, when used on a socket, may sometimes raise an ENOTCONN error. What exactly does this error mean? What are the conditions that would trigger it? I can never seem to reproduce it locally but there are users who can.
Right now I just ignore ENOTCONN when raised by close() and shutdown() because it seems harmless, but I'm not entirely sure.
EDIT:
I am absolutely sure that the connect() call succeeded. I check for its return value.
ENOTCONN is most often raised by close() and shutdown(). I've only very rarely seen a read() and write() raising ENOTCONN.

If you are sure that nothing on your side of the TCP connection is closing the connection, then it sounds to me like the remote side is closing the connection.
ENOTCONN, as others have pointed out, simply means that the socket is not connected. This doesn't necessarily mean that connect failed. The socket may well have been connected previously, it just wasn't at the time of the call that resulted in ENOTCONN.
This differs from:
ECONNRESET: the other end of the connection sent a TCP reset packet. This can happen if the other end is refusing a connection, or doesn't acknowledge that it is already connected, among other things.
ETIMEDOUT: this generally applies only to connect. This can happen if the connection attempt is not successful within a system-dependent amount of time.
EPIPE can sometimes be returned by some socket-related system calls under conditions that are more or less the same as ENOTCONN. For example, on some systems, EPIPE and ENOTCONN are synonymous when returned by send.
While it's not unusual for shutdown to return ENOTCONN, since this function is supposed to tear down the TCP connection, I would be surprised to see close return ENOTCONN. It really should never do that.
Finally, as dwc mentioned, EBADF shouldn't apply in your scenario unless you are attempting some operation on a file descriptor that has already been closed. Having a socket get disconnected (i.e. the TCP connection has broken) is not the same as closing the file descriptor associated with that socket.

It's because, at the moment of shutting() the socket, you have data in the socket's buffer waiting to be delivered to the remote party which has closed() or shutted down() its receiving socket.
I don't finish understanding how sockets work, I am rather a noob, and I've failed to even find the files where this "shutdown" function is implemented, but seeing that there's practically no user manual for the whole sockets thing I started trying all possibilities until I got the error in a "controlled" environment. It could be something else, but after much trying these are the explanations I settled for:
If you sent data after the remote side closed the connection, when you shutdown(), you get the error.
If you sent data before the remote side closed the connection but it didn't get received() on the other end, you can shutdown() once, the next time you try to shutdown(), you get the error.
If you didn't send any data, you can shutdown all the times you want, as long as the remote side doesn't shutdown(); once the remote side has shutdown(), if you try to shutdown() and the socket was already shutdown(), you get the error.

I believe ENOTCONN is returned, because shutdown() is not supposed to return ECONNRESET or other more accurate errors.
It is wrong to assume that the other side “just” closed the connection. On the TCP-level, the other side can only half-close a connection (or abort it). The connection is ordinary fully closed if both sides do a shutdown() (or close()). If both side do that, shutdown() actually succeeds for both of them!
The problem is that shutdown() did not succeed in ordinary (half-)closing the connection, neither as the first one to close it, nor as the second one. – From the errors listed in the POSIX docs for shutdown(), ENOTCONN is the least inappropriate, because the others indicate problems with arguments passed to shutdown() (or local resource problems to handle the request).
So what happened? These days, a NAT device somewhere between the two parties involved might have dropped the association and sends out RESET packets as a reaction. Reset connections are so common for IPv4, that you will get them anywhere in your code, even masked as ENOTCONN in shutdown().
A coding bug might also be the reason. On a non-blocking socket, for example, a connect() can return 0 without indicating a successful connection yet.

Transport endpoint is not connected
The socket is associated with a connection-oriented protocol and has not been connected. This is usually a programming flaw.
From: http://www.wlug.org.nz/ENOTCONN

If you're sure you've connected properly in the first place, ENOTCONN is most likely to be caused by either the fd being closed on your end (perhaps in another thread?) while you're in the middle of a request, or by the connection dropping while you're in the middle of the request.
In any case, it means that the socket is not connected. Go ahead and clean up that socket. It's dead. No problem calling close() or shutdown() on it.

Related

What is correct procedure following a failure to connect a TCP socket?

I'm writing a TCP client using asynchronous calls. If the server is active when the app starts then it connects and talks OK. However if the first connect fails, then every subsequent call to connect() fails with WSAENOTCONN(10057) without producing any network traffic (checked with Wireshark).
Currently the code does not close the socket when it gets the error. The TCP state diagram does not seem to require it. It simply waits for 30 seconds and tries the connect() again.
That subsequent connect() and the following read() both return the WSAENOTCONN error.
Do I need to close the socket and open a new one? If so, which errors require me to close the socket, since there are a lot of different errors, some of which I will probably never see on the test bench.
You can assume this is MS Winsock2, although it is actually Interval Zero RTX 2009, which is subtly different in some places.
Do I need to close the socket and open a new one?
Yes.
If so, which errors require me to close the socket, since there are a lot of different errors, some of which I will probably never see on the test bench.
Almost all errors are fatal to the connection and should result in you closing the socket. EAGAIN/EWOULDBLOCK s a prominent exception, as is EINTR, but I can't think of any others offhand.
Do I need to close the socket and open a new one?
Yes.
You should close the socket under all error conditions that results in connection gone for good (Say, like the peer has closed the connection)

potential for file id collision in C when doing pthread network io

I have an app in c that listens on a port and creates a pthread upon connection and goes back to the listen. The pthread functions reads from the socket, writes a response and then waits 1/10th of a sec followed by a shutdown() and a close() then pthread_exit(). This can happen very rapidly resulting in possibly hundreds of threads at the same time. My question is can the system reuse a file id before I do the final close()? I'm concerned about the possibility of the socket closing prematurely for some reason. On the listening side the file id cannot be reused until I do the close() call even if the underlying connection is long gone, right? I'm fairly sure that this is how it works but I can't confirm.
On the listening side the file id cannot be reused until I do the
close() call even if the underlying connection is long gone, right?
Yes, this is correct - the file descriptor is not released for re-use until it has been passed to close() (or is an FD_CLOEXEC file descriptor being closed automatically at execve()).
All thread try to enter critical region to be processed if you didn't use semafor,mutex or monitoring probably it uses same id even your files that you get from byte stream may be croupted. I advise to you use semafor, mutex ,or monitoring, and search about dining philosophers problem, because it is very frequent situation. Good luck I hope I can show a clue about your problem.

Avoiding dataloss in Go when writing with CLOSE_WAIT socket

start listening client with netcat -l
go program opens a conn with net.DialTCP to said client.
kill the netcat
in go program, do conn.Write() with a []byte -> it runs fine without error!
it takes another conn.Write to get the error: broken pipe
The first write is the one where data loss happens, and I want to avoid. if i only get an error I know i can just keep the data and try again later.
I've seen https://stackoverflow.com/a/15071574/2757887 which is a very similar case and the explanation seems to apply here, but it still doesn't explain how to deal with the issue, if the tcp protocol I need to implement only does one-way communication.
I've sniffed the traffic with wireshark, and when i kill the netcat, I can see that it sends FIN to the go program, to which the go program replies with ACK. For some reason the go program doesn't immediately reply with it's own FIN - and i'm curious why that is, it might help with my problem - but there's probably a good reason for it.
Either way, from the "connection termination" section # http://en.wikipedia.org/wiki/Transmission_Control_Protocol, I conclude that the socket is in the CLOSE_WAIT state at this point, which I also confirmed with "netstat -np", which shows the socket going from ESTABLISHED to CLOSE_WAIT after killing netstat.
Looking at wireshark, the first conn.write results in a packet with push and ack fields set, and of course my payload. this is the write that succeeds fine in go.
then the old socket that used to belong to netstat sends RST,
which makes sure that as soon as i try to write in go (2nd write) it fails.
So my question is:
A) why can't I get an error on the first write? if the socket received the FIN and is in CLOSE_WAIT why does Go let me write to the socket and tell me all is fine?
B) is there any way I can check in Go whether the socket is in CLOSE_WAIT? and if so, I could for this purpose consider it closed and not do the write.
thanks,
Dieter
Fundamentally, a successful write only tells you that data has been queued to be sent to the other end. If you need to make sure the other end gets that data, even if the connection closes or errors, you must store a copy of the data until the other end provides you with an application-level acknowledgment.

close() socket directly after send(): unsafe?

Is it wise/safe to close() a socket directly after the last send()?
I know that TCP is supposed to try to deliver all remaining data in the send buffer even after closing the socket, but can I really count on that?
I'm making sure that there is no remaining data in my receive buffer so that no RST will be sent following my close.
In my case, the close is actually the very last statement of code before calling exit().
Will the TCP stack really continue to try and transmit the data even after the process sending it has terminated? Is that as reliable as waiting for an arbitrary timeout myself before calling close() by setting SO_LINGER?
That is, do the same TCP timeouts apply, or are they shorter? With a big send buffer and a slow connection, the time to actually transfer all the buffered data could be substantial, after all.
I'm not interested at all in being notified of the last byte sent; I just want them to eventually arrive at the remote host as reliably as possible.
Application layer acknowledgements are not an option (the protocol is HTTP, and I'm writing a small server).
I've been reading the The ultimate SO_LINGER page, or: why is my tcp not reliable blog post a lot. I recommend you read it too. It discusses edge cases of large data transfers with regards to TCP sockets.
I'm not the expert at SO_LINGER, but on my server code (still in active development) I do the following:
After the last byte is sent via send(), I call shutdown(sock, SHUT_WR) to trigger a FIN to be sent.
Then wait for a subsequent recv() call on that socket to return 0 (or recv returns -1 and errno is anything other that EAGAIN/EWOULDBLOCK).
Then the server does a close() on the socket.
The assumption is that the client will close his socket first after it has received all the bytes of the response.
But I do have a timeout enforced between the final send() and when recv() indicates EOF. If the client never closes his end of the connection, the server will give up waiting and close the connection anyway. I'm at 45-90 seconds for this timeout.
All of my sockets are non-blocking and I use poll/epoll to be notified of connection events as a hint to see if it's time to try calling recv() or send() again.
Application layer acknowledgements are not an option (the protocol is HTTP, and I'm writing a small server).
HTTP protocol doesn't suffer from this problem. A HTTP server is not supposed to close the connection in any normal operation. The client closes it after recv(), and it knows exactly how many bytes it expects.
And just to be clear, the answer is "no".
Yes, it is safe that send() then close() immediately.
the kernel will sent out all data in buffer and wait ack, then fin the socket gracefully.

Socket Read In Multi-Threaded Application Returns Zero Bytes or EINTR (104)

Am a c-coder for a while now - neither a newbie nor an expert. Now, I have a certain daemoned application in C on a PPC Linux. I use PHP's socket_connect as a client to connect to this service locally. The server uses epoll for multiplexing connections via a Unix socket. A user submitted string is parsed for certain characters/words using strstr() and if found, spawns 4 joinable threads to different websites simultaneously. I use socket, connect, write and read, to interact with the said webservers via TCP on their port 80 in each thread. All connections and writes seems successful. Reads to the webserver sockets fail however, with either (A) all 3 threads seem to hang, and only one thread returns -1 and errno is set to 104. The responding thread takes like 10 minutes - an eternity long:-(. *I read somewhere that the 104 (is EINTR?), which in the network context suggests that ...'the connection was reset by peer'; or (B) 0 bytes from 3 threads, and only 1 of the 4 threads actually returns some data. Isn't the socket read/write thread-safe? I use thread-safe (and reentrant) libc functions such as strtok_r, gethostbyname_r, etc.
*I doubt that the said webhosts are actually resetting the connection, because when I run a single-threaded standalone (everything else equal) all things works perfectly right, but of course in series not parallel.
There's a second problem too (oops), I can't write back to the client who connect to my epoll-ed Unix socket. My daemon application will hang and hog CPU > 100% for ever. Yet nothing is written to the clients end. Am sure the client (a very typical PHP socket application) hasn't closed the connection whenever this is happening - no error(s) detected either. Any ideas?
I cannot figure-out whatever is wrong even with Valgrind, GDB or much logging. Kindly help where you can.
Yes, read/write are thread-safe. But beware of gethostbyname() and getservbyname() if you're using them - they return pointers to static data, and may not be thread-safe.
errno 104 is ECONNREFUSED (not EINTR). Use strerror or perror to get the textual error message (like 'Connection reset by peer') for a particular errno code.
The best way to figure out what's going wrong is often to do very detailed logging - log the results of every operation, plus details like the IP address/port connecting to, the number of bytes read/written, the thread id, and so forth. And, of course, make sure your logging code is thread-safe :-)
Getting an ECONNRESET after 10 minutes sounds like the result of your connection timing out. Either the web server isn't sending the data or your app isn't receiving it.
To test the former, hookup a program like Wireshark to the local loopback device and look for traffic to and from the port you are using.
For the later, take a look at the epoll() man page. They mention a scenario where using edge triggered events could result in a lockup, because there is still data in the buffer, but no new data comes in so no new event is triggered.