SO_ERROR value after successful socket operation - sockets

I'm curious about the behavior of the SO_ERROR socket option used with getsockopt() after a successful socket operation
The Open Group specification:
SO_ERROR
Reports information about error status and clears it. This option shall store an int value.
Usually I see SO_ERROR used after a socket operation returns -1, but what happens if the previous socket operation succeeded (thus not returning -1). Does the getsockopt() call fail? Does it return 0 as the int value?

It's ok for non-blocking connect, see connect(2)
The socket is nonblocking and the connection cannot be completed immediately. It is possible to select(2) or poll(2) for completion by selecting the socket for writing. After select(2) indicates writability, use getsockopt(2) to read the SO_ERROR option at level SOL_SOCKET to determine whether connect() completed successfully (SO_ERROR is zero) or unsuccessfully (SO_ERROR is one of the usual error codes listed here, explaining the reason for the failure).
other is undefined.

It's undefined. You should only call this option when you already know that there has been an error. Not as a means to discover whether there was one.

I've learned more about SO_ERROR in Unix Networking Programmig Vol 1, it's become clear to me. SO_ERROR is used to report asynchronous errors that are the result of events within the network stack and not synchronous errors that are the result of a library call (send/recv/connect). Synchronous results are reported via errno.
Calling getsockopt() with SO_ERROR after a library call returns -1 is incorrect from the POSIX implementation.
Learning of the non-blocking connect result via select is an example of discovering when the asynchronous result is ready (which can then be retrieved via SO_ERROR)

Related

Non-blocking TCP socket fails to connect using socket2

I am currently in the process of converting some of my code from blocking to non-blocking using the sockets2 crate, however I am running into issues with connecting the socket. The socket always fails to connect before the timeout is exceeded. Despite my attempts to search for examples, I have yet to find any Rust code showing how a non-blocking TCP stream is created.
To give you an idea what I am attempting to do, the code I am currently converting looks looks roughly like this. This gives me no issues and works fine, but it is getting too costly to create a new thread for every socket.
let address = SocketAddr::from(([x, y, z, v], port));
let mut socket = TcpStream::connect_timeout(&address, timeout)?;
At the moment, my code to connect the socket looks like this. Since connect_timeout can only be executed in blocking mode, I use connect instead and regularly poll the socket to check if it is connected. At the moment, I keep getting WouldBlock errors when calling connect, but I do not know what this means. At first I assumed that the connect was proceeding, but returning the result immediately would require blocking so a WouldBlock error was given instead. However, due to the issues getting the socket to connect, I am second guessing those assumptions.
let address = SocketAddr::from(([x, y, z, v], port));
// Create socket equivalent to TcpStream
let socket = Socket::new(Domain::IPV4, Type::STREAM, Some(Protocol::TCP))?;
// Enable non-blocking mode on the socket
socket.set_nonblocking(true)?;
// What response should I expect? Do I need to bind an address first?
match socket.connect(&address.into()) {
Ok(_) => {}
Err(e) if e.kind() == ErrorKind::WouldBlock => {
// I keep getting this error, but I don't know what this means.
// Is non-blocking connect unavailable?
// Do I need to keep trying to connect until it succeeds?
},
// Are there any other types of errors I should be looking for before failing the connection?
Err(e) => return Err(e),
}
I am also unsure what the correct approach is to determine if a socket is connected. At the moment, I attempt to read to a zero length buffer and check if I get a NotConnected error. However, I am unsure what WouldBlock means in this context and I have never gotten a positive response from this approach.
let mut buffer = [0u8; 0];
// I also tried self.socket.peer_addr(), but ran into issues where it returned a positive
// response despite not being connected.
match self.socket.read(&mut buffer) {
Ok(_) => Ok(true),
// What does WouldBlock mean in this context?
Err(e) if e.kind() == ErrorKind::WouldBlock => Ok(false),
Err(e) if e.kind() == ErrorKind::NotConnected => Ok(false),
Err(e) => Err(e),
}
Each socket is periodically checked until an arbitrary timeout is reached to determine if it has connected. So far, no socket has passed the connected before reaching its timeout (20 sec) when connecting to a known-good server. These tests are all performed in a single threaded application on Windows using a known-good server that has been checked to work with the blocking version of my program.
Edit: Here is a minimum reproducible example for this issue. However, it likely won't work if you run it on Rust playground due to network restrictions. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a08c22574a971c0032fd9dd37e10fd94
WouldBlock is the expected error when a non-blocking connect() (or other operation) is successfully started in the background. You can then wait up to your desired timeout interval for the operation to finish (use select() or epoll() or other platform-specific notification to detect this). If the timeout elapses, close the socket and handle the timeout accordingly. Otherwise, check the socket's SO_ERROR option to see if the operation was successful or failed, and act accordingly.
To give you an idea what I am attempting to do, the code I am currently converting looks looks roughly like this. This gives me no issues and works fine, but it is getting too costly to create a new thread for every socket.
This sounds to me strongly like an XY-Problem.
I think you misunderstand what 'nonblocking' means. What it does not mean is that you can simply and without worrying run multiple sockets in parallel. What it does mean is that every operation that would block returns an error instead and you have to retry it at a later time.
Actual non-blocking sockets usually don't get used at enduser level. They are meant for libraries that depend on them and provide some higher level interface for asynchronism. Non-blocking sockets are hard to get right. They need to be paired with events, because otherwise you can only implement them with 100% cpu hungry busy loops, which is most likely not what you want.
There's good news, though! Remember the high-level libraries I talked about that use nonblocking sockets internally? The most famous one right now is called tokio and does exactly what you want. It will require you to learn a programming mechanism called asynchronism, but you will grasp it, I'm sure :)
I recommend this read: https://tokio.rs/tokio/tutorial

How to manage the error queue in openssl (SSL_get_error and ERR_get_error)

In OpenSSl, The man pages for The majority of SSL_* calls indicate an error by returning a value <= 0 and suggest calling SSL_get_error() to get the extended error.
But within the man pages for these calls as well as for other OpenSSL library calls, there are vague references to using the "error queue" in OpenSSL - Such is the case in the man page for SSL_get_error:
The current thread's error queue must be empty before the TLS/SSL I/O
operation is attempted, or SSL_get_error() will not work reliably.
And in that very same man page, the description for SSL_ERROR_SSL says this:
SSL_ERROR_SSL
A failure in the SSL library occurred, usually a protocol error.
The OpenSSL error queue contains more information on the error.
This kind of implies that there is something in the error queue worth reading. And failure to read it makes a subsequent call to SSL_get_error unreliable. Presumably, the call to make is ERR_get_error.
I plan to use non-blocking sockets in my code. As such, it's important that I reliably discover when the error condition is SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE so I can put the socket in the correct polling mode.
So my questions are this:
Does SSL_get_error() call ERR_get_error() implicitly for me? Or do I need to use both?
Should I be calling ERR_clear_error prior to every OpenSSL library call?
Is it possible that more than one error could be in the queue after an OpenSSL library call completes? Hence, are there circumstances where the first error in the queue is more relevant than the last error?
SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error, the error stays in the queue.
You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, SSL_write etc) that is followed by SSL_get_error, otherwise you may be reading an old error that occurred previously in the current thread.

POLLHUP during non-blocking connect?

I open a socket and try to connect() to non-existent peer. The connect() is non blocking.
Then I epoll on the socket.
Sometimes I get EPOLLERR|EPOLLHUP events and subsequent getsockopt(SO_ERROR) returns ECONNREFUSED. Which is what I would expect.
However, sometimes I get EPOLLHUP alone and subsequent getsockopt(SO_ERROR) returns 0.
Anyone any idea what the latter case is supposed to mean?

GetQueuedCompletionStatus returns ERROR_NETNAME_DELETED on remote socket closure

I am writing a small server-client-stuff using an I/O-Completion Port.
I get the server and client connected successfully via AcceptEx over my completion port.
After the client has connected the client socket is associated with the completion port and an overlapped call to WSARecv on that socket is invoked.
Everything works just fine, until I close the client test program.
GetQueuedCompletionStatus() returns FALSE and GetLastError returns
ERROR_NETNAME_DELETED
, which makes sense to me (after I read the description on the MSDN).
But my problem is, that I thought the call to GetQueuedCompletionStatus would return me a packet indicating the failure due to closure of the socket, because WSARecv would return the apropriate return value.
Since this is not the case I don´t know which clients´ socket caused the error and cant act the way i need to (freeing structures , cleanup for this particular connection, etc)...
Any suggestion on how to solve this, Or hints?
Thanks:)
EDIT: http://codepad.org/WeYINasO <- the code responsible... the "error" occures at the first functions beginning of the while-loop (the call to GetCompletionStatus() which is only a wrapper for GetQueuedCompletionStatus() working fine in other cases) [Did post it there, because it looks shitty & messy in here]
Here are the scenarios that you need to watch for when calling GetQueuedCompletionStatus:
GetQueuedCompletionStatus returns TRUE: A successful completion packet has been received, all the out parameters have been populated.
GetQueuedCompletionStatus returns FALSE, lpOverlapped == NULL: No packet was dequeued. The other out parameters contain indeterminate values.
GetQueuedCompletionStatus returns FALSE, lpOverlapped != NULL: The function has dequeued a failed completion packet. The error code is available via GetLastError.
To answer your question, when GetQueuedCompletionStatus returns FALSE and lpOverlapped != NULL, there was a failed I/O completion. It's the value of lpOverlapped that you need to be concerned about.
I know this is an old question, but I found this page while fruitlessly googling for details about ERROR_NETNAME_DELETED. It is an error which I get while doing an overlapped Readfile().
After some debugging it turned out that the problem was caused by a program which was writing to a socket but forgetting to call closesocket() before using ExitProcess() (due to garbage collection issues). Calling CloseHandle() did not prevent the error, nor did adding WSACleanup() before ExitProcess(). However, adding a short sleep before the client exited did prevent the error. Maybe avoiding ExitProcess() would have prevented the problem also.
So I suspect your problem is caused by the program exiting without closing down the socket properly.
I don't think this would be an issue on Unix where sockets are just ordinary file descriptors.

What causes the ENOTCONN error?

I'm currently maintaining some web server software and I need to perform a lot of I/O operations. The read(), write(), close() and shutdown() calls, when used on a socket, may sometimes raise an ENOTCONN error. What exactly does this error mean? What are the conditions that would trigger it? I can never seem to reproduce it locally but there are users who can.
Right now I just ignore ENOTCONN when raised by close() and shutdown() because it seems harmless, but I'm not entirely sure.
EDIT:
I am absolutely sure that the connect() call succeeded. I check for its return value.
ENOTCONN is most often raised by close() and shutdown(). I've only very rarely seen a read() and write() raising ENOTCONN.
If you are sure that nothing on your side of the TCP connection is closing the connection, then it sounds to me like the remote side is closing the connection.
ENOTCONN, as others have pointed out, simply means that the socket is not connected. This doesn't necessarily mean that connect failed. The socket may well have been connected previously, it just wasn't at the time of the call that resulted in ENOTCONN.
This differs from:
ECONNRESET: the other end of the connection sent a TCP reset packet. This can happen if the other end is refusing a connection, or doesn't acknowledge that it is already connected, among other things.
ETIMEDOUT: this generally applies only to connect. This can happen if the connection attempt is not successful within a system-dependent amount of time.
EPIPE can sometimes be returned by some socket-related system calls under conditions that are more or less the same as ENOTCONN. For example, on some systems, EPIPE and ENOTCONN are synonymous when returned by send.
While it's not unusual for shutdown to return ENOTCONN, since this function is supposed to tear down the TCP connection, I would be surprised to see close return ENOTCONN. It really should never do that.
Finally, as dwc mentioned, EBADF shouldn't apply in your scenario unless you are attempting some operation on a file descriptor that has already been closed. Having a socket get disconnected (i.e. the TCP connection has broken) is not the same as closing the file descriptor associated with that socket.
It's because, at the moment of shutting() the socket, you have data in the socket's buffer waiting to be delivered to the remote party which has closed() or shutted down() its receiving socket.
I don't finish understanding how sockets work, I am rather a noob, and I've failed to even find the files where this "shutdown" function is implemented, but seeing that there's practically no user manual for the whole sockets thing I started trying all possibilities until I got the error in a "controlled" environment. It could be something else, but after much trying these are the explanations I settled for:
If you sent data after the remote side closed the connection, when you shutdown(), you get the error.
If you sent data before the remote side closed the connection but it didn't get received() on the other end, you can shutdown() once, the next time you try to shutdown(), you get the error.
If you didn't send any data, you can shutdown all the times you want, as long as the remote side doesn't shutdown(); once the remote side has shutdown(), if you try to shutdown() and the socket was already shutdown(), you get the error.
I believe ENOTCONN is returned, because shutdown() is not supposed to return ECONNRESET or other more accurate errors.
It is wrong to assume that the other side “just” closed the connection. On the TCP-level, the other side can only half-close a connection (or abort it). The connection is ordinary fully closed if both sides do a shutdown() (or close()). If both side do that, shutdown() actually succeeds for both of them!
The problem is that shutdown() did not succeed in ordinary (half-)closing the connection, neither as the first one to close it, nor as the second one. – From the errors listed in the POSIX docs for shutdown(), ENOTCONN is the least inappropriate, because the others indicate problems with arguments passed to shutdown() (or local resource problems to handle the request).
So what happened? These days, a NAT device somewhere between the two parties involved might have dropped the association and sends out RESET packets as a reaction. Reset connections are so common for IPv4, that you will get them anywhere in your code, even masked as ENOTCONN in shutdown().
A coding bug might also be the reason. On a non-blocking socket, for example, a connect() can return 0 without indicating a successful connection yet.
Transport endpoint is not connected
The socket is associated with a connection-oriented protocol and has not been connected. This is usually a programming flaw.
From: http://www.wlug.org.nz/ENOTCONN
If you're sure you've connected properly in the first place, ENOTCONN is most likely to be caused by either the fd being closed on your end (perhaps in another thread?) while you're in the middle of a request, or by the connection dropping while you're in the middle of the request.
In any case, it means that the socket is not connected. Go ahead and clean up that socket. It's dead. No problem calling close() or shutdown() on it.