Non-blocking TCP socket fails to connect using socket2

Non-blocking TCP socket fails to connect using socket2 - sockets

I am currently in the process of converting some of my code from blocking to non-blocking using the sockets2 crate, however I am running into issues with connecting the socket. The socket always fails to connect before the timeout is exceeded. Despite my attempts to search for examples, I have yet to find any Rust code showing how a non-blocking TCP stream is created.
To give you an idea what I am attempting to do, the code I am currently converting looks looks roughly like this. This gives me no issues and works fine, but it is getting too costly to create a new thread for every socket.
let address = SocketAddr::from(([x, y, z, v], port));
let mut socket = TcpStream::connect_timeout(&address, timeout)?;
At the moment, my code to connect the socket looks like this. Since connect_timeout can only be executed in blocking mode, I use connect instead and regularly poll the socket to check if it is connected. At the moment, I keep getting WouldBlock errors when calling connect, but I do not know what this means. At first I assumed that the connect was proceeding, but returning the result immediately would require blocking so a WouldBlock error was given instead. However, due to the issues getting the socket to connect, I am second guessing those assumptions.
let address = SocketAddr::from(([x, y, z, v], port));
// Create socket equivalent to TcpStream
let socket = Socket::new(Domain::IPV4, Type::STREAM, Some(Protocol::TCP))?;
// Enable non-blocking mode on the socket
socket.set_nonblocking(true)?;
// What response should I expect? Do I need to bind an address first?
match socket.connect(&address.into()) {
Ok(_) => {}
Err(e) if e.kind() == ErrorKind::WouldBlock => {
// I keep getting this error, but I don't know what this means.
// Is non-blocking connect unavailable?
// Do I need to keep trying to connect until it succeeds?
},
// Are there any other types of errors I should be looking for before failing the connection?
Err(e) => return Err(e),
}
I am also unsure what the correct approach is to determine if a socket is connected. At the moment, I attempt to read to a zero length buffer and check if I get a NotConnected error. However, I am unsure what WouldBlock means in this context and I have never gotten a positive response from this approach.
let mut buffer = [0u8; 0];
// I also tried self.socket.peer_addr(), but ran into issues where it returned a positive
// response despite not being connected.
match self.socket.read(&mut buffer) {
Ok(_) => Ok(true),
// What does WouldBlock mean in this context?
Err(e) if e.kind() == ErrorKind::WouldBlock => Ok(false),
Err(e) if e.kind() == ErrorKind::NotConnected => Ok(false),
Err(e) => Err(e),
}
Each socket is periodically checked until an arbitrary timeout is reached to determine if it has connected. So far, no socket has passed the connected before reaching its timeout (20 sec) when connecting to a known-good server. These tests are all performed in a single threaded application on Windows using a known-good server that has been checked to work with the blocking version of my program.
Edit: Here is a minimum reproducible example for this issue. However, it likely won't work if you run it on Rust playground due to network restrictions. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a08c22574a971c0032fd9dd37e10fd94

WouldBlock is the expected error when a non-blocking connect() (or other operation) is successfully started in the background. You can then wait up to your desired timeout interval for the operation to finish (use select() or epoll() or other platform-specific notification to detect this). If the timeout elapses, close the socket and handle the timeout accordingly. Otherwise, check the socket's SO_ERROR option to see if the operation was successful or failed, and act accordingly.

To give you an idea what I am attempting to do, the code I am currently converting looks looks roughly like this. This gives me no issues and works fine, but it is getting too costly to create a new thread for every socket.
This sounds to me strongly like an XY-Problem.
I think you misunderstand what 'nonblocking' means. What it does not mean is that you can simply and without worrying run multiple sockets in parallel. What it does mean is that every operation that would block returns an error instead and you have to retry it at a later time.
Actual non-blocking sockets usually don't get used at enduser level. They are meant for libraries that depend on them and provide some higher level interface for asynchronism. Non-blocking sockets are hard to get right. They need to be paired with events, because otherwise you can only implement them with 100% cpu hungry busy loops, which is most likely not what you want.
There's good news, though! Remember the high-level libraries I talked about that use nonblocking sockets internally? The most famous one right now is called tokio and does exactly what you want. It will require you to learn a programming mechanism called asynchronism, but you will grasp it, I'm sure :)
I recommend this read: https://tokio.rs/tokio/tutorial

Related

How to deal with ZMQ sockets lack of thread safety?

I've been using ZMQ in some Python applications for a while, but only very recently I decided to reimplement one of them in Go and I realized that ZMQ sockets are not thread-safe.
The original Python implementation uses an event loop that looks like this:
while running:
socks = dict(poller.poll(TIMEOUT))
if socks.get(router) == zmq.POLLIN:
client_id = router.recv()
_ = router.recv()
data = router.recv()
requests.append((client_id, data))
for req in requests:
rep = handle_request(req)
if rep:
replies.append(rep)
requests.remove(req)
for client_id, data in replies:
router.send(client_id, zmq.SNDMORE)
router.send(b'', zmq.SNDMORE)
router.send(data)
del replies[:]
The problem is that the reply might not be ready on the first pass, so whenever I have pending requests, I have to poll with a very short timeout or the clients will wait for more than they should, and the application ends up using a lot of CPU for polling.
When I decided to reimplement it in Go, I thought it would be as simple as this, avoiding the problem by using infinite timeout on polling:
for {
sockets, _ := poller.Poll(-1)
for _, socket := range sockets {
switch s := socket.Socket; s {
case router:
msg, _ := s.RecvMessage(0)
client_id := msg[0]
data := msg[2]
go handleRequest(router, client_id, data)
}
}
}
But that ideal implementation only works when I have a single client connected, or a light load. Under heavy load I get random assertion errors inside libzmq. I tried the following:
Following the zmq4 docs I tried adding a sync.Mutex and lock/unlock on all socket operations. It fails. I assume it's because ZMQ uses its own threads for flushing.
Creating one goroutine for polling/receiving and one for sending, and use channels in the same way I used the req/rep queues in the Python version. It fails, as I'm still sharing the socket.
Same as 2, but setting GOMAXPROCS=1. It fails, and throughput was very limited because replies were being held back until the Poll() call returned.
Use the req/rep channels as in 2, but use runtime.LockOSThread to keep all socket operations in the same thread as the socket. Has the same problem as above. It doesn't fail, but throughput was very limited.
Same as 4, but using the poll timeout strategy from the Python version. It works, but has the same problem the Python version does.
Share the context instead of the socket and create one socket for sending and one for receiving in separate goroutines, communicating with channels. It works, but I'll have to rewrite the client libs to use two sockets instead of one.
Get rid of zmq and use raw TCP sockets, which are thread-safe. It works perfectly, but I'll also have to rewrite the client libs.
So, it looks like 6 is how ZMQ was really intended to be used, as that's the only way I got it to work seamlessly with goroutines, but I wonder if there's any other way I haven't tried. Any ideas?
Update
With the answers here I realized I can just add an inproc PULL socket to the poller and have a goroutine connect and push a byte to break out of the infinite wait. It's not as versatile as the solutions suggested here, but it works and I can even backport it to the Python version.

I opened an issue a 1.5 years ago to introduce a port of https://github.com/vaughan0/go-zmq/blob/master/channels.go to pebbe/zmq4. Ultimately the author decided against it, but we have used this in production (under VERY heavy workloads) for a long time now.
This is a gist of the file that had to be added to the pebbe/zmq4 package (since it adds methods to the Socket). This could be re-written in such a way that the methods on the Socket receiver instead took a Socket as an argument, but since we vendor our code anyway, this was an easy way forward.
The basic usage is to create your Socket like normal (call it s for example) then you can:
channels := s.Channels()
outBound := channels.Out()
inBound := channels.In()
Now you have two channels of type [][]byte that you can use between goroutines, but a single goroutine - managed within the channels abstraction, is responsible for managing the Poller and communicating with the socket.

The blessed way to do this with pebbe/zmq4 is with a Reactor. Reactors have the ability to listen on Go channels, but you don't want to do that because they do so by polling the channel periodically using a poll timeout, which reintroduces the same exact problem you have in your Python version. Instead you can use zmq inproc sockets, with one end held by the reactor and the other end held by a goroutine that passes data in from a channel. It's complicated, verbose, and unpleasant, but I have used it successfully.

Perl IO::Socket::UNIX Connect with Timeout gives EAGAIN/EWOULDBLOCK

Ubuntu Linux, 2.6.32-45 kernel, 64b, Perl 5.10.1
I connect many new IO::Socket::UNIX stream sockets to a server, and mostly they work fine. But sometimes in a heavily threaded environment on a faster processor, they return "Resource temporarily unavailable" (EAGAIN/EWOULDBLOCK). I use a timeout on the Connect, so this causes the sockets to be put into non-blocking mode during the connect. But my timeout period isn't occurring - it doesn't wait any noticeable time, it returns quickly.
I see that inside IO::Socket, it tries the connect, and if it fails with EINPROGRESS or EAGAIN/EWOULDBLOCK, it does a select to wait for the write bit to be set. This seems normal so far. In my case the select quickly succeeds, implying that the write bit is set, and the code then tries a re-connect. (I guess this is an attempt to get any error via error slippage?) Anyway, the re-connect fails again with the EAGAIN/EWOULDBLOCK.
In my code this is easy to fix with a re-try loop. But I don't understand why, when the socket becomes writeable, that the socket is not re-connectable. I thought the select guard was always sufficient for a non-blocking connect. Apparently not; so my questions are:
What conditions cause the connect to fail when the select works (the write bit gets set)?
Is there a better way than spinning and retrying, to wait for the connect to succeed? The spinning is wasting cycles. Instead I'd like it to block on something like a select/poll, but I still need a timeout.
Thanx,
-- Steve

But I don't understand why, when the socket becomes writeable, that the socket is not re-connectable.
I imagine it's because whatever needed resource became free got snatched up before you were able to connect again. Replacing the select with a spin loop would not help that.

How to set a timeout in connect/send ? ( as400 iseries v5r4, rpg )

From this rpg socket tutorial we created a socket client in rpg that calls a java server socket
The problem is that connect()/send() operations blocks and we have a requirement that if the connect/send couldn't be done in a matter of a second per say, we have to just log it and finish.
If I set the socket to non-blocking mode (I think with fnctl), we are not fully understanding how to proceed, and can't find any useful documentation with examples for it.
I think if I do connect to a non-blocking socket I have to do select(..., timeout) which tells us if the connect succeed and/ we are able to send(bytes). But, if we send(bytes) afterwards, as it is now a non-blocking socket (which will immediately return after the call), how do I know that send() did the actual sending of the bytes to the server before closing the socket ?
I can fall back to have the client socket in AS400 as a Java or C procedure, but I really want to just keep it in a simple RPG program.
Would somebody help me understand how to do that please ?
Thanks !

In my opinion, that RPG tutorial you mention has a slight defect. What I believe is causing your confusion is the following section's code:
...
Consequently, we typically call the
send() API like this:
D miscdata S 25A
D rc S 10I 0
C eval miscdata = 'The data to send goes here'
C eval rc = send(s: %addr(miscdata): 25: 0)
c if rc < 25
C* for some reason we weren't able to send all 25 bytes!
C endif
...
If you read the documentation of send() you will see that the return value does not indicate an error if it is greater than -1 yet in the code above it seems as if an error has occurred. In fact, the sum of the return values must equal the size of the buffer assuming that you keep moving the pointer into the buffer to reflect what has been sent. Look here in Beej's Guide to Network Programming. You might also like to look at Richard Stevens' book UNIX Network Programming, Volume 1 for really detailed explanations.
As to the problem of determining if the last send before close() did the actual send ... well the paragraph above explains how to determine what portion of the data was sent. However, calling close() will attempt to send all unsent data unless SO_LINGER is set.
fnctl() is used to control blocking while setsockopt() is used to set SO_LINGER.
The abstraction of network communications being used is BSD sockets. There are some slight differences in implementations across OS's but it is generally quite homogeneous. This means that one can generally use documentation written for other OS's for the broad overview. Most of the time.

Socket Read In Multi-Threaded Application Returns Zero Bytes or EINTR (104)

Am a c-coder for a while now - neither a newbie nor an expert. Now, I have a certain daemoned application in C on a PPC Linux. I use PHP's socket_connect as a client to connect to this service locally. The server uses epoll for multiplexing connections via a Unix socket. A user submitted string is parsed for certain characters/words using strstr() and if found, spawns 4 joinable threads to different websites simultaneously. I use socket, connect, write and read, to interact with the said webservers via TCP on their port 80 in each thread. All connections and writes seems successful. Reads to the webserver sockets fail however, with either (A) all 3 threads seem to hang, and only one thread returns -1 and errno is set to 104. The responding thread takes like 10 minutes - an eternity long:-(. *I read somewhere that the 104 (is EINTR?), which in the network context suggests that ...'the connection was reset by peer'; or (B) 0 bytes from 3 threads, and only 1 of the 4 threads actually returns some data. Isn't the socket read/write thread-safe? I use thread-safe (and reentrant) libc functions such as strtok_r, gethostbyname_r, etc.
*I doubt that the said webhosts are actually resetting the connection, because when I run a single-threaded standalone (everything else equal) all things works perfectly right, but of course in series not parallel.
There's a second problem too (oops), I can't write back to the client who connect to my epoll-ed Unix socket. My daemon application will hang and hog CPU > 100% for ever. Yet nothing is written to the clients end. Am sure the client (a very typical PHP socket application) hasn't closed the connection whenever this is happening - no error(s) detected either. Any ideas?
I cannot figure-out whatever is wrong even with Valgrind, GDB or much logging. Kindly help where you can.

Yes, read/write are thread-safe. But beware of gethostbyname() and getservbyname() if you're using them - they return pointers to static data, and may not be thread-safe.
errno 104 is ECONNREFUSED (not EINTR). Use strerror or perror to get the textual error message (like 'Connection reset by peer') for a particular errno code.
The best way to figure out what's going wrong is often to do very detailed logging - log the results of every operation, plus details like the IP address/port connecting to, the number of bytes read/written, the thread id, and so forth. And, of course, make sure your logging code is thread-safe :-)

Getting an ECONNRESET after 10 minutes sounds like the result of your connection timing out. Either the web server isn't sending the data or your app isn't receiving it.
To test the former, hookup a program like Wireshark to the local loopback device and look for traffic to and from the port you are using.
For the later, take a look at the epoll() man page. They mention a scenario where using edge triggered events could result in a lockup, because there is still data in the buffer, but no new data comes in so no new event is triggered.

What causes the ENOTCONN error?

I'm currently maintaining some web server software and I need to perform a lot of I/O operations. The read(), write(), close() and shutdown() calls, when used on a socket, may sometimes raise an ENOTCONN error. What exactly does this error mean? What are the conditions that would trigger it? I can never seem to reproduce it locally but there are users who can.
Right now I just ignore ENOTCONN when raised by close() and shutdown() because it seems harmless, but I'm not entirely sure.
EDIT:
I am absolutely sure that the connect() call succeeded. I check for its return value.
ENOTCONN is most often raised by close() and shutdown(). I've only very rarely seen a read() and write() raising ENOTCONN.

If you are sure that nothing on your side of the TCP connection is closing the connection, then it sounds to me like the remote side is closing the connection.
ENOTCONN, as others have pointed out, simply means that the socket is not connected. This doesn't necessarily mean that connect failed. The socket may well have been connected previously, it just wasn't at the time of the call that resulted in ENOTCONN.
This differs from:
ECONNRESET: the other end of the connection sent a TCP reset packet. This can happen if the other end is refusing a connection, or doesn't acknowledge that it is already connected, among other things.
ETIMEDOUT: this generally applies only to connect. This can happen if the connection attempt is not successful within a system-dependent amount of time.
EPIPE can sometimes be returned by some socket-related system calls under conditions that are more or less the same as ENOTCONN. For example, on some systems, EPIPE and ENOTCONN are synonymous when returned by send.
While it's not unusual for shutdown to return ENOTCONN, since this function is supposed to tear down the TCP connection, I would be surprised to see close return ENOTCONN. It really should never do that.
Finally, as dwc mentioned, EBADF shouldn't apply in your scenario unless you are attempting some operation on a file descriptor that has already been closed. Having a socket get disconnected (i.e. the TCP connection has broken) is not the same as closing the file descriptor associated with that socket.

It's because, at the moment of shutting() the socket, you have data in the socket's buffer waiting to be delivered to the remote party which has closed() or shutted down() its receiving socket.
I don't finish understanding how sockets work, I am rather a noob, and I've failed to even find the files where this "shutdown" function is implemented, but seeing that there's practically no user manual for the whole sockets thing I started trying all possibilities until I got the error in a "controlled" environment. It could be something else, but after much trying these are the explanations I settled for:
If you sent data after the remote side closed the connection, when you shutdown(), you get the error.
If you sent data before the remote side closed the connection but it didn't get received() on the other end, you can shutdown() once, the next time you try to shutdown(), you get the error.
If you didn't send any data, you can shutdown all the times you want, as long as the remote side doesn't shutdown(); once the remote side has shutdown(), if you try to shutdown() and the socket was already shutdown(), you get the error.

I believe ENOTCONN is returned, because shutdown() is not supposed to return ECONNRESET or other more accurate errors.
It is wrong to assume that the other side “just” closed the connection. On the TCP-level, the other side can only half-close a connection (or abort it). The connection is ordinary fully closed if both sides do a shutdown() (or close()). If both side do that, shutdown() actually succeeds for both of them!
The problem is that shutdown() did not succeed in ordinary (half-)closing the connection, neither as the first one to close it, nor as the second one. – From the errors listed in the POSIX docs for shutdown(), ENOTCONN is the least inappropriate, because the others indicate problems with arguments passed to shutdown() (or local resource problems to handle the request).
So what happened? These days, a NAT device somewhere between the two parties involved might have dropped the association and sends out RESET packets as a reaction. Reset connections are so common for IPv4, that you will get them anywhere in your code, even masked as ENOTCONN in shutdown().
A coding bug might also be the reason. On a non-blocking socket, for example, a connect() can return 0 without indicating a successful connection yet.

Transport endpoint is not connected
The socket is associated with a connection-oriented protocol and has not been connected. This is usually a programming flaw.
From: http://www.wlug.org.nz/ENOTCONN

If you're sure you've connected properly in the first place, ENOTCONN is most likely to be caused by either the fd being closed on your end (perhaps in another thread?) while you're in the middle of a request, or by the connection dropping while you're in the middle of the request.
In any case, it means that the socket is not connected. Go ahead and clean up that socket. It's dead. No problem calling close() or shutdown() on it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse