Using multiple sockets, is non-blocking or blocking with select better? - sockets

Lets say I have a server program that can accept connections from 10 (or more) different clients. The clients send data at random which is received by the server, but it is certain that at least one client will be sending data every update. The server cannot wait for information to arrive because it has other processing to do. Aside from using asynchronous sockets, I see two options:
Make all sockets non-blocking. In a loop, call recv() on each socket and allow it to fail with WSAEWOULDBLOCK if there is no data available and if I happen to get some data, then keep it.
Leave the sockets as blocking. Add all sockets to a FD_SET and call select(). If the return value is non-zero (which it will be most of the time), loop through all the sockets to find the appropriate number of readable sockets with FD_ISSET() and only call recv() on the readable sockets.
The first option will create a lot more calls to the recv() function. The second method is a bigger pain from a programming perspective because of all the FD_SET and FD_ISSET looping.
Which method (or another method) is preferred? Is avoiding the overhead on letting recv() fail on a non-blocking socket worth the hassle of calling select()?
I think I understand both methods and I have tried both with success, but I don't know if one way is considered better or optimal.

I would recommend using overlapped IO instead. You can then kick off a WSARecv(), and provide a callback function to be invoked when the operation completes. What's more, since it'll only be invoked when your program is in an alertable wait state, you don't need to worry about locks like you would in a threaded application (assuming you run them on your main thread).
Note, however, that you do need to enter such an alertable wait state frequently. If this is your UI thread, make sure to use MsgWaitForMultipleObjectsEx() in your message loop, with the MWMO_ALERTABLE flag. This will give your callbacks a chance to run. On non-UI threads, call on a regular basis any of the wait functions that put you into an alertable wait state.
Note also that modal dialogs generally will not enter an alertable wait state, as they have their own message loop which doesn't call MsgWaitForMultipleObjectsEx(). If you need to process network IO when showing a dialog box, do all of your network IO on a dedicated thread, which does enter an alertable wait state regularly.
If, for whatever reason, you can't use overlapped IO - definitely use blocking select(). Using non-blocking recv() like that in an infinite loop is an inexcusable waste of CPU time. However, do put the sockets in non-blocking mode - as otherwise, if one byte arrives and you try to read two, you might end up blocking unexpectedly.
You might also want to consider using a library to abstract away the finicky details. For example, libevent or boost::asio.

the IO should be either completely blocking with one thread per connection and in this case the event loop is essentially an OS scheduler or the IO should be completely non-blocking, and in this case select/waitformultipleobjects-based event loop will be in your application
All intermediate variants are not very maintainable and error prone
Completely non blocking approach scales much better when the amount of concurrent connections grows and does not have a thread context switch overhead, so it is a preferrable where the number of concurrent connections is not fixed. This approach has higher implementation complexity compared to completely blocking one.
For a completely non-blocking IO the core of the applicaiton is a select/waitformultipleobjects-based event loop, all sockets are in non-blocking mode, all reads/writes are generally done from within event loop thread (for top performance writes can be first attempted directly from the thread requesting the write)

Related

Asynchronous blocking thread magic

I've been learning play, and I'm getting most of the major concepts, but I'm struggling with what magic the platform is doing to enable all of these things.
In particular, let's say I have a controller that does something time-intensive. Now I understand how using Futures and asynchronous processing I can make these things appear not to block, but if it's something resource intensive, of course in the end it must block somewhere. Per the documentation:
You can’t magically turn synchronous IO into asynchronous by wrapping it in a Future. If you can’t change the application’s architecture to avoid blocking operations, at some point that operation will have to be executed, and that thread is going to block. So in addition to enclosing the operation in a Future, it’s necessary to configure it to run in a separate execution context that has been configured with enough threads to deal with the expected concurrency.
This bit I'm not understanding: if some task that I'm doing via a Future is possibly being handled in a separate thread pool, how/what magic is Scala/Play doing in the framework to coordinate these threads such that whichever thread is listening to the HTTP socket blocks long enough to do all of the complex processing (DB loads, serialization to JSON, etc. etc.) -- in separate threads, and yet somehow returning to the original blocking thread that has to send something back to the client for that request?
Disclaimer: this is a simplified answer for the general problem, I don't want to make this even more complex by going inside Play and Akka internals.
One method is to have a thread listening to the socket, but not writing to it, let's call it A. A spans a Future that contains, on itself, all the data needed for the computation. It is important that you don't confuse the thread that does the processing with the data that is being processed, as the data (memory) is shared by all threads (and sometimes needs explicit synchronization). The future will be processed (eventually), by a thread B.
Now, do I need for A to block until B is done? It could (and in many general cases that might be the right solution), but in this case, we hardly want to stop listening to our socket. So no, we don't, A forgets everything about the message and carries on with its life.
So when B is done, the Future might be mapped or have a listener that sends the proper response. B itself can send it given the information that it has on the original message! You just need to be careful synchronizing access to the socket, to avoid colliding with a possible thread C that might have been processing a previous or later message in parallel.
Things can obviously get more complex by having threads spawning even more threads, queues where some threads write data and other read data, etc. (Play, being based in Akka, certainly includes a lot of message queues). But I hope to have convinced you that while this statement is correct:
You can’t magically turn synchronous IO into asynchronous by wrapping
it in a Future. If you can’t change the application’s architecture to
avoid blocking operations
Such a change in application's architecture is certainly possible in many (most?) cases, and certainly has been done inside Play.

Can a server handle multiple sockets in a single thread?

I'm writing a test program that needs to emulate several connections between virtual machines, and it seems like the best way to do that is to use Unix domain sockets, for various reasons. It doesn't really matter whether I use SOCK_STREAM or SOCK_DGRAM, but it seems like SOCK_STREAM is easier/simpler for my usage.
My problem seems to be a little backwards from the typical scenario. I want to have a single client communicating with the server over 4 distinct sockets. (I could have 4 clients with one socket each, but that distinction shouldn't matter.) Now, the thing I'm emulating doesn't have multiple threads and gets an interrupt whenever a data packet is received over one of the "sockets". Is there some easy way to emulate this with Unix sockets?
I believe that I have to do the socket(), bind(), and listen() for all 4 sockets first, then do an accept() for all 4, and do fcntl( fd, F_SETFF, FNDELAY ) for each one to make them nonblocking, so that I can check each one for data with read() in a round-robin fashion. Is there any way to make it interrupt-driven or event-driven, so that my main loop only checks for data in the socket if there's data there? Or is it better to poll them all like this?
Yes. Handling multiple connections is almost synonymous with "server", and they are often single threaded -- but please not this way:
check each one for data with read() in a round-robin fashion
That would require, as you mention, non-blocking sockets and some kind of delay to prevent your "round-robin" from becoming a system killing busy loop.
A major problem with that is the granularity of the delay. You can't make it too small, or the loop will still hog too much CPU time when nothing is happening. But what about when something is happening, and that something is data incoming simultaneously on multiple connections? Now your delay can produce a snowballing backlog of tish leading to refused connections, etc.
It just is not feasible, and no one writes a server that way, although I am sure anyone would give it serious thought if they were unaware of the library functions intended to tackle the problem. Note that networking is a platform specific issue, so these are not actually part of the C standard (which does not deal with sockets at all).
The functions are select(), poll(), and epoll(); the last one is linux specific and the other two are POSIX. The basic idea is that the call blocks, waiting until one or more of any number of active connections is ready to read or write. Waiting for a socket to be ready to write only meaningfully applies to NON_BLOCK sockets. You don't have to use NON_BLOCK, however, and the select() call blocks regardless. Using NON_BLOCK on the individual sockets makes the implementation more complex, but increases performance potential in a single threaded server -- this is the idea behind asynchronous servers (such as nginx), a paradigm which contrasts with the more traditional threaded synchronous model.
However, I would recommend that you not use NON_BLOCK initially because of the added complexity. When/if it ends up being called for, you'll know. You still do not need threads.
There are many, many, many examples and tutorials around about how to use select() in particular.

Should I use IOCPs or overlapped WSASend/Receive?

I am investigating the options for asynchronous socket I/O on Windows. There is obviously more than one option: I can use WSASend... with an overlapped structure providing either a completion callback or an event, or I could use IOCPs and the (new) thread pool. From I usually read, the latter option is the recommended one.
However, it is not clear to me, why I should use IOCPs if the completion routine suffices for my goal: tell the socket to send this block of data and inform me if it is done.
I understand that the IOCP stuff in combination with CreateThreadpoolIo etc. uses the OS thread pool. However, the "normal" overlapped I/O must also use separate threads? So what is the difference/disadvantage? Is my callback called by an I/O thread and blocks other stuff?
Thanks in advance,
Christoph
You can use either but, for servers, IOCP with the 'completion queue' will have better performance, in general, because it can use multiple client<>server threads, either with CreateThreadpoolIo or some user-space thread pool. Obviously, in this case, dedicated handler threads are usual.
Overlapped completion-routine I/O is more useful for clients, IMHO. The completion-routine is fired by an Asynchronous Procedure Call that is queued to the thread that initiated the I/O request, (WSASend, WSARecv). This implies that that thread must be in a position to process the APC and typically this means a while(true) loop around some 'blahEx()' call. This can be useful because it's fairly easy to wait on a blocking queue, or other inter-thread signal, that allows the thread to be supplied with data to send and the completion routine is always handled by that thread. This I/O mechanism leaves the 'hEvent' OVL parameter free to use - ideal for passing a comms buffer object pointer into the completion routine.
Overlapped I/O using an actual synchro event/Semaphore/whatever for the overlapped hEvent parameter should be avoided.
Windows IOCP documentation recommends no more than one thread per available core per completion port. Hyperthreading doubles the number of cores. Since use of IOCPs results in a for all practical purposes event-driven application the use of thread pools adds unnecessary processing to the scheduler.
If you think about it you'll understand why: an event should be serviced in its entirety (or placed in some queue after initial processing) as quickly as possible. Suppose five events are queued to an IOCP on a 4-core computer. If there are eight threads associated with the IOCP you run the risk of the scheduler interrupting one event to begin servicing another by using another thread which is inefficient. It can be dangerous too if the interrupted thread was inside a critical section. With four threads you can process four events simultaneously and as soon as one event has been completed you can start on the last remaining event in the IOCP queue.
Of course, you may have thread pools for non-IOCP related processing.
EDIT________________
The socket (file handles work fine too) is associated with an IOCP. The completion routine waits on the IOCP. As soon as a requested read from or write to the socket completes the OS - via the IOCP - releases the completion routine waiting on the IOCP and returns with the additional information you provided when you called the read or write (I usually pass a pointer to a control block). So the completion routine immediately "knows" where the to find information pertinent to the completion.
If you passed information referring to a control block (similar) then that control block (probably) needs to keep track of what operation has completed so it knows what to do next. The IOCP itself neither knows nor cares.
If you're writing a server attached to the internet, the server would issue a read to wait for client input. That input may arrive a milli-second or a week later and when it does the IOCP will release the completion routine which analyzes the input. Typically it responds with a write containing the data requested in the input and then waits on the IOCP. When the write completed the IOCP again releases the completion routine which sees that the write has completed, (typically) issues a new read and a new cycle starts.
So an IOCP-based application typically consumes very little (or no) CPU until the moment a completion occurs at which time the completion routine goes full tilt until it has finished processing, sends a new I/O request and again waits on the completion port. Apart from the IOCP timeout (which can be used to signal house-keeping or such) all I/O-related stuff occurs in the OS.
To further complicate (or simplify) things it is not necessary that sockets be serviced using the WSA routines, the Win32 functions ReadFile and WriteFile work just fine.

What is the benefit of using non-blocking sockets with the "select" function?

I'm writing a server in Linux that will have to support simultaneous read/write operations from multiple clients. I want to use the select function to manage read/write availability.
What I don't understand is this: Suppose I want to wait until a socket has data available to be read. The documentation for select states that it blocks until there is data available to read, and that the read function will not block.
So if I'm using select and I know that the read function will not block, why would I need to set my sockets to non-blocking?
There might be cases when a socket is reported as ready but by the time you get to check it, it changes its state.
One of the good examples is accepting connections. When a new connection arrives, a listening socket is reported as ready for read. By the time you get to call accept, the connection might be closed by the other side before ever sending anything and before we called accept. Of course, the handling of this case is OS-dependent, but it's possible that accept will simply block until a new connection is established, which will cause our application to wait for indefinite amount of time preventing processing of other sockets. If your listening socket is in a non-blocking mode, this won't happen and you'll get EWOULDBLOCK or some other error, but accept will not block anyway.
Some kernels used to have (I hope it's fixed now) an interesting bug with UDP and select. When a datagram arrives select wakes up with the socket with datagram being marked as ready for read. The datagram checksum validation is postponed until a user code calls recvfrom (or some other API capable of receiving UDP datagrams). When the code calls recvfrom and the validating code detects a checksum mismatch, a datagram is simply dropped and recvfrom ends up being blocked until a next datagram arrives. One of the patches fixing this problem (along with the problem description) can be found here.
Other than the kernel bugs mentioned by others, a different reason for choosing non-blocking sockets, even with a polling loop, is that it allows for greater performance with fast-arriving data. Think what happens when a blocking socket is marked as "readable". You have no idea how much data has arrived, so you can safely read it only once. Then you have to get back to the event loop to have your poller check whether the socket is still readable. This means that for every single read from or write to the socket you have to do at least two system calls: the select to tell you it's safe to read, and the reading/writing call itself.
With non-blocking sockets you can skip the unnecessary calls to select after the first one. When a socket is flagged as readable by select, you have the option of reading from it as long as it returns data, which allows faster processing of quick bursts of data.
This going to sound snarky but it isn't. The best reason to make them non-blocking is so you don't block.
Think about it. select() tells you there is something to read but you don't know how much. Could be 2 bytes, could be 2,000. In most cases it more efficient to drain whatever data is there before going back to select. So you enter a while loop to read
while (1)
{
n = read(sock, buffer, 200);
//check return code, etc
}
What happens on the last read when there is nothing left to read? If the socket isn't non-blocking you will block, thereby defeating (at least partially) the point of the select().
One of the benefits, is that it will catch any programming errors you make, because if you try to read a socket that would normally block you, you'll get EWOULDBLOCK instead. For objects other than sockets, the exact api behaviour may change, see http://www.scottklement.com/rpg/socktut/nonblocking.html.

How does I/O work in Akka?

How does the actor model (in Akka) work when you need to perform I/O (ie. a database operation)?
It is my understanding that a blocking operation will throw an exception (and essentially ruin all concurrency due to the evented nature of Netty, which Akka uses). Hence I would have to use a Future or something similar - however I don't understand the concurrency model.
Can 1 actor be processing multiple message simultaneously?
If an actor makes a blocking call in a future (ie. future.get()) does that block only the current actor's execution; or will it prevent execution on all actors until the blocking call has completed?
If it blocks all execution, how does using a future assist concurrency (ie. wouldn't invoking blocking calls in a future still amount to creating an actor and executing the blocking call)?
What is the best way to deal with a multi-staged process (ie. read from the database; call a blocking webservice; read from the database; write to the database) where each step is dependent on the last?
The basic context is this:
I'm using a Websocket server which will maintain thousands of sessions.
Each session has some state (ie. authentication details, etc);
The Javascript client will send a JSON-RPC message to the server, which will pass it to the appropriate session actor, which will execute it and return a result.
Execution of the RPC call will involve some I/O and blocking calls.
There will be a large number of concurrent requests (each user will be making a significant amount of requests over the WebSocket connection and there will be a lot of users).
Is there a better way to achieve this?
Blocking operations do not throw exceptions in Akka. You can do blocking calls from an Actor (which you probably want to minimize, but thats another story).
no, 1 actor instance cannot.
It will not block any other actors. You can influence this by using a specific Dispatcher. Futures use the default dispatcher (the global event driven one normally) so it runs on a thread in a pool. You can choose which dispatcher you want to use for your actors (per actor, or for all). I guess if you really wanted to create a problem you might be able to pass exactly the same (thread based) dispatcher to futures and actors, but that would take some intent from your part. I guess if you have a huge number of futures blocking indefinitely and the executorservice has been configured to a fixed amount of threads, you could blow up the executorservice. So a lot of 'ifs'. a f.get blocks only if the Future has not completed yet. It will block the 'current thread' of the Actor from which you call it (if you call it from an Actor, which is not necessary by the way)
you do not necessarily have to block. you can use a callback instead of f.get. You can even compose Futures without blocking. check out talk by Viktor on 'the promising future of akka' for more details: http://skillsmatter.com/podcast/scala/talk-by-viktor-klang
I would use async communication between the steps (if the steps are meaningful processes on their own), so use an actor for every step, where every actor sends a oneway message to the next, possibly also oneway messages to some other actor that will not block which can supervise the process. This way you could create chains of actors, of which you could make many, in front of it you could put a load balancing actor, so that if one actor blocks in one chain another of the same type might not in the other chain. That would also work for your 'context' question, pass of workload to local actors, chain them up behind a load balancing actor.
As for netty (and I assume you mean Remote Actors, because this is the only thing that netty is used for in Akka), pass of your work as soon as possible to a local actor or a future (with callback) if you are worried about timing or preventing netty to do it's job in some way.
Blocking operations will generally not throw exceptions, but waiting on a future (for example by using !! or !!! send methods) can throw a time out exception. That's why you should stick with fire-and-forget as much as possible, use a meaningful time-out value and prefer callbacks when possible.
An akka actor cannot explicitly process several messages in a row, but you can play with the throughput value via the config file. The actor will then process several message (i.e. its receive method will be called several times sequentially) if its message queue it's not empty: http://akka.io/docs/akka/1.1.3/scala/dispatchers.html#id5
Blocking operations inside an actor will not "block" all actors, but if you share threads among actors (recommended usage), one of the threads of the dispatcher will be blocked until operations resume. So try composing futures as much as possible and beware of the time-out value).
3 and 4. I agree with Raymond answers.
What Raymond and paradigmatic said, but also, if you want to avoid starving the thread pool, you should wrap any blocking operations in scala.concurrent.blocking.
It's of course best to avoid blocking operations, but sometimes you need to use a library that blocks. If you wrap said code in blocking, it will let the execution context know you may be blocking this thread so it can allocate another one if needed.
The problem is worse than paradigmatic describes since if you have several blocking operations you may end up blocking all threads in the thread pool and have no free threads. You could end up with deadlock if all your threads are blocked on something that won't happen until another actor/future gets scheduled.
Here's an example:
import scala.concurrent.blocking
...
Future {
val image = blocking { load_image_from_potentially_slow_media() }
val enhanced = image.enhance()
blocking {
if (oracle.queryBetter(image, enhanced)) {
write_new_image(enhanced)
}
}
enhanced
}
Documentation is here.