Why is epoll faster than select? - select

I have seen a lot of comparisons which says select have to walk through the fd list, and this is slow. But why doesn't epoll have to do this?

There's a lot of misinformation about this, but the real reason is this:
A typical server might be dealing with, say, 200 connections. It will service every connection that needs to have data written or read and then it will need to wait until there's more work to do. While it's waiting, it needs to be interrupted if data is received on any of those 200 connections.
With select, the kernel has to add the process to 200 wait lists, one for each connection. To do this, it needs a "thunk" to attach the process to the wait list. When the process finally does wake up, it needs to be removed from all 200 wait lists and all those thunks need to be freed.
By contrast, with epoll, the epoll socket itself has a wait list. The process needs to be put on only that one wait list using only one thunk. When the process wakes up, it needs to be removed from only one wait list and only one thunk needs to be freed.
To be clear, with epoll, the epoll socket itself has to be attached to each of those 200 connections. But this is done once, for each connection, when it is accepted in the first place. And this is torn down once, for each connection, when it is removed. By contrast, each call to select that blocks must add the process to every wait queue for every socket being monitored.
Ironically, with select, the largest cost comes from checking if sockets that have had no activity have had any activity. With epoll, there is no need to check sockets that have had no activity because if they did have activity, they would have informed the epoll socket when that activity happened. In a sense, select polls each socket each time you call select to see if there's any activity while epoll rigs it so that the socket activity itself notifies the process.

The main difference between epoll and select is that in select() the list of file descriptors to wait on only exists for the duration of a single select() call, and the calling task only stays on the sockets' wait queues for the duration of a single call. In epoll, on the other hand, you create a single file descriptor that aggregates events from multiple other file descriptors you want to wait on, and so the list of monitored fd's is long-lasting, and tasks stay on socket wait queues across multiple system calls. Furthermore, since an epoll fd can be shared across multiple tasks, it is no longer a single task on the wait queue, but a structure that itself contains another wait queue, containing all processes currently waiting on the epoll fd. (In terms of implementation, this is abstracted over by the sockets' wait queues holding a function pointer and a void* data pointer to pass to that function).
So, to explain the mechanics a little more:
An epoll file descriptor has a private struct eventpoll that keeps track of which fd's are attached to this fd. struct eventpoll also has a wait queue that keeps track of all processes that are currently epoll_waiting on this fd. struct epoll also has a list of all file descriptors that are currently available for reading or writing.
When you add a file descriptor to an epoll fd using epoll_ctl(), epoll adds the struct eventpoll to that fd's wait queue. It also checks if the fd is currently ready for processing and adds it to the ready list, if so.
When you wait on an epoll fd using epoll_wait, the kernel first checks the ready list, and returns immediately if any file descriptors are already ready. If not, it adds itself to the single wait queue inside struct eventpoll, and goes to sleep.
When an event occurs on a socket that is being epoll()ed, it calls the epoll callback, which adds the file descriptor to the ready list, and also wakes up any waiters that are currently waiting on that struct eventpoll.
Obviously, a lot of careful locking is needed on struct eventpoll and the various lists and wait queues, but that's an implementation detail.
The important thing to note is that at no point above there did I describe a step that loops over all file descriptors of interest. By being entirely event-based and by using a long-lasting set of fd's and a ready list, epoll can avoid ever taking O(n) time for an operation, where n is the number of file descriptors being monitored.


Interrupting gen_tcp:recv in erlang

We have a gen_server process that manages the pool of passive sockets on the client side by creating them and borrowing them for other processes. Any other process can borrow a socket, sends a request to the server using the socket, gets a reply through gen_tcp:recv, and then releases the socket to the gen_server socket pool process.
The socket pool process monitors all processes that borrow the sockets. If any of the borrowed process is down, it gets a down signal from it:
handle_info({'DOWN', Ref, process, _Pid, _Reason}, State) ->
In this case we would like to drain the borrowed socket, and reuse it by putting back into the pool. The problem is that while trying to drain a socket using gen_tcp:recv(Socket, 0, 0), we get inet ealready error message, meaning that recv operation is in progress.
So the question is how to interrupt previous recv, successfully drain a socket, and reuse for other processes.
One more level of indirection will greatly simplify the situation.
Instead of passing sockets to processes that need to use them, have each socket controlled by a separate process that owns it and represents the socket within the system. Route Erlang-side messages to and from sockets as necessary to implement the "borrowing" of sockets (even more flexibly, pass the socket controller a callback module that speaks a given protocol, so as soon as data comes over the network it is interpreted as Erlang messages internally).
If this is done you will not lose control of sockets or have them in indeterminate states -- they will instead be held by a single, owning process the entire time. Instead of having the route-manager/pool-manager process receive the 'DOWN' messages, have the socket controllers monitor its current using process. When a 'DOWN' is received then you can change state according to whatever is necessary.
You can catch yourself in some weird situations passing open files descriptors, socket and other types of ports around among sockets that aren't designated as the owner of them. Passing ports and sockets around also becomes a problem if you need to scale a program across several nodes (suddenly you have to care where things are passed and what node they are on, etc.).

epoll and multiple processes

Assume, I have a process that binds to a socket, than forks himself to create 4 instances of the current process.
New processes inherit the file descriptor of the parent socket and are able to do an "accept" on it. If I put the socket descriptor into epoll and try to connect to the socket, all 4 workers are being notified (EPOLLIN) there is some data to read/accept. all workers try to do an accept 3 of them fail and only one can do accept.
How can I get around this behaviour?
This is too big performance penalty letting most of the workers to fail every time on incomming connection.
How this can be avoided?
If accept() returns -1/EWOULDBLOCK/EAGAIN, just return to the poll loop.
Use EPOLLONESHOT while adding sockets in epoll fd which should notify only one process for one event.
Or you can safeguard epoll fd so that only one worker waits on epoll fd.
This is thundering herd issue which can be solved by using EPOLLEXCLUSIVE flag. That would ensure that only one waiting process is woken up per event.

Should I use IOCPs or overlapped WSASend/Receive?

I am investigating the options for asynchronous socket I/O on Windows. There is obviously more than one option: I can use WSASend... with an overlapped structure providing either a completion callback or an event, or I could use IOCPs and the (new) thread pool. From I usually read, the latter option is the recommended one.
However, it is not clear to me, why I should use IOCPs if the completion routine suffices for my goal: tell the socket to send this block of data and inform me if it is done.
I understand that the IOCP stuff in combination with CreateThreadpoolIo etc. uses the OS thread pool. However, the "normal" overlapped I/O must also use separate threads? So what is the difference/disadvantage? Is my callback called by an I/O thread and blocks other stuff?
Thanks in advance,
You can use either but, for servers, IOCP with the 'completion queue' will have better performance, in general, because it can use multiple client<>server threads, either with CreateThreadpoolIo or some user-space thread pool. Obviously, in this case, dedicated handler threads are usual.
Overlapped completion-routine I/O is more useful for clients, IMHO. The completion-routine is fired by an Asynchronous Procedure Call that is queued to the thread that initiated the I/O request, (WSASend, WSARecv). This implies that that thread must be in a position to process the APC and typically this means a while(true) loop around some 'blahEx()' call. This can be useful because it's fairly easy to wait on a blocking queue, or other inter-thread signal, that allows the thread to be supplied with data to send and the completion routine is always handled by that thread. This I/O mechanism leaves the 'hEvent' OVL parameter free to use - ideal for passing a comms buffer object pointer into the completion routine.
Overlapped I/O using an actual synchro event/Semaphore/whatever for the overlapped hEvent parameter should be avoided.
Windows IOCP documentation recommends no more than one thread per available core per completion port. Hyperthreading doubles the number of cores. Since use of IOCPs results in a for all practical purposes event-driven application the use of thread pools adds unnecessary processing to the scheduler.
If you think about it you'll understand why: an event should be serviced in its entirety (or placed in some queue after initial processing) as quickly as possible. Suppose five events are queued to an IOCP on a 4-core computer. If there are eight threads associated with the IOCP you run the risk of the scheduler interrupting one event to begin servicing another by using another thread which is inefficient. It can be dangerous too if the interrupted thread was inside a critical section. With four threads you can process four events simultaneously and as soon as one event has been completed you can start on the last remaining event in the IOCP queue.
Of course, you may have thread pools for non-IOCP related processing.
The socket (file handles work fine too) is associated with an IOCP. The completion routine waits on the IOCP. As soon as a requested read from or write to the socket completes the OS - via the IOCP - releases the completion routine waiting on the IOCP and returns with the additional information you provided when you called the read or write (I usually pass a pointer to a control block). So the completion routine immediately "knows" where the to find information pertinent to the completion.
If you passed information referring to a control block (similar) then that control block (probably) needs to keep track of what operation has completed so it knows what to do next. The IOCP itself neither knows nor cares.
If you're writing a server attached to the internet, the server would issue a read to wait for client input. That input may arrive a milli-second or a week later and when it does the IOCP will release the completion routine which analyzes the input. Typically it responds with a write containing the data requested in the input and then waits on the IOCP. When the write completed the IOCP again releases the completion routine which sees that the write has completed, (typically) issues a new read and a new cycle starts.
So an IOCP-based application typically consumes very little (or no) CPU until the moment a completion occurs at which time the completion routine goes full tilt until it has finished processing, sends a new I/O request and again waits on the completion port. Apart from the IOCP timeout (which can be used to signal house-keeping or such) all I/O-related stuff occurs in the OS.
To further complicate (or simplify) things it is not necessary that sockets be serviced using the WSA routines, the Win32 functions ReadFile and WriteFile work just fine.


This is a server with sockets using IOCP.
I initalize a pool of OVERLAPPED which i use to send WSASend() calls.
Every WSASend() call take out a single OVERLAPPED pointer out of the pool and puts it back in IOCP worker thread on notification.
However, when a client dissconnect, SOME of the pending WSASend() calls gets dropped and therefor i have no chance to recycle the OVERLAPPED pointers that were taken out of the pool.
How can i cancel 100% all the pending WSASend() calls while making sure that they wont get to the IOCP worker, so i can manually recycle the OVERLAPPED pointers on disconnection?
That's not how IOCPs work.
If you have pending operations that you want to cancel then close the corresponding socket and the operations will either complete or fail and all of the completions (including the failures) will come out of the IOCP eventually.
You need to wait for that to occur and once it has then you are good to shut down.
What I tend to do is have a 'per connection' structure which contains the socket and which is used as the completion key. I then have "per operation" structures which include the OVERLAPPED and which also include details of which operation type, the I/O buffer used and other stuff. Both of these structures is reference counted.
When an operation is initiated you increment the reference count on both the connection object and the operation object. When you get a completion you process it and then decrement the counts. When the counts reach 0 you're not doing any work with the objects and they can be recycled to the pool for reuse.
To aid in clean shutdown I have a counter that I can wait on that tracks the number of 'active' 'per connection' objects (sockets).
To shut down you abort all connections and then wait for the connection counter to hit zero. At that point all of your objects are either destroyed or in your pools and you can clean up.
I have some example code, here, which is a set of full featured IOCP server examples which may help - it's working code that you can step through and get ideas from if nothing else.

Using multiple sockets, is non-blocking or blocking with select better?

Lets say I have a server program that can accept connections from 10 (or more) different clients. The clients send data at random which is received by the server, but it is certain that at least one client will be sending data every update. The server cannot wait for information to arrive because it has other processing to do. Aside from using asynchronous sockets, I see two options:
Make all sockets non-blocking. In a loop, call recv() on each socket and allow it to fail with WSAEWOULDBLOCK if there is no data available and if I happen to get some data, then keep it.
Leave the sockets as blocking. Add all sockets to a FD_SET and call select(). If the return value is non-zero (which it will be most of the time), loop through all the sockets to find the appropriate number of readable sockets with FD_ISSET() and only call recv() on the readable sockets.
The first option will create a lot more calls to the recv() function. The second method is a bigger pain from a programming perspective because of all the FD_SET and FD_ISSET looping.
Which method (or another method) is preferred? Is avoiding the overhead on letting recv() fail on a non-blocking socket worth the hassle of calling select()?
I think I understand both methods and I have tried both with success, but I don't know if one way is considered better or optimal.
I would recommend using overlapped IO instead. You can then kick off a WSARecv(), and provide a callback function to be invoked when the operation completes. What's more, since it'll only be invoked when your program is in an alertable wait state, you don't need to worry about locks like you would in a threaded application (assuming you run them on your main thread).
Note, however, that you do need to enter such an alertable wait state frequently. If this is your UI thread, make sure to use MsgWaitForMultipleObjectsEx() in your message loop, with the MWMO_ALERTABLE flag. This will give your callbacks a chance to run. On non-UI threads, call on a regular basis any of the wait functions that put you into an alertable wait state.
Note also that modal dialogs generally will not enter an alertable wait state, as they have their own message loop which doesn't call MsgWaitForMultipleObjectsEx(). If you need to process network IO when showing a dialog box, do all of your network IO on a dedicated thread, which does enter an alertable wait state regularly.
If, for whatever reason, you can't use overlapped IO - definitely use blocking select(). Using non-blocking recv() like that in an infinite loop is an inexcusable waste of CPU time. However, do put the sockets in non-blocking mode - as otherwise, if one byte arrives and you try to read two, you might end up blocking unexpectedly.
You might also want to consider using a library to abstract away the finicky details. For example, libevent or boost::asio.
the IO should be either completely blocking with one thread per connection and in this case the event loop is essentially an OS scheduler or the IO should be completely non-blocking, and in this case select/waitformultipleobjects-based event loop will be in your application
All intermediate variants are not very maintainable and error prone
Completely non blocking approach scales much better when the amount of concurrent connections grows and does not have a thread context switch overhead, so it is a preferrable where the number of concurrent connections is not fixed. This approach has higher implementation complexity compared to completely blocking one.
For a completely non-blocking IO the core of the applicaiton is a select/waitformultipleobjects-based event loop, all sockets are in non-blocking mode, all reads/writes are generally done from within event loop thread (for top performance writes can be first attempted directly from the thread requesting the write)