epoll and multiple processes - sockets

Assume, I have a process that binds to a socket, than forks himself to create 4 instances of the current process.
New processes inherit the file descriptor of the parent socket and are able to do an "accept" on it. If I put the socket descriptor into epoll and try to connect to the socket, all 4 workers are being notified (EPOLLIN) there is some data to read/accept. all workers try to do an accept 3 of them fail and only one can do accept.
How can I get around this behaviour?
This is too big performance penalty letting most of the workers to fail every time on incomming connection.
How this can be avoided?

If accept() returns -1/EWOULDBLOCK/EAGAIN, just return to the poll loop.

Use EPOLLONESHOT while adding sockets in epoll fd which should notify only one process for one event.
Or you can safeguard epoll fd so that only one worker waits on epoll fd.

This is thundering herd issue which can be solved by using EPOLLEXCLUSIVE flag. That would ensure that only one waiting process is woken up per event.

Related

Can non-blocking sockets be used with epoll's level triggered mode?

Currently I have a server application which supports multiple client sessions. Server is running with epoll's edge triggered mode. The sockets which are used inside the server are all non-blocking in nature.
The main epoll loop looks something like this,
n = epoll_wait()
iterate over n
if event is epollin(assume client has written some data)
while(1)
drain the buffer untill you get EAGAIN
break
Problem here arises when the data continuously flowing over the buffer and buffer never drains. The other sessions does not get a chance to be entertained by the server.
Because of this possible starvation to other clients, I am thinking of using level triggered mode which allows server to entertain all the active sessions in a round robin way.
Can I just use level triggered mode by removing "EPOLLET" from the subscribed event and read buffer data once(like, in LT mode)?
Any comments/references are appreciated.
Thanks !

Interrupting gen_tcp:recv in erlang

We have a gen_server process that manages the pool of passive sockets on the client side by creating them and borrowing them for other processes. Any other process can borrow a socket, sends a request to the server using the socket, gets a reply through gen_tcp:recv, and then releases the socket to the gen_server socket pool process.
The socket pool process monitors all processes that borrow the sockets. If any of the borrowed process is down, it gets a down signal from it:
handle_info({'DOWN', Ref, process, _Pid, _Reason}, State) ->
In this case we would like to drain the borrowed socket, and reuse it by putting back into the pool. The problem is that while trying to drain a socket using gen_tcp:recv(Socket, 0, 0), we get inet ealready error message, meaning that recv operation is in progress.
So the question is how to interrupt previous recv, successfully drain a socket, and reuse for other processes.
Thanks.
One more level of indirection will greatly simplify the situation.
Instead of passing sockets to processes that need to use them, have each socket controlled by a separate process that owns it and represents the socket within the system. Route Erlang-side messages to and from sockets as necessary to implement the "borrowing" of sockets (even more flexibly, pass the socket controller a callback module that speaks a given protocol, so as soon as data comes over the network it is interpreted as Erlang messages internally).
If this is done you will not lose control of sockets or have them in indeterminate states -- they will instead be held by a single, owning process the entire time. Instead of having the route-manager/pool-manager process receive the 'DOWN' messages, have the socket controllers monitor its current using process. When a 'DOWN' is received then you can change state according to whatever is necessary.
You can catch yourself in some weird situations passing open files descriptors, socket and other types of ports around among sockets that aren't designated as the owner of them. Passing ports and sockets around also becomes a problem if you need to scale a program across several nodes (suddenly you have to care where things are passed and what node they are on, etc.).

Why is epoll faster than select?

I have seen a lot of comparisons which says select have to walk through the fd list, and this is slow. But why doesn't epoll have to do this?
There's a lot of misinformation about this, but the real reason is this:
A typical server might be dealing with, say, 200 connections. It will service every connection that needs to have data written or read and then it will need to wait until there's more work to do. While it's waiting, it needs to be interrupted if data is received on any of those 200 connections.
With select, the kernel has to add the process to 200 wait lists, one for each connection. To do this, it needs a "thunk" to attach the process to the wait list. When the process finally does wake up, it needs to be removed from all 200 wait lists and all those thunks need to be freed.
By contrast, with epoll, the epoll socket itself has a wait list. The process needs to be put on only that one wait list using only one thunk. When the process wakes up, it needs to be removed from only one wait list and only one thunk needs to be freed.
To be clear, with epoll, the epoll socket itself has to be attached to each of those 200 connections. But this is done once, for each connection, when it is accepted in the first place. And this is torn down once, for each connection, when it is removed. By contrast, each call to select that blocks must add the process to every wait queue for every socket being monitored.
Ironically, with select, the largest cost comes from checking if sockets that have had no activity have had any activity. With epoll, there is no need to check sockets that have had no activity because if they did have activity, they would have informed the epoll socket when that activity happened. In a sense, select polls each socket each time you call select to see if there's any activity while epoll rigs it so that the socket activity itself notifies the process.
The main difference between epoll and select is that in select() the list of file descriptors to wait on only exists for the duration of a single select() call, and the calling task only stays on the sockets' wait queues for the duration of a single call. In epoll, on the other hand, you create a single file descriptor that aggregates events from multiple other file descriptors you want to wait on, and so the list of monitored fd's is long-lasting, and tasks stay on socket wait queues across multiple system calls. Furthermore, since an epoll fd can be shared across multiple tasks, it is no longer a single task on the wait queue, but a structure that itself contains another wait queue, containing all processes currently waiting on the epoll fd. (In terms of implementation, this is abstracted over by the sockets' wait queues holding a function pointer and a void* data pointer to pass to that function).
So, to explain the mechanics a little more:
An epoll file descriptor has a private struct eventpoll that keeps track of which fd's are attached to this fd. struct eventpoll also has a wait queue that keeps track of all processes that are currently epoll_waiting on this fd. struct epoll also has a list of all file descriptors that are currently available for reading or writing.
When you add a file descriptor to an epoll fd using epoll_ctl(), epoll adds the struct eventpoll to that fd's wait queue. It also checks if the fd is currently ready for processing and adds it to the ready list, if so.
When you wait on an epoll fd using epoll_wait, the kernel first checks the ready list, and returns immediately if any file descriptors are already ready. If not, it adds itself to the single wait queue inside struct eventpoll, and goes to sleep.
When an event occurs on a socket that is being epoll()ed, it calls the epoll callback, which adds the file descriptor to the ready list, and also wakes up any waiters that are currently waiting on that struct eventpoll.
Obviously, a lot of careful locking is needed on struct eventpoll and the various lists and wait queues, but that's an implementation detail.
The important thing to note is that at no point above there did I describe a step that loops over all file descriptors of interest. By being entirely event-based and by using a long-lasting set of fd's and a ready list, epoll can avoid ever taking O(n) time for an operation, where n is the number of file descriptors being monitored.

Adding a socket descriptor to the io_service dynamically and removing it

I am writing a gateway service which listens on the network socket and routes the packets received to separate daemons. I am planning to use boost asio but I am stuck with few questions. Here is the design of the server I am planning to implement:
The gateway will be listening for TCP connections using boost asio.
The gateway will also listen for streamed Unix domain connections from daemons using boost asio.
Whenever there is a packet on the tcp connection the gateway looks at the protocol tag in the packet and puts the packet on the unix domain connection on which the service will is listening.
Whenever there is a packet on the service connection the gateway looks at the client tag and puts on the respective client connection.
Every descriptor in the gateway will be a NONBLOCKING one.
I am stuck with one particular problem, when the gateway is writing to the service connection, there are chances of getting an EAGAIN or EWOULDBLOCK error if the service socket is full. I plan to tackle this by queuing the buffers and "waiting for the service connection get ready for write".
If I were to use select system call "waiting for the service connection get ready for write" would translate to adding the fd in the writefd list and passing it to select. Once the service connection is ready for write I will write the enqueued buffers to the connection and will remove the service connection from the writefdlist of select.
How do i do the same thing with boost asio? Is such thing possible?
If you want to go with that approach, then use boost::asio::null_buffers to enable Reactor-Style operations. Additionally, set the Boost.Asio socket to non-blocking through the socket::non_blocking() member function. This option will set the synchronous socket operations to be non-blocking. This is different from setting the native socket as non-blocking, as Boost.Asio sets the native socket as non-blocking, and emulates blocking for synchronous operations.
However, if Proactor-Style operations are an option, then consider using them, as it allows the application to ignore some of the lower level details. When using proactor style operations, Boost.Asio will perform the I/O on the application's behalf, properly handling EWOULDBLOCK, EAGAIN, and ERROR_RETRY logic. For example, when Boost.Asio incurs one of the previously mentioned errors, it pushes the I/O operation back into its internal queue, deferring its reattempt, allowing other operations to be attempted.
Often times, there are two constraints which require the use of Reactor-Style operations instead of Proactor-Style operations:
Another library expects to perform the I/O operations itself.
Memory limitations. With a Proactor, the lifespan of a buffer must exceed the duration of a read or write operation, and concurrent operations may require their own buffer. A Reactor allows for the lifetime of a buffer to begin when data is ready to be read, and end when data is no longer being used.
Using boost::asio you dont need to mess with nonblocking mode and/or with return codes such as EAGAIN EWOULDBLOCK etc. Also, you are not "adding a socket to pool loop" or something like that; this is hidden for you since it more highlevel framework.
Typical pattern is
You create io_service object
You create socket with binding to io_service
You create some async event (async_connect, async_read, async_write or so on) on the socket.
You run dispatching with io_service::run or similar methods.
asio will trigger your handler when time is come.
Check out for examples on the boost::asio page. I think async echo server can illustrate technique for your task.
If multiple threads will be writing to the same socket object used for a connection, then you need to use a mutex (or critical section if using Windows) to single thread the code.
As for - "when the gateway is writing to the service connection, there are chances of getting an EAGAIN or EWOULDBLOCK error if the service socket is full", I believe that ASIO handles that for you internally so you don't have to worry about it.

in socket programming, can accept() the same listening socket in multi-process(thread)?

i.e.
open a listening socket in parent process
call epoll_wait(listening_socket) in child1,child2,child3....
call accept in each child if there is connection request
In general, it's not a good idea to have multiple threads performing IO on the same socket without some kind of synchronization between them. In your scenario, it's possible you'd see something like:
incoming connection request wakes up epoll_wait in all N child threads
all N threads call accept, 1 call succeeds, N-1 block (or fail, if your listening socket is non-blocking)
The more usual approach is to have the parent thread loop calling accept on the listening socket, and starting a child thread for each incoming request. (Or, if you're concerned about thread creation overhead, you can have a pool of child threads that wait on a condition variable when idle; the parent adds the newly-accepted socket to a queue and uses pthread_cond_signal to wake a child to handle it.)
Yes you can but your example is a little incomplete:
Create listening_socket
Create epoll set using epoll_create
Register listening_socket to the epoll set using epoll_ctl
call epoll_wait(epoll set) in child1,child2,child3....
call accept in each child if there is connection request
epoll_wait makes sures that only one thread gets the connection event and only that thread will call accept.
If you create two epoll sets and register your listening_socket to both of them you will receive the event twice, once per epoll set, and this is not recommended.
You may refer to this tutorial http://www.devshed.com/c/a/BrainDump/Linux-Files-and-the-Event-Poll-Interface/ and search some interesting discussions about epoll in this forum http://www.developerweb.net/forum/ to learn more.
For more elaborate examples you can always refer to the libev, libevent or nginx source code.