Under winsock2, what alternative is there to select()? - winsock

I have a working multi-client, single-threaded TCP/IP server application built in C++ over bare winsock2. The heart of it uses select() to wait for new work to do. I'm thinking of extending the number of simultaneous clients to some hundreds or thousands, in practice all mostly idle. My architecture uses very little memory for a connected, idle client.
Before each select(), I build an fd_set of client sockets in read state, plus my listening socket (for accepting new connections); and another fd_set of sockets in write state. Then, after the select(), I scan these to reconstruct, from the socket number, which of my client that was for. This fd_set building and scanning, though objectively not the current CPU bottleneck, makes me uneasy: the amount of work per transaction grows linearly with the number of clients; and while I see how to go over the default 64-sockets limit in an fd_set, I'm reluctant to go that route.
I vaguely see how I could use two threads, one handling the few most active clients, and another for the bulk of idle clients. That seems workable, but a tad complex.
So: what are the alternatives to select() under winsock2?

As you have seen, select() has a max limit for the number of sockets it can handle in a single call. If scalability is an issue for you then you should use Overlapped I/O or I/O Completion Ports instead. That way, you can issue read/write operations on individual sockets when needed and the OS will notify you when the work is finished, there is no need to poll for it.

Related

Ajax polling vs SSE (performance on server side)

I'm curious about if there is some type of standard limit on when is better to use Ajax Polling instead of SSE, from a server side viewpoint.
1 request every second: I'm pretty sure is better SSE
1 request per minute: I'm pretty sure is better Ajax
But what about 1 request every 5 seconds? How can we calculate where is the limit frequency for Ajax or SSE?
No way is 1 request per minute always better for Ajax, so that assumption is flawed from the start. Any kind of frequent polling is nearly always a costly choice. It seems from our previous conversation in comments of another question that you start with a belief that an open TCP socket (whether SSE connection or webSocket connection) is somehow costly to server performance. An idle TCP connection takes zero CPU (maybe every once in a long while, a keep alive might be sent, but other than that, an idle socket does not use CPU). It does use a bit of server memory to handle the socket descriptor, but a highly tuned server can have 1,000,000 open sockets at once. So, your CPU usage is going to be more about how many connections are being established and what are they asking the server to do every time they are established than it is about how many open (and mostly idle) connections there are.
Remember, every http connection has to create a TCP socket (which is roundtrips between client/server), then send the http request, then get the http response, then close the socket. That's a lot of roundtrips of data to do every minute. If the connection is https, it's even more work and roundtrips to establish the connection because of the crypto layer and endpoint certification. So doing all that every minute for hundreds of thousands of clients seems like a massive waste of resources and bandwidth when you could create one SSE connection and the client just listen for data to stream from the server over that connection.
As I said in our earlier comment exchange on a different question, these types of questions are not really answerable in the abstract. You have to have specific requirements of both client and server and a specific understanding of the data being delivered and how urgent it is on the client and therefore a specific polling interval and a specific scale in order to begin to do some calculations or test harnesses to evaluate which might be the more desirable way to do things. There are simply too many variables to come up with a purely hypothetical answer. You have to define a scenario and then analyze different implementations for that specific scenario.
Number of requests per second is only one of many possible variables. For example, if most the time you poll there's actually nothing new, then that gives even more of an advantage to the SSE case because it would have nothing to do at all (zero load on the server other than a little bit of memory used for an open socket most of the time) whereas the polling creates continual load, even when nothing to do.
The #1 advantage to server push (whether implement with SSE or webSocket) is that the server only has to do anything with the client when there is actually pertinent data to send to that specific client. All the rest of the time, the socket is just sitting there idle (perhaps occasionally on a long interval, sending a keep-alive).
The #1 disadvantage to polling is that there may be lots of times that the client is polling the server and the server has to expend resources to deal with the polling request only to inform that client that it has nothing new.
How can we calculate where is the limit frequency for Ajax or SSE?
It's a pretty complicated process. Lots of variables in a specific scenario need to be defined. It's not as simple as just requests/sec. Then, you have to decide what you're attempting to measure or evaluate and at what scale? "Server performance" is the only thing you mention, but that has to be completely defined and different factors such as CPU usage and memory usage have to be weighted into whatever you're measuring or calculating. Then, you may even need to run some test harnesses if the calculations don't yield an obvious answer or if the decision is so critical that you want to verify your calculations with real metrics.
It sounds like you're looking for an answer like "at greater than x requests/min, you should use polling instead of SSE" and I don't think there is an answer that simple. It depends upon far more things than requests/min or requests/sec.
"Polling" incurs overhead on all parties. If you can avoid it, don't poll.
If SSE is an option, it might be a good choice. "It depends".
Q: What (if any) kind of "event(s)" will your app need to handle?

Which kind of inter process communication (ipc) mechanism should I use at which moment?

I know that there are several methods of inter-process communication (ipc), like:
File
Signal
Socket
Message Queue
Pipe
Named pipe
Semaphore
Shared memory
Message passing
Memory-mapped file
However I was unable to find a list or a paper comparing these mechanism to each other and pointing out the benefits of them in different environemnts.
E.g I know that if I use a file which gets written by process A and process B reads it out it will work on any OS and is pretty robust, on the other hand - why shouldn't I use TCP Socket ? Has anyone a kind of overview in which cases which methods are the most suitable ?
Long story short:
Use lock files, mutexes, semaphores and barriers when processes compete for a scarce resource. They operate in a similar manner: several process try to acquire a synchronisation primitive, some of them acquire it, others are put in sleeping state until the primitive is available again. Use semaphores to limit the amount of processes working with a resource. Use a mutex to limit the amount to 1.
You can partially avoid using synchronisation primitives by using non-blocking thread-safe data structures.
Use signals, queues, pipes, events, messages, unix sockets when processes need to exchange data. Signals and events are usually used for notifying a process of something (for instance, ctrl+c in unix terminal sends a SIGINT signal to a process). Pipes, shared memory and unix sockets are for transmitting data.
Use sockets for networking (or, speaking formally, for exchanging data between processes located on different machines).
Long story long: take a look at Modern Operating Systems book by Tanenbaum & Bos, namely IPC chapter. The topic is vast and can't be completely covered within a list or a paper.

Can a server handle multiple sockets in a single thread?

I'm writing a test program that needs to emulate several connections between virtual machines, and it seems like the best way to do that is to use Unix domain sockets, for various reasons. It doesn't really matter whether I use SOCK_STREAM or SOCK_DGRAM, but it seems like SOCK_STREAM is easier/simpler for my usage.
My problem seems to be a little backwards from the typical scenario. I want to have a single client communicating with the server over 4 distinct sockets. (I could have 4 clients with one socket each, but that distinction shouldn't matter.) Now, the thing I'm emulating doesn't have multiple threads and gets an interrupt whenever a data packet is received over one of the "sockets". Is there some easy way to emulate this with Unix sockets?
I believe that I have to do the socket(), bind(), and listen() for all 4 sockets first, then do an accept() for all 4, and do fcntl( fd, F_SETFF, FNDELAY ) for each one to make them nonblocking, so that I can check each one for data with read() in a round-robin fashion. Is there any way to make it interrupt-driven or event-driven, so that my main loop only checks for data in the socket if there's data there? Or is it better to poll them all like this?
Yes. Handling multiple connections is almost synonymous with "server", and they are often single threaded -- but please not this way:
check each one for data with read() in a round-robin fashion
That would require, as you mention, non-blocking sockets and some kind of delay to prevent your "round-robin" from becoming a system killing busy loop.
A major problem with that is the granularity of the delay. You can't make it too small, or the loop will still hog too much CPU time when nothing is happening. But what about when something is happening, and that something is data incoming simultaneously on multiple connections? Now your delay can produce a snowballing backlog of tish leading to refused connections, etc.
It just is not feasible, and no one writes a server that way, although I am sure anyone would give it serious thought if they were unaware of the library functions intended to tackle the problem. Note that networking is a platform specific issue, so these are not actually part of the C standard (which does not deal with sockets at all).
The functions are select(), poll(), and epoll(); the last one is linux specific and the other two are POSIX. The basic idea is that the call blocks, waiting until one or more of any number of active connections is ready to read or write. Waiting for a socket to be ready to write only meaningfully applies to NON_BLOCK sockets. You don't have to use NON_BLOCK, however, and the select() call blocks regardless. Using NON_BLOCK on the individual sockets makes the implementation more complex, but increases performance potential in a single threaded server -- this is the idea behind asynchronous servers (such as nginx), a paradigm which contrasts with the more traditional threaded synchronous model.
However, I would recommend that you not use NON_BLOCK initially because of the added complexity. When/if it ends up being called for, you'll know. You still do not need threads.
There are many, many, many examples and tutorials around about how to use select() in particular.

kernel-based (Linux) data relay between two TCP sockets

I wrote TCP relay server which works like peer-to-peer router (supernode).
The simplest case are two opened sockets and data relay between them:
clientA <---> server <---> clientB
However the server have to serve about 2000 such A-B pairs, ie. 4000 sockets...
There are two well known data stream relay implementations in userland (based on socketA.recv() --> socketB.send() and socketB.recv() --> socketA.send()):
using of select / poll functions (non-blocking method)
using of threads / forks (blocking method)
I used threads so in the worst case the server creates 2*2000 threads! I had to limit stack size and it works but is it right solution?
Core of my question:
Is there a way to avoid active data relaying between two sockets in userland?
It seems there is a passive way. For example I can create file descriptor from each socket, create two pipes and use dup2() - the same method like stdin/out redirecting. Then two threads are useless for data relay and can be finished/closed.
The question is if the server should ever close sockets and pipes and how to know when the pipe is broken to log the fact?
I've also found "socket pairs" but I am not sure about it for my purpose.
What solution would you advice to off-load the userland and limit amount fo threads?
Some extra explanations:
The server has defined static routing table (eg. ID_A with ID_B - paired identifiers). Client A connects to the server and sends ID_A. Then the server waits for client B. When A and B are paired (both sockets opened) the server starts the data relay.
Clients are simple devices behind symmetric NAT therefore N2N protocol or NAT traversal techniques are too complex for them.
Thanks to Gerhard Rieger I have the hint:
I am aware of two kernel space ways to avoid read/write, recv/send in
user space:
sendfile
splice
Both have restrictions regarding type of file descriptor.
dup2 will not help to do something in kernel, AFAIK.
Man pages: splice(2) splice(2) vmsplice(2) sendfile(2) tee(2)
Related links:
Understanding sendfile() and splice()
http://blog.superpat.com/2010/06/01/zero-copy-in-linux-with-sendfile-and-splice/
http://yarchive.net/comp/linux/splice.html (Linus)
C, sendfile() and send() difference?
bridging between two file descriptors
Send and Receive a file in socket programming in Linux with C/C++ (GCC/G++)
http://ogris.de/howtos/splice.html
OpenBSD implements SO_SPLICE:
relayd asiabsdcon2013 slides / paper
http://www.manualpages.de/OpenBSD/OpenBSD-5.0/man2/setsockopt.2.html
http://metacpan.org/pod/BSD::Socket::Splice .
Does Linux support something similar or only own kernel-module is the solution?
TCPSP
SP-MOD described here
TCP-Splicer described here
L4/L7 switch
HAProxy
Even for loads as tiny as 2000 concurrent connections, I'd never go with threads. They have the highest stack and switching overhead, simply because it's always more expensive to ensure that you can be interrupted anywhere than when you can only be interrupted at specific places. Just use epoll() and splice (if your sockets are TCP, which seems to be the case) and you'll be fine. You can even make epoll work in event triggered mode, where you only register your fds once.
If you absolutely want to use threads, use one thread per CPU core to spread the load, but if you need to do this, it means you're playing at speeds where affinity, RAM location on each CPU socket etc... plays a significant role, which doesn't seem to be the case in your question. So I'm assuming that a single thread is more than enough in your case.

Are socket operations performed in parallel at the system level?

If I have a quad core machine and 4 open and active sockets is it faster to have 4 threads to read from these sockets in parallel or just have one thread that reads the sockets iteratively? Can the socket read operation be performed in parallel or is it synchronized somewhere at the system level?
Analogously, if I have 4 sockets to which I need to write some data, is it faster to have 4 threads that write to these sockets at the same time, or is it better to have one thread that does all the writing?
It seems that multiple cores should enable higher throughput (of course if the network bandwidth is not the bottleneck), however I wonder if this is the case, since after all, in the end there is only a single ethernet cord (or other medium), so at some time, all these writes will have to be synchronized anyway, and the packets will be sent sequentially...
Specifically, if I have a machine running n threads which generate requests that need to be sent to another machine is it better to open n sockets and let each thread write to it? Or is it better to synchronize all the threads and send all the data through a single socket?
And on the receiving side is it better to have a single socket to read from? Or is it better to have n sockets read in parallel? What if all the readers need to be synchronized after reading anyway?
Socket operations are largely asynchronous to the application; buffered; and, if you're lucky, concurrent inside the kernel: the only significant state to synchronize on is per-connection. However the rate-determining step is really the bandwidth of the network, which you can easily saturate with even just one thread. Web browsers typically open between 4 and 8 threads per Web page to fetch the HTML itself and the images etc.